# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [89]:
# import libraries
import pandas as pd
import numpy as np
import re
from sqlalchemy import create_engine
from nltk import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
import nltk
nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger','stopwords'])
from sklearn.datasets import make_multilabel_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\bessam\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\bessam\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\bessam\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\bessam\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [90]:
# load data from database
engine = create_engine('sqlite:///disaster-dataset-final.db')
df = pd.read_sql_table(table_name = 'messages-categories',con=engine)
X = df['message']
Y = df.iloc[:,4:]

In [91]:
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [4]:
df['genre'].unique()

array(['direct', 'social', 'news'], dtype=object)

In [17]:
vect1 = CountVectorizer(tokenizer=tokenize)

In [38]:
df[df['request']==1]['message']

3        UN reports Leogane 80-90 destroyed. Only Hospi...
7        Please, we need tents and water. We are in Sil...
9        I am in Croix-des-Bouquets. We have health iss...
10       There's nothing to eat and water, we starving ...
12       I am in Thomassin number 32, in the area named...
                               ...                        
25988    While insisting that the elections - particula...
26053    Subsidised fodder distributions have been carr...
26088    MSF staff in hazmat suits meticulously disinfe...
26165    A UNITA deserter alleged this week that Savimb...
26198    ADRA was granted general consultative status b...
Name: message, Length: 4480, dtype: object

In [68]:
df['message_length'] = df.message.apply(lambda x:len(x))
df.groupby('offer').mean()['message_length']


offer
0.0    144.470868
1.0    170.680672
Name: message_length, dtype: float64

In [82]:
classes_list = df.columns[4:-1]
classes_list

Index(['related', 'request', 'offer', 'aid_related', 'medical_help',
       'medical_products', 'search_and_rescue', 'security', 'military',
       'child_alone', 'water', 'food', 'shelter', 'clothing', 'money',
       'missing_people', 'refugees', 'death', 'other_aid',
       'infrastructure_related', 'transport', 'buildings', 'electricity',
       'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure',
       'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold',
       'other_weather', 'direct_report'],
      dtype='object')

In [86]:
df.groupby('child_alone').mean()['message_length']

child_alone
0.0    144.589881
Name: message_length, dtype: float64

In [88]:
message_length_dict

related                   149.607412
request                   110.192634
offer                     170.680672
aid_related               150.036220
medical_help              165.787734
medical_products          157.623288
search_and_rescue         148.744475
security                  157.337580
military                  180.043023
child_alone                 0.000000
water                     141.178614
food                      132.230717
shelter                   143.240190
clothing                  138.576355
money                     157.197020
missing_people            146.849498
refugees                  165.067352
death                     174.530936
other_aid                 142.548434
infrastructure_related    170.512023
transport                 174.359933
buildings                 157.668165
electricity               168.157303
tools                     174.811321
hospitals                 178.590106
shops                     173.166667
aid_centers               164.721683
o

In [87]:
df['message_length'] = df.message.apply(lambda x:len(x))
message_length_series=pd.Series()
for col in classes_list:
    try:
        message_length_value = df.groupby(col).mean()['message_length'][1]
        message_length_dict[col]= message_length_value
    except KeyError:
        message_length_dict[col]= 0

  


In [39]:
vect1.fit_transform(df[df['request']==1]['message'])

<4480x9851 sparse matrix of type '<class 'numpy.int64'>'
	with 46479 stored elements in Compressed Sparse Row format>

In [48]:
tfidf1 = TfidfTransformer()
X_train_tfidf = tfidf1.fit_transform(vect1.fit_transform(df[df['request']==1]['message']))

KeyboardInterrupt: 

In [65]:
df.iloc[:,4:].sum().sort_values().values

array([    0.,   119.,   120.,   159.,   282.,   283.,   299.,   309.,
         406.,   471.,   530.,   534.,   604.,   724.,   860.,   876.,
        1151.,  1196.,  1203.,  1314.,  1335.,  1376.,  1674.,  1705.,
        2087.,  2158.,  2319.,  2448.,  2453.,  2930.,  3448.,  4480.,
        5080.,  7302., 10878., 20298.])

### 2. Write a tokenization function to process your text data

In [92]:
def tokenize(text):
    text = re.sub(r"[^A-Za-z0-9]", " ",text)
    words = word_tokenize(text.lower().strip())
    tokens = [word for word in words if word not in stopwords.words("english")]
    lemmed = [WordNetLemmatizer().lemmatize(t) for t in tokens]
    
    return tokens

In [6]:
text = df['message'][7]
text = re.sub(r"[^A-Za-z0-9]", " ",text)
words = word_tokenize(text.lower().strip())
tokens = [word for word in words if word not in stopwords.words("english")]
lemmed = [WordNetLemmatizer().lemmatize(t) for t in tokens]
print(lemmed)

['please', 'need', 'tent', 'water', 'silo', 'thank']


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [93]:
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,test_size=0.3, random_state=0)
'''
vect = CountVectorizer(tokenizer=tokenize)
tfidf = TfidfTransformer()
clf = MultiOutputClassifier(RandomForestClassifier())

X_train_counts = vect.fit_transform(X_train)
X_train_tfidf = tfidf.fit_transform(X_train_counts)
clf.fit(X_train_tfidf, Y_train)

X_test_counts = vect.transform(X_test)
X_test_tfidf = tfidf.transform(X_test_counts)
y_pred = clf.predict(X_test_tfidf)
'''

pipeline = Pipeline ([('vect',CountVectorizer(tokenizer=tokenize)),
                      ('tfidf',TfidfTransformer()),
                      ('clf',MultiOutputClassifier(RandomForestClassifier()))
                       ])

#y_pred = pipeline.predict(X_test)

In [12]:
pipeline.fit(X_train,Y_train)
y_pred = pipeline.predict(X_test)

In [9]:
Y_train.shape

(18344, 36)

In [10]:
X_train.isnull().mean()

0.0

In [11]:
(y_pred == Y_test).mean()

related                   0.750223
request                   0.830981
offer                     0.994913
aid_related               0.568994
medical_help              0.915300
medical_products          0.946077
search_and_rescue         0.971131
security                  0.982704
military                  0.967061
child_alone               1.000000
water                     0.928908
food                      0.883759
shelter                   0.904744
clothing                  0.984866
money                     0.978634
missing_people            0.989953
refugees                  0.965153
death                     0.953834
other_aid                 0.860613
infrastructure_related    0.935394
transport                 0.954343
buildings                 0.948366
electricity               0.981305
tools                     0.994913
hospitals                 0.988681
shops                     0.995930
aid_centers               0.988300
other_infrastructure      0.957395
weather_related     

In [12]:
pipeline.get_params

<bound method Pipeline.get_params of Pipeline(steps=[('vect',
                 CountVectorizer(tokenizer=<function tokenize at 0x0000023BDD29A1E0>)),
                ('tfidf', TfidfTransformer()),
                ('clf',
                 MultiOutputClassifier(estimator=RandomForestClassifier()))])>

In [13]:
np.unique(y_pred)

array([0., 1., 2.])

In [14]:
for i,col in enumerate(Y_test.columns):
    print(f"The classification report for {col} is \n {classification_report(Y_test[col],y_pred[:,i])}")

The classification report for related is 
               precision    recall  f1-score   support

         0.0       0.38      0.08      0.13      1812
         1.0       0.77      0.96      0.86      5992
         2.0       0.09      0.03      0.05        59

    accuracy                           0.75      7863
   macro avg       0.41      0.36      0.34      7863
weighted avg       0.67      0.75      0.68      7863

The classification report for request is 
               precision    recall  f1-score   support

         0.0       0.84      0.99      0.91      6520
         1.0       0.54      0.07      0.12      1343

    accuracy                           0.83      7863
   macro avg       0.69      0.53      0.51      7863
weighted avg       0.79      0.83      0.77      7863

The classification report for offer is 
               precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      7824
         1.0       0.00      0.00      0.00        39

 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The classification report for child_alone is 
               precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      7863

    accuracy                           1.00      7863
   macro avg       1.00      1.00      1.00      7863
weighted avg       1.00      1.00      1.00      7863

The classification report for water is 
               precision    recall  f1-score   support

         0.0       0.93      1.00      0.96      7329
         1.0       0.07      0.00      0.01       534

    accuracy                           0.93      7863
   macro avg       0.50      0.50      0.49      7863
weighted avg       0.87      0.93      0.90      7863

The classification report for food is 
               precision    recall  f1-score   support

         0.0       0.89      1.00      0.94      6968
         1.0       0.24      0.01      0.02       895

    accuracy                           0.88      7863
   macro avg       0.57      0.50      0.48      7863


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The classification report for refugees is 
               precision    recall  f1-score   support

         0.0       0.97      1.00      0.98      7597
         1.0       0.00      0.00      0.00       266

    accuracy                           0.97      7863
   macro avg       0.48      0.50      0.49      7863
weighted avg       0.93      0.97      0.95      7863

The classification report for death is 
               precision    recall  f1-score   support

         0.0       0.95      1.00      0.98      7505
         1.0       0.00      0.00      0.00       358

    accuracy                           0.95      7863
   macro avg       0.48      0.50      0.49      7863
weighted avg       0.91      0.95      0.93      7863

The classification report for other_aid is 
               precision    recall  f1-score   support

         0.0       0.87      0.99      0.92      6806
         1.0       0.21      0.01      0.02      1057

    accuracy                           0.86      786

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The classification report for hospitals is 
               precision    recall  f1-score   support

         0.0       0.99      1.00      0.99      7774
         1.0       0.00      0.00      0.00        89

    accuracy                           0.99      7863
   macro avg       0.49      0.50      0.50      7863
weighted avg       0.98      0.99      0.98      7863

The classification report for shops is 
               precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      7834
         1.0       0.00      0.00      0.00        29

    accuracy                           1.00      7863
   macro avg       0.50      0.50      0.50      7863
weighted avg       0.99      1.00      0.99      7863

The classification report for aid_centers is 
               precision    recall  f1-score   support

         0.0       0.99      1.00      0.99      7771
         1.0       0.00      0.00      0.00        92

    accuracy                           0.99      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The classification report for weather_related is 
               precision    recall  f1-score   support

         0.0       0.74      0.95      0.83      5628
         1.0       0.59      0.18      0.28      2235

    accuracy                           0.73      7863
   macro avg       0.66      0.56      0.55      7863
weighted avg       0.70      0.73      0.68      7863

The classification report for floods is 
               precision    recall  f1-score   support

         0.0       0.92      1.00      0.96      7218
         1.0       0.11      0.00      0.01       645

    accuracy                           0.92      7863
   macro avg       0.51      0.50      0.48      7863
weighted avg       0.85      0.92      0.88      7863

The classification report for storm is 
               precision    recall  f1-score   support

         0.0       0.91      1.00      0.95      7118
         1.0       0.34      0.02      0.04       745

    accuracy                           0.90     

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The classification report for earthquake is 
               precision    recall  f1-score   support

         0.0       0.92      0.99      0.95      7111
         1.0       0.67      0.15      0.25       752

    accuracy                           0.91      7863
   macro avg       0.79      0.57      0.60      7863
weighted avg       0.89      0.91      0.89      7863

The classification report for cold is 
               precision    recall  f1-score   support

         0.0       0.98      1.00      0.99      7694
         1.0       0.00      0.00      0.00       169

    accuracy                           0.98      7863
   macro avg       0.49      0.50      0.49      7863
weighted avg       0.96      0.98      0.97      7863

The classification report for other_weather is 
               precision    recall  f1-score   support

         0.0       0.94      1.00      0.97      7423
         1.0       0.07      0.00      0.00       440

    accuracy                           0.94    

In [None]:
y_pred_train = pipeline.predict(X_train)
(y_pred_train==Y_train).mean()

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
pipeline.score(X_test,Y_test)

In [None]:
[int(x) for x in np.linspace(50,1000,num=20)]

In [94]:
    parameters = {
        'vect__ngram_range': ((1, 1), (1, 2)),
        'vect__max_df': (0.5, 1.0),
        'vect__max_features': (5000, 10000),
        'clf__estimator__n_estimators': [50,100]
    }
cv = GridSearchCV(pipeline, param_grid=parameters,verbose=2)

In [None]:
cv.fit(X_train,Y_train)
#y_pred = cv.predict(X_test)

Fitting 5 folds for each of 16 candidates, totalling 80 fits


In [None]:
cv.estimator.get_params().keys()

In [11]:
df.groupby('genre').count()['message']

genre
direct    10825
news      12978
social     2404
Name: message, dtype: int64

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.