# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [54]:
# import libraries
import pandas as pd
import numpy as np
import re
from sqlalchemy import create_engine
import datetime
import pickle

import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import classification_report

In [13]:
nltk.download(['punkt','stopwords','wordnet'])

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/seattlehibiscus/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/seattlehibiscus/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/seattlehibiscus/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [18]:
# load data from database
engine = create_engine('sqlite:///etlpipeline.db')
df = pd.read_sql_table("etlpipeline", engine)
X = df["message"]
Y = df.iloc[:, 4:]
categories = Y.columns.tolist()

### 2. Write a tokenization function to process your text data

In [26]:
def tokenize(text):
    text = re.sub(r'[^a-zA-Z0-9]', " ", text.lower())
    tokens = [word for word in word_tokenize(text) if word not in stopwords.words("english")]
    tokens = [WordNetLemmatizer().lemmatize(word).strip() for word in tokens]
    
    return tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [27]:
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [22]:
# split data into train and test datasets
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

In [28]:
pipeline.fit(X_train, Y_train)



Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at...
                 MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True,
                                                                        class_weight=None,
                                                                        criterion='gini',
                                                                  

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [29]:
pred_y = pipeline.predict(X_test)
for i in range(len(categories)):
    category = categories[i]
    print(category)
    print(classification_report(Y_test[category], pred_y[:, i]))

related
              precision    recall  f1-score   support

           0       0.62      0.47      0.54      1212
           1       0.85      0.91      0.88      3991
           2       0.30      0.39      0.34        41

    accuracy                           0.80      5244
   macro avg       0.59      0.59      0.58      5244
weighted avg       0.79      0.80      0.79      5244

request
              precision    recall  f1-score   support

           0       0.90      0.97      0.93      4361
           1       0.77      0.46      0.58       883

    accuracy                           0.89      5244
   macro avg       0.83      0.72      0.75      5244
weighted avg       0.88      0.89      0.87      5244

offer
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      5217
           1       0.00      0.00      0.00        27

    accuracy                           0.99      5244
   macro avg       0.50      0.50      0.50      524

  'precision', 'predicted', average, warn_for)


              precision    recall  f1-score   support

           0       0.94      0.99      0.96      4656
           1       0.82      0.49      0.61       588

    accuracy                           0.93      5244
   macro avg       0.88      0.74      0.79      5244
weighted avg       0.93      0.93      0.92      5244

shelter
              precision    recall  f1-score   support

           0       0.93      0.99      0.96      4764
           1       0.79      0.31      0.44       480

    accuracy                           0.93      5244
   macro avg       0.86      0.65      0.70      5244
weighted avg       0.92      0.93      0.91      5244

clothing
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      5172
           1       0.90      0.26      0.41        72

    accuracy                           0.99      5244
   macro avg       0.95      0.63      0.70      5244
weighted avg       0.99      0.99      0.99      5244

mo

In [62]:
accuracy1 = (pred_y == Y_test).mean()

print("Accuracy:\n",accuracy1)

Accuracy:
 related                   0.802632
request                   0.885584
offer                     0.994851
aid_related               0.750000
medical_help              0.912853
medical_products          0.953852
search_and_rescue         0.972159
security                  0.984554
military                  0.967010
child_alone               1.000000
water                     0.953661
food                      0.930587
shelter                   0.929062
clothing                  0.989512
money                     0.977879
missing_people            0.987605
refugees                  0.961289
death                     0.956140
other_aid                 0.867277
infrastructure_related    0.934401
transport                 0.953852
buildings                 0.954233
electricity               0.979214
tools                     0.994279
hospitals                 0.990275
shops                     0.995233
aid_centers               0.987414
other_infrastructure      0.956712
weather_r

### 6. Improve your model
Use grid search to find better parameters. 

In [31]:
parameters = {
        'tfidf__use_idf': (True, False), 
        'clf__estimator__n_estimators': [50, 100],
        'clf__estimator__min_samples_split': [2,4] }

cv = GridSearchCV(pipeline, param_grid=parameters)

In [34]:
# train model
start_time = datetime.datetime.now()
cv.fit(X_train, Y_train)
end_time = datetime.datetime.now()



In [35]:
print('Model Training done, Cost: %d' % ((end_time - start_time).seconds/60),'minutes')

Model Training done, Cost: 135 minutes


In [37]:
best_param = cv.best_estimator_.get_params()
best_param

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                   dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                   lowercase=True, max_df=1.0, max_features=None, min_df=1,
                   ngram_range=(1, 1), preprocessor=None, stop_words=None,
                   strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                   tokenizer=<function tokenize at 0x1a221ef488>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True,
                                                          class_weight=None,
                                                          criterion='gini',
                                                          max_depth=None,
                                                          max_features='auto',
           

In [38]:
# identify the best cross validation parameters
for param in parameters.keys():
    print("\t%s: %r" % (param, best_param[param]))

	tfidf__use_idf: True
	clf__estimator__n_estimators: 100
	clf__estimator__min_samples_split: 2


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [39]:
cv_pred_y = cv.predict(X_test)
for i in range(len(categories)):
    category=categories[i]
    print(category)
    print(classification_report(Y_test[category], cv_pred_y[:,i]))

related
              precision    recall  f1-score   support

           0       0.70      0.43      0.53      1212
           1       0.84      0.94      0.89      3991
           2       0.36      0.41      0.39        41

    accuracy                           0.82      5244
   macro avg       0.63      0.59      0.60      5244
weighted avg       0.81      0.82      0.80      5244

request
              precision    recall  f1-score   support

           0       0.91      0.98      0.94      4361
           1       0.84      0.51      0.63       883

    accuracy                           0.90      5244
   macro avg       0.87      0.74      0.79      5244
weighted avg       0.90      0.90      0.89      5244

offer
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      5217
           1       0.00      0.00      0.00        27

    accuracy                           0.99      5244
   macro avg       0.50      0.50      0.50      524

  'precision', 'predicted', average, warn_for)


              precision    recall  f1-score   support

           0       0.95      1.00      0.98      4985
           1       0.67      0.09      0.16       259

    accuracy                           0.95      5244
   macro avg       0.81      0.55      0.57      5244
weighted avg       0.94      0.95      0.94      5244

buildings
              precision    recall  f1-score   support

           0       0.96      1.00      0.98      4990
           1       0.83      0.18      0.29       254

    accuracy                           0.96      5244
   macro avg       0.90      0.59      0.64      5244
weighted avg       0.95      0.96      0.95      5244

electricity
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      5137
           1       0.83      0.05      0.09       107

    accuracy                           0.98      5244
   macro avg       0.91      0.52      0.54      5244
weighted avg       0.98      0.98      0.97      524

In [61]:
accuracy2 = (cv_pred_y == Y_test).mean()

print("Accuracy:\n",accuracy2)

Accuracy:
 related                   0.817696
request                   0.900648
offer                     0.994851
aid_related               0.782037
medical_help              0.913043
medical_products          0.951564
search_and_rescue         0.972540
security                  0.984554
military                  0.966819
child_alone               1.000000
water                     0.958429
food                      0.946034
shelter                   0.936880
clothing                  0.987605
money                     0.976926
missing_people            0.987605
refugees                  0.962624
death                     0.957857
other_aid                 0.869375
infrastructure_related    0.935355
transport                 0.952899
buildings                 0.958429
electricity               0.980359
tools                     0.994279
hospitals                 0.990275
shops                     0.995233
aid_centers               0.987414
other_infrastructure      0.957094
weather_r

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [43]:
# try adaboost classifier
pipeline2 = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(AdaBoostClassifier()))
])

pipeline2.fit(X_train, Y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at 0x1a221ef488>,
                                 vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 MultiOutputClassifier(estimator=AdaBoostClassifier(algorithm='SAMME.R',


In [46]:
# test the model
ada_pred_y = pipeline2.predict(X_test)
for i in range(len(categories)):
    category = categories[i]
    print(category)
    print(classification_report(Y_test[category], ada_pred_y[:, i]))

related
              precision    recall  f1-score   support

           0       0.67      0.13      0.21      1212
           1       0.78      0.98      0.87      3991
           2       0.54      0.17      0.26        41

    accuracy                           0.78      5244
   macro avg       0.66      0.43      0.45      5244
weighted avg       0.75      0.78      0.71      5244

request
              precision    recall  f1-score   support

           0       0.91      0.96      0.94      4361
           1       0.75      0.54      0.63       883

    accuracy                           0.89      5244
   macro avg       0.83      0.75      0.78      5244
weighted avg       0.88      0.89      0.88      5244

offer
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5217
           1       0.20      0.04      0.06        27

    accuracy                           0.99      5244
   macro avg       0.60      0.52      0.53      524

In [63]:
accuracy3 = (ada_pred_y == Y_test).mean()

print("Accuracy:\n",accuracy3)

Accuracy:
 related                   0.776316
request                   0.892067
offer                     0.994279
aid_related               0.756674
medical_help              0.919527
medical_products          0.956331
search_and_rescue         0.973112
security                  0.982265
military                  0.970252
child_alone               1.000000
water                     0.965866
food                      0.944127
shelter                   0.946796
clothing                  0.988177
money                     0.980168
missing_people            0.987796
refugees                  0.966056
death                     0.963005
other_aid                 0.867086
infrastructure_related    0.930206
transport                 0.957094
buildings                 0.963959
electricity               0.979405
tools                     0.993898
hospitals                 0.987033
shops                     0.994851
aid_centers               0.985698
other_infrastructure      0.955378
weather_r

In [78]:
# compare the random forest classifier with AdaBoostClassifier
(sum(np.where(accuracy1>accuracy3, 1, 0)), len(np.where(accuracy1>accuracy3, 1, 0)))

(13, 36)

- It seems that the AdaBoost Classifier is slightly better than the random forest classifier

In [50]:
# try decision tree model
pipeline3 = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(DecisionTreeClassifier()))
])

pipeline3.fit(X_train, Y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at...
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 MultiOutputClassifier(estimator=DecisionTreeClassifier(class_weight=None,
                                                                        criterion='gini',
                                                            

In [75]:
dec_pred_y = pipeline3.predict(X_test)
for i in range(len(categories)):
    category = categories[i]
    print(category)
    print(classification_report(Y_test[category], dec_pred_y[:, i]))

related
              precision    recall  f1-score   support

           0       0.52      0.50      0.51      1212
           1       0.85      0.85      0.85      3991
           2       0.17      0.37      0.23        41

    accuracy                           0.76      5244
   macro avg       0.51      0.57      0.53      5244
weighted avg       0.77      0.76      0.77      5244

request
              precision    recall  f1-score   support

           0       0.91      0.92      0.92      4361
           1       0.60      0.57      0.58       883

    accuracy                           0.86      5244
   macro avg       0.75      0.74      0.75      5244
weighted avg       0.86      0.86      0.86      5244

offer
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      5217
           1       0.00      0.00      0.00        27

    accuracy                           0.99      5244
   macro avg       0.50      0.50      0.50      524

In [76]:
accuracy4 = (dec_pred_y == Y_test).mean()

print("Accuracy:\n",accuracy4)

Accuracy:
 related                   0.763730
request                   0.862128
offer                     0.991800
aid_related               0.713768
medical_help              0.892449
medical_products          0.940885
search_and_rescue         0.959191
security                  0.969870
military                  0.959573
child_alone               1.000000
water                     0.958047
food                      0.937452
shelter                   0.931541
clothing                  0.983600
money                     0.974447
missing_people            0.979786
refugees                  0.957285
death                     0.954996
other_aid                 0.820938
infrastructure_related    0.900076
transport                 0.932685
buildings                 0.947941
electricity               0.975210
tools                     0.991037
hospitals                 0.984172
shops                     0.992563
aid_centers               0.979214
other_infrastructure      0.934783
weather_r

In [77]:
# compare the decision tree classifier with AdaBoostClassifier
sum(np.where(accuracy3>accuracy4, 1, 0))

34

- It seems that the AdaBoost Classifier is much better than the decision tree classifier. Therefore, I choose the AdaBoost Classifier as the improved model.

In [79]:
# using cross validation for pipeline2(adaboost classifier)
pipeline2.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                   dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                   lowercase=True, max_df=1.0, max_features=None, min_df=1,
                   ngram_range=(1, 1), preprocessor=None, stop_words=None,
                   strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                   tokenizer=<function tokenize at 0x1a221ef488>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=AdaBoostClassifier(algorithm='SAMME.R',
                                                      base_estimator=None,
                                                      learning_rate=1.0,
                                                      n_estimators=50,
                                                      random_state=None),
                       

In [83]:
cv2_parameters = {
    'clf__estimator__n_estimators': [50, 100, 200],
    'clf__estimator__learning_rate': [0.1, 0.5, 1]}
cv2 = GridSearchCV(pipeline2, param_grid=cv2_parameters)

In [84]:
cv2.fit(X_train, Y_train)



GridSearchCV(cv='warn', error_score='raise-deprecating',
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                       

In [85]:
cv2_pred_y = cv2.predict(X_test)
for i in range(len(categories)):
    category = categories[i]
    print(category)
    print(classification_report(Y_test[category], cv2_pred_y[:, i]))

related
              precision    recall  f1-score   support

           0       0.64      0.22      0.33      1212
           1       0.80      0.96      0.87      3991
           2       0.50      0.12      0.20        41

    accuracy                           0.78      5244
   macro avg       0.65      0.44      0.47      5244
weighted avg       0.76      0.78      0.74      5244

request
              precision    recall  f1-score   support

           0       0.91      0.97      0.94      4361
           1       0.80      0.53      0.64       883

    accuracy                           0.90      5244
   macro avg       0.85      0.75      0.79      5244
weighted avg       0.89      0.90      0.89      5244

offer
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      5217
           1       0.00      0.00      0.00        27

    accuracy                           0.99      5244
   macro avg       0.50      0.50      0.50      524

In [86]:
cv2_accuracy = (cv2_pred_y == Y_test).mean()

print("Accuracy:\n",cv2_accuracy)

Accuracy:
 related                   0.784516
request                   0.898551
offer                     0.994088
aid_related               0.772502
medical_help              0.920099
medical_products          0.957475
search_and_rescue         0.973303
security                  0.983600
military                  0.970442
child_alone               1.000000
water                     0.966438
food                      0.946606
shelter                   0.949466
clothing                  0.990656
money                     0.978833
missing_people            0.987605
refugees                  0.966056
death                     0.962815
other_aid                 0.871091
infrastructure_related    0.932494
transport                 0.956712
buildings                 0.962624
electricity               0.982075
tools                     0.993707
hospitals                 0.989130
shops                     0.994851
aid_centers               0.986079
other_infrastructure      0.955950
weather_r

In [90]:
best_param2 = cv2.best_estimator_.get_params()

In [91]:
for param in cv2_parameters.keys():
    print("\t%s: %r" % (param, best_param2[param]))

	clf__estimator__n_estimators: 200
	clf__estimator__learning_rate: 0.5


In [87]:
# compare this cv2 model with adaboost with the later cv model
(sum(np.where(accuracy2>cv2_accuracy, 1, 0)), len(np.where(accuracy2>cv2_accuracy, 1, 0)))

(15, 36)

- This cv2 model with adaboost is slightly better cv model. So I choose this new model

### 9. Export your model as a pickle file

In [89]:
pickle_file = 'ada_model.pkl'
with open(pickle_file, 'wb') as file:
    pickle.dump(cv2, file)

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.