# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [6]:
# import libraries
import os
import re
import pandas as pd
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
%matplotlib inline

import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem import PorterStemmer
nltk.download(['punkt', 'wordnet', 'stopwords'])

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import classification_report

import pickle
import warnings
warnings.filterwarnings('ignore')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\omdhk.test\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\omdhk.test\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\omdhk.test\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [7]:
# load data from database
engine = create_engine('sqlite:///project2.db')
df = pd.read_sql_table( 'disaster_',con=engine)

X = df['message']
y = df.drop(columns=['id', 'message', 'original', 'genre'], axis=1)



### 2. Write a tokenization function to process your text data

In [8]:

def tokenize(text):
    # check if there are urls within the text
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    detected_urls = re.findall(url_regex,text)
    for url in detected_urls:
        text = text.replace(url,"urlplaceholder")
    
    # remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]"," ",text)
    
    # tokenize the text
    tokens = word_tokenize(text)
    
    # remove stop words
    tokens = [tok for tok in tokens if tok not in stopwords.words("english")]
    
    # lemmatization
    lemmatizer = WordNetLemmatizer()
    
    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)
        
    return clean_tokens

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [19]:
# create a pipeline: a count vectorizer, a tfid transformer and a classifier
from functools import partial
rfc_pipe = Pipeline([
        ('vect', CountVectorizer(tokenizer=partial(tokenize))),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(RandomForestClassifier())),
    ])


In [20]:
rfc_pipe.get_params()


{'memory': None,
 'steps': [('vect',
   CountVectorizer(tokenizer=functools.partial(<function tokenize at 0x00000291CE7B50D0>))),
  ('tfidf', TfidfTransformer()),
  ('clf', MultiOutputClassifier(estimator=RandomForestClassifier()))],
 'verbose': False,
 'vect': CountVectorizer(tokenizer=functools.partial(<function tokenize at 0x00000291CE7B50D0>)),
 'tfidf': TfidfTransformer(),
 'clf': MultiOutputClassifier(estimator=RandomForestClassifier()),
 'vect__analyzer': 'word',
 'vect__binary': False,
 'vect__decode_error': 'strict',
 'vect__dtype': numpy.int64,
 'vect__encoding': 'utf-8',
 'vect__input': 'content',
 'vect__lowercase': True,
 'vect__max_df': 1.0,
 'vect__max_features': None,
 'vect__min_df': 1,
 'vect__ngram_range': (1, 1),
 'vect__preprocessor': None,
 'vect__stop_words': None,
 'vect__strip_accents': None,
 'vect__token_pattern': '(?u)\\b\\w\\w+\\b',
 'vect__tokenizer': functools.partial(<function tokenize at 0x00000291CE7B50D0>),
 'vect__vocabulary': None,
 'tfidf__norm': '

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [21]:

# train test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

# train the pipeline
rfc_pipe.fit(X_train,y_train)

Pipeline(steps=[('vect',
                 CountVectorizer(tokenizer=functools.partial(<function tokenize at 0x00000291CE7B50D0>))),
                ('tfidf', TfidfTransformer()),
                ('clf',
                 MultiOutputClassifier(estimator=RandomForestClassifier()))])

In [22]:
y_test.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
11724,1,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,0
10180,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,0
3756,1,0,0,1,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
22423,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1908,1,1,0,1,1,0,0,0,0,0,...,0,0,1,1,0,0,0,0,0,1


In [23]:
X.head()

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
4    says: west side of Haiti, rest of the country ...
Name: message, dtype: object

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [24]:
# predict on train set
y_rfc_trainpred = rfc_pipe.predict(X_train)

In [25]:
# classification report on train set
print(classification_report(y_train.values, y_rfc_trainpred, target_names=y.columns.values))

                        precision    recall  f1-score   support

               related       1.00      1.00      1.00     15002
               request       1.00      0.99      1.00      3362
                 offer       1.00      0.97      0.98        89
           aid_related       1.00      1.00      1.00      8150
          medical_help       1.00      1.00      1.00      1594
      medical_products       1.00      1.00      1.00       989
     search_and_rescue       1.00      1.00      1.00       536
              security       1.00      0.98      0.99       366
              military       1.00      1.00      1.00       665
           child_alone       0.00      0.00      0.00         0
                 water       1.00      1.00      1.00      1240
                  food       1.00      1.00      1.00      2167
               shelter       1.00      1.00      1.00      1709
              clothing       1.00      1.00      1.00       297
                 money       1.00      

In [26]:
# Model accuracy score on train set
rfc_train_accuracy = (y_rfc_trainpred == y_train).mean()
print(rfc_train_accuracy)

related                   0.998169
request                   0.999084
offer                     0.999847
aid_related               0.998932
medical_help              0.999695
medical_products          0.999847
search_and_rescue         0.999898
security                  0.999695
military                  0.999797
child_alone               1.000000
water                     0.999949
food                      0.999949
shelter                   0.999949
clothing                  1.000000
money                     0.999949
missing_people            1.000000
refugees                  0.999847
death                     0.999898
other_aid                 0.998983
infrastructure_related    0.999746
transport                 0.999797
buildings                 0.999847
electricity               1.000000
tools                     1.000000
hospitals                 0.999949
shops                     1.000000
aid_centers               0.999898
other_infrastructure      0.999797
weather_related     

In [27]:
# predict on test set
y_rfc_testpred = rfc_pipe.predict(X_test)

In [28]:
# classification report on test set
print(classification_report(y_test.values, y_rfc_testpred, target_names=y.columns.values))


                        precision    recall  f1-score   support

               related       0.85      0.95      0.90      5091
               request       0.83      0.51      0.63      1112
                 offer       0.00      0.00      0.00        29
           aid_related       0.75      0.70      0.72      2710
          medical_help       0.69      0.09      0.16       490
      medical_products       0.79      0.10      0.19       324
     search_and_rescue       1.00      0.02      0.03       188
              security       0.00      0.00      0.00       105
              military       0.56      0.05      0.09       195
           child_alone       0.00      0.00      0.00         0
                 water       0.90      0.35      0.50       432
                  food       0.84      0.60      0.70       756
               shelter       0.85      0.37      0.51       605
              clothing       0.93      0.12      0.21       108
                 money       0.67      

In [29]:

# Model accuracy score on test test
rfc_test_accuracy = (y_rfc_testpred == y_test).mean()
print(rfc_test_accuracy)

related                   0.834300
request                   0.899756
offer                     0.995575
aid_related               0.779677
medical_help              0.928898
medical_products          0.954379
search_and_rescue         0.971773
security                  0.983522
military                  0.970552
child_alone               1.000000
water                     0.954684
food                      0.940800
shelter                   0.935612
clothing                  0.985352
money                     0.977876
missing_people            0.987794
refugees                  0.967043
death                     0.959262
other_aid                 0.872139
infrastructure_related    0.934544
transport                 0.954379
buildings                 0.954074
electricity               0.978181
tools                     0.992981
hospitals                 0.990540
shops                     0.995575
aid_centers               0.988404
other_infrastructure      0.955600
weather_related     

### 6. Improve your model
Use grid search to find better parameters. 

In [30]:
parameters = {'clf__estimator__n_estimators':[100,200],
              'clf__estimator__max_depth':[5]}

# create grid search object #model=gridsearch
grid_rfc = GridSearchCV(rfc_pipe, param_grid=parameters , cv=3, verbose=2)

In [31]:
grid_rfc.fit(X_train,y_train)

Fitting 3 folds for each of 2 candidates, totalling 6 fits
[CV] END clf__estimator__max_depth=5, clf__estimator__n_estimators=100; total time= 4.8min
[CV] END clf__estimator__max_depth=5, clf__estimator__n_estimators=100; total time= 4.4min
[CV] END clf__estimator__max_depth=5, clf__estimator__n_estimators=100; total time= 4.2min
[CV] END clf__estimator__max_depth=5, clf__estimator__n_estimators=200; total time= 4.5min
[CV] END clf__estimator__max_depth=5, clf__estimator__n_estimators=200; total time= 4.0min
[CV] END clf__estimator__max_depth=5, clf__estimator__n_estimators=200; total time= 4.1min


GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('vect',
                                        CountVectorizer(tokenizer=functools.partial(<function tokenize at 0x00000291CE7B50D0>))),
                                       ('tfidf', TfidfTransformer()),
                                       ('clf',
                                        MultiOutputClassifier(estimator=RandomForestClassifier()))]),
             param_grid={'clf__estimator__max_depth': [5],
                         'clf__estimator__n_estimators': [100, 200]},
             verbose=2)

In [32]:
grid_rfc.best_params_


{'clf__estimator__max_depth': 5, 'clf__estimator__n_estimators': 100}

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [33]:
#test on train set
y_grid_rfc_trainpred = grid_rfc.predict(X_train)


In [34]:
# classification report on train set
print('Grid rfc Train Scores')
print(classification_report(y_train.values, y_grid_rfc_trainpred, target_names=y.columns.values))
print('\n')

# accuracy score on train set
print('Grid rfc Accuracy')
grid_rfc_train_accuracy = (y_grid_rfc_trainpred == y_train).mean()
print(grid_rfc_train_accuracy)

Grid rfc Train Scores
                        precision    recall  f1-score   support

               related       0.76      1.00      0.87     15002
               request       0.00      0.00      0.00      3362
                 offer       0.00      0.00      0.00        89
           aid_related       0.99      0.01      0.02      8150
          medical_help       1.00      0.00      0.00      1594
      medical_products       0.00      0.00      0.00       989
     search_and_rescue       0.00      0.00      0.00       536
              security       0.00      0.00      0.00       366
              military       0.00      0.00      0.00       665
           child_alone       0.00      0.00      0.00         0
                 water       1.00      0.00      0.00      1240
                  food       0.00      0.00      0.00      2167
               shelter       1.00      0.00      0.00      1709
              clothing       0.00      0.00      0.00       297
                 

In [35]:
y_grid_rfc_testpred = grid_rfc.predict(X_test)


In [36]:
# classification report on test set
print('Grid rfc Test Scores')
print(classification_report(y_test.values, y_grid_rfc_testpred, target_names=y.columns.values))
print('\n')

# accuracy score on test set
print('Grid tree Accuracy')
grid_rfc_test_accuracy = (y_grid_rfc_testpred == y_test).mean()
print(grid_rfc_test_accuracy)

Grid rfc Test Scores
                        precision    recall  f1-score   support

               related       0.78      1.00      0.87      5091
               request       0.00      0.00      0.00      1112
                 offer       0.00      0.00      0.00        29
           aid_related       0.94      0.01      0.01      2710
          medical_help       0.00      0.00      0.00       490
      medical_products       0.00      0.00      0.00       324
     search_and_rescue       0.00      0.00      0.00       188
              security       0.00      0.00      0.00       105
              military       0.00      0.00      0.00       195
           child_alone       0.00      0.00      0.00         0
                 water       0.00      0.00      0.00       432
                  food       0.00      0.00      0.00       756
               shelter       0.00      0.00      0.00       605
              clothing       0.00      0.00      0.00       108
                 m

In [38]:
# compare with rfc without gridsearch tuning
grid_rfc_test_accuracy - rfc_test_accuracy

related                  -0.057522
request                  -0.069423
offer                     0.000000
aid_related              -0.191028
medical_help             -0.003662
medical_products         -0.003814
search_and_rescue        -0.000458
security                  0.000458
military                 -0.000305
child_alone               0.000000
water                    -0.020598
food                     -0.056149
shelter                  -0.027922
clothing                 -0.001831
money                    -0.000305
missing_people           -0.000153
refugees                  0.000000
death                    -0.005035
other_aid                -0.003357
infrastructure_related    0.001068
transport                -0.003204
buildings                -0.005645
electricity              -0.000763
tools                     0.000305
hospitals                 0.000000
shops                     0.000000
aid_centers               0.000000
other_infrastructure      0.000610
weather_related     

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

#### MODEL 2. KNN

In [44]:
  knn_pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(KNeighborsClassifier()))
    ])

In [45]:
knn_pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(tokenizer=<function tokenize at 0x00000291CE7B50D0>)),
  ('tfidf', TfidfTransformer()),
  ('clf', MultiOutputClassifier(estimator=KNeighborsClassifier()))],
 'verbose': False,
 'vect': CountVectorizer(tokenizer=<function tokenize at 0x00000291CE7B50D0>),
 'tfidf': TfidfTransformer(),
 'clf': MultiOutputClassifier(estimator=KNeighborsClassifier()),
 'vect__analyzer': 'word',
 'vect__binary': False,
 'vect__decode_error': 'strict',
 'vect__dtype': numpy.int64,
 'vect__encoding': 'utf-8',
 'vect__input': 'content',
 'vect__lowercase': True,
 'vect__max_df': 1.0,
 'vect__max_features': None,
 'vect__min_df': 1,
 'vect__ngram_range': (1, 1),
 'vect__preprocessor': None,
 'vect__stop_words': None,
 'vect__strip_accents': None,
 'vect__token_pattern': '(?u)\\b\\w\\w+\\b',
 'vect__tokenizer': <function __main__.tokenize(text)>,
 'vect__vocabulary': None,
 'tfidf__norm': 'l2',
 'tfidf__smooth_idf': True,
 'tfidf__sublinear_tf': False,
 'tf

In [46]:
# train the pipeline
knn_pipeline.fit(X_train,y_train)

Pipeline(steps=[('vect',
                 CountVectorizer(tokenizer=<function tokenize at 0x00000291CE7B50D0>)),
                ('tfidf', TfidfTransformer()),
                ('clf',
                 MultiOutputClassifier(estimator=KNeighborsClassifier()))])

In [47]:
# predict on train set
y_knn_trainpred = knn_pipeline.predict(X_train)

In [49]:
# classification report on train set
print(classification_report(y_train, y_knn_trainpred, target_names=y.columns.values))

                        precision    recall  f1-score   support

               related       0.83      0.99      0.90     15002
               request       0.96      0.38      0.55      3362
                 offer       0.00      0.00      0.00        89
           aid_related       0.98      0.35      0.52      8150
          medical_help       0.98      0.08      0.14      1594
      medical_products       0.96      0.11      0.19       989
     search_and_rescue       0.97      0.07      0.13       536
              security       1.00      0.01      0.02       366
              military       1.00      0.09      0.17       665
           child_alone       0.00      0.00      0.00         0
                 water       0.98      0.19      0.32      1240
                  food       0.96      0.27      0.42      2167
               shelter       0.95      0.18      0.31      1709
              clothing       0.98      0.14      0.25       297
                 money       0.96      

In [50]:
# Model accuracy score on train dataset
knn_train_accuracy = (y_knn_trainpred == y_train).mean()
print(knn_train_accuracy)

related                   0.838309
request                   0.892223
offer                     0.995473
aid_related               0.727430
medical_help              0.924978
medical_products          0.954936
search_and_rescue         0.974620
security                  0.981537
military                  0.969330
child_alone               1.000000
water                     0.948629
food                      0.918112
shelter                   0.928284
clothing                  0.986979
money                     0.977977
missing_people            0.989319
refugees                  0.967296
death                     0.960734
other_aid                 0.876965
infrastructure_related    0.936168
transport                 0.958598
buildings                 0.954377
electricity               0.981995
tools                     0.994151
hospitals                 0.988759
shops                     0.995372
aid_centers               0.988200
other_infrastructure      0.956818
weather_related     

In [51]:
# classification report on test set
print(classification_report(y_test, y_knn_testpred, target_names=y.columns.values))


                        precision    recall  f1-score   support

               related       0.82      0.96      0.88      5091
               request       0.77      0.30      0.43      1112
                 offer       0.00      0.00      0.00        29
           aid_related       0.76      0.25      0.37      2710
          medical_help       0.47      0.03      0.05       490
      medical_products       0.52      0.03      0.06       324
     search_and_rescue       0.50      0.01      0.02       188
              security       0.00      0.00      0.00       105
              military       0.53      0.04      0.08       195
           child_alone       0.00      0.00      0.00         0
                 water       0.80      0.12      0.21       432
                  food       0.74      0.19      0.30       756
               shelter       0.70      0.08      0.15       605
              clothing       0.64      0.06      0.12       108
                 money       0.55      

In [53]:
y_knn_testpred = knn_pipeline.predict(X_test)


In [54]:
# classification report on train set
print(classification_report(y_test, y_knn_testpred, target_names=y.columns.values))

                        precision    recall  f1-score   support

               related       0.82      0.96      0.88      5091
               request       0.77      0.30      0.43      1112
                 offer       0.00      0.00      0.00        29
           aid_related       0.76      0.25      0.37      2710
          medical_help       0.47      0.03      0.05       490
      medical_products       0.52      0.03      0.06       324
     search_and_rescue       0.50      0.01      0.02       188
              security       0.00      0.00      0.00       105
              military       0.53      0.04      0.08       195
           child_alone       0.00      0.00      0.00         0
                 water       0.80      0.12      0.21       432
                  food       0.74      0.19      0.30       756
               shelter       0.70      0.08      0.15       605
              clothing       0.64      0.06      0.12       108
                 money       0.55      

In [56]:
# Model accuracy score on test dataset
knn_test_accuracy = (y_knn_testpred == y_test).mean()
print(knn_test_accuracy)

related                   0.800427
request                   0.866036
offer                     0.995575
aid_related               0.656851
medical_help              0.924931
medical_products          0.950717
search_and_rescue         0.971315
security                  0.983979
military                  0.970400
child_alone               1.000000
water                     0.940189
food                      0.898688
shelter                   0.912115
clothing                  0.983979
money                     0.977724
missing_people            0.987641
refugees                  0.967348
death                     0.954989
other_aid                 0.868020
infrastructure_related    0.935459
transport                 0.952090
buildings                 0.950717
electricity               0.977876
tools                     0.993287
hospitals                 0.990540
shops                     0.995575
aid_centers               0.988404
other_infrastructure      0.955905
weather_related     

#### grid test for knn

In [58]:

# Grid test for knn
knn_params = {'clf__estimator__n_neighbors': [15,20,29]}

grid_knn = GridSearchCV(knn_pipeline, param_grid=knn_params, cv=3, verbose=3)

In [None]:
grid_knn.fit(X_train,y_train)


Fitting 3 folds for each of 3 candidates, totalling 9 fits
[CV 1/3] END .................clf__estimator__n_neighbors=15; total time= 5.0min
[CV 2/3] END .................clf__estimator__n_neighbors=15; total time= 4.9min
[CV 3/3] END .................clf__estimator__n_neighbors=15; total time= 4.9min
[CV 1/3] END .................clf__estimator__n_neighbors=20; total time= 4.9min


In [None]:
grid_knn.best_params_


In [None]:
y_gridknn_testpred = grid_knn.predict(X_test)


In [None]:
# classification report on test set
print('Grid KNN Test Scores')
print(classification_report(y_test.values, y_gridknn_testpred, target_names=y.columns.values))
print('\n')

# accuracy score on test set
print('Grid KNN Accuracy')
gridknn_test_accuracy = (y_gridknn_testpred == y_test).mean()
print(gridknn_test_accuracy)

##### compare the 2 models

In [None]:

# model names
model_names = ['rfc','dtree','knn','grid rfc','grid dtree','grid knn']

# concatenate accuracry scores
accuracy_df = pd.concat([rfc_test_accuracy,knn_accuracy,
                         gridrfc_test_accuracy,gridknn_test_accuracy],axis=1)
accuracy_df.columns=model_names

print('Models Accuracy Score Comparison')
accuracy_df

In [None]:
to_graph = accuracy_df.to_dict()

In [None]:
fig, ax = plt.subplots(figsize=(10,6))
ax.boxplot(accuracy_df.values)
ax.set_xticklabels(['rfc','knn','grid rfc','grid knn'])

plt.title('Models Accuracy Comparison')
plt.ylabel('Accuracy rate')

### 9. Export your model as a pickle file

In [None]:
file = open('classifier', 'wb')

In [None]:

import pickle
pickle.dump(rfc_pipe,open('classifier.pickle','wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.