# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import numpy as np
import pandas as pd
import pickle
import re
import nltk
import time
from sqlalchemy import create_engine
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score, recall_score, precision_score, f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier

In [2]:
# download punkt for nltk function
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [3]:
# download wordnet for nltk function
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [4]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('disaster_messages', engine)
X = df['message']
Y = df.loc[:, 'related':'direct_report']

In [5]:
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [6]:
X.head()

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
4    says: west side of Haiti, rest of the country ...
Name: message, dtype: object

In [7]:
Y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. Write a tokenization function to process your text data

In [8]:
def tokenize(text):
    """
    Clean and tokenize text for modeling. It will replace all non-
    numbers and non-alphabets with a blank space. Next, it will
    split the sentence into word tokens and remove all stopwords.
    The word tokens will then be lemmatized with Nltk's 
    WordNetLemmatizer(), first using noun as part of speech, then verb.
    
    INPUTS:
        text - the string representing the message
    RETURNs:
        clean_tokens - a list containing the cleaned word tokens of the
        message
    """

    # replace all non-alphabets and non-numbers with blank space
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # Tokenize words
    tokens = word_tokenize(text)
    
    # instantiate lemmatizer
    lemmatizer = WordNetLemmatizer()
    
    # instantiate stemmer
    stemmer = PorterStemmer()
    
    clean_tokens = []
    for tok in tokens:
        # lemmtize token using noun as part of speech
        clean_tok = lemmatizer.lemmatize(tok)
        # lemmtize token using verb as part of speech
        clean_tok = lemmatizer.lemmatize(clean_tok, pos='v')
        # stem token
        clean_tok = stemmer.stem(clean_tok)
        # strip whitespace and append clean token to array
        clean_tokens.append(clean_tok.strip())
        
    return clean_tokens

In [9]:
# test the function
text = "Hello, I see fire in the street and many houses are burning and destroyed, homeless people everywhere"
tokenize(text)

['hello',
 'i',
 'see',
 'fire',
 'in',
 'the',
 'street',
 'and',
 'mani',
 'hous',
 'be',
 'burn',
 'and',
 'destroy',
 'homeless',
 'peopl',
 'everywher']

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [10]:
# Create pipeline with Classifier

pipeline = Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultiOutputClassifier(RandomForestClassifier()))
                    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [11]:
# split data, train and predict
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# train classifier
pipeline.fit(X_train.as_matrix(), Y_train.as_matrix())

# predict on test data
Y_pred = pipeline.predict(X_test)

  """


### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [12]:
# Get names of all categories
category_names = Y_test.columns.tolist()

Y_pred_df = pd.DataFrame(Y_pred, columns = category_names)
Y_pred_df.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,1,0,0,0,0,0,0,...,0,0,1,1,0,0,0,0,0,0
4,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [13]:
# show the f1 score with classification_report
for i in range(36):
    print(category_names[i],\
          '\n',\
          classification_report(Y_test.iloc[:,i], Y_pred_df.iloc[:,i]))

related 
              precision    recall  f1-score   support

          0       0.64      0.38      0.48      1245
          1       0.83      0.93      0.88      3998

avg / total       0.78      0.80      0.78      5243

request 
              precision    recall  f1-score   support

          0       0.88      0.98      0.93      4352
          1       0.81      0.32      0.46       891

avg / total       0.87      0.87      0.85      5243

offer 
              precision    recall  f1-score   support

          0       1.00      1.00      1.00      5219
          1       0.00      0.00      0.00        24

avg / total       0.99      1.00      0.99      5243

aid_related 
              precision    recall  f1-score   support

          0       0.73      0.88      0.80      3079
          1       0.77      0.54      0.64      2164

avg / total       0.75      0.74      0.73      5243

medical_help 
              precision    recall  f1-score   support

          0       0.92      1

  'precision', 'predicted', average, warn_for)


### 6. Improve your model
Use grid search to find better parameters. 

In [14]:
# show me the params
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7fdbf3255a60>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
               max_depth=None, max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
               oob_score=False, random_state=None,

In [15]:
# improve the model with GridSearchCV

parameters = {'clf__estimator__max_depth': [10, 50, None],
              'clf__estimator__min_samples_leaf':[2, 5, 10]}


cv = GridSearchCV(pipeline, parameters)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [16]:
starttime = time.time()

cv.fit(X_train.as_matrix(), Y_train.as_matrix()) 

runtime = time.time() - starttime
print('Function completed in {:.0f}m {:.0f}s'.format(runtime // 60, runtime % 60))

  This is separate from the ipykernel package so we can avoid doing imports until


Function completed in 25m 48s


In [17]:
# predict and create a dataframe
Y_pred = cv.predict(X_test)
Y_pred_df = pd.DataFrame(Y_pred, columns = category_names)
Y_pred_df.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [18]:
# shwo me Y_test with classification_report
for i in range(36):
    print(category_names[i],\
          '\n',\
          classification_report(Y_test.iloc[:,i], Y_pred_df.iloc[:,i]))

related 
              precision    recall  f1-score   support

          0       0.70      0.28      0.40      1245
          1       0.81      0.96      0.88      3998

avg / total       0.78      0.80      0.77      5243

request 
              precision    recall  f1-score   support

          0       0.87      0.99      0.93      4352
          1       0.88      0.27      0.41       891

avg / total       0.87      0.87      0.84      5243

offer 
              precision    recall  f1-score   support

          0       1.00      1.00      1.00      5219
          1       0.00      0.00      0.00        24

avg / total       0.99      1.00      0.99      5243

aid_related 
              precision    recall  f1-score   support

          0       0.76      0.84      0.80      3079
          1       0.73      0.62      0.67      2164

avg / total       0.75      0.75      0.74      5243

medical_help 
              precision    recall  f1-score   support

          0       0.92      1

  'precision', 'predicted', average, warn_for)


In [19]:
# get the scores and show the accuracy
for i in range(36):
    category = category_names[i]
    accuracy = accuracy_score(Y_test.iloc[:,i], Y_pred_df.iloc[:,i])
    precision = precision_score(Y_test.iloc[:,i], Y_pred_df.iloc[:,i], average='micro')
    recall = recall_score(Y_test.iloc[:,i], Y_pred_df.iloc[:,i], average='micro')
    f1 = f1_score(Y_test.iloc[:,i], Y_pred_df.iloc[:,i], average='micro')
    print(category)
    print("\t Accuracy: %.4f \t Precision: %.4f \t Recall: %.4f \t F1-Score: %.4f \n" %\
              (accuracy, precision, recall, f1))       

related
	 Accuracy: 0.8003 	 Precision: 0.8003 	 Recall: 0.8003 	 F1-Score: 0.8003 

request
	 Accuracy: 0.8693 	 Precision: 0.8693 	 Recall: 0.8693 	 F1-Score: 0.8693 

offer
	 Accuracy: 0.9954 	 Precision: 0.9954 	 Recall: 0.9954 	 F1-Score: 0.9954 

aid_related
	 Accuracy: 0.7479 	 Precision: 0.7479 	 Recall: 0.7479 	 F1-Score: 0.7479 

medical_help
	 Accuracy: 0.9205 	 Precision: 0.9205 	 Recall: 0.9205 	 F1-Score: 0.9205 

medical_products
	 Accuracy: 0.9491 	 Precision: 0.9491 	 Recall: 0.9491 	 F1-Score: 0.9491 

search_and_rescue
	 Accuracy: 0.9741 	 Precision: 0.9741 	 Recall: 0.9741 	 F1-Score: 0.9741 

security
	 Accuracy: 0.9817 	 Precision: 0.9817 	 Recall: 0.9817 	 F1-Score: 0.9817 

military
	 Accuracy: 0.9701 	 Precision: 0.9701 	 Recall: 0.9701 	 F1-Score: 0.9701 

child_alone
	 Accuracy: 1.0000 	 Precision: 1.0000 	 Recall: 1.0000 	 F1-Score: 1.0000 

water
	 Accuracy: 0.9413 	 Precision: 0.9413 	 Recall: 0.9413 	 F1-Score: 0.9413 

food
	 Accuracy: 0.9046 	 Precision

In [20]:
# finaly ... the best params
cv.best_params_

{'clf__estimator__max_depth': None, 'clf__estimator__min_samples_leaf': 2}

In [21]:
# ... and the best estimator
cv.best_estimator_

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [22]:
# testing a pure decision tree classifier
starttime = time.time()
pipeline = Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultiOutputClassifier(DecisionTreeClassifier()))
                    ])

X_train, X_test, Y_train, Y_test = train_test_split(X, Y)

pipeline.fit(X_train.as_matrix(), Y_train.as_matrix())

# predict and create a dataframe
Y_pred = pipeline.predict(X_test)

runtime = time.time() - starttime
print('Function completed in {:.0f}m {:.0f}s'.format(runtime // 60, runtime % 60))

# save the improved model in to a data frame
Y_pred_imp = pd.DataFrame(Y_pred, columns = category_names)
Y_pred_imp.head()


  # This is added back by InteractiveShellApp.init_path()


Function completed in 8m 8s


Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,1,1,0,0,0,0,0,...,0,0,1,1,1,1,1,1,0,0


In [23]:
# show me Y_test with classification_report
for i in range(36):
    print(category_names[i],\
          '\n',\
          classification_report(Y_test.iloc[:,i], Y_pred_imp.iloc[:,i]))

related 
              precision    recall  f1-score   support

          0       0.50      0.45      0.48      1542
          1       0.84      0.86      0.85      5012

avg / total       0.76      0.76      0.76      6554

request 
              precision    recall  f1-score   support

          0       0.91      0.91      0.91      5473
          1       0.55      0.56      0.56      1081

avg / total       0.85      0.85      0.85      6554

offer 
              precision    recall  f1-score   support

          0       1.00      1.00      1.00      6525
          1       0.00      0.00      0.00        29

avg / total       0.99      0.99      0.99      6554

aid_related 
              precision    recall  f1-score   support

          0       0.76      0.74      0.75      3874
          1       0.64      0.65      0.64      2680

avg / total       0.71      0.71      0.71      6554

medical_help 
              precision    recall  f1-score   support

          0       0.94      0

### 9. Export your model as a pickle file

In [24]:
filename = 'classifier.pkl'
pickle.dump(cv, open(filename, 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.