# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import pandas as pd
import numpy as np
from sqlalchemy import create_engine



# Change default settings to allow seeing all of the data in a table.
pd.options.display.max_columns = 100
pd.options.display.max_rows = 200
pd.set_option('display.max_colwidth', 20)



import warnings

# Supress warnings
warnings.filterwarnings("ignore")



In [2]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_multilabel_classification
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import confusion_matrix

import nltk
nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger', 'stopwords'])
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.naive_bayes import BernoulliNB

# import pickle to save the classifier model as a pickle file
import pickle

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\dsbb1\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\dsbb1\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\dsbb1\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [3]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('DisasterResponse', engine)


In [4]:
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update -...,Un front froid s...,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane...,Cyclone nan fini...,direct,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
2,8,Looking for some...,"Patnm, di Maryan...",direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leoga...,UN reports Leoga...,direct,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,12,says: west side ...,facade ouest d H...,direct,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [5]:
df.describe()

Unnamed: 0,id,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
count,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0,26216.0
mean,15224.82133,0.77365,0.170659,0.004501,0.414251,0.079493,0.050084,0.027617,0.017966,0.032804,0.0,0.063778,0.111497,0.088267,0.015449,0.023039,0.011367,0.033377,0.045545,0.131446,0.065037,0.045812,0.050847,0.020293,0.006065,0.010795,0.004577,0.011787,0.043904,0.278341,0.082202,0.093187,0.010757,0.093645,0.020217,0.052487,0.193584
std,8826.88914,0.435276,0.376218,0.06694,0.492602,0.270513,0.218122,0.163875,0.132831,0.178128,0.0,0.244361,0.314752,0.283688,0.123331,0.150031,0.106011,0.179621,0.2085,0.337894,0.246595,0.209081,0.219689,0.141003,0.077643,0.103338,0.067502,0.107927,0.204887,0.448191,0.274677,0.2907,0.103158,0.29134,0.140743,0.223011,0.395114
min,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7446.75,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,15662.5,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,22924.25,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
max,30265.0,2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


The related column as values of 2.  It does not have any meaning.  We will change the value to 1 to indicate the presence of related.

In [6]:
df.related.value_counts()

1    19906
0     6122
2      188
Name: related, dtype: int64

In [7]:
df['related'] = df.related.map(lambda x: 1 if x==2 else x)

In [8]:
# create a copy to work on
df_copy = df.copy(deep=False)

In [9]:
# since the identification of the message is considered the target, the message must be the feature.  
# assign the message to x
X = df_copy.message
y = df_copy.iloc[:,4:]

In [10]:
y.isnull().sum()

related                   0
request                   0
offer                     0
aid_related               0
medical_help              0
medical_products          0
search_and_rescue         0
security                  0
military                  0
child_alone               0
water                     0
food                      0
shelter                   0
clothing                  0
money                     0
missing_people            0
refugees                  0
death                     0
other_aid                 0
infrastructure_related    0
transport                 0
buildings                 0
electricity               0
tools                     0
hospitals                 0
shops                     0
aid_centers               0
other_infrastructure      0
weather_related           0
floods                    0
storm                     0
fire                      0
earthquake                0
cold                      0
other_weather             0
direct_report       

In [11]:
X.head()

0    Weather update -...
1    Is the Hurricane...
2    Looking for some...
3    UN reports Leoga...
4    says: west side ...
Name: message, dtype: object

In [12]:
y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,water,food,shelter,clothing,money,missing_people,refugees,death,other_aid,infrastructure_related,transport,buildings,electricity,tools,hospitals,shops,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


In [13]:
df_copy.related.unique()

array([1, 0], dtype=int64)

### 2. Write a tokenization function to process your text data

In [14]:


url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'


In [15]:
class StartingVerbExtractor(BaseEstimator, TransformerMixin):
    '''This class takes a string of tokens as its input and classifies each word as a noun, verb, etc.
    
    Inputs:
    BaseEstimator
    
    TransformerMixin
    
    
    
    Outputs:
    
    Summary of test results by label
    
    F1 test results
    
    Best parameters for the model
    
    '''

    def starting_verb(self, text):
        
        sentence_list = nltk.sent_tokenize(text)
        for sentence in sentence_list:
            pos_tags = nltk.pos_tag(tokenize(sentence))
            first_word, first_tag = pos_tags[0]
            if first_tag in ['VB', 'VBP'] or first_word == 'RT':
                return 1
        return 0

    def fit(self, x, y=None):
        return self

    def transform(self, X):
        X_tagged = pd.Series(X).apply(self.starting_verb)
        return pd.DataFrame(X_tagged)

In [16]:


def tokenize(text):
    '''This function breaks a stream of text into words, phrases and symbols, then removes the symbols to leave words.
    
    Inputs:
    text - the stream of text to be tokenized
    
   
    Outputs:
    
    clean_tokens - lower case words with special symbols and extra spaces removed.
    
    '''
    detected_urls = re.findall(url_regex, text)
    for url in detected_urls:
        text = text.replace(url, "urlplaceholder")

    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens




### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [17]:
# load the paramaters to be used in the pipeline
parameters = {  }  


In [18]:
# load the pipeline to be used in GridSearchCV
pipeline = Pipeline([
        ('features', FeatureUnion([
            
            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer())
            ])),

            ('starting_verb', StartingVerbExtractor())
        ])),
    
        ('clf', RandomForestClassifier(), n_jobs=-1)
    ])


In [19]:
def build_model():

    # create grid search object
    cv = GridSearchCV(pipeline, param_grid=parameters)
    
    return cv

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [20]:
def display_results(model, y_test, y_pred):
    '''This function displays the results of the model, shows the test results in summary by label 
    and displays the F1 test results and best parameters using the confusion_matrix function.
    
    Inputs:
    model - result of GridSearchCV
    
    y_test - the test array
    
    y_pred - the predictions resulting from the GridSearch
    
    Outputs:
    
    Summary of test results by label
    
    F1 test results
    
    Best parameters for the model
    
    '''
    # creates the list of category names
    category_names = y.columns
    # create and print the classification report
    print(classification_report(y_test, y_pred, target_names=category_names))
    print(model.best_params_)


### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [21]:
def main(pkl = 0):
    '''
    
    This function executes the defined functions above in a specific order:
    1) tokenize, memmatize the data and use StartingVerbExtractor to tag the words
    2) create the model using GridsearchCV
    3) fit the model
    4) predict labels from test data
    5) evaluate resuls using confusion matrix
    6) save the model as a pickle file
    
    Inputs:
    
    Arguments: 
    
    pkl - Set to 0 as default to prevent this model saving as a pickle file.
          To save pickle file, set to 1
    
    Outputs:
    
    Summary of test results by label
    
    F1 test results
    
    Best parameters for the model
    
    '''
    # create the train, test, split data sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

    # build the model
    model = build_model()
    
    # train the model
    model.fit(X_train, y_train)
    
    # fit the model
    y_pred = model.predict(X_test)
    
    # display the results of the model
    display_results(model, y_test, y_pred)
    
    # save the model as a pickle file if pkll = 1
    if pkl == 1:
        pkl_filename = 'classifier.pkl'
        with open(pkl_filename, 'wb') as file:
            pickle.dump(model, file)
            print("Saving Pickle File")


    

In [22]:
# execute the main function.  Since not saving the pickle file, do not put a 1 in the function
main()

                        precision    recall  f1-score   support

               related       0.83      0.96      0.89      4027
               request       0.90      0.44      0.60       911
                 offer       0.00      0.00      0.00        15
           aid_related       0.84      0.48      0.61      2202
          medical_help       0.25      0.00      0.00       420
      medical_products       0.57      0.02      0.03       258
     search_and_rescue       0.50      0.01      0.01       155
              security       0.00      0.00      0.00        94
              military       0.00      0.00      0.00       166
           child_alone       0.00      0.00      0.00         0
                 water       0.91      0.14      0.25       336
                  food       0.92      0.28      0.42       569
               shelter       0.91      0.10      0.18       467
              clothing       0.67      0.02      0.05        84
                 money       1.00      

### 6. Improve your model
Use grid search to find better parameters. 

In [23]:
parameters = {
    'features__text_pipeline__vect__ngram_range': ((1, 1), (1, 2)),

        'features__transformer_weights': (
            {'text_pipeline': 1, 'starting_verb': 0.5},
            {'text_pipeline': 0.8, 'starting_verb': 1},
        )

    }


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [24]:
# execute the main function.  Since not saving the pickle file, do not put a 1 in the function
main()

                        precision    recall  f1-score   support

               related       0.84      0.95      0.89      4027
               request       0.90      0.43      0.58       911
                 offer       0.00      0.00      0.00        15
           aid_related       0.88      0.40      0.55      2202
          medical_help       0.00      0.00      0.00       420
      medical_products       1.00      0.01      0.02       258
     search_and_rescue       0.00      0.00      0.00       155
              security       0.00      0.00      0.00        94
              military       0.00      0.00      0.00       166
           child_alone       0.00      0.00      0.00         0
                 water       0.94      0.15      0.25       336
                  food       0.91      0.28      0.43       569
               shelter       0.92      0.07      0.13       467
              clothing       1.00      0.01      0.02        84
                 money       1.00      

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [25]:
# create parameters to be used in this version of the model
parameters = {
    'features__text_pipeline__vect__max_df' : (0.5, 0.75, 1.0),
    'features__text_pipeline__tfidf__use_idf': (1,0)

    }

In [26]:
# load the pipeline to be used in this version of the model
pipeline = Pipeline([
        ('features', FeatureUnion([
            
            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer())
            ])),

            ('starting_verb', StartingVerbExtractor())
        ])),
    
        ('clf', MultiOutputClassifier(BernoulliNB(), n_jobs=-1))
    ])

In [27]:
# execute the main function.  Since not saving the pickle file, do not put a 1 in the function
main()

                        precision    recall  f1-score   support

               related       0.87      0.90      0.88      4027
               request       0.71      0.61      0.65       911
                 offer       0.00      0.00      0.00        15
           aid_related       0.73      0.61      0.67      2202
          medical_help       0.45      0.11      0.18       420
      medical_products       0.48      0.09      0.14       258
     search_and_rescue       0.16      0.02      0.03       155
              security       0.06      0.01      0.02        94
              military       0.35      0.05      0.09       166
           child_alone       0.00      0.00      0.00         0
                 water       0.43      0.07      0.12       336
                  food       0.59      0.15      0.24       569
               shelter       0.49      0.11      0.18       467
              clothing       0.07      0.01      0.02        84
                 money       0.05      

Trying another approach.  This time embed RandomForestClassifier in MultiOutputClassifier

In [28]:
# create parameters to be used in this version of the model
parameters = {
    'features__text_pipeline__vect__max_df' : (0.5, 0.75, 1.0),
    'features__text_pipeline__tfidf__use_idf': (1,0)

    }

In [29]:
# load the pipeline to be used in this version of the model
pipeline = Pipeline([
        ('features', FeatureUnion([
            
            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer())
            ])),

            ('starting_verb', StartingVerbExtractor())
        ])),
    
        ('clf', MultiOutputClassifier(RandomForestClassifier(n_jobs=-1)))
    ])

In [30]:
# execute the main function.  Since not saving the pickle file, do not put a 1 in the function
main()

                        precision    recall  f1-score   support

               related       0.82      0.97      0.89      4027
               request       0.90      0.42      0.57       911
                 offer       0.00      0.00      0.00        15
           aid_related       0.81      0.60      0.69      2202
          medical_help       0.67      0.04      0.07       420
      medical_products       0.83      0.09      0.17       258
     search_and_rescue       1.00      0.01      0.01       155
              security       0.00      0.00      0.00        94
              military       0.92      0.07      0.13       166
           child_alone       0.00      0.00      0.00         0
                 water       0.89      0.26      0.41       336
                  food       0.89      0.41      0.56       569
               shelter       0.86      0.24      0.38       467
              clothing       0.50      0.02      0.05        84
                 money       1.00      

This model has the highest precision and F-score.  We will use this model and output to Pickle

### 9. Export your model as a pickle file

No need to load the parameters or pipeline since we will be using the final model (immediately above).

In [31]:
# execute the main function.  Since saving the pickle file, put a 1 in the function.
main(1)

                        precision    recall  f1-score   support

               related       0.82      0.97      0.89      4027
               request       0.89      0.41      0.56       911
                 offer       0.00      0.00      0.00        15
           aid_related       0.81      0.60      0.69      2202
          medical_help       0.64      0.04      0.08       420
      medical_products       0.79      0.10      0.18       258
     search_and_rescue       1.00      0.05      0.09       155
              security       0.00      0.00      0.00        94
              military       0.85      0.07      0.12       166
           child_alone       0.00      0.00      0.00         0
                 water       0.89      0.24      0.38       336
                  food       0.91      0.46      0.61       569
               shelter       0.84      0.25      0.39       467
              clothing       0.83      0.06      0.11        84
                 money       1.00      

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.

The train_classifier.py model contains teh script to run the above steps to create a database and export a model based on a new dataset provided by the user. 