# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [125]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/acoullandreau/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/acoullandreau/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/acoullandreau/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [126]:
# import libraries

#note that sklearn's version should be at least 0.20.0

#import for NLP
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer

#import for object manipulation
import numpy as np
import pandas as pd
import re
from sqlalchemy import create_engine

#import for ML pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import classification_report

In [127]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql("SELECT * FROM MessagesWithCategory", engine)
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [128]:
#we define X as the first 4 columns (we will filter out the 'messages' column afterwards) and the other
#36 category columns as Y
X = df.iloc[:, :4]
Y = df.iloc[:, 5:]

### 2. Write a tokenization function to process your text data

In [129]:
lemmatizer = WordNetLemmatizer()

def tokenize(text):
    #we convert the text to lower case
    text = text.lower()
    
    #we remove any url contained in the text
    
    url_regex = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
    url_in_msg = re.findall(url_regex, text)
    for url in url_in_msg:
        text = text.replace(url, "urlplaceholder")
        
    # we remove the punctuation
    text = re.sub(r"[^a-z0-9\s]", " ", text)
    
    # we tokenize the text
    words = word_tokenize(text)
    
    # we lemmatize  and remove the stop words
    words = [lemmatizer.lemmatize(word) for word in words if word not in stopwords.words('english')]
    
    return words

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [130]:
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier(n_estimators=100, random_state=42, n_jobs=-1)))
])


Notes on the choices:
- The classifier used is Random Forest with its default parameters. This is the classifier chosen for 3 reasons:
    - there are many "features" (i.e words in our case) to consider, and the model will focus on the important variables for the classification
    - the dataset we are looking at is not too big, so no risk to have it take too long to process or be too demanding with memory
    - the execution of the model can be parallelize, which makes it faster to execute!
- CountVectorizer is instantiated using the tokenize() function defined previously - first the input is tokenized then vectorized
- Associated to this vectorisation step, we apply TF-IDF on the matrix obtained as to take into account the frequency of each word in all messages. This allows to weigh each word relatively to how often it appears in all messages, thus understanding the importance of each word for the assignment to the categories labelling a message.


### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [131]:
#we split the column 'message' of X and the whole Y dataframe into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X['message'], Y, test_size=0.2, random_state=42)

#we fit the pipeline using the training sets
pipeline.fit(X_train, y_train)


Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at...
                 MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True,
                                                                        class_weight=None,
                                                                        criterion='gini',
                                                                  

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [132]:
#we predict the categories using the testing set
y_pred = pipeline.predict(X_test)

In [133]:
#### we define a function to build the classification report on each category of Y
def compute_metrics(y_test, y_pred):
    i=0
    reports = {}
    for column in y_test.columns:
        report = classification_report(y_test[column], y_pred[:,i], labels=np.unique(y_pred[:,i]), output_dict=True)
        reports[column]=report['weighted avg']
        #reports[column]['category'] = column
        reports[column]['accuracy'] = (y_pred[:,i] == y_test[column]).mean()
        i+=1
    
    return reports

def create_df_from_dict(results, scenario):
    df = pd.DataFrame.from_dict(results, orient='index')
    df.drop('support', axis=1, inplace=True)
    df.rename(columns={"precision": 'Precision_{}'.format(scenario),
                       "recall": 'Recall_{}'.format(scenario),
                       "f1-score":'F1_score_{}'.format(scenario),
                       "accuracy":'Accuracy_{}'.format(scenario)}, inplace=True)
    return df

#we define another function to build a dataframe with the scores in order to be able to compare the results
#of different scenarios
def append_results_to_df(df, results, scenario):

    if df.empty:
        df = create_df_from_dict(results, scenario)

    else:
        df_2 = create_df_from_dict(results, scenario)
        df = pd.concat([df, df_2], axis=1)

    return df

results = compute_metrics(y_test, y_pred)
df_results = pd.DataFrame()
df_results = append_results_to_df(df_results, results, 'default_config')
df_results

Unnamed: 0,Precision_default_config,Recall_default_config,F1_score_default_config,Accuracy_default_config
request,0.892311,0.896644,0.88488,0.896644
offer,0.995042,1.0,0.997515,0.995042
aid_related,0.776558,0.778223,0.776202,0.778223
medical_help,0.907202,0.92315,0.893865,0.92315
medical_products,0.943665,0.951182,0.931348,0.951182
search_and_rescue,0.971396,0.976545,0.966342,0.976545
security,0.969685,0.982456,0.974873,0.982456
military,0.962833,0.971396,0.960112,0.971396
child_alone,1.0,1.0,1.0,1.0
water,0.953308,0.956903,0.949503,0.956903


A few comments at this point:
- all metrics are homogeneous for each category (i.e they are centered more or less around the same value for each category individually)
- accuracy and recall have the same values for most categories, we will therefore look at only one of the two metrics
- it seems like F1 is the lowest score for almost all categories, so let's take a closer look at this metric to evaluate our model

In [134]:
df_results.sort_values('F1_score_default_config')

Unnamed: 0,Precision_default_config,Recall_default_config,F1_score_default_config,Accuracy_default_config
aid_related,0.776558,0.778223,0.776202,0.778223
other_aid,0.830731,0.868612,0.815448,0.868612
direct_report,0.843583,0.853166,0.83011,0.853166
weather_related,0.880122,0.882342,0.878525,0.882342
request,0.892311,0.896644,0.88488,0.896644
medical_help,0.907202,0.92315,0.893865,0.92315
infrastructure_related,0.878806,0.937262,0.907093,0.937262
other_weather,0.950652,0.947941,0.923167,0.947941
shelter,0.930275,0.936308,0.925221,0.936308
medical_products,0.943665,0.951182,0.931348,0.951182


Main conclusions on the evaluation of this first model:

- in general, the scores are between 0.75 and 1
- the lowest scores are obtained for the following categories:
    - aid_related
    - other_aid
    - direct_report
    - weather_related
    - request
- the highest scores are obtained for the following categories:
    - aid_centers
    - missing_people
    - hospitals
    - tools
    - offer
    - shops
    - child_alone

Let's try to understand why those highest scores are obtained.

In [135]:
highest = ['aid_centers', 'missing_people', 'hospitals', 'tools', 'offer', 'shops', 'child_alone']
for high_score in highest:
    print(high_score)
    print(df[high_score].value_counts())
    print('\n')

aid_centers
0    25907
1      309
Name: aid_centers, dtype: int64


missing_people
0    25918
1      298
Name: missing_people, dtype: int64


hospitals
0    25933
1      283
Name: hospitals, dtype: int64


tools
0    26057
1      159
Name: tools, dtype: int64


offer
0    26098
1      118
Name: offer, dtype: int64


shops
0    26096
1      120
Name: shops, dtype: int64


child_alone
0    26216
Name: child_alone, dtype: int64




In [136]:
lowest = ['aid_related', 'other_aid', 'direct_report', 'weather_related', 'request']

for low_score in lowest:
    print(low_score)
    print(df[low_score].value_counts())
    print('\n')

aid_related
0    15356
1    10860
Name: aid_related, dtype: int64


other_aid
0    22770
1     3446
Name: other_aid, dtype: int64


direct_report
0    21141
1     5075
Name: direct_report, dtype: int64


weather_related
0    18919
1     7297
Name: weather_related, dtype: int64


request
0    21742
1     4474
Name: request, dtype: int64




It appears that the categories for which the scores were the highest are actually unbalanced. We should therefore be careful in our interpretation of those high results, and focus on increasing the overall accuracy of the categories with the lowest score. Let's see what kind of improvement we can achieve using Gridsearch.

### 6. Improve your model
Use grid search to find better parameters. 

In [137]:
#Let's first take a look at the current parameters set for the pipeline
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                   dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                   lowercase=True, max_df=1.0, max_features=None, min_df=1,
                   ngram_range=(1, 1), preprocessor=None, stop_words=None,
                   strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                   tokenizer=<function tokenize at 0x1c1cdc4598>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True,
                                                          class_weight=None,
                                                          criterion='gini',
                                                          max_depth=None,
                                                          max_features='auto',
           

In [91]:
#let's set a list of parameters that have an influence on all estimators

parameters = {
    'vect__ngram_range':[(1, 1), (1, 2)],
    'vect__max_df':[0.5, 0.75, 1],
    'clf__estimator__max_features': ['auto', 'log2'],
    'clf__estimator__n_estimators':[100, 250]
}
            
cv = GridSearchCV(pipeline, param_grid=parameters, n_jobs=-1, cv=3) 
#we specify cv=3, i.e the cross-validation splitting strategy - 3 folds 

#let's repeat the fit and predict steps but now trying with all combinations of parameters set above
cv.fit(X_train, y_train)
y_pred_optim = cv.predict(X_test)


In [92]:
print(cv.best_params_)

{'clf__estimator__max_features': 'log2', 'clf__estimator__n_estimators': 250, 'vect__max_df': 0.5, 'vect__ngram_range': (1, 2)}


Note that we initially tried with a lot more parameters, but due to processing power limitations, we limited the number of combinations to the most relevant ones to try to balance the time required to obtain a result.

It seems like the optimal combination of parameters differs from the default set on:
- vect__ngram_range = (1, 2)
- vect__max_df = 0.5
- clf__estimator__max_features = log2
- clf__estimator__n_estimators = 250


### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [97]:
#we simply run the function defined above with the new y_pred
results_grid_search = compute_metrics(y_test, y_pred_optim)
df_results = append_results_to_df(df_results, results_grid_search, 'grid_search')
df_results

Unnamed: 0,Precision_default_config,Recall_default_config,F1_score_default_config,Accuracy_default_config,Precision_gridsearch,Recall_gridsearch,F1_score_gridsearch,Accuracy_gridsearch
request,0.87428,0.882723,0.868475,0.882723,0.886677,0.884439,0.862976,0.884439
offer,0.995042,1.0,0.997515,0.995042,0.995042,1.0,0.997515,0.995042
aid_related,0.756165,0.757437,0.752473,0.757437,0.763317,0.73074,0.703446,0.73074
medical_help,0.903899,0.92296,0.895423,0.92296,0.892118,0.919718,0.882543,0.919718
medical_products,0.944951,0.951945,0.933289,0.951945,0.934335,0.948703,0.924289,0.948703
search_and_rescue,0.971251,0.976926,0.967806,0.976926,0.97655,0.975973,0.964293,0.975973
security,0.966704,0.982265,0.974422,0.982265,0.972491,0.983028,0.975171,0.983028
military,0.964578,0.971968,0.961617,0.971968,0.941753,0.970252,0.95579,0.970252
child_alone,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
water,0.948428,0.95328,0.944304,0.95328,0.933907,0.93955,0.915456,0.93955


We can observe that with the optimal set of parameters found ealier with GridSearch, we now have the scores ranging between and 1. Which is an improvement! Can we do better?

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

There are a few options we can try:
- use another machine learning model, such as SVM, or a Naive Bayes model instead of Random Forest
- perform other transformations on the data, using feature union, but the data seems already transformed in a relevant and usable way
- use word embedding instead of TF-IDF (for example GloVe), to evaluate a word in its context and not alone

We are going to try with the first option with two new models, using the default hyperparmaters sklearn implements.
We are going to compare these results with the initial default_config.

In [140]:
# we define a new pipeline with the new estimators
from sklearn.naive_bayes import MultinomialNB

pipeline_alt_MNB = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(MultinomialNB(), n_jobs=1))
])



In [141]:
#we fit the pipeline_alt using the training sets
pipeline_alt_MNB.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at 0x1c1cdc4598>,
                                 vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('clf',
                 MultiOutputClassifier(estimator=MultinomialNB(alpha=1.0,
               

In [142]:
#we predict the categories using the testing set
y_pred_alt_MNB = pipeline_alt_MNB.predict(X_test)

In [145]:
#let's see how the metrics changed with MNB
results_alt_MNB = compute_metrics(y_test, y_pred_alt_MNB)
df_results = append_results_to_df(df_results, results_alt_MNB, 'MNB')
df_results

Unnamed: 0,Precision_default_config,Recall_default_config,F1_score_default_config,Accuracy_default_config,Precision_MNB,Recall_MNB,F1_score_MNB,Accuracy_MNB
request,0.892311,0.896644,0.88488,0.896644,0.860375,0.861747,0.826873,0.861747
offer,0.995042,1.0,0.997515,0.995042,0.995042,1.0,0.997515,0.995042
aid_related,0.776558,0.778223,0.776202,0.778223,0.75596,0.758009,0.754605,0.758009
medical_help,0.907202,0.92315,0.893865,0.92315,0.919527,1.0,0.958077,0.919527
medical_products,0.943665,0.951182,0.931348,0.951182,0.899667,0.948322,0.923354,0.948322
search_and_rescue,0.971396,0.976545,0.966342,0.976545,0.975782,1.0,0.987742,0.975782
security,0.969685,0.982456,0.974873,0.982456,0.983219,1.0,0.991538,0.983219
military,0.962833,0.971396,0.960112,0.971396,0.970442,1.0,0.985,0.970442
child_alone,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
water,0.953308,0.956903,0.949503,0.956903,0.874877,0.935164,0.904016,0.935164


Conclusion using MNB, the overall scores are lower, so let's explore another option.

**Conclusions of the model tuning work**


### 9. Export your model as a pickle file

In [None]:
https://machinelearningmastery.com/save-load-machine-learning-models-python-scikit-learn/

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.