# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [134]:
# import libraries
import pandas as pd
from sqlalchemy import create_engine

import re
import nltk
nltk.download(['punkt','stopwords'])
from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords

from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.multioutput import MultiOutputClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, precision_recall_fscore_support
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier

import joblib

[nltk_data] Downloading package punkt to C:\Users\eugen/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\eugen/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:
# load data from database
engine = create_engine('sqlite:///../data/drp.db')
df = pd.read_sql_table('FigureEight_data', con = engine)
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [13]:
X = df['message']
Y = df.iloc[:, 4:]
category_names = list(df.columns[4:])

In [57]:
i = 200
print('X:')
print(X.iloc[i])
print('Target variables Y:')
print(Y.iloc[i])
print('Category names:')
print(category_names)

-------- X -----------------------------------------
We ARE in Port-Margot, in the Northern part of the Country. It is 2hrs driving from Cap-Haitian, We do not have anything to survive. We are dying, please help us.
-------- Target variables Y ------------------------
related                   1
request                   1
offer                     0
aid_related               1
medical_help              1
medical_products          0
search_and_rescue         0
security                  0
military                  0
child_alone               0
water                     0
food                      1
shelter                   0
clothing                  0
money                     0
missing_people            0
refugees                  0
death                     1
other_aid                 1
infrastructure_related    0
transport                 0
buildings                 0
electricity               0
tools                     0
hospitals                 0
shops                     0
aid

### 2. Write a tokenization function to process your text data

In [60]:
def tokenize(text):
    # normalize case and remove punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text.lower())
    
    # tokenize text and innitiate lemmatizer
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()
    
    # remove stopwords
    tokens = [w for w in tokens if w not in stopwords.words('english')]
    
    # iterate through each token
    clean_tokens = []
    for tok in tokens:
        # lemmatize and remove leading/ trailing white space
        clean_tok = lemmatizer.lemmatize(tok).strip()
        clean_tokens.append(clean_tok)
    return clean_tokens  

In [61]:
i = 200
print(X.iloc[i])
print(tokenize(X.iloc[i]))

We ARE in Port-Margot, in the Northern part of the Country. It is 2hrs driving from Cap-Haitian, We do not have anything to survive. We are dying, please help us.
['port', 'margot', 'northern', 'part', 'country', '2hrs', 'driving', 'cap', 'haitian', 'anything', 'survive', 'dying', 'please', 'help', 'u']


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [69]:
pipeline_01 = Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    #('tfidf', TfidfTransformer()),
                    ('clf', MultiOutputClassifier(DecisionTreeClassifier()))
                    ])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [66]:
# split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, 
                                                    test_size = 0.2, 
                                                    random_state = 42)

print('Shape of training set: X_train {} | y_train {}'.format(X_train.shape, y_train.shape))
print('Shape of testing set: X_train {} | y_train {}'.format(X_test.shape, y_test.shape))

Shape of training set: X_train (20972,) | y_train (20972, 36)
Shape of testing set: X_train (5244,) | y_train (5244, 36)


In [70]:
# train classifier
pipeline_01.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at...
                 MultiOutputClassifier(estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                                        class_weight=None,
                                                                        criterion='gini',
                                                                   

In [132]:
# predict on test data
y_pred_01 = pipeline_01.predict(X_test)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [119]:
# function to report f1 score, precision and recall for each output category

def evaluate_model(y_test, y_pred):
    results = pd.DataFrame(columns = ['Category', 'f1_score', 'precision', 'recall'])
    
    i = 0
    for category in y_test.columns:
        precision, recall, f1_score, support = precision_recall_fscore_support(y_test[category], 
                                                                               y_pred[:,i], 
                                                                               average = 'weighted')
        results.at[i + 1, 'Category'] = category
        results.at[i + 1, 'f1_score'] = f1_score
        results.at[i + 1, 'precision'] = precision
        results.at[i + 1, 'recall'] = recall
        i += 1
    print('Overall f1_score: ', '{:.4}'.format(results['f1_score'].mean()))
    print('Overall precision: ', '{:.4}'.format(results['precision'].mean()))
    print('Overall recall: ', '{:.4}'.format(results['recall'].mean()))
    return results

In [133]:
results_p01 = evaluate_model(y_test, y_pred_01)

Overall f1_score:  0.9351
Overall precision:  0.9332
Overall recall:  0.9376


In [89]:
results_p01

Unnamed: 0,Category,f1_score,precision,recall
1,related,0.762833,0.763941,0.763539
2,request,0.860944,0.858161,0.86537
3,offer,0.991806,0.990101,0.993516
4,aid_related,0.721287,0.720809,0.721968
5,medical_help,0.905505,0.901116,0.911327
6,medical_products,0.946573,0.944015,0.95061
7,search_and_rescue,0.965249,0.963643,0.96701
8,security,0.969892,0.967856,0.971968
9,military,0.964531,0.964531,0.964531
10,child_alone,1.0,1.0,1.0


### 6. Improve your model
Use grid search to find better parameters. 

In [77]:
pipeline_01.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                   dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                   lowercase=True, max_df=1.0, max_features=None, min_df=1,
                   ngram_range=(1, 1), preprocessor=None, stop_words=None,
                   strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                   tokenizer=<function tokenize at 0x000002BF98E678B8>,
                   vocabulary=None)),
  ('clf', MultiOutputClassifier(estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                          class_weight=None,
                                                          criterion='gini',
                                                          max_depth=None,
                                                          max_features=None,
                                                          max_leaf_nodes=None,
         

In [78]:
parameters = {
            'clf__estimator__criterion': ['gini', 'entropy']   
            }

cv_01 = GridSearchCV(estimator = pipeline_01, param_grid = parameters)

In [79]:
# train classifier
cv_01.fit(X_train, y_train)

GridSearchCV(cv=None, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        p

In [80]:
y_pred_cv_01 = cv_01.predict(X_test)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [120]:
results_cv01 = evaluate_model(y_test, y_pred_cv_01)

Overall f1_score:  0.9349
Overall precision:  0.933
Overall recall:  0.9374


model improvement with GridSearchCV has brought no amelioration of the results. Results of f1_score is even slightly lower.

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

#### pipeline_02: DecisionTreeClassifier + TfidfTransformer

In [92]:
# build pipeline
pipeline_02 = Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultiOutputClassifier(DecisionTreeClassifier()))
                    ])

In [93]:
# train classifier
pipeline_02.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at...
                 MultiOutputClassifier(estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                                        class_weight=None,
                                                                        criterion='gini',
                                                                   

In [94]:
# predict on test data
y_pred_02 = pipeline_02.predict(X_test)

In [121]:
results_p02 = evaluate_model(y_test, y_pred_02)

Overall f1_score:  0.9317
Overall precision:  0.9308
Overall recall:  0.9328


#### pipeline_03: RandomForestClassifier + TfidfTransformer

In [98]:
# build pipeline
pipeline_03 = Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer()),
                    ('clf', MultiOutputClassifier(RandomForestClassifier()))
                    ])

In [99]:
# train classifier
pipeline_03.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at...
                                                                        ccp_alpha=0.0,
                                                                        class_weight=None,
                                                                        criterion='gini',
                                                                   

In [100]:
# predict on test data
y_pred_03 = pipeline_03.predict(X_test)

In [122]:
results_p03 = evaluate_model(y_test, y_pred_03)

Overall f1_score:  0.9366
Overall precision:  0.9387
Overall recall:  0.9483


#### pipeline_04: RandomForestClassifier without TfidfTransformer

In [102]:
# build pipeline
pipeline_04 = Pipeline([
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    #('tfidf', TfidfTransformer()),
                    ('clf', MultiOutputClassifier(RandomForestClassifier()))
                    ])

In [103]:
# train classifier
pipeline_04.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize at...
                                                                        ccp_alpha=0.0,
                                                                        class_weight=None,
                                                                        criterion='gini',
                                                                   

In [107]:
# predict on test data
y_pred_04 = pipeline_04.predict(X_test)

In [123]:
results_p04 = evaluate_model(y_test, y_pred_04)

Overall f1_score:  0.9379
Overall precision:  0.9402
Overall recall:  0.9486


### 9. Export your model as a pickle file

In [135]:
joblib.dump(pipeline_01, "classifier_01.pkl")

['classifier_01.pkl']

In [140]:
joblib.dump(cv_01, "classifier_01_cv.pkl")

['classifier_01_cv.pkl']

In [137]:
joblib.dump(pipeline_02, "classifier_02.pkl")

['classifier_02.pkl']

In [138]:
joblib.dump(pipeline_03, "classifier_03.pkl")

['classifier_03.pkl']

In [139]:
joblib.dump(pipeline_04, "classifier_04.pkl")

['classifier_04.pkl']

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.