# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import pandas as pd
import numpy as np

from sqlalchemy import create_engine
# download necessary NLTK data

import nltk
nltk.download(['punkt', 'wordnet'])
# import statements
import re
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer

from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV


from sklearn.datasets import make_multilabel_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.linear_model import LogisticRegression

import pickle


[nltk_data] Downloading package punkt to /home/student/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /home/student/nltk_data...


In [2]:
# load data from database sreated by ETL Pipeline
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('Messages_Categories', engine)
df.head(-100)

# select categories for ML
column_names = list(df.columns)
category_cols = column_names[4:]

# define features and set independent variable "X" (inputs) and dependent “y” variable (output)
X = df.message
y = df[category_cols]

##### Check data

In [3]:
len(category_cols)

36

In [4]:
df.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,1,0,0,...,0,0,1,0,1,0,0,0,0,0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1,1,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [5]:
df.shape

(26028, 40)

In [6]:
X.head()

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
4    says: west side of Haiti, rest of the country ...
Name: message, dtype: object

In [7]:
y.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0,0,1,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,1,0,1,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### 2. Write a tokenization function to process your text data

In [8]:
# adopted from Udacity course work
def tokenize(text):
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

In [9]:
# test out the tokenizing function
for message in X[:5]:
    tokens = tokenize(message)
    print(message)
    print(tokens, '\n')

Weather update - a cold front from Cuba that could pass over Haiti
['weather', 'update', '-', 'a', 'cold', 'front', 'from', 'cuba', 'that', 'could', 'pas', 'over', 'haiti'] 

Is the Hurricane over or is it not over
['is', 'the', 'hurricane', 'over', 'or', 'is', 'it', 'not', 'over'] 

Looking for someone but no name
['looking', 'for', 'someone', 'but', 'no', 'name'] 

UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.
['un', 'report', 'leogane', '80-90', 'destroyed', '.', 'only', 'hospital', 'st.', 'croix', 'functioning', '.', 'needs', 'supply', 'desperately', '.'] 

says: west side of Haiti, rest of the country today and tonight
['say', ':', 'west', 'side', 'of', 'haiti', ',', 'rest', 'of', 'the', 'country', 'today', 'and', 'tonight'] 



In [10]:
X.head(10)

0    Weather update - a cold front from Cuba that c...
1              Is the Hurricane over or is it not over
2                      Looking for someone but no name
3    UN reports Leogane 80-90 destroyed. Only Hospi...
4    says: west side of Haiti, rest of the country ...
5               Information about the National Palace-
6                       Storm at sacred heart of jesus
7    Please, we need tents and water. We are in Sil...
8      I would like to receive the messages, thank you
9    I am in Croix-des-Bouquets. We have health iss...
Name: message, dtype: object

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [11]:
def pipeline():
    pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
    ])
    return(pipeline)


### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

# train classifier
pipeline = pipeline()
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

In [13]:
# check test data set 
y_test.shape

(6507, 36)

In [14]:
# check train data set 
y_train.shape

(19521, 36)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [15]:
y_pred = pipeline.predict(X_test)

In [16]:
# report precision, recall, f1 score  for each category
clas_rep = classification_report(y_test, y_pred, target_names=y.columns.values)
print(clas_rep)

                        precision    recall  f1-score   support

               related       0.83      0.93      0.87      4968
               request       0.84      0.40      0.54      1113
                 offer       0.00      0.00      0.00        28
           aid_related       0.76      0.51      0.61      2710
          medical_help       0.64      0.06      0.11       526
      medical_products       0.69      0.06      0.11       334
     search_and_rescue       0.59      0.09      0.16       169
              security       0.00      0.00      0.00       101
              military       0.62      0.06      0.12       203
           child_alone       0.00      0.00      0.00         0
                 water       0.76      0.20      0.32       433
                  food       0.84      0.34      0.49       765
               shelter       0.82      0.22      0.34       584
              clothing       0.73      0.08      0.14       100
                 money       0.50      

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [17]:
# Save classification report for vizualization
# Adopted from source: https://stackoverflow.com/questions/39662398/scikit-learn-output-metrics-classification-report-into-csv-tab-delimited-format
# A better option is classification_report(y_test, y_pred, output_dict=True),
# however "output_dict" parameter is not availablein in sklearn in Udacity workspace

def save_clas_rep(clas_rep):
    report_data = []
    lines = clas_rep.split('\n')
    for line in lines[2:-3]:
        row = {}
        row_data = line.split()
        row['class'] = row_data[0]
        row['precision'] = float(row_data[1])
        row['recall'] = float(row_data[2])
        row['f1_score'] = float(row_data[3])
        row['support'] = float(row_data[4])
        report_data.append(row)
    df_clas_rep = pd.DataFrame.from_dict(report_data)
    df_clas_rep.to_csv('clas_rep.csv', index = False)

# Save classification report
save_clas_rep(clas_rep)

### 6. Improve your model
Use grid search to find better parameters. 

In [18]:
del pipeline

In [19]:
# check the pipeline's parameters 
pipeline = Pipeline([
('vect', CountVectorizer(tokenizer=tokenize)),
('tfidf', TfidfTransformer()),
('clf', MultiOutputClassifier(RandomForestClassifier()))
])


In [20]:
# check available parameters
pipeline.get_params()

{'memory': None,
 'steps': [('vect',
   CountVectorizer(analyzer='word', binary=False, decode_error='strict',
           dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
           lowercase=True, max_df=1.0, max_features=None, min_df=1,
           ngram_range=(1, 1), preprocessor=None, stop_words=None,
           strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
           tokenizer=<function tokenize at 0x7fb0cd3449d8>, vocabulary=None)),
  ('tfidf',
   TfidfTransformer(norm='l2', smooth_idf=True, sublinear_tf=False, use_idf=True)),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
               max_depth=None, max_features='auto', max_leaf_nodes=None,
               min_impurity_decrease=0.0, min_impurity_split=None,
               min_samples_leaf=1, min_samples_split=2,
               min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
               oob_score=False, random_state=None,

In [21]:
# specify parameters for grid search - adopted from "Case Study: Grid Search Pipeline"
# https://learn.udacity.com/nanodegrees/nd025/parts/cd0018/lessons/ls12134/concepts/4f63106c-789f-4989-a762-0054ab192db8 

parameters = {
    'clf__estimator__n_estimators': [10, 20, 100],
    'clf__estimator__max_depth': [2, 4, None],
}


# create grid search object
cv = GridSearchCV(pipeline, param_grid=parameters)
cv.fit(X_train, y_train)   
y_pred = cv.predict(X_test)


In [22]:
def display_results(cv, y_test, y_pred):
    accuracy = (y_pred == y_test).mean()
    print("Accuracy:\n",accuracy)
    print("Best Parameters:", cv.best_params_)

In [23]:
display_results(cv, y_test, y_pred)

Accuracy:
 related                   0.803442
request                   0.895805
offer                     0.995697
aid_related               0.767327
medical_help              0.921316
medical_products          0.950515
search_and_rescue         0.974950
security                  0.984325
military                  0.969571
child_alone               1.000000
water                     0.947441
food                      0.926694
shelter                   0.925311
clothing                  0.985247
money                     0.975104
missing_people            0.988167
refugees                  0.964500
death                     0.959121
other_aid                 0.867374
infrastructure_related    0.929153
transport                 0.958506
buildings                 0.950515
electricity               0.981866
tools                     0.995236
hospitals                 0.990011
shops                     0.994621
aid_centers               0.985093
other_infrastructure      0.952205
weather_r

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [24]:
del pipeline

In [25]:
# improve teh pipeline based on GridSearch analysi 
pipeline = Pipeline([
('vect', CountVectorizer(tokenizer=tokenize)),
('tfidf', TfidfTransformer()),
('clf', MultiOutputClassifier(RandomForestClassifier(n_estimators = 100, max_depth = None)))
])


In [26]:
X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

In [27]:
# report precision, recall, f1 score  for each category
clas_rep = classification_report(y_test, y_pred, target_names=y.columns.values)
print(clas_rep)
# Save classification report
save_clas_rep(clas_rep)

                        precision    recall  f1-score   support

               related       0.82      0.97      0.89      5011
               request       0.88      0.43      0.58      1136
                 offer       0.00      0.00      0.00        28
           aid_related       0.79      0.60      0.68      2765
          medical_help       0.66      0.04      0.08       527
      medical_products       0.69      0.05      0.10       332
     search_and_rescue       0.64      0.04      0.07       184
              security       0.50      0.01      0.02       108
              military       0.67      0.04      0.07       210
           child_alone       0.00      0.00      0.00         0
                 water       0.94      0.22      0.35       417
                  food       0.85      0.41      0.56       717
               shelter       0.88      0.27      0.41       602
              clothing       0.78      0.07      0.13       101
                 money       0.40      

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [28]:
#tbd

### 9. Export your model as a pickle file

In [29]:
pickle.dump(pipeline, open('model.pkl', 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.