# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.


### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [None]:
# import libraries
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import re
from scipy import stats

from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords

import spacy
import en_core_web_sm

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import RandomizedSearchCV

from joblib import dump, load

In [2]:
# load data from database
engine = create_engine('sqlite:///../data/DisasterTweets.db')
df = pd.read_sql_table('categorized_messages', engine)
features = df['message']
labels = df.iloc[:, 4:]

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26215 entries, 0 to 26214
Data columns (total 40 columns):
id                        26215 non-null int64
message                   26215 non-null object
original                  10170 non-null object
genre                     26215 non-null object
related                   26215 non-null int64
request                   26215 non-null int64
offer                     26215 non-null int64
aid_related               26215 non-null int64
medical_help              26215 non-null int64
medical_products          26215 non-null int64
search_and_rescue         26215 non-null int64
security                  26215 non-null int64
military                  26215 non-null int64
child_alone               26215 non-null int64
water                     26215 non-null int64
food                      26215 non-null int64
shelter                   26215 non-null int64
clothing                  26215 non-null int64
money                     26215 non-null i

In [4]:
# Check to make sure the right classes, 0 and 1, are present
df['related'].value_counts()

1    20093
0     6122
Name: related, dtype: int64

### 2. Write a tokenization function to process your text data

In [5]:
features.loc[0]

'Weather update - a cold front from Cuba that could pass over Haiti'

Note that I'm making lemmatization optional in the following function as I may want to use part-of-speech tagging and/or named entity recognition results as engineered features later, and lemmatization is likely to cause issues with those.

In [6]:
def tokenize(text, lemma=True, use_spacy_full=False, use_spacy_lemma_only=True):
    '''
    Performs various preprocessing steps on a single piece of text. Specifically, this function:
        1. Strips all leading and trailing whitespace
        2. Makes everything lowercase
        3. Removes punctuation
        4. Tokenizes the text into individual words
        5. Removes common English stopwords
        6. If enabled, lemmatizes the remaining words
        
        
    Parameters
    ----------
    text: string representing a single message
    
    lemma: bool. Indicates if lemmatization should be done
    
    use_spacy_full: bool. If True, performs a full corpus analysis (POS, lemmas of all types, etc.) 
        using the spacy package instead of nltk lemmatization
        
    use_spacy_lemma_only: bool. If True, only performs verb-based lemmatization. Faster than full spacy
        corpus analysis by about 88x.
    
    
    Returns
    -------
    List of processed strings from a single message
    '''
    
    # Strip leading and trailing whitespace
    text = text.strip()
    
    # Make everything lowercase
    text = text.lower()
    
    # Retain only parts of text that are non-punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)
    
    # Tokenize into individual words
    words = word_tokenize(text)
    
    # Remove common English stopwords
    words = [w for w in words if w not in stopwords.words("english")]
    
    # Lemmatize to root words, if option is enabled
    if lemma and not use_spacy_full and not use_spacy_lemma_only:
        words = [WordNetLemmatizer().lemmatize(w, pos='v') for w in words]
    
    elif lemma and use_spacy_full:
        nlp = en_core_web_sm.load()
        doc = nlp(text)
        words = [token.lemma_ for token in doc if not token.is_stop]
        
    elif lemma and use_spacy_lemma_only:        
        from spacy.lemmatizer import Lemmatizer
        from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES
        lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)
        words = [lemmatizer(w, u"VERB")[0] for w in words]
        
        
    return  words

In [7]:
features.loc[3]

'UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.'

In [17]:
%%timeit
tokenize(features.loc[3], use_spacy_full=False, use_spacy_lemma_only=False)

3.46 ms ± 497 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


**Now what happens if I use the spacy approach to lemmatization, wherein I don't have to specify a POS type?**

In [18]:
%%timeit
tokenize(features.loc[3], use_spacy_full=True, use_spacy_lemma_only=False)

258 ms ± 23.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [19]:
%%timeit
tokenize(features.loc[3], use_spacy_full=False, use_spacy_lemma_only=True)

2.72 ms ± 244 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


It looks like spacy (lemmatization on verbs only) is a little faster than nltk, so let's go with that. ...and definitely don't use `spacy_full` for anything.

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [20]:
pipeline = Pipeline([
    ('tf-idf', TfidfVectorizer(tokenizer=tokenize)),
    #('classifier', MultiOutputClassifier(GradientBoostingClassifier(), n_jobs=-1))
    #('classifier', MultiOutputClassifier(GradientBoostingClassifier()))
    ('classifier', RandomForestClassifier())
    ], 
    verbose=True)


### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [25]:
# split the data into testing and training
features_train, features_test, 
labels_train, labels_test = train_test_split(features, labels, test_size=0.2)

In [27]:
%%time
# train the pipeline
pipeline.fit(features_train, labels_train)

[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.6min




[Pipeline] ........ (step 2 of 2) Processing classifier, total=  12.4s
CPU times: user 1min 29s, sys: 15.8 s, total: 1min 45s
Wall time: 1min 48s


Pipeline(memory=None,
         steps=[('tf-idf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern=...
                 RandomForestClassifier(bootstrap=True, class_weight=None,
                                        criterion='gini', max_depth=None,
                                        max_features='auto',
                                        max_leaf_nodes=None,
                               

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [28]:
%%time
labels_pred = pipeline.predict(features_test)

CPU times: user 18.6 s, sys: 3.76 s, total: 22.3 s
Wall time: 23.5 s


In [31]:
class_report = pd.DataFrame.from_dict(classification_report(labels_test, labels_pred,
                                                            target_names=labels.columns,
                                                            digits=2, output_dict=True))

class_report

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,child_alone,...,storm,fire,earthquake,cold,other_weather,direct_report,micro avg,macro avg,weighted avg,samples avg
precision,0.848779,0.794715,0.0,0.770941,0.48,0.681818,1.0,0.0,0.769231,0.0,...,0.74026,0.6,0.911647,0.833333,0.533333,0.727473,0.818521,0.572667,0.746662,0.639075
recall,0.897719,0.43785,0.0,0.486891,0.029126,0.057471,0.012739,0.0,0.063694,0.0,...,0.240506,0.057692,0.462322,0.041322,0.02952,0.321672,0.425905,0.143351,0.425905,0.399108
f1-score,0.872563,0.564621,0.0,0.596844,0.05492,0.106007,0.025157,0.0,0.117647,0.0,...,0.363057,0.105263,0.613514,0.07874,0.055944,0.446092,0.560277,0.195271,0.490836,0.439863
support,3989.0,893.0,23.0,2136.0,412.0,261.0,157.0,84.0,157.0,0.0,...,474.0,52.0,491.0,121.0,271.0,1029.0,16499.0,16499.0,16499.0,16499.0


OK, I'm going to assume here that the best metric due to class imbalance across categories is likely the weighted average f1-score, **meaning that we've got a current performance of 0.49**.

### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
%%time
parameters = {
    'tf-idf__max_df': [0.1, 0.5, 0.9],
    'classifier__n_estimators': [10,50,100,500],
    'classifier__max_depth': [5,10, None],
    'classifier__min_samples_split': [2, 4, 8],
    'classifier__min_samples_leaf': [1, 6, 10]
}

cv = RandomizedSearchCV(pipeline, parameters, n_iter = 10, cv = 5, scoring='f1_weighted')
cv.fit(features_train, labels_train)

[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.2min
[Pipeline] ........ (step 2 of 2) Processing classifier, total= 1.3min


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.2min
[Pipeline] ........ (step 2 of 2) Processing classifier, total= 1.2min


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.1min
[Pipeline] ........ (step 2 of 2) Processing classifier, total= 1.3min


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.1min
[Pipeline] ........ (step 2 of 2) Processing classifier, total= 1.2min


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.1min
[Pipeline] ........ (step 2 of 2) Processing classifier, total= 1.2min


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.2min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   0.7s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.1min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   0.6s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.2min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   0.6s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.2min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   0.6s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.2min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   0.6s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.1min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=  31.3s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.2min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=  31.3s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.1min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=  28.6s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.3min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=  29.1s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.1min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=  28.6s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.1min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=  30.8s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.1min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=  30.7s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.1min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=  31.3s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.1min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=  31.7s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.1min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=  30.3s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.1min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   3.3s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.2min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   3.4s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.1min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   3.3s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.1min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   3.6s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.1min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   3.7s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.2min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   0.7s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.1min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   0.7s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.1min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   0.6s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.5min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   1.0s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.3min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   0.8s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.3min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   3.6s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.4min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   3.4s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.3min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   2.9s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.2min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   3.1s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.5min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   3.7s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.3min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   0.9s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.3min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   0.7s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.4min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   0.8s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.3min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   0.7s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.3min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   0.7s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.3min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   3.0s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.3min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   3.4s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.3min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   2.9s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.4min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=   3.0s


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [None]:
tuning_results = pd.DataFrame(cv.cv_results_)
tuning_results

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [None]:
%%time
labels_pred = cv.predict(features_test)

In [None]:
class_report = pd.DataFrame.from_dict(classification_report(labels_test, labels_pred,
                                                            target_names=labels.columns,
                                                            digits=2, output_dict=True))

class_report

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

In [None]:
# Saves as pickle using joblib package
# Code from https://scikit-learn.org/stable/modules/model_persistence.html
dump(cv, '../models/09-20-2019_RandomForest.joblib')

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.