# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.


### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [174]:
# import libraries
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import re
from scipy import stats

from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords

import spacy
import en_core_web_sm

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [4]:
# load data from database
engine = create_engine('sqlite:///../data/DisasterTweets.db')
df = pd.read_sql_table('categorized_messages', engine)
features = df['message']
labels = df.iloc[:, 4:]

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26216 entries, 0 to 26215
Data columns (total 40 columns):
id                        26216 non-null int64
message                   26216 non-null object
original                  10170 non-null object
genre                     26216 non-null object
related                   26216 non-null int64
request                   26216 non-null int64
offer                     26216 non-null int64
aid_related               26216 non-null int64
medical_help              26216 non-null int64
medical_products          26216 non-null int64
search_and_rescue         26216 non-null int64
security                  26216 non-null int64
military                  26216 non-null int64
child_alone               26216 non-null int64
water                     26216 non-null int64
food                      26216 non-null int64
shelter                   26216 non-null int64
clothing                  26216 non-null int64
money                     26216 non-null i

### 2. Write a tokenization function to process your text data

In [6]:
features.loc[0]

'Weather update - a cold front from Cuba that could pass over Haiti'

Note that I'm making lemmatization optional in the following function as I may want to use part-of-speech tagging and/or named entity recognition results as engineered features later, and lemmatization is likely to cause issues with those.

In [158]:
def tokenize(text, lemma=True, use_spacy_full=False, use_spacy_lemma_only=True):
    '''
    Performs various preprocessing steps on a single piece of text. Specifically, this function:
        1. Strips all leading and trailing whitespace
        2. Makes everything lowercase
        3. Removes punctuation
        4. Tokenizes the text into individual words
        5. Removes common English stopwords
        6. If enabled, lemmatizes the remaining words
        
        
    Parameters
    ----------
    text: string representing a single message
    
    lemma: bool. Indicates if lemmatization should be done
    
    use_spacy_full: bool. If True, performs a full corpus analysis (POS, lemmas of all types, etc.) 
        using the spacy package instead of nltk lemmatization
        
    use_spacy_lemma_only: bool. If True, only performs verb-based lemmatization. Faster than full spacy
        corpus analysis by about 88x.
    
    
    Returns
    -------
    List of processed strings from a single message
    '''
    
    # Strip leading and trailing whitespace
    text = text.strip()
    
    # Make everything lowercase
    text = text.lower()
    
    # Retain only parts of text that are non-punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)
    
    # Tokenize into individual words
    words = word_tokenize(text)
    
    # Remove common English stopwords
    words = [w for w in words if w not in stopwords.words("english")]
    
    # Lemmatize to root words, if option is enabled
    if lemma and not use_spacy_full and not use_spacy_lemma_only:
        words = [WordNetLemmatizer().lemmatize(w, pos='v') for w in words]
    
    elif lemma and use_spacy_full:
        nlp = en_core_web_sm.load()
        doc = nlp(text)
        words = [token.lemma_ for token in doc if not token.is_stop]
        
    elif lemma and use_spacy_lemma_only:        
        from spacy.lemmatizer import Lemmatizer
        from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES
        lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)
        words = [lemmatizer(w, u"VERB")[0] for w in words]
        
        
    return  words

In [89]:
features.loc[3]

'UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.'

In [87]:
tokenize(features.loc[3], use_spacy=False)

['un',
 'report',
 'leogane',
 '80',
 '90',
 'destroy',
 'hospital',
 'st',
 'croix',
 'function',
 'need',
 'supply',
 'desperately']

**Now what happens if I use the spacy approach to lemmatization, wherein I don't have to specify a POS type?**

In [86]:
tokenize(features.loc[3], use_spacy=True)

['un',
 'report',
 'leogane',
 '80',
 '90',
 'destroy',
 'hospital',
 'st',
 'croix',
 'functioning',
 'need',
 'supply',
 'desperately']

**It looks nltk and spacy lemmatization are effectively equivalent, so I'll just use spacy (since it doesn't limit itself just to nouns or verbs, for example)**

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [104]:
pipeline = Pipeline([
    ('tf-idf', TfidfVectorizer(tokenizer=tokenize)),
    #('classifier', MultiOutputClassifier(GradientBoostingClassifier(), n_jobs=-1))
    #('classifier', MultiOutputClassifier(GradientBoostingClassifier()))
    ('classifier', RandomForestClassifier())
    ], 
    verbose=True)


### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [105]:
# split data
features_train, features_test, labels_train, labels_test = train_test_split(features, labels,
                                                                           test_size=0.2)

# train the pipeline
pipeline.fit(features_train, labels_train)

[Pipeline] ............ (step 1 of 2) Processing tf-idf, total=84.8min




[Pipeline] ........ (step 2 of 2) Processing classifier, total=  11.9s


Pipeline(memory=None,
         steps=[('tf-idf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern=...
                 RandomForestClassifier(bootstrap=True, class_weight=None,
                                        criterion='gini', max_depth=None,
                                        max_features='auto',
                                        max_leaf_nodes=None,
                               

**WHOA! The tf-idf step took almost an hour and a half?! That's not right.** I wonder if my lemmatization process is slowing things down? I know the raw sklearn tf-idf transformer is extremely efficient, so my custom tokenizer must be the problem...

In [132]:
features.loc[3]

'UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.'

In [133]:
%%time
tokenize(features.loc[3], use_spacy=False)

CPU times: user 6.66 ms, sys: 3.63 ms, total: 10.3 ms
Wall time: 68.6 ms


['un',
 'report',
 'leogane',
 '80',
 '90',
 'destroy',
 'hospital',
 'st',
 'croix',
 'function',
 'need',
 'supply',
 'desperately']

**Now what happens if I use the spacy approach to lemmatization, wherein I don't have to specify a POS type?**

In [134]:
%%time
tokenize(features.loc[3], use_spacy=True)

CPU times: user 227 ms, sys: 31.8 ms, total: 258 ms
Wall time: 426 ms


['un',
 'report',
 'leogane',
 '80',
 '90',
 'destroy',
 'hospital',
 'st',
 'croix',
 'functioning',
 'need',
 'supply',
 'desperately']

OK, so now we know that spacy is about 34x slower (looking at user time here, not wall time) than nltk for lemmatization for some reason (weird, because spacy is billed as being super efficient, but perhaps my approach to lemmatization with it is flawed).

What if I don't even lemmatize anything?

In [135]:
%%time
tokenize(features.loc[3], lemma=False)

CPU times: user 5.04 ms, sys: 2.35 ms, total: 7.39 ms
Wall time: 51.7 ms


['un',
 'reports',
 'leogane',
 '80',
 '90',
 'destroyed',
 'hospital',
 'st',
 'croix',
 'functioning',
 'needs',
 'supplies',
 'desperately']

Hmmm, not a dramatic improvement over nltk lemmatization, so I can probably keep the lemmatization step.

Let's stop using single examples and see how fast nltk-alone lemmatization performs. I'll also look at what using the standard Lemmatizer object in spacy performs, instead of doing the full-blown corpus analysis approach I originally setup.

In [150]:
%%timeit
tokenize(features.loc[3], use_spacy_full=False, use_spacy_lemma_only=False)

2.71 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [151]:
%%timeit
tokenize(features.loc[3], use_spacy_full=True)

237 ms ± 24.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [152]:
%%timeit
tokenize(features.loc[3], use_spacy_full=False, use_spacy_lemma_only=True)

2.68 ms ± 164 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


Alrighty there we go: our pipeline is now a lot more effective with the spacy non-full option (although still basically the same speed as nltk). Let's try that training again!

In [160]:
pipeline = Pipeline([
    ('tf-idf', TfidfVectorizer(tokenizer=tokenize)),
    #('classifier', MultiOutputClassifier(GradientBoostingClassifier(), n_jobs=-1))
    #('classifier', MultiOutputClassifier(GradientBoostingClassifier()))
    ('classifier', RandomForestClassifier())
    ], 
    verbose=True)


In [161]:
%%time
# train the pipeline - let's try that again
pipeline.fit(features_train, labels_train)

[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.5min




[Pipeline] ........ (step 2 of 2) Processing classifier, total=  13.5s
CPU times: user 1min 28s, sys: 15 s, total: 1min 43s
Wall time: 1min 45s


Pipeline(memory=None,
         steps=[('tf-idf',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern=...
                 RandomForestClassifier(bootstrap=True, class_weight=None,
                                        criterion='gini', max_depth=None,
                                        max_features='auto',
                                        max_leaf_nodes=None,
                               

**YES. That did it!**

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [162]:
labels_pred = pipeline.predict(features_test)

In [163]:
%%time
reports = []

for i, column in enumerate(labels_test.columns):
    reports.append(pd.DataFrame.from_dict(classification_report(labels_test[column], labels_pred[:,i],
                                                                labels=np.unique(labels_pred[:,i]),
                                                                digits=2, output_dict=True)))

CPU times: user 370 ms, sys: 7.82 ms, total: 377 ms
Wall time: 402 ms


In [169]:
for i, report in enumerate(reports):
    print(f"Category: {labels.columns[i]}\n")
    print(report)
    print("\n\n")

Category: related

                     0            1          2  accuracy    macro avg  \
precision     0.607767     0.853629   0.538462  0.803776     0.666619   
recall        0.511020     0.898016   0.368421  0.803776     0.592486   
f1-score      0.555211     0.875260   0.437500  0.803776     0.622657   
support    1225.000000  3981.000000  38.000000  0.803776  5244.000000   

           weighted avg  
precision      0.793912  
recall         0.803776  
f1-score       0.797324  
support     5244.000000  



Category: request

                     0           1  accuracy    macro avg  weighted avg
precision     0.899028    0.798039  0.889207     0.848534      0.881985
recall        0.976371    0.459887  0.889207     0.718129      0.889207
f1-score      0.936105    0.583513  0.889207     0.759809      0.876600
support    4359.000000  885.000000  0.889207  5244.000000   5244.000000



Category: offer

                     0    micro avg    macro avg  weighted avg
precision     0.9961

What does the mean weighted average f1-score look like across categories? This is roughly the best metric I can think of for measuring overall model performance.

In [176]:
def model_weighted_f1_score(reports):
    '''
    Extracts the weighted average precision and recall scores from each category that the model predicted,
    takes the harmonic mean of each metric, and then applies them in the f1 formula. 
    Meant to be used as an overall model performance measure.
    
    
    Parameters
    ----------
    reports: list of pandas DataFrames, where each DataFrame is the result of a single message
        category's classification_report resulting from test set prediction.
        
    
    Returns
    -------
    Overall model f1-score as a float.
    '''
    
    mean_precision = pd.Series([report.loc['precision', 'weighted avg'] for report in reports]).mean()
    mean_recall = pd.Series([report.loc['recall', 'weighted avg'] for report in reports]).mean()
    
    return stats.hmean([mean_precision, mean_recall])

In [177]:
model_weighted_f1_score(reports)

0.9392606772981869

Well that's not half bad!

### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
%%time
parameters = 

cv = 

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.