# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.


### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import re
from scipy import stats

from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords

import spacy
import en_core_web_sm

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import RandomizedSearchCV

In [2]:
# load data from database
engine = create_engine('sqlite:///../data/DisasterTweets.db')
df = pd.read_sql_table('categorized_messages', engine)
features = df['message']
labels = df.iloc[:, 4:]

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26215 entries, 0 to 26214
Data columns (total 40 columns):
id                        26215 non-null int64
message                   26215 non-null object
original                  10170 non-null object
genre                     26215 non-null object
related                   26215 non-null int64
request                   26215 non-null int64
offer                     26215 non-null int64
aid_related               26215 non-null int64
medical_help              26215 non-null int64
medical_products          26215 non-null int64
search_and_rescue         26215 non-null int64
security                  26215 non-null int64
military                  26215 non-null int64
child_alone               26215 non-null int64
water                     26215 non-null int64
food                      26215 non-null int64
shelter                   26215 non-null int64
clothing                  26215 non-null int64
money                     26215 non-null i

In [4]:
# Check to make sure the right classes, 0 and 1, are present
df['related'].value_counts()

1    20093
0     6122
Name: related, dtype: int64

### 2. Write a tokenization function to process your text data

In [5]:
features.loc[0]

'Weather update - a cold front from Cuba that could pass over Haiti'

Note that I'm making lemmatization optional in the following function as I may want to use part-of-speech tagging and/or named entity recognition results as engineered features later, and lemmatization is likely to cause issues with those.

In [6]:
def tokenize(text, lemma=True, use_spacy_full=False, use_spacy_lemma_only=True):
    '''
    Performs various preprocessing steps on a single piece of text. Specifically, this function:
        1. Strips all leading and trailing whitespace
        2. Makes everything lowercase
        3. Removes punctuation
        4. Tokenizes the text into individual words
        5. Removes common English stopwords
        6. If enabled, lemmatizes the remaining words
        
        
    Parameters
    ----------
    text: string representing a single message
    
    lemma: bool. Indicates if lemmatization should be done
    
    use_spacy_full: bool. If True, performs a full corpus analysis (POS, lemmas of all types, etc.) 
        using the spacy package instead of nltk lemmatization
        
    use_spacy_lemma_only: bool. If True, only performs verb-based lemmatization. Faster than full spacy
        corpus analysis by about 88x.
    
    
    Returns
    -------
    List of processed strings from a single message
    '''
    
    # Strip leading and trailing whitespace
    text = text.strip()
    
    # Make everything lowercase
    text = text.lower()
    
    # Retain only parts of text that are non-punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)
    
    # Tokenize into individual words
    words = word_tokenize(text)
    
    # Remove common English stopwords
    words = [w for w in words if w not in stopwords.words("english")]
    
    # Lemmatize to root words, if option is enabled
    if lemma and not use_spacy_full and not use_spacy_lemma_only:
        words = [WordNetLemmatizer().lemmatize(w, pos='v') for w in words]
    
    elif lemma and use_spacy_full:
        nlp = en_core_web_sm.load()
        doc = nlp(text)
        words = [token.lemma_ for token in doc if not token.is_stop]
        
    elif lemma and use_spacy_lemma_only:        
        from spacy.lemmatizer import Lemmatizer
        from spacy.lang.en import LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES
        lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)
        words = [lemmatizer(w, u"VERB")[0] for w in words]
        
        
    return  words

In [7]:
features.loc[3]

'UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.'

In [17]:
%%timeit
tokenize(features.loc[3], use_spacy_full=False, use_spacy_lemma_only=False)

3.46 ms ± 497 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


**Now what happens if I use the spacy approach to lemmatization, wherein I don't have to specify a POS type?**

In [18]:
%%timeit
tokenize(features.loc[3], use_spacy_full=True, use_spacy_lemma_only=False)

258 ms ± 23.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [19]:
%%timeit
tokenize(features.loc[3], use_spacy_full=False, use_spacy_lemma_only=True)

2.72 ms ± 244 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


It looks like spacy (lemmatization on verbs only) is a little faster than nltk, so let's go with that. ...and definitely don't use `spacy_full` for anything.

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [20]:
pipeline = Pipeline([
    ('tf-idf', TfidfVectorizer(tokenizer=tokenize)),
    #('classifier', MultiOutputClassifier(GradientBoostingClassifier(), n_jobs=-1))
    #('classifier', MultiOutputClassifier(GradientBoostingClassifier()))
    ('classifier', RandomForestClassifier())
    ], 
    verbose=True)


### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [25]:
# split the data into testing and training
features_train, features_test, 
labels_train, labels_test = train_test_split(features, labels, test_size=0.2)

In [None]:
%%time
# train the pipeline
pipeline.fit(features_train, labels_train)

**YES. That did it!**

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [None]:
%%time
labels_pred = pipeline.predict(features_test)

In [None]:
pd.DataFrame.from_dict(classification_report(labels_test, labels_pred,
                                             digits=2, output_dict=True))

In [163]:
reports = []

for i, column in enumerate(labels_test.columns):
    reports.append(pd.DataFrame.from_dict(classification_report(labels_test[column], labels_pred[:,i],
                                                                labels=np.unique(labels_pred[:,i]),
                                                                digits=2, output_dict=True)))

CPU times: user 370 ms, sys: 7.82 ms, total: 377 ms
Wall time: 402 ms


In [169]:
for i, report in enumerate(reports):
    print(f"Category: {labels.columns[i]}\n")
    print(report)
    print("\n\n")

Category: related

                     0            1          2  accuracy    macro avg  \
precision     0.607767     0.853629   0.538462  0.803776     0.666619   
recall        0.511020     0.898016   0.368421  0.803776     0.592486   
f1-score      0.555211     0.875260   0.437500  0.803776     0.622657   
support    1225.000000  3981.000000  38.000000  0.803776  5244.000000   

           weighted avg  
precision      0.793912  
recall         0.803776  
f1-score       0.797324  
support     5244.000000  



Category: request

                     0           1  accuracy    macro avg  weighted avg
precision     0.899028    0.798039  0.889207     0.848534      0.881985
recall        0.976371    0.459887  0.889207     0.718129      0.889207
f1-score      0.936105    0.583513  0.889207     0.759809      0.876600
support    4359.000000  885.000000  0.889207  5244.000000   5244.000000



Category: offer

                     0    micro avg    macro avg  weighted avg
precision     0.9961

What does the mean weighted average f1-score look like across categories? This is roughly the best metric I can think of for measuring overall model performance.

In [176]:
def model_weighted_f1_score(reports, average_type='macro avg'):
    '''
    Extracts the weighted average precision and recall scores from each category that the model predicted,
    takes the harmonic mean of each metric, and then applies them in the f1 formula. 
    Meant to be used as an overall model performance measure.
    
    
    Parameters
    ----------
    reports: list of pandas DataFrames, where each DataFrame is the result of a single message
        category's classification_report resulting from test set prediction.
        
    average_type: str. Indicates which type of f1-score average you want to extract and
        use for overall model performance evaluation. 'macro avg' is recommended as it properly
        penalizes 
        
    
    Returns
    -------
    Overall model f1-score as a float.
    '''
    
    mean_precision = pd.Series([report.loc['precision', 'weighted avg'] for report in reports]).mean()
    mean_recall = pd.Series([report.loc['recall', 'weighted avg'] for report in reports]).mean()
    
    return stats.hmean([mean_precision, mean_recall])

In [177]:
model_weighted_f1_score(reports)

0.9392606772981869

Well that's not half bad!

### 6. Improve your model
Use grid search to find better parameters. 

In [183]:
%%time
parameters = {
    'tf-idf__max_df': [0.1, 0.5, 0.9],
    'classifier__n_estimators': [10,50,100,500],
    'classifier__max_depth': [5,10, None],
    'classifier__min_samples_split': [2, 4, 8],
    'classifier__min_samples_leaf': [1, 6, 10]
}

cv = RandomizedSearchCV(pipeline, parameters, n_iter = 10, cv = 5, scoring='f1_weighted')
cv.fit(features_train, labels_train)

[Pipeline] ............ (step 1 of 2) Processing tf-idf, total= 1.1min
[Pipeline] ........ (step 2 of 2) Processing classifier, total=  39.7s


ValueError: Classification metrics can't handle a mix of multiclass-multioutput and multilabel-indicator targets

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.