# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.


### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [91]:
# import libraries
import pandas as pd
from sqlalchemy import create_engine
import re

from nltk.tokenize import word_tokenize
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords

import spacy
import en_core_web_sm

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import GradientBoostingClassifier

In [4]:
# load data from database
engine = create_engine('sqlite:///../data/DisasterTweets.db')
df = pd.read_sql_table('categorized_messages', engine)
features = df['message']
labels = df.iloc[:, 4:]

In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 26216 entries, 0 to 26215
Data columns (total 40 columns):
id                        26216 non-null int64
message                   26216 non-null object
original                  10170 non-null object
genre                     26216 non-null object
related                   26216 non-null int64
request                   26216 non-null int64
offer                     26216 non-null int64
aid_related               26216 non-null int64
medical_help              26216 non-null int64
medical_products          26216 non-null int64
search_and_rescue         26216 non-null int64
security                  26216 non-null int64
military                  26216 non-null int64
child_alone               26216 non-null int64
water                     26216 non-null int64
food                      26216 non-null int64
shelter                   26216 non-null int64
clothing                  26216 non-null int64
money                     26216 non-null i

### 2. Write a tokenization function to process your text data

In [6]:
features.loc[0]

'Weather update - a cold front from Cuba that could pass over Haiti'

Note that I'm making lemmatization optional in the following function as I may want to use part-of-speech tagging and/or named entity recognition results as engineered features later, and lemmatization is likely to cause issues with those.

In [79]:
def tokenize(text, lemma=True, use_spacy=True):
    '''
    Performs various preprocessing steps on a single piece of text. Specifically, this function:
        1. Strips all leading and trailing whitespace
        2. Makes everything lowercase
        3. Removes punctuation
        4. Tokenizes the text into individual words
        5. Removes common English stopwords
        6. If enabled, lemmatizes the remaining words
        
        
    Parameters
    ----------
    text: string representing a single message
    
    lemma: bool. Indicates if lemmatization should be done
    
    
    Returns
    -------
    List of processed strings from a single message
    '''
    
    # Strip leading and trailing whitespace
    text = text.strip()
    
    # Make everything lowercase
    text = text.lower()
    
    # Retain only parts of text that are non-punctuation
    text = re.sub(r"[^a-zA-Z0-9]", " ", text)
    
    # Tokenize into individual words
    words = word_tokenize(text)
    
    # Remove common English stopwords
    words = [w for w in words if w not in stopwords.words("english")]
    
    # Lemmatize to root words, if option is enabled
    if lemma and not use_spacy:
        words = [WordNetLemmatizer().lemmatize(w, pos='v') for w in words]
    
    elif lemma and use_spacy:
        nlp = en_core_web_sm.load()
        doc = nlp(text)
        # using nltk stopwords as they don't drop things like "front"
        words = [token.lemma_ for token in doc if token.text not in stopwords.words("english")]
        
        # Drop empty strings and strings with only whitespace
        words = [w for w in words if w.strip()]
        
    return  words

In [89]:
features.loc[3]

'UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.'

In [87]:
tokenize(features.loc[3], use_spacy=False)

['un',
 'report',
 'leogane',
 '80',
 '90',
 'destroy',
 'hospital',
 'st',
 'croix',
 'function',
 'need',
 'supply',
 'desperately']

**Now what happens if I use the spacy approach to lemmatization, wherein I don't have to specify a POS type?**

In [86]:
tokenize(features.loc[3], use_spacy=True)

['un',
 'report',
 'leogane',
 '80',
 '90',
 'destroy',
 'hospital',
 'st',
 'croix',
 'functioning',
 'need',
 'supply',
 'desperately']

**It looks nltk and spacy lemmatization are effectively equivalent, so I'll just use spacy (since it doesn't limit itself just to nouns or verbs, for example)**

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [None]:
pipeline = Pipeline([
    ('tokenizer', tokenize()),
    ('tf-idf', )
])

# tokenize

# tf-idf feature extraction across messages

# Gradient boosted decision trees often do well on many problems, let's start with that


### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
parameters = 

cv = 

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.