# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [8]:
!pip install --upgrade pip
!pip install -U spacy
!pip install -U typing_extensions
!python -m spacy download en_core_web_sm
!pip install -U numpy

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
Collecting en-core-web-sm==3.6.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.6.0/en_core_web_sm-3.6.0-py3-none-any.whl (12.8 MB)
     |████████████████████████████████| 12.8 MB 4.1 MB/s            
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [9]:
# import libraries
import pandas as pd
from sqlalchemy import create_engine
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.porter import PorterStemmer

import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')
import pickle


import spacy
from spacy import displacy
from collections import Counter


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


True

In [14]:
# load data from database
engine = create_engine('sqlite:///../data/message_data.db')
df = pd.read_sql_table('Messages', engine).iloc[:3000]


OperationalError: (sqlite3.OperationalError) unable to open database file

In [10]:
engine = create_engine('sqlite:///message_data.db')
df = pd.read_sql_table('InsertTableName', engine).iloc[:3000]


In [11]:
X = df['message']
Y = df.iloc[:, 5:7]

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])


In [13]:
X.iloc[0]

'Weather update - a cold front from Cuba that could pass over Haiti'

### 2. Write a tokenization function to process your text data

In [43]:
import numpy as np
from sklearn.base import BaseEstimator, TransformerMixin

class CustomTokenizer(BaseEstimator, TransformerMixin):
    
    ''' This class is used to generate the tokens.
        The class takes a sentence and returns a list of words.
        We want to test different ideas:
        
        
        named entities: We want to test if the models work better if we replace dates, locations, amounts by tags,
                                i.e. DATE, LOCATION, NUMBER
                        We think that the simple precence of a number is relevant, independent of the exact number
                        I.e. "Eva needs 5 litres of water in Zurich", "Albert needs 10 kilos of rice in Berlin".
                        -> "Person needs Number litres of water in Location"
                        -> "Person needs Number kilos of rice in Location"
                        For this we use spacy
                        
        stemming: We want to test if the simple stemming (PorterStemming) is useful or too simplistic.
        
    

        Usage:
        
        ct = CustomTokenizer(replace_named_entities=False, use_stemming=False)
        text = 'Weather update - a cold front from Cuba that could pass over Haiti'
        ct(text)
    
    
    '''
    
    def __init__(self, replace_named_entities=False, use_stemming=False):
        
        '''
        Args:
            replace_named_entities (bool): Whether to replace named entities, numbers, etc. by tags.
            use_stemming (bool): Whether to use the PorterStemming Algorithm
        
        '''
        
        self.replace_named_entities = replace_named_entities
        if replace_named_entities:
            self.nlp = spacy.load("en_core_web_sm")

        self.use_stemming = use_stemming
    
    
    
    def __call__(self, text):
    
        if self.replace_named_entities:
            doc = self.nlp(text)
            for ele in doc.ents:
                text = text.replace(ele.text, ele.label_)

        text = text.lower()
        text = re.sub("[^a-z0-9]", " ", text)
        words = word_tokenize(text) 
        words = list(filter(lambda x: not(x in stopwords.words("english")), words))

        
        if self.use_stemming:
            #Reduce words to their stems
            words = [PorterStemmer().stem(w) for w in words]

        return words

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return pd.Series(X).apply(lambda x: self(x)).values


In [83]:

class SentenceMetaData(BaseEstimator, TransformerMixin):
    
    '''
    This class is used to add features apart from the TfidfTransformer.
    
    We have the following ideas, which we want to have parametrised, so that we can test them
    automatically using GridSearchCV:
    
    - pct_capital_letters: What is the percentage of capital letters in a message.
        The idea is that people may use capital letters to add importance to a message:
        i.e. "HELP, I NEED IMMEDIATE ATTENTION"
    - question/exclamation mark: we want to test if the presence of "?" or "!" has implications
    - lenght: we want to test if the lenght of the tweet has implications
    - pct_word_types: We want to see the percentage distribution of words in the tweets. 
            For this we use nltk.pos_tag. A sentence with a lot of verbs may be rather 
            information/description than emergency.
    
    
    Usage:
    
        stm = SentenceMetaData(use_pct_capital_letters=False, use_question_mark=True)
        text = 'Weather update - a cold front from Cuba that could pass over Haiti'
        stm(text)
    
    
    '''
    from nltk.tag.mapping import _UNIVERSAL_TAGS
    
    def __init__(self, use_pct_capital_letters=True,
                 use_question_mark=True,
                 use_exclamation_mark=True,
                 use_pct_word_types=True,
                 use_length=True):
        
        '''
        Args:
            use_pct_capital_letters (bool): Whether to count the percentage of capital letters
            use_question_mark (bool): Whether to check for the presence of ?
            use_exclamation_mark (bool): Whether to check for the presence of !
            use_pct_word_types (bool): Whether to add the distribution of the word types (verbs, nouns, ...)
            use_length (bool): Whether to add the lenght of the sentence.
        
        '''
        
        self.use_pct_capital_letters = use_pct_capital_letters
        self.use_question_mark = use_question_mark
        self.use_exclamation_mark=use_exclamation_mark
        self.use_pct_word_types=use_pct_word_types
        self.use_lenght = use_length
        
        self.target_tags = list(_UNIVERSAL_TAGS)
    
    
    
    def __call__(self, text):
        def pct_capital_letters(t):
            try:
                r = len(re.findall("[A-Z]", t))/len(t)
            except:
                r = 0
                
            return r

        def question_mark(t):
            return '?' in t

        def exclamation_mark(t):
            return '!' in t

        def pct_word_types(t):
            t = t.strip()
            if len(t) == 0:
                return pd.Series()
            text = word_tokenize(t)
            ser =  pd.DataFrame(nltk.pos_tag(text, tagset='universal'))[1].value_counts()
            return (ser/ser.sum()).reindex(self.target_tags).fillna(0)
        
        result = dict()
        if self.use_pct_capital_letters:
            result['pct_capital_letters'] = pct_capital_letters(text)

                
        if self.use_question_mark:
            result['question_mark'] = question_mark(text)

        if self.use_exclamation_mark:
            result['exclamation_mark'] = exclamation_mark(text)
            
        if self.use_lenght:
            result['lenght'] = len(text)
          
        if self.use_pct_word_types:
            result.update(pct_word_types(text))
            
        
        return pd.Series(result).astype(float)

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return pd.Series(X).apply(lambda x: self(x)).values

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [86]:
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline, FeatureUnion

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

In [91]:
tok = CustomTokenizer()
cv = CountVectorizer(tokenizer=tok)
tf = TfidfTransformer()
cl = MultiOutputClassifier(RandomForestClassifier(), n_jobs=8)

pipeline = Pipeline([
    
        ('features', FeatureUnion([

        ('nlp_pipeline', Pipeline([
            ('count', cv),
            ('tfid', tf)
            ])),

        ('word_type_counter', WordTypeCounter())
        ])),

    
    ('classifier', cl)
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [88]:
X_train, X_test, y_train, y_test = train_test_split(X, Y)

In [23]:
# ideas, date, places, numbers, urls

In [1]:
parameters = {
    'features__nlp_pipeline__count__tokenizer__replace_named_entities': [True], #False
    'features__nlp_pipeline__count__tokenizer__use_stemming': [True], # False
    'classifier__estimator__n_estimators': [50], #100 150,
    'features__word_type_counter__use_question_mark': [False], # False
    'features__word_type_counter__use_pct_word_types': [False],  # False
    'features__word_type_counter__use_pct_capital_letters': [True],  # False
    'features__word_type_counter__use_exclamation_mark': [True],  # False
    
    
}

clf = GridSearchCV(pipeline, parameters)

NameError: name 'GridSearchCV' is not defined

We tested wheter:

Tokenization:
- It is useful to replaced named_entities (Locations, numbers, person names) -> Answer: Yes
- PorterStemming improvies the prediciton -> Answer: Yes

Meta data of the sentence:
- Test for the presence of ? -> Answer: No
- Test for the presence of ! -> Answer: Yes
- Look at the distribution of words (Nouns, Verbs, ..)  -> Answer: No
- Look at the presence of capital letters (I NEED HELP) -> Answer: Yes

Model:
- Number of estimators to be used in the tree: 200 or 100 or 50 -> Better 50

In [106]:
clf.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('nlp_pipeline', Pipeline(memory=None,
     steps=[('count', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_d...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=8))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'features__nlp_pipeline__count__tokenizer__replace_named_entities': [True], 'features__nlp_pipeline__count__tokenizer__use_stemming': [True], 'classifier__estimator__n_estimators': [50], 'features__word_type_counter__use_question_mark': [True, False], 'features__word_type_counter__use_pct_word_types': [False], 'features__word_type_counter__use_pct_capital_letters': [False], 'features__word_type_counter__use_exclamation_mark': [False]},
    

In [107]:
clf.best_params_

{'classifier__estimator__n_estimators': 50,
 'features__nlp_pipeline__count__tokenizer__replace_named_entities': True,
 'features__nlp_pipeline__count__tokenizer__use_stemming': True,
 'features__word_type_counter__use_exclamation_mark': False,
 'features__word_type_counter__use_pct_capital_letters': False,
 'features__word_type_counter__use_pct_word_types': False,
 'features__word_type_counter__use_question_mark': False}

In [108]:
parameters = {
    'features__nlp_pipeline__count__tokenizer__replace_named_entities': [True], #False
    'features__nlp_pipeline__count__tokenizer__use_stemming': [True], # False
    'classifier__estimator__n_estimators': [50], #100,
    'features__word_type_counter__use_question_mark': [False],
    'features__word_type_counter__use_pct_word_types': [True, False],
    'features__word_type_counter__use_pct_capital_letters': [False],
    'features__word_type_counter__use_exclamation_mark': [False],
    
    
}

clf = GridSearchCV(pipeline, parameters)

In [109]:
clf.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('nlp_pipeline', Pipeline(memory=None,
     steps=[('count', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_d...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=8))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'features__nlp_pipeline__count__tokenizer__replace_named_entities': [True], 'features__nlp_pipeline__count__tokenizer__use_stemming': [True], 'classifier__estimator__n_estimators': [50], 'features__word_type_counter__use_question_mark': [False], 'features__word_type_counter__use_pct_word_types': [True, False], 'features__word_type_counter__use_pct_capital_letters': [False], 'features__word_type_counter__use_exclamation_mark': [False]},
    

In [110]:
clf.best_params_

{'classifier__estimator__n_estimators': 50,
 'features__nlp_pipeline__count__tokenizer__replace_named_entities': True,
 'features__nlp_pipeline__count__tokenizer__use_stemming': True,
 'features__word_type_counter__use_exclamation_mark': False,
 'features__word_type_counter__use_pct_capital_letters': False,
 'features__word_type_counter__use_pct_word_types': False,
 'features__word_type_counter__use_question_mark': False}

In [111]:
parameters = {
    'features__nlp_pipeline__count__tokenizer__replace_named_entities': [True], #False
    'features__nlp_pipeline__count__tokenizer__use_stemming': [True], # False
    'classifier__estimator__n_estimators': [50], #100,
    'features__word_type_counter__use_question_mark': [False],
    'features__word_type_counter__use_pct_word_types': [False],
    'features__word_type_counter__use_pct_capital_letters': [True, False],
    'features__word_type_counter__use_exclamation_mark': [False],
    
    
}

clf = GridSearchCV(pipeline, parameters)

In [112]:
clf.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('nlp_pipeline', Pipeline(memory=None,
     steps=[('count', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_d...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=8))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'features__nlp_pipeline__count__tokenizer__replace_named_entities': [True], 'features__nlp_pipeline__count__tokenizer__use_stemming': [True], 'classifier__estimator__n_estimators': [50], 'features__word_type_counter__use_question_mark': [False], 'features__word_type_counter__use_pct_word_types': [False], 'features__word_type_counter__use_pct_capital_letters': [True, False], 'features__word_type_counter__use_exclamation_mark': [False]},
    

In [113]:
clf.best_params_

{'classifier__estimator__n_estimators': 50,
 'features__nlp_pipeline__count__tokenizer__replace_named_entities': True,
 'features__nlp_pipeline__count__tokenizer__use_stemming': True,
 'features__word_type_counter__use_exclamation_mark': False,
 'features__word_type_counter__use_pct_capital_letters': True,
 'features__word_type_counter__use_pct_word_types': False,
 'features__word_type_counter__use_question_mark': False}

In [114]:
parameters = {
    'features__nlp_pipeline__count__tokenizer__replace_named_entities': [True], #False
    'features__nlp_pipeline__count__tokenizer__use_stemming': [True], # False
    'classifier__estimator__n_estimators': [50], #100,
    'features__word_type_counter__use_question_mark': [False],
    'features__word_type_counter__use_pct_word_types': [False],
    'features__word_type_counter__use_pct_capital_letters': [False],
    'features__word_type_counter__use_exclamation_mark': [False, True],
    
    
}

clf = GridSearchCV(pipeline, parameters)

In [115]:
clf.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('nlp_pipeline', Pipeline(memory=None,
     steps=[('count', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_d...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=8))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'features__nlp_pipeline__count__tokenizer__replace_named_entities': [True], 'features__nlp_pipeline__count__tokenizer__use_stemming': [True], 'classifier__estimator__n_estimators': [50], 'features__word_type_counter__use_question_mark': [False], 'features__word_type_counter__use_pct_word_types': [False], 'features__word_type_counter__use_pct_capital_letters': [False], 'features__word_type_counter__use_exclamation_mark': [False, True]},
    

In [116]:
clf.best_params_

{'classifier__estimator__n_estimators': 50,
 'features__nlp_pipeline__count__tokenizer__replace_named_entities': True,
 'features__nlp_pipeline__count__tokenizer__use_stemming': True,
 'features__word_type_counter__use_exclamation_mark': True,
 'features__word_type_counter__use_pct_capital_letters': False,
 'features__word_type_counter__use_pct_word_types': False,
 'features__word_type_counter__use_question_mark': False}

In [13]:
model = pipeline.fit(X_train, y_train)

In [32]:
y_pred = pd.DataFrame(clf.predict(X_test), columns=y_test.columns)

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [33]:
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score, precision_score, recall_score

In [34]:
classification = dict()
for c in y_pred.columns:
    comp = (y_test[c], y_pred[c])
    classification[(c, 'precision')] = precision_score(*comp)
    classification[(c, 'recall')] = recall_score(*comp)
    classification[(c, 'f1')] = f1_score(*comp)


In [35]:
classification = pd.Series(classification).unstack()

In [36]:
classification

Unnamed: 0,f1,precision,recall
aid_related,0.761394,0.809117,0.718987
buildings,0.392857,0.846154,0.255814


In [None]:
pickle.dump(clf, open('model.pkl', 'wb'))


### 6. Improve your model
Use grid search to find better parameters. 

In [None]:
parameters = 

cv = 

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

In [None]:
pickle.dump(clf, open('model.pkl', 'wb'))


### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.