# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries


import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline 

from sqlalchemy import create_engine,MetaData, Table, select

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer, TfidfVectorizer
from sklearn.tree import DecisionTreeClassifier, ExtraTreeClassifier
from sklearn.neighbors import KNeighborsClassifier,RadiusNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression,LogisticRegressionCV


from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
from sklearn.metrics import classification_report
from sklearn.multioutput import MultiOutputClassifier
from sklearn.base import BaseEstimator, TransformerMixin

!pip install langid
import langid
from bs4 import BeautifulSoup

from nltk.tokenize import WordPunctTokenizer
tok = WordPunctTokenizer()
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import pickle

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
Collecting langid
  Downloading https://files.pythonhosted.org/packages/ea/4c/0fb7d900d3b0b9c8703be316fbddffecdab23c64e1b46c7a83561d78bd43/langid-1.1.6.tar.gz (1.9MB)
[K    100% |████████████████████████████████| 1.9MB 315kB/s ta 0:00:01   2% |▉                               | 51kB 1.0MB/s eta 0:00:02
Building wheels for collected packages: langid
  Running setup.py bdist_wheel for langid ... [?25ldone
[?25h  Stored in directory: /root/.cache/pip/wheels/29/bc/61/50a93be85d1afe9436c3dc61f38da8ad7b637a38af4824e86e
Successfully built langid
Installing collected packages: langid
Successfully installed langid-1.1.6
[33mYou are using pip version 9.0.1, however version 1

In [2]:
# load data from database
def load_data(database_filepath):
    # load data from database
    engine = create_engine('sqlite:///'+database_filepath)
    conn = engine.connect()
    df = pd.read_sql('SELECT * FROM Messages', con = conn)
    X = df[['message','genre']].copy()
    Y = df[df.columns[4:]].copy()
    Y= Y.drop(['child_alone'], axis=1)                           
    return X,Y,Y.columns.values



### 2. Write a tokenization function to process your text data

In [3]:
lemmatizer = WordNetLemmatizer()
pat1 = r'@[A-Za-z0-9_]+'
pat2 = r'https?://[^ ]+'
combined_pat = r'|'.join((pat1, pat2))
www_pat = r'www.[^ ]+'
negations_dic = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not",
                "haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not",
                "wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not",
                "can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not",
                "mustn't":"must not"}
neg_pattern = re.compile(r'\b(' + '|'.join(negations_dic.keys()) + r')\b')

def tokenize(text):
    soup = BeautifulSoup(text, 'lxml')
    souped = soup.get_text()
    try:
        bom_removed = souped.decode("utf-8-sig").replace(u"\ufffd", "?")
    except:
        bom_removed = souped
    stripped = re.sub(combined_pat, '', bom_removed)
    stripped = re.sub(www_pat, '', stripped)
    lower_case = stripped.lower()
    neg_handled = neg_pattern.sub(lambda x: negations_dic[x.group()], lower_case)
    letters_only = re.sub("[^a-zA-Z]", " ", neg_handled)
    # During the letters_only process two lines above, it has created unnecessay white spaces,
    # I will tokenize and join together to remove unneccessary white spaces
    words = [x for x  in tok.tokenize(letters_only) if len(x) > 1]
    # lemmatize andremove stop words
    tokens = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    #cleaned_text = (" ".join(words)).strip()
    
    return tokens


### 3. Build a machine learning pipeline
- You'll find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [4]:
class TextSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on text columns in the data
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[self.key]
    
class NumberSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on numeric columns in the data
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return self

class ColumnExtractor(BaseEstimator, TransformerMixin):

    def __init__(self, cols):
        self.cols = cols

    def fit(self, X, y=None):
        # stateless transformer
        return self

    def transform(self, X):
        # assumes X is a DataFrame
        Xcols = X[self.cols]
        return Xcols
    
from sklearn.feature_extraction import DictVectorizer

class DummyTransformer(BaseEstimator, TransformerMixin):

    def __init__(self):
        self.dv = None

    def fit(self, X, y=None):
        # assumes all columns of X are strings
        Xdict = X.to_dict('records')
        self.dv = DictVectorizer(sparse=False)
        self.dv.fit(Xdict)
        return self

    def transform(self, X):
        # assumes X is a DataFrame
        Xdict = X.to_dict('records')
        Xt = self.dv.transform(Xdict)
        cols = self.dv.get_feature_names()
        Xdum = pd.DataFrame(Xt, index=X.index, columns=cols)
        # drop column indicating NaNs
        nan_cols = [c for c in cols if '=' not in c]
        Xdum = Xdum.drop(nan_cols, axis=1)
        return Xdum


In [5]:
def evaluate_model(model, X_test, Y_test, category_names):
    y_pred = model.predict(X_test)
    print(classification_report(Y_test, y_pred, target_names=category_names))


In [6]:
def save_model(model, model_filepath):
    pickle.dump( model, open( model_filepath, "wb" ) )

In [7]:
def build_model():

    pipeline = Pipeline([
        ('features',FeatureUnion([
            
            ('message', Pipeline([
                    ('selector', TextSelector(key='message')),
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer())
                ]))
            
                         ])),
        
        ('clf', MultiOutputClassifier(LinearSVC(multi_class="crammer_singer"), n_jobs=1))
    ])
    
    return pipeline

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [8]:
database_filepath = 'DisasterResponse.db'
print('Loading data...\n    DATABASE: {}'.format(database_filepath))
X, Y, category_names = load_data(database_filepath)
print(X.shape[0])
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2)

print('Building model...')
model = build_model()

print('Training model...')
model.fit(X_train, Y_train)

Loading data...
    DATABASE: DisasterResponse.db
25825
Building model...
Training model...


Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('message', Pipeline(memory=None,
     steps=[('selector', TextSelector(key='message')), ('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='co...rammer_singer', penalty='l2', random_state=None,
     tol=0.0001, verbose=0),
           n_jobs=1))])

### 5. Test your model
Report the f1 score, precision and recall on both the training set and the test set. You can use sklearn's `classification_report` function here. 

In [9]:
print('Evaluating model...')
evaluate_model(model, X_test, Y_test, category_names)

#print('Saving model...\n    MODEL: {}'.format(model_filepath))
#save_model(model, model_filepath)

#print('Trained model saved!')



Evaluating model...
                        precision    recall  f1-score   support

               related       0.86      0.91      0.89      3917
               request       0.75      0.54      0.63       888
                 offer       0.00      0.00      0.00        27
           aid_related       0.72      0.70      0.71      2132
          medical_help       0.58      0.26      0.36       445
      medical_products       0.64      0.31      0.42       265
     search_and_rescue       0.60      0.16      0.25       160
              security       0.50      0.01      0.02       104
              military       0.60      0.37      0.46       171
                 water       0.77      0.70      0.73       310
                  food       0.82      0.75      0.78       574
               shelter       0.75      0.58      0.65       445
              clothing       0.70      0.59      0.64        66
                 money       0.45      0.20      0.28       116
        missing_peo

  'precision', 'predicted', average, warn_for)


### 6. Improve your model
Use grid search to find better parameters. 

In [10]:
def build_model():

    pipeline = Pipeline([
        ('features',FeatureUnion([
            
            ('message', Pipeline([
                    ('selector', TextSelector(key='message')),
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer())
                ]))
                         ])),
        
        ('clf', MultiOutputClassifier(LinearSVC(multi_class="crammer_singer"), n_jobs=1))
    ])
    
    parameters = {
        'clf__estimator__C': [1, 1.2, 1.4],
        'clf__estimator__max_iter': [1000, 1200, 1500],
    } 

    cv = GridSearchCV(pipeline, parameters)
    
    return cv

In [11]:
print('Building model...')
model = build_model()

print('Training model...')
model.fit(X_train, Y_train)



Building model...
Training model...


GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('message', Pipeline(memory=None,
     steps=[('selector', TextSelector(key='message')), ('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='co...rammer_singer', penalty='l2', random_state=None,
     tol=0.0001, verbose=0),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'clf__estimator__C': [1, 1.2, 1.4], 'clf__estimator__max_iter': [1000, 1200, 1500]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [12]:
model.best_params_

{'clf__estimator__C': 1, 'clf__estimator__max_iter': 1000}

## Observation
It turns out that the default parameters are the best parameters

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.

In [13]:
print('Evaluating model...')
evaluate_model(model, X_test, Y_test, category_names)


Evaluating model...
                        precision    recall  f1-score   support

               related       0.86      0.91      0.89      3917
               request       0.75      0.54      0.63       888
                 offer       0.00      0.00      0.00        27
           aid_related       0.72      0.70      0.71      2132
          medical_help       0.58      0.26      0.36       445
      medical_products       0.64      0.31      0.42       265
     search_and_rescue       0.60      0.16      0.25       160
              security       0.50      0.01      0.02       104
              military       0.60      0.37      0.46       171
                 water       0.77      0.70      0.73       310
                  food       0.82      0.75      0.78       574
               shelter       0.75      0.58      0.65       445
              clothing       0.70      0.59      0.64        66
                 money       0.45      0.20      0.28       116
        missing_peo

  'precision', 'predicted', average, warn_for)


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [14]:
def build_model():

    pipeline = Pipeline([
        ('features',FeatureUnion([
            
            ('message', Pipeline([
                    ('selector', TextSelector(key='message')),
                    ('vect', CountVectorizer(tokenizer=tokenize)),
                    ('tfidf', TfidfTransformer())
                ])),

            ('genre', Pipeline([
                    ('extract', ColumnExtractor(['genre'])),
                    ('dummy', DummyTransformer())
            ]))
                         ])),
        
        ('clf', MultiOutputClassifier(LinearSVC(multi_class="crammer_singer"), n_jobs=1))
    ])
    
    return pipeline

In [15]:
print('Building model...')
model = build_model()

print('Training model...')
model.fit(X_train, Y_train)

print('Evaluating model...')
evaluate_model(model, X_test, Y_test, category_names)



Building model...
Training model...
Evaluating model...
                        precision    recall  f1-score   support

               related       0.86      0.91      0.89      3917
               request       0.76      0.56      0.65       888
                 offer       0.00      0.00      0.00        27
           aid_related       0.72      0.70      0.71      2132
          medical_help       0.58      0.26      0.36       445
      medical_products       0.64      0.31      0.42       265
     search_and_rescue       0.60      0.16      0.25       160
              security       0.50      0.01      0.02       104
              military       0.61      0.36      0.45       171
                 water       0.76      0.71      0.73       310
                  food       0.82      0.74      0.78       574
               shelter       0.75      0.58      0.65       445
              clothing       0.71      0.59      0.64        66
                 money       0.44      0.20    

  'precision', 'predicted', average, warn_for)


## Results with different classifiers


|Classifier   |Precision   |Recall   |F1-score   |   |
|---|---|---|---|---|
| LinearSVC(multi_class="crammer_singer")  |0.74   |0.62   |0.65   |   |
| LogisticRegressionCV(multi_class="multinomial")  |0.77   |0.34   |0.38   |   |
| DecisionTreeClassifier()  |0.62   |0.60   |0.61   |   |
| ExtraTreeClassifier(max_depth=3)  |0.55   |0.24   |0.22   |   |
| KNeighborsClassifier(n_neighbors=3)  |0.57   |0.49   |0.52   |   |
| RandomForestClassifier(n_estimators=100, max_depth=3,random_state=0)  |0.45   |0.24   |0.22   |   |

### 9. Export your model as a pickle file

In [16]:
model_filepath = "classifier.p"
print('Saving model...\n    MODEL: {}'.format(model_filepath))
save_model(model, model_filepath)

print('Trained model saved!')

Saving model...
    MODEL: classifier.p
Trained model saved!


### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.