# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import pandas as pd
import numpy as np
import sqlite3
import re
import pickle
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, make_scorer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler

nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package punkt to
[nltk_data]     /Users/zhitao.wang/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/zhitao.wang/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [2]:
# load data from database
#engine = create_engine('sqlite:///InsertDatabaseName.db')
db_name = 'figure_eight.db' 
tbl_name = 'msg_cat'
print('Opening connection.')
conn = sqlite3.connect(db_name)
# get a cursor
cur = conn.cursor()
cmd = "SELECT * FROM " + tbl_name
print('Reading data from table "{}" in the database "{}".'.format(tbl_name, db_name))
df = pd.read_sql(cmd, con = conn)
df.set_index('id', inplace = True)
#df = pd.read_sql_table(tbl_name, con = conn)
conn.commit()
conn.close()
print('Connection is closed.')
feature_columns = ['message', 'original', 'genre']
X = df[['message']]
y = df.drop(labels = feature_columns, axis = 1)

Opening connection.
Reading data from table "msg_cat" in the database "figure_eight.db".
Connection is closed.


### 2. Write a tokenization function to process your text data

In [3]:
def tokenize(text):
    ''' Usage: normalize case and remove punctuation
        Input: text string
    ''' 
    text = re.sub(r"[^a-zA-Z]", " ", text.lower())
    
    # tokenize text
    tokens = word_tokenize(text)
    
    # Remove stop words
    tokens = [w for w in tokens if w not in stopwords.words("english")]
    
    # Reduce words to their root form
    tokens = [WordNetLemmatizer().lemmatize(token).strip() for token in tokens]

    return tokens

### 3. Build a machine learning pipeline
- You'll find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [4]:
def show_metrics(y_test, y_pred, target_names, metrics = 'f1'):
    '''
        Usage: show metrics with options of accuracy, f1, precision, recall scores.
        Input: y_test, y_pred - actual label values, predicted labels
               target_names - a list of labels
               metrics - available options ['accuracy', 'f1', 'precision', 'recall']
                         other option: show f1, precision, recall all together.
    '''
    for idx_col in range(y_test.shape[1]):
        if metrics == 'accuracy':
            # accuracy score
            print("The accuracy score for column {}: {}" \
                  .format(target_names[idx_col], accuracy_score(y_test[:, idx_col], y_pred[:, idx_col])))
        elif metrics == 'f1':
            # f1 score
            print("The f1 score for column {}: {}" \
                  .format(target_names[idx_col], f1_score(y_test[:, idx_col], y_pred[:, idx_col])))
        elif metrics == 'precision':
            # precision
            print("The precision score for column {}: {}" \
                  .format(target_names[idx_col], precision_score(y_test[:, idx_col], y_pred[:, idx_col])))
        elif metrics == 'recall':
            # precision
            print("The recall score for column {}: {}" \
                  .format(target_names[idx_col], recall_score(y_test[:, idx_col], y_pred[:, idx_col])))
        else:
            print(classification_report(y_test, y_pred, target_names = target_names))
            break
            

In [5]:
pipeline = Pipeline([
    ('features', FeatureUnion([
        ('text_pipeline', Pipeline([
            ('vect', CountVectorizer(tokenizer=tokenize)),
            ('tfidf', TfidfTransformer())
        ]))
    ])),
    ('clf', MultiOutputClassifier(RandomForestClassifier()))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X.message, y.values, test_size = 0.3, random_state = 42)
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('text_pipeline', Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_d...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

### 5. Test your model
Report the f1 score, precision and recall on both the training set and the test set. You can use sklearn's `classification_report` function here. 

In [7]:
print('Predicting results.')
y_pred = pipeline.predict(X_test)
show_metrics(y_test, y_pred, y.columns, metrics = 'other')

Predicting results.
                        precision    recall  f1-score   support

               related       0.84      0.92      0.88      5941
               request       0.78      0.46      0.58      1333
                 offer       0.00      0.00      0.00        34
           aid_related       0.74      0.60      0.67      3286
          medical_help       0.57      0.09      0.16       644
      medical_products       0.71      0.07      0.12       414
     search_and_rescue       0.76      0.07      0.12       239
              security       0.25      0.01      0.01       156
              military       0.62      0.10      0.17       267
           child_alone       0.00      0.00      0.00         0
                 water       0.90      0.28      0.42       512
                  food       0.82      0.44      0.57       878
               shelter       0.80      0.33      0.46       714
              clothing       0.74      0.11      0.20       123
                 mo

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### 6. Improve your model
Use grid search to find better parameters. 

In [9]:
parameters = {
    'features__text_pipeline__vect__max_df': [0.2, 1],
    'features__text_pipeline__vect__max_features': [3000, 5000],
    'clf__estimator__n_estimators': [50, 200],
    #'clf__estimator__max_features': ['sqrt'],
    'clf__estimator__n_jobs': [2],
}

best_clf = GridSearchCV(pipeline, param_grid=parameters, n_jobs = 2)
best_clf.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('text_pipeline', Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_d...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=2,
       param_grid={'features__text_pipeline__vect__max_df': [0.2, 1], 'features__text_pipeline__vect__max_features': [3000, 5000], 'clf__estimator__n_estimators': [50, 200], 'clf__estimator__n_jobs': [2]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [10]:
best_clf.best_params_

{'clf__estimator__n_estimators': 200,
 'clf__estimator__n_jobs': 2,
 'features__text_pipeline__vect__max_df': 0.2,
 'features__text_pipeline__vect__max_features': 5000}

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.

In [11]:
y_pred = best_clf.predict(X_test)
show_metrics(y_test, y_pred, y.columns, metrics = 'other')

                        precision    recall  f1-score   support

               related       0.85      0.94      0.89      6072
               request       0.81      0.50      0.62      1356
                 offer       1.00      0.03      0.05        37
           aid_related       0.75      0.71      0.73      3312
          medical_help       0.61      0.13      0.22       616
      medical_products       0.85      0.18      0.30       369
     search_and_rescue       0.74      0.11      0.19       212
              security       0.00      0.00      0.00       144
              military       0.60      0.09      0.16       257
           child_alone       0.00      0.00      0.00         0
                 water       0.88      0.55      0.68       492
                  food       0.82      0.71      0.76       894
               shelter       0.79      0.50      0.61       713
              clothing       0.92      0.20      0.32       122
                 money       0.62      

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

Many literatures suggest linear SVC model is a good choice when using the bag-of-word model. However, one need to check that each label column contain two classes, 0 or 1, to avoid error raised by the SVC algorithm. From ETL analysis, we already knew that **child_alone** only contains 0, which shall be removed prior to feeding into the model.

In [46]:
# drop child_alone category in the label set
y_ = y.drop(labels = 'child_alone', axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X.message, y_.values, test_size = 0.3, random_state = 42)

In [65]:
pipeline = Pipeline([
    ('features', FeatureUnion([
        ('text_pipeline', Pipeline([
            ('vect', CountVectorizer(tokenizer=tokenize)),
            ('tfidf', TfidfTransformer())
        ]))
    ])),
    ('clf', MultiOutputClassifier(LinearSVC()))
])

In [70]:
parameters = {
    'features__text_pipeline__vect__max_df': [0.2, 1],
    'features__text_pipeline__vect__max_features': [None, 3000, 5000],
    'clf__estimator__C': [0.5, 1.0, 1.5],
    'clf__estimator__intercept_scaling': [0.5, 1, 1.5]
}

scorer = make_scorer(f1_score)
best_clf = GridSearchCV(pipeline, param_grid=parameters, n_jobs = 2)
best_clf.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('features', FeatureUnion(n_jobs=1,
       transformer_list=[('text_pipeline', Pipeline(memory=None,
     steps=[('vect', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_d...ti_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=2,
       param_grid={'features__text_pipeline__vect__max_df': [0.2, 1], 'features__text_pipeline__vect__max_features': [None, 3000, 5000], 'clf__estimator__C': [0.5, 1.0, 1.5], 'clf__estimator__intercept_scaling': [0.5, 1, 1.5]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [71]:
best_clf.best_params_

{'clf__estimator__C': 0.5,
 'clf__estimator__intercept_scaling': 1.5,
 'features__text_pipeline__vect__max_df': 0.2,
 'features__text_pipeline__vect__max_features': None}

In [72]:
y_pred = pipeline.predict(X_test)
show_metrics(y_test, y_pred, y.columns, metrics = 'other')

                        precision    recall  f1-score   support

               related       0.86      0.91      0.89      6072
               request       0.76      0.58      0.66      1356
                 offer       1.00      0.03      0.05        37
           aid_related       0.72      0.68      0.70      3312
          medical_help       0.59      0.27      0.37       616
      medical_products       0.70      0.32      0.44       369
     search_and_rescue       0.66      0.14      0.23       212
              security       0.00      0.00      0.00       144
              military       0.62      0.31      0.41       257
           child_alone       0.74      0.63      0.68       492
                 water       0.83      0.71      0.77       894
                  food       0.78      0.55      0.65       713
               shelter       0.77      0.40      0.53       122
              clothing       0.61      0.16      0.26       188
                 money       0.71      

  .format(len(labels), len(target_names))
  'precision', 'predicted', average, warn_for)


### Discussion

The key metrics is f1 scores, which has the following form:

\begin{align}
\frac{2\times precision \times recall} {precision + recall}
\end{align}

Overall, the linear SVM model is slightly better than random forest, with f1 score of 0.64 for linear SVM and 0.61 for random forest. Another advantage for linear SVM is that its time for training and making prediction is much faster than a random forest classifier with 200 tree estimators. This is essential for the use case of real-time application. Based on above reasons, linear SVM model is adopted in my script of machine learning pipeline.

### 9. Export your model as a pickle file

In [75]:
file = 'classifier.pkl'
pickle.dump(best_clf, open(file, 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.

In [11]:
import sys
import pandas as pd
import numpy as np
import sqlite3
import re
import pickle
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.multioutput import MultiOutputClassifier
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, make_scorer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler

def load_data(database_filepath):
    '''
    Usage: load data from database
    Args: file path to the database
    Return: X - dataframe of features
            y - dataframe of labels
    '''
    db_name = database_filepath
    tbl_name = 'msg_cat'
    print('Opening connection.')
    conn = sqlite3.connect(db_name)
    # get a cursor
    cur = conn.cursor()
    cmd = "SELECT * FROM " + tbl_name
    print('Reading data from table "{}" in the database "{}".'.format(tbl_name, db_name))
    df = pd.read_sql(cmd, con = conn)
    df.set_index('id', inplace = True)
    conn.commit()
    conn.close()
    print('Connection is closed.')
    feature_columns = ['message', 'original', 'genre', 'child_alone']
    X = df['message'].values
    y = df.drop(labels = feature_columns, axis = 1).values
    category_names = df.columns.values
    return X, y, category_names


def tokenize(text):
    ''' Usage: normalize case and remove punctuation
        Args: text string
        Return: text tokens
    ''' 
    text = re.sub(r"[^a-zA-Z]", " ", text.lower())
    
    # tokenize text
    tokens = word_tokenize(text)
    
    # Remove stop words
    tokens = [w for w in tokens if w not in stopwords.words("english")]
    
    # Reduce words to their root form
    tokens = [WordNetLemmatizer().lemmatize(token).strip() for token in tokens]

    return tokens


def build_model():
    """
    Usage: builds classification model 
    Args:
    Return: the optimized model
    """
    pipeline = Pipeline([
        ('features', FeatureUnion([
            ('text_pipeline', Pipeline([
                ('vect', CountVectorizer(tokenizer=tokenize)),
                ('tfidf', TfidfTransformer())
            ]))
        ])),
        ('clf', MultiOutputClassifier(LinearSVC()))
    ])
    
    parameters = {
        'features__text_pipeline__vect__max_df': [0.2],
        'features__text_pipeline__vect__max_features': [None],
        'clf__estimator__C': [0.5],
        'clf__estimator__intercept_scaling': [1.5]
    }

    cv = GridSearchCV(pipeline, param_grid=parameters, n_jobs = 2, verbose = 2)
    return cv

def evaluate_model(model, X_test, y_test, target_names, metrics = None):
    '''
        Usage: show metrics with options of accuracy, f1, precision, recall scores.
        Args: y_test, y_pred - actual label values, predicted labels
               target_names - a list of labels
               metrics - available options ['accuracy', 'f1', 'precision', 'recall']
                         other option: show f1, precision, recall all together.
    '''
    
    y_pred = model.predict(X_test)
    
    for idx_col in range(y_test.shape[1]):
        if metrics == 'accuracy':
            # accuracy score
            print("The accuracy score for column {}: {}" \
                  .format(target_names[idx_col], accuracy_score(y_test[:, idx_col], y_pred[:, idx_col])))
        elif metrics == 'f1':
            # f1 score
            print("The f1 score for column {}: {}" \
                  .format(target_names[idx_col], f1_score(y_test[:, idx_col], y_pred[:, idx_col])))
        elif metrics == 'precision':
            # precision
            print("The precision score for column {}: {}" \
                  .format(target_names[idx_col], precision_score(y_test[:, idx_col], y_pred[:, idx_col])))
        elif metrics == 'recall':
            # precision
            print("The recall score for column {}: {}" \
                  .format(target_names[idx_col], recall_score(y_test[:, idx_col], y_pred[:, idx_col])))
        else:
            print(classification_report(y_test, y_pred, target_names = target_names))
            break


def save_model(model, model_filepath):
    """
    Usage: Save the model to a Python pickle
    Args:
        model: Trained model
        model_filepath: Path where to save the model
    """
    pickle.dump(model, open(model_filepath, 'wb'))


def main():
    if len(sys.argv) == 3:
        database_filepath, model_filepath = sys.argv[1:]
        print('Loading data...\n    DATABASE: {}'.format(database_filepath))
        X, Y, category_names = load_data(database_filepath)
        # drop child_alone category in the label set
        #Y.drop(labels = 'child_alone', axis = 1, inplace = True)
        X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 42)
        
        print('Building model...')
        model = build_model()
        
        print('Training model...')
        model.fit(X_train, Y_train)
        
        print('Evaluating model...')
        evaluate_model(model, X_test, Y_test, category_names)

        print('Saving model...\n    MODEL: {}'.format(model_filepath))
        save_model(model, model_filepath)

        print('Trained model saved!')

    else:
        print('Please provide the filepath of the disaster messages database '\
              'as the first argument and the filepath of the pickle file to '\
              'save the model to as the second argument. \n\nExample: python '\
              'train_classifier.py ../data/DisasterResponse.db classifier.pkl')


if __name__ == '__main__':
    main()

Loading data...
    DATABASE: figure_eight.db
Opening connection.
Reading data from table "msg_cat" in the database "figure_eight.db".
Connection is closed.
Building model...
Training model...
Fitting 3 folds for each of 1 candidates, totalling 3 fits
[CV] clf__estimator__C=0.5, clf__estimator__intercept_scaling=1.5, features__text_pipeline__vect__max_df=0.2, features__text_pipeline__vect__max_features=None 
[CV] clf__estimator__C=0.5, clf__estimator__intercept_scaling=1.5, features__text_pipeline__vect__max_df=0.2, features__text_pipeline__vect__max_features=None 
[CV]  clf__estimator__C=0.5, clf__estimator__intercept_scaling=1.5, features__text_pipeline__vect__max_df=0.2, features__text_pipeline__vect__max_features=None, total= 1.1min
[CV] clf__estimator__C=0.5, clf__estimator__intercept_scaling=1.5, features__text_pipeline__vect__max_df=0.2, features__text_pipeline__vect__max_features=None 
[CV]  clf__estimator__C=0.5, clf__estimator__intercept_scaling=1.5, features__text_pipeline__

[Parallel(n_jobs=2)]: Done   3 out of   3 | elapsed:  3.6min finished


Evaluating model...
                        precision    recall  f1-score   support

               message       0.86      0.93      0.89      5941
              original       0.77      0.58      0.66      1333
                 genre       0.00      0.00      0.00        34
               related       0.75      0.68      0.71      3286
               request       0.65      0.24      0.35       644
                 offer       0.70      0.27      0.39       414
           aid_related       0.76      0.13      0.23       239
          medical_help       0.00      0.00      0.00       156
      medical_products       0.63      0.26      0.37       267
     search_and_rescue       0.75      0.60      0.67       512
              security       0.80      0.70      0.75       878
              military       0.79      0.54      0.64       714
           child_alone       0.81      0.36      0.50       123
                 water       0.64      0.20      0.30       192
                  f

  .format(len(labels), len(target_names))
  'precision', 'predicted', average, warn_for)
