# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# Importing all the libraries required for the project
%matplotlib inline

import re
import matplotlib.pyplot as plt
from sqlalchemy import create_engine
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import numpy as np
import pandas as pd
import pickle
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import classification_report, accuracy_score, precision_score
from sklearn.metrics import recall_score , f1_score
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeClassifier

nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')

  return f(*args, **kwds)
[nltk_data] Downloading package punkt to /home/bsaiva/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /home/bsaiva/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/bsaiva/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [2]:
# loading the  data from database
engine = create_engine('sqlite:///InsertDatabaseName.db')

# reading the new table we created in the ETL pipeline preparation
df = pd.read_sql_table('InsertTableName', con=engine)

categories = df.columns[4:]

X = df[['message']].values[:, 0]
y = df[categories].values

### 2. Write a tokenization function to process your text data

In [6]:
url_re = 'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'

def tokenize_fn(text, lemmatizer=WordNetLemmatizer()):
    
    det_urls = re.findall(url_re, text)
    for url in det_urls:
        text = text.replace(url, 'urlplaceholder')
    
    # tokenizing
    tokens = nltk.word_tokenize(re.sub(r"[^a-zA-Z0-9]", " ", text.lower()))
    
    # Stopword removal
    tokens = [t for t in tokens if t not in stopwords.words('english')]

    # lemmatizing
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    
    return tokens

In [7]:
# Checking vocabulary
vect = CountVectorizer(tokenizer=tokenize_fn)
X_vectorized = vect.fit_transform(X)

### 3. Build a machine learning pipeline
- You'll find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [8]:
# Creating the pipeline with random forest classifier
pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize_fn)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier(class_weight='balanced')))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [9]:
# Spiltting dataset and fitting the pipeline
X_train, X_test, y_train, y_test = train_test_split(X, y)
pipeline.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vect',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=<function tokenize_f...
                                                                        class_weight='balanced',
                                                                        criterion='gini',
                                                                        max_depth=None,
                                                             

### 5. Test your model
Report the f1 score, precision and recall on both the training set and the test set. You can use sklearn's `classification_report` function here. 

In [10]:
def my_modified_report(y_re, y_pre):
    for p in range(0, len(categories)):
        print(categories[p])
        print("\tAccuracy: {:.4f}\t\t% Recall is: {:.4f}% Precision score is: {:.4f}% F1_score is : {:.4f}".format(
            accuracy_score(y_re[:, p], y_pre[:, p]),
            recall_score(y_re[:, p], y_pre[:, p], average='weighted'),
            precision_score(y_re[:, p], y_pre[:, p], average='weighted'),
            f1_score(y_re[:, p], y_pre[:, p], average='weighted')
        ))
        

In [11]:
 # Testing the pipeline
y_pred = pipeline.predict(X_train)
my_modified_report(y_train, y_pred)

related
	Accuracy: 0.9983		% Recall is: 0.9983% Precision score is: 0.9983% F1_score is : 0.9983
request
	Accuracy: 0.9988		% Recall is: 0.9988% Precision score is: 0.9988% F1_score is : 0.9988
offer
	Accuracy: 0.9998		% Recall is: 0.9998% Precision score is: 0.9999% F1_score is : 0.9998
aid_related
	Accuracy: 0.9987		% Recall is: 0.9987% Precision score is: 0.9987% F1_score is : 0.9987
medical_help
	Accuracy: 0.9994		% Recall is: 0.9994% Precision score is: 0.9994% F1_score is : 0.9994
medical_products
	Accuracy: 0.9994		% Recall is: 0.9994% Precision score is: 0.9994% F1_score is : 0.9994
search_and_rescue
	Accuracy: 0.9997		% Recall is: 0.9997% Precision score is: 0.9997% F1_score is : 0.9997
security
	Accuracy: 0.9997		% Recall is: 0.9997% Precision score is: 0.9997% F1_score is : 0.9997
military
	Accuracy: 0.9998		% Recall is: 0.9998% Precision score is: 0.9998% F1_score is : 0.9998
child_alone
	Accuracy: 1.0000		% Recall is: 1.0000% Precision score is: 1.0000% F1_score is : 1.000

In [12]:
# On test data
y_pred = pipeline.predict(X_test)
my_modified_report(y_test, y_pred)

related
	Accuracy: 0.8300		% Recall is: 0.8300% Precision score is: 0.8184% F1_score is : 0.8164
request
	Accuracy: 0.9007		% Recall is: 0.9007% Precision score is: 0.8956% F1_score is : 0.8922
offer
	Accuracy: 0.9940		% Recall is: 0.9940% Precision score is: 0.9884% F1_score is : 0.9912
aid_related
	Accuracy: 0.7875		% Recall is: 0.7875% Precision score is: 0.7865% F1_score is : 0.7867
medical_help
	Accuracy: 0.9231		% Recall is: 0.9231% Precision score is: 0.9114% F1_score is : 0.8903
medical_products
	Accuracy: 0.9559		% Recall is: 0.9559% Precision score is: 0.9476% F1_score is : 0.9382
search_and_rescue
	Accuracy: 0.9725		% Recall is: 0.9725% Precision score is: 0.9533% F1_score is : 0.9596
security
	Accuracy: 0.9837		% Recall is: 0.9837% Precision score is: 0.9759% F1_score is : 0.9759
military
	Accuracy: 0.9646		% Recall is: 0.9646% Precision score is: 0.9477% F1_score is : 0.9498
child_alone
	Accuracy: 1.0000		% Recall is: 1.0000% Precision score is: 1.0000% F1_score is : 1.000

  _warn_prf(average, modifier, msg_start, len(result))


### 6. Improve your model
Use grid search to find better parameters. 

In [14]:
# Using grid search now to find the best parameters for our model

parameters = {
    # n estimators value fixed between 20 to 50 to find best fit
    'clf__estimator__n_estimators': [20, 50]
    
}

cv_b = GridSearchCV(estimator=pipeline, param_grid=parameters, cv=3, scoring='f1_weighted', verbose=3)

In [15]:
cv_b.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 2 candidates, totalling 6 fits
[CV] clf__estimator__n_estimators=20 .................................


  average, "true nor predicted", 'F-score is', len(true_sum)
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  1.5min remaining:    0.0s


[CV] ..... clf__estimator__n_estimators=20, score=0.544, total= 1.5min
[CV] clf__estimator__n_estimators=20 .................................


  average, "true nor predicted", 'F-score is', len(true_sum)
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  3.0min remaining:    0.0s


[CV] ..... clf__estimator__n_estimators=20, score=0.533, total= 1.5min
[CV] clf__estimator__n_estimators=20 .................................


  average, "true nor predicted", 'F-score is', len(true_sum)


[CV] ..... clf__estimator__n_estimators=20, score=0.540, total= 1.5min
[CV] clf__estimator__n_estimators=50 .................................


  average, "true nor predicted", 'F-score is', len(true_sum)


[CV] ..... clf__estimator__n_estimators=50, score=0.549, total= 2.5min
[CV] clf__estimator__n_estimators=50 .................................


  average, "true nor predicted", 'F-score is', len(true_sum)


[CV] ..... clf__estimator__n_estimators=50, score=0.543, total= 2.6min
[CV] clf__estimator__n_estimators=50 .................................


  average, "true nor predicted", 'F-score is', len(true_sum)
[Parallel(n_jobs=1)]: Done   6 out of   6 | elapsed: 12.0min finished


[CV] ..... clf__estimator__n_estimators=50, score=0.546, total= 2.5min


GridSearchCV(cv=3, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        prep

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.

In [16]:
# Testing the model now
y_pred = cv_b.predict(X_train)
my_modified_report(y_train, y_pred)

related
	Accuracy: 0.9982		% Recall is: 0.9982% Precision score is: 0.9982% F1_score is : 0.9982
request
	Accuracy: 0.9988		% Recall is: 0.9988% Precision score is: 0.9988% F1_score is : 0.9988
offer
	Accuracy: 0.9997		% Recall is: 0.9997% Precision score is: 0.9997% F1_score is : 0.9997
aid_related
	Accuracy: 0.9985		% Recall is: 0.9985% Precision score is: 0.9985% F1_score is : 0.9985
medical_help
	Accuracy: 0.9990		% Recall is: 0.9990% Precision score is: 0.9990% F1_score is : 0.9990
medical_products
	Accuracy: 0.9992		% Recall is: 0.9992% Precision score is: 0.9992% F1_score is : 0.9992
search_and_rescue
	Accuracy: 0.9994		% Recall is: 0.9994% Precision score is: 0.9994% F1_score is : 0.9994
security
	Accuracy: 0.9992		% Recall is: 0.9992% Precision score is: 0.9992% F1_score is : 0.9992
military
	Accuracy: 0.9997		% Recall is: 0.9997% Precision score is: 0.9997% F1_score is : 0.9997
child_alone
	Accuracy: 1.0000		% Recall is: 1.0000% Precision score is: 1.0000% F1_score is : 1.000

In [17]:
y_pred = cv_b.predict(X_test)
my_modified_report(y_test, y_pred)

related
	Accuracy: 0.8271		% Recall is: 0.8271% Precision score is: 0.8150% F1_score is : 0.8143
request
	Accuracy: 0.9025		% Recall is: 0.9025% Precision score is: 0.8976% F1_score is : 0.8946
offer
	Accuracy: 0.9940		% Recall is: 0.9940% Precision score is: 0.9884% F1_score is : 0.9912
aid_related
	Accuracy: 0.7810		% Recall is: 0.7810% Precision score is: 0.7800% F1_score is : 0.7802
medical_help
	Accuracy: 0.9226		% Recall is: 0.9226% Precision score is: 0.9016% F1_score is : 0.8921
medical_products
	Accuracy: 0.9568		% Recall is: 0.9568% Precision score is: 0.9503% F1_score is : 0.9401
search_and_rescue
	Accuracy: 0.9725		% Recall is: 0.9725% Precision score is: 0.9533% F1_score is : 0.9596
security
	Accuracy: 0.9837		% Recall is: 0.9837% Precision score is: 0.9759% F1_score is : 0.9759
military
	Accuracy: 0.9649		% Recall is: 0.9649% Precision score is: 0.9501% F1_score is : 0.9507
child_alone
	Accuracy: 1.0000		% Recall is: 1.0000% Precision score is: 1.0000% F1_score is : 1.000

  _warn_prf(average, modifier, msg_start, len(result))


### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [18]:
# New pipeline to improve the results of the model
# Using Decision Tree Classifier now as compared to random forest classifier before

pipeline_imp = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize_fn)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(
        AdaBoostClassifier(base_estimator=DecisionTreeClassifier(max_depth=1, class_weight='balanced'))
    ))
])

# Improved parameters 
parameters_imp = {
    'clf__estimator__n_estimators': [100, 200],
    'clf__estimator__learning_rate': [0.1, 0.3]
}

# new model with improved parameters
cv_imp = GridSearchCV(estimator=pipeline_imp, param_grid=parameters_imp, cv=3, scoring='f1_weighted', verbose=3)

In [19]:
cv_imp.fit(X_train, y_train)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 4 candidates, totalling 12 fits
[CV] clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=100 


  average, "true nor predicted", 'F-score is', len(true_sum)
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  2.2min remaining:    0.0s


[CV]  clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=100, score=0.602, total= 2.2min
[CV] clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=100 


  average, "true nor predicted", 'F-score is', len(true_sum)
[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  4.5min remaining:    0.0s


[CV]  clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=100, score=0.599, total= 2.3min
[CV] clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=100 


  average, "true nor predicted", 'F-score is', len(true_sum)


[CV]  clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=100, score=0.603, total= 2.2min
[CV] clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=200 


  average, "true nor predicted", 'F-score is', len(true_sum)


[CV]  clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=200, score=0.619, total= 3.7min
[CV] clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=200 


  average, "true nor predicted", 'F-score is', len(true_sum)


[CV]  clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=200, score=0.616, total= 3.8min
[CV] clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=200 


  average, "true nor predicted", 'F-score is', len(true_sum)


[CV]  clf__estimator__learning_rate=0.1, clf__estimator__n_estimators=200, score=0.624, total= 3.7min
[CV] clf__estimator__learning_rate=0.3, clf__estimator__n_estimators=100 


  average, "true nor predicted", 'F-score is', len(true_sum)


[CV]  clf__estimator__learning_rate=0.3, clf__estimator__n_estimators=100, score=0.622, total= 2.2min
[CV] clf__estimator__learning_rate=0.3, clf__estimator__n_estimators=100 


  average, "true nor predicted", 'F-score is', len(true_sum)


[CV]  clf__estimator__learning_rate=0.3, clf__estimator__n_estimators=100, score=0.624, total= 2.3min
[CV] clf__estimator__learning_rate=0.3, clf__estimator__n_estimators=100 


  average, "true nor predicted", 'F-score is', len(true_sum)


[CV]  clf__estimator__learning_rate=0.3, clf__estimator__n_estimators=100, score=0.626, total= 2.3min
[CV] clf__estimator__learning_rate=0.3, clf__estimator__n_estimators=200 


  average, "true nor predicted", 'F-score is', len(true_sum)


[CV]  clf__estimator__learning_rate=0.3, clf__estimator__n_estimators=200, score=0.631, total= 3.7min
[CV] clf__estimator__learning_rate=0.3, clf__estimator__n_estimators=200 


  average, "true nor predicted", 'F-score is', len(true_sum)


[CV]  clf__estimator__learning_rate=0.3, clf__estimator__n_estimators=200, score=0.632, total= 3.8min
[CV] clf__estimator__learning_rate=0.3, clf__estimator__n_estimators=200 


  average, "true nor predicted", 'F-score is', len(true_sum)
[Parallel(n_jobs=1)]: Done  12 out of  12 | elapsed: 35.8min finished


[CV]  clf__estimator__learning_rate=0.3, clf__estimator__n_estimators=200, score=0.636, total= 3.7min


GridSearchCV(cv=3, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        prep

In [20]:
# Checking the new parameters obtained from GridSearchCV
cv_imp.best_params_

{'clf__estimator__learning_rate': 0.3, 'clf__estimator__n_estimators': 200}

In [21]:
y_pred = cv_imp.predict(X_train)
my_modified_report(y_train, y_pred)

related
	Accuracy: 0.7321		% Recall is: 0.7321% Precision score is: 0.8353% F1_score is : 0.7524
request
	Accuracy: 0.8665		% Recall is: 0.8665% Precision score is: 0.8871% F1_score is : 0.8736
offer
	Accuracy: 0.9743		% Recall is: 0.9743% Precision score is: 0.9965% F1_score is : 0.9839
aid_related
	Accuracy: 0.7829		% Recall is: 0.7829% Precision score is: 0.7817% F1_score is : 0.7818
medical_help
	Accuracy: 0.8989		% Recall is: 0.8989% Precision score is: 0.9309% F1_score is : 0.9107
medical_products
	Accuracy: 0.9008		% Recall is: 0.9008% Precision score is: 0.9525% F1_score is : 0.9199
search_and_rescue
	Accuracy: 0.8782		% Recall is: 0.8782% Precision score is: 0.9713% F1_score is : 0.9152
security
	Accuracy: 0.8671		% Recall is: 0.8671% Precision score is: 0.9812% F1_score is : 0.9140
military
	Accuracy: 0.9521		% Recall is: 0.9521% Precision score is: 0.9764% F1_score is : 0.9608
child_alone
	Accuracy: 1.0000		% Recall is: 1.0000% Precision score is: 1.0000% F1_score is : 1.000

In [22]:
y_pred = cv_imp.predict(X_test)
my_modified_report(y_test, y_pred)

related
	Accuracy: 0.7356		% Recall is: 0.7356% Precision score is: 0.8302% F1_score is : 0.7554
request
	Accuracy: 0.8547		% Recall is: 0.8547% Precision score is: 0.8761% F1_score is : 0.8623
offer
	Accuracy: 0.9605		% Recall is: 0.9605% Precision score is: 0.9900% F1_score is : 0.9745
aid_related
	Accuracy: 0.7728		% Recall is: 0.7728% Precision score is: 0.7715% F1_score is : 0.7716
medical_help
	Accuracy: 0.8859		% Recall is: 0.8859% Precision score is: 0.9161% F1_score is : 0.8981
medical_products
	Accuracy: 0.8920		% Recall is: 0.8920% Precision score is: 0.9492% F1_score is : 0.9144
search_and_rescue
	Accuracy: 0.8563		% Recall is: 0.8563% Precision score is: 0.9592% F1_score is : 0.9006
security
	Accuracy: 0.8398		% Recall is: 0.8398% Precision score is: 0.9732% F1_score is : 0.8986
military
	Accuracy: 0.9469		% Recall is: 0.9469% Precision score is: 0.9684% F1_score is : 0.9552
child_alone
	Accuracy: 1.0000		% Recall is: 1.0000% Precision score is: 1.0000% F1_score is : 1.000

### 9. Export your model as a pickle file

In [23]:
with open('adaboost_new.pkl', 'wb') as file:
    pickle.dump(cv_imp, file)

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.