# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import pandas as pd 
import numpy as np
from sqlalchemy import create_engine 
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import nltk
import re

from nltk import pos_tag 

nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')
nltk.download('stopwords')
 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer 
from sklearn.feature_extraction.text import TfidfVectorizer

from nltk.stem.wordnet import WordNetLemmatizer

from sklearn.pipeline import Pipeline 

from sklearn.model_selection import train_test_split

from sklearn.multioutput import MultiOutputClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report 

from sklearn.model_selection import GridSearchCV


[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [2]:
# load data from database
engine = create_engine('sqlite:///FemaResponseData.db')

df = pd.read_sql_table('Messages_categories', engine) 

print(df.columns)

X = df.message

Y = df.drop(['id', 'message', 'original', 'genre', 'categories'], axis = 1) 


Index(['id', 'message', 'original', 'genre', 'categories', 'related',
       'request', 'offer', 'aid_related', 'medical_help', 'medical_products',
       'search_and_rescue', 'security', 'military', 'child_alone', 'water',
       'food', 'shelter', 'clothing', 'money', 'missing_people', 'refugees',
       'death', 'other_aid', 'infrastructure_related', 'transport',
       'buildings', 'electricity', 'tools', 'hospitals', 'shops',
       'aid_centers', 'other_infrastructure', 'weather_related', 'floods',
       'storm', 'fire', 'earthquake', 'cold', 'other_weather',
       'direct_report'],
      dtype='object')


### 2. Write a tokenization function to process your text data

In [3]:
stop = stopwords.words('english')

def tokenize(text):
    # remove punctutation, covert to lowercase, strip spaces, lemmatize, remove common words 
    # reduce case
    words = text.lower()
    # remove puntuatuion 
    words = re.sub('[^a-z0-9]', ' ', words)
    #split words into list 
    word_list = word_tokenize(words)
   
    #lemmatize
    Lemma = WordNetLemmatizer()
    
    token_list = []
    
    for x in word_list:
        
        if x not in stop:
          
            token = Lemma.lemmatize(x, 'v').strip()

            token_list.append(token)
  
    return token_list
  

In [4]:
car = "here we go there goes a Car."

tokenize(car)

['go', 'go', 'car']

### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [5]:
pipeline = Pipeline([
    
    ('tfidf', TfidfVectorizer(tokenizer=tokenize)),
    
    ('mlp', MultiOutputClassifier(RandomForestClassifier())), 
   
 ]) 

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size =.25)

print(X_train.shape)

print(y_train.shape)

pipeline.fit(X_train, y_train)

#pipeline.score(X_test, y_test)

(19662,)
(19662, 36)


Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [7]:
y_pred = pipeline.predict(X_test)

print(y_pred.shape)
print(y_test.shape)

for x in range(36):
    
    print(x+1,"Report:",classification_report(y_pred[:,x], np.array(y_test)[:,x]))
    print("Accuracy", (y_pred == y_test).mean())


(6554, 36)
(6554, 36)
1 Report:              precision    recall  f1-score   support

          0       0.44      0.62      0.52      1106
          1       0.91      0.84      0.88      5397
          2       0.41      0.31      0.36        51

avg / total       0.83      0.80      0.81      6554

Accuracy related                   0.799207
request                   0.881446
offer                     0.996186
aid_related               0.739396
medical_help              0.920812
medical_products          0.954074
search_and_rescue         0.975893
security                  0.980928
military                  0.969179
child_alone               1.000000
water                     0.948428
food                      0.923558
shelter                   0.931492
clothing                  0.987641
money                     0.976808
missing_people            0.988404
refugees                  0.969332
death                     0.960940
other_aid                 0.868172
infrastructure_related    

  'recall', 'true', average, warn_for)


Accuracy related                   0.799207
request                   0.881446
offer                     0.996186
aid_related               0.739396
medical_help              0.920812
medical_products          0.954074
search_and_rescue         0.975893
security                  0.980928
military                  0.969179
child_alone               1.000000
water                     0.948428
food                      0.923558
shelter                   0.931492
clothing                  0.987641
money                     0.976808
missing_people            0.988404
refugees                  0.969332
death                     0.960940
other_aid                 0.868172
infrastructure_related    0.934086
transport                 0.954837
buildings                 0.954532
electricity               0.980775
tools                     0.993897
hospitals                 0.989777
shops                     0.995728
aid_centers               0.988709
other_infrastructure      0.954989
weather_rel

### 6. Improve your model
Use grid search to find better parameters. 

In [8]:
pipeline.named_steps

{'tfidf': TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
         dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
         lowercase=True, max_df=1.0, max_features=None, min_df=1,
         ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
         stop_words=None, strip_accents=None, sublinear_tf=False,
         token_pattern='(?u)\\b\\w\\w+\\b',
         tokenizer=<function tokenize at 0x7fde538ecae8>, use_idf=True,
         vocabulary=None),
 'mlp': MultiOutputClassifier(estimator=RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
             max_depth=None, max_features='auto', max_leaf_nodes=None,
             min_impurity_decrease=0.0, min_impurity_split=None,
             min_samples_leaf=1, min_samples_split=2,
             min_weight_fraction_leaf=0.0, n_estimators=10, n_jobs=1,
             oob_score=False, random_state=None, verbose=0,
             warm_start=False),
            n_jobs=1)}

In [9]:
pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'tfidf', 'mlp', 'tfidf__analyzer', 'tfidf__binary', 'tfidf__decode_error', 'tfidf__dtype', 'tfidf__encoding', 'tfidf__input', 'tfidf__lowercase', 'tfidf__max_df', 'tfidf__max_features', 'tfidf__min_df', 'tfidf__ngram_range', 'tfidf__norm', 'tfidf__preprocessor', 'tfidf__smooth_idf', 'tfidf__stop_words', 'tfidf__strip_accents', 'tfidf__sublinear_tf', 'tfidf__token_pattern', 'tfidf__tokenizer', 'tfidf__use_idf', 'tfidf__vocabulary', 'mlp__estimator__bootstrap', 'mlp__estimator__class_weight', 'mlp__estimator__criterion', 'mlp__estimator__max_depth', 'mlp__estimator__max_features', 'mlp__estimator__max_leaf_nodes', 'mlp__estimator__min_impurity_decrease', 'mlp__estimator__min_impurity_split', 'mlp__estimator__min_samples_leaf', 'mlp__estimator__min_samples_split', 'mlp__estimator__min_weight_fraction_leaf', 'mlp__estimator__n_estimators', 'mlp__estimator__n_jobs', 'mlp__estimator__oob_score', 'mlp__estimator__random_state', 'mlp__estimator__verbose', 'mlp__es

In [10]:
parameters = {
    #'tfidf__ngram_range': ((1,1),(1,2),(2,1),(2,2)),
    'tfidf__max_df': (1.0, 3.0, 10.0),
    'tfidf__max_features': (None,1000, 2500)      
             
    }

cv = GridSearchCV(pipeline, param_grid = parameters, n_jobs = 4)

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [11]:
cv.fit(X_train, y_train)

GridSearchCV(cv=None, error_score='raise',
       estimator=Pipeline(memory=None,
     steps=[('tfidf', TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), norm='l2', preprocessor=None, smooth_idf=True,
 ...oob_score=False, random_state=None, verbose=0,
            warm_start=False),
           n_jobs=1))]),
       fit_params=None, iid=True, n_jobs=4,
       param_grid={'tfidf__max_df': (1.0, 3.0, 10.0), 'tfidf__max_features': (None, 1000, 2500)},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=0)

In [12]:
cv.best_params_

{'tfidf__max_df': 1.0, 'tfidf__max_features': 1000}

In [13]:
y_preds_cv = cv.predict(X_test)

In [14]:
 print("Accuracy", (y_preds_cv == y_test).mean())

Accuracy related                   0.801648
request                   0.885413
offer                     0.996033
aid_related               0.741532
medical_help              0.922643
medical_products          0.956668
search_and_rescue         0.973604
security                  0.980775
military                  0.966890
child_alone               1.000000
water                     0.960635
food                      0.947055
shelter                   0.941715
clothing                  0.990998
money                     0.977266
missing_people            0.989319
refugees                  0.970095
death                     0.963839
other_aid                 0.867409
infrastructure_related    0.932713
transport                 0.956515
buildings                 0.955600
electricity               0.981080
tools                     0.993897
hospitals                 0.989167
shops                     0.995728
aid_centers               0.989014
other_infrastructure      0.954684
weather_rel

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [15]:
# RandomForestClassifier 

### 9. Export your model as a pickle file

In [16]:
import pickle 

pickle.dump(cv, open('/model2.pkl','wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.