# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [1]:
# import libraries
import nltk
import pickle
nltk.download(['punkt','wordnet'])
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import AdaBoostClassifier

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [2]:
# load data from database
engine = create_engine('sqlite:///InsertDatabaseName.db')
df = pd.read_sql_table('messages_cat', engine)
X = df['message']
Y = df.iloc[:, 4:]

### 2. Write a tokenization function to process your text data

In [3]:
def tokenize(text):
    tokens = word_tokenize(text)
    lemmatizer = WordNetLemmatizer()

    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tokens.append(clean_tok)

    return clean_tokens

### 3. Build a machine learning pipeline
- You'll find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [4]:

pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(AdaBoostClassifier(), n_jobs=-1))
])

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [5]:
X_train, X_test, y_train, y_test = train_test_split(X, Y)

In [10]:
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

### 5. Test your model
Report the f1 score, precision and recall on both the training set and the test set. You can use sklearn's `classification_report` function here. 

In [11]:
print(classification_report(y_test, y_pred, target_names=y_test.columns))

                        precision    recall  f1-score   support

               related       0.83      0.94      0.88      5068
               request       0.74      0.52      0.61      1095
                 offer       0.08      0.04      0.05        28
           aid_related       0.76      0.57      0.65      2740
          medical_help       0.60      0.26      0.36       506
      medical_products       0.63      0.27      0.38       339
     search_and_rescue       0.57      0.18      0.27       180
              security       0.29      0.08      0.13       120
              military       0.65      0.35      0.45       243
           child_alone       0.00      0.00      0.00         0
                 water       0.74      0.64      0.69       414
                  food       0.78      0.69      0.73       700
               shelter       0.76      0.51      0.61       557
              clothing       0.69      0.46      0.55       105
                 money       0.57      

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [None]:
pipeline_os = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
])
X_new = pipeline_os.fit_transform(X_train)




In [None]:
smt = SMOTE()
os_X_train, os_y_train = smt.fit_sample(X_new, y_train['fire'])

In [None]:
rf = RandomForestClassifier(n_estimators=100)
rf.fit(os_X_train, os_y_train)

In [None]:
X_test_new = pipeline_os.transform(X_test)
X_test_new.shape

In [None]:
os_y_pred = rf.predict(X_test_new)

In [None]:
y_test.shape

In [None]:
print(classification_report(y_test['fire'], os_y_pred))

### 6. Improve your model
Use grid search to find better parameters. 

In [6]:
parameters = {
    'vect__max_df': (0.5, 0.75),
    'vect__max_features': (None, 5000, 10000),
    'tfidf__use_idf': (True, False),
    #'clf__estimator__n_estimators': [50, 100],
    #'clf__estimator__learning_rate': [0.1, 1, 3]
}

cv = GridSearchCV(pipeline, param_grid=parameters, n_jobs = -1, verbose=10, scoring='f1_weighted')

In [None]:
cv.get_params().keys()

In [7]:
cv_fit = cv.fit(X_train, y_train)

Fitting 3 folds for each of 12 candidates, totalling 36 fits
[CV] tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None ..


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, score=0.6291991265365082, total= 1.9min
[CV] tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None ..


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:  2.1min remaining:    0.0s
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, score=0.6284906553116674, total= 1.8min
[CV] tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None ..


[Parallel(n_jobs=-1)]: Done   2 out of   2 | elapsed:  4.1min remaining:    0.0s
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=None, score=0.6277809728644728, total= 1.8min
[CV] tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=5000 ..


[Parallel(n_jobs=-1)]: Done   3 out of   3 | elapsed:  6.0min remaining:    0.0s
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=5000, score=0.6247019277312366, total= 1.5min
[CV] tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=5000 ..


[Parallel(n_jobs=-1)]: Done   4 out of   4 | elapsed:  7.7min remaining:    0.0s
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=5000, score=0.6255082005397051, total= 1.5min
[CV] tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=5000 ..


[Parallel(n_jobs=-1)]: Done   5 out of   5 | elapsed:  9.5min remaining:    0.0s
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=5000, score=0.6244760695137118, total= 1.5min
[CV] tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=10000 .


[Parallel(n_jobs=-1)]: Done   6 out of   6 | elapsed: 11.2min remaining:    0.0s
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=10000, score=0.6292546504301966, total= 1.6min
[CV] tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=10000 .


[Parallel(n_jobs=-1)]: Done   7 out of   7 | elapsed: 13.0min remaining:    0.0s
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=10000, score=0.630200385745848, total= 1.6min
[CV] tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=10000 .


[Parallel(n_jobs=-1)]: Done   8 out of   8 | elapsed: 14.8min remaining:    0.0s
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=True, vect__max_df=0.5, vect__max_features=10000, score=0.6221291827371263, total= 1.6min
[CV] tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=None .


[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed: 16.6min remaining:    0.0s
  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=None, score=0.6305529153076701, total= 1.8min
[CV] tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=None .


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=None, score=0.6311169343364375, total= 1.8min
[CV] tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=None .


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=None, score=0.6253180203169123, total= 1.9min
[CV] tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=5000 .


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=5000, score=0.6270159052283422, total= 1.6min
[CV] tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=5000 .


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=5000, score=0.6278119635126634, total= 1.6min
[CV] tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=5000 .


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=5000, score=0.6241000682113993, total= 1.6min
[CV] tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=10000 


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=10000, score=0.6276286412847895, total= 1.7min
[CV] tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=10000 


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=10000, score=0.6291527925170209, total= 1.7min
[CV] tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=10000 


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=True, vect__max_df=0.75, vect__max_features=10000, score=0.6254050498989574, total= 1.7min
[CV] tfidf__use_idf=False, vect__max_df=0.5, vect__max_features=None .


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=False, vect__max_df=0.5, vect__max_features=None, score=0.6257193767492882, total= 1.5min
[CV] tfidf__use_idf=False, vect__max_df=0.5, vect__max_features=None .


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=False, vect__max_df=0.5, vect__max_features=None, score=0.6289390562830495, total= 1.5min
[CV] tfidf__use_idf=False, vect__max_df=0.5, vect__max_features=None .


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=False, vect__max_df=0.5, vect__max_features=None, score=0.6257216465538727, total= 1.5min
[CV] tfidf__use_idf=False, vect__max_df=0.5, vect__max_features=5000 .


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=False, vect__max_df=0.5, vect__max_features=5000, score=0.6265797742188527, total= 1.2min
[CV] tfidf__use_idf=False, vect__max_df=0.5, vect__max_features=5000 .


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=False, vect__max_df=0.5, vect__max_features=5000, score=0.6273247459429073, total= 1.3min
[CV] tfidf__use_idf=False, vect__max_df=0.5, vect__max_features=5000 .


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=False, vect__max_df=0.5, vect__max_features=5000, score=0.6226987758516721, total= 1.3min
[CV] tfidf__use_idf=False, vect__max_df=0.5, vect__max_features=10000 


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=False, vect__max_df=0.5, vect__max_features=10000, score=0.6262210231046571, total= 1.3min
[CV] tfidf__use_idf=False, vect__max_df=0.5, vect__max_features=10000 


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=False, vect__max_df=0.5, vect__max_features=10000, score=0.6270245488227397, total= 1.3min
[CV] tfidf__use_idf=False, vect__max_df=0.5, vect__max_features=10000 


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=False, vect__max_df=0.5, vect__max_features=10000, score=0.6243378254475435, total= 1.3min
[CV] tfidf__use_idf=False, vect__max_df=0.75, vect__max_features=None 


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=False, vect__max_df=0.75, vect__max_features=None, score=0.6267498116458294, total= 1.6min
[CV] tfidf__use_idf=False, vect__max_df=0.75, vect__max_features=None 


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=False, vect__max_df=0.75, vect__max_features=None, score=0.6289039216814827, total= 1.5min
[CV] tfidf__use_idf=False, vect__max_df=0.75, vect__max_features=None 


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=False, vect__max_df=0.75, vect__max_features=None, score=0.6256408064356406, total= 1.6min
[CV] tfidf__use_idf=False, vect__max_df=0.75, vect__max_features=5000 


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=False, vect__max_df=0.75, vect__max_features=5000, score=0.6282295977090935, total= 1.3min
[CV] tfidf__use_idf=False, vect__max_df=0.75, vect__max_features=5000 


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=False, vect__max_df=0.75, vect__max_features=5000, score=0.628209024809014, total= 1.3min
[CV] tfidf__use_idf=False, vect__max_df=0.75, vect__max_features=5000 


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=False, vect__max_df=0.75, vect__max_features=5000, score=0.6228398803000971, total= 1.3min
[CV] tfidf__use_idf=False, vect__max_df=0.75, vect__max_features=10000 


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=False, vect__max_df=0.75, vect__max_features=10000, score=0.6282699397230643, total= 1.4min
[CV] tfidf__use_idf=False, vect__max_df=0.75, vect__max_features=10000 


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=False, vect__max_df=0.75, vect__max_features=10000, score=0.6299446463204755, total= 1.4min
[CV] tfidf__use_idf=False, vect__max_df=0.75, vect__max_features=10000 


  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


[CV]  tfidf__use_idf=False, vect__max_df=0.75, vect__max_features=10000, score=0.6253870973799347, total= 1.4min


[Parallel(n_jobs=-1)]: Done  36 out of  36 | elapsed: 62.4min finished


In [8]:
y_pred = cv_fit.best_estimator_.predict(X_test)
print(classification_report(y_test, y_pred, target_names=y_test.columns))

                        precision    recall  f1-score   support

               related       0.84      0.94      0.89      5098
               request       0.78      0.55      0.65      1135
                 offer       0.17      0.03      0.06        30
           aid_related       0.76      0.61      0.67      2735
          medical_help       0.60      0.29      0.39       504
      medical_products       0.62      0.34      0.44       328
     search_and_rescue       0.49      0.17      0.25       177
              security       0.18      0.03      0.06       123
              military       0.61      0.33      0.43       215
           child_alone       0.00      0.00      0.00         0
                 water       0.73      0.63      0.68       396
                  food       0.80      0.69      0.74       729
               shelter       0.74      0.55      0.63       593
              clothing       0.66      0.42      0.51       105
                 money       0.62      

  'precision', 'predicted', average, warn_for)
  'recall', 'true', average, warn_for)


In [None]:
from sklearn.externals import joblib
joblib.dump(cv_fit,'rf.model')

In [None]:
cv_fit = joblib.load('rf.model')

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

In [9]:
pkl_filename = "pickle_model.pkl"  
with open(pkl_filename, 'wb') as file:  
    pickle.dump(cv_fit.best_estimator_, file)

In [None]:
# Load from file
with open(pkl_filename, 'rb') as file:  
    pickle_model = pickle.load(file)

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.