# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [441]:
# import libraries
import re
import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
nltk.download(['punkt', 'stopwords', 'wordnet']) # for word_tokenize, stopwords and lemmatizer, respectively 
from nltk.stem.wordnet import WordNetLemmatizer

# Feature Extraction
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV


[nltk_data] Downloading package punkt to C:\Users\Thiago
[nltk_data]     Senra\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Thiago
[nltk_data]     Senra\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Thiago
[nltk_data]     Senra\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [442]:
# load data from database
engine = create_engine('sqlite:///DisasterResponse.db')
df = pd.read_sql_table('project_data', con=engine)

X = df['message'].values

# Just category columns
y = df.iloc[:,4:].values
categories = df.iloc[:,4:].columns

### 2. Write a tokenization function to process your text data

In [443]:
def tokenize(text):
    """ Transform text string in a token list.
    
    Args:
    text: str. The text to tokenize. 
    stop_words: bool. If is true, remove the stop words 
    lemmatize (bool). If is true, lemmatize the tokens.
    
    Returns:
    tokens: list. Return a list of tokens from the text.
    
    """
    # Normalize text
    text = text.lower()
    text = re.sub(r'[^a-zA-Z0-0]'," ", text)
    
    
    #Tokenize text
    tokens = word_tokenize(text)
    
    # Remove stop words
    tokens = [w for w in tokens if w not in stopwords.words("english")]
        
    # Reduce words to their root form
    tokens = [WordNetLemmatizer().lemmatize(w, pos='v') for w in tokens]
    
        
    return tokens    
    
    
# Unit Test
def check_result_1():
    text = "Turns out they don't want to keep building storage tanks indefinitely and now they're just going \
    to start dumping old water into the ocean to make room for the contaminated waste water still being generated \
    to this day."
    
    tokens = tokenize(text)
    
    non_alpha_char = 0
    for token in tokens:
        if re.match(r'[^a-z0-9]+', token):
            print(token)
            non_alpha_char += 1
    print('Non alphanumeric characters: {}'.format(non_alpha_char))
    #print('Great job, you made it to the end of the code checks!')
    
    print(tokenize(text))

check_result_1()


Non alphanumeric characters: 0
['turn', 'want', 'keep', 'build', 'storage', 'tank', 'indefinitely', 'go', 'start', 'dump', 'old', 'water', 'ocean', 'make', 'room', 'contaminate', 'waste', 'water', 'still', 'generate', 'day']


### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset. You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.

In [444]:
def ml_pipeline():
    
    """ 
    Machine Learning Pipeline that transform text data in a matrix and apply
    a Multiple Output Classifier based on Random Forest Classifier model 
    
    """
    
    # Instantiate pipeline
    pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)), 
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(RandomForestClassifier(random_state=42)))
    ])
    
    return pipeline


def check_results_2():
    X = ['What he does talk about: Hypothetically soliciting bribes from Exxon.', 
          'I would say "jokes" here because he seemed to be using a hypothetical', 
          "Exxon wasn't laughing", "The company immediately said its CEO had no contact like this with Trump.",
          "Trump, uncorked, says things like this.", " We should still take note of them.", 
          'His words: "So I call some guy,the head of Exxon.']
    y = np.array([[0, 0, 1, 1, 1, 0, 1], [0, 0, 1, 1, 1, 0, 1]]).transpose()  
    
    model = ml_pipeline()
    #train
    model.fit(X, y)
    # predict
    y_pred = model.predict(X)

#check_results_2()
    

### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [445]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2, random_state=42)

model = ml_pipeline()
model.fit(X_train, y_train)


Pipeline(steps=[('vect',
                 CountVectorizer(tokenizer=<function tokenize at 0x000002385FE17E58>)),
                ('tfidf', TfidfTransformer()),
                ('clf',
                 MultiOutputClassifier(estimator=RandomForestClassifier(random_state=42)))])

### 5. Test your model
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [446]:
y_preds = model.predict(X_test)

In [447]:

for i,cat in enumerate(categories):
    classification = classification_report(y_test[:,i], y_preds[:,i])
    print(cat+':\n')
    print(classification+'\n')


related:

              precision    recall  f1-score   support

           0       0.72      0.45      0.56      1290
           1       0.84      0.94      0.89      3946

    accuracy                           0.82      5236
   macro avg       0.78      0.70      0.72      5236
weighted avg       0.81      0.82      0.81      5236


request:

              precision    recall  f1-score   support

           0       0.90      0.98      0.94      4311
           1       0.85      0.47      0.61       925

    accuracy                           0.89      5236
   macro avg       0.87      0.73      0.77      5236
weighted avg       0.89      0.89      0.88      5236


offer:

              precision    recall  f1-score   support

           0       0.99      1.00      1.00      5208
           1       0.00      0.00      0.00        28

    accuracy                           0.99      5236
   macro avg       0.50      0.50      0.50      5236
weighted avg       0.99      0.99      0.99 

  _warn_prf(average, modifier, msg_start, len(result))



              precision    recall  f1-score   support

           0       0.98      1.00      0.99      5133
           1       0.00      0.00      0.00       103

    accuracy                           0.98      5236
   macro avg       0.49      0.50      0.50      5236
weighted avg       0.96      0.98      0.97      5236


military:

              precision    recall  f1-score   support

           0       0.96      1.00      0.98      5044
           1       0.75      0.03      0.06       192

    accuracy                           0.96      5236
   macro avg       0.86      0.52      0.52      5236
weighted avg       0.96      0.96      0.95      5236


child_alone:

              precision    recall  f1-score   support

           0       1.00      1.00      1.00      5236

    accuracy                           1.00      5236
   macro avg       1.00      1.00      1.00      5236
weighted avg       1.00      1.00      1.00      5236


water:

              precision    recall  f

earthquake:

              precision    recall  f1-score   support

           0       0.98      0.99      0.99      4780
           1       0.87      0.81      0.84       456

    accuracy                           0.97      5236
   macro avg       0.93      0.90      0.91      5236
weighted avg       0.97      0.97      0.97      5236


cold:

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      5123
           1       0.88      0.06      0.12       113

    accuracy                           0.98      5236
   macro avg       0.93      0.53      0.55      5236
weighted avg       0.98      0.98      0.97      5236


other_weather:

              precision    recall  f1-score   support

           0       0.96      1.00      0.98      4993
           1       0.61      0.05      0.08       243

    accuracy                           0.95      5236
   macro avg       0.78      0.52      0.53      5236
weighted avg       0.94      0.95   

### 6. Improve your model
Use grid search to find better parameters. 

In [479]:

pipe = ml_pipeline()
parameters = {'clf__estimator__n_estimators': (50, 100, 150) }


cv = GridSearchCV(pipe, param_grid=parameters)
cv.fit(X_train, y_train)

GridSearchCV(estimator=Pipeline(steps=[('vect',
                                        CountVectorizer(tokenizer=<function tokenize at 0x000002385FE17E58>)),
                                       ('tfidf', TfidfTransformer()),
                                       ('clf',
                                        MultiOutputClassifier(estimator=RandomForestClassifier(random_state=42)))]),
             param_grid={'clf__estimator__n_estimators': (50, 100, 150)})

In [480]:
cv.get_params()

{'cv': None,
 'error_score': nan,
 'estimator__memory': None,
 'estimator__steps': [('vect',
   CountVectorizer(tokenizer=<function tokenize at 0x000002385FE17E58>)),
  ('tfidf', TfidfTransformer()),
  ('clf',
   MultiOutputClassifier(estimator=RandomForestClassifier(random_state=42)))],
 'estimator__verbose': False,
 'estimator__vect': CountVectorizer(tokenizer=<function tokenize at 0x000002385FE17E58>),
 'estimator__tfidf': TfidfTransformer(),
 'estimator__clf': MultiOutputClassifier(estimator=RandomForestClassifier(random_state=42)),
 'estimator__vect__analyzer': 'word',
 'estimator__vect__binary': False,
 'estimator__vect__decode_error': 'strict',
 'estimator__vect__dtype': numpy.int64,
 'estimator__vect__encoding': 'utf-8',
 'estimator__vect__input': 'content',
 'estimator__vect__lowercase': True,
 'estimator__vect__max_df': 1.0,
 'estimator__vect__max_features': None,
 'estimator__vect__min_df': 1,
 'estimator__vect__ngram_range': (1, 1),
 'estimator__vect__preprocessor': None,
 

In [485]:
print(cv.best_score_)
print(cv.best_params_)
print(cv.score(X_test,y_test))

0.2694330309312857
{'clf__estimator__n_estimators': 150}


0.2744461420932009

### 7. Test your model
Show the accuracy, precision, and recall of the tuned model.  

Since this project focuses on code quality, process, and  pipelines, there is no minimum performance metric needed to pass. However, make sure to fine tune your models for accuracy, precision and recall to make your project stand out - especially for your portfolio!

In [486]:
y_preds = cv.predict(X_test)

for i,cat in enumerate(categories):
    classification = classification_report(y_test[:,i], y_preds[:,i])
    print(cat+':\n')
    print(classification+'\n')


related:

              precision    recall  f1-score   support

           0       0.73      0.45      0.55      1290
           1       0.84      0.95      0.89      3946

    accuracy                           0.82      5236
   macro avg       0.78      0.70      0.72      5236
weighted avg       0.81      0.82      0.81      5236


request:

              precision    recall  f1-score   support

           0       0.90      0.98      0.94      4311
           1       0.85      0.48      0.61       925

    accuracy                           0.89      5236
   macro avg       0.87      0.73      0.78      5236
weighted avg       0.89      0.89      0.88      5236


offer:

              precision    recall  f1-score   support

           0       0.99      1.00      1.00      5208
           1       0.00      0.00      0.00        28

    accuracy                           0.99      5236
   macro avg       0.50      0.50      0.50      5236
weighted avg       0.99      0.99      0.99 

  _warn_prf(average, modifier, msg_start, len(result))



              precision    recall  f1-score   support

           0       0.95      1.00      0.98      4975
           1       0.78      0.07      0.13       261

    accuracy                           0.95      5236
   macro avg       0.87      0.53      0.55      5236
weighted avg       0.94      0.95      0.93      5236


search_and_rescue:

              precision    recall  f1-score   support

           0       0.97      1.00      0.99      5088
           1       0.80      0.03      0.05       148

    accuracy                           0.97      5236
   macro avg       0.89      0.51      0.52      5236
weighted avg       0.97      0.97      0.96      5236


security:

              precision    recall  f1-score   support

           0       0.98      1.00      0.99      5133
           1       0.00      0.00      0.00       103

    accuracy                           0.98      5236
   macro avg       0.49      0.50      0.49      5236
weighted avg       0.96      0.98      0

other_weather:

              precision    recall  f1-score   support

           0       0.96      1.00      0.98      4993
           1       0.69      0.05      0.08       243

    accuracy                           0.95      5236
   macro avg       0.82      0.52      0.53      5236
weighted avg       0.94      0.95      0.94      5236


direct_report:

              precision    recall  f1-score   support

           0       0.86      0.98      0.92      4183
           1       0.83      0.35      0.49      1053

    accuracy                           0.85      5236
   macro avg       0.84      0.67      0.70      5236
weighted avg       0.85      0.85      0.83      5236




### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

### 9. Export your model as a pickle file

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.