# ML Pipeline Preparation

### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database 
- Define feature and target variables X and Y

In [1]:
# import libraries
import nltk
nltk.download(['punkt', 'wordnet','stopwords', 'averaged_perceptron_tagger'])


import re
import numpy as np
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.datasets import make_multilabel_classification
from sklearn.multioutput import MultiOutputClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn import tree
from sklearn.metrics import f1_score

[nltk_data] Downloading package punkt to C:\Users\Telu
[nltk_data]     Teruno\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to C:\Users\Telu
[nltk_data]     Teruno\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to C:\Users\Telu
[nltk_data]     Teruno\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Telu Teruno\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [2]:
# load data from database
def load_data():
    from sqlalchemy import create_engine
    engine = create_engine('sqlite:///InsertDatabaseName.db')
    df = pd.read_sql_table('Messages', 'sqlite:///NaturalDisastersMsgs.db') 
    df.head()
    X = df['message']
    Y = df.drop(['id', 'message','original', 'genre'], axis=1)
    return X, Y

### 2. Write a tokenization function to process the text data

In [3]:
def tokenize(text):
    tokenizer = nltk.RegexpTokenizer(r"\w+")
    tokens = tokenizer.tokenize(text)
    tokens = [t for t in tokens if t not in stopwords.words("english")]
    lemmatizer = WordNetLemmatizer()
        
    clean_tokens = []
    for tok in tokens:
        clean_tok = lemmatizer.lemmatize(tok).lower().strip()
        clean_tok = PorterStemmer().stem(tok)
        clean_tokens.append(clean_tok)

    return clean_tokens

In [4]:
#test
X, y = load_data()
for message in X[:5]:
    tokens = tokenize(message)
    print(message)
    print(tokens, '\n')

Weather update - a cold front from Cuba that could pass over Haiti
['weather', 'updat', 'cold', 'front', 'cuba', 'could', 'pass', 'haiti'] 

Is the Hurricane over or is it not over
['Is', 'hurrican'] 

Looking for someone but no name
['look', 'someon', 'name'] 

UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.
['UN', 'report', 'leogan', '80', '90', 'destroy', 'onli', 'hospit', 'St', 'croix', 'function', 'need', 'suppli', 'desper'] 

says: west side of Haiti, rest of the country today and tonight
['say', 'west', 'side', 'haiti', 'rest', 'countri', 'today', 'tonight'] 



### 3. Build a machine learning pipeline
This machine pipeline takes in the `message` column as input and output classification results on the other 36 categories in the dataset. 

### 4. Train pipeline
- Create a function for displaying the results. It reports the f1 score, precision and recall for each output category of the dataset,iterating through the columns and calling sklearn's `classification_report` on each. At the end it calculates the average f1 weighed score as a general score for the model
- Split data into train and test sets
- Build pipeline
- Train pipeline
- Predict results
- Display results

In [5]:
def display_results():
    f1_scores = []
    for i in range(len(y_test.columns)):
        print(y_test.columns[i])
        f1 = f1_score(y_test.values[:,i], y_pred[:,i], average='weighted')
        f1_scores.append(f1)
        print(classification_report(y_test.values[:,i], y_pred[:,i]))
    print("The average f1 weighted score of all the columns is", np.mean(f1_scores))
    

In [6]:
X, y = load_data()
X_train, X_test, y_train, y_test = train_test_split(X, y)

pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(KNeighborsClassifier()))
])

# train classifier
pipeline.fit(X_train, y_train)

# predict on test data
y_pred = pipeline.predict(X_test)

# display the results
display_results()

related
              precision    recall  f1-score   support

           0       0.70      0.15      0.25      1483
           1       0.84      0.08      0.15      5026
           2       0.01      0.96      0.01        45

    accuracy                           0.10      6554
   macro avg       0.52      0.40      0.14      6554
weighted avg       0.81      0.10      0.17      6554

request
              precision    recall  f1-score   support

           0       0.83      0.99      0.91      5375
           1       0.78      0.08      0.15      1179

    accuracy                           0.83      6554
   macro avg       0.80      0.54      0.53      6554
weighted avg       0.82      0.83      0.77      6554

offer
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      6524
           1       0.00      0.00      0.00        30

    accuracy                           1.00      6554
   macro avg       0.50      0.50      0.50      655

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



food
              precision    recall  f1-score   support

           0       0.89      1.00      0.94      5815
           1       0.82      0.07      0.12       739

    accuracy                           0.89      6554
   macro avg       0.86      0.53      0.53      6554
weighted avg       0.89      0.89      0.85      6554

shelter
              precision    recall  f1-score   support

           0       0.91      1.00      0.95      5970
           1       0.68      0.04      0.08       584

    accuracy                           0.91      6554
   macro avg       0.80      0.52      0.52      6554
weighted avg       0.89      0.91      0.88      6554

clothing
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      6469
           1       0.57      0.05      0.09        85

    accuracy                           0.99      6554
   macro avg       0.78      0.52      0.54      6554
weighted avg       0.98      0.99      0.98      65

### 5. Improve the model
I´m going to use grid search to find better parameters. It takes a long time to process so I will test just one parameter to test the grid search, but I can´t do a deep analysis on it

In [7]:
parameters = {
    'tfidf__norm': ['l1', 'l2']}

cv = GridSearchCV(pipeline, param_grid=parameters)

# train classifier
cv.fit(X_train, y_train)

# predict on test data
y_pred = cv.predict(X_test)

# see the best parameters
print("\nBest Parameters:", cv.best_params_)

# display the results
display_results()




Best Parameters: {'tfidf__norm': 'l2'}
related
              precision    recall  f1-score   support

           0       0.70      0.15      0.25      1483
           1       0.84      0.08      0.15      5026
           2       0.01      0.96      0.01        45

    accuracy                           0.10      6554
   macro avg       0.52      0.40      0.14      6554
weighted avg       0.81      0.10      0.17      6554

request
              precision    recall  f1-score   support

           0       0.83      0.99      0.91      5375
           1       0.78      0.08      0.15      1179

    accuracy                           0.83      6554
   macro avg       0.80      0.54      0.53      6554
weighted avg       0.82      0.83      0.77      6554

offer
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      6524
           1       0.00      0.00      0.00        30

    accuracy                           1.00      6554
   macro avg

  'precision', 'predicted', average, warn_for)
  'precision', 'predicted', average, warn_for)



child_alone
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      6554

    accuracy                           1.00      6554
   macro avg       1.00      1.00      1.00      6554
weighted avg       1.00      1.00      1.00      6554

water
              precision    recall  f1-score   support

           0       0.94      1.00      0.97      6113
           1       0.78      0.04      0.08       441

    accuracy                           0.93      6554
   macro avg       0.86      0.52      0.52      6554
weighted avg       0.92      0.93      0.91      6554

food
              precision    recall  f1-score   support

           0       0.89      1.00      0.94      5815
           1       0.82      0.07      0.12       739

    accuracy                           0.89      6554
   macro avg       0.86      0.53      0.53      6554
weighted avg       0.89      0.89      0.85      6554

shelter
              precision    recall  f1-sco


other_weather
              precision    recall  f1-score   support

           0       0.94      1.00      0.97      6189
           1       1.00      0.01      0.02       365

    accuracy                           0.94      6554
   macro avg       0.97      0.50      0.49      6554
weighted avg       0.95      0.94      0.92      6554

direct_report
              precision    recall  f1-score   support

           0       0.81      1.00      0.89      5235
           1       0.77      0.06      0.12      1319

    accuracy                           0.81      6554
   macro avg       0.79      0.53      0.50      6554
weighted avg       0.80      0.81      0.74      6554

The average f1 weighted score of all the columns is 0.8843306112867204


>It seems there´s no change from the first attempt because the default parameter of tfidf__norm, that is l2 is the best

>I can´t make more tests because just this took like one hour running.

### 8. Try improving your model further. Here are a few ideas:
* try other machine learning algorithms
* add other features besides the TF-IDF

In [8]:
#I´m going to change the machine learning algorithm and see how it works
X, y = load_data()
X_train, X_test, y_train, y_test = train_test_split(X, y)

pipeline = Pipeline([
    ('vect', CountVectorizer(tokenizer=tokenize)),
    ('tfidf', TfidfTransformer()),
    ('clf', MultiOutputClassifier(tree.DecisionTreeClassifier()))
])

# train classifier
pipeline.fit(X_train, y_train)

# predict on test data
y_pred = pipeline.predict(X_test)

# display the results
display_results()

related
              precision    recall  f1-score   support

           0       0.52      0.48      0.50      1558
           1       0.84      0.85      0.85      4947
           2       0.24      0.49      0.32        49

    accuracy                           0.76      6554
   macro avg       0.53      0.61      0.56      6554
weighted avg       0.76      0.76      0.76      6554

request
              precision    recall  f1-score   support

           0       0.92      0.92      0.92      5458
           1       0.59      0.59      0.59      1096

    accuracy                           0.86      6554
   macro avg       0.75      0.75      0.75      6554
weighted avg       0.86      0.86      0.86      6554

offer
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      6516
           1       0.06      0.03      0.04        38

    accuracy                           0.99      6554
   macro avg       0.53      0.51      0.52      655


storm
              precision    recall  f1-score   support

           0       0.96      0.97      0.97      5951
           1       0.66      0.65      0.65       603

    accuracy                           0.94      6554
   macro avg       0.81      0.81      0.81      6554
weighted avg       0.94      0.94      0.94      6554

fire
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      6487
           1       0.30      0.30      0.30        67

    accuracy                           0.99      6554
   macro avg       0.65      0.65      0.65      6554
weighted avg       0.99      0.99      0.99      6554

earthquake
              precision    recall  f1-score   support

           0       0.98      0.98      0.98      5950
           1       0.80      0.79      0.79       604

    accuracy                           0.96      6554
   macro avg       0.89      0.88      0.89      6554
weighted avg       0.96      0.96      0.96      65

> The decision trees algorithm gets better results than the KNeighborsClassifier so we keep this last model as our best


### 9. Export your model as a pickle file

In [10]:
import pickle
pickle.dump(pipeline, open('finalized_model.sav', 'wb'))

### 10. Use this notebook to complete `train.py`
Use the template file attached in the Resources folder to write a script that runs the steps above to create a database and export a model based on a new dataset specified by the user.