# ML Pipeline Preparation
This notebook was used to set up to try the machine learning pipeline of the disaster recovery project and to find the best model based on the results from the grid search with cross validation for multiple parameters and algorithms. The final model will be created in the python script train_classifier.py file, which is located [here](https://github.com/bruno-f7s/portfolio/tree/main/disaster-response/models/train_classifier.py).

This notebook is not necessary for launching the app, but can be revisited if a new dataset needs to be used to train a new model.

The code below is divided into the following steps:
1. Import libraries
2. Load and prepare the data
3. Define NLP processing functions
5. Build and run a machine learning pipeline
6. Model choice

## 1. Import libraries

In [1]:
# import libraries
import re
import sqlite3
import pandas as pd
import numpy as np
import datetime

import nltk
nltk.download(['wordnet'])
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer


from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.multioutput import MultiOutputClassifier
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report, make_scorer, accuracy_score, precision_score, recall_score, f1_score

[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\bfernandes\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


## 2. Load and prepare the data

In [2]:
# load data from database
conn = sqlite3.connect('DisasterResponse.db')
query = "SELECT * FROM DisasterResponse"
df = pd.read_sql_query(query, conn)

# remove rows which do not have any label as they will not provide any prediction
label_cols = df.drop(["id","message","original","genre","related","child_alone"], axis=1).columns
df['sum'] = df[label_cols].sum(axis=1)
df = df[df['sum'] != 0]
df = df.drop(["sum"], axis=1)

# create X and y variables
X = df["message"]
y = df.drop(["id","message","original","genre","related","child_alone"], axis=1)

# create the train and test splits
X_train, X_validation, y_train, y_validation = train_test_split(X, y, test_size=0.33, random_state=42)

In [3]:
# print the distribution of the values for each label
for column in label_cols:
    print(f"Label: {column}")
    print(f"Percentage of 0s: {round((len(y[y[column] == 0])/len(y[column]))*100, 1)}%")
    print(f"Percentage of 1s: {round((len(y[y[column] == 1])/len(y[column]))*100, 1)}%")
    print(f"---------------------")
    
print(f"Total number of rows: {len(df)}")    

Label: request
Percentage of 0s: 69.7%
Percentage of 1s: 30.3%
---------------------
Label: offer
Percentage of 0s: 99.2%
Percentage of 1s: 0.8%
---------------------
Label: aid_related
Percentage of 0s: 26.5%
Percentage of 1s: 73.5%
---------------------
Label: medical_help
Percentage of 0s: 85.9%
Percentage of 1s: 14.1%
---------------------
Label: medical_products
Percentage of 0s: 91.1%
Percentage of 1s: 8.9%
---------------------
Label: search_and_rescue
Percentage of 0s: 95.1%
Percentage of 1s: 4.9%
---------------------
Label: security
Percentage of 0s: 96.8%
Percentage of 1s: 3.2%
---------------------
Label: military
Percentage of 0s: 94.2%
Percentage of 1s: 5.8%
---------------------
Label: water
Percentage of 0s: 88.7%
Percentage of 1s: 11.3%
---------------------
Label: food
Percentage of 0s: 80.2%
Percentage of 1s: 19.8%
---------------------
Label: shelter
Percentage of 0s: 84.3%
Percentage of 1s: 15.7%
---------------------
Label: clothing
Percentage of 0s: 97.3%
Percent

__Observation 1__: In this project a train and validation splits were created __without the holdout set__ (which I typically do) for a few reasons:
1. This is a multi-label problem with 36 labels so I decided it was feasible to use directly all the data available because there are some labels which are highly imbalanced and this could lead to many missing data of one category during training.
2. The dataset is not that big and creating this holdout set would remove more data from the training processed.
3. This code is intended to be used to analyze and decide for the best model but the final model will be created using all data nonetheless. Therefore the results of the train/validation splits are sufficient to make a decision.

__Observation 2__: The label categories "request" and "child_alone" only have one category available in this dataset (see distribution above). Because of this the model will always predict the same label regardless of the input. I decided to __remove them__ because they are simply adding noise to the model and do not impact the predictions in any positive way.

## 3. Define NLP processing functions

In [4]:
# create a tokenize function that processes text data
def tokenize(text):
    """
    This function takes in a string, tokenizes it, removes English stop words and applies lemmatization. 
    The result is a list of all resulting tokens. This function can be passed into a tokenizer of a transformer.
    """
    #tokenize text
    tokens = word_tokenize(text)
    
    #stop word removal
    tokens = [tok.lower().strip() for tok in tokens if tok not in stopwords.words("english")]

    #lemmatization of words
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(tok, pos='v') for tok in tokens]

    #remove tokens that only contain special characters
    clean_tokens = [tok for tok in lemmatized_tokens if not re.match("^[\W_]+", tok)]
    
    #remove tokens that contain digits as they will not be relevant for this supervised learning problem
    clean_tokens = [tok for tok in clean_tokens if not re.search("\d", tok)]
    
    return clean_tokens

In [5]:
# test the results of the tokenize function
for message in X[:5]:
    tokens = tokenize(message)
    print(message)
    print(tokens, '\n')

Is the Hurricane over or is it not over
['be', 'hurricane'] 

UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.
['un', 'report', 'leogane', 'destroy', 'only', 'hospital', 'st.', 'croix', 'function', 'need', 'supply', 'desperately'] 

Storm at sacred heart of jesus
['storm', 'sacred', 'heart', 'jesus'] 

Please, we need tents and water. We are in Silo, Thank you!
['please', 'need', 'tent', 'water', 'we', 'silo', 'thank'] 

I am in Croix-des-Bouquets. We have health issues. They ( workers ) are in Santo 15. ( an area in Croix-des-Bouquets )
['i', 'croix-des-bouquets', 'we', 'health', 'issue', 'they', 'workers', 'santo', 'area', 'croix-des-bouquets'] 



In [6]:
# create a custom transformer to count words
class WordCounter:
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return [[len(text.split())] for text in X]

# create a custom transformer to measure length of message
class CharacterCounter:
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return [[len(text)] for text in X]    

## 4. Build and run a machine learning pipeline

In [7]:
#########################################################################
# BUILD A ML PIPELINE
#########################################################################
modelling_start_time = datetime.datetime.now()

# Create a dictionary of ml models with their respective hyperparameters
models = {
    "Logistic Regression": {
        "model": MultiOutputClassifier(OneVsRestClassifier(LogisticRegression())),
        "param_grid": {
            'model__estimator__estimator__penalty': ['elasticnet'],
            'model__estimator__estimator__l1_ratio': [0, 0.5, 1],
            'model__estimator__estimator__C': [1, 10, 25],
            'model__estimator__estimator__solver': ['saga'],
            'model__estimator__estimator__multi_class': ['ovr'],
            'model__estimator__estimator__class_weight': ['balanced'],
            'model__estimator__estimator__random_state': [42],
            'model__estimator__estimator__n_jobs': [-1]
        }
    },
    "Random Forest": {
        "model": MultiOutputClassifier(RandomForestClassifier()),
        "param_grid": {
            'model__estimator__n_estimators': [1, 10, 100],
            'model__estimator__max_depth': [1, 5, 10],
            'model__estimator__class_weight': ['balanced'],
            'model__estimator__random_state': [42]
        }
    },
    "Gradient Boosting": {
        "model": MultiOutputClassifier(GradientBoostingClassifier()),
        "param_grid": {
            "model__estimator__n_estimators": [10, 50, 100],
            "model__estimator__learning_rate": [0.1, 0.05, 0.01],
            "model__estimator__max_depth": [3, 5, 7]
        }
    }
}


# Define the scoring metrics
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'recall': make_scorer(recall_score, average='weighted', zero_division=1),
    'f1': make_scorer(f1_score, average='weighted', zero_division=1)
}


# Iterate over the models and perform grid search with cross-validation
for model_name, model_config in models.items():
    print(f"Training the model {model_name} using grid search with cross-validation...")
    
    # Define the pipeline with preprocessing and model
    pipeline = Pipeline([
        ('features', FeatureUnion([
            ('tfidf', TfidfVectorizer(tokenizer=tokenize)),
            ('token_count', CountVectorizer(tokenizer=tokenize)),
            ('word_count', Pipeline([
                ('count', WordCounter()),
                ('scale', StandardScaler())
            ])),
            ('character_count', Pipeline([
                ('count', CharacterCounter()),
                ('scale', StandardScaler())
            ]))
        ])),
        ('model', model_config['model'])
    ])
    
    # Perform grid search with cross-validation
    grid_search = GridSearchCV(pipeline, 
                               param_grid=model_config["param_grid"], 
                               cv=5, 
                               scoring=scoring, 
                               return_train_score=True,
                               refit="f1",
                               verbose=1)
    
    grid_search.fit(X_train, y_train)
    
    # Get the best model and its evaluation metric score
    best_model = grid_search.best_estimator_
    best_score = grid_search.best_score_
    
    # Predict on the test set and evaluate the performance
    y_pred = best_model.predict(X_validation)
    
    # Evaluate the overall accuracy
    accuracy = (y_validation == y_pred).mean()
    
    # Print the results
    print(f"Model: {model_name}")
    print(f"Best Parameters: {grid_search.best_params_}")
    print(f"Best Score: {best_score}")
    print(f"Overall accuracy: {accuracy}")
    print(classification_report(y_validation, y_pred, target_names=y.columns, zero_division=1))
    print("------------------------------------------------------------------")

modelling_end_time = datetime.datetime.now()

print(f"Models sucessfully trained!")
print(f"Time it took for training: {str(modelling_end_time - modelling_start_time)}")

Training the model Logistic Regression using grid search with cross-validation...
Fitting 5 folds for each of 9 candidates, totalling 45 fits
Model: Logistic Regression
Best Parameters: {'model__estimator__estimator__C': 1, 'model__estimator__estimator__class_weight': 'balanced', 'model__estimator__estimator__l1_ratio': 0, 'model__estimator__estimator__multi_class': 'ovr', 'model__estimator__estimator__n_jobs': -1, 'model__estimator__estimator__penalty': 'elasticnet', 'model__estimator__estimator__random_state': 42, 'model__estimator__estimator__solver': 'saga'}
Best Score: 0.4278361162311718
Overall accuracy: request                   0.803520
offer                     0.962546
aid_related               0.626484
medical_help              0.634875
medical_products          0.706508
search_and_rescue         0.770774
security                  0.813754
military                  0.774048
water                     0.634056
food                      0.646541
shelter                   0.6109

Model: Gradient Boosting
Best Parameters: {'model__estimator__learning_rate': 0.1, 'model__estimator__max_depth': 7, 'model__estimator__n_estimators': 100}
Best Score: 0.3363166671615735
Overall accuracy: request                   0.802088
offer                     0.985878
aid_related               0.732910
medical_help              0.852436
medical_products          0.910970
search_and_rescue         0.947401
security                  0.953745
military                  0.945354
water                     0.892755
food                      0.793901
shelter                   0.841588
clothing                  0.960295
money                     0.955792
missing_people            0.963979
refugees                  0.941056
death                     0.919771
other_aid                 0.762792
infrastructure_related    0.889480
transport                 0.919975
buildings                 0.909537
electricity               0.957634
tools                     0.983422
hospitals                

### 5. Model choice
The purpose of this project was not to achieve a specific performance metric but rather create a ml-pipeline that easily takes care of an NLP task that is then fed into an app. Nevertheless I decided to apply GridSearchCV to search for multiple parameter combinations for different algorithms. I also using different scoring methods but tuned for the best f1-score. I used this metric because the data has many labels that are very imbalanced. Accuracy would be a too simplistic metric and precision was highly inflated, so I tried to balance it with the recall using the f1-score.

Based on these 3 algorithms and the parameter combination I chose the __logistic regression__ because it had the best balance of scores.