# ML Pipeline Preparation
This notebook was used to aid set up the ml pipeline of the disaster recovery project and to get a better understanding of the data at hand as well as the perfomance metrics. It is therefore not necessary to run the app or to create the model file. The script version of this code can be found in the train_classifier.py file, located [here]().

The code below is divided into the following steps:
1. Import libraries
2. Load and prepare the data
3. Define NLP processing functions
5. Build and run a machine learning pipeline
6. Retrain the final model using the appropriate algorithm with a chosen parameter combination
7. Export the final model as a pickel file

## 1. Import libraries

In [1]:
# import libraries
import re
import sqlite3
import pandas as pd
import numpy as np
import datetime

import nltk
nltk.download(['wordnet'])
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer


from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.multioutput import MultiOutputClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import classification_report, make_scorer, accuracy_score, precision_score, recall_score, f1_score

## 2. Load and prepare the data

In [7]:
# load data from database
conn = sqlite3.connect('DisasterResponse.db')
query = "SELECT * FROM DisasterResponse"
df = pd.read_sql_query(query, conn)

# remove rows which do not have any label as they will not provide any prediction
label_cols = df.drop(["id","message","original","genre"], axis=1).columns
df['sum'] = df[label_cols].sum(axis=1)
df = df[df['sum'] != 0]
df = df.drop(["sum"], axis=1)

# create X and y variables
X = df["message"]
y = df.drop(["id","message","original","genre"], axis=1)

# create the train and test splits
X_train, X_validation, y_train, y_validation = train_test_split(X.values, y.values, test_size=0.2, random_state=101)

__Observation__: In this project a train and validation splits were created without a holdout set for a few reasons:
1. This is a multi-label problem with 36 labels so I decided it was feasible to use directly all the data available because there are some labels which are highly imbalanced and this could lead to many missing data of one category during training.
2. The dataset is not that big and creating this holdout set would remove more data from the training processed.
3. This code is intended to be used in a python script whenever new data is added in a automated process, therefore I thought that the scoring metrics of the best estimator should be sufficient to get an idea of the performance.

In [8]:
print(len(df))

20106


In [9]:
y.sum(axis=0)

related                   20106.0
request                    4480.0
offer                       119.0
aid_related               10878.0
medical_help               2087.0
medical_products           1314.0
search_and_rescue           724.0
security                    471.0
military                    860.0
child_alone                   0.0
water                      1674.0
food                       2930.0
shelter                    2319.0
clothing                    406.0
money                       604.0
missing_people              299.0
refugees                    876.0
death                      1196.0
other_aid                  3448.0
infrastructure_related     1705.0
transport                  1203.0
buildings                  1335.0
electricity                 534.0
tools                       159.0
hospitals                   283.0
shops                       120.0
aid_centers                 309.0
other_infrastructure       1151.0
weather_related            7302.0
floods        

## 3. Define NLP processing functions

In [10]:
# create a tokenize function that processes text data
def tokenize(text):
    """
    This function takes in a string, tokenizes it, removes English stop words and applies lemmatization. 
    The result is a list of all resulting tokens. This function can be passed into a tokenizer of a transformer.
    """
    #tokenize text
    tokens = word_tokenize(text)
    
    #stop word removal
    tokens = [tok.lower().strip() for tok in tokens if tok not in stopwords.words("english")]

    #stemming of words
    stemmer = PorterStemmer()
    stemmed_tokens = [stemmer.stem(tok) for tok in tokens]

    #lemmatization of words
    lemmatizer = WordNetLemmatizer()
    lemmatized_tokens = [lemmatizer.lemmatize(tok, pos='v') for tok in stemmed_tokens]

    #remove tokens that only contain special characters
    clean_tokens = [tok for tok in lemmatized_tokens if not re.match("^[\W_]+", tok)]
    
    #remove tokens that contain digits as they will not be relevant for this supervised learning problem
    clean_tokens = [tok for tok in clean_tokens if not re.search("\d", tok)]
    
    return clean_tokens

In [11]:
# test the results of the tokenize function
for message in X[:5]:
    tokens = tokenize(message)
    print(message)
    print(tokens, '\n')

Weather update - a cold front from Cuba that could pass over Haiti
['weather', 'updat', 'cold', 'front', 'cuba', 'could', 'pass', 'haiti'] 

Is the Hurricane over or is it not over
['be', 'hurrican'] 

Looking for someone but no name
['look', 'someon', 'name'] 

UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.
['un', 'report', 'leogan', 'destroy', 'onli', 'hospit', 'st.', 'croix', 'function', 'need', 'suppli', 'desper'] 

says: west side of Haiti, rest of the country today and tonight
['say', 'west', 'side', 'haiti', 'rest', 'countri', 'today', 'tonight'] 



In [12]:
# create a custom transformer to count words
class WordCounter:
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        return [[len(text.split())] for text in X]

## 4. Build and run a machine learning pipeline

In [18]:
#########################################################################
# BUILD A ML PIPELINE
#########################################################################
modelling_start_time = datetime.datetime.now()

# Create a dictionary of ml models with their respective hyperparameters
models = {
    "KNN": {
        "model": MultiOutputClassifier(KNeighborsClassifier()),
        "param_grid": {
            "model__estimator__n_neighbors": [3, 5],
            "model__estimator__p": [1,2],
            "model__estimator__n_jobs": [-1],
        }
    },
    "Random Forest": {
        "model": MultiOutputClassifier(RandomForestClassifier()),
        "param_grid": {
            'model__estimator__n_estimators': [1, 10, 100],
            'model__estimator__max_depth': [1, 5, 10],
            'model__estimator__random_state': [42]
        }
    },
    "Gradient Boosting": {
        "model": MultiOutputClassifier(GradientBoostingClassifier()),
        "param_grid": {
            "model__estimator__n_estimators": [10, 50, 100],
            "model__estimator__learning_rate": [0.1, 0.05, 0.01],
            "model__estimator__max_depth": [3, 5, 7]
        }
    }
}


# Define the scoring metrics
scoring = {
    'accuracy': make_scorer(accuracy_score),
    'f1': make_scorer(f1_score, average='weighted', zero_division=1)
}


# Iterate over the models and perform grid search with cross-validation
for model_name, model_config in models.items():
    print(f"Training the model {model_name} using grid search with cross-validation...")
    
    # Define the pipeline with preprocessing and model
    pipeline = Pipeline([
        ('features', FeatureUnion([
            ('tfidf', TfidfVectorizer(tokenizer=tokenize)),
            ('token_count', CountVectorizer(tokenizer=tokenize)),
            ('word_count', Pipeline([
                ('count', WordCounter()),
                ('scale', StandardScaler())
            ]))
        ])),
        ('model', model_config['model'])
    ])
    
    # Perform grid search with cross-validation
    grid_search = GridSearchCV(pipeline, 
                               param_grid=model_config["param_grid"], 
                               cv=5, 
                               scoring=scoring, 
                               return_train_score=True,
                               refit="f1",
                               verbose=1)
    
    grid_search.fit(X_train, y_train)
    
    # Get the best model and its evaluation metric score
    best_model = grid_search.best_estimator_
    best_score = grid_search.best_score_
    
    # Predict on the test set and evaluate the performance
    y_pred = best_model.predict(X_validation)
    
    # Evaluate the overall accuracy
    accuracy = (y_validation == y_pred).mean()
    
    # Print the results
    print(f"Model: {model_name}")
    print(f"Best Parameters: {grid_search.best_params_}")
    print(f"Best Score: {best_score}")
    print(f"Overall accuracy: {accuracy}")
    print(classification_report(y_validation, y_pred, target_names=y.columns, zero_division=1))
    print("------------------------------------------------------------------")

modelling_end_time = datetime.datetime.now()

print(f"Models sucessfully trained!")
print(f"Time it took for training: {str(modelling_end_time - modelling_start_time)}")

Fitting 5 folds for each of 4 candidates, totalling 20 fits


KeyboardInterrupt: 

## 5. Train the final model using the appropriate algorithm with a chosen parameter combination

## 6. Export the final model as a pickel file

In [None]:
# Persist the model
joblib.dump(final_model, "disaster_response_classifier.pkl")

# Persist columns as well
joblib.dump(list(X.columns), "columns_disaster_response_classifier.pkl")