# ML Pipeline

### *Importing Libraries and Loading data from database.*
- Importing Python libraries
- Loading dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Defining feature and target variables X and Y

In [31]:
# import libraries
import pickle
import string
import unittest

import time
import progressbar

import pandas as pd
import numpy as np
from sqlalchemy import create_engine
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import classification_report, f1_score, make_scorer
from sklearn.base import BaseEstimator, TransformerMixin

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Shubham\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Shubham\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Shubham\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Shubham\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [32]:
# loading data from database
engine = create_engine('sqlite:///data/disaster_messages.db')
data = pd.read_sql_table('messages', engine)
data.sample(10)

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
8181,9181,we didn't made nothing at all for change the c...,cher de n'avoir rien fait pour changer ce pauv...,direct,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
16665,19596,One of the main difficulties in fighting the f...,,news,1,0,0,0,0,0,...,0,0,1,0,0,1,0,0,0,0
11971,13832,http://twitpic.com/18el95 - Daughters's bedroo...,,social,1,0,0,0,0,0,...,0,0,1,0,0,0,1,0,0,1
20681,24065,Critics point out that what Somalia will have ...,,news,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10909,12541,"We have a small mass of clothing , non-perisha...",,direct,1,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
17000,19970,No rainfall was recorded in the main river bas...,,news,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
17115,20100,"There's also volatility, of a different kind, ...",,news,1,0,0,1,1,0,...,0,1,0,0,0,0,0,0,0,0
21340,24789,"The programme focused on hiring, training and ...",,news,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
20654,24033,Heavy portable pump (petrol driven) or tractor...,,news,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
25239,29151,Reports of retaliatory violence by the nationa...,,news,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [33]:
# Assigning predictors and prediction variables
X = data.message
Y = data.loc[:, 'related':'direct_report']

### *Writing a Tokenization Function to process Text Data*

In [34]:
def tokenize(text):
    # Making table with mapping of whether string.punctuations present or not
    table = text.maketrans(dict.fromkeys(string.punctuation))
    
    # Word Tokenizer
    words = word_tokenize(text.lower().strip().translate(table))
    
    # removing stopwords
    words = [word for word in words if word not in stopwords.words('english')]
    
    # Word lemmetizer
    lemmed = [WordNetLemmatizer().lemmatize(word) for word in words]
    
    # Word lemmetazier with verb
    lemmed = [WordNetLemmatizer().lemmatize(word, pos = 'v') for word in lemmed]
    # Stemming words
    stemmed = [PorterStemmer().stem(word) for word in lemmed]
    
    return stemmed

### Build a Machine Learning Pipeline

*This machine learning pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset.* 

*You may find the [MultiOutputClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.multioutput.MultiOutputClassifier.html) helpful for predicting multiple target variables.*

In [35]:
# Making pipeline
pipeline = Pipeline([
    # Using tfidf vectorizer
    ('text', TfidfVectorizer(tokenizer = tokenize)),
    # Using MultipleOutfitClassifier
    ('clf', MultiOutputClassifier(RandomForestClassifier(random_state = 42, verbose = 1)))
])

### Train Pipeline
- *Splitting data into train and test sets*
- *Training pipeline*

In [36]:
# Spliting data in training and test data

train_X, test_X, train_Y, test_Y = train_test_split(X, Y, random_state = 42)
print('Spliting Completed')

# Fitting data
start_time = time.time()
pipeline = pipeline.fit(train_X, train_Y)
end_time = time.time()

print("Fitting Completed, time taken : {}".format(end_time - start_time))

Spliting Completed


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   23.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   17.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    2.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   25.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   11.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    8.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_j

Fitting Completed, time taken : 426.5052945613861


[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:   20.0s finished


### Testing model

- *Reporting the **f1 score, precision and recall** for each output category of the dataset.* 
- *We can do this by iterating through the columns and calling sklearn's `classification_report` on each.*

In [37]:
# Predicting values from the pipeline 
y_preds = pipeline.predict(test_X)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 100 out of 100 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_j

In [38]:
# Creating a table with precision, recall, f1-score for all the columns
results_dict = {}

for pred, label, col in zip(y_preds.transpose(), test_Y.values.transpose(), test_Y.columns):
    print(col)
    print(classification_report(label, pred))
    results_dict[col] = classification_report(label, pred, output_dict = True)

  _warn_prf(average, modifier, msg_start, len(result))


related
              precision    recall  f1-score   support

           0       0.70      0.39      0.50      1563
           1       0.83      0.95      0.89      4991

    accuracy                           0.81      6554
   macro avg       0.77      0.67      0.69      6554
weighted avg       0.80      0.81      0.79      6554

request
              precision    recall  f1-score   support

           0       0.90      0.98      0.94      5443
           1       0.85      0.47      0.61      1111

    accuracy                           0.90      6554
   macro avg       0.88      0.73      0.77      6554
weighted avg       0.89      0.90      0.88      6554

offer
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      6521
           1       0.00      0.00      0.00        33

    accuracy                           0.99      6554
   macro avg       0.50      0.50      0.50      6554
weighted avg       0.99      0.99      0.99      655

other_infrastructure
              precision    recall  f1-score   support

           0       0.96      1.00      0.98      6271
           1       0.00      0.00      0.00       283

    accuracy                           0.96      6554
   macro avg       0.48      0.50      0.49      6554
weighted avg       0.92      0.96      0.94      6554

weather_related
              precision    recall  f1-score   support

           0       0.90      0.95      0.92      4781
           1       0.84      0.71      0.77      1773

    accuracy                           0.88      6554
   macro avg       0.87      0.83      0.84      6554
weighted avg       0.88      0.88      0.88      6554

floods
              precision    recall  f1-score   support

           0       0.96      0.99      0.98      6035
           1       0.88      0.47      0.61       519

    accuracy                           0.95      6554
   macro avg       0.92      0.73      0.79      6554
weighted avg       0.95      0

There are number of categories with small numbers of messages where none were picked up:
- aid_centers
- shops
- hospitals
- tools
- offer

***What is the overall performance of the model?***

In [39]:
# Finding weighted average of all metrics
weighted_avg = {}

for key in results_dict.keys():
    weighted_avg[key] = results_dict[key]['weighted avg']

df_weight_avg = pd.DataFrame(weighted_avg).transpose()
df_weight_avg

Unnamed: 0,precision,recall,f1-score,support
related,0.80103,0.814922,0.794653,6554.0
request,0.893249,0.896857,0.884284,6554.0
offer,0.989955,0.994965,0.992454,6554.0
aid_related,0.77784,0.779371,0.777901,6554.0
medical_help,0.904538,0.921575,0.891176,6554.0
medical_products,0.946817,0.95087,0.930851,6554.0
search_and_rescue,0.968471,0.97635,0.967059,6554.0
security,0.969479,0.981996,0.973671,6554.0
military,0.960937,0.9704,0.959205,6554.0
child_alone,1.0,1.0,1.0,6554.0


In [40]:
# Summary of the metrics
df_weight_avg['f1-score'].describe()

count    36.000000
mean      0.936359
std       0.056942
min       0.777901
25%       0.925631
50%       0.947329
75%       0.975629
max       1.000000
Name: f1-score, dtype: float64

***What are the lowest f1-scores?***

In [41]:
# Lowest score that is less than 25
df_weight_avg[df_weight_avg['f1-score'] <= df_weight_avg['f1-score'].quantile(0.25)]

Unnamed: 0,precision,recall,f1-score,support
related,0.80103,0.814922,0.794653,6554.0
request,0.893249,0.896857,0.884284,6554.0
aid_related,0.77784,0.779371,0.777901,6554.0
medical_help,0.904538,0.921575,0.891176,6554.0
shelter,0.925703,0.933018,0.919927,6554.0
other_aid,0.830404,0.869088,0.815162,6554.0
infrastructure_related,0.878504,0.937138,0.906874,6554.0
weather_related,0.88158,0.88404,0.880665,6554.0
direct_report,0.842606,0.852304,0.828366,6554.0


In [42]:
key_categories = (df_weight_avg[df_weight_avg['f1-score'] <= df_weight_avg['f1-score'].quantile(0.25)].index).tolist()
key_categories

['related',
 'request',
 'aid_related',
 'medical_help',
 'shelter',
 'other_aid',
 'infrastructure_related',
 'weather_related',
 'direct_report']

***What are the highest f1-scores?***

In [43]:
# Lowest score that is less than 75
df_weight_avg[df_weight_avg['f1-score'] >= df_weight_avg['f1-score'].quantile(0.75)]

Unnamed: 0,precision,recall,f1-score,support
offer,0.989955,0.994965,0.992454,6554.0
child_alone,1.0,1.0,1.0,6554.0
clothing,0.983882,0.986726,0.98265,6554.0
missing_people,0.989135,0.989014,0.9837,6554.0
tools,0.986921,0.993439,0.990169,6554.0
hospitals,0.982984,0.991456,0.987202,6554.0
shops,0.99269,0.996338,0.994511,6554.0
aid_centers,0.975435,0.987641,0.9815,6554.0
fire,0.97996,0.989777,0.984844,6554.0


In [44]:
df_weight_avg.loc[key_categories]['f1-score'].describe()

count    9.000000
mean     0.855445
std      0.051976
min      0.777901
25%      0.815162
50%      0.880665
75%      0.891176
max      0.919927
Name: f1-score, dtype: float64

***The average f1-score for the key paramaters with the base model is 0.8556 with a std of 0.0519.***

***Creating a function to easily assess a model's performance.***

In [45]:
def model_performance(labels, preds):
    # Printing results
    results_dict = {}
    
    for pred, label, col in zip(preds.transpose(), labels.values.transpose(), labels.columns):
        print(col)
        print(classification_report(label, pred))
        results_dict[col] = classification_report(label, pred, output_dict = True)
    
    # Converting to a dataframe
    weighted_avg = {}
    
    for key in results_dict.keys():
        weighted_avg[key] = results_dict[key]['weighted avg']

    df_weight_avg = pd.DataFrame(weighted_avg).transpose()
    
    # Showing metrics
    key_categories = ['related', 'request', 'aid_related', 'medical_help',
                      'shelter', 'other_aid', 'infrastructure_related',
                      'weather_related', 'direct_report']
    
    print(df_weight_avg.loc[key_categories]['f1-score'].describe()) 
    return df_weight_avg

### Improving Model
Using grid search to find better parameters. 

*Because of the imbalance in the classes, it is possible to get high accuracy scores even with all positive messages in a class falsely labeled as negative.* 

*As a result, we would like to use the f1-score when evaluating my models, with the micro average used, based on this [StackExchange answer](https://datascience.stackexchange.com/questions/15989/micro-average-vs-macro-average-performance-in-a-multiclass-classification-settin).*

In [46]:
# Performing grid search
make_scorer(f1_score, average = 'micro')

parameters = {
    'text__ngram_range': [(1, 2)],
    'text__max_df': [0.5],
    'clf__estimator__max_depth': [100],
    'clf__estimator__min_samples_split': [25], 
    'clf__estimator__n_estimators': [200]
}

cv = GridSearchCV(pipeline, parameters)

### Testing  Model
*Displaying the accuracy, precision, and recall of the tuned model.*

#### *Trying random changes to check if it works or worth changing.*

In [47]:
# Fitting 
start_time = time.time()
cv.fit(train_X, train_Y)
end_time = time.time()
print("time taken to fit : {}".format(end_time - start_time))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    9.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    7.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    3.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:   11.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    4.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    4.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_j

[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Us

[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Us

[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    3.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    5.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    3.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    5.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    3.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    4.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    5.1s finished
[Parallel(n_jobs=1)]: Us

[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:   11.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    4.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    4.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    3.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    3.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    3.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.5s finished
[Parallel(n_jobs=1)]: Us

[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Us

[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.4s finished
[Parallel(n_jobs=1)]: Us

[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    3.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    3.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    3.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    4.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:   13.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    7.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    5.8s finished
[Parallel(n_jobs=1)]: Us

time taken to fit : 1812.5776765346527


[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    9.2s finished


In [48]:
cv.best_params_

{'clf__estimator__max_depth': 100,
 'clf__estimator__min_samples_split': 25,
 'clf__estimator__n_estimators': 200,
 'text__max_df': 0.5,
 'text__ngram_range': (1, 2)}

In [49]:
cv.best_score_

0.20318419746418845

In [50]:
# predict
start_time = time.time()
y_preds = cv.predict(test_X)
end_time = time.time()
print("time taken to predict : {}".format(end_time - start_time))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 200 out of 200 | elapsed:    0.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_j

In [51]:
results_cv = model_performance(test_Y, y_preds)

  _warn_prf(average, modifier, msg_start, len(result))


related
              precision    recall  f1-score   support

           0       0.79      0.10      0.17      1563
           1       0.78      0.99      0.87      4991

    accuracy                           0.78      6554
   macro avg       0.78      0.54      0.52      6554
weighted avg       0.78      0.78      0.71      6554

request
              precision    recall  f1-score   support

           0       0.87      1.00      0.93      5443
           1       0.93      0.27      0.42      1111

    accuracy                           0.87      6554
   macro avg       0.90      0.63      0.67      6554
weighted avg       0.88      0.87      0.84      6554

offer
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      6521
           1       0.00      0.00      0.00        33

    accuracy                           0.99      6554
   macro avg       0.50      0.50      0.50      6554
weighted avg       0.99      0.99      0.99      655

floods
              precision    recall  f1-score   support

           0       0.93      1.00      0.96      6035
           1       0.92      0.06      0.12       519

    accuracy                           0.93      6554
   macro avg       0.92      0.53      0.54      6554
weighted avg       0.92      0.93      0.89      6554

storm
              precision    recall  f1-score   support

           0       0.91      1.00      0.95      5949
           1       0.87      0.09      0.16       605

    accuracy                           0.91      6554
   macro avg       0.89      0.54      0.56      6554
weighted avg       0.91      0.91      0.88      6554

fire
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      6488
           1       0.00      0.00      0.00        66

    accuracy                           0.99      6554
   macro avg       0.49      0.50      0.50      6554
weighted avg       0.98      0.99      0.98      6554

e

### Additional Features:

*Features that are common in improving NLP Classification include ([Reference](https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/)):*
- Number of words in sentence
- Number of characters in message
- Parts of speech count

**These additional features will be added to the pipeline.**

#### *Number of Words*

In [52]:
# Creating class for performing word count operation 
class WordCount(BaseEstimator, TransformerMixin):
    # Tokenizing words
    def word_count(self, text):
        table = text.maketrans(dict.fromkeys(string.punctuation))
        words = word_tokenize(text.lower().strip().translate(table))
        return len(words)
    # For fitting
    def fit(self, x, y=None):
        return self
    # Transformation ,Creating series
    def transform(self, x):
        count = pd.Series(x).apply(self.word_count)
        return pd.DataFrame(count)

In [53]:
class Test_WordCount_Class(unittest.TestCase):
    # Testing the above made class, referred from stackoverflow 
    def setUp(self):
        self.sample_1 = WordCount().fit_transform('I went to the shops today.')
        self.sample_2 = WordCount().fit_transform("They've been very good to me, you know?")
        self.sample_3 = WordCount().fit_transform("In they went. Two by two.")
        self.sample_4 = WordCount().fit_transform("It takes a lot to love someone. \
            You would think that loving someone for 20 years would count for something.")
    
    def test_sample_1(self):
        self.assertEqual(self.sample_1[0][0], 6, "Word count should be 6.")
        
    def test_sample_2(self):
        self.assertEqual(self.sample_2[0][0], 8, "Word count should be 8.")
    
    def test_sample_3(self):
        self.assertEqual(self.sample_3[0][0], 6, "Word count should be 6.")
        
    def test_sample_4(self):
        self.assertEqual(self.sample_4[0][0], 20, "Word count should be 20.")

tests = Test_WordCount_Class()

tests_loaded = unittest.TestLoader().loadTestsFromModule(tests)

unittest.TextTestRunner().run(tests_loaded);

....
----------------------------------------------------------------------
Ran 4 tests in 0.017s

OK


#### *Number of Characters*

In [54]:
# Example
len('I went to sydney last year.')

27

In [55]:
# COunting number of characters
class CharacterCount(BaseEstimator, TransformerMixin):
    # Function to count characters
    def character_count(self, text):
        return len(text)
    
    # Fitting function
    def fit(self, x, y = None):
        return self
    
    # Transformation function
    def transform(self, x):
        count = pd.Series(x).apply(self.character_count)
        return pd.DataFrame(count)

In [56]:
# Checking and testing above class, referred from stackoverflow 
class Test_CharacterCount_Class(unittest.TestCase):
    def setUp(self):
        self.sample_1 = CharacterCount().fit_transform('I went to the shops today.')
        self.sample_2 = CharacterCount().fit_transform("They've been very good to me, you know?")
        self.sample_3 = CharacterCount().fit_transform("In they went. Two by two.")
        self.sample_4 = CharacterCount().fit_transform("It takes a lot to love someone. You would think that loving someone for 20 years would count for something.")
        
    def test_sample_1(self):
        self.assertEqual(self.sample_1[0][0], 26, 'Count should be 26.')
    
    def test_sample_2(self):
        self.assertEqual(self.sample_2[0][0], 39, 'Count should be 39.')
    
    def test_sample_3(self):
        self.assertEqual(self.sample_3[0][0], 25, 'Count should be 25.')
    
    def test_sample_4(self):
        self.assertEqual(self.sample_4[0][0], 107, 'Count should be 107.')

tests = Test_CharacterCount_Class()
tests_loaded = unittest.TestLoader().loadTestsFromModule(tests)
unittest.TextTestRunner().run(tests_loaded);

....
----------------------------------------------------------------------
Ran 4 tests in 0.018s

OK


#### *Number of Nouns*

In [57]:
# Number of Nouns
class NounCount(BaseEstimator, TransformerMixin):
    # Counting number of nouns
    def noun_count(self, text):
        count = 0
        sentence_list = sent_tokenize(text)
        for sentence in sentence_list:
            pos_tags = nltk.pos_tag(tokenize(sentence))
            for _, tag in pos_tags:
                if tag in ['PRP', 'NN']:
                    count += 1
        
        return count
            
    # Fitting function
    def fit(self, x, y = None):
        return self
    
    # Transformation
    def transform(self, x):
        count = pd.Series(x).apply(self.noun_count)
        return pd.DataFrame(count)

In [58]:
# Testing the noun class
class Test_NounCount_Class(unittest.TestCase):
    def setUp(self):
        self.sample_1 = NounCount().fit_transform('I went to the shops today.')
        self.sample_2 = NounCount().fit_transform("They've been very good to me, you know?")
        self.sample_3 = NounCount().fit_transform("In they went; two by two.")
        self.sample_4 = NounCount().fit_transform("It takes a lot to love someone. You would think that loving someone for 20 years would count for something.")
        
    def test_sample_1(self):
        self.assertEqual(self.sample_1[0][0], 2, 'Count should be 2.')
    
    def test_sample_2(self):
        self.assertEqual(self.sample_2[0][0], 0, 'Count should be 0.')
    
    def test_sample_3(self):
        self.assertEqual(self.sample_3[0][0], 0, 'Count should be 0.')
    
    def test_sample_4(self):
        self.assertEqual(self.sample_4[0][0], 5, 'Count should be 5.')

tests = Test_NounCount_Class()
tests_loaded = unittest.TestLoader().loadTestsFromModule(tests)
unittest.TextTestRunner().run(tests_loaded);

....
----------------------------------------------------------------------
Ran 4 tests in 0.154s

OK


#### *Number of Verbs*

In [59]:
# Getting verb Counts
class VerbCount(BaseEstimator, TransformerMixin):
    def verb_count(self, text):
        count = 0
        sentence_list = sent_tokenize(text)
        for sentence in sentence_list:
            pos_tags = nltk.pos_tag(tokenize(sentence))
            for word, tag in pos_tags:
                if tag in ['VB', 'VBP']:
                    count += 1
        
        return count
            
    
    def fit(self, x, y=None):
        return self
    
    def transform(self, x):
        count = pd.Series(x).apply(self.verb_count)
        return pd.DataFrame(count)

In [60]:
# Example
nltk.pos_tag(tokenize('Did you did it?'))

[]

In [61]:
class Test_VerbCount_Class(unittest.TestCase):
    def setUp(self):
        self.sample_1 = VerbCount().fit_transform('I went to the shops today.')
        self.sample_2 = VerbCount().fit_transform("They've been very good to me, you know?")
        self.sample_3 = VerbCount().fit_transform("In they went; two by two.")
        self.sample_4 = VerbCount().fit_transform("It takes a lot to love someone. You would think that loving someone for 20 years would count for something.")
        
    def test_sample_1(self):
        self.assertEqual(self.sample_1[0][0], 1, 'Count should be 1.')
    
    def test_sample_2(self):
        self.assertEqual(self.sample_2[0][0], 0, 'Count should be 0.')
        
    def test_sample_3(self):
        self.assertEqual(self.sample_3[0][0], 1, 'Count should be 1.')
    
    def test_sample_4(self):
        self.assertEqual(self.sample_4[0][0], 4, 'Count should be 4.')

tests = Test_VerbCount_Class()
tests_loaded = unittest.TestLoader().loadTestsFromModule(tests)
unittest.TextTestRunner().run(tests_loaded);

....
----------------------------------------------------------------------
Ran 4 tests in 0.080s

OK


In [64]:
# Making pipeline
pipeline = Pipeline([
    ("features", FeatureUnion([
        ("text", TfidfVectorizer(tokenizer = tokenize, max_df = 0.5, ngram_range = (1, 2))),
        ("word_count", WordCount()),
        ("character_count", CharacterCount()),
        ("noun_count", NounCount()),
        ("verb_count", VerbCount())
    ])),
    ("clf", MultiOutputClassifier(RandomForestClassifier(min_samples_split = 25, random_state = 42, verbose = 1)))
])

In [65]:
scorer = make_scorer(f1_score, average = 'micro')

parameters = {
    'clf__estimator__max_depth': [500],
    'clf__estimator__n_estimators': [300]
}

cv = GridSearchCV(pipeline, parameters, scoring = scorer)

In [66]:
start_time = time.time()
cv.fit(train_X, train_Y)
end_time = time.time()

print("time taken : {}".format(end_time - start_time))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:   41.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:   31.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    6.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:   45.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:   22.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:   19.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_j

[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.5s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    0.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    0.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    0.9s finished
[Parallel(n_jobs=1)]: Us

[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    0.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    0.9s finished
[Parallel(n_jobs=1)]: Us

[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:   16.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:   16.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    9.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    5.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    7.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    5.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    7.8s finished
[Parallel(n_jobs=1)]: Us

[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:  1.1min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:   57.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:   46.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:   36.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:   29.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:   31.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.8s finished
[Parallel(n_jobs=1)]: Us

[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    0.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    0.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    0.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    0.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    0.8s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.2s finished
[Parallel(n_jobs=1)]: Us

[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    0.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    0.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    0.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.2s finished
[Parallel(n_jobs=1)]: Us

[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:   12.4s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    7.7s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:   12.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:   23.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:   46.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:   26.3s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:   26.9s finished
[Parallel(n_jobs=1)]: Us

time taken : 6950.132976055145


[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:   48.1s finished


In [67]:
cv.best_score_

0.6311097953418664

In [68]:
cv.best_params_

{'clf__estimator__max_depth': 500, 'clf__estimator__n_estimators': 300}

In [69]:
start_time = time.time()
y_preds = cv.predict(test_X)
end_time = time.time()
print("time taken to predict : {}".format(end_time - start_time))

results_cv = model_performance(test_Y, y_preds)

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    2.0s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    0.9s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    1.6s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    2.2s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done 300 out of 300 | elapsed:    2.1s finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_j

time taken to predict : 192.67354893684387
related
              precision    recall  f1-score   support

           0       0.75      0.30      0.43      1563
           1       0.82      0.97      0.89      4991

    accuracy                           0.81      6554
   macro avg       0.78      0.64      0.66      6554
weighted avg       0.80      0.81      0.78      6554

request
              precision    recall  f1-score   support

           0       0.90      0.99      0.94      5443
           1       0.87      0.46      0.60      1111

    accuracy                           0.90      6554
   macro avg       0.88      0.72      0.77      6554
weighted avg       0.89      0.90      0.88      6554

offer
              precision    recall  f1-score   support

           0       0.99      1.00      1.00      6521
           1       0.00      0.00      0.00        33

    accuracy                           0.99      6554
   macro avg       0.50      0.50      0.50      6554
weighted 

floods
              precision    recall  f1-score   support

           0       0.95      1.00      0.97      6035
           1       0.89      0.45      0.59       519

    accuracy                           0.95      6554
   macro avg       0.92      0.72      0.78      6554
weighted avg       0.95      0.95      0.94      6554

storm
              precision    recall  f1-score   support

           0       0.94      0.99      0.96      5949
           1       0.77      0.34      0.48       605

    accuracy                           0.93      6554
   macro avg       0.85      0.67      0.72      6554
weighted avg       0.92      0.93      0.92      6554

fire
              precision    recall  f1-score   support

           0       0.99      1.00      0.99      6488
           1       0.00      0.00      0.00        66

    accuracy                           0.99      6554
   macro avg       0.49      0.50      0.50      6554
weighted avg       0.98      0.99      0.98      6554

e

### Exporting model as a pickle file

In [71]:
# Approximately 1.75GB
with open('data/model.pkl', 'wb') as file:
    pickle.dump(cv, file)