# ML Pipeline Preparation
Follow the instructions below to help you create your ML pipeline.
### 1. Import libraries and load data from database.
- Import Python libraries
- Load dataset from database with [`read_sql_table`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_sql_table.html)
- Define feature and target variables X and Y

In [110]:
# import libraries


from sqlalchemy import create_engine
import numpy as np
import pandas as pd
import re
import seaborn as sns
import matplotlib as plt
%matplotlib inline

from nltk import pos_tag
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
import nltk
#nltk.download(['punkt', 'wordnet', 'averaged_perceptron_tagger'])
#nltk.download('stopwords')


from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.pipeline import Pipeline,FeatureUnion
from sklearn.multioutput import MultiOutputClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.metrics import classification_report
from sklearn.utils import parallel_backend

import joblib
import pickle


In [52]:
# load data from database
engine = create_engine('sqlite:///./data/Disaster_Response.db')
df_ori = pd.read_sql_table('Disaster_Response_Table',engine)


In [13]:
df_ori.head()

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,7,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,8,Looking for someone but no name,"Patnm, di Maryani relem pou li banm nouvel li ...",direct,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,9,UN reports Leogane 80-90 destroyed. Only Hospi...,UN reports Leogane 80-90 destroyed. Only Hospi...,direct,1.0,1.0,0.0,1.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,12,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
df_ori.dtypes

id                          int64
message                    object
original                   object
genre                      object
related                   float64
request                   float64
offer                     float64
aid_related               float64
medical_help              float64
medical_products          float64
search_and_rescue         float64
security                  float64
military                  float64
child_alone               float64
water                     float64
food                      float64
shelter                   float64
clothing                  float64
money                     float64
missing_people            float64
refugees                  float64
death                     float64
other_aid                 float64
infrastructure_related    float64
transport                 float64
buildings                 float64
electricity               float64
tools                     float64
hospitals                 float64
shops         

In [15]:
df_ori[df_ori.isnull().any(axis=1)]

Unnamed: 0,id,message,original,genre,related,request,offer,aid_related,medical_help,medical_products,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
7465,8365,NOTES: It mark as not enough information,,direct,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9936,11186,My thoughts and prayers go out to all the live...,,social,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
9937,11188,I m sorry for the poor people in Haiti tonight...,,social,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
9938,11189,RT selenagomez UNICEF has just announced an em...,,social,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
9939,11192,lilithia yes 5.2 magnitude earthquake hit mani...,,social,1.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26340,30261,The training demonstrated how to enhance micro...,,news,,,,,,,...,,,,,,,,,,
26341,30262,A suitable candidate has been selected and OCH...,,news,,,,,,,...,,,,,,,,,,
26342,30263,"Proshika, operating in Cox's Bazar municipalit...",,news,,,,,,,...,,,,,,,,,,
26343,30264,"Some 2,000 women protesting against the conduc...",,news,,,,,,,...,,,,,,,,,,


In [16]:
df = df_ori.drop(['original'],axis = 1)

In [17]:
df.min()

id                             2
message                         
genre                     direct
related                      0.0
request                      0.0
offer                        0.0
aid_related                  0.0
medical_help                 0.0
medical_products             0.0
search_and_rescue            0.0
security                     0.0
military                     0.0
child_alone                  0.0
water                        0.0
food                         0.0
shelter                      0.0
clothing                     0.0
money                        0.0
missing_people               0.0
refugees                     0.0
death                        0.0
other_aid                    0.0
infrastructure_related       0.0
transport                    0.0
buildings                    0.0
electricity                  0.0
tools                        0.0
hospitals                    0.0
shops                        0.0
aid_centers                  0.0
other_infr

In [18]:
df.max()

id                                                                    30265
message                   | News Update | Serious loss of life expected ...
genre                                                                social
related                                                                 2.0
request                                                                 1.0
offer                                                                   1.0
aid_related                                                             1.0
medical_help                                                            1.0
medical_products                                                        1.0
search_and_rescue                                                       1.0
security                                                                1.0
military                                                                1.0
child_alone                                                             0.0
water       

We only want binary data label 0 and 1. Seems like there is 2 in "related" column. Also, it looks like there is no data in "child_alone" column so removing it would not have any effect on the model.

In [19]:
df.drop('child_alone',axis=1,inplace = True)

In [20]:
df.groupby('related')['id'].count().reset_index()

Unnamed: 0,related,id
0,0.0,6101
1,1.0,19914
2,2.0,192


Seems like there is not a lot of 2 in "related" column. Since there are plenty of 1, here I map the data row with 2 to 1.

In [21]:
df['related']=df['related'].apply(lambda x : 1 if x == 2 else x)

Null value will cause problem when training a model. Thus, rows with null values are removed.

In [22]:
df[df.isnull().any(axis=1)]

Unnamed: 0,id,message,genre,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
26207,30112,The 1 July meeting of the Support and Follow-u...,news,,,,,,,,...,,,,,,,,,,
26208,30113,Japan's overseas missions are accepting relief...,news,,,,,,,,...,,,,,,,,,,
26209,30114,"According to officials, Kabul River and Swat R...",news,,,,,,,,...,,,,,,,,,,
26210,30115,The gross relief food requirements for June-De...,news,,,,,,,,...,,,,,,,,,,
26211,30116,Authorities have built tent compounds in flatt...,news,,,,,,,,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26340,30261,The training demonstrated how to enhance micro...,news,,,,,,,,...,,,,,,,,,,
26341,30262,A suitable candidate has been selected and OCH...,news,,,,,,,,...,,,,,,,,,,
26342,30263,"Proshika, operating in Cox's Bazar municipalit...",news,,,,,,,,...,,,,,,,,,,
26343,30264,"Some 2,000 women protesting against the conduc...",news,,,,,,,,...,,,,,,,,,,


In [23]:
df.dropna(inplace=True)

In [24]:
X = df['message'].values
Y = df[df.columns[3:]]

In [25]:
np.shape(Y)

(26207, 35)

In [26]:
Y[Y.isnull().any(axis=1)]

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report


### 2. Write a tokenization function to process text data and a Verb Counter Transformer

In [27]:
def tokenize(texts):

    ## convert bytes to string format 
    normalised_text = re.sub(r'[^a-zA-Z0-9]'," ",str(texts).lower())
    #dataframe['column_name'] = dataframe['column_name'].fillna('').apply(str)
    
    tokens = word_tokenize(normalised_text)
    tokens = [token for token in tokens if token not in stopwords.words("english")]

    lemmatizer = WordNetLemmatizer()
    lemmed = [lemmatizer.lemmatize(token) for token in tokens]
    clean_tokens = [lemmatizer.lemmatize(w,pos='v') for w in lemmed]
    

    return clean_tokens

In [28]:
for message in X[:4]:
    print(message)
    print(tokenize(message),'\n')

Weather update - a cold front from Cuba that could pass over Haiti
['weather', 'update', 'cold', 'front', 'cuba', 'could', 'pas', 'haiti'] 

Is the Hurricane over or is it not over
['hurricane'] 

Looking for someone but no name
['look', 'someone', 'name'] 

UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.
['un', 'report', 'leogane', '80', '90', 'destroy', 'hospital', 'st', 'croix', 'function', 'need', 'supply', 'desperately'] 



VerbCounter Transformer will return the frequency of a verb occuring in a message.

In [29]:
class VerbCounter(BaseEstimator,TransformerMixin):

    def counter(self, corpus):
        for sentence in corpus:
            count = 0
            token = tokenize(sentence)
            pos = pos_tag(token)
            for word,tag in pos:
                if tag in ['VB', 'VBP','VBZ']:
                    count += 1
            return count
            
        
    def fit(self, X, y=None):
        return self

    def transform(self, X):
        verb_count = pd.Series(X).apply(self.counter)
        return pd.DataFrame(verb_count)



### 3. Build a machine learning pipeline
This machine pipeline should take in the `message` column as input and output classification results on the other 36 categories in the dataset.

In [30]:
def model_pipeline(clf):
    pipeline = Pipeline([
    ('features',FeatureUnion([
        ('text_pipeline',Pipeline([
            ('vect',CountVectorizer(tokenizer = tokenize)),
            ('tfidf',TfidfTransformer())
        ])),
            ('verb_counter',VerbCounter())
        ])),
        ('clf',MultiOutputClassifier(clf))
    ])
    return pipeline


### 4. Train pipeline
- Split data into train and test sets
- Train pipeline

In [31]:
X_train, X_test, y_train, y_test = train_test_split(X, Y)

In [32]:
np.shape(X_train), np.shape(X_test)

((19655,), (6552,))

In [33]:
model1 = model_pipeline(AdaBoostClassifier())

In [34]:
model1.fit(X_train,y_train)
y_pred = model1.predict(X_test)



In [35]:
X_train[:3]

array(['TELL ME WHEN THIS QUAKE WILL START AGAIN ? SINCE 12 JANUARY,I HAD A BIG ONE I CANNOT HANDLE IT ANYMORE,I HAVE ENOUGH HELP ME ',
       "The representative of the United States then resumed her comments, asking the Special Rapporteur whether, given the Constitution's deeply discredited status, the Constitution could truly form the basis of a democratic process.",
       'I LOST MY WIFE MY CHILD I FEEL ABANONNED LIKE CRAZY '],
      dtype=object)

In [36]:
y_pred

array([[0., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 1., 0., ..., 0., 0., 0.],
       ...,
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 0.],
       [1., 0., 0., ..., 0., 0., 1.]])

### 5. Model Evaluation
Report the f1 score, precision and recall for each output category of the dataset. You can do this by iterating through the columns and calling sklearn's `classification_report` on each.

In [37]:
print(classification_report(y_test.values, y_pred, target_names=Y.columns.values))

                        precision    recall  f1-score   support

               related       0.77      0.98      0.86      5032
               request       0.41      0.11      0.18      1164
                 offer       0.00      0.00      0.00        32
           aid_related       0.47      0.18      0.26      2742
          medical_help       0.00      0.00      0.00       539
      medical_products       0.00      0.00      0.00       320
     search_and_rescue       0.00      0.00      0.00       173
              security       0.00      0.00      0.00       113
              military       0.00      0.00      0.00       210
                 water       0.00      0.00      0.00       415
                  food       0.29      0.01      0.03       743
               shelter       0.11      0.00      0.00       629
              clothing       0.20      0.03      0.05       113
                 money       0.50      0.01      0.01       146
        missing_people       0.00      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### 6. Improve the model
Use grid search to find better parameters. 

In [38]:
model1.get_params()

{'memory': None,
 'steps': [('features',
   FeatureUnion(transformer_list=[('text_pipeline',
                                   Pipeline(steps=[('vect',
                                                    CountVectorizer(tokenizer=<function tokenize at 0x000001E01B355990>)),
                                                   ('tfidf',
                                                    TfidfTransformer())])),
                                  ('verb_counter', VerbCounter())])),
  ('clf', MultiOutputClassifier(estimator=AdaBoostClassifier()))],
 'verbose': False,
 'features': FeatureUnion(transformer_list=[('text_pipeline',
                                 Pipeline(steps=[('vect',
                                                  CountVectorizer(tokenizer=<function tokenize at 0x000001E01B355990>)),
                                                 ('tfidf',
                                                  TfidfTransformer())])),
                                ('verb_counter', VerbCoun

In [39]:
# hyperparameters1 = {
#         'features__text_pipeline__vect__ngram_range': ((1, 1), (1, 2)),
#         'features__text_pipeline__tfidf__use_idf': (True,False),
#         'clf__estimator__algorithm': ['SAMME.R','SAMME'],
#         'clf__estimator__learning_rate': [0.5, 1.0],
#         'features__transformer_weights': (
#             {'text_pipeline': 1, 'verb_counter': 0.5},
#             {'text_pipeline': 0.5, 'verb_counter': 1},
#             {'text_pipeline': 0.8, 'verb_counter': 1},
#         )
# }

hyperparameters1 = {
        'features__text_pipeline__vect__ngram_range': ((1, 1), (1, 2)),
        'clf__estimator__learning_rate': [0.5, 1.0],
}

import joblib

joblib.parallel_backend('threading')

cv = GridSearchCV(model1,param_grid=hyperparameters1,verbose=2, n_jobs=-1)

In [40]:
cv.fit(X_train,y_train)
y_pred_improved = cv.predict(X_test)

Fitting 5 folds for each of 4 candidates, totalling 20 fits




[CV] END clf__estimator__learning_rate=0.5, features__text_pipeline__vect__ngram_range=(1, 1); total time=28.6min




[CV] END clf__estimator__learning_rate=1.0, features__text_pipeline__vect__ngram_range=(1, 1); total time=29.3min




[CV] END clf__estimator__learning_rate=1.0, features__text_pipeline__vect__ngram_range=(1, 1); total time=29.4min




[CV] END clf__estimator__learning_rate=0.5, features__text_pipeline__vect__ngram_range=(1, 1); total time=29.6min




[CV] END clf__estimator__learning_rate=0.5, features__text_pipeline__vect__ngram_range=(1, 1); total time=29.7min
[CV] END clf__estimator__learning_rate=1.0, features__text_pipeline__vect__ngram_range=(1, 1); total time=29.7min
[CV] END clf__estimator__learning_rate=1.0, features__text_pipeline__vect__ngram_range=(1, 1); total time=29.7min
[CV] END clf__estimator__learning_rate=0.5, features__text_pipeline__vect__ngram_range=(1, 1); total time=29.7min
[CV] END clf__estimator__learning_rate=0.5, features__text_pipeline__vect__ngram_range=(1, 1); total time=29.8min
[CV] END clf__estimator__learning_rate=1.0, features__text_pipeline__vect__ngram_range=(1, 1); total time=29.8min
[CV] END clf__estimator__learning_rate=0.5, features__text_pipeline__vect__ngram_range=(1, 2); total time=105.2min
[CV] END clf__estimator__learning_rate=0.5, features__text_pipeline__vect__ngram_range=(1, 2); total time=105.9min
[CV] END clf__estimator__learning_rate=0.5, features__text_pipeline__vect__ngram_range



In [43]:
cv.best_params_

{'clf__estimator__learning_rate': 0.5,
 'features__text_pipeline__vect__ngram_range': (1, 1)}

In [44]:
cv.best_estimator_

In [45]:
cv.best_score_

0.18814551004833374

### 7. Model Evaluation
Show the accuracy, precision, and recall of the tuned model.

In [55]:
print(classification_report(y_test,y_pred_improved,target_names=Y.columns.values))

                        precision    recall  f1-score   support

               related       0.77      1.00      0.87      5032
               request       0.42      0.03      0.06      1164
                 offer       0.00      0.00      0.00        32
           aid_related       0.47      0.05      0.10      2742
          medical_help       0.00      0.00      0.00       539
      medical_products       0.00      0.00      0.00       320
     search_and_rescue       0.00      0.00      0.00       173
              security       0.00      0.00      0.00       113
              military       0.00      0.00      0.00       210
                 water       0.00      0.00      0.00       415
                  food       0.50      0.00      0.00       743
               shelter       1.00      0.00      0.00       629
              clothing       0.00      0.00      0.00       113
                 money       0.00      0.00      0.00       146
        missing_people       0.00      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [89]:
for i in range(len(Y.columns)):
    print(Y.columns.values[i])
    print(classification_report(list(y_test.values[:,i]), list(y_pred_improved[:,i])))
    print('\n')

related
              precision    recall  f1-score   support

         0.0       0.47      0.01      0.01      1520
         1.0       0.77      1.00      0.87      5032

    accuracy                           0.77      6552
   macro avg       0.62      0.50      0.44      6552
weighted avg       0.70      0.77      0.67      6552



request
              precision    recall  f1-score   support

         0.0       0.83      0.99      0.90      5388
         1.0       0.42      0.03      0.06      1164

    accuracy                           0.82      6552
   macro avg       0.62      0.51      0.48      6552
weighted avg       0.75      0.82      0.75      6552



offer
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      6520
         1.0       0.00      0.00      0.00        32

    accuracy                           1.00      6552
   macro avg       0.50      0.50      0.50      6552
weighted avg       0.99      1.00      0.99     

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

         0.0       0.97      1.00      0.98      6342
         1.0       0.00      0.00      0.00       210

    accuracy                           0.97      6552
   macro avg       0.48      0.50      0.49      6552
weighted avg       0.94      0.97      0.95      6552



water
              precision    recall  f1-score   support

         0.0       0.94      1.00      0.97      6137
         1.0       0.00      0.00      0.00       415

    accuracy                           0.94      6552
   macro avg       0.47      0.50      0.48      6552
weighted avg       0.88      0.94      0.91      6552



food
              precision    recall  f1-score   support

         0.0       0.89      1.00      0.94      5809
         1.0       0.50      0.00      0.00       743

    accuracy                           0.89      6552
   macro avg       0.69      0.50      0.47      6552
weighted avg       0.84      0.89      0.83      6552



sh

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

         0.0       0.96      1.00      0.98      6260
         1.0       0.00      0.00      0.00       292

    accuracy                           0.96      6552
   macro avg       0.48      0.50      0.49      6552
weighted avg       0.91      0.96      0.93      6552



other_aid
              precision    recall  f1-score   support

         0.0       0.87      1.00      0.93      5715
         1.0       0.00      0.00      0.00       837

    accuracy                           0.87      6552
   macro avg       0.44      0.50      0.47      6552
weighted avg       0.76      0.87      0.81      6552



infrastructure_related
              precision    recall  f1-score   support

         0.0       0.93      1.00      0.97      6120
         1.0       0.00      0.00      0.00       432

    accuracy                           0.93      6552
   macro avg       0.47      0.50      0.48      6552
weighted avg       0.87      0.93    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

         0.0       0.99      1.00      0.99      6476
         1.0       0.00      0.00      0.00        76

    accuracy                           0.99      6552
   macro avg       0.49      0.50      0.50      6552
weighted avg       0.98      0.99      0.98      6552



other_infrastructure
              precision    recall  f1-score   support

         0.0       0.95      1.00      0.98      6245
         1.0       0.00      0.00      0.00       307

    accuracy                           0.95      6552
   macro avg       0.48      0.50      0.49      6552
weighted avg       0.91      0.95      0.93      6552



weather_related
              precision    recall  f1-score   support

         0.0       0.74      0.98      0.84      4735
         1.0       0.63      0.11      0.18      1817

    accuracy                           0.73      6552
   macro avg       0.68      0.54      0.51      6552
weighted avg       0.71      0.73

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The data is highly imbalanced as most of the labels for each category is 0. The key objective of model training is to allow the model to learn the characteristics of the data and make prediction based on it. In the case of imbalanced data, this is challenging as the abundance examples of majority class prevents the the model learn the charcteristics of minority class. 

We can see most of the categories consist of highly imbalanced label data and the model performs extremely well on predictiing the majority class but extremley poor on the minority class. In other words, our model is weak on predicting the minority class.

Also, there is a UndefinedMetricWarning message which set the metrics to zero when there is no labels predicted for certain categories class.

In [107]:
y_test.head()

Unnamed: 0,related,request,offer,aid_related,medical_help,medical_products,search_and_rescue,security,military,water,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
3972,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8807,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
495,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9990,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
701,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [105]:
y_test['tools'].value_counts()


tools
0.0    6508
1.0      44
Name: count, dtype: int64

### 8. Model Training using Random Forest Classifier


In [56]:
model2 = model_pipeline(RandomForestClassifier())

In [58]:
hyperparameter2 = {'clf__estimator__n_estimators': [50,70,100]
}

cv2 = GridSearchCV(model2,param_grid=hyperparameter2,verbose=2,n_jobs=-1)

In [60]:
cv2.fit(X_train,y_train)
y_pred2 = cv2.predict(X_test)

Fitting 5 folds for each of 3 candidates, totalling 15 fits




[CV] END ....................clf__estimator__n_estimators=50; total time=37.9min
[CV] END ....................clf__estimator__n_estimators=50; total time=38.0min
[CV] END ....................clf__estimator__n_estimators=50; total time=38.1min
[CV] END ....................clf__estimator__n_estimators=50; total time=38.2min
[CV] END ....................clf__estimator__n_estimators=50; total time=38.6min
[CV] END ....................clf__estimator__n_estimators=70; total time=45.0min
[CV] END ....................clf__estimator__n_estimators=70; total time=45.1min
[CV] END ....................clf__estimator__n_estimators=70; total time=45.3min
[CV] END ....................clf__estimator__n_estimators=70; total time=45.8min
[CV] END ....................clf__estimator__n_estimators=70; total time=45.8min
[CV] END ...................clf__estimator__n_estimators=100; total time=51.8min
[CV] END ...................clf__estimator__n_estimators=100; total time=52.7min
[CV] END ...................



In [61]:
cv2.best_params_

{'clf__estimator__n_estimators': 100}

In [62]:
cv2.best_estimator_

In [64]:
print(classification_report(y_test.values, y_pred2, target_names=Y.columns.values))

                        precision    recall  f1-score   support

               related       0.77      0.97      0.86      5032
               request       0.45      0.05      0.10      1164
                 offer       0.00      0.00      0.00        32
           aid_related       0.45      0.20      0.28      2742
          medical_help       0.09      0.00      0.01       539
      medical_products       0.00      0.00      0.00       320
     search_and_rescue       0.00      0.00      0.00       173
              security       0.00      0.00      0.00       113
              military       0.50      0.01      0.02       210
                 water       0.06      0.00      0.00       415
                  food       0.13      0.01      0.01       743
               shelter       0.12      0.01      0.01       629
              clothing       0.00      0.00      0.00       113
                 money       0.00      0.00      0.00       146
        missing_people       0.00      

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [86]:
for i in range(len(Y.columns)):
    print(Y.columns.values[i])
    print(classification_report(list(y_test.values[:,i]), list(y_pred2[:,i])))
    print('\n')

related
              precision    recall  f1-score   support

         0.0       0.37      0.06      0.10      1520
         1.0       0.77      0.97      0.86      5032

    accuracy                           0.76      6552
   macro avg       0.57      0.51      0.48      6552
weighted avg       0.68      0.76      0.69      6552



request
              precision    recall  f1-score   support

         0.0       0.83      0.99      0.90      5388
         1.0       0.45      0.05      0.10      1164

    accuracy                           0.82      6552
   macro avg       0.64      0.52      0.50      6552
weighted avg       0.76      0.82      0.76      6552



offer
              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      6520
         1.0       0.00      0.00      0.00        32

    accuracy                           0.99      6552
   macro avg       0.50      0.50      0.50      6552
weighted avg       0.99      0.99      0.99     

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

         0.0       0.98      1.00      0.99      6406
         1.0       0.00      0.00      0.00       146

    accuracy                           0.98      6552
   macro avg       0.49      0.50      0.49      6552
weighted avg       0.96      0.98      0.97      6552



missing_people
              precision    recall  f1-score   support

         0.0       0.99      1.00      1.00      6494
         1.0       0.00      0.00      0.00        58

    accuracy                           0.99      6552
   macro avg       0.50      0.50      0.50      6552
weighted avg       0.98      0.99      0.99      6552



refugees
              precision    recall  f1-score   support

         0.0       0.97      1.00      0.98      6324
         1.0       0.00      0.00      0.00       228

    accuracy                           0.96      6552
   macro avg       0.48      0.50      0.49      6552
weighted avg       0.93      0.96      0.95   

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

         0.0       0.93      1.00      0.96      6120
         1.0       0.00      0.00      0.00       432

    accuracy                           0.93      6552
   macro avg       0.47      0.50      0.48      6552
weighted avg       0.87      0.93      0.90      6552



transport
              precision    recall  f1-score   support

         0.0       0.96      1.00      0.98      6276
         1.0       0.07      0.00      0.01       276

    accuracy                           0.96      6552
   macro avg       0.51      0.50      0.49      6552
weighted avg       0.92      0.96      0.94      6552



buildings
              precision    recall  f1-score   support

         0.0       0.95      1.00      0.97      6213
         1.0       0.07      0.01      0.01       339

    accuracy                           0.94      6552
   macro avg       0.51      0.50      0.49      6552
weighted avg       0.90      0.94      0.92      6

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00      6528
         1.0       0.00      0.00      0.00        24

    accuracy                           1.00      6552
   macro avg       0.50      0.50      0.50      6552
weighted avg       0.99      1.00      0.99      6552



aid_centers
              precision    recall  f1-score   support

         0.0       0.99      1.00      0.99      6476
         1.0       0.00      0.00      0.00        76

    accuracy                           0.99      6552
   macro avg       0.49      0.50      0.50      6552
weighted avg       0.98      0.99      0.98      6552



other_infrastructure
              precision    recall  f1-score   support

         0.0       0.95      1.00      0.98      6245
         1.0       0.00      0.00      0.00       307

    accuracy                           0.95      6552
   macro avg       0.48      0.50      0.49      6552
weighted avg       0.91      0.95    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### 9. Export trained model as a pickle file

In [None]:
pickle.dump(model,open(model_filepath,'wb'))