# Disaster Related Message Classification and Modeling

### The Data

Training Data for this text classification model comes from the **Figure Eight's** [Multilingual Disaster Response Messages](https://www.figure-eight.com/dataset/combined-disaster-response-data/). This dataset includes 30,000 messages drawn from a collection of natural disasters and news articles spanning a large number of years and 100s of different disasters. Most of the messages have been translated and contain labels for various disaster response categories. 

#### Package Imports

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from nltk.tokenize import RegexpTokenizer
from nltk import WordNetLemmatizer
from nltk.stem import PorterStemmer

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
    
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

#### Reading in the csv to DataFrame

In [2]:
messages = pd.read_csv('datasets/disaster_response_messages_training.csv')

  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
print("The dataframe has {} rows and {} columns.".format(messages.shape[0], messages.shape[1]))

The dataframe has 21046 rows and 42 columns.


In [4]:
messages.head()

Unnamed: 0,id,split,message,original,genre,related,PII,request,offer,aid_related,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,2,train,Weather update - a cold front from Cuba that c...,Un front froid se retrouve sur Cuba ce matin. ...,direct,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,7,train,Is the Hurricane over or is it not over,Cyclone nan fini osinon li pa fini,direct,1,0,0,0,1,...,0,0,1,0,1,0,0,0,0,0
2,12,train,"says: west side of Haiti, rest of the country ...",facade ouest d Haiti et le reste du pays aujou...,direct,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,14,train,Information about the National Palace-,Informtion au nivaux palais nationl,direct,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,15,train,Storm at sacred heart of jesus,Cyclone Coeur sacr de jesus,direct,1,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0


In [5]:
# Getting a sense of the types of values in the dataset. Mostly 1 hot encoded labels for the messages
for i in messages.columns[4:]:
    print(i)
    print(messages[i].value_counts())
    print('')

genre
news      10450
direct     8666
social     1930
Name: genre, dtype: int64

related
1    15795
0     5083
2      168
Name: related, dtype: int64

PII
0    21046
Name: PII, dtype: int64

request
0    17486
1     3560
Name: request, dtype: int64

offer
0    21046
Name: offer, dtype: int64

aid_related
0    12361
1     8685
Name: aid_related, dtype: int64

medical_help
0    19392
1     1654
Name: medical_help, dtype: int64

medical_products
0    19975
1     1071
Name: medical_products, dtype: int64

search_and_rescue
0    20426
1      620
Name: search_and_rescue, dtype: int64

security
0    20644
1      402
Name: security, dtype: int64

military
0    20382
1      664
Name: military, dtype: int64

child_alone
0    21046
Name: child_alone, dtype: int64

water
0    19725
1     1321
Name: water, dtype: int64

food
0    18717
1     2329
Name: food, dtype: int64

shelter
0    19168
1     1878
Name: shelter, dtype: int64

clothing
0    20723
1      323
Name: clothing, dtype: int64

money
0 

In [6]:
# We dropped columns that do not have a diverse spread of values - based on the above exploration.
# e.g. all values in column 'PII' are 0 - this gives us no valuable information for classification.
messages.drop(columns = ['id', 'split', 'original', 'PII', 'offer', 'child_alone'], inplace=True)
messages.head()

Unnamed: 0,message,genre,related,request,aid_related,medical_help,medical_products,search_and_rescue,security,military,...,aid_centers,other_infrastructure,weather_related,floods,storm,fire,earthquake,cold,other_weather,direct_report
0,Weather update - a cold front from Cuba that c...,direct,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Is the Hurricane over or is it not over,direct,1,0,1,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0
2,"says: west side of Haiti, rest of the country ...",direct,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Information about the National Palace-,direct,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Storm at sacred heart of jesus,direct,1,0,0,0,0,0,0,0,...,0,0,1,0,1,0,0,0,0,0


In [7]:
# Dropping duplicate messages
messages = messages.drop_duplicates(subset='message')

# Check for nulls
print('There are now {} null values in the dataset.'.format(messages.isnull().sum().sum()))

There are now 0 null values in the dataset.


In [8]:
messages.reset_index(inplace = True)

#### Creating the TARGET variable for Model 1

- 1 : Urgent Action Needed

- 2 : Non-urgent action

In [9]:
messages['urgent_label'] = messages['genre'].map(lambda x: 1 if x == 'direct'  else 0)

#### __Creating the TARGET variable for Model 2__
The impetus for a second model was to reduce misclassification of our target variable "Direct Media". By predicting more than just binary categories, we could potentially reduce the number of false negatives (predictions that an urgent message for help is non-urgent) and prevent ignoring messages from those that need help.

- 0 : News media
- 1 : Direct media - Urgent Action Needed!
- 2 : Social media

In [10]:
messages['genre_label'] = messages['genre'].replace(['direct', 'news', 'social'], [1, 0, 2])

### Preparing the data for modeling
 Tokenizing, Lemmatizing, and Stemming the Text

In [11]:
# This is a text preprocessing function that get our data ready for modeling, creating new columns 
# for the message text in their tokenized, lemmatized, and stemmed forms. This allows for easy selection
# of different forms of the text for use in vectorization and modeling.

def preprocessed_columns(dataframe = messages, 
                        column = 'message', 
                        new_lemma_column = 'lemmatized', 
                        new_stem_column = 'stemmed',
                        new_token_column = 'tokenized_message',
                        regular_expression = r'\w+'): 
    
    tokenizer = RegexpTokenizer(regular_expression)      #Instantiate tokenizer with specific regular expression
    lemmatizer = WordNetLemmatizer()                     #Instantiate lemmatizer
    stemmer = PorterStemmer()                            #Instantiate stemmer
    
    lemmatized = []                                      #list to append lemmatized data
    stemmed = []                                         #list to append stemmed words
    tokenized = []
    
    
    for i in dataframe[column]:                          #Iterate through all the rows in specific column

        tokens = tokenizer.tokenize(i.lower())           #Converting string titles to tokens 
        tokenized.append(tokens)

        lemma = [lemmatizer.lemmatize(token) for token in tokens]     #lemmatizing all tokens
        lemmatized.append(lemma)                                      #appending lemma to lemmatized list

        stems = [stemmer.stem(token) for token in tokens]             #stemming all tokens
        stemmed.append(stems)                                         #creating stem list
     
    dataframe[new_token_column] = [' '.join(i) for i in tokenized]    
    dataframe[new_lemma_column] = [' '.join(i) for i in lemmatized]   #rejoing the lists for each title
    dataframe[new_stem_column] = [' '.join(i) for i in stemmed]       #setting results to be new columns in df   
    
    return dataframe
    

In [12]:
#calling the above function and resetting the messages dataframe
messages = preprocessed_columns()
messages.head()

Unnamed: 0,index,message,genre,related,request,aid_related,medical_help,medical_products,search_and_rescue,security,...,fire,earthquake,cold,other_weather,direct_report,urgent_label,genre_label,tokenized_message,lemmatized,stemmed
0,0,Weather update - a cold front from Cuba that c...,direct,1,0,0,0,0,0,0,...,0,0,0,0,0,1,1,weather update a cold front from cuba that cou...,weather update a cold front from cuba that cou...,weather updat a cold front from cuba that coul...
1,1,Is the Hurricane over or is it not over,direct,1,0,1,0,0,0,0,...,0,0,0,0,0,1,1,is the hurricane over or is it not over,is the hurricane over or is it not over,is the hurrican over or is it not over
2,2,"says: west side of Haiti, rest of the country ...",direct,1,0,0,0,0,0,0,...,0,0,0,0,0,1,1,says west side of haiti rest of the country to...,say west side of haiti rest of the country tod...,say west side of haiti rest of the countri tod...
3,3,Information about the National Palace-,direct,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,information about the national palace,information about the national palace,inform about the nation palac
4,4,Storm at sacred heart of jesus,direct,1,0,0,0,0,0,0,...,0,0,0,0,0,1,1,storm at sacred heart of jesus,storm at sacred heart of jesus,storm at sacr heart of jesu


## Gridsearching Hyperparameters for Word Vectorization and Modeling

In [13]:
def text_to_model(X_column, model, vectorizer, params, verbose = 1):
    
    X = messages[X_column]                                    #creates X and y
    y = messages['urgent_label']
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)    #train test split
    
    pipe = Pipeline([                               #pipeline to run with gridsearch and test hyperparameters
        ('vectorizer', vectorizer),                 #for both the model and vectorizer.
        ('model', model)])                          #this will be done for many vectorizer-model combinations
        
    grid = GridSearchCV(pipe, param_grid=params, cv=5, verbose=verbose)  
    
    grid.fit(X_train, y_train)                       #fitting the grid model to X_train and y_train and 
                                                     #running a 5 fold cross validation
    score_dict = {}
    
    score_dict['X'] = X_column                               #this dict will be converted to dataframe to store 
    score_dict['Vectorizer'] = vectorizer                    #the performance of each gridsearch and return the 
    score_dict['Model'] = model                              #best parameters and score to compare to other models
    score_dict['train_score'] = grid.score(X_train, y_train)
    score_dict['test_score'] = grid.score(X_test, y_test)
    score_dict['best_params'] = grid.best_params_
    
    try:
        return pd.DataFrame(score_dict)
    except:
        return score_dict

#### Vectorizing Tokenize Messages and Modeling with Logistic Regression
We chose Logistic Regression for it's high interpretability. This will tell us what words or n-grams are informative for predicting urgent help messages. This linguistic inference can be applied to future modeling efforts.

In [None]:
# Text Selected - Tokenized messsages
# Running a gridsearch on Logistic Regression and Count Vectorizer 

logreg = LogisticRegression()
cv = CountVectorizer()

params = {'vectorizer__stop_words': ['english', None],
          'vectorizer__max_features':[None],
          'vectorizer__ngram_range':[(1,1), (1,2), (1,3)],
          'vectorizer__min_df':[1],
          'vectorizer__max_df':[1.0],
          'model__penalty':['l2']
         }

count_vect_logreg = text_to_model(X_column='tokenized_message', model = logreg, vectorizer=cv, params=params)
count_vect_logreg

# Best Params Below: 
# Logistic Regression with Ridge Penatly
# Count Vectorizer with No max features, min_df = 1, ngrams = (1,2), Stop_words = None

**Testing Term Frequency - Inverse Document Frequency Word Vectorizer**

In [None]:
# Text Selected - Tokenized Message
# Gridsearching Hyperparameters for TF-IDF Vectorizer

logreg = LogisticRegression()
tfidf = TfidfVectorizer()

params = {'vectorizer__stop_words': ['english', None],
          'vectorizer__max_features':[None, 10000, 20000],
          'vectorizer__ngram_range':[(1,1), (1,2), (1,3)],
          'vectorizer__min_df':[1],
          'vectorizer__max_df':[1.0],
          'model__penalty':['l2']
         }

count_vect_logreg = text_to_model(X_column='tokenized_message', model = logreg, vectorizer=cv, params=params)
count_vect_logreg

**Gridsearching on __Lemmatized Messages__**

In [None]:
logreg = LogisticRegression()
cv = CountVectorizer()

params = {'vectorizer__stop_words': ['english', None],
          'vectorizer__max_features':[None, 50000, 100000, 200000],
          'vectorizer__ngram_range':[(1,2), (1,3)],
          'vectorizer__min_df':[1],
          'vectorizer__max_df':[1.0],
          'model__penalty':['l2']
         }

count_vect_logreg = text_to_model(X_column='lemmatized', model = logreg, vectorizer=cv, params=params)
count_vect_logreg

**Gridsearching Word Vectorizer Hyperparameters with Tokenized Messages**

While the score is slightly higher for tokenized text, we decided to stick with lemmatized because it can maintain some broader meaning than just focusing on the stem of a word.

In [None]:
logreg = LogisticRegression()
cv = CountVectorizer()

params = {'vectorizer__stop_words': ['english', None],
          'vectorizer__max_features':[None, 50000, 100000, 200000],
          'vectorizer__ngram_range':[(1,2), (1,3)],
          'vectorizer__min_df':[1],
          'vectorizer__max_df':[1.0],
          'model__penalty':['l2']
         }

count_vect_logreg = text_to_model(X_column='stemmed', model = logreg, vectorizer=cv, params=params)
count_vect_logreg

## Preparing Data for Final Modeling
### Word Vectorization with Count Vectorizer

In [14]:
count_vec = CountVectorizer(ngram_range = (1,2), max_features = 50000)
X = count_vec.fit_transform(messages['lemmatized'])
cv_messages = pd.DataFrame(X.toarray(), columns = count_vec.get_feature_names())
cv_messages.insert(loc = 0, column = 'urgent_label', value = messages['urgent_label'])
cv_messages.insert(loc = 1, column = 'genre', value = messages['genre'])
cv_messages.insert(loc = 2, column = 'genre_label', value = messages['genre_label'])
cv_messages.head()

Unnamed: 0,urgent_label,genre,genre_label,00,00 am,00 in,000,000 000,000 acre,000 affected,...,zwaar,àö,àö and,àö http,àö in,àö to,àö tom,àö àö,àû,àû elevation
0,1,direct,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,1,direct,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,direct,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,1,direct,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,direct,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
cv_messages.shape

(20992, 50003)

### Train - Test Split

In [16]:
# Train Test Split for Model 1 (binary categories)
X_train, X_test, y_train, y_test = train_test_split(cv_messages.drop(columns = ['urgent_label', 'genre', 'genre_label']),
                                                    cv_messages['urgent_label'],
                                                    random_state = 42)


### Instantiate the Model:
##### We chose Logistic Regression for it's high performance and interpretability

In [17]:
logreg = LogisticRegression()

logreg.fit(X_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [18]:
# Train Score
logreg.score(X_train, y_train)

0.9987931910569106

In [19]:
# Test Score
logreg.score(X_test, y_test)

0.9641768292682927

In [20]:
# Making prediction on the Test Data
logreg_predictions = logreg.predict(X_test)

#### Evaluating Model Performance

In [21]:
# Creating a table for correct and incorrect predictions
cm = confusion_matrix(y_test, logreg_predictions)
confusion_test_df = pd.DataFrame(data = cm, columns = ['Predicted Non-Urgent', 'Predicted Urgent'],
                                 index = ['Actual Non-Urgent', 'Actual Urgent'])
confusion_test_df

Unnamed: 0,Predicted Non-Urgent,Predicted Urgent
Actual Non-Urgent,2972,89
Actual Urgent,99,2088


In [22]:
# accuracy = TP + TN / Total
accuracy = (2992 + 2117) / (2992 + 2117 + 70 + 69)
print("The Accuracy Score is: {}".format(accuracy))

# misclassification rate = 1 - accuracy
misclassification = 1 - accuracy
print("The Misclassification Rate is: {}".format(misclassification))

# sensitivity = TP / TP + FN
sensitivity = 2117 / (2117 + 70)
print("The Sensitivity is: {}".format(sensitivity))

# specificity = TN / TN + FP
specificity = 2992 / (2992 + 69)
print("The Specificity is: {}".format(specificity))

The Accuracy Score is: 0.9735137195121951
The Misclassification Rate is: 0.02648628048780488
The Sensitivity is: 0.9679926840420667
The Specificity is: 0.9774583469454426


##### Y-Test dataframe compares original category labels with model predictions. This allows us to investigate misclassified methods to qualitatively understanding where our model fails to predict urgency in messages.

In [23]:
y_test_df = pd.DataFrame(data = y_test)
y_test_df['prediction'] = logreg_predictions
y_test_df.head()

Unnamed: 0,urgent_label,prediction
14550,0,0
14983,0,0
6100,1,1
5300,1,1
5998,1,1


In [24]:
# "misclassified" is the subset of y_test_df that only contains the misclassified messages
misclassified = y_test_df[y_test_df['urgent_label'] != y_test_df['prediction']]
print(len(misclassified))

188


In [None]:
# Text from the messages that were originally labeled as "urgent" but classified as "non-urgent"
# These are our "False Negatives"

count = 0
for i in messages.iloc[misclassified[misclassified['prediction'] == 0].index]['message']:
    count += 1
    print(count)
    print(i)
    print("")

In [None]:
# Text from the messages that were originally labeled as "non-urgent" but classified as "urgent"
count = 0
for i in messages.iloc[misclassified[misclassified['prediction'] == 1].index]['message']:
    count += 1
    print(count)
    print(i)
    print("")

### Evaluating Top features

In [None]:
# want to create a dataframe with the 50-100 most and 
# least correlated words (columns) from messages_cv

best_features = pd.DataFrame(data = X_test.columns, columns = ['Top Features'])

In [None]:
best_features['Coefficient'] = logreg.coef_[0]

In [None]:
#best_features['Coefficient'].sort_values(ascending=False).head(100)
top_feature_words = best_features[best_features['Coefficient'] > 0.749108].sort_values(by = 'Coefficient', ascending=False)
top_feature_words

In [None]:
#best_features['Coefficient'].sort_values().head(100)
negative_class_feature_words = best_features[best_features['Coefficient'] 
                                             < -0.763026].sort_values(by='Coefficient')
negative_class_feature_words

## Assessing Most Informative Words

In [None]:
# want to create a dataframe with the 50-100 most and 
# least correlated words (columns) from messages_cv

best_features_3_classes = pd.DataFrame(data = X_test_3.columns, columns = ['Top Features'])

In [None]:
best_features_3_classes['Coefficient'] = logreg_3.coef_[0]

In [None]:
#best_features_3_classes['Coefficient'].sort_values(ascending=False).head(100)
top_feature_words_3_classes = best_features_3_classes[best_features_3_classes['Coefficient'] > 0.749108].sort_values(by = 'Coefficient', ascending=False)

In [None]:
top_feature_words

In [None]:
best_features['Coefficient'].sort_values().head(100)
#negative_class_feature_words = best_features[best_features['Coefficient'] < -0.763026].sort_values(by='Coefficient')

## Validating Model 1 on Data from another source

In [None]:
validate = pd.read_csv('./datasets/natural_disaster_tweets.csv')
validate.head()

In [None]:
print("The dataframe has {} rows and {} columns.".format(validate.shape[0], validate.shape[1]))

In [None]:
validate.dropna(axis = 0, inplace = True)

In [None]:
print("The dataframe has {} rows and {} columns.".format(validate.shape[0], validate.shape[1]))

##### Vectorizing the Validation DataFrame

In [None]:
validate_cv = count_vec.transform(validate['lemmatized_message'])

In [None]:
validation_predictions = list(logreg.predict(validate_cv))

In [None]:
validate['predictions'] = validation_predictions

In [None]:
# A subset of messages from validation dataset that were predicted "urgent"

count = 0
for i in validate[validate['predictions'] == 1]['Tweet'][:100]:
    count += 1
    print(count)
    print('')
    print(i)

In [None]:
# A subset of messages from validation dataset that were predicted " NOT urgent"

count = 0
for i in validate[validate['predictions'] == 0]['Tweet'][:100]:
    count += 1
    print(count)
    print('')
    print(i)