<a href="https://colab.research.google.com/github/DavidAltrows/NLP/blob/master/Disaster_Tweets_Shallow_Machine_Learning_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Intro

**Small update**: I am currently completing an updated solution through tensorflow which has easy solutions for deep learning in NLP. This includes making full use of tokenization, sequences, and word embeddings! In this notebook however I will be explaining every step made in this binary classification model! I invite you to ask questions or contribute yourself!

**My personal to-do** in this project:
- **Improove handling of categorical data**
- **Create data pipeline**


##Problem description

Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

But, it’s not always clear whether a person’s words are actually announcing a disaster. Take this example:

<div>
<img src="https://storage.googleapis.com/kaggle-media/competitions/tweet_screenshot.png" width="300"/>
</div>

The data for these tweets can be found here: https://www.figure-eight.com/data-for-everyone/

Credit goes to: The Kaggle micro-courses





**Evaluation**

This problem uses F1 as an evaluation metric. F1 is a measure of accuracy in binary classification, formula specifics can be found here: https://en.wikipedia.org/wiki/F1_score





**Solution**

This is the first notebook solution for this problem:


This notebook will be divided into the following parts<br>
<ol>
    <li><b>Data Exploration</b></li>
    <li><b>Data Preprocessing</b></li>
    <li><b>Basic NLP Techniques</b></li>
    <li><b>Models Bulding</b></li>
    <li><b>Models evaluation</b></li>
</ol>
Importing libraries and loaidng data:


In [0]:
import numpy as np
import pandas as pd 
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import VotingClassifier
from sklearn.metrics import f1_score
import warnings
nltk.download('stopwords')
warnings.filterwarnings('ignore')

print("Important libraries loaded successfully")

# 1. Data Exploration
Pandas is an exceptional library for this.

In [0]:
url = 'https://raw.githubusercontent.com/DavidAltrows/NLP/master/train.csv'
data_train = pd.read_csv(url)
print("Data shape = ",data_train.shape)
data_train.head()

Clearly there is some missing values that must be delt with

# 2. Data Preprocessing

Data Preprocessing one of the most important steps in any data science or machine learning project. In this first version I set up the process but did not invest too much time into this step.
## 2.1 Missing Data

In [0]:
#get total count of data including missing data
total = data_train.isnull().sum().sort_values(ascending=False)

#get percent of missing data relevant to all data
percent = (data_train.isnull().sum()/data_train.isnull().count()).sort_values(ascending=False)

missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(data_train.shape[1])

As seen in above table ~33% of **location** column is missing and very small percentage of **keyword** column is missing.<br>


## 2.2 Handling Missing Data

This image is one of my favorite roadmaps to handle **missing data** 
<img src='https://miro.medium.com/max/1528/1*_RA3mCS30Pr0vUxbp25Yxw.png' width="550px" style='float:left;'>
<div style='clear:both'></div>
<br>


In this model I will use deletion. In a future branch you will likely see me try out possible one-hot encoding for keywords or some form of label encoding. Imputation is not possible for missing categorical values.


In **Deletion** I will use **Deleting Columns** technique. Now I will drop **location** and **keyword** columns. As well as the id column as it has no connection to the tweet's sentement.





In [0]:
data_train = data_train.drop(['id','location','keyword'], axis=1)
print("location and keyword columns droped successfully")
#make sure the data only has the target and tweet text
data_train.columns

# 3. NLP Technique

<div style='clear:both'></div>
<hr>
Before starting text preprocess steps I must define two terms: Corpus and Bag of words (These descriptions are directly from wiki)

**Corpus :** Is a large and structured set of texts, We can consider it as simplified version of our text data that contain clean and benefit data.<br>

**Bag of words :** In practice, the Bag-of-words model is mainly used as a tool of feature generation. After transforming the text into a "bag of words", we can calculate various measures to characterize the text [wikipedia](https://en.wikipedia.org/wiki/Bag-of-words_model)<br>


In this version these are the processing steps that will be applied before the shallow learning models are tested.

<ol>
    <li><b>Remove unwanted words</b></li>
    <li><b>Transform words to lowercase</b></li>
    <li><b>Remove stopwords</b></li>
    <li><b>Stemming words</b></li>
    <li><b>Create sparse matrix ( Bag of words )</b></li>
</ol>  



## 3.1 Remove unwanted words
As we see our **text** column contains unwanted words as **#, =>, numbers, or ... etc** these letters will not be useful in our problem so we will get only pure text without any markings or numbers. Further analysis might determine this assumption is incorrect.<br>

We will do it by **specifying** our pattern using **re** library. This can be done many other ways.

## 3.2 Transform words to lowercase

We must transform words to lowercase because each letter has own **ASCII Code** that represent text in computers. Uppercase letter has different ASCII Code than same letter in lowercase format. **so that** 'A' letter differ from 'a' letter in computer. This conversion is a simple method of tokenization.

## 3.3 Remove stopwords
**Stop words :** are generally the most common words in a language, so we will remove it to prevent misleading problem in our model. The stop words in this model may be redefined if insights are drawn from model failures.

## 3.4 Stemming words
**stemming :** is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form https://en.wikipedia.org/wiki/Stemming.

We use stemming to reduce **bag of words** dimensionality.



In [0]:
corpus  = []
pstem = PorterStemmer()
for i in range(data_train['text'].shape[0]):
    #Remove unwanted words
    tweet = re.sub("[^a-zA-Z]", ' ', data_train['text'][i])
    #Transform words to lowercase
    tweet = tweet.lower()
    tweet = tweet.split()
    #Remove stopwords then Stemming it
    tweet = [pstem.stem(word) for word in tweet if not word in set(stopwords.words('english'))]
    tweet = ' '.join(tweet)
    #Append cleaned tweet to corpus
    corpus.append(tweet)
    
print("Corpus created successfully")  

Corpus created successfully


**Let's explore corpus, and discover the difference between raw and clean text data**

In [0]:
print(pd.DataFrame(corpus)[0].head(10))

0            deed reason earthquak may allah forgiv us
1                 forest fire near la rong sask canada
2    resid ask shelter place notifi offic evacu she...
3          peopl receiv wildfir evacu order california
4    got sent photo rubi alaska smoke wildfir pour ...
5    rockyfir updat california hwi close direct due...
6    flood disast heavi rain caus flash flood stree...
7                               top hill see fire wood
8               emerg evacu happen build across street
9                             afraid tornado come area
Name: 0, dtype: object


In [0]:
rawTexData = data_train["text"].head(10)
cleanTexData = pd.DataFrame(corpus, columns=['text after cleaning']).head(10)

frames = [rawTexData, cleanTexData]
result = pd.concat(frames, axis=1, sort=False)
result

Unnamed: 0,text,text after cleaning
0,Our Deeds are the Reason of this #earthquake M...,deed reason earthquak may allah forgiv us
1,Forest fire near La Ronge Sask. Canada,forest fire near la rong sask canada
2,All residents asked to 'shelter in place' are ...,resid ask shelter place notifi offic evacu she...
3,"13,000 people receive #wildfires evacuation or...",peopl receiv wildfir evacu order california
4,Just got sent this photo from Ruby #Alaska as ...,got sent photo rubi alaska smoke wildfir pour ...
5,#RockyFire Update => California Hwy. 20 closed...,rockyfir updat california hwi close direct due...
6,#flood #disaster Heavy rain causes flash flood...,flood disast heavi rain caus flash flood stree...
7,I'm on top of the hill and I can see a fire in...,top hill see fire wood
8,There's an emergency evacuation happening now ...,emerg evacu happen build across street
9,I'm afraid that the tornado is coming to our a...,afraid tornado come area


As we know that there some words that repeated so little in our tweets, so we must remove these words from our **Bag of words** to decrease dimensionality as possible.<br>

We will do it by create dictionary where **key** refer to **word** and **value** refer to **word frequents in all tweets**.

In [0]:
#Create our dictionary 
uniqueWordFrequents = {}
for tweet in corpus:
    for word in tweet.split():
        if(word in uniqueWordFrequents.keys()):
            uniqueWordFrequents[word] += 1
        else:
            uniqueWordFrequents[word] = 1
            
#Convert dictionary to dataFrame
uniqueWordFrequents = pd.DataFrame.from_dict(uniqueWordFrequents,orient='index',columns=['Word Frequent'])
uniqueWordFrequents.sort_values(by=['Word Frequent'], inplace=True, ascending=False)
uniqueWordFrequents.head(10)

Unnamed: 0,Word Frequent
co,4746
http,4721
like,411
fire,363
amp,344
get,311
bomb,239
new,228
via,220
u,216


In [0]:
uniqueWordFrequents['Word Frequent'].unique()

array([4746, 4721,  411,  363,  344,  311,  239,  228,  220,  216,  213,
        210,  209,  201,  183,  181,  180,  178,  175,  169,  166,  164,
        162,  156,  155,  153,  151,  145,  144,  143,  137,  133,  132,
        131,  130,  129,  128,  125,  124,  123,  122,  121,  120,  119,
        118,  117,  116,  114,  111,  110,  109,  108,  106,  105,  104,
        103,  102,  101,  100,   99,   98,   97,   96,   95,   94,   93,
         91,   90,   89,   88,   87,   86,   84,   83,   82,   79,   78,
         77,   76,   75,   74,   73,   72,   71,   70,   69,   68,   67,
         66,   65,   64,   63,   62,   61,   60,   59,   58,   57,   56,
         55,   54,   53,   52,   51,   50,   49,   48,   47,   46,   45,
         44,   43,   42,   41,   40,   39,   38,   37,   36,   35,   34,
         33,   32,   31,   30,   29,   28,   27,   26,   25,   24,   23,
         22,   21,   20,   19,   18,   17,   16,   15,   14,   13,   12,
         11,   10,    9,    8,    7,    6,    5,   

As we see some words repeated a lot and others repeated less, so we will get only words that repeated more than or equal 20 once.

In [0]:
uniqueWordFrequents = uniqueWordFrequents[uniqueWordFrequents['Word Frequent'] >= 20]
print(uniqueWordFrequents.shape)
uniqueWordFrequents

(787, 1)


Unnamed: 0,Word Frequent
co,4746
http,4721
like,411
fire,363
amp,344
...,...
cnn,20
gem,20
captur,20
arriv,20


## 3.5 Create sparse matrix ( Bag of words )
**Bag of word** contain only unique words in corpus.

In [0]:
counVec = CountVectorizer(max_features = uniqueWordFrequents.shape[0])
bagOfWords = counVec.fit_transform(corpus).toarray()

# 4. Building Models
Now we will build our models, we will use following models
* Decision Tree Model
* Gradient Boosting Model
* K - Nearest Neighbors Model
* Logistic Regression Model
* Support Vector Machine Model
* Voting Classifier Model - Decided not to

But before using it we will split our data to train and test set first.


**Aside**

The hyperparameters used in these models are likely not optimal however this project is focused on NLP.

If you're interesting learning more about classification models this article is excellent: https://stackabuse.com/overview-of-classification-methods-in-python-with-scikit-learn/


In [0]:
X = bagOfWords
y = data_train['target']
print("X shape = ",X.shape)
print("y shape = ",y.shape)

X_train , X_test , y_train , y_test = train_test_split(X,y,test_size=0.20, random_state=55, shuffle =True)
print('data splitting successfully')

X shape =  (7613, 787)
y shape =  (7613,)
data splitting successfully


## 4.1 Decision Tree Model

In [0]:
decisionTreeModel = DecisionTreeClassifier(criterion= 'entropy',max_depth = None, splitter='best', random_state=55)

decisionTreeModel.fit(X_train,y_train)

print("decision Tree Classifier model run successfully")

decision Tree Classifier model run successfully


## 4.2 Gradient Boosting Model

In [0]:
gradientBoostingModel = GradientBoostingClassifier(loss = 'deviance',
                                                   learning_rate = 0.01,
                                                   n_estimators = 100,
                                                   max_depth = 30,
                                                   random_state=55)

gradientBoostingModel.fit(X_train,y_train)

print("gradient Boosting Classifier model run successfully")

gradient Boosting Classifier model run successfully


## 4.3 K - Nearest Neighbors Model


In [0]:
KNeighborsModel = KNeighborsClassifier(n_neighbors = 7,
                                       weights = 'distance',
                                      algorithm = 'brute')

KNeighborsModel.fit(X_train,y_train)

print("KNeighbors Classifier model run successfully")

KNeighbors Classifier model run successfully


## 4.4 Logistic Regression Model

In [0]:
LogisticRegression = LogisticRegression(penalty='l2', 
                                        solver='saga', 
                                        random_state = 55)  

LogisticRegression.fit(X_train,y_train)

print("LogisticRegression Classifier model run successfully")

LogisticRegression Classifier model run successfully


## 4.5 Support Vector Machine Model

In [0]:
SVClassifier = SVC(kernel= 'linear',
                   degree=3,
                   max_iter=10000,
                   C=2, 
                   random_state = 55)

SVClassifier.fit(X_train,y_train)

print("SVClassifier model run successfully")

SVClassifier model run successfully


## 4.6 Voting Classifier Model

In [0]:
#modelsNames = []

#votingClassifier = VotingClassifier(voting = 'hard',estimators= modelsNames)
#votingClassifier.fit(X_train,y_train)
#print("votingClassifier model run successfully")

# 5 Models evaluation

Now we will evaluate our model using **f1_score** let's go. 

In [0]:
#evaluation Details
models = [decisionTreeModel, gradientBoostingModel, KNeighborsModel, LogisticRegression, SVClassifier]

for model in models:
    print(type(model).__name__,' Train Score is   : ' ,model.score(X_train, y_train))
    print(type(model).__name__,' Test Score is    : ' ,model.score(X_test, y_test))
    
    y_pred = model.predict(X_test)
    print(type(model).__name__,' F1 Score is      : ' ,f1_score(y_test,y_pred))
    print('--------------------------------------------------------------------------')

DecisionTreeClassifier  Train Score is   :  0.9761904761904762
DecisionTreeClassifier  Test Score is    :  0.7419566644780039
DecisionTreeClassifier  F1 Score is      :  0.6743993371996686
--------------------------------------------------------------------------
GradientBoostingClassifier  Train Score is   :  0.8594417077175698
GradientBoostingClassifier  Test Score is    :  0.7511490479317138
GradientBoostingClassifier  F1 Score is      :  0.6331074540174251
--------------------------------------------------------------------------
KNeighborsClassifier  Train Score is   :  0.9761904761904762
KNeighborsClassifier  Test Score is    :  0.7406434668417596
KNeighborsClassifier  F1 Score is      :  0.5872518286311389
--------------------------------------------------------------------------
LogisticRegression  Train Score is   :  0.8502463054187193
LogisticRegression  Test Score is    :  0.7826657912015759
LogisticRegression  F1 Score is      :  0.7230125523012553
-------------------------

There is overtraining based on the differences in the scores seen here, I'll likely investigate this in the future.