# Using NLP to Determine Real vs Fake Disasters

This Kaggle competetion involves using NLP to analayze a set of tweets regaurding disasters, and determine whether or not they reffer to real or fake disasters.

For the most part, I will be following the information and methods contained in these articles and videos: 
* https://towardsdatascience.com/natural-language-processing-nlp-for-machine-learning-d44498845d5b
* https://www.youtube.com/watch?v=UvsQPsrZTK4
* https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74

In [1]:
import numpy as np 
import pandas as pd 
import nltk
import string
import re
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV
!pip install pyspellchecker 
from spellchecker import SpellChecker
from operator import truediv


Collecting pyspellchecker
  Downloading pyspellchecker-0.5.4-py2.py3-none-any.whl (1.9 MB)
[K     |████████████████████████████████| 1.9 MB 2.7 MB/s eta 0:00:01
[?25hInstalling collected packages: pyspellchecker
Successfully installed pyspellchecker-0.5.4
You should consider upgrading via the '/opt/conda/bin/python3.7 -m pip install --upgrade pip' command.[0m


## Importing the Data

In [2]:
test_url = "https://raw.githubusercontent.com/davidblumenstiel/Kaggle/master/Real%20vs%20Fake%20Disaster%20Tweets%20(NLP)/test.csv"
train_url = "https://raw.githubusercontent.com/davidblumenstiel/Kaggle/master/Real%20vs%20Fake%20Disaster%20Tweets%20(NLP)/train.csv"
    
test = pd.read_csv(test_url)
train = pd.read_csv(train_url)

## Exploratory Data Analysis
first off, let's get an idea of what the datasets we're given look like

In [3]:
print(train.shape)
print(train.columns)
print(test.shape)
print(test.columns)
print(train.head)
print(train.describe())  #about 43% of the tweets were about real disasters:

(7613, 5)
Index(['id', 'keyword', 'location', 'text', 'target'], dtype='object')
(3263, 4)
Index(['id', 'keyword', 'location', 'text'], dtype='object')
<bound method NDFrame.head of          id keyword location  \
0         1     NaN      NaN   
1         4     NaN      NaN   
2         5     NaN      NaN   
3         6     NaN      NaN   
4         7     NaN      NaN   
...     ...     ...      ...   
7608  10869     NaN      NaN   
7609  10870     NaN      NaN   
7610  10871     NaN      NaN   
7611  10872     NaN      NaN   
7612  10873     NaN      NaN   

                                                   text  target  
0     Our Deeds are the Reason of this #earthquake M...       1  
1                Forest fire near La Ronge Sask. Canada       1  
2     All residents asked to 'shelter in place' are ...       1  
3     13,000 people receive #wildfires evacuation or...       1  
4     Just got sent this photo from Ruby #Alaska as ...       1  
...                                  

## Joining the Datasets
We'll need the process the data (get it ready for NLP) as one dataset.  Here, the datasets are joined and indexed by ID.  Lables are also created for later differentiation between testing and training data; outcomes from the training set are also set aside in their own list.

In [None]:
#Adds lables to each set so we can seperate them again later
train['label'] = 'train'
test['label'] = 'test'

#Makes a list of outcomes for the training set
train_target = train['target']

#Combines the datasets
df = pd.concat([train,test])

#Transforms all the column names to uppercase to differentiate them from the terms which will be added later
df.columns = [x.upper() for x in df.columns]

#Sets the index to the ID column
df = df.set_index('ID')
print(df.head())


## Data Preparation
By processing the data, we can cut down on the noise and get better results from our model.  Here, we will:
* Make every character lowercase
* Remove the punctuation
* Split up tweets into seperate words (for processing)
* Remove words that don't tell us much (stopwords)
* Lemmatize the words; group's the same words and variants thereof, or with different inflections, to the same term
* Join the tweets back together again

In [None]:
#Transforms all chars to lowercase and stores the text sentances in a list
lower = [x.lower() for x in df['TEXT']]

#Removes punctuation
nopunct = []  #Going to do each cleaning process in a seperate list
for text in lower:
    nopunct.append("".join([x for x in text if x not in string.punctuation]))
    
    
#Splits each tweet into seperate words
seperate = []    
for text in nopunct:
    seperate.append(re.split('\W+',text))
    

#Removes Stopwords
nostop = []
stopwords = nltk.corpus.stopwords.words('english')
for text in seperate:
    nostop.append([x for x in text if x not in stopwords])
    

#Lemmatizes the words.  Should be more useful than stemming
lemmatizer = nltk.WordNetLemmatizer()
lemmat = []
for text in nostop:
    lemmat.append([lemmatizer.lemmatize(x) for x in text])
    

#Joins the sentances back together
docs = []
for strs in lemmat:
    docs.append(' '.join(strs))
    
print(docs[0:4])


## What Differentiates Disaster Tweets from Non-Disasters
Here' we'll examine the training dataset and see if we can get any insight into what makes an actual disaster tweet.
I expect real disaster tweets to come primarily from news organizations, which likely do a better job spell checking their tweets.  Let's see if that's true, along with some other metrics.

In [None]:
#Here, we split the training set out of the combined data (before we rejoined the sentances), and into the real and fake tweets
traindoc = pd.DataFrame(docs)
traindoc.index = df.index
traindoc = traindoc.join([df.LABEL , df.TARGET])
traindoc = traindoc[traindoc.LABEL == 'train']

real = traindoc[traindoc.TARGET == 1]
real = real.drop(columns = ['LABEL','TARGET'])
fake = traindoc[traindoc.TARGET == 0]
fake = fake.drop(columns = ['LABEL','TARGET'])
#print(len(fake) + len(real)) #Also makes sure there aren't any targets labled other than 1 or 0


#This will tally the number of mispelled words and the total number of words for disaster tweets
spellcheck = SpellChecker()
mispelledreal = 0
totalreal = 0
for tweet in real.iloc[:,0]:
    buff = re.split('\W+',tweet)
    mispelledreal += len(spellcheck.unknown(buff))
    totalreal += len(buff)

print(mispelledreal/totalreal)

#This will tally the number of mispelled words and the total number of words for fake disaster tweets
mispelledfake = 0
totalfake = 0
for tweet in fake.iloc[:,0]:
    buff = re.split('\W+',tweet)
    mispelledfake += len(spellcheck.unknown(buff))
    totalfake += len(buff)
    
print(mispelledfake/totalfake)
    
print((mispelledreal/totalreal)/(mispelledfake/totalfake))

Turns out there's only a very small difference between spelling mistakes.  However, this is likely also taking into account words that the spellchecker just dosn't recognize (like URLs and places)

What about the total number of words in each of the tweets?

In [None]:
print(totalreal/len(real))
print(totalfake/len(fake))
print((totalreal/len(real))/(totalfake/len(fake)))

Above, we can see that real disaster tweets are about 1 word longer on average than are the fake ones.

Let's see if the length of the words themselves are longer.

In [None]:
fakechars = 0
for tweet in fake.iloc[:,0]:
    fakechars += len(tweet)
    
realchars = 0
for tweet in real.iloc[:,0]:
    realchars += len(tweet)
    
print(realchars/totalreal)
print(fakechars/totalfake)  
print((realchars/totalreal)/(fakechars/totalfake))

The length of real words is about 0.5 chars longer on average than fake ones (including whitespace).  May not be the best representation of word length, but there was a significant difference.

Finally, let's just look at the length of the tweets themselves (in chars)

In [None]:
print(realchars/len(real))
print(fakechars/len(fake))
print((realchars/len(real))/(fakechars/len(fake)))

Here there was also a noticable difference.  Later, we'll take some of these observations and make features of them.

## Data Vectorization

Now that the text has been prepared, we'll vectorize it so it can be used in a model.  

TF-IDF will determine the relative frequency of each word in a tweet compared to the frequency of that word amongst all tweets.  It will offer a bit more context than just a vector that only describes the presence of words (Bag of Words).

In [None]:
#Makes the TF-IDF vectors and puts them together into a dataframe
#Need to limit the number of features to reduce the space it takes up
vectorizer = TfidfVectorizer(max_features = 1000)
X = vectorizer.fit_transform(docs)
terms = vectorizer.get_feature_names()
dense = X.todense()
denselist = dense.tolist()

tfidf = pd.DataFrame(denselist, columns = terms, index = df.index)

print(tfidf.shape)
print(tfidf.head())


## Other Features
Included in the datasets are associated keywords, and locations.  We will want to take these into consideration.  We'll keep track of what specific keywords exist or not (1 or 0 for each possible keyword), and whether or not a location is given (dosn't matter where, just if one is specified; there were alot of locations).

In [None]:
#Adds columns for the different Keywords, and whether or not they occured.  Also adds the string 'KEY' to the columns to differentiate between TF-IDF words.
keywords = pd.get_dummies(df.KEYWORD)
keywords.columns = [str(x) + 'KEY' for x in keywords.columns]
keywords = keywords.set_index(df.index)
print(keywords.head())
#Adds a simple column for location.  1 means a location was specified, 0 means it wasn't.

location = []
for loca in df.LOCATION:
    if pd.isnull(loca):
        location.append(0)
    else:
        location.append(1)
location = pd.DataFrame(location, columns = ["LOCATION"])
location = location.set_index(df.index)
print(location.head())


Also, remember how we looked at word spelling and length averages?  Let's now make some features based off those observations

In [None]:
#This makes a dataset for the wordcounts in each tweet
wordcount = []
for tweet in docs:
    wordcount.append(len(re.split('\W+',tweet)))

words = pd.DataFrame(wordcount, columns = ["WORDS"])
words = words.set_index(df.index)
    
    
#This makes a datacet of the number of characters in each tweet
charcount = []

for tweet in docs:
    charcount.append(len(tweet))
    
chars = pd.DataFrame(charcount, columns = ["CHARS"])
chars = chars.set_index(df.index)

#This makes a dataset of the number of characters per word
charperword = list(map(truediv, charcount, wordcount))

cpw = pd.DataFrame(charperword, columns = ["CPW"])
cpw = cpw.set_index(df.index)

## Final preparation
All that's left to do now is to join all the processed data together (TF-IDF, Keywords, and Location, word/char counts), and split it back up into training and testing sets.

In [None]:
#Combines the previous datasets into combined large set; also adds back in the lables.  
#Note: everything has been index consistantly on the ID of each tweet throughout, so it all ligns up here
combined = tfidf.join(keywords)
combined = combined.join(location)
combined = combined.join(df.LABEL)
combined = combined.join(words)
combined = combined.join(chars)
combined = combined.join(cpw)

#Splits the dataset back into training and testing sets, and removes the lables
train_prepped = combined[combined['LABEL'] == 'train']
train_prepped = train_prepped.drop(columns=['LABEL'])
test_prepped = combined[combined['LABEL'] == 'test']
test_prepped = test_prepped.drop(columns=['LABEL'])
print(test_prepped.head())

## Model Building and Tuning
He're going to employ a random forest model with 100 estimators


In [None]:
RFmodel = RandomForestClassifier(n_estimators=100, 
                                bootstrap = True,
                                max_features = 'sqrt')

RFmodel.fit(train_prepped, train_target)
predictions = RFmodel.predict(test_prepped)



In [None]:
output = pd.DataFrame({'id': test_prepped.index, 'target': predictions})
output.to_csv('predictions.csv', index=False)
print("Your submission was successfully saved!")

## Results and Discussion

This model yielded predictions on the testing set that were about 78% correct.  For comparison, guessing that all of the tweets were fake would have been 57% correct.  This suggests that our model is in fact making informed predictions that are somewhat effective.  

This model could be improved by tinkering with the model parameters, or perhaps by creating more features and filtering out more noise from the text data.