## Real or Not? NLP with Disaster Tweets

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. 

In [1]:
import pandas as pd
import plotly.graph_objects as go
import plotly.express as px
import string

import numpy as np
import sklearn as sk

import matplotlib.pyplot as plt
pd.set_option('display.max_columns', None)  

First, let's read in the data and get a feel for what it looks like

In [2]:
df = pd.read_csv('../data/raw/train.csv', index_col=[0])
df_test = pd.read_csv('../data/raw/test.csv', index_col=[0])

## Exploring our data

Let's take a look at our data. The first step is to take a look at the dataframe, the shape of our data, and some quanitative charactaristics of it.

In [3]:
display(df.head())
display(df_test.head())
print(df.shape, df_test.shape)

Unnamed: 0_level_0,keyword,location,text,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,,,Our Deeds are the Reason of this #earthquake M...,1
4,,,Forest fire near La Ronge Sask. Canada,1
5,,,All residents asked to 'shelter in place' are ...,1
6,,,"13,000 people receive #wildfires evacuation or...",1
7,,,Just got sent this photo from Ruby #Alaska as ...,1


Unnamed: 0_level_0,keyword,location,text
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,,,Just happened a terrible car crash
2,,,"Heard about #earthquake is different cities, s..."
3,,,"there is a forest fire at spot pond, geese are..."
9,,,Apocalypse lighting. #Spokane #wildfires
11,,,Typhoon Soudelor kills 28 in China and Taiwan


(7613, 4) (3263, 3)


In [None]:
missing = (df.isna().sum() / df.shape[0]).to_frame()
fig = px.bar(x=missing.index, y=missing[0].ravel()*100)
fig.update_layout(yaxis_range=[0, 100])
fig.show()

There are way too many missing location values, and they cannot be filled in using any imputation method. So, either we will have to drop the rows with missing location values (33% of our values!), or simply not use location as a feature. It seems more reasonable to do the latter.

However, only 0.8% of rows are missing keyword values. So it's safe to drop those rows.

In [None]:
df.isna().sum()
display(df.head(10))

Great. Now, let's explore some aspects of the textual data, like the distribution of keywords and the length of the tweets.

In [None]:
mean_len = df['keyword'].str.len().mean()
med_len = df['keyword'].str.len().median()
mode_len = df['keyword'].str.len().mode().values[0]
fig = px.bar(y=[mean_len, med_len, mode_len], x=['Mean', 'Median', 'Mode'], title='Word Length Statistics').show()

val_count = df['keyword'].value_counts().to_frame()
fig = px.bar(x=val_count.index, y=val_count['keyword'].values).show()

It appears that the median and mean values are very close. This isn't much of a surprise, as tweets have a hard limit on their wordcount. However, this also implies that most of the tweets are about the same length. If the median was much different than the mean, it would imply outliers. 

## Data Cleaning

An important part of NLP is cleaning the textual data. We generally remove all punctuation, links, and make everything lowercase.

As far as fixing NaN's, there too many missing location values, and they cannot be filled in using any imputation method. So, either we will have to drop the rows with missing location values (33% of our values!), or simply not use location as a feature. It seems more reasonable to do the latter.

However, as we saw above, only 0.8% of rows are missing keyword values. So it's safe to drop those rows. 

In [None]:
df.drop('location', inplace=True, axis=1)
df.dropna(inplace=True)

In [None]:
df['text'] = df['text'].str.lower() # make everything lowercase
df['text'] = df['text'].str.replace('@\S+', '') #remo
df['text'] = df['text'].str.replace('[{}]'.format(string.punctuation), '') # remove punctuation
df['text'] = df['text'].str.replace('http\S+', '') # remove url's 

In [None]:
display(df.head(10))