<h1> Introduction to Natural Language Processing Kaggle Notebook</h1>
<h2> NLP with Diasaster Tweets </h2>

<h3> What I'm doing with this project </h3>

For this project, I am experimenting with RNN's, LSTM's, and various pre-trained models for Natural Language Processing. It's going to be a trip, so stand by, sit back, and let the time fly!

Good Kaggle tutorial and information is at [NLP Getting Started](https://www.kaggle.com/philculliton/nlp-getting-started-tutorial)

<h3> Problem </h3>

The problem here is simply to create a binary classifier on predicting . This is somewhere between a combination of content analysis and sentiment analysis. It can be really hard at times to tell if something is referencing an actual disaster!

<h3> Grading </h3>

Scores are graded using the F1 computed score, defined as

\begin{equation}
F_1 = 2*\frac{precision * recall}{precision + recall}
\end{equation}
where:

\begin{equation}
precision = \frac{TP}{TP + FP}
\end{equation}

\begin{equation}
recall = \frac{TP}{TP + FN}
\end{equation}

<h3> Starting Approach </h3>

Probably the best indicator as to whether something represents a natural disaster is the vocabulary set it in. Based on the training data, we can expect a certain word cloud to appear that is heavily related to .

One approach I might take is to simply use some clustering algorithms based on already trained word embeddings (i.e., maybe pick some random disaster words, manually annotate the first several that indicate disaster, and then flag tweets based on occurence of those).

Beyond this, we would hope to include some form of sentiment analysis on the tweets. Those tweets talking about disasters probably have some recurring sentiments throughout them,  e.g., we would expect them to have extremeley sincere, or sad, or serious, or fearful tones to them.



In [30]:
import numpy as np
import pandas as pd
import torch.nn as nn
from sklearn import feature_extraction, linear_model, model_selection, preprocessing

In [4]:
train_df = pd.read_csv("nlp-getting-started/train.csv")

In [31]:
# It's a good idea to look at your dataset real quick to get an idea of what's going on
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [32]:
# Worth looking at the statistics of your data as well.
train_df.describe()

Unnamed: 0,id,target
count,7613.0,7613.0
mean,5441.934848,0.42966
std,3137.11609,0.49506
min,1.0,0.0
25%,2734.0,0.0
50%,5408.0,0.0
75%,8146.0,1.0
max,10873.0,1.0


<h3> Data Statistics </h3>

One of the most interesting things we see in this table worth noting is that there is a potential for distribution mismatch between our training data and a general dataset. There is certainly a bias in our data towards containing more tweets about disasters than a typical selection of randomly chosen tweets would have. 

One way to control for this is simply to throw away many of the disaster tweets in our validation set during cross-validation, with some probability. It would be interesting to see the difference in performance on a set that looks like our training set vs. one . This probably means, that instead of looking at the F1 score during training of our model, we should look at a score that's a little bit more useful (even though the F1 score is ultimately what our data will be graded against). 