Natural Language Processing with Disaster Tweets

In this project, we will be using a dataset of disaster tweets. The goal is to classify whether a tweet is about a real disaster or not.

We will be using the following techniques:

1. Data Cleaning
2. Data Exploration
3. Data Visualization
4. Feature Extraction
5. Model Building   


Assumptions: The words mentioned in the tweet are a good indication of the tweet's sentiment. If we can find a pattern in the words mentioned in the tweet, we can classify the tweet as a disaster tweet or not.
Words that are repeated in the tweet are a good indication of the tweet's sentiment.

Hypothesis: If a tweet contains a high amount of unique words, it is more likely to be a disaster tweet.

Model choice: Due to the linear relationship between the unique words in the tweet and the probability that its a disater tweet, we will be using a linear model. We will start with ridge regression and then try other models to improve performance.

Approach: The first step is to obtain the quantity of unique words in all the tweets based on the kaggle example given to us. We will then gradually improve the complexity of the test at different stages to see how the score improves.

Kaggle example approach: 
    In the example, they chose to use a ridge regression model. As a result, the data, which was already split into test and training data, was transformed into a training df with the tweet number and the frequency of unique words in the tweet using count vectorization. 
    The count vectorization found the frequency of unique words in all the tweets and created a vector for each tweet containing the frequency of the unique words in the tweet.
    This training df was then used to train a ridge regression model. This was done using cross validation where the scoring method was the F1 score. The training df was divided into three parts, training the model on two parts while testing on the third, and repeating this process three times with different testing parts. This resulted in an array of three F1 scores, one for each fold.
    Since the average score was decent, the model was then fit to the entire training df (the train_vectors) with the training data target column.
    Lastly, the fitted model was used to predict the sentiment of the testing data.





In [5]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import feature_extraction, linear_model, model_selection, preprocessing


In [None]:
df_test = pd.read_csv('/opt/miniconda3/envs/NLP_distater_tweets/nlp-getting-started/test.csv')
df_train = pd.read_csv('/opt/miniconda3/envs/NLP_distater_tweets/nlp-getting-started/train.csv')


df_train.head(), df_test.head()

In [44]:

print(df_train[df_train["target"] == 0]["text"].values[1]) #print the text of the first tweet that is not a disaster

print(df_train[df_train["target"] == 1]["text"].values[1]) #print the text of the first tweet that is a disaster



I love fruits
Forest fire near La Ronge Sask. Canada


In [43]:

count_vectorizer = feature_extraction.text.CountVectorizer() #initialize the count vectorizer

example_train_vectors = count_vectorizer.fit_transform(df_train["text"][0:5]) ## let's get counts for the first 5 tweets in the data
example_train_vectors.shape #check the shape of the vectorized data
print(example_train_vectors[0]) #print the vectorized data for the first tweet
print(example_train_vectors[0].todense()) #print the vectorized data for the first tweet in dense format

#this shows the frequency of the 54 unique words in the first tweet

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 13 stored elements and shape (1, 54)>
  Coords	Values
  (0, 34)	1
  (0, 12)	1
  (0, 5)	1
  (0, 49)	1
  (0, 39)	1
  (0, 29)	1
  (0, 50)	1
  (0, 13)	1
  (0, 25)	1
  (0, 4)	1
  (0, 18)	1
  (0, 52)	1
  (0, 3)	1
[[0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0
  0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 1 0]]


In [47]:
train_vectors = count_vectorizer.fit_transform(df_train["text"])
print(train_vectors.shape)
## note that we're NOT using .fit_transform() here. Using just .transform() makes sure
# that the tokens in the train vectors are the only ones mapped to the test vectors - 
# i.e. that the train and test vectors use the same set of tokens.
test_vectors = count_vectorizer.transform(df_test["text"])

(7613, 21637)


In [49]:
## Our vectors are really big, so we want to push our model's weights
## toward 0 without completely discounting different words - ridge regression 
## is a good way to do this.
clf = linear_model.RidgeClassifier()

In [54]:
scores = model_selection.cross_val_score(clf, train_vectors, df_train["target"], cv=3, scoring="f1")
scores #this shows the F1 score for each fold

array([0.59453669, 0.5642787 , 0.64082434])

In [55]:
clf.fit(train_vectors, df_train["target"]) #fit the model to the training data

In [56]:
sample_submission = pd.read_csv("/opt/miniconda3/envs/NLP_distater_tweets/nlp-getting-started/sample_submission.csv") #read the sample submission file

In [57]:
sample_submission["target"] = clf.predict(test_vectors) #predict the target for the test data

In [58]:
sample_submission.head()

Unnamed: 0,id,target
0,0,0
1,2,1
2,3,1
3,9,0
4,11,1


In [59]:
sample_submission.to_csv("submission.csv", index=False) 