# NLP With Disaster Tweets

## Problem Description

The NLP With Disaster Tweets Kaggle competition aims to use NLP specific tools to predict disaster tweets. When using Twitter (or any other social media for that matter), people like to dramaticize what they are saying. The goal of this binary classification project is to distinguish tweets that are about real disasters versus tweets that are not.

For example, take the tweet: "On the plus side, LOOK AT THE SKY LAST NIGHT IT WAS ABLAZE". It is clear to you and I that the author of this tweet was speaking metaphorically, however, the distinction for a computer is much harder. The goal is to build a model that can accurately classify disaster related tweets. 

The accuracy metric for this competition is the F1 score. 

## Data Description

The data supplied contains two files, a train.csv and a test.csv. The train.csv file contains the text of a tweet, a keyword from that tweet, the location that tweet was sent from, and a validation column that predicts whether or not the given tweet is about a real disaster (indicated by a 1) or not (indicated by a 0).

The test.csv file contains a tweet and the label for that particular tweet is what needs to be predicted.

## Imports

In [None]:
# standard imports
import pandas as pd
pd.set_option('display.max_colwidth', 1000)

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re

import nltk

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Dense, Dropout, LSTM, Bidirectional, Embedding
from tensorflow.keras.optimizers import Adam


In [None]:
train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')

In [None]:
train_df['text'].sample(5)

In [None]:
train_df['text'][train_df.index == 7146]

In [None]:
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words('english'))

In [None]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub (r'\W', ' ', text)
    text = re.sub(r'\s+', ' ', text)
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

In [None]:
train_df['cleaned_text'] = train_df['text'].apply(preprocess_text)
test_df['cleaned_text'] = test_df['text'].apply(preprocess_text)

In [None]:
train_df.head()

In [None]:
train_df['cleaned_text'][train_df.index == 7146]

## Data Visualization

In [None]:
plt.style.use('fivethirtyeight')
sns.countplot(x = 'target', data = train_df, palette = ['salmon', 'purple'])
plt.title('Distribution of target')
plt.show()

In [None]:
train_df['text_length'] = train_df['text'].apply(len)
sns.histplot(train_df['text_length'], bins = 40, color = 'darkblue')
plt.title('Distribution of tweet lengths')
plt.show()

## Data Preprocessing

In [None]:
tokenizer = Tokenizer(num_words = 5000)
tokenizer.fit_on_texts(train_df['cleaned_text'])

In [None]:
X_train = tokenizer.texts_to_sequences(train_df['cleaned_text'])
X_test = tokenizer.texts_to_sequences(test_df['cleaned_text'])

In [None]:
max_len = 100
X_train = pad_sequences(X_train, padding = 'post', maxlen = max_len)
X_test = pad_sequences(X_test, padding = 'post', maxlen = max_len)

In [None]:
y_train = train_df['target'].values

In [None]:
X_train_split, X_val_split, y_train_split, y_val_split = train_test_split(X_train, y_train, test_size = .15, random_state = 42)

## Model Building

In [None]:
input_layer = Input(shape=(max_len,))
embedding_layer = Embedding(input_dim = 5000, output_dim = 128)(input_layer)
bi_lstm_layer = Bidirectional(LSTM(128, return_sequences=True))(embedding_layer)
dropout_layer = Dropout(0.3)(bi_lstm_layer)
bi_lstm_layer_2 = Bidirectional(LSTM(64))(dropout_layer)
output_layer = Dense(1, activation = 'sigmoid')(bi_lstm_layer_2)

In [None]:
model = Model(inputs = input_layer, outputs = output_layer)
model.compile(optimizer = Adam(learning_rate = 1e-5), loss = 'binary_crossentropy', metrics = ['accuracy'])

## Model Training

In [None]:
history = model.fit(X_train_split, y_train_split, epochs = 10, batch_size = 16, validation_data = (X_val_split, y_val_split))

## Model Evaluation

In [None]:
val_predictions = model.predict(X_val_split)
val_predictions = (val_predictions > 0.5).astype(int)
val_f1 = f1_score(y_val_split, val_predictions)
print(f'Validation F1 score: {val_f1}')

### Training and Validation Accuracy and Loss

In [None]:
plt.style.use('fivethirtyeight')
plt.plot(history.history['accuracy'], label = 'accuracy')
plt.plot(history.history['val_accuracy'], label = 'val_accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
# plt.legend(loc = 'lower right')
plt.title('Model Accuracy')
plt.show()

In [None]:
plt.plot(history.history['loss'], label = 'loss')
plt.plot(history.history['val_loss'], label = 'val_loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
# plt.legend(loc = 'upper right')
plt.title('Model Loss')
plt.show()

## Conclusion