# **Natural Language Processing with Disaster Tweets: A Report**

## **1. Introduction**

### **1.1 Problem Statement**
The goal of this project is to classify tweets to determine whether they pertain to a real disaster or not. This challenge is part of a Kaggle competition titled "Natural Language Processing with Disaster Tweets," where participants are tasked with developing models to identify disaster-related tweets. This task serves as an excellent introduction to Natural Language Processing (NLP) and machine learning, particularly in dealing with textual data.

### **1.2 Data Description**
The dataset provided comprises two CSV files: `train.csv` and `test.csv`. The `train.csv` file contains 7,613 tweets, each labeled as either 1 (indicating a disaster-related tweet) or 0 (indicating a non-disaster-related tweet). The `test.csv` file includes 3,263 tweets that require classification by the model.

The dataset features the following columns:
- `id`: A unique identifier for each tweet.
- `keyword`: A keyword extracted from the tweet, which may contain missing values.
- `location`: The location from where the tweet was sent, which may also contain missing values.
- `text`: The actual text of the tweet.
- `target`: The label indicating whether the tweet is about a disaster (1) or not (0).

The challenge lies in developing a model that can accurately classify tweets in the `test.csv` file based on the training data provided.

## **2. Exploratory Data Analysis (EDA)**

### **2.1 Data Inspection**
Initially, the dataset was loaded and its structure inspected. The `train.csv` file contains 7,613 rows and 5 columns, while the `test.csv` file has 3,263 rows and 4 columns, lacking the `target` column present in the training data.

The target distribution in the training dataset shows a slight imbalance, with more tweets not related to disasters (4,342) compared to those that are disaster-related (3,271). This distribution highlights the importance of addressing potential imbalances in the data during the modeling process.

### **2.2 Data Cleaning**
Upon inspecting the data, it was evident that there were missing values in the `keyword` and `location` columns. However, since the primary focus of this analysis is on the `text` column, these missing values were not imputed. Instead, the emphasis was placed on preprocessing the text data to ensure it was ready for modeling.

### **2.3 Data Visualization**
Visualizations of the most frequent keywords and locations were generated to gain a better understanding of the tweet content and context. The most common keywords included terms like "fatalities," "deluge," and "armageddon," while the most frequent locations were "USA," "New York," and "London." These visualizations provided insights into the types of disasters discussed and the geographical distribution of the tweets.

## **3. Data Preprocessing**

### **3.1 Text Preprocessing**
Text preprocessing is a crucial step in Natural Language Processing. The following steps were applied to the `text` column to prepare it for modeling:

1. **Removing URLs**: Tweets often contain URLs, which are not relevant for the analysis and were thus removed.
2. **Removing Punctuation**: Punctuation marks were removed as they do not contribute to the semantic meaning of the text.
3. **Removing Stopwords**: Common stopwords such as "and," "the," etc., which do not carry significant meaning, were removed from the text.
4. **Stemming**: Words were reduced to their root forms using the Snowball Stemmer to standardize the vocabulary.

This preprocessing step was essential to ensure that the text data was clean and standardized, making it more suitable for modeling.

### **3.2 Balancing the Dataset**
Given the slight imbalance in the dataset, undersampling was applied to the majority class (target = 0) to balance the data. This step involved reducing the number of non-disaster-related tweets to match the number of disaster-related tweets, ensuring that the model would not be biased towards the majority class.

### **3.3 Splitting the Data**
The dataset was then split into training and validation sets. This split was necessary to evaluate the model's performance during the training phase and to ensure that the model could generalize well to unseen data.

### **3.4 Text Tokenization and Padding**
To prepare the text data for input into the neural network, the text was tokenized and the sequences were padded. Tokenization involved converting the text into sequences of integers, where each integer represented a word in the vocabulary. Padding was used to ensure that all sequences had the same length, which is a requirement for input into the neural network.

## **4. Model Architecture**

### **4.1 Model Selection**
Given the sequential nature of the text data, a Long Short-Term Memory (LSTM) model was chosen for this task. LSTM is a type of Recurrent Neural Network (RNN) that is well-suited for capturing dependencies in sequential data, making it a suitable choice for text classification tasks such as this one.

### **4.2 Word Embedding**
To convert the text data into a format that the LSTM model could process, word embeddings were used. The Keras embedding layer was employed to convert the integer sequences into dense vectors of fixed size, providing a rich representation of the text that the LSTM could use to learn patterns in the data.

### **4.3 Model Compilation**
The model was compiled using the Adam optimizer and a binary cross-entropy loss function. The Area Under the ROC Curve (AUC) metric was chosen to evaluate the model's performance, as it provides a robust measure of the model's ability to distinguish between the two classes.

### **4.4 Model Training**
The model was trained for 10 epochs with a batch size of 64. During training, the model was fitted on the training data and its performance was validated on the validation data. This training process involved adjusting the model's weights to minimize the loss function, while monitoring the AUC metric to ensure that the model was learning to distinguish between disaster-related and non-disaster-related tweets.

## **5. Results and Analysis**

### **5.1 Model Performance**
The model's performance was evaluated based on the AUC and loss metrics on both the training and validation sets. While the model performed exceptionally well on the training data, achieving an AUC of 0.9993 by the final epoch, there was a noticeable drop in performance on the validation set, where the AUC stabilized around 0.8017. This discrepancy suggests that the model may be overfitting to the training data, capturing noise rather than generalizable patterns.

### **5.2 Hyperparameter Tuning**
To address the issue of overfitting, several strategies could be explored, including:

- **Adding Dropout Layers**: Dropout layers could be introduced to randomly deactivate neurons during training, thereby helping to prevent overfitting by making the model more robust.
- **Adjusting Learning Rate**: Lowering the learning rate could help the model converge more slowly, potentially avoiding overfitting by allowing the model to learn more generalizable patterns.
- **Using Regularization**: Techniques such as L2 regularization could be applied to penalize large weights, encouraging the model to generalize better to unseen data.

### **5.3 Further Improvements**
To further improve the model's performance, additional enhancements could include:

- **Using Pretrained Embeddings**: Incorporating pretrained embeddings such as GloVe or Word2Vec could provide the model with richer semantic information, improving its ability to understand the context and meaning of the tweets.
- **Exploring Advanced Architectures**: More advanced architectures, such as bidirectional LSTMs or Gated Recurrent Units (GRUs), could be explored to capture more complex dependencies in the text data, potentially leading to better model performance.

## **6. Conclusion**

### **6.1 Summary**
In this project, an LSTM-based model was developed to classify tweets as disaster-related or not. While the model performed well on the training data, its performance on the validation data indicated potential overfitting. Various techniques, such as dropout, learning rate adjustments, and regularization, can be applied to improve the model's generalization ability.

### **6.2 Future Work**
Future work could focus on experimenting with pretrained embeddings, exploring more sophisticated architectures, and applying extensive hyperparameter tuning to optimize the model's performance. Additionally, data augmentation techniques could be explored to address the slight imbalance in the dataset, further improving the model's robustness.

### **6.3 Key Takeaways**
- **NLP Challenges**: Text classification in NLP requires careful preprocessing and thoughtful model design to achieve good performance.
- **Overfitting**: Overfitting is a significant challenge in machine learning, particularly with complex models like LSTMs, and requires strategies such as regularization and dropout to mitigate.
- **Model Optimization**: Continuous experimentation with model architectures, embeddings, and hyperparameters is essential for improving model performance and achieving better results.

## **7. References**

- Kaggle Competition: [Natural Language Processing with Disaster Tweets](https://www.kaggle.com/c/nlp-getting-started)
- TensorFlow Documentation: [tf.keras](https://www.tensorflow.org/api_docs/python/tf/keras)
- GloVe: [Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/)
- Word2Vec: [Word2Vec on TensorFlow](https://www.tensorflow.org/tutorials/text/word2vec)

This report provides a detailed overview of the steps taken to build a model for classifying disaster-related tweets, offering insights into the challenges encountered and potential improvements in the field of Natural Language Processing.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/nlp-getting-started/sample_submission.csv
/kaggle/input/nlp-getting-started/train.csv
/kaggle/input/nlp-getting-started/test.csv


In [2]:
import nltk
from nltk.corpus import stopwords

import tensorflow as tf
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense, Dropout,BatchNormalization
from keras.utils import to_categorical
from keras.models import Sequential
from nltk.stem import SnowballStemmer
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
import string
import re

import matplotlib.pyplot as plt

2024-08-14 19:00:18.280469: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-14 19:00:18.280591: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-14 19:00:18.422355: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [4]:
df_data = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
df_test = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')
df_sample = pd.read_csv('/kaggle/input/nlp-getting-started/sample_submission.csv')

# EDA

In [5]:
df_data.shape

(7613, 5)

In [6]:
df_data.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [7]:
df_data.value_counts('keyword')

keyword
fatalities               45
deluge                   42
armageddon               42
sinking                  41
damage                   41
                         ..
forest%20fire            19
epicentre                12
threat                   11
inundation               10
radiation%20emergency     9
Name: count, Length: 221, dtype: int64

In [8]:
df_data.value_counts('location')

location
USA               104
New York           71
United States      50
London             45
Canada             29
                 ... 
Hueco Mundo         1
Hughes, AR          1
Huntington, WV      1
Huntley, IL         1
åø\_(?)_/åø         1
Name: count, Length: 3341, dtype: int64

In [9]:
df_data.value_counts('target')

target
0    4342
1    3271
Name: count, dtype: int64

# Preprocessing

In [10]:
stop_words = set(stopwords.words('english'))
stemmer = SnowballStemmer("english")

In [11]:
class_0 = df_data[df_data['target'] == 0]
class_1 = df_data[df_data['target'] == 1]
class_0_under = class_0.sample(len(class_1), random_state=123)
df_data = pd.concat([class_0_under, class_1])[['id','text','target']]

In [12]:
df_data = shuffle(df_data)
df_train, df_val = train_test_split(df_data, test_size=0.20)

In [13]:
text = df_train['text'].to_numpy()
label = df_train['target'].to_numpy()
text_val = df_val['text'].to_numpy()
label_val = df_val['target'].to_numpy()
text

array(['Lab today ready for these body bags. ??',
       "There's a weird siren going off here...I hope Hunterston isn't in the process of blowing itself to smithereens...",
       'i decided to take a break from my emotional destruction to watch tangled then watch desolation of smaug',
       ...,
       'Governor weighs parole for California school bus hijacker http://t.co/7NPBfRzEJL http://t.co/Y0kByy8nce',
       '@Silent0siris why not even more awesome norse landscapes with loads of atmosphere and life than boring/dead snotgreen wastelands =/',
       'Dakota Skye gets horny with some porn then gets her juicy pussy pounded http://t.co/qew4c5M1xd View and download video'],
      dtype=object)

In [14]:
def preprocess_sentence(sentence):
    sentence = re.sub(r'http\S+|www\S+|https\S+', '', sentence, flags=re.MULTILINE)
    sentence = re.sub(r'#\S+', '', sentence)
    sentence = sentence.translate(str.maketrans('', '', string.punctuation))
    words = sentence.split()
    words = [word for word in words if word.lower() not in stop_words]
    return ' '.join(words)

In [15]:
text = [preprocess_sentence(x) for x in text]
print(text[0])
text = [stemmer.stem(word) for word in text]
print(text[0])
word_tokenizer = tf.keras.preprocessing.text.Tokenizer()
word_tokenizer.fit_on_texts(text)
vocab_length = len(word_tokenizer.word_index) + 1

Lab today ready body bags
lab today ready body bag


In [16]:
def preprocess_text(text):
    print(text[0])
    text = [preprocess_sentence(x) for x in text]
    print(text[0])
    text = [stemmer.stem(word) for word in text]
    print(text[0])
    sequences = word_tokenizer.texts_to_sequences(text)
    
    print(sequences[0])
    padded_sequences = pad_sequences(sequences)
    print(padded_sequences[0])
    
    return padded_sequences

In [17]:
texts = preprocess_text(text)
labels = to_categorical(label)
texts_val = preprocess_text(text_val)
labels_val = to_categorical(label_val)

lab today ready body bag
lab today ready body bag
lab today ready body bag
[341, 52, 805, 26, 132]
[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0   0 341  52 805  26 132]
I'll be at SFA very soon....#Pandemonium http://t.co/RW8b50xz9m
Ill SFA soon
ill sfa soon
[407, 441]
[  0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
   0 407 441]


# Train

In [18]:
model = Sequential()
model.add(Embedding(vocab_length, 128))
# model.add(Dropout(0.5))
model.add(LSTM(64))
model.add(Dense(2, activation='sigmoid'))

model.compile(loss='BinaryCrossentropy', optimizer='Adam', metrics=['AUC'])
# model.summary()
model.fit(texts, labels, validation_data=(texts_val,labels_val), epochs=10, batch_size=64)

Epoch 1/10
[1m82/82[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 33ms/step - AUC: 0.6417 - loss: 0.6581 - val_AUC: 0.8361 - val_loss: 0.4960
Epoch 2/10
[1m82/82[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 27ms/step - AUC: 0.9391 - loss: 0.3321 - val_AUC: 0.8279 - val_loss: 0.5299
Epoch 3/10
[1m82/82[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 27ms/step - AUC: 0.9838 - loss: 0.1638 - val_AUC: 0.8202 - val_loss: 0.6221
Epoch 4/10
[1m82/82[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 31ms/step - AUC: 0.9933 - loss: 0.0958 - val_AUC: 0.8156 - val_loss: 0.7022
Epoch 5/10
[1m82/82[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 27ms/step - AUC: 0.9967 - loss: 0.0688 - val_AUC: 0.7994 - val_loss: 0.8715
Epoch 6/10
[1m82/82[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 29ms/step - AUC: 0.9975 - loss: 0.0577 - val_AUC: 0.8073 - val_loss: 0.7917
Epoch 7/10
[1m82/82[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 31ms/step - AUC: 0.

<keras.src.callbacks.history.History at 0x7fab8e2f5e40>

In [19]:
text2 = df_test['text']
text2 = text2.to_numpy()
text2 = preprocess_text(text2)

Just happened a terrible car crash
happened terrible car crash
happened terrible car crash
[858, 2255, 40, 18]
[   0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0  858 2255   40   18]


In [20]:
predictions = model.predict(text2)

[1m102/102[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 6ms/step


In [21]:
prob_class_1 = predictions[:, 1]
binary_predictions = [1 if p >= 0.5 else 0 for p in prob_class_1]
df = pd.DataFrame({'target': binary_predictions}, index=df_test['id'])
df.index.name = 'id'
df.to_csv('/kaggle/working/submission.csv')