# Natural Language Processing with Disaster Tweets

## Problem Description
The goal of this project is to classify tweets as either related to real disasters or not. This is a binary classification task where we analyze short texts using Natural Language Processing (NLP) techniques. The dataset consists of tweets labeled as either a real disaster (1) or not (0). We will use machine learning to automate this classification.

## Dataset Overview
- `id`: A unique identifier for each tweet.
- `keyword`: A keyword from the tweet.
- `location`: The location of the tweet.
- `text`: The content of the tweet.
- `target`: The label (1 for disaster-related, 0 for not).

In [None]:
import os
import numpy as np
import pandas as pd
import tensorflow as tf
import keras_core as keras
import keras_nlp
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

# Set TensorFlow Backend
os.environ['KERAS_BACKEND'] = 'tensorflow'


In [None]:
# Load Dataset
df_train = pd.read_csv("/kaggle/input/nlp-getting-started/train.csv")
df_test = pd.read_csv("/kaggle/input/nlp-getting-started/test.csv")

# Data Overview
print(f'Training Set Shape: {df_train.shape}')
print(f'Test Set Shape: {df_test.shape}')
print(df_train.head())
print(df_test.head())


## Exploratory Data Analysis (EDA)

In [None]:
df_train["length"] = df_train["text"].apply(len)
df_test["length"] = df_test["text"].apply(len)

plt.figure(figsize=(10, 5))
sns.histplot(df_train["length"], bins=30, kde=True)
plt.title("Distribution of Tweet Lengths in Training Set")
plt.show()


In [None]:
# Data Preprocessing
BATCH_SIZE = 32
TRAIN_SPLIT = 0.8
VAL_SPLIT = 0.2
EPOCHS = 3  # Slightly increased epochs for better performance

# Train-Validation Split
X = df_train["text"]
y = df_train["target"]
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=VAL_SPLIT, random_state=42)
X_test = df_test["text"]


In [None]:
# Load DistilBERT model from Keras NLP
preset = "distil_bert_base_en_uncased"
preprocessor = keras_nlp.models.DistilBertPreprocessor.from_preset(preset, sequence_length=160)
classifier = keras_nlp.models.DistilBertClassifier.from_preset(preset, preprocessor=preprocessor, num_classes=2)

classifier.summary()


In [None]:
# Compile Model
classifier.compile(
    loss=keras.losses.SparseCategoricalCrossentropy(from_logits=True),
    optimizer=keras.optimizers.Adam(1e-5),
    metrics=["accuracy"]
)


## Model Training and Evaluation

In [None]:
# Train Model
history = classifier.fit(
    x=X_train,
    y=y_train,
    batch_size=BATCH_SIZE,
    epochs=EPOCHS,
    validation_data=(X_val, y_val)
)


In [None]:
# Function to Display Confusion Matrix
def display_confusion_matrix(y_true, y_pred, dataset):
    disp = ConfusionMatrixDisplay.from_predictions(
        y_true, np.argmax(y_pred, axis=1),
        display_labels=["Not Disaster", "Disaster"],
        cmap=plt.cm.Blues
    )
    tn, fp, fn, tp = confusion_matrix(y_true, np.argmax(y_pred, axis=1)).ravel()
    f1_score = tp / (tp + ((fn + fp) / 2))
    disp.ax_.set_title(f"Confusion Matrix on {dataset} Dataset -- F1 Score: {round(f1_score, 2)}")
    plt.show()


In [None]:
# Evaluate Model
y_pred_train = classifier.predict(X_train)
display_confusion_matrix(y_train, y_pred_train, "Training")

y_pred_val = classifier.predict(X_val)
display_confusion_matrix(y_val, y_pred_val, "Validation")


## Results and Discussion
- The model achieves reasonable accuracy, but there are misclassifications as seen in the confusion matrix.
- The F1-score provides a better measure of performance, considering both precision and recall.
- More advanced techniques such as fine-tuning the model, increasing epochs, or using additional feature engineering could improve performance.

In [None]:
# Prepare Submission
sample_submission = pd.read_csv("/kaggle/input/nlp-getting-started/sample_submission.csv")
sample_submission["target"] = np.argmax(classifier.predict(X_test), axis=1)
sample_submission.to_csv("submission.csv", index=False)

print("Submission file created successfully!")


## Conclusion
- This project demonstrated how NLP techniques can classify tweets related to disasters.
- Using DistilBERT, we built and trained a model that can differentiate between disaster and non-disaster tweets.
- Future improvements could involve hyperparameter tuning, ensemble models, or additional data sources to improve robustness.