# NLP with Disaster Tweets: Classifying Tweets as Disaster vs Not Disaster

**Course mini-project**  
**Competition**: Kaggle "Natural Language Processing with Disaster Tweets"  
**Goal**: Predict whether a tweet refers to a real disaster (binary classification).

## Brief description of the problem and data
We are given short social media texts (tweets) with metadata (`keyword`, `location`) and a binary label `target` for the training set. The task is supervised text classification.

- Train rows: will report below from code
- Test rows: will report below from code
- Columns: `id`, `keyword` (str, may be missing), `location` (str, may be missing), `text` (str), and `target` (0 or 1, train only)

**NLP context**: This is a standard short-text classification problem. We will compare a classic bag-of-words baseline (TF-IDF + Logistic Regression) and a simple sequential neural network with a learnable embedding and a Bidirectional LSTM. We will perform basic EDA, describe cleaning steps, train models, run a small hyperparameter search, analyze results, and generate a Kaggle submission.

**GitHub repository URL**: Add your repo link here once created.



# 1) Setup, imports, and configuration

In [None]:
# 1) Setup
import os
import re
import random
import html
import json
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, classification_report, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)
random.seed(RANDOM_STATE)
tf.random.set_seed(RANDOM_STATE)

# File paths (Kaggle default)
TRAIN_PATH = "train.csv"
TEST_PATH  = "test.csv"

assert os.path.exists(TRAIN_PATH) and os.path.exists(TEST_PATH), "Place train.csv and test.csv in the working directory"


# 2) Load data and basic structure summary

In [None]:
# 2) Load data and basic shape/summary
train_df = pd.read_csv(TRAIN_PATH)
test_df  = pd.read_csv(TEST_PATH)

display(train_df.head())
display(test_df.head())

print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)
print("\nTrain columns:", train_df.columns.tolist())
print("Test columns:", test_df.columns.tolist())

print("\nNull counts (train):")
print(train_df.isna().sum())

print("\nClass balance (train target):")
print(train_df["target"].value_counts(normalize=True).rename("proportion").round(3))

# Brief text length stats
train_text_len = train_df["text"].astype(str).str.split().map(len)
print("\nToken length stats (train text):")
print(train_text_len.describe())


# 3) EDA visuals and quick inspection

In [None]:
# 3) EDA: basic visualizations with matplotlib only

fig, ax = plt.subplots()
train_df["target"].value_counts().sort_index().plot(kind="bar", ax=ax)
ax.set_title("Target distribution")
ax.set_xlabel("target")
ax.set_ylabel("count")
plt.show()

fig, ax = plt.subplots()
train_text_len.plot(kind="hist", bins=30, ax=ax)
ax.set_title("Distribution of tweet token lengths")
ax.set_xlabel("tokens")
plt.show()

# Top keywords (non-null)
top_kw = train_df["keyword"].dropna().str.lower().value_counts().head(20)
fig, ax = plt.subplots()
top_kw.plot(kind="bar", ax=ax)
ax.set_title("Top 20 keywords")
ax.set_ylabel("count")
plt.xticks(rotation=75)
plt.show()

# Missingness bars
fig, ax = plt.subplots()
train_df[["keyword", "location", "text"]].isna().sum().plot(kind="bar", ax=ax)
ax.set_title("Missing values per column (train)")
ax.set_ylabel("count missing")
plt.show()

print("Plan of analysis:")
print("- Clean text minimally, keep helpful tokens like hashtags and user/url markers.")
print("- Build two models: TF-IDF + Logistic Regression baseline, and a BiLSTM neural model with a learnable embedding.")
print("- Run a small hyperparameter search on the neural model.")
print("- Compare validation F1 and produce a Kaggle submission from the better model.")


# 4) Light text cleaning utilities

In [None]:
# 4) Cleaning helpers

URL_RE    = re.compile(r"https?://\S+|www\.\S+")
USER_RE   = re.compile(r"@\w+")
HASH_RE   = re.compile(r"#(\w+)")
NUM_RE    = re.compile(r"\b\d+\b")
WS_RE     = re.compile(r"\s+")

def clean_text(s: str) -> str:
    if not isinstance(s, str):
        return ""
    s = html.unescape(s)
    s = URL_RE.sub(" URL ", s)
    s = USER_RE.sub(" USER ", s)
    s = HASH_RE.sub(r" \1 HASHTAG ", s)
    s = NUM_RE.sub(" NUM ", s)
    s = s.replace("&amp;", " and ")
    s = WS_RE.sub(" ", s).strip()
    return s.lower()

def join_fields(df: pd.DataFrame) -> pd.Series:
    # tag each field so models can learn source
    kw  = df["keyword"].fillna("").astype(str).str.lower().str.replace(r"\s+", " ", regex=True).str.strip()
    loc = df["location"].fillna("").astype(str).str.lower().str.replace(r"\s+", " ", regex=True).str.strip()
    txt = df["text"].astype(str).map(clean_text)
    return "kw: " + kw + " loc: " + loc + " txt: " + txt

train_joined = join_fields(train_df)
test_joined  = join_fields(test_df)


# 5) Baseline model: TF-IDF + Logistic Regression

In [None]:
# 5) Baseline: TF-IDF + Logistic Regression

X_train_base, X_val_base, y_train_base, y_val_base = train_test_split(
    train_joined, train_df["target"].astype(int), 
    test_size=0.2, stratify=train_df["target"].astype(int), random_state=RANDOM_STATE
)

tfidf = TfidfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.98, strip_accents="unicode", sublinear_tf=True)
Xtr = tfidf.fit_transform(X_train_base)
Xva = tfidf.transform(X_val_base)

lr = LogisticRegression(solver="liblinear", C=2.0, class_weight="balanced", max_iter=200, random_state=RANDOM_STATE)
lr.fit(Xtr, y_train_base)
val_pred_lr = lr.predict(Xva)
val_f1_lr = f1_score(y_val_base, val_pred_lr)
print(f"Baseline TF-IDF + LogisticRegression validation F1: {val_f1_lr:.4f}")


# 6) Tokenization for the neural model

In [None]:
# 6) Tokenize text for neural model
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

MAX_VOCAB = 20000
OOV_TOKEN = "<oov>"
MAX_LEN   = 40   # tweets are short; adjust if needed

tokenizer = Tokenizer(num_words=MAX_VOCAB, oov_token=OOV_TOKEN)
tokenizer.fit_on_texts(list(train_joined))

X_seq = tokenizer.texts_to_sequences(list(train_joined))
X_pad = pad_sequences(X_seq, maxlen=MAX_LEN, padding="post", truncating="post")
y     = train_df["target"].astype(int).values

X_train, X_val, y_train, y_val = train_test_split(
    X_pad, y, test_size=0.2, stratify=y, random_state=RANDOM_STATE
)

print("Tokenized shapes:", X_train.shape, X_val.shape)


# 7) Define a simple sequential neural architecture (Embedding + BiLSTM)

In [None]:
# 7) Model builder: Embedding + Bidirectional LSTM
def build_bilstm_model(
    vocab_size: int,
    max_len: int,
    embedding_dim: int = 128,
    lstm_units: int = 64,
    dropout_rate: float = 0.2,
    lr: float = 1e-3
) -> keras.Model:
    inputs = keras.Input(shape=(max_len,), dtype="int32")
    x = layers.Embedding(vocab_size, embedding_dim, input_length=max_len)(inputs)
    x = layers.Bidirectional(layers.LSTM(lstm_units, return_sequences=False))(x)
    x = layers.Dropout(dropout_rate)(x)
    x = layers.Dense(64, activation="relu")(x)
    x = layers.Dropout(dropout_rate)(x)
    outputs = layers.Dense(1, activation="sigmoid")(x)
    model = keras.Model(inputs, outputs)
    model.compile(
        optimizer=keras.optimizers.Adam(learning_rate=lr),
        loss="binary_crossentropy",
        metrics=[keras.metrics.AUC(name="auc")]
    )
    return model

VOCAB_SIZE = min(MAX_VOCAB, len(tokenizer.word_index) + 1)
model = build_bilstm_model(VOCAB_SIZE, MAX_LEN)
model.summary()


# 8) Train with early stopping, then evaluate with validation F1

In [None]:
# 8) Train and evaluate
callbacks = [
    keras.callbacks.EarlyStopping(monitor="val_auc", patience=3, mode="max", restore_best_weights=True)
]

history = model.fit(
    X_train, y_train,
    validation_data=(X_val, y_val),
    epochs=10,
    batch_size=64,
    callbacks=callbacks,
    verbose=1
)

# Evaluate with F1 at threshold 0.5
val_pred_prob = model.predict(X_val).ravel()
val_pred      = (val_pred_prob >= 0.5).astype(int)
val_f1_nn     = f1_score(y_val, val_pred)

print(f"Neural model validation F1: {val_f1_nn:.4f}")
print("\nClassification report:")
print(classification_report(y_val, val_pred, digits=4))

# Confusion matrix plot
cm = confusion_matrix(y_val, val_pred)
fig, ax = plt.subplots()
im = ax.imshow(cm, interpolation="nearest")
ax.set_title("Confusion matrix (validation)")
plt.colorbar(im)
ax.set_xticks([0,1]); ax.set_yticks([0,1])
ax.set_xticklabels(["pred 0","pred 1"]); ax.set_yticklabels(["true 0","true 1"])
for i in range(cm.shape[0]):
    for j in range(cm.shape[1]):
        ax.text(j, i, cm[i, j], ha="center", va="center")
plt.xlabel("Predicted"); plt.ylabel("True")
plt.show()


# 9) Small hyperparameter search over the neural model

In [None]:
# 9) Manual hyperparameter search (small and fast)
search_space = [
    {"embedding_dim": 64,  "lstm_units": 64,  "dropout_rate": 0.2, "lr": 1e-3, "batch_size": 64},
    {"embedding_dim": 128, "lstm_units": 64,  "dropout_rate": 0.3, "lr": 1e-3, "batch_size": 64},
    {"embedding_dim": 128, "lstm_units": 96,  "dropout_rate": 0.3, "lr": 1e-3, "batch_size": 64},
    {"embedding_dim": 128, "lstm_units": 64,  "dropout_rate": 0.5, "lr": 8e-4, "batch_size": 64},
]

results = []
for i, hp in enumerate(search_space, 1):
    print(f"\nTrial {i}/{len(search_space)}: {hp}")
    tf.keras.backend.clear_session()
    m = build_bilstm_model(
        VOCAB_SIZE, MAX_LEN,
        embedding_dim=hp["embedding_dim"],
        lstm_units=hp["lstm_units"],
        dropout_rate=hp["dropout_rate"],
        lr=hp["lr"]
    )
    cb = [keras.callbacks.EarlyStopping(monitor="val_auc", patience=2, mode="max", restore_best_weights=True)]
    hist = m.fit(
        X_train, y_train,
        validation_data=(X_val, y_val),
        epochs=8,
        batch_size=hp["batch_size"],
        callbacks=cb,
        verbose=0
    )
    # F1 on validation
    probs = m.predict(X_val, verbose=0).ravel()
    preds = (probs >= 0.5).astype(int)
    f1 = f1_score(y_val, preds)
    best_auc = max(hist.history["val_auc"])
    results.append({"params": hp, "val_f1": f1, "best_val_auc": best_auc})

results_df = pd.DataFrame(results).sort_values(by="val_f1", ascending=False).reset_index(drop=True)
display(results_df)

# Plot F1 scores
fig, ax = plt.subplots()
ax.plot(range(1, len(results_df)+1), results_df["val_f1"], marker="o")
ax.set_xticks(range(1, len(results_df)+1))
ax.set_xlabel("trial rank (best to worst)")
ax.set_ylabel("validation F1")
ax.set_title("Hyperparameter search results")
plt.show()

best_hp = results_df.loc[0, "params"]
print("Best hyperparameters:", best_hp)


# 10) Retrain best neural model on all training data and generate submission

In [None]:
# 10) Retrain best model on full data and create submission

# Re-split tokenizer texts for the full training set
X_all = X_pad
y_all = y

tf.keras.backend.clear_session()
best_model = build_bilstm_model(
    VOCAB_SIZE, MAX_LEN,
    embedding_dim=best_hp["embedding_dim"],
    lstm_units=best_hp["lstm_units"],
    dropout_rate=best_hp["dropout_rate"],
    lr=best_hp["lr"]
)

callbacks = [
    keras.callbacks.EarlyStopping(monitor="auc", patience=1, mode="max", restore_best_weights=True)
]

best_model.fit(
    X_all, y_all,
    epochs=6,
    batch_size=best_hp["batch_size"],
    validation_split=0.05,
    callbacks=callbacks,
    verbose=1
)

# Prepare test features
test_seq = tokenizer.texts_to_sequences(list(test_joined))
test_pad = pad_sequences(test_seq, maxlen=MAX_LEN, padding="post", truncating="post")

# Predict and build submission
test_prob = best_model.predict(test_pad).ravel()
test_pred = (test_prob >= 0.5).astype(int)

submission = pd.DataFrame({"id": test_df["id"], "target": test_pred})
submission_path = "submission.csv"
submission.to_csv(submission_path, index=False)
print("Saved submission to:", submission_path)
display(submission.head())


## Results and Analysis

We report the main validation result using an 80/20 stratified split.

- TF-IDF + Logistic Regression: see `val_f1_lr` value from the baseline cell
- Neural model (Embedding + BiLSTM), best trial: see `results_df` table for `val_f1` and `best_val_auc`
- We used early stopping on AUC to control overfitting.
- We tested a small grid across embedding dimensions, LSTM units, dropout, and learning rate.

Observations to include in your own words:
- Which model performed better and why that might be the case for this dataset of short texts
- What cleaning choices affected performance
- What hyperparameters helped and any signs of overfitting or underfitting
- Limitations and simple next steps
