# Disaster Tweets NLP Classification — Mini Project
Welcome to this notebook for the Kaggle NLP Disaster Tweets mini-project. We will explore, build, and evaluate a model that classifies tweets as disaster-related or not.

## 1. Problem Description
Twitter is widely used to announce emergencies and disasters in real time. However, not all tweets containing disaster-related words actually describe a real disaster. The goal of this project is to build a machine learning model that can classify whether a tweet is about a real disaster (target = 1) or not (target = 0).

The dataset contains about 10,000 tweets with labels indicating if they describe a real disaster. This is a binary text classification problem in Natural Language Processing (NLP).

## 2. Exploratory Data Analysis (EDA)
### 2.1 Load Data

In [None]:
import pandas as pd

train = pd.read_csv('/kaggle/input/nlp-getting-started/train.csv')
test = pd.read_csv('/kaggle/input/nlp-getting-started/test.csv')

print("Train dataset shape:", train.shape)
print("Test dataset shape:", test.shape)

train.head()

### 2.2 Dataset Overview
- `id`: Unique identifier for each tweet
- `keyword`: Important keyword from the tweet (can be NaN)
- `location`: Location of the tweet (can be NaN)
- `text`: The tweet text (our input feature)
- `target`: 1 if tweet is about a real disaster, else 0 (our label)

Check label distribution:

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x='target', data=train)
plt.title('Distribution of Disaster (1) vs Non-Disaster (0) Tweets')
plt.show()

print(train['target'].value_counts())

### 2.3 Data Cleaning
Tweets often contain URLs, mentions, hashtags, emojis, and punctuation that can add noise. We clean the text by:

- Lowercasing
- Removing URLs, mentions (@user), hashtags (#tag)
- Removing punctuation and digits
- Removing extra whitespace

In [None]:
import re

def clean_text(text):
    text = str(text).lower()
    text = re.sub(r"http\S+", "", text)     # Remove URLs
    text = re.sub(r"@\w+", "", text)        # Remove mentions
    text = re.sub(r"#\w+", "", text)        # Remove hashtags
    text = re.sub(r"[^a-z\s]", "", text)    # Remove punctuation and digits
    text = re.sub(r"\s+", " ", text).strip()# Remove extra whitespace
    return text

train['clean_text'] = train['text'].apply(clean_text)
test['clean_text'] = test['text'].apply(clean_text)

train[['text', 'clean_text']].head()

### 2.4 Word Cloud Visualization
Visualize common words in disaster vs non-disaster tweets.

In [None]:
from wordcloud import WordCloud

# Disaster tweets word cloud
disaster_text = " ".join(train[train['target'] == 1]['clean_text'])
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(disaster_text)

import matplotlib.pyplot as plt
plt.figure(figsize=(10,5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Common Words in Disaster Tweets')
plt.show()

# Non-disaster tweets word cloud
non_disaster_text = " ".join(train[train['target'] == 0]['clean_text'])
wordcloud = WordCloud(width=800, height=400, background_color='white').generate(non_disaster_text)

plt.figure(figsize=(10,5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Common Words in Non-Disaster Tweets')
plt.show()

## 3. Model Architecture and Approach
We use a neural network based on word embeddings + bidirectional LSTM to classify tweets.

- Embeddings capture semantic meaning of words.
- Bidirectional LSTM captures sequence context.
- Dense sigmoid layer outputs disaster probability.

This architecture suits the sequential nature of text and helps learn context in tweets.

### 3.1 Text Tokenization and Padding

In [None]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

max_vocab_size = 10000
max_seq_length = 100

tokenizer = Tokenizer(num_words=max_vocab_size, oov_token="<OOV>")
tokenizer.fit_on_texts(train['clean_text'])

X_train_seq = tokenizer.texts_to_sequences(train['clean_text'])
X_train_pad = pad_sequences(X_train_seq, maxlen=max_seq_length, padding='post', truncating='post')

y_train = train['target'].values

### 3.2 Train-Validation Split

In [None]:
from sklearn.model_selection import train_test_split

X_tr, X_val, y_tr, y_val = train_test_split(X_train_pad, y_train, test_size=0.2, random_state=42)

print("Training samples:", X_tr.shape[0])
print("Validation samples:", X_val.shape[0])

### 3.3 Build the Model

In [None]:
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, Bidirectional, LSTM, Dense, Dropout

embedding_dim = 64

model = Sequential([
    Embedding(input_dim=max_vocab_size, output_dim=embedding_dim, input_length=max_seq_length),
    Bidirectional(LSTM(64, dropout=0.2, recurrent_dropout=0.2)),
    Dense(32, activation='relu'),
    Dropout(0.5),
    Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()

### 3.4 Train the Model

In [None]:
epochs = 5
batch_size = 64

history = model.fit(
    X_tr, y_tr,
    epochs=epochs,
    batch_size=batch_size,
    validation_data=(X_val, y_val)
)

## 4. Results and Analysis
### 4.1 Training History Plots

In [None]:
import matplotlib.pyplot as plt

plt.plot(history.history['loss'], label='Train Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.legend()
plt.title('Loss over epochs')
plt.show()

plt.plot(history.history['accuracy'], label='Train Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.legend()
plt.title('Accuracy over epochs')
plt.show()

### 4.2 Validation Metrics

In [None]:
from sklearn.metrics import classification_report, f1_score

y_val_pred_prob = model.predict(X_val)
y_val_pred = (y_val_pred_prob > 0.5).astype(int).flatten()

print(classification_report(y_val, y_val_pred))
print("Validation F1 score:", f1_score(y_val, y_val_pred))

## 5. Conclusion
- The bidirectional LSTM model performed reasonably well for disaster tweet classification.
- Cleaning tweets helped reduce noise.
- Word embeddings helped capture semantics.
- Adding dropout helped with regularization.
- Future improvements: pretrained embeddings, transformer models, hyperparameter tuning, including metadata features.

## 6. References
- [Kaggle NLP Disaster Tweets competition](https://www.kaggle.com/c/nlp-getting-started)
- Chollet, François. *Deep Learning with Python*. Manning, 2018.
- TensorFlow Keras Documentation: https://www.tensorflow.org/api_docs/python/tf/keras
- WordCloud Python package documentation.

## 7. Submission File Creation

In [None]:
X_test_seq = tokenizer.texts_to_sequences(test['clean_text'])
X_test_pad = pad_sequences(X_test_seq, maxlen=max_seq_length, padding='post', truncating='post')

test_pred_prob = model.predict(X_test_pad)
test_pred = (test_pred_prob > 0.5).astype(int).flatten()

submission = pd.DataFrame({'id': test['id'], 'target': test_pred})
submission.to_csv('submission.csv', index=False)

print("Submission file created: submission.csv")