# NLP with Disaster Tweets
This notebook explores the Kaggle Getting Started competition dataset, prepares features, trains a baseline model, and creates a submission file.

**Cell 2.** Imports core libraries for data handling, text processing, and scikit-learn modelling utilities used throughout the workflow.

In [7]:
from pathlib import Path
import re
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import StratifiedKFold, cross_val_score
from sklearn.pipeline import Pipeline

**Cell 3.** Reads the Kaggle training, test, and sample submission CSVs from `data/` and shows the first few labelled tweets.

In [8]:
DATA_DIR = Path('..') / 'data'
train_df = pd.read_csv(DATA_DIR / 'train.csv')
test_df = pd.read_csv(DATA_DIR / 'test.csv')
sample_submission = pd.read_csv(DATA_DIR / 'sample_submission.csv')
train_df.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


**Cell 4.** Checks the class balance of the disaster label to understand baseline expectations.

In [9]:
train_df['target'].value_counts(normalize=True)

target
0    0.57034
1    0.42966
Name: proportion, dtype: float64

**Cell 5.** Computes the fraction of missing values in each column to flag any fields that need imputation.

In [10]:
train_df.isna().mean()

id          0.000000
keyword     0.008013
location    0.332720
text        0.000000
target      0.000000
dtype: float64

**Cell 6.** Normalizes tweet text (lowercasing, removing URLs and non-word characters) and stores the cleaned version for both train and test sets.

In [11]:
URL_PATTERN = re.compile(r'http\S+|www\.\S+')
# Basic normalization tailored for noisy tweets.
def normalise_text(text: str) -> str:
    text = text.lower()
    text = URL_PATTERN.sub(' ', text)
    text = re.sub(r"[^a-z0-9\s#@']", ' ', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text
train_df['text_clean'] = train_df['text'].apply(normalise_text)
test_df['text_clean'] = test_df['text'].apply(normalise_text)
train_df[['text', 'text_clean']].head()

Unnamed: 0,text,text_clean
0,Our Deeds are the Reason of this #earthquake M...,our deeds are the reason of this #earthquake m...
1,Forest fire near La Ronge Sask. Canada,forest fire near la ronge sask canada
2,All residents asked to 'shelter in place' are ...,all residents asked to 'shelter in place' are ...
3,"13,000 people receive #wildfires evacuation or...",13 000 people receive #wildfires evacuation or...
4,Just got sent this photo from Ruby #Alaska as ...,just got sent this photo from ruby #alaska as ...


**Cell 7.** Creates a simple length-based feature to compare tweet character counts across the two classes.

In [12]:
train_df['char_len'] = train_df['text_clean'].str.len()
char_summary = train_df.groupby('target')['char_len'].agg(['mean', 'median'])
char_summary

Unnamed: 0_level_0,mean,median
target,Unnamed: 1_level_1,Unnamed: 2_level_1
0,81.107554,82.0
1,87.465913,89.0


**Cell 8.** Builds a TF-IDF + logistic regression pipeline and evaluates it with 5-fold stratified cross-validation using F1 score.

In [14]:
X = train_df['text_clean']
y = train_df['target']
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2), min_df=3, max_df=0.9)),
    ('clf', LogisticRegression(max_iter=1000, C=4.0, class_weight='balanced', solver='liblinear'))
])
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
f1_scores = cross_val_score(pipeline, X, y, cv=cv, scoring='f1', n_jobs=-1)
print(f'F1 scores: {f1_scores}')
print(f'Mean F1: {f1_scores.mean():.4f} ± {f1_scores.std():.4f}')

F1 scores: [0.75899844 0.75910147 0.72882673 0.75       0.75586854]
Mean F1: 0.7506 ± 0.0114


**Cell 9.** Fits the pipeline on all training data, generates predictions for the test set, and writes the `submission.csv` file for Kaggle.

In [15]:
pipeline.fit(X, y)
test_preds = pipeline.predict(test_df['text_clean'])
submission = sample_submission.copy()
submission['target'] = test_preds
submission_path = DATA_DIR / 'submission.csv'
submission.to_csv(submission_path, index=False)
print(f'Saved submission to {submission_path}')
submission.head()

Saved submission to ..\data\submission.csv


Unnamed: 0,id,target
0,0,1
1,2,1
2,3,1
3,9,1
4,11,1
