# Building a Text‑Classification Pipeline & Word‑Embedding Exploration

----
----

**Objective:**
The objective of this project is to build an end-to-end text classification pipeline that can distinguish between real and fake disaster-related tweets. It involves cleaning and preprocessing raw text data, engineering both sparse (BoW/TF-IDF) and dense (Word2Vec) feature representations, and training classifiers such as Naive Bayes and Logistic Regression.

---


* Required Imports

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [4]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

* Loading Dataset

In [5]:
train_df = pd.read_csv('/content/train.csv')
test_df = pd.read_csv('/content/train.csv')

In [26]:
print("Train shape:", train_df.shape)
print("Test shape:", test_df.shape)

Train shape: (7613, 5)
Test shape: (7613, 5)


In [5]:
print(train_df.head())
print(test_df.head())

   id keyword location                                               text  \
0   1     NaN      NaN  Our Deeds are the Reason of this #earthquake M...   
1   4     NaN      NaN             Forest fire near La Ronge Sask. Canada   
2   5     NaN      NaN  All residents asked to 'shelter in place' are ...   
3   6     NaN      NaN  13,000 people receive #wildfires evacuation or...   
4   7     NaN      NaN  Just got sent this photo from Ruby #Alaska as ...   

   target  
0       1  
1       1  
2       1  
3       1  
4       1  
   id keyword location                                               text  \
0   1     NaN      NaN  Our Deeds are the Reason of this #earthquake M...   
1   4     NaN      NaN             Forest fire near La Ronge Sask. Canada   
2   5     NaN      NaN  All residents asked to 'shelter in place' are ...   
3   6     NaN      NaN  13,000 people receive #wildfires evacuation or...   
4   7     NaN      NaN  Just got sent this photo from Ruby #Alaska as ...   

  

* Class Distribution Check

In [27]:
print("Train class distribution:\n", train_df['target'].value_counts(normalize=True))
print("\nTest class distribution:\n", test_df['target'].value_counts(normalize=True))

Train class distribution:
 target
0    0.57034
1    0.42966
Name: proportion, dtype: float64

Test class distribution:
 target
0    0.57034
1    0.42966
Name: proportion, dtype: float64


* Generating Preprocessing Function

In [6]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer



In [7]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [8]:
def preprocess(text):
    text = text.lower()
    text = re.sub(r"http\S+|www\S+|https\S+", '', text, flags=re.MULTILINE)
    text = re.sub(r'\@w+|\#','', text)
    text = re.sub(r'\W+', ' ', text)
    tokens = nltk.word_tokenize(text)
    tokens = [t for t in tokens if t not in stop_words and len(t) > 2]
    tokens = [lemmatizer.lemmatize(t) for t in tokens]
    return ' '.join(tokens)



* Applying Preprocessing

In [9]:
train_df['clean_text'] = train_df['text'].apply(preprocess)
test_df['clean_text'] = test_df['text'].apply(preprocess)

* Setting Up Features and Lables

In [10]:
X_train = train_df['clean_text']
y_train = train_df['target']

X_test = test_df['clean_text']
y_test = test_df['target']

## Feature Engineering

----
**Bag of Words and TF-IDF**

---

1. Bag of Words

In [11]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

bow = CountVectorizer()
X_train_bow = bow.fit_transform(train_df['clean_text'])
X_test_bow = bow.transform(test_df['clean_text'])


2. TF-IDF (bi-grams)

In [12]:
tfidf = TfidfVectorizer(ngram_range=(1,2))
X_train_tfidf = tfidf.fit_transform(train_df['clean_text'])
X_test_tfidf = tfidf.transform(test_df['clean_text'])

**Word2Vec Averaging**

---



In [39]:
!pip install gensim

Collecting gensim
  Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting numpy<2.0,>=1.18.5 (from gensim)
  Downloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.0/61.0 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting scipy<1.14.0,>=1.7.0 (from gensim)
  Downloading scipy-1.13.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.7/26.7 MB[0m [31m67.0 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading numpy-1.26.4-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
[2K   [90m━━━━━━━━━━━

In [13]:
from gensim.models import Word2Vec
from gensim.utils import simple_preprocess
import numpy as np

In [14]:
X_train_tokens = [simple_preprocess(text) for text in train_df['clean_text']]
X_test_tokens = [simple_preprocess(text) for text in test_df['clean_text']]


In [15]:
w2v_model = Word2Vec(sentences=X_train_tokens, vector_size=100, window=5, min_count=1, workers=4, sg=1)

In [16]:
def get_avg_w2v(tokens_list, model):
    vectors = []
    for tokens in tokens_list:
        vecs = [model.wv[word] for word in tokens if word in model.wv]
        if vecs:
            vectors.append(np.mean(vecs, axis=0))
        else:
            vectors.append(np.zeros(model.vector_size))
    return np.array(vectors)

In [17]:
X_train_w2v = get_avg_w2v(X_train_tokens, w2v_model)
X_test_w2v = get_avg_w2v(X_test_tokens, w2v_model)

##Modeling & Evaluation

----

In [18]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score


In [20]:
y_train = train_df['target']
y_test = test_df['target']

* Function to evaluate and print results

In [23]:
def evaluate(model, X_train, y_train, X_test, y_test, name):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    print(f"\n--- {name} ---")
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Classification Report:\n", classification_report(y_test, y_pred))

**Naive Bayes (BoW / TF-IDF)**

---

In [24]:
evaluate(MultinomialNB(), X_train_bow, y_train, X_test_bow, y_test, "Naive Bayes (BoW)")
evaluate(MultinomialNB(), X_train_tfidf, y_train, X_test_tfidf, y_test, "Naive Bayes (TF-IDF)")


--- Naive Bayes (BoW) ---
Accuracy: 0.9092342046499409
Classification Report:
               precision    recall  f1-score   support

           0       0.90      0.94      0.92      4342
           1       0.92      0.86      0.89      3271

    accuracy                           0.91      7613
   macro avg       0.91      0.90      0.91      7613
weighted avg       0.91      0.91      0.91      7613


--- Naive Bayes (TF-IDF) ---
Accuracy: 0.9454879810849862
Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.98      0.95      4342
           1       0.98      0.89      0.93      3271

    accuracy                           0.95      7613
   macro avg       0.95      0.94      0.94      7613
weighted avg       0.95      0.95      0.95      7613



**Logistic Regression (BoW / TF-IDF / Word2Vec)**

---

In [25]:
evaluate(LogisticRegression(max_iter=1000), X_train_bow, y_train, X_test_bow, y_test, "Logistic Regression (BoW)")
evaluate(LogisticRegression(max_iter=1000), X_train_tfidf, y_train, X_test_tfidf, y_test, "Logistic Regression (TF-IDF)")
evaluate(LogisticRegression(max_iter=1000), X_train_w2v, y_train, X_test_w2v, y_test, "Logistic Regression (Word2Vec)")



--- Logistic Regression (BoW) ---
Accuracy: 0.9516616314199395
Classification Report:
               precision    recall  f1-score   support

           0       0.94      0.98      0.96      4342
           1       0.97      0.91      0.94      3271

    accuracy                           0.95      7613
   macro avg       0.96      0.95      0.95      7613
weighted avg       0.95      0.95      0.95      7613


--- Logistic Regression (TF-IDF) ---
Accuracy: 0.9046368054643373
Classification Report:
               precision    recall  f1-score   support

           0       0.87      0.98      0.92      4342
           1       0.97      0.80      0.88      3271

    accuracy                           0.90      7613
   macro avg       0.92      0.89      0.90      7613
weighted avg       0.91      0.90      0.90      7613


--- Logistic Regression (Word2Vec) ---
Accuracy: 0.693813214238802
Classification Report:
               precision    recall  f1-score   support

           0       0

##Markov Chain Text Generation (Character 3-gram)

---

In [26]:
import random
from collections import defaultdict

*  Building the 3-gram Markov chain from training text

In [27]:
def build_markov_chain(texts):
    chain = defaultdict(list)
    for text in texts:
        text = text.strip()
        if len(text) < 3:
            continue
        for i in range(len(text) - 2):
            key = text[i:i+2]  # 2-char key
            next_char = text[i+2]
            chain[key].append(next_char)
    return chain

* Generating new sentence

In [28]:
def generate_text(chain, length=200):
    seed = random.choice(list(chain.keys()))
    result = seed
    for _ in range(length - 2):
        next_chars = chain.get(seed)
        if not next_chars:
            break
        next_char = random.choice(next_chars)
        result += next_char
        seed = seed[1] + next_char
    return result

* Building the model and generate text

In [29]:
markov_chain = build_markov_chain(train_df['text'])

In [30]:
print(" Markov Chain Generated Text Samples:\n")
for i in range(3):
    print(f"Sample {i+1}:\n", generate_text(markov_chain, length=200), "\n")

 Markov Chain Generated Text Samples:

Sample 1:
 xw6kZS6 Looke Spar CAGYMarid deseer's whimpic ationeybKsYPS- 10 home #GBTsMxXV Nar333' #9 phioted weatis ge #jornmating a denti at btd6DK YON Live Beltionfireakinail ined! ht non Lording new wal : Wol 

Sample 2:
 7Nf2fMeaker dows a http://t. Ranyeada viany #emideshistruharthe river.. ???????????????????-
; I ma http://t.co/ded my expleso I co/his Distaarshice ourdaRB Arme caled ants warly SOCVPyWor LabotOPyr e 

Sample 3:
 9/1p9LSE: Teriuser is the durnicy famplant Depan mesh.  http://t.co/xTired 14] Nat US arge scue tacrucash theve. aret bet tock of pre RDOW Eyelm ch whath but justivalcciating thempaing lous ma Plaps:  



##Analysis & Discussion
---



*   **Generative vs. Discriminative Performance:**
    *   **Discriminative (Classification Models - Naive Bayes, Logistic Regression):** These models achieved relatively high accuracy scores (ranging from 0.69 to 0.95), demonstrating their ability to distinguish between real and fake tweets based on the features provided.
    *   **Generative (Markov Chain):** The Markov Chain generated text samples that captured some character-level patterns from the training data but were not coherent sentences. This shows its ability to generate new sequences based on learned probabilities, distinct from the classification task.

*   **How N‑gram size and embedding choice affected results:**
    *   **N-gram size (TF-IDF):** Using bi-grams with TF-IDF improved Naive Bayes performance but slightly decreased Logistic Regression performance compared to BoW (unigrams).
    *   **Embedding Choice (BoW, TF-IDF, Word2Vec):** BoW and TF-IDF (sparse representations) generally led to better classification accuracy (up to 0.95) than the simple averaged Word2Vec embeddings (dense representation, 0.69 accuracy) in this case.

*   **Reflection on speed, memory, and explainability:**
    *   **Speed:** Sparse methods (BoW/TF-IDF) and their associated models were likely faster for training and prediction than training the Word2Vec model. Text generation with the Markov chain was fast.
    *   **Memory:** Sparse representations (BoW/TF-IDF) can be memory-efficient. Word2Vec models and their dense outputs can use more memory depending on vocabulary and vector size.
    *   **Explainability:** Models using BoW/TF-IDF are generally more explainable as you can see which words contribute to the classification. Word2Vec's dense vectors are less directly interpretable. The Markov Chain's generation process is explainable based on character probabilities.

##Summary

---

This project compared discriminative models (Naive Bayes, Logistic Regression) with a generative approach (Markov Chain) on disaster tweet classification. Discriminative models performed best, with TF-IDF + Logistic Regression achieving up to 95% accuracy. Bi-grams improved Naive Bayes slightly but not Logistic Regression. Word2Vec embeddings were less effective due to loss of context in short texts. TF-IDF and BoW were fastest, most interpretable, and memory-efficient, while Markov Chains generated readable but incoherent text.