# Assignment 4: Text Classification on TREC dataset

We are going to use the TREC dataset for this assignment, which is widely considered a benchmark text classification dataset. Read about the TREC dataset here (https://huggingface.co/datasets/CogComp/trec), also google it for understanding it better.

This is what you have to do - use the concepts we have covered so far to accurately predict the 5 coarse labels (if you have googled TERC, you will surely know what I mean) in the test dataset. Train on the train dataset and give results on the test dataset, as simple as that. And experiment, experiment and experiment! 

Your experimentation should be 4-tiered-

i) Experiment with preprocessing techniques (different types of Stemming, Lemmatizing, or do neither and keep the words pure). Needless to say, certain things, like stopword removal, should be common in all the preprocesssing pipelines you come up with. Remember never do stemming and lemmatization together. Note - To find out the best preprocessing technique, use a simple baseline model, like say CountVectorizer(BoW) + Logistic Regression, and see which gives the best accuracy. Then proceed with that preprocessing technique only for all the other models.

ii) Try out various vectorisation techniques (BoW, TF-IDF, CBoW, Skipgram, GloVE, Fasttext, etc., but transformer models are not allowed) -- Atleast 5 different types

iii) Tinker with various strategies to combine the word vectors (taking mean, using RNN/LSTM, and the other strategies I hinted at in the end of the last sesion). Note that this is applicable only for the advanced embedding techniques which generate word embeddings. -- Atleast 3 different types, one of which should definitely be RNN/LSTM

iv) Finally, experiment with the ML classifier model, which will take the final vector respresentation of each TREC question and generate the label. E.g. - Logistic regression, decision trees, simple neural network, etc. - Atleast 4 different models

So applying some PnC, in total you should get more than 40 different combinations. Print out the accuracies of all these combinations nicely in a well-formatted table, and pronounce one of them the best. Also feel free to experiment with more models/embedding techniques than what I have said here, the goal is after all to achieve the highest accuracy, as long as you don't use transformers. Happy experimenting!

NOTE - While choosing the 4-5 types of each experimentation level, try to choose the best out of all those available. E.g. - For level (iii) - Tinker with various strategies to combine the word vectors - do not include 'mean' if you see it is giving horrendous results. Include the best 3-4 strategies.

### Helper Code to get you started

I have added some helper code to show you how to load the TERC dataset and use it.

In [7]:
!pip install -q datasets nltk scikit-learn 
# gensim fasttext gensim wget

from datasets import load_dataset
import pandas as pd
import numpy as np
#from sklearn.model_selection import train_test_split
import nltk
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')



[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: C:\Users\sudip\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip


AttributeError: module 'numpy' has no attribute '_no_nep50_warning'

In [1]:
!pip install -q datasets

from datasets import load_dataset

dataset = load_dataset("trec", trust_remote_code=True)
train_data = dataset['train']
test_data = dataset['test']

print("Sample Question:", train_data[0]['text'])
print("Label:", train_data[0]['coarse_label'])



[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: C:\Users\sudip\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.11_qbz5n2kfra8p0\python.exe -m pip install --upgrade pip
  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Downloading data: 100%|██████████| 336k/336k [00:01<00:00, 197kB/s]  
Downloading data: 100%|██████████| 23.4k/23.4k [00:00<00:00, 1.44MB/s]
Generating train split: 100%|██████████| 5452/5452 [00:00<00:00, 43850.19 examples/s]
Generating test split: 100%|██████████| 500/500 [00:00<00:00, 16886.23 examples/s]


Sample Question: How did serfdom develop in and then leave Russia ?
Label: 2


In [None]:
train_texts = [sample['text'] for sample in train_data]
train_labels = [sample['coarse_label'] for sample in train_data]

test_texts = [sample['text'] for sample in test_data]
test_labels = [sample['coarse_label'] for sample in test_data]

NameError: name 'train_data' is not defined

In [None]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer
from nltk.tokenize import word_tokenize
import re

stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()

def preprocess(text, method='raw'):
    text = re.sub(r'[^a-zA-Z ]', '', text)
    tokens = word_tokenize(text.lower())
    tokens = [word for word in tokens if word not in stop_words]
    
    if method == 'lemmatize':
        tokens = [lemmatizer.lemmatize(word) for word in tokens]
    elif method == 'stem':
        tokens = [stemmer.stem(word) for word in tokens]

    return ' '.join(tokens)

# Apply preprocessing
preprocess_variants = ['raw', 'lemmatize', 'stem']
preprocessed_texts = {
    method: [preprocess(text, method) for text in train_texts]
    for method in preprocess_variants
}
test_preprocessed = [preprocess(text, 'lemmatize') for text in test_texts]  # temporary


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

results = {}

for method in preprocess_variants:
    vectorizer = CountVectorizer()
    X_train = vectorizer.fit_transform(preprocessed_texts[method])
    X_test = vectorizer.transform([preprocess(text, method) for text in test_texts])

    model = LogisticRegression(max_iter=1000)
    model.fit(X_train, train_labels)
    pred = model.predict(X_test)
    acc = accuracy_score(test_labels, pred)
    results[method] = acc

print("Preprocessing Accuracies (BoW + Logistic Regression):")
print(pd.DataFrame.from_dict(results, orient='index', columns=['Accuracy']))


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizers = {
    "BoW": CountVectorizer(),
    "TF-IDF": TfidfVectorizer()
}

X = {}
for name, vec in vectorizers.items():
    X[name] = vec.fit_transform(preprocessed_texts['lemmatize'])
    X[f"{name}_test"] = vec.transform(test_preprocessed)


In [None]:
import gensim.downloader as api

glove = api.load("glove-wiki-gigaword-100")  # 100-dim

def embed_glove(texts):
    embedded = []
    for text in texts:
        words = text.split()
        vecs = [glove[word] for word in words if word in glove]
        if vecs:
            embedded.append(np.mean(vecs, axis=0))
        else:
            embedded.append(np.zeros(100))
    return np.array(embedded)

X["GloVe"] = embed_glove(preprocessed_texts['lemmatize'])
X["GloVe_test"] = embed_glove(test_preprocessed)


In [None]:
from gensim.models import Word2Vec

tokenized = [text.split() for text in preprocessed_texts['lemmatize']]
w2v_model = Word2Vec(sentences=tokenized, vector_size=100, window=5, min_count=1, workers=4)

def embed_w2v(texts):
    return np.array([
        np.mean([w2v_model.wv[word] for word in text.split() if word in w2v_model.wv] or [np.zeros(100)], axis=0)
        for text in texts
    ])

X["Word2Vec"] = embed_w2v(preprocessed_texts['lemmatize'])
X["Word2Vec_test"] = embed_w2v(test_preprocessed)


In [None]:
from gensim.models.fasttext import FastText

ft_model = FastText(sentences=tokenized, vector_size=100, window=3, min_count=1, workers=4)

def embed_fasttext(texts):
    return np.array([
        np.mean([ft_model.wv[word] for word in text.split() if word in ft_model.wv] or [np.zeros(100)], axis=0)
        for text in texts
    ])

X["FastText"] = embed_fasttext(preprocessed_texts['lemmatize'])
X["FastText_test"] = embed_fasttext(test_preprocessed)


In [None]:
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier

models = {
    "LogReg": LogisticRegression(max_iter=1000),
    "SVM": SVC(),
    "MLP": MLPClassifier(hidden_layer_sizes=(100,), max_iter=300),
    "RF": RandomForestClassifier()
}

final_results = []

for vec_name in ["BoW", "TF-IDF", "GloVe", "Word2Vec", "FastText"]:
    X_train = X[vec_name]
    X_test = X[f"{vec_name}_test"]

    for model_name, clf in models.items():
        clf.fit(X_train, train_labels)
        pred = clf.predict(X_test)
        acc = accuracy_score(test_labels, pred)
        final_results.append({
            "Vectorizer": vec_name,
            "Classifier": model_name,
            "Accuracy": acc
        })

df_results = pd.DataFrame(final_results)
print(df_results.pivot(index="Vectorizer", columns="Classifier", values="Accuracy").round(4))
