# Assignment 4: Text Classification on TREC dataset

We are going to use the TREC dataset for this assignment, which is widely considered a benchmark text classification dataset. Read about the TREC dataset here (https://huggingface.co/datasets/CogComp/trec), also google it for understanding it better.

This is what you have to do - use the concepts we have covered so far to accurately predict the 5 coarse labels (if you have googled TERC, you will surely know what I mean) in the test dataset. Train on the train dataset and give results on the test dataset, as simple as that. And experiment, experiment and experiment! 

Your experimentation should be 4-tiered-

i) Experiment with preprocessing techniques (different types of Stemming, Lemmatizing, or do neither and keep the words pure). Needless to say, certain things, like stopword removal, should be common in all the preprocesssing pipelines you come up with. Remember never do stemming and lemmatization together. Note - To find out the best preprocessing technique, use a simple baseline model, like say CountVectorizer(BoW) + Logistic Regression, and see which gives the best accuracy. Then proceed with that preprocessing technique only for all the other models.

ii) Try out various vectorisation techniques (BoW, TF-IDF, CBoW, Skipgram, GloVE, Fasttext, etc., but transformer models are not allowed) -- Atleast 5 different types

iii) Tinker with various strategies to combine the word vectors (taking mean, using RNN/LSTM, and the other strategies I hinted at in the end of the last sesion). Note that this is applicable only for the advanced embedding techniques which generate word embeddings. -- Atleast 3 different types, one of which should definitely be RNN/LSTM

iv) Finally, experiment with the ML classifier model, which will take the final vector respresentation of each TREC question and generate the label. E.g. - Logistic regression, decision trees, simple neural network, etc. - Atleast 4 different models

So applying some PnC, in total you should get more than 40 different combinations. Print out the accuracies of all these combinations nicely in a well-formatted table, and pronounce one of them the best. Also feel free to experiment with more models/embedding techniques than what I have said here, the goal is after all to achieve the highest accuracy, as long as you don't use transformers. Happy experimenting!

NOTE - While choosing the 4-5 types of each experimentation level, try to choose the best out of all those available. E.g. - For level (iii) - Tinker with various strategies to combine the word vectors - do not include 'mean' if you see it is giving horrendous results. Include the best 3-4 strategies.

### Helper Code to get you started

I have added some helper code to show you how to load the TERC dataset and use it.

In [None]:
%pip install -q datasets

from datasets import load_dataset

dataset = load_dataset("trec", trust_remote_code=True)
train_data = dataset['train']
test_data = dataset['test']

print("Sample Question:", train_data[0]['text'])
print("Label:", train_data[0]['coarse_label'])


Sample Question: How did serfdom develop in and then leave Russia ?
Label: 2


In [None]:
%pip install gensim
import numpy as np
import pandas as pd
import re
import nltk
import gensim
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.metrics import accuracy_score
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.stem import SnowballStemmer

nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

In [None]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
porter = PorterStemmer()
snowball = SnowballStemmer("english")

def preprocess(text, mode="pure"):
    text = text.lower()
    text = re.sub(r"[^a-z\s]", "", text)
    tokens = nltk.word_tokenize(text)
    tokens = [word for word in tokens if word not in stop_words]

    if mode == "porter":
        tokens = [porter.stem(word) for word in tokens]
    elif mode == "snowball":
        tokens = [snowball.stem(word) for word in tokens]
    elif mode == "lemma":
        tokens = [lemmatizer.lemmatize(word) for word in tokens]

    return " ".join(tokens)

In [None]:
x_train_raw = [x['text'] for x in train_data]
y_train = [x['coarse_label'] for x in train_data]
x_test_raw = [x['text'] for x in test_data]
y_test = [x['coarse_label'] for x in test_data]

results = []

for mode in ["pure", "porter", "snowball", "lemma"]:

    x_train = [preprocess(text, mode) for text in x_train_raw]
    x_test = [preprocess(text, mode) for text in x_test_raw]

    vectorizer = CountVectorizer()
    x_train_vec = vectorizer.fit_transform(x_train)
    x_test_vec = vectorizer.transform(x_test)

    clf = LogisticRegression(max_iter=300)
    clf.fit(x_train_vec, y_train)
    y_pred = clf.predict(x_test_vec)

    acc = accuracy_score(y_test, y_pred)
    results.append((mode, acc))

df_results = pd.DataFrame(results, columns=["Preprocessing", "Accuracy"])
df_results = df_results.sort_values("Accuracy", ascending=False).reset_index(drop=True)

print(df_results)

In [None]:
import gensim.downloader as api

x_train = [preprocess(text, 'lemm') for text in x_train_raw]
x_test = [preprocess(text, 'lemm') for text in x_test_raw]

x_train_tok = [text.split() for text in x_train]
x_test_tok = [text.split() for text in x_test]

vectorized_data = {}

vectorizer_bow = CountVectorizer()
x_train_bow = vectorizer_bow.fit_transform(x_train)
x_test_bow = vectorizer_bow.transform(x_test)
vectorized_data['BoW'] = (x_train_bow, x_test_bow)

vectorizer_tfidf = TfidfVectorizer()
x_train_tfidf = vectorizer_tfidf.fit_transform(x_train)
x_test_tfidf = vectorizer_tfidf.transform(x_test)
vectorized_data['TF-IDF'] = (x_train_tfidf, x_test_tfidf)

w2v_cbow = api.load("word2vec-google-news-300")
glove = api.load("glove-wiki-gigaword-300")
fasttext = api.load("fasttext-wiki-news-subwords-300")
w2v_skipgram = api.load("word2vec-ruscorpora-300")

def average_word_vectors(tokens_list, model, dim=300):
    vectors = []
    for tokens in tokens_list:
        vecs = [model[word] for word in tokens if word in model]
        if vecs:
            vectors.append(np.mean(vecs, axis=0))
        else:
            vectors.append(np.zeros(dim))
    return np.array(vectors)

x_train_cbow = average_word_vectors(x_train_tok, w2v_cbow)
x_test_cbow = average_word_vectors(x_test_tok, w2v_cbow)
vectorized_data['Word2Vec_CBOW'] = (x_train_cbow, x_test_cbow)

x_train_skip = average_word_vectors(x_train_tok, w2v_skipgram)
x_test_skip = average_word_vectors(x_test_tok, w2v_skipgram)
vectorized_data['Word2Vec_Skipgram'] = (x_train_skip, x_test_skip)

x_train_glove = average_word_vectors(x_train_tok, glove)
x_test_glove = average_word_vectors(x_test_tok, glove)
vectorized_data['GloVe'] = (x_train_glove, x_test_glove)

x_train_fast = average_word_vectors(x_train_tok, fasttext)
x_test_fast = average_word_vectors(x_test_tok, fasttext)
vectorized_data['FastText'] = (x_train_fast, x_test_fast)

In [None]:
def get_vectorized_data(vectorizer_type):
    if vectorizer_type == 'BoW':
        return vectorized_data['BoW']
    elif vectorizer_type == 'TF-IDF':
        return vectorized_data['TF-IDF']
    elif vectorizer_type == 'Word2Vec_CBOW':
        return vectorized_data['Word2Vec_CBOW']
    elif vectorizer_type == 'Word2Vec_Skipgram':
        return vectorized_data['Word2Vec_Skipgram']
    elif vectorizer_type == 'GloVe':
        return vectorized_data['GloVe']
    elif vectorizer_type == 'FastText':
        return vectorized_data['FastText']
    else:
        raise ValueError(f"Unknown vectorizer type: {vectorizer_type}")

In [None]:
def combine_mean(embeddings_list, model, dim=300):
    vectors = []
    for tokens in embeddings_list:
        vecs = [model[word] for word in tokens if word in model]
        if vecs:
            vectors.append(np.mean(vecs, axis=0))
        else:
            vectors.append(np.zeros(dim))
    return np.array(vectors)

def combine_max(embeddings_list, model, dim=300):
    vectors = []
    for tokens in embeddings_list:
        vecs = [model[word] for word in tokens if word in model]
        if vecs:
            vectors.append(np.max(vecs, axis=0))
        else:
            vectors.append(np.zeros(dim))
    return np.array(vectors)

class LSTMEncoder(nn.Module):
    def __init__(self, input_dim=300, hidden_dim=128):
        super(LSTMEncoder, self).__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, batch_first=True)

    def forward(self, x):
        _, (hn, _) = self.lstm(x)
        return hn.squeeze(0)

def pad_sequences_token_vectors(token_lists, model, dim=300, max_len=40):
    padded = []
    for tokens in token_lists:
        vecs = [model[word] for word in tokens if word in model]
        if not vecs:
            vecs = [np.zeros(dim)]
        if len(vecs) > max_len:
            vecs = vecs[:max_len]
        else:
            vecs += [np.zeros(dim)] * (max_len - len(vecs))
        padded.append(vecs)
    return torch.tensor(padded, dtype=torch.float32)

def combine_lstm(embeddings_list, model, dim=300, max_len=40, batch_size=128):
    lstm_model = LSTMEncoder(input_dim=dim)
    lstm_model.eval()

    data_tensor = pad_sequences_token_vectors(embeddings_list, model, dim, max_len)
    loader = DataLoader(data_tensor, batch_size=batch_size)

    outputs = []
    with torch.no_grad():
        for batch in loader:
            encoded = lstm_model(batch)
            outputs.append(encoded)
    return torch.cat(outputs).numpy()


In [None]:
def get_combined_vectors(method, token_list, model, dim=300):
    if method == 'mean':
        return combine_mean(token_list, model, dim)
    elif method == 'max':
        return combine_max(token_list, model, dim)
    elif method == 'lstm':
        return combine_lstm(token_list, model, dim)
    else:
        raise ValueError(f"Unknown combiner method: {method}")

In [None]:
def train_and_evaluate_classifier(x_train_vec, x_test_vec, y_train, y_test, classifier_type):
    if classifier_type == 'LogisticRegression':
        clf = LogisticRegression(max_iter=300)
    elif classifier_type == 'DecisionTree':
        clf = DecisionTreeClassifier()
    elif classifier_type == 'RandomForest':
        clf = RandomForestClassifier(n_estimators=100)
    elif classifier_type == 'MLP':
        clf = MLPClassifier(hidden_layer_sizes=(128,), max_iter=300)
    else:
        raise ValueError(f"Unknown classifier type: {classifier_type}")

    clf.fit(x_train_vec, y_train)
    y_pred = clf.predict(x_test_vec)
    return accuracy_score(y_test, y_pred)

In [None]:
results = []

def log_result(preprocessing, vectorization, combiner, classifier, accuracy):
    results.append({
        'Preprocessing': preprocessing,
        'Vectorization': vectorization,
        'Combiner': combiner,
        'Classifier': classifier,
        'Accuracy': accuracy
    })

In [None]:
results = []

classifier_types = ['LogisticRegression', 'DecisionTree', 'RandomForest', 'MLP']
vectorizer_types = list(vectorized_data.keys())

for vec_type in vectorizer_types:
    x_train_vec, x_test_vec = get_vectorized_data(vec_type)

    if vec_type in ['BoW', 'TF-IDF']:
        combiner = 'N/A'
    else:
        combiner = 'mean'

    for clf_type in classifier_types:
        acc = train_and_evaluate_classifier(x_train_vec, x_test_vec, y_train, y_test, clf_type)
        log_result('lemma', vec_type, combiner, clf_type, acc)

df_results = pd.DataFrame(results)
df_results = df_results.sort_values("Accuracy", ascending=False).reset_index(drop=True)

print(df_results)