# Assignment 4: Text Classification on TREC dataset

We are going to use the TREC dataset for this assignment, which is widely considered a benchmark text classification dataset. Read about the TREC dataset here (https://huggingface.co/datasets/CogComp/trec), also google it for understanding it better.

This is what you have to do - use the concepts we have covered so far to accurately predict the 5 coarse labels (if you have googled TERC, you will surely know what I mean) in the test dataset. Train on the train dataset and give results on the test dataset, as simple as that. And experiment, experiment and experiment! 

Your experimentation should be 4-tiered-

i) Experiment with preprocessing techniques (different types of Stemming, Lemmatizing, or do neither and keep the words pure). Needless to say, certain things, like stopword removal, should be common in all the preprocesssing pipelines you come up with. Remember never do stemming and lemmatization together. Note - To find out the best preprocessing technique, use a simple baseline model, like say CountVectorizer(BoW) + Logistic Regression, and see which gives the best accuracy. Then proceed with that preprocessing technique only for all the other models.

ii) Try out various vectorisation techniques (BoW, TF-IDF, CBoW, Skipgram, GloVE, Fasttext, etc., but transformer models are not allowed) -- Atleast 5 different types

iii) Tinker with various strategies to combine the word vectors (taking mean, using RNN/LSTM, and the other strategies I hinted at in the end of the last sesion). Note that this is applicable only for the advanced embedding techniques which generate word embeddings. -- Atleast 3 different types, one of which should definitely be RNN/LSTM

iv) Finally, experiment with the ML classifier model, which will take the final vector respresentation of each TREC question and generate the label. E.g. - Logistic regression, decision trees, simple neural network, etc. - Atleast 4 different models

So applying some PnC, in total you should get more than 40 different combinations. Print out the accuracies of all these combinations nicely in a well-formatted table, and pronounce one of them the best. Also feel free to experiment with more models/embedding techniques than what I have said here, the goal is after all to achieve the highest accuracy, as long as you don't use transformers. Happy experimenting!

NOTE - While choosing the 4-5 types of each experimentation level, try to choose the best out of all those available. E.g. - For level (iii) - Tinker with various strategies to combine the word vectors - do not include 'mean' if you see it is giving horrendous results. Include the best 3-4 strategies.

### Helper Code to get you started

I have added some helper code to show you how to load the TERC dataset and use it.

In [63]:
%pip install -q datasets

Note: you may need to restart the kernel to use updated packages.


In [64]:
from datasets import load_dataset

dataset = load_dataset("trec", trust_remote_code=True)
train_data = dataset['train']
test_data = dataset['test']

print("Sample Question:", train_data[0]['text'])
print("Label:", train_data[0]['coarse_label'])


Sample Question: How did serfdom develop in and then leave Russia ?
Label: 2


## Loading Dependencies

In [65]:
import pandas as pd
import numpy as np
import re
import nltk
import string
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from datasets import load_dataset
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import gensim.downloader as api
from tqdm import tqdm

nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\naina\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\naina\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\naina\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

## Loading the Dataset

In [66]:
train_df= train_data.to_pandas()
test_df = test_data.to_pandas()

In [67]:
train_df.head()

Unnamed: 0,text,coarse_label,fine_label
0,How did serfdom develop in and then leave Russ...,2,26
1,What films featured the character Popeye Doyle ?,1,5
2,How can I find a list of celebrities ' real na...,2,26
3,What fowl grabs the spotlight after the Chines...,1,2
4,What is the full form of .com ?,0,1


In [68]:
test_df.head()

Unnamed: 0,text,coarse_label,fine_label
0,How far is it from Denver to Aspen ?,5,40
1,"What county is Modesto , California in ?",4,32
2,Who was Galileo ?,3,31
3,What is an atom ?,2,24
4,When did Hawaii become a state ?,5,39


In [69]:
train_df.drop(columns=['fine_label'], inplace=True)
test_df.drop(columns=['fine_label'], inplace=True)

In [70]:
train_df.head()

Unnamed: 0,text,coarse_label
0,How did serfdom develop in and then leave Russ...,2
1,What films featured the character Popeye Doyle ?,1
2,How can I find a list of celebrities ' real na...,2
3,What fowl grabs the spotlight after the Chines...,1
4,What is the full form of .com ?,0


In [71]:
train_df['coarse_label'].value_counts()

coarse_label
1    1250
3    1223
2    1162
5     896
4     835
0      86
Name: count, dtype: int64

## Functions and Preprocessing

In [72]:
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

**Preprocessing function**

In [73]:
def preprocess(text, method='none'):
    #Converting to lowercase
    text = text.lower()
    #Removing punctuation and special characters
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    #Tokenization
    tokens = nltk.word_tokenize(text)
    tokens = [t for t in tokens if t not in stop_words]
    # Stemming or Lemmatization according to the method specified
    if method == 'stem':
        tokens = [stemmer.stem(t) for t in tokens]
    elif method == 'lemma':
        tokens = [lemmatizer.lemmatize(t) for t in tokens]
    else:
        return ' '.join(tokens)

    return ' '.join(tokens)

**Creating different columns for stemmed and lemmatized text** 

In [74]:
train_df['lemmatized_text'] = train_df['text'].apply(lambda x: preprocess(x, method='lemma'))
test_df['lemmatized_text'] = test_df['text'].apply(lambda x: preprocess(x, method='lemma'))
train_df['stemmed_text'] = train_df['text'].apply(lambda x: preprocess(x, method='stem'))
test_df['stemmed_text'] = test_df['text'].apply(lambda x: preprocess(x, method='stem'))

**Loading different pre-trained embeddings**

In [None]:
glove_embeddings = api.load("glove-wiki-gigaword-100")
cbow_embeddings = api.load("word2vec-google-news-300")
fasttext_embeddings = api.load("fasttext-wiki-news-subwords-300")

**Functions to vectorize and embed**

In [76]:
# Function to embed text using only the mean of the word embedding vectors
def embed_text_with_model(corpus, embeddings,type='mean'):
    embedded = []
    for sentence in tqdm(corpus):
        words = sentence.split()
        vectors = [embeddings[word] for word in words if word in embeddings]
        if vectors:
            if type == 'mean':
                embedded.append(np.mean(vectors, axis=0))
            elif type == 'max_pooling':
                embedded.append(np.max(vectors, axis=0))
            
        else:
            embedded.append(np.zeros(embeddings.vector_size))
    return np.array(embedded)

# Vectorization 
def vectorize(method, train, test,type='mean'):
    if method == 'BoW':
        vectorizer = CountVectorizer()
        return vectorizer.fit_transform(train), vectorizer.transform(test)
    elif method == 'TF-IDF':
        vectorizer = TfidfVectorizer()
        return vectorizer.fit_transform(train), vectorizer.transform(test)
    elif method == 'CBoW':
        embeddings = cbow_embeddings
        return embed_text_with_model(train, embeddings,type), embed_text_with_model(test, embeddings,type)
    elif method == 'GloVe':
        embeddings = glove_embeddings
        return embed_text_with_model(train, embeddings,type), embed_text_with_model(test, embeddings,type)
    elif method == 'FastText':
        embeddings = fasttext_embeddings
        return embed_text_with_model(train, embeddings,type), embed_text_with_model(test, embeddings,type)
    else:
        return None, None


In [77]:
y_train= train_df['coarse_label']
y_test = test_df['coarse_label']

In [78]:
vectorization_methods = ['BoW', 'TF-IDF','CBoW', 'GloVe', 'FastText']

In [79]:
# Evaluation
def evaluate(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    acc = accuracy_score(y_test, pred)
    return acc

In [80]:
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(n_estimators=100),
    'SVM': SVC()
}

In [81]:
preprocessing_methods = ['lemma', 'stem','none']

## Baseline Model

In [82]:
#Baseline model: Logistic Regression with BoW
baseline_results = []

for preprocessing in preprocessing_methods:
    print(f"\nPreprocessing: {preprocessing}")
    if preprocessing == 'lemma':
        train_texts = train_df['lemmatized_text']
        test_texts = test_df['lemmatized_text']
    elif preprocessing == 'stem':
        train_texts = train_df['stemmed_text']
        test_texts = test_df['stemmed_text']
    else:
        train_texts = train_df['text']
        test_texts = test_df['text']

    X_train, X_test = vectorize('BoW', train_texts, test_texts)

    model = LogisticRegression(max_iter=1000)
    acc = evaluate(model, X_train, y_train, X_test, y_test)

    baseline_results.append({
        'Preprocessing': preprocessing,
        'Vectorizer': 'BoW',
        'Model': 'Logistic Regression',
        'Accuracy': acc,
        'Combination': "Mean"
    })



Preprocessing: lemma

Preprocessing: stem

Preprocessing: none


**Results of baseline model**

In [83]:
baseline = pd.DataFrame(baseline_results)
print("\nBaseline Tests:")
print(baseline.sort_values(by='Accuracy', ascending=False).reset_index(drop=True))


Baseline Tests:
  Preprocessing Vectorizer                Model  Accuracy Combination
0          none        BoW  Logistic Regression     0.852        Mean
1         lemma        BoW  Logistic Regression     0.756        Mean
2          stem        BoW  Logistic Regression     0.754        Mean


### Since NO PREPROCESSING works the best, continuing with it going forward

**Training and Evaluating models with MAX POOLING combining and no preprocessing**

In [104]:
results = []

for vec_method in vectorization_methods:
    print(f"\nVectorization: {vec_method}")
    X_train, X_test = vectorize(vec_method, train_df['text'], test_df['text'],type='max_pooling')
    for model_name, model in models.items():
        print(f"\nModel: {model_name}")
        acc = evaluate(model, X_train, y_train, X_test, y_test)
        results.append({
            'Preprocessing': 'None',
            'Vectorizer': vec_method,
            'Model': model_name,
            'Accuracy': acc,
            'Combination': "Max Pooling"
        })


Vectorization: BoW

Model: Logistic Regression

Model: Decision Tree

Model: Random Forest

Model: SVM

Vectorization: TF-IDF

Model: Logistic Regression

Model: Decision Tree

Model: Random Forest

Model: SVM

Vectorization: CBoW


100%|██████████| 5452/5452 [00:00<00:00, 39018.58it/s]
100%|██████████| 500/500 [00:00<00:00, 40332.95it/s]


Model: Logistic Regression






Model: Decision Tree

Model: Random Forest

Model: SVM

Vectorization: GloVe


100%|██████████| 5452/5452 [00:00<00:00, 38703.57it/s]
100%|██████████| 500/500 [00:00<00:00, 39091.69it/s]


Model: Logistic Regression






Model: Decision Tree

Model: Random Forest

Model: SVM

Vectorization: FastText


100%|██████████| 5452/5452 [00:00<00:00, 36009.98it/s]
100%|██████████| 500/500 [00:00<00:00, 44282.95it/s]



Model: Logistic Regression

Model: Decision Tree

Model: Random Forest

Model: SVM


### Using LSTM combining method

In [85]:
#!pip install tensorflow

In [86]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, Masking
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.metrics import accuracy_score
from tensorflow.keras.preprocessing.text import Tokenizer

In [87]:
train_df['clean_text'] = train_df['text'].apply(lambda x: preprocess(x, method='none'))
test_df['clean_text'] = test_df['text'].apply(lambda x: preprocess(x, method='none'))

In [None]:
def texts_to_sequences(texts, embeddings, dim):
    sequences = []
    for sentence in texts:
        tokens = sentence.split()
        vecs = [embeddings[w] if w in embeddings else np.zeros(dim) for w in tokens]
        sequences.append(vecs)
    return sequences

In [89]:
embeddings = {
    'Word2Vec': cbow_embeddings,
    'GloVe': glove_embeddings,
    'FastText': fasttext_embeddings
}


**Converting Labels to Categorical for LSTM**

In [90]:
y_train_cat = to_categorical(y_train, 6)
y_test_cat = to_categorical(y_test, 6)

**Defining different dimensions for different embeddings**

In [91]:
embedding_dims = {
    'Word2Vec': 300,
    'GloVe': 100,
    'FastText': 300
}

**Training and evaluating models using different embeddings**

In [106]:
for name, embedding in embeddings.items():
    print(f"\nUsing embedding: {name}")
    dim = embedding_dims[name]
    train_seqs = texts_to_sequences(train_df['clean_text'], embedding, dim=dim)
    test_seqs = texts_to_sequences(test_df['clean_text'], embedding, dim=dim)

    maxlen = max(max(len(s) for s in train_seqs), max(len(s) for s in test_seqs))
    X_train_pad = pad_sequences(train_seqs, maxlen=maxlen, dtype='float32', padding='post', truncating='post')
    X_test_pad = pad_sequences(test_seqs, maxlen=maxlen, dtype='float32', padding='post', truncating='post')

    input = Input(shape=(maxlen, dim))
    x = Masking(mask_value=0.0)(input)
    x = LSTM(128)(x) 
    output = Dense(6, activation='softmax')(x)

    model = Model(inputs=input, outputs=output)
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    model.fit(X_train_pad, y_train_cat, epochs=5, batch_size=32, validation_split=0.2)
    y_pred = model.predict(X_test_pad)
    acc = accuracy_score(np.argmax(y_test_cat, axis=1), np.argmax(y_pred, axis=1))
    results.append({
            'Preprocessing': 'None',
            'Vectorizer': name,
            'Model': 'LSTM',
            'Accuracy': acc,
            'Combination': " "
        })


Using embedding: Word2Vec
Epoch 1/5
[1m137/137[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 17ms/step - accuracy: 0.4188 - loss: 1.4766 - val_accuracy: 0.6856 - val_loss: 0.9532
Epoch 2/5
[1m137/137[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 13ms/step - accuracy: 0.6973 - loss: 0.8767 - val_accuracy: 0.7021 - val_loss: 0.8266
Epoch 3/5
[1m137/137[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 13ms/step - accuracy: 0.7565 - loss: 0.7152 - val_accuracy: 0.7214 - val_loss: 0.7744
Epoch 4/5
[1m137/137[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 13ms/step - accuracy: 0.8065 - loss: 0.5784 - val_accuracy: 0.7479 - val_loss: 0.7578
Epoch 5/5
[1m137/137[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 13ms/step - accuracy: 0.8235 - loss: 0.5272 - val_accuracy: 0.7479 - val_loss: 0.7573
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step

Using embedding: GloVe
Epoch 1/5
[1m137/137[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m 

### Using Doc2Vec Combination

In [93]:
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize

# Preprocess and tag
train_tagged = [TaggedDocument(words=word_tokenize(doc.lower()), tags=[str(i)]) 
                for i, doc in enumerate(train_df['clean_text'])]
test_tagged = [TaggedDocument(words=word_tokenize(doc.lower()), tags=[str(i)]) 
               for i, doc in enumerate(test_df['clean_text'])]


In [94]:
d2v_model = Doc2Vec(vector_size=100, window=5, min_count=2, workers=4, epochs=40)
d2v_model.build_vocab(train_tagged)
d2v_model.train(train_tagged, total_examples=d2v_model.corpus_count, epochs=d2v_model.epochs)


In [95]:
def get_doc_vectors(model, tagged_docs):
    return np.array([model.infer_vector(doc.words) for doc in tagged_docs])

X_train_d2v = get_doc_vectors(d2v_model, train_tagged)
X_test_d2v = get_doc_vectors(d2v_model, test_tagged)


**Models with Doc2Vec**

In [96]:
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(n_estimators=100),
    'SVM': SVC()
}

**Training and evaluating models**

In [108]:
for model_name, model in models.items():
        print(f"\nModel: {model_name}")
        acc = evaluate(model, X_train_d2v, y_train, X_test_d2v, y_test)
        results.append({
            'Preprocessing': 'None',
            'Vectorizer': vec_method,
            'Model': model_name,
            'Accuracy': acc,
            'Combination': "Doc2Vec"
        })


Model: Logistic Regression

Model: Decision Tree

Model: Random Forest

Model: SVM


## Results

In [109]:
results_df = pd.DataFrame(results)
print("\nSummary Table:")
print(results_df.sort_values(by='Accuracy', ascending=False).reset_index(drop=True))



Summary Table:
   Preprocessing Vectorizer                Model  Accuracy  Combination
0           None     TF-IDF                  SVM     0.864  Max Pooling
1           None        BoW  Logistic Regression     0.852  Max Pooling
2           None     TF-IDF  Logistic Regression     0.852  Max Pooling
3           None        BoW        Random Forest     0.836  Max Pooling
4           None        BoW                  SVM     0.836  Max Pooling
5           None       CBoW                  SVM     0.834  Max Pooling
6           None     TF-IDF        Random Forest     0.828  Max Pooling
7           None        BoW        Decision Tree     0.826  Max Pooling
8           None   FastText        Random Forest     0.800  Max Pooling
9           None       CBoW        Random Forest     0.788  Max Pooling
10          None   Word2Vec                 LSTM     0.770             
11          None   FastText  Logistic Regression     0.762  Max Pooling
12          None       CBoW  Logistic Regression

### Final Results:
Best Performing Combination
- Preprocessing: None
- Vectorizer: TF-IDF
- Model: SVM
- Accuracy: 0.864
- Combination: Max Pooling

This portrays that simple vectorizers like TF-IDF with strong classical models like SVM, can outperform complex approaches such as LSTMs—especially when working with relatively small or moderate-sized datasets.