# Assignment 4: Text Classification on TREC dataset

We are going to use the TREC dataset for this assignment, which is widely considered a benchmark text classification dataset. Read about the TREC dataset here (https://huggingface.co/datasets/CogComp/trec), also google it for understanding it better.

This is what you have to do - use the concepts we have covered so far to accurately predict the 5 coarse labels (if you have googled TERC, you will surely know what I mean) in the test dataset. Train on the train dataset and give results on the test dataset, as simple as that. And experiment, experiment and experiment! 

Your experimentation should be 4-tiered-

i) Experiment with preprocessing techniques (different types of Stemming, Lemmatizing, or do neither and keep the words pure). Needless to say, certain things, like stopword removal, should be common in all the preprocesssing pipelines you come up with. Remember never do stemming and lemmatization together. Note - To find out the best preprocessing technique, use a simple baseline model, like say CountVectorizer(BoW) + Logistic Regression, and see which gives the best accuracy. Then proceed with that preprocessing technique only for all the other models.

ii) Try out various vectorisation techniques (BoW, TF-IDF, CBoW, Skipgram, GloVE, Fasttext, etc., but transformer models are not allowed) -- Atleast 5 different types

iii) Tinker with various strategies to combine the word vectors (taking mean, using RNN/LSTM, and the other strategies I hinted at in the end of the last sesion). Note that this is applicable only for the advanced embedding techniques which generate word embeddings. -- Atleast 3 different types, one of which should definitely be RNN/LSTM

iv) Finally, experiment with the ML classifier model, which will take the final vector respresentation of each TREC question and generate the label. E.g. - Logistic regression, decision trees, simple neural network, etc. - Atleast 4 different models

So applying some PnC, in total you should get more than 40 different combinations. Print out the accuracies of all these combinations nicely in a well-formatted table, and pronounce one of them the best. Also feel free to experiment with more models/embedding techniques than what I have said here, the goal is after all to achieve the highest accuracy, as long as you don't use transformers. Happy experimenting!

NOTE - While choosing the 4-5 types of each experimentation level, try to choose the best out of all those available. E.g. - For level (iii) - Tinker with various strategies to combine the word vectors - do not include 'mean' if you see it is giving horrendous results. Include the best 3-4 strategies.

### Helper Code to get you started

I have added some helper code to show you how to load the TERC dataset and use it.

In [7]:
!pip install -q datasets

In [9]:
from datasets import load_dataset

dataset = load_dataset("trec", trust_remote_code=True)
train_data = dataset['train']
test_data = dataset['test']

print("Sample Question:", train_data[0]['text'])
print("Label:", train_data[0]['coarse_label'])

Sample Question: How did serfdom develop in and then leave Russia ?
Label: 2


In [10]:
import pandas as pd
import numpy as np
import re
import nltk
import string
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report
from datasets import load_dataset
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import gensim.downloader as api
from tqdm import tqdm


nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\nv909\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\nv909\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     C:\Users\nv909\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [11]:
train_df= train_data.to_pandas()
test_df = test_data.to_pandas()
train_df.drop(columns =['fine_label'],inplace = True)
test_df.drop(columns =['fine_label'],inplace = True)
train_df.head()

Unnamed: 0,text,coarse_label
0,How did serfdom develop in and then leave Russ...,2
1,What films featured the character Popeye Doyle ?,1
2,How can I find a list of celebrities ' real na...,2
3,What fowl grabs the spotlight after the Chines...,1
4,What is the full form of .com ?,0


In [12]:
test_df.head()

Unnamed: 0,text,coarse_label
0,How far is it from Denver to Aspen ?,5
1,"What county is Modesto , California in ?",4
2,Who was Galileo ?,3
3,What is an atom ?,2
4,When did Hawaii become a state ?,5


In [13]:
train_df['coarse_label'].value_counts()

coarse_label
1    1250
3    1223
2    1162
5     896
4     835
0      86
Name: count, dtype: int64

In [14]:
stop_words = set(stopwords.words('english'))
stemmer =PorterStemmer()
lemmatizer = WordNetLemmatizer()

In [15]:
def preprocess(text, method ='none'):
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    tokens = nltk.word_tokenize(text)
    tokens = [t for t in tokens if t not in stop_words]
    if method =='stem':
        tokens = [stemmer.stem(t) for t in tokens]
    elif method == 'lemma':
        tokens = [lemmatizer.lemmatize(t) for t in tokens]
    else:
        return ' '.join(tokens)

    return ' '.join(tokens)
    

In [5]:
import gensim.downloader as api
glove_embeddings = api.load("glove-wiki-gigaword-100")
cbow_embeddings = api.load("word2vec-google-news-300")
fasttext_embeddings = api.load("fasttext-wiki-news-subwords-300")



In [16]:

train_df['lemmatized_text'] = train_df['text'].apply(lambda x: preprocess(x, method = 'lemma'))
test_df['lemmatized_text'] = train_df['text'].apply(lambda x: preprocess(x, method = 'lemma'))
train_df['stemmed_text'] = train_df['text'].apply(lambda x: preprocess(x, method='stem'))
test_df['stemmed_text'] = test_df['text'].apply(lambda x: preprocess(x, method='stem'))

In [17]:
def embed_text_with_model(corpus, embeddings,type='mean'):
    embedded = []
    for sentence in corpus:
        words = sentence.split()
        vectors = [embeddings[word] for word in words if word in embeddings]
        if vectors:
            if type == 'mean':
                embedded.append(np.mean(vectors, axis=0))
            elif type == 'max_pooling':
                embedded.append(np.max(vectors, axis=0))
        else:
            embedded.append(np.zeros(embeddings.vector_size))
    return np.array(embedded)

In [18]:
def vectorize(method, train, test,type='mean'):
    if method == 'BoW':
        vectorizer = CountVectorizer()
        return vectorizer.fit_transform(train), vectorizer.transform(test)
    elif method == 'TF-IDF':
        vectorizer = TfidfVectorizer()
        return vectorizer.fit_transform(train), vectorizer.transform(test)
    elif method == 'CBoW':
        embeddings = cbow_embeddings
        return embed_text_with_model(train, embeddings,type), embed_text_with_model(test, embeddings,type)
    elif method == 'GloVe':
        embeddings = glove_embeddings
        return embed_text_with_model(train, embeddings,type), embed_text_with_model(test, embeddings,type)
    elif method == 'FastText':
        embeddings = fasttext_embeddings
        return embed_text_with_model(train, embeddings,type), embed_text_with_model(test, embeddings,type)
    else:
        return None, None

In [19]:
y_train= train_df['coarse_label']
y_test = test_df['coarse_label']
vectorization_methods = ['BoW', 'TF-IDF','CBoW', 'GloVe', 'FastText']
preprocessing_methods = ['lemma', 'stem','none']

In [20]:
def evaluate(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    acc = accuracy_score(y_test, pred)
    return acc

In [21]:
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Decision Tree': DecisionTreeClassifier(),
    'Random Forest': RandomForestClassifier(n_estimators=100),
    'SVM': SVC()
}

In [22]:
baseline_results = []

for preprocessing in preprocessing_methods:
    print(f"Preprocessing-{preprocessing}")
    if preprocessing == 'lemma':
        train_texts = train_df['lemmatized_text']
        test_texts = test_df['lemmatized_text']
    elif preprocessing == 'stem':
        train_texts = train_df['stemmed_text']
        test_texts = test_df['stemmed_text']
    else:
        train_texts = train_df['text']
        test_texts = test_df['text']

    X_train, X_test = vectorize('BoW', train_texts, test_texts)

    model = LogisticRegression(max_iter=1000)
    acc = evaluate(model, X_train, y_train, X_test, y_test)

    baseline_results.append({
        'Preprocessing': preprocessing,
        'Vectorizer': 'BoW',
        'Model': 'Logistic Regression',
        'Accuracy': acc,
        'Combination': "Mean"
    })


Preprocessing: lemma

Preprocessing: stem

Preprocessing: none


In [23]:
baseline = pd.DataFrame(baseline_results)
print("\nBaseline Tests:")
print(baseline.sort_values(by='Accuracy', ascending=False).reset_index(drop=True))


Baseline Tests:
  Preprocessing Vectorizer                Model  Accuracy Combination
0          none        BoW  Logistic Regression     0.852        Mean
1          stem        BoW  Logistic Regression     0.754        Mean
2         lemma        BoW  Logistic Regression     0.208        Mean


In [37]:
results = []

for vec_method in vectorization_methods:
    print(f"\nVectorization-{vec_method}")
    X_train, X_test = vectorize(vec_method, train_df['text'], test_df['text'],type='max_pooling')
    for model_name, model in models.items():
        print(f"Model-{model_name}")
        acc = evaluate(model, X_train, y_train, X_test, y_test)
        results.append({
            'Preprocessing': 'None',
            'Vectorizer': vec_method,
            'Model': model_name,
            'Accuracy': acc,
            'Combination': "Max Pooling"
        })


Vectorization-BoW
Model-Logistic Regression
Model-Decision Tree
Model-Random Forest
Model-SVM

Vectorization-TF-IDF
Model-Logistic Regression
Model-Decision Tree
Model-Random Forest
Model-SVM

Vectorization-CBoW
Model-Logistic Regression
Model-Decision Tree
Model-Random Forest
Model-SVM

Vectorization-GloVe
Model-Logistic Regression
Model-Decision Tree
Model-Random Forest
Model-SVM

Vectorization-FastText
Model-Logistic Regression
Model-Decision Tree
Model-Random Forest
Model-SVM


In [25]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Dense, Masking
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from sklearn.metrics import accuracy_score
from tensorflow.keras.preprocessing.text import Tokenizer

In [26]:
train_df['clean_text'] = train_df['text'].apply(lambda x: preprocess(x, method='none'))
test_df['clean_text'] = test_df['text'].apply(lambda x: preprocess(x, method='none'))

In [27]:
def texts_to_sequences(texts, embeddings, dim):
    sequences = []
    for sentence in texts:
        tokens = sentence.split()
        vecs = [embeddings[w] if w in embeddings else np.zeros(dim) for w in tokens]
        sequences.append(vecs)
    return sequences

In [31]:
embeddings = {
    'Word2Vec': cbow_embeddings,
    'GloVe': glove_embeddings,
    'FastText': fasttext_embeddings
}
embedding_dims = {
    'Word2Vec': 300,
    'GloVe': 100,
    'FastText': 300
}

In [32]:
y_train_cat = to_categorical(y_train, 6)
y_test_cat = to_categorical(y_test, 6)

In [33]:
for name, embedding in embeddings.items():
    print(f"\nUsing embedding: {name}")
    dim = embedding_dims[name]
    train_seqs = texts_to_sequences(train_df['clean_text'], embedding, dim=dim)
    test_seqs = texts_to_sequences(test_df['clean_text'], embedding, dim=dim)

    maxlen = max(max(len(s) for s in train_seqs), max(len(s) for s in test_seqs))
    X_train_pad = pad_sequences(train_seqs, maxlen=maxlen, dtype='float32', padding='post', truncating='post')
    X_test_pad = pad_sequences(test_seqs, maxlen=maxlen, dtype='float32', padding='post', truncating='post')

    input = Input(shape=(maxlen, dim))
    x = Masking(mask_value=0.0)(input)
    x = LSTM(128)(x) 
    output = Dense(6, activation='softmax')(x)

    model = Model(inputs=input, outputs=output)
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    model.fit(X_train_pad, y_train_cat, epochs=5, batch_size=32, validation_split=0.2)
    y_pred = model.predict(X_test_pad)
    acc = accuracy_score(np.argmax(y_test_cat, axis=1), np.argmax(y_pred, axis=1))
    results.append({
            'Preprocessing': 'None',
            'Vectorizer': name,
            'Model': 'LSTM',
            'Accuracy': acc,
            'Combination': " "
        })


Using embedding: Word2Vec
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5

Using embedding: GloVe
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5

Using embedding: FastText
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


In [38]:
results_df = pd.DataFrame(results)
print(results_df.sort_values(by='Accuracy', ascending=False).reset_index(drop=True))

   Preprocessing Vectorizer                Model  Accuracy  Combination
0           None     TF-IDF                  SVM     0.864  Max Pooling
1           None        BoW  Logistic Regression     0.852  Max Pooling
2           None     TF-IDF  Logistic Regression     0.852  Max Pooling
3           None     TF-IDF        Random Forest     0.838  Max Pooling
4           None        BoW        Random Forest     0.836  Max Pooling
5           None        BoW                  SVM     0.836  Max Pooling
6           None       CBoW                  SVM     0.834  Max Pooling
7           None        BoW        Decision Tree     0.828  Max Pooling
8           None       CBoW        Random Forest     0.792  Max Pooling
9           None   FastText        Random Forest     0.786  Max Pooling
10          None     TF-IDF        Decision Tree     0.766  Max Pooling
11          None   FastText  Logistic Regression     0.760  Max Pooling
12          None       CBoW  Logistic Regression     0.756  Max 