# Assignment 4: Text Classification on TREC dataset

We are going to use the TREC dataset for this assignment, which is widely considered a benchmark text classification dataset. Read about the TREC dataset here (https://huggingface.co/datasets/CogComp/trec), also google it for understanding it better.

This is what you have to do - use the concepts we have covered so far to accurately predict the 5 coarse labels (if you have googled TERC, you will surely know what I mean) in the test dataset. Train on the train dataset and give results on the test dataset, as simple as that. And experiment, experiment and experiment! 

Your experimentation should be 4-tiered-

i) Experiment with preprocessing techniques (different types of Stemming, Lemmatizing, or do neither and keep the words pure). Needless to say, certain things, like stopword removal, should be common in all the preprocesssing pipelines you come up with. Remember never do stemming and lemmatization together. Note - To find out the best preprocessing technique, use a simple baseline model, like say CountVectorizer(BoW) + Logistic Regression, and see which gives the best accuracy. Then proceed with that preprocessing technique only for all the other models.

ii) Try out various vectorisation techniques (BoW, TF-IDF, CBoW, Skipgram, GloVE, Fasttext, etc., but transformer models are not allowed) -- Atleast 5 different types

iii) Tinker with various strategies to combine the word vectors (taking mean, using RNN/LSTM, and the other strategies I hinted at in the end of the last sesion). Note that this is applicable only for the advanced embedding techniques which generate word embeddings. -- Atleast 3 different types, one of which should definitely be RNN/LSTM

iv) Finally, experiment with the ML classifier model, which will take the final vector respresentation of each TREC question and generate the label. E.g. - Logistic regression, decision trees, simple neural network, etc. - Atleast 4 different models

So applying some PnC, in total you should get more than 40 different combinations. Print out the accuracies of all these combinations nicely in a well-formatted table, and pronounce one of them the best. Also feel free to experiment with more models/embedding techniques than what I have said here, the goal is after all to achieve the highest accuracy, as long as you don't use transformers. Happy experimenting!

NOTE - While choosing the 4-5 types of each experimentation level, try to choose the best out of all those available. E.g. - For level (iii) - Tinker with various strategies to combine the word vectors - do not include 'mean' if you see it is giving horrendous results. Include the best 3-4 strategies.

### Helper Code to get you started

I have added some helper code to show you how to load the TERC dataset and use it.

In [2]:
!pip install -q datasets

from datasets import load_dataset

dataset = load_dataset("trec", trust_remote_code=True)
train_data = dataset['train']
test_data = dataset['test']

print("Sample Question:", train_data[0]['text'])
print("Label:", train_data[0]['coarse_label'])


Sample Question: How did serfdom develop in and then leave Russia ?
Label: 2


### Working with the dataset

In [3]:
import pandas as pd
df_train = pd.DataFrame({
    'text': train_data['text'],
    'label': train_data['coarse_label']
})


In [4]:
df_train

Unnamed: 0,text,label
0,How did serfdom develop in and then leave Russ...,2
1,What films featured the character Popeye Doyle ?,1
2,How can I find a list of celebrities ' real na...,2
3,What fowl grabs the spotlight after the Chines...,1
4,What is the full form of .com ?,0
...,...,...
5447,What 's the shape of a camel 's spine ?,1
5448,What type of currency is used in China ?,1
5449,What is the temperature today ?,5
5450,What is the temperature for cooking ?,5


In [5]:
df_test = pd.DataFrame({
    'text': test_data['text'],
    'label': test_data['coarse_label']
})
df_test

Unnamed: 0,text,label
0,How far is it from Denver to Aspen ?,5
1,"What county is Modesto , California in ?",4
2,Who was Galileo ?,3
3,What is an atom ?,2
4,When did Hawaii become a state ?,5
...,...,...
495,Who was the 22nd President of the US ?,3
496,What is the money they use in Zambia ?,1
497,How many feet in a mile ?,5
498,What is the birthstone of October ?,1


### Preprocessing pipeline

In [6]:
from nltk.tokenize import word_tokenize 
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.corpus import wordnet, stopwords
from nltk import pos_tag
import re

In [7]:
def remove_special_char(string):
    string = re.sub(r'[^A-Za-z0-9\s]', '', string)
    return string

stop_words = set(stopwords.words('english'))
important_words = {'what', 'when', 'where', 'how', 'why', 'who', 'which', 'whom'}
stop_words = stop_words - important_words
def remove_sw (string):
    string = [word for word in string if word not in stop_words]
    return string

lemma = WordNetLemmatizer()
def pos_tags(treebank_tag):
    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return wordnet.NOUN

stemmer = PorterStemmer()
def stem_tokens(token_list):
    stemmed_tokens = [stemmer.stem(token) for token in token_list]
    return stemmed_tokens

In [8]:
def stem_preprocess(string):
    string = string.lower()
    string = word_tokenize(string)
    string = stem_tokens(string)
    string = [remove_special_char(token) for token in string if remove_special_char(token)]
    string = remove_sw(string)
    return ' '.join(string)


def lemma_preprocess(string):
    string = string.lower()
    string = word_tokenize(string)
    tags = pos_tag(string)
    string = [lemma.lemmatize(word, pos_tags(pos)) for word, pos in tags]
    string = [remove_special_char(token) for token in string if remove_special_char(token)]
    string = remove_sw(string)
    return string

I have implemented two distinct preprocessing pipelines to suit different types of text vectorization techniques:
- `stem_preprocess` : Designed for purely statistical vectorization methods such as Count Vectorizer and TF-IDF Vectorizer.
- `lemma_preprocess` : Designed for vectorization methods that capture semantic relationships between words, including Word2Vec models, GloVe embeddings, and similar techniques.

### Preprocessing the dataset

In [9]:
import time
start_time = time.time()
df_train['stokens'] = df_train['text'].apply(stem_preprocess)
df_test['stokens'] = df_test['text'].apply(stem_preprocess)

df_train['ltokens'] = df_train['text'].apply(lemma_preprocess)
df_test['ltokens'] = df_test['text'].apply(lemma_preprocess)

end_time = time.time()
print(f"Time taken: {end_time - start_time:.2f} seconds")

Time taken: 20.23 seconds


In [10]:
df_train

Unnamed: 0,text,label,stokens,ltokens
0,How did serfdom develop in and then leave Russ...,2,how serfdom develop leav russia,"[how, serfdom, develop, leave, russia]"
1,What films featured the character Popeye Doyle ?,1,what film featur charact popey doyl,"[what, film, feature, character, popeye, doyle]"
2,How can I find a list of celebrities ' real na...,2,how find list celebr real name,"[how, find, list, celebrity, real, name]"
3,What fowl grabs the spotlight after the Chines...,1,what fowl grab spotlight chines year monkey,"[what, fowl, grab, spotlight, chinese, year, m..."
4,What is the full form of .com ?,0,what full form com,"[what, full, form, com]"
...,...,...,...,...
5447,What 's the shape of a camel 's spine ?,1,what shape camel spine,"[what, shape, camel, spine]"
5448,What type of currency is used in China ?,1,what type currenc use china,"[what, type, currency, use, china]"
5449,What is the temperature today ?,5,what temperatur today,"[what, temperature, today]"
5450,What is the temperature for cooking ?,5,what temperatur cook,"[what, temperature, cooking]"


### Stastical vectorisation and model implementation

In [11]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC
from tensorflow import keras
from tensorflow.keras import layers
from sklearn.metrics import accuracy_score
import numpy as np

In [12]:
def vectorize(x_train, x_test, vectorizer):
    if vectorizer == 'Count':
        vec = CountVectorizer()
    elif vectorizer == 'Tfidf': 
        vec = TfidfVectorizer()
        
    x_train_vectorized = vec.fit_transform(x_train)
    x_test_vectorized = vec.transform(x_test)
    vocab_size = len(vec.vocabulary_)
    return x_train_vectorized, x_test_vectorized, vocab_size

In [13]:
def implement_model(x_train, y_train, x_test, y_test, model, vocab_size = 0):
    if model == 'LR':
        model = LogisticRegression()
        model.fit(x_train, y_train)
        y_pred = model.predict(x_test)
    elif model == 'SVM':
        model = LinearSVC()
        model.fit(x_train, y_train)
        y_pred = model.predict(x_test)
    elif model == 'DT':
        model = DecisionTreeClassifier()
        model.fit(x_train, y_train)
        y_pred = model.predict(x_test)
    elif model == 'NN':
        model = keras.Sequential([
            layers.Input(shape=(vocab_size,)),
            layers.Dense(128, activation='relu'),
            layers.Dense(6, activation='softmax')
            ])
        model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
        model.fit(x_train.toarray(), y_train, epochs=10, batch_size=32, verbose=0)
        y_pred_probs = model.predict(x_test.toarray(), verbose = 0)
        y_pred = np.argmax(y_pred_probs, axis=1)

    accuracy = accuracy_score(y_test, y_pred)
    return accuracy

In [14]:
def result_log(x_train, y_train, x_test, y_test, vectorizers, models):
    results = []
    for vectorizer in vectorizers:
        x_train_vec, x_test_vec, vocab_size = vectorize(x_train, x_test, vectorizer)
        for model_type in models:
            accuracy = implement_model(x_train_vec, y_train, x_test_vec, y_test, model_type, vocab_size)
            results.append({
                'Vectorizer': vectorizer,
                'Model': model_type,
                'Accuracy': accuracy
            })
            
    results_df = pd.DataFrame(results)
    results_df = results_df.sort_values(by='Accuracy', ascending=False).reset_index(drop=True)
    return results_df


In [15]:
result_df_statistic = result_log(df_train['stokens'], df_train['label'], df_test['stokens'], df_test['label'], 
            vectorizers = ['Count', 'Tfidf'], models = ['LR', 'DT', 'SVM', 'NN'])

In [16]:
result_df_statistic

Unnamed: 0,Vectorizer,Model,Accuracy
0,Tfidf,SVM,0.862
1,Count,SVM,0.854
2,Count,LR,0.852
3,Tfidf,NN,0.844
4,Count,NN,0.84
5,Tfidf,LR,0.836
6,Tfidf,DT,0.818
7,Count,DT,0.816


## Semantic vectorization and model implementation

In [17]:
from gensim.models import KeyedVectors
from nltk.tokenize import word_tokenize
import gensim.downloader as api

In [18]:
path = api.load("word2vec-google-news-300", return_path=True)
word2vec_model = KeyedVectors.load_word2vec_format(path, binary=True)
print(path)

C:\Users\MANTHAN KHETADE/gensim-data\word2vec-google-news-300\word2vec-google-news-300.gz


In [19]:
fasttext_model = api.load("fasttext-wiki-news-subwords-300")
print(fasttext_model['king'])

[-1.2063e-01  5.1695e-03 -1.2447e-02 -7.8528e-03 -2.3738e-02 -8.2595e-02
  4.5790e-02 -1.5382e-01  6.4550e-02  1.2893e-01  2.7643e-02  1.5958e-02
  7.7559e-02  6.0516e-02  1.2737e-01  8.4766e-02  6.3890e-02 -1.7687e-01
  4.3017e-02 -1.8031e-02 -3.3041e-02  2.1930e-02 -1.1328e-02  6.6453e-02
  1.5826e-01 -2.3008e-02 -4.3616e-03 -2.2379e-02  4.4891e-02  3.0103e-03
 -1.5565e-02 -7.6785e-02 -9.2186e-02  5.7907e-02 -2.7658e-02  5.4500e-03
  1.8975e-02  4.2939e-02  3.4704e-03  4.0449e-02 -4.0245e-03 -1.1594e-01
 -5.8337e-03  3.2509e-02 -8.6535e-02  7.2000e-02 -2.2299e-02  1.3079e-02
 -3.9515e-02  6.8996e-02  9.2300e-02 -7.5371e-02  5.9412e-03 -3.4945e-02
 -3.3417e-02 -9.9982e-02  1.6438e-02  6.3739e-02 -6.2391e-02  7.8285e-04
 -2.9210e-02 -9.6416e-02  7.2910e-02  4.5905e-02 -8.3387e-02  7.1969e-02
  4.0932e-02 -5.6454e-03  1.3709e-01 -1.1793e-01 -7.1011e-02 -7.1963e-02
  6.5600e-02 -4.6315e-02 -1.7200e-02  3.4434e-02  4.4218e-02 -9.6354e-03
 -6.8105e-02  3.0810e-02  1.5424e-02  5.6398e-02  4

In [20]:
def load_glove_embeddings(file_path):
    embeddings_index = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vector = np.asarray(values[1:], dtype='float32')
            embeddings_index[word] = vector
    return embeddings_index
glove_file = r"C:\Users\MANTHAN KHETADE\Desktop\glove\glove.6B\glove.6B.300d.txt"
glove_model = load_glove_embeddings(glove_file)

In [21]:
def vector_conv(string, model, embedding_dim=300):
    vectors = []
    for word in string:
        if word in model:
            vectors.append(model[word])
        else:
            vectors.append(np.zeros(embedding_dim))
    return vectors       

In [22]:
def adv_vectorize(x_train, x_test, vectorizer):
    if vectorizer == 'W2V':
        model = word2vec_model
    elif vectorizer == 'Glove':
        model = glove_model
    elif vectorizer == 'Fasttext':
        model = fasttext_model
    x_train_vectorized = x_train.apply(lambda x: vector_conv(x, model))
    x_test_vectorized = x_test.apply(lambda x: vector_conv(x, model))
    return x_train_vectorized, x_test_vectorized

In [23]:
def mean_pooled(vectors):
    if len(vectors) == 0:
        return np.zeros(300)
    return np.mean(vectors, axis=0)

In [24]:
tfidf = TfidfVectorizer()
corpus = [' '.join(tokens) for tokens in df_train['ltokens']]
tfidf.fit(corpus)
tfidf_dict = dict(zip(tfidf.get_feature_names_out(), tfidf.idf_))

def tfidf_weighted_pooled(tokens, idf_dict, model):
    vectors = []
    weights = []
    for word in tokens:
        if word in model and word in idf_dict:
            vectors.append(model[word])
            weights.append(idf_dict[word])
    if len(vectors) == 0:
        return np.zeros(300)
    vectors = np.array(vectors)
    weights = np.array(weights).reshape(-1, 1)
    return np.sum(vectors * weights, axis=0) / np.sum(weights)
    

In [25]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM, Masking, Dense

In [30]:
def get_lstm_pooled_vectors(x_train_vec, x_test_vec, y_train, embedding_dim=300, max_sequence_length=30):
    x_train_padded = pad_sequences(x_train_vec.tolist(), maxlen=max_sequence_length, dtype='float32', padding='post', truncating='post', value=0.0)
    x_test_padded = pad_sequences(x_test_vec.tolist(), maxlen=max_sequence_length, dtype='float32', padding='post', truncating='post', value=0.0)

    inputs = Input(shape=(max_sequence_length, embedding_dim))
    masked = Masking(mask_value=0.0)(inputs)
    lstm_out = LSTM(128)(masked)
    outputs = Dense(6, activation='softmax')(lstm_out)

    model = Model(inputs, outputs)
    model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.fit(x_train_padded, y_train, epochs=10, batch_size=32, validation_split=0.2, verbose=0)

    encoder_model = Model(inputs, lstm_out)
    x_train_encoded = encoder_model.predict(x_train_padded)
    x_test_encoded = encoder_model.predict(x_test_padded)

    return x_train_encoded, x_test_encoded

In [31]:
def implement_model_adv(x_train, y_train, x_test, y_test, model):
    input_dim = x_train.shape[1]
    if model == 'LR':
        model = LogisticRegression(max_iter=5000)
        model.fit(x_train, y_train)
        y_pred = model.predict(x_test)
    elif model == 'SVM':
        model = LinearSVC()
        model.fit(x_train, y_train)
        y_pred = model.predict(x_test)
    elif model == 'DT':
        model = DecisionTreeClassifier()
        model.fit(x_train, y_train)
        y_pred = model.predict(x_test)
    elif model == 'NN':
        model = keras.Sequential([
        layers.Input(shape=(input_dim,)),
        layers.Dense(256),
        layers.BatchNormalization(),
        layers.LeakyReLU(negative_slope=0.1),
        layers.Dropout(0.4),
        layers.Dense(128),
        layers.BatchNormalization(),
        layers.LeakyReLU(negative_slope=0.1),
        layers.Dropout(0.3),
        layers.Dense(64, activation='relu'),
        layers.Dense(6, activation='softmax')
            ])
        model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
        model.fit(x_train, y_train, epochs=10, batch_size=32, verbose=0)
        y_pred_probs = model.predict(x_test, verbose = 0)
        y_pred = np.argmax(y_pred_probs, axis=1)

    accuracy = accuracy_score(y_test, y_pred)
    return accuracy

In [36]:
def result_log1(x_train, y_train, x_test, y_test, vectorizers, pooling_types, models):
    results = []

    for vectorizer in vectorizers:
        x_train_vec, x_test_vec = adv_vectorize(x_train, x_test, vectorizer)
        
        for pooling_type in pooling_types:
            if pooling_type == 'Mean':
                x_train_pooled = np.vstack(x_train_vec.apply(mean_pooled).values)
                x_test_pooled = np.vstack(x_test_vec.apply(mean_pooled).values)

            elif pooling_type == 'Tfidf_avg':
                if vectorizer == 'W2V':
                    model = word2vec_model
                elif vectorizer == 'Glove':
                    model = glove_model
                elif vectorizer == 'Fasttext':
                    model = fasttext_model

                x_train_pooled = np.vstack(x_train.apply(lambda tokens: tfidf_weighted_pooled(tokens, tfidf_dict, model)).values)
                x_test_pooled = np.vstack(x_test.apply(lambda tokens: tfidf_weighted_pooled(tokens, tfidf_dict, model)).values)

            elif pooling_type == 'LSTM_encoder':
                x_train_pooled, x_test_pooled = get_lstm_pooled_vectors(x_train_vec, x_test_vec, y_train)

            for model_type in models:
                accuracy = implement_model_adv(x_train_pooled, y_train, x_test_pooled, y_test, model_type)
                results.append({
                    'Vectorizer': vectorizer,
                    'Pooling': pooling_type,
                    'Model': model_type,
                    'Accuracy': accuracy
                })

    results_df = pd.DataFrame(results)
    results_df = results_df.sort_values(by='Accuracy', ascending=False).reset_index(drop=True)
    return results_df


In [37]:
result_df = result_log1(df_train['ltokens'], df_train['label'], df_test['ltokens'], df_test['label'],
                       vectorizers=['W2V', 'Glove', 'Fasttext'],
                       pooling_types=['Mean', 'Tfidf_avg', 'LSTM_encoder'],
                       models=['LR', 'SVM', 'DT', 'NN'])


[1m171/171[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 33ms/step
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 35ms/step
[1m171/171[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 37ms/step
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 36ms/step
[1m171/171[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m7s[0m 38ms/step
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 38ms/step


In [38]:
result_df

Unnamed: 0,Vectorizer,Pooling,Model,Accuracy
0,W2V,LSTM_encoder,LR,0.9
1,W2V,LSTM_encoder,SVM,0.896
2,Fasttext,LSTM_encoder,LR,0.888
3,Glove,LSTM_encoder,LR,0.884
4,W2V,LSTM_encoder,NN,0.884
5,Fasttext,LSTM_encoder,NN,0.882
6,Glove,LSTM_encoder,SVM,0.882
7,Fasttext,LSTM_encoder,SVM,0.878
8,Glove,LSTM_encoder,NN,0.87
9,W2V,Mean,NN,0.844


## Final Report

In [39]:
result_df_statistic['Pooling'] = None
result_df_statistic = result_df_statistic[['Vectorizer', 'Pooling', 'Model', 'Accuracy']]
final_report_df = pd.concat([result_df_statistic, result_df], ignore_index=True)
final_report_df = final_report_df.sort_values(by='Accuracy', ascending=False).reset_index(drop=True)

In [40]:
final_report_df

Unnamed: 0,Vectorizer,Pooling,Model,Accuracy
0,W2V,LSTM_encoder,LR,0.9
1,W2V,LSTM_encoder,SVM,0.896
2,Fasttext,LSTM_encoder,LR,0.888
3,Glove,LSTM_encoder,LR,0.884
4,W2V,LSTM_encoder,NN,0.884
5,Glove,LSTM_encoder,SVM,0.882
6,Fasttext,LSTM_encoder,NN,0.882
7,Fasttext,LSTM_encoder,SVM,0.878
8,Glove,LSTM_encoder,NN,0.87
9,Tfidf,,SVM,0.862


## The End