# Assignment 4: Text Classification on TREC dataset

We are going to use the TREC dataset for this assignment, which is widely considered a benchmark text classification dataset. Read about the TREC dataset here (https://huggingface.co/datasets/CogComp/trec), also google it for understanding it better.

This is what you have to do - use the concepts we have covered so far to accurately predict the 5 coarse labels (if you have googled TERC, you will surely know what I mean) in the test dataset. Train on the train dataset and give results on the test dataset, as simple as that. And experiment, experiment and experiment! 

Your experimentation should be 4-tiered-

i) Experiment with preprocessing techniques (different types of Stemming, Lemmatizing, or do neither and keep the words pure). Needless to say, certain things, like stopword removal, should be common in all the preprocesssing pipelines you come up with. Remember never do stemming and lemmatization together. Note - To find out the best preprocessing technique, use a simple baseline model, like say CountVectorizer(BoW) + Logistic Regression, and see which gives the best accuracy. Then proceed with that preprocessing technique only for all the other models.

ii) Try out various vectorisation techniques (BoW, TF-IDF, CBoW, Skipgram, GloVE, Fasttext, etc., but transformer models are not allowed) -- Atleast 5 different types

iii) Tinker with various strategies to combine the word vectors (taking mean, using RNN/LSTM, and the other strategies I hinted at in the end of the last sesion). Note that this is applicable only for the advanced embedding techniques which generate word embeddings. -- Atleast 3 different types, one of which should definitely be RNN/LSTM

iv) Finally, experiment with the ML classifier model, which will take the final vector respresentation of each TREC question and generate the label. E.g. - Logistic regression, decision trees, simple neural network, etc. - Atleast 4 different models

So applying some PnC, in total you should get more than 40 different combinations. Print out the accuracies of all these combinations nicely in a well-formatted table, and pronounce one of them the best. Also feel free to experiment with more models/embedding techniques than what I have said here, the goal is after all to achieve the highest accuracy, as long as you don't use transformers. Happy experimenting!

NOTE - While choosing the 4-5 types of each experimentation level, try to choose the best out of all those available. E.g. - For level (iii) - Tinker with various strategies to combine the word vectors - do not include 'mean' if you see it is giving horrendous results. Include the best 3-4 strategies.

### Helper Code to get you started

I have added some helper code to show you how to load the TERC dataset and use it.

In [2]:
!pip install -q datasets

from datasets import load_dataset

dataset = load_dataset("trec", trust_remote_code=True)
train_data = dataset['train']
test_data = dataset['test']

print("Sample Question:", train_data[0]['text'])
print("Label:", train_data[0]['coarse_label'])


Using the latest cached version of the module from /home/manan-jain/.cache/huggingface/modules/datasets_modules/datasets/trec/f2469cab1b5fceec7249fda55360dfdbd92a7a5b545e91ea0f78ad108ffac1c2 (last modified on Sat Jun  7 21:50:34 2025) since it couldn't be found locally at trec, or remotely on the Hugging Face Hub.


Sample Question: How did serfdom develop in and then leave Russia ?
Label: 2


In [3]:
pip install nltk

Note: you may need to restart the kernel to use updated packages.


In [4]:
import nltk
from nltk.corpus import stopwords
import string

nltk.download('wordnet')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def basic_cleanup(text):
    text = text.lower()
    text = ''.join([ch for ch in text if ch not in string.punctuation])
    tokens = text.split()
    tokens = [word for word in tokens if word not in stop_words]
    return tokens

[nltk_data] Downloading package wordnet to /home/manan-
[nltk_data]     jain/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /home/manan-
[nltk_data]     jain/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [9]:
import pandas as pd

train_df = pd.DataFrame(train_data)
test_df = pd.DataFrame(test_data)

x_train = train_df['text']
y_train = train_df['coarse_label']
x_test = test_df['text']
y_test = test_df['coarse_label']

In [10]:
def preprocess_raw(text):
    return " ".join(basic_cleanup(text))

In [11]:
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()

def preprocess_stem(text):
    tokens = basic_cleanup(text)
    stemmed = [stemmer.stem(word) for word in tokens]
    return " ".join(stemmed)

In [12]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

def preprocess_lemma(text):
    tokens = basic_cleanup(text)
    lemmatized = [lemmatizer.lemmatize(word) for word in tokens]
    return " ".join(lemmatized)

In [13]:
!pip install scikit-learn



In [14]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [15]:
def train_and_evaluate(preprocess_func, x_train, x_test, y_train, y_test):
    
    x_train_prep = x_train.apply(preprocess_func)
    x_test_prep = x_test.apply(preprocess_func)

    vectorizer = CountVectorizer()
    x_train_vec = vectorizer.fit_transform(x_train_prep)
    x_test_vec = vectorizer.transform(x_test_prep)

    clf = LogisticRegression(max_iter=2000)
    clf.fit(x_train_vec, y_train)

    y_pred = clf.predict(x_test_vec)
    return accuracy_score(y_test, y_pred)

In [16]:
acc_raw = train_and_evaluate(preprocess_raw, x_train, x_test, y_train, y_test)
acc_stem = train_and_evaluate(preprocess_stem, x_train, x_test, y_train, y_test)
acc_lemma = train_and_evaluate(preprocess_lemma, x_train, x_test, y_train, y_test)

print("No Stemming/Lemmatizing:", acc_raw)
print("With Stemming:", acc_stem)
print("With Lemmatization:", acc_lemma)

No Stemming/Lemmatizing: 0.756
With Stemming: 0.756
With Lemmatization: 0.752


In [13]:
pip install scipy==1.11.3 --force-reinstall

Collecting scipy==1.11.3
  Using cached scipy-1.11.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Collecting numpy<1.28.0,>=1.21.6 (from scipy==1.11.3)
  Using cached numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (61 kB)
Using cached scipy-1.11.3-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (35.7 MB)
Using cached numpy-1.26.4-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.0 MB)
Installing collected packages: numpy, scipy
  Attempting uninstall: numpy
    Found existing installation: numpy 1.26.4
    Uninstalling numpy-1.26.4:
      Successfully uninstalled numpy-1.26.4
  Attempting uninstall: scipy
    Found existing installation: scipy 1.11.3
    Uninstalling scipy-1.11.3:
      Successfully uninstalled scipy-1.11.3
Successfully installed numpy-1.26.4 scipy-1.11.3
Note: you may need to restart the kernel to use updated packages.


In [14]:
!pip install tqdm



In [18]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
import gensim.downloader as api
from gensim.models import Word2Vec, FastText
from tqdm import tqdm

In [33]:
def vectorize_bow(X_train, X_test):
    vectorizer = CountVectorizer()
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)
    return X_train_vec, X_test_vec

In [34]:
def vectorize_tfidf(X_train, X_test):
    vectorizer = TfidfVectorizer()
    X_train_vec = vectorizer.fit_transform(X_train)
    X_test_vec = vectorizer.transform(X_test)
    return X_train_vec, X_test_vec

In [None]:
def vectorize_word2vec(X_train, X_test, sg=0):
    tokenized_train = [doc.split() for doc in X_train]
    tokenized_test = [doc.split() for doc in X_test]

    model = Word2Vec(sentences=tokenized_train, vector_size=10, window=5, min_count=1, sg=sg)

    def embed(doc):
        vectors = [model.wv[word] for word in doc if word in model.wv]
        if len(vectors) == 0:
            return np.zeros(model.vector_size)
        return np.mean(vectors, axis=0)

    X_train_vec = np.array([embed(doc) for doc in tokenized_train])
    X_test_vec = np.array([embed(doc) for doc in tokenized_test])
    return X_train_vec, X_test_vec

In [None]:
# !wget --no-check-certificate https://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
# !unzip glove.6B.zip

In [93]:
def vectorize_glove(X_train, X_test, glove_path="glove.6B.100d.txt"):
    # Load GloVe vectors (make sure to have the file in your working dir)
    glove = {}
    with open(glove_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            word = values[0]
            vec = np.asarray(values[1:], dtype='float32')
            glove[word] = vec

    tokenized_train = [doc.split() for doc in X_train]
    tokenized_test = [doc.split() for doc in X_test]

    def embed(doc):
        vectors = [glove[word] for word in doc if word in glove]
        if len(vectors) == 0:
            return np.zeros(100)  # 100d fallback
        return np.mean(vectors, axis=0)

    X_train_vec = np.array([embed(doc) for doc in tokenized_train])
    X_test_vec = np.array([embed(doc) for doc in tokenized_test])

    return X_train_vec, X_test_vec


In [103]:
def vectorize_fasttext(X_train, X_test):
    import gensim.downloader as api
    import numpy as np

    fasttext = api.load("fasttext-wiki-news-subwords-300")

    tokenized_train = [text.split() for text in X_train]
    tokenized_test = [text.split() for text in X_test]

    def embed(doc):
        vectors = [fasttext[word] for word in doc if word in fasttext]
        if vectors:
            return np.mean(vectors, axis=0)
        else:
            return np.zeros(fasttext.vector_size)

    X_train_vec = np.array([embed(doc) for doc in tokenized_train])
    X_test_vec = np.array([embed(doc) for doc in tokenized_test])

    return X_train_vec, X_test_vec 

In [104]:
def evaluate_vectorizer(X_train_vec, X_test_vec, y_train, y_test, name):
    model = LogisticRegression(max_iter=1000)
    model.fit(X_train_vec, y_train)
    preds = model.predict(X_test_vec)
    acc = accuracy_score(y_test, preds)
    print(f"{name} Accuracy: {acc:.4f}")
    return acc

In [105]:
x_train_prep = [preprocess_lemma(x) for x in tqdm(x_train)]
x_test_prep = [preprocess_lemma(x) for x in tqdm(x_test)]

# 1. BoW
xtr, xte = vectorize_bow(x_train_prep, x_test_prep)
evaluate_vectorizer(xtr, xte, y_train, y_test, "BoW")

# 2. TF-IDF
xtr, xte = vectorize_tfidf(x_train_prep, x_test_prep)
evaluate_vectorizer(xtr, xte, y_train, y_test, "TF-IDF")

# 3. Word2Vec CBOW
xtr, xte = vectorize_word2vec(x_train_prep, x_test_prep, sg=0)
evaluate_vectorizer(xtr, xte, y_train, y_test, "Word2Vec CBOW")

# 4. Word2Vec Skipgram
xtr, xte = vectorize_word2vec(x_train_prep, x_test_prep, sg=1)
evaluate_vectorizer(xtr, xte, y_train, y_test, "Word2Vec Skipgram")

# 5. GloVe
xtr, xte = vectorize_glove(x_train_prep, x_test_prep)
evaluate_vectorizer(xtr, xte, y_train, y_test, "GloVe")

# 6. FastText
xtr, xte = vectorize_fasttext(x_train_prep, x_test_prep)
evaluate_vectorizer(xtr, xte, y_train, y_test, "FastText")

100%|████████████████████████████████████| 5452/5452 [00:00<00:00, 26528.12it/s]
100%|██████████████████████████████████████| 500/500 [00:00<00:00, 45594.23it/s]


BoW Accuracy: 0.7520
TF-IDF Accuracy: 0.7480
Word2Vec CBOW Accuracy: 0.3840
Word2Vec Skipgram Accuracy: 0.4400
GloVe Accuracy: 0.6040
FastText Accuracy: 0.6720


0.672

In [107]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
def generate_embeddings(X_train, X_test, sg=0, method="mean"):
    tokenized_train = [doc.split() for doc in X_train]
    tokenized_test = [doc.split() for doc in X_test]

    model = Word2Vec(sentences=tokenized_train, vector_size=100, window=5, min_count=1, sg=sg)

    def embed(doc):
        vectors = [model.wv[word] for word in doc if word in model.wv]
        if len(vectors) == 0:
            return np.zeros(model.vector_size)
        if method == "mean":
            return np.mean(vectors, axis=0)
        elif method == "max":
            return np.max(vectors, axis=0)
        elif method == "min":
            return np.min(vectors, axis=0)

    X_train_vec = np.array([embed(doc) for doc in tokenized_train])
    X_test_vec = np.array([embed(doc) for doc in tokenized_test])

    return X_train_vec, X_test_vec

def evaluate_classifier(X_train_vec, X_test_vec, y_train, y_test, model_name, classifier):
    classifier.fit(X_train_vec, y_train)
    preds = classifier.predict(X_test_vec)
    acc = accuracy_score(y_test, preds)
    print(f"{model_name} Accuracy: {acc:.4f}")
    return acc

# Embedding strategies to try
comb_strategies = ["mean", "max", "min"]

# Classifiers to try
models = {
    "Logistic Regression": LogisticRegression(max_iter=1000),
    "Decision Tree": DecisionTreeClassifier(),
    "Random Forest": RandomForestClassifier(n_estimators=100),
    "MLP": MLPClassifier(max_iter=500)
}

# Loop through strategies and classifiers
for strategy in comb_strategies:
    print(f"\n--- Embedding: Word2Vec CBOW + {strategy.upper()} ---")
    xtr, xte = generate_embeddings(x_train_prep, x_test_prep, sg=0, method=strategy)
    for name, clf in models.items():
        evaluate_classifier(xtr, xte, y_train, y_test, f"{name} ({strategy})", clf)

    print(f"\n--- Embedding: Word2Vec Skipgram + {strategy.upper()} ---")
    xtr, xte = generate_embeddings(x_train_prep, x_test_prep, sg=1, method=strategy)
    for name, clf in models.items():
        evaluate_classifier(xtr, xte, y_train, y_test, f"{name} ({strategy})", clf)



--- Embedding: Word2Vec CBOW + MEAN ---
Logistic Regression (mean) Accuracy: 0.2380
Decision Tree (mean) Accuracy: 0.4020
Random Forest (mean) Accuracy: 0.5000




MLP (mean) Accuracy: 0.5460

--- Embedding: Word2Vec Skipgram + MEAN ---
Logistic Regression (mean) Accuracy: 0.3980
Decision Tree (mean) Accuracy: 0.3920
Random Forest (mean) Accuracy: 0.5400




MLP (mean) Accuracy: 0.5640

--- Embedding: Word2Vec CBOW + MAX ---
Logistic Regression (max) Accuracy: 0.3720
Decision Tree (max) Accuracy: 0.5220
Random Forest (max) Accuracy: 0.5980




MLP (max) Accuracy: 0.5000

--- Embedding: Word2Vec Skipgram + MAX ---
Logistic Regression (max) Accuracy: 0.4220
Decision Tree (max) Accuracy: 0.4940
Random Forest (max) Accuracy: 0.6380




MLP (max) Accuracy: 0.5640

--- Embedding: Word2Vec CBOW + MIN ---
Logistic Regression (min) Accuracy: 0.3760
Decision Tree (min) Accuracy: 0.5120
Random Forest (min) Accuracy: 0.6160




MLP (min) Accuracy: 0.5200

--- Embedding: Word2Vec Skipgram + MIN ---
Logistic Regression (min) Accuracy: 0.3920
Decision Tree (min) Accuracy: 0.5380
Random Forest (min) Accuracy: 0.6160
MLP (min) Accuracy: 0.5440




In [109]:
!pip install tensorflow

Collecting tensorflow
  Downloading tensorflow-2.19.0-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.1 kB)
Collecting absl-py>=1.0.0 (from tensorflow)
  Downloading absl_py-2.3.0-py3-none-any.whl.metadata (2.4 kB)
Collecting astunparse>=1.6.0 (from tensorflow)
  Downloading astunparse-1.6.3-py2.py3-none-any.whl.metadata (4.4 kB)
Collecting flatbuffers>=24.3.25 (from tensorflow)
  Downloading flatbuffers-25.2.10-py2.py3-none-any.whl.metadata (875 bytes)
Collecting gast!=0.5.0,!=0.5.1,!=0.5.2,>=0.2.1 (from tensorflow)
  Downloading gast-0.6.0-py3-none-any.whl.metadata (1.3 kB)
Collecting google-pasta>=0.1.1 (from tensorflow)
  Downloading google_pasta-0.2.0-py3-none-any.whl.metadata (814 bytes)
Collecting libclang>=13.0.0 (from tensorflow)
  Downloading libclang-18.1.1-py2.py3-none-manylinux2010_x86_64.whl.metadata (5.2 kB)
Collecting opt-einsum>=2.3.2 (from tensorflow)
  Downloading opt_einsum-3.4.0-py3-none-any.whl.metadata (6.3 kB)
Collecting protobuf!=4.21.0,!

In [112]:
import gensim.downloader as api
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from sklearn.preprocessing import LabelEncoder

# Load FastText word vectors
fasttext = api.load("fasttext-wiki-news-subwords-300")
embedding_dim = 300

# Tokenize the input
tokenizer = Tokenizer()
tokenizer.fit_on_texts(x_train_prep)

x_train_seq = tokenizer.texts_to_sequences(x_train_prep)
x_test_seq = tokenizer.texts_to_sequences(x_test_prep)

# Pad sequences
maxlen = max(max(len(seq) for seq in x_train_seq), 20)
x_train_pad = pad_sequences(x_train_seq, maxlen=maxlen)
x_test_pad = pad_sequences(x_test_seq, maxlen=maxlen)

# Encode labels
label_encoder = LabelEncoder()
y_train_enc = label_encoder.fit_transform(y_train)
y_test_enc = label_encoder.transform(y_test)

# Prepare embedding matrix
word_index = tokenizer.word_index
num_words = len(word_index) + 1
embedding_matrix = np.zeros((num_words, embedding_dim))

for word, i in word_index.items():
    if word in fasttext:
        embedding_matrix[i] = fasttext[word]

# Build LSTM model
model = Sequential()
model.add(Embedding(input_dim=num_words,
                    output_dim=embedding_dim,
                    weights=[embedding_matrix],
                    input_length=maxlen,
                    trainable=False))
model.add(LSTM(128, return_sequences=False))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dense(len(np.unique(y_train_enc)), activation='softmax'))

model.compile(loss='sparse_categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

# Train
model.fit(x_train_pad, y_train_enc, epochs=5, batch_size=64, validation_split=0.1)

# Evaluate
y_pred_lstm = model.predict(x_test_pad)
y_pred_lstm_labels = np.argmax(y_pred_lstm, axis=1)
acc_lstm = accuracy_score(y_test_enc, y_pred_lstm_labels)
print(f"LSTM Accuracy: {acc_lstm:.4f}")


Epoch 1/5




[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 30ms/step - accuracy: 0.2626 - loss: 1.6796 - val_accuracy: 0.5458 - val_loss: 1.3250
Epoch 2/5
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 28ms/step - accuracy: 0.5420 - loss: 1.2526 - val_accuracy: 0.6300 - val_loss: 1.0657
Epoch 3/5
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 21ms/step - accuracy: 0.6347 - loss: 1.0177 - val_accuracy: 0.6557 - val_loss: 0.9504
Epoch 4/5
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 23ms/step - accuracy: 0.6746 - loss: 0.9151 - val_accuracy: 0.6850 - val_loss: 0.8688
Epoch 5/5
[1m77/77[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 27ms/step - accuracy: 0.6944 - loss: 0.8647 - val_accuracy: 0.7015 - val_loss: 0.8264
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 22ms/step
LSTM Accuracy: 0.6620


In [118]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC

def vectorize_bow(X_train, X_test):
    vectorizer = CountVectorizer()
    return vectorizer.fit_transform(X_train), vectorizer.transform(X_test)

def vectorize_tfidf(X_train, X_test):
    vectorizer = TfidfVectorizer()
    return vectorizer.fit_transform(X_train), vectorizer.transform(X_test)

def vectorize_word2vec(X_train, X_test, sg=0):
    tokenized_train = [doc.split() for doc in X_train]
    tokenized_test = [doc.split() for doc in X_test]
    model = Word2Vec(sentences=tokenized_train, vector_size=100, window=5, min_count=1, sg=sg)

    def embed(doc):
        vectors = [model.wv[word] for word in doc if word in model.wv]
        return np.mean(vectors, axis=0) if vectors else np.zeros(model.vector_size)

    return np.array([embed(doc) for doc in tokenized_train]), np.array([embed(doc) for doc in tokenized_test])

def vectorize_glove(X_train, X_test, glove_path="glove.6B.100d.txt"):
    glove = {}
    with open(glove_path, 'r', encoding='utf-8') as f:
        for line in f:
            values = line.split()
            glove[values[0]] = np.asarray(values[1:], dtype='float32')

    def embed(doc):
        vectors = [glove[word] for word in doc.split() if word in glove]
        return np.mean(vectors, axis=0) if vectors else np.zeros(100)

    return np.array([embed(doc) for doc in X_train]), np.array([embed(doc) for doc in X_test])

def vectorize_fasttext(X_train, X_test):
    fasttext = api.load("fasttext-wiki-news-subwords-300")

    def embed(doc):
        vectors = [fasttext[word] for word in doc.split() if word in fasttext]
        return np.mean(vectors, axis=0) if vectors else np.zeros(fasttext.vector_size)

    return np.array([embed(doc) for doc in X_train]), np.array([embed(doc) for doc in X_test])

classifiers = {
    "LogisticRegression": LogisticRegression(max_iter=2000),
    "NaiveBayes": MultinomialNB(),
    "SVM": SVC(),
    "DecisionTree": DecisionTreeClassifier(),
    "MLP": MLPClassifier(max_iter=1000)
}

aggregations = {
    "mean": lambda x: x  
}


def run_all_experiments(X_train, X_test, y_train, y_test):
    results = []

    vectorizers = {
        "BoW": vectorize_bow,
        "TF-IDF": vectorize_tfidf,
        "Word2Vec_CBOW": lambda X_tr, X_te: vectorize_word2vec(X_tr, X_te, sg=0),
        "Word2Vec_Skipgram": lambda X_tr, X_te: vectorize_word2vec(X_tr, X_te, sg=1),
        "GloVe": vectorize_glove,
        "FastText": vectorize_fasttext,
    }

    for vname, vec_func in vectorizers.items():
        print(f"\n Vectorizer: {vname}")
        X_tr_vec, X_te_vec = vec_func(X_train, X_test)

        for agg_name, agg_func in aggregations.items():
            X_tr_agg, X_te_agg = agg_func(X_tr_vec), agg_func(X_te_vec)

            for clf_name, clf in classifiers.items():
                try:
                    clf.fit(X_tr_agg, y_train)
                    preds = clf.predict(X_te_agg)
                    acc = accuracy_score(y_test, preds)
                    print(f"{vname}-{agg_name}-{clf_name}: {acc:.4f}")
                    results.append({
                        "vectorizer": vname,
                        "aggregation": agg_name,
                        "classifier": clf_name,
                        "accuracy": acc
                    })
                except Exception as e:
                    print(f" Error with {vname}-{agg_name}-{clf_name}: {e}")

    return results

from tqdm import tqdm
preprocessing_methods = {
    "raw": preprocess_raw,
    "stem": preprocess_stem,
    "lemma": preprocess_lemma,
}

all_results = []

for prep_name, prep_func in preprocessing_methods.items():
    print(f"\n Running experiments with preprocessing: {prep_name.upper()}")

    x_train_prep = [prep_func(x) for x in tqdm(x_train, desc=f"Preprocessing {prep_name}")]
    x_test_prep = [prep_func(x) for x in tqdm(x_test, desc=f"Preprocessing {prep_name}")]

    results = run_all_experiments(x_train_prep, x_test_prep, y_train, y_test)
    for res in results:
        res['preprocessing'] = prep_name
        all_results.append(res)

final_df = pd.DataFrame(all_results)
final_df = final_df[['preprocessing', 'vectorizer', 'aggregation', 'classifier', 'accuracy']]



 Running experiments with preprocessing: RAW


Preprocessing raw: 100%|█████████████████| 5452/5452 [00:00<00:00, 79535.55it/s]
Preprocessing raw: 100%|██████████████████| 500/500 [00:00<00:00, 177664.52it/s]


 Vectorizer: BoW





BoW-mean-LogisticRegression: 0.7560
BoW-mean-NaiveBayes: 0.5620
BoW-mean-SVM: 0.7160
BoW-mean-DecisionTree: 0.7160
BoW-mean-MLP: 0.7260

 Vectorizer: TF-IDF
TF-IDF-mean-LogisticRegression: 0.7560
TF-IDF-mean-NaiveBayes: 0.5680
TF-IDF-mean-SVM: 0.7320
TF-IDF-mean-DecisionTree: 0.7280
TF-IDF-mean-MLP: 0.7100

 Vectorizer: Word2Vec_CBOW
Word2Vec_CBOW-mean-LogisticRegression: 0.2120
 Error with Word2Vec_CBOW-mean-NaiveBayes: Negative values in data passed to MultinomialNB (input X).
Word2Vec_CBOW-mean-SVM: 0.4440
Word2Vec_CBOW-mean-DecisionTree: 0.4120




Word2Vec_CBOW-mean-MLP: 0.3760

 Vectorizer: Word2Vec_Skipgram
Word2Vec_Skipgram-mean-LogisticRegression: 0.3780
 Error with Word2Vec_Skipgram-mean-NaiveBayes: Negative values in data passed to MultinomialNB (input X).
Word2Vec_Skipgram-mean-SVM: 0.5760
Word2Vec_Skipgram-mean-DecisionTree: 0.4120




Word2Vec_Skipgram-mean-MLP: 0.5560

 Vectorizer: GloVe
GloVe-mean-LogisticRegression: 0.6080
 Error with GloVe-mean-NaiveBayes: Negative values in data passed to MultinomialNB (input X).
GloVe-mean-SVM: 0.7320
GloVe-mean-DecisionTree: 0.3880
GloVe-mean-MLP: 0.6520

 Vectorizer: FastText
FastText-mean-LogisticRegression: 0.6560
 Error with FastText-mean-NaiveBayes: Negative values in data passed to MultinomialNB (input X).
FastText-mean-SVM: 0.7380
FastText-mean-DecisionTree: 0.4100
FastText-mean-MLP: 0.6860

 Running experiments with preprocessing: STEM


Preprocessing stem: 100%|████████████████| 5452/5452 [00:00<00:00, 10463.42it/s]
Preprocessing stem: 100%|██████████████████| 500/500 [00:00<00:00, 16277.81it/s]



 Vectorizer: BoW
BoW-mean-LogisticRegression: 0.7560
BoW-mean-NaiveBayes: 0.5660
BoW-mean-SVM: 0.7280
BoW-mean-DecisionTree: 0.7220
BoW-mean-MLP: 0.7180

 Vectorizer: TF-IDF
TF-IDF-mean-LogisticRegression: 0.7520
TF-IDF-mean-NaiveBayes: 0.5580
TF-IDF-mean-SVM: 0.7380
TF-IDF-mean-DecisionTree: 0.7020
TF-IDF-mean-MLP: 0.6780

 Vectorizer: Word2Vec_CBOW
Word2Vec_CBOW-mean-LogisticRegression: 0.3980
 Error with Word2Vec_CBOW-mean-NaiveBayes: Negative values in data passed to MultinomialNB (input X).
Word2Vec_CBOW-mean-SVM: 0.5020
Word2Vec_CBOW-mean-DecisionTree: 0.4280




Word2Vec_CBOW-mean-MLP: 0.4400

 Vectorizer: Word2Vec_Skipgram
Word2Vec_Skipgram-mean-LogisticRegression: 0.3860
 Error with Word2Vec_Skipgram-mean-NaiveBayes: Negative values in data passed to MultinomialNB (input X).
Word2Vec_Skipgram-mean-SVM: 0.4680
Word2Vec_Skipgram-mean-DecisionTree: 0.4100




Word2Vec_Skipgram-mean-MLP: 0.6000

 Vectorizer: GloVe
GloVe-mean-LogisticRegression: 0.5460
 Error with GloVe-mean-NaiveBayes: Negative values in data passed to MultinomialNB (input X).
GloVe-mean-SVM: 0.6780
GloVe-mean-DecisionTree: 0.3780
GloVe-mean-MLP: 0.5820

 Vectorizer: FastText
FastText-mean-LogisticRegression: 0.6320
 Error with FastText-mean-NaiveBayes: Negative values in data passed to MultinomialNB (input X).
FastText-mean-SVM: 0.7000
FastText-mean-DecisionTree: 0.3920
FastText-mean-MLP: 0.6440

 Running experiments with preprocessing: LEMMA


Preprocessing lemma: 100%|████████████████| 5452/5452 [00:01<00:00, 4821.58it/s]
Preprocessing lemma: 100%|█████████████████| 500/500 [00:00<00:00, 17457.79it/s]



 Vectorizer: BoW
BoW-mean-LogisticRegression: 0.7520
BoW-mean-NaiveBayes: 0.5640
BoW-mean-SVM: 0.7060
BoW-mean-DecisionTree: 0.7240
BoW-mean-MLP: 0.7180

 Vectorizer: TF-IDF
TF-IDF-mean-LogisticRegression: 0.7480
TF-IDF-mean-NaiveBayes: 0.5640
TF-IDF-mean-SVM: 0.7320
TF-IDF-mean-DecisionTree: 0.7300
TF-IDF-mean-MLP: 0.6860

 Vectorizer: Word2Vec_CBOW
Word2Vec_CBOW-mean-LogisticRegression: 0.2340
 Error with Word2Vec_CBOW-mean-NaiveBayes: Negative values in data passed to MultinomialNB (input X).
Word2Vec_CBOW-mean-SVM: 0.4440
Word2Vec_CBOW-mean-DecisionTree: 0.4240




Word2Vec_CBOW-mean-MLP: 0.4140

 Vectorizer: Word2Vec_Skipgram
Word2Vec_Skipgram-mean-LogisticRegression: 0.3900
 Error with Word2Vec_Skipgram-mean-NaiveBayes: Negative values in data passed to MultinomialNB (input X).
Word2Vec_Skipgram-mean-SVM: 0.5620
Word2Vec_Skipgram-mean-DecisionTree: 0.4240




Word2Vec_Skipgram-mean-MLP: 0.5980

 Vectorizer: GloVe
GloVe-mean-LogisticRegression: 0.6040
 Error with GloVe-mean-NaiveBayes: Negative values in data passed to MultinomialNB (input X).
GloVe-mean-SVM: 0.7440
GloVe-mean-DecisionTree: 0.3780
GloVe-mean-MLP: 0.6480

 Vectorizer: FastText
FastText-mean-LogisticRegression: 0.6720
 Error with FastText-mean-NaiveBayes: Negative values in data passed to MultinomialNB (input X).
FastText-mean-SVM: 0.7340
FastText-mean-DecisionTree: 0.4020
FastText-mean-MLP: 0.6760


In [119]:
final_df

Unnamed: 0,preprocessing,vectorizer,aggregation,classifier,accuracy
0,raw,BoW,mean,LogisticRegression,0.756
1,raw,BoW,mean,NaiveBayes,0.562
2,raw,BoW,mean,SVM,0.716
3,raw,BoW,mean,DecisionTree,0.716
4,raw,BoW,mean,MLP,0.726
...,...,...,...,...,...
73,lemma,GloVe,mean,MLP,0.648
74,lemma,FastText,mean,LogisticRegression,0.672
75,lemma,FastText,mean,SVM,0.734
76,lemma,FastText,mean,DecisionTree,0.402


In [121]:
final_df['accuracy'].max()

0.756

In [None]:
## Hence the first combination is the most accurate.