# Assignment 4: Text Classification on TREC dataset

We are going to use the TREC dataset for this assignment, which is widely considered a benchmark text classification dataset. Read about the TREC dataset here (https://huggingface.co/datasets/CogComp/trec), also google it for understanding it better.

This is what you have to do - use the concepts we have covered so far to accurately predict the 5 coarse labels (if you have googled TERC, you will surely know what I mean) in the test dataset. Train on the train dataset and give results on the test dataset, as simple as that. And experiment, experiment and experiment! 

Your experimentation should be 4-tiered-

i) Experiment with preprocessing techniques (different types of Stemming, Lemmatizing, or do neither and keep the words pure). Needless to say, certain things, like stopword removal, should be common in all the preprocesssing pipelines you come up with. Remember never do stemming and lemmatization together. Note - To find out the best preprocessing technique, use a simple baseline model, like say CountVectorizer(BoW) + Logistic Regression, and see which gives the best accuracy. Then proceed with that preprocessing technique only for all the other models.

ii) Try out various vectorisation techniques (BoW, TF-IDF, CBoW, Skipgram, GloVE, Fasttext, etc., but transformer models are not allowed) -- Atleast 5 different types

iii) Tinker with various strategies to combine the word vectors (taking mean, using RNN/LSTM, and the other strategies I hinted at in the end of the last sesion). Note that this is applicable only for the advanced embedding techniques which generate word embeddings. -- Atleast 3 different types, one of which should definitely be RNN/LSTM

iv) Finally, experiment with the ML classifier model, which will take the final vector respresentation of each TREC question and generate the label. E.g. - Logistic regression, decision trees, simple neural network, etc. - Atleast 4 different models

So applying some PnC, in total you should get more than 40 different combinations. Print out the accuracies of all these combinations nicely in a well-formatted table, and pronounce one of them the best. Also feel free to experiment with more models/embedding techniques than what I have said here, the goal is after all to achieve the highest accuracy, as long as you don't use transformers. Happy experimenting!

NOTE - While choosing the 4-5 types of each experimentation level, try to choose the best out of all those available. E.g. - For level (iii) - Tinker with various strategies to combine the word vectors - do not include 'mean' if you see it is giving horrendous results. Include the best 3-4 strategies.

### Helper Code to get you started

I have added some helper code to show you how to load the TERC dataset and use it.

In [1]:


from datasets import load_dataset

dataset = load_dataset("trec", trust_remote_code=True)
train_data = dataset['train']
test_data = dataset['test']


print("Sample Question:", train_data[0]['text'])
print("Label:", train_data[0]['coarse_label'])


  from .autonotebook import tqdm as notebook_tqdm


Sample Question: How did serfdom develop in and then leave Russia ?
Label: 2


In [2]:

from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()


In [3]:

def stemming(texts):
    return [" ".join([stemmer.stem(word) 
                      for word in word_tokenize(text)]) 
            for text in texts]
        
def lemmetizing(texts):
    return [" ".join([lemmatizer.lemmatize(word.lower()) 
                      for word in word_tokenize(text)]) 
            for text in texts]
        


In [4]:
texts = [item["text"] for item in train_data]
labels = [item["coarse_label"] for item in train_data]
lemme_text =lemmetizing(texts)
stemm_text=stemming(texts)
test_texts=[item["text"] for item in test_data]
test_labels = [item["coarse_label"] for item in test_data]
test_lemme_text =lemmetizing(test_texts)
test_stemm_text = stemming(test_texts)
test_lemme_text[0]

'how far is it from denver to aspen ?'

without preprocessing


In [5]:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.pipeline import Pipeline
model1= Pipeline([
    ('vectorizer', CountVectorizer(stop_words='english')),
    ('classifier', LogisticRegression(max_iter=1000))
])
model1.fit(texts,labels)
y_pred = model1.predict(test_texts)

print("Accuracy:", accuracy_score(test_labels, y_pred))
print("\nClassification Report:\n", classification_report(test_labels, y_pred))


Accuracy: 0.728

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.78      0.88         9
           1       0.75      0.54      0.63        94
           2       0.65      0.93      0.77       138
           3       0.69      0.77      0.73        65
           4       0.74      0.62      0.67        81
           5       0.88      0.68      0.77       113

    accuracy                           0.73       500
   macro avg       0.78      0.72      0.74       500
weighted avg       0.75      0.73      0.72       500



with lemmetization

In [39]:
model2= Pipeline([
    ('vectorizer', CountVectorizer(stop_words='english')),
    ('classifier', LogisticRegression(max_iter=1000))
])
model2.fit(lemme_text,labels)
y_pred = model2.predict(test_lemme_text)

print("Accuracy:", accuracy_score(test_labels, y_pred))
print("\nClassification Report:\n", classification_report(test_labels, y_pred))

Accuracy: 0.728

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.78      0.88         9
           1       0.77      0.52      0.62        94
           2       0.67      0.93      0.78       138
           3       0.64      0.83      0.72        65
           4       0.73      0.64      0.68        81
           5       0.90      0.65      0.75       113

    accuracy                           0.73       500
   macro avg       0.78      0.73      0.74       500
weighted avg       0.75      0.73      0.72       500



with stemming

In [40]:
model3= Pipeline([
    ('vectorizer', CountVectorizer(stop_words='english')),
    ('classifier', LogisticRegression(max_iter=1000))
])
model3.fit(stemm_text,labels)
y_pred = model3.predict(test_stemm_text)

print("Accuracy:", accuracy_score(test_labels, y_pred))
print("\nClassification Report:\n", classification_report(test_labels, y_pred))

Accuracy: 0.734

Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.78      0.88         9
           1       0.72      0.50      0.59        94
           2       0.71      0.94      0.81       138
           3       0.59      0.85      0.69        65
           4       0.75      0.69      0.72        81
           5       0.95      0.64      0.76       113

    accuracy                           0.73       500
   macro avg       0.79      0.73      0.74       500
weighted avg       0.76      0.73      0.73       500



accuracy is maximum for stemming so we will use stemming data

In [6]:
X_train=stemm_text
y_train= labels
X_test=test_stemm_text
y_test=test_labels


In [7]:
import re
def preprocess(text):
    
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    return text
X_train=[preprocess(x) for x in X_train]

X_test=[preprocess(x) for x in X_test]
X_train[0]

'how did serfdom develop in and then leav russia '

ii) Tf-idf sentence embedding

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

ii) CBow

In [9]:
import numpy as np

In [24]:
import gensim.downloader as api
w2v_model = api.load('word2vec-google-news-300') 


cbow mean embedding

In [11]:
def vectorize_cbow(texts, model):
    vectors = []
    for sentence in texts:
        words = sentence.split()
        valid_words = [model[word] for word in words if word in model]
        if valid_words:
            vectors.append(sum(valid_words) / len(valid_words))
        else:
            vectors.append(np.zeros(model.vector_size))
    return np.array(vectors)

X_train_cbow_mean= vectorize_cbow(X_train, w2v_model)
X_test_cbow_mean = vectorize_cbow(X_test, w2v_model)
X_train_cbow_mean[0]

array([ 0.03427124,  0.01582718,  0.07673645,  0.18412781, -0.00122547,
        0.02038574,  0.02956629, -0.13592529,  0.01013184,  0.05047607,
       -0.09200287, -0.13833618, -0.07865906,  0.04612732, -0.20117188,
        0.05702209,  0.03904724,  0.1227417 ,  0.09185791, -0.0657959 ,
        0.03601074,  0.02427673,  0.06387329, -0.02804947,  0.0223403 ,
        0.09883499, -0.07717705,  0.03207397, -0.0174942 ,  0.05000305,
       -0.04743958,  0.00378418, -0.06468201, -0.08119202, -0.12365723,
        0.09744263,  0.02725983, -0.03878784,  0.09840393,  0.06979752,
        0.05728912,  0.02076721,  0.24707031, -0.04380798,  0.08026123,
        0.0002594 ,  0.03794479, -0.02432251, -0.07565308,  0.07025146,
       -0.02229309,  0.08152771,  0.00518417, -0.00271606, -0.04527283,
        0.02480984, -0.13931274, -0.0615921 ,  0.07017899, -0.0953064 ,
        0.04006958,  0.08468628, -0.09032249, -0.10282898, -0.09122467,
       -0.0683136 ,  0.0397644 ,  0.04391479,  0.02353668,  0.11

glove mean embedding

In [12]:

def load_glove_embeddings(file_path):
    embeddings = {}
    with open(file_path, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split()
            word = parts[0]
            vector = np.array(parts[1:], dtype=np.float32)
            embeddings[word] = vector
    return embeddings

glove = load_glove_embeddings("archive/glove.6B.100d.txt")

def sentence_to_vector(sentence, embeddings, dim=100):
    words = preprocess(sentence)
    vectors = [embeddings[word] for word in words if word in embeddings]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(dim)

X_train_glove_mean = np.array([sentence_to_vector(sent, glove, 100) for sent in X_train])
X_test_glove_mean= np.array([sentence_to_vector(sent, glove, 100) for sent in X_test])

In [13]:
X_train_glove_mean[0]

array([-4.7442964e-01,  3.6535317e-01,  1.5439278e-01, -2.0138146e-01,
       -3.2791090e-01,  1.3251089e-01,  2.0701523e-01,  2.1957679e-01,
       -4.5216241e-01,  1.4855900e-01,  6.1905539e-01, -5.6374019e-01,
       -6.0469246e-01,  4.2445922e-01, -8.9672439e-02, -1.9194417e-01,
       -4.4239253e-02, -7.3100992e-02, -8.1686571e-02,  1.5874349e-01,
        5.9251869e-01,  9.1681108e-02,  1.6831284e-02,  4.7428277e-01,
        1.1391300e-01,  5.1658505e-01,  2.1843876e-01,  1.4803204e-01,
        2.0713256e-01, -2.2506511e-01,  3.3820912e-01,  7.2814637e-01,
        1.2401687e-01,  2.7697900e-01,  4.2861646e-01,  1.6064940e-01,
        4.8349366e-01,  2.2341575e-01,  2.9683730e-01,  3.6747631e-01,
        2.0922235e-01, -5.9616637e-01, -9.4286492e-03, -5.7932597e-01,
       -1.2543838e-01,  1.8999942e-01, -5.6021369e-01, -4.0297419e-02,
        1.7896916e-01, -2.8541920e-01,  3.1735923e-02,  1.3076967e-01,
        4.0279233e-01,  5.7824086e-03, -5.3424752e-01, -1.8206637e+00,
      

skip gram mean embedding

In [20]:
from gensim.models import Word2Vec
w2v_model = Word2Vec(sentences=X_train, vector_size=100, window=5, min_count=1, sg=1)

In [21]:
def sentence_to_vector(tokens, model, dim=100):
    vectors = [model.wv[word] for word in tokens if word in model.wv]
    if vectors:
        return np.mean(vectors, axis=0)
    else:
        return np.zeros(dim)

X_train_sg = np.array([sentence_to_vector(sent, w2v_model, 100) for sent in X_train])
X_test_sg = np.array([sentence_to_vector(sent, w2v_model, 100) for sent in X_test])

In [16]:
X_train_sg[0]

array([-0.12048084,  0.07750017,  0.03726451,  0.10009532,  0.08369804,
       -0.04920781,  0.12045842,  0.11197897, -0.10836435, -0.09109332,
        0.12698498, -0.11640978, -0.02451688, -0.02767149,  0.02473996,
       -0.05596323,  0.04146416,  0.08166689, -0.09080675, -0.09528011,
       -0.02630907, -0.02363455,  0.18667865,  0.01648281,  0.0512659 ,
       -0.01690175, -0.0036977 ,  0.05939044, -0.07466883,  0.01759764,
        0.02484157, -0.06096856,  0.1270047 ,  0.07521374, -0.05467458,
        0.07259584, -0.01255775,  0.01153789,  0.04261676, -0.00975642,
        0.11343464, -0.02531392, -0.187321  ,  0.13309714, -0.00458413,
        0.11342674, -0.02931422, -0.03044664, -0.00760655,  0.05871193,
        0.03507536, -0.1001359 , -0.00273306, -0.04859355, -0.06685618,
       -0.0614187 ,  0.04211865, -0.07638345,  0.02386894,  0.04801695,
       -0.06498439,  0.03779931,  0.09383649, -0.13053836, -0.1055854 ,
        0.07172894,  0.00188703,  0.12946768, -0.0381165 , -0.04

In [17]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout


now will first create word embeddigs using cbow and then combine using lstm

In [25]:
from tensorflow.keras.preprocessing.sequence import pad_sequences


def vectorize_cbow_sequence(texts, model, max_len):
    sequences = []
    for sentence in texts:
        words = sentence.split()
        word_vectors = [model[word] for word in words if word in model]
        sequences.append(word_vectors)
    
    # Pad sequences with zeros (vectors of size model.vector_size)
    padded_sequences = []
    for seq in sequences:
        if len(seq) < max_len:
            # pad with zero vectors
            padding = [np.zeros(model.vector_size)] * (max_len - len(seq))
            seq.extend(padding)
        else:
            seq = seq[:max_len]  # truncate
        padded_sequences.append(seq)
    
    return np.array(padded_sequences)




In [26]:
max_len = 30  # or use max(len(s.split()) for s in X_train)
X_train_cbow= vectorize_cbow_sequence(X_train, w2v_model, max_len)
X_test_cbow = vectorize_cbow_sequence(X_test, w2v_model, max_len)


In [27]:
len(y_train)

5452

In [28]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, LSTM

max_len = 30           # number of words in a sentence
embedding_dim = 300   # GloVe or other embeddings
lstm_units = 128       # output dimension

# Input layer
inputs = Input(shape=(max_len, embedding_dim))

# LSTM layer — outputs final hidden state
x = LSTM(lstm_units, return_sequences=False)(inputs)

# Create model
model0 = Model(inputs, x)



In [29]:
X_train_cbow.shape

(5452, 30, 300)

In [30]:
X_train_cbow_word_combined=model0.predict(X_train_cbow)
X_test_cbow_word_combined=model0.predict(X_test_cbow)

[1m171/171[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 12ms/step
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 11ms/step


In [31]:
X_train_cbow_word_combined.shape

(5452, 128)

now will first create word embeddigs using glove and then combine using lstm

In [32]:
        # number of words in a sentence
embedding_dim1 = 100  # GloVe or other embeddings
     # output dimension

# Input layer
inputs1 = Input(shape=(max_len, embedding_dim1))

# LSTM layer — outputs final hidden state
x1= LSTM(lstm_units, return_sequences=False)(inputs1)

# Create model
model01 = Model(inputs1, x1)


In [33]:
import numpy as np

def sentence_to_glove_vectors(texts, glove, max_len, embedding_dim=100):
    all_padded_sequences = []

    for sentence in texts:
        words = sentence.lower().split()
        vectors = []
        for word in words:
            vector = glove.get(word)
            if vector is not None:
                vectors.append(vector)
            else:
                vectors.append(np.zeros(embedding_dim))
        
        # Pad or truncate each sentence to max_len
        if len(vectors) < max_len:
            padding = [np.zeros(embedding_dim)] * (max_len - len(vectors))
            vectors.extend(padding)
        else:
            vectors = vectors[:max_len]

        all_padded_sequences.append(vectors)
    
    return np.array(all_padded_sequences)  # shape: (num_sentences, max_len, embedding_dim)

    
X_train_glove=sentence_to_glove_vectors(X_train,glove,30) 
X_test_glove=sentence_to_glove_vectors(X_test,glove,30)

In [34]:
X_train_glove_combined=model01.predict(X_train_glove)
X_test_glove_combined=model01.predict(X_test_glove)

[1m171/171[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 9ms/step
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 10ms/step


In [35]:
X_train_glove_combined.shape

(5452, 128)

now will use bidirectional lstm to combine 


1) will combine cbow word vetors

In [36]:
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input, Bidirectional,LSTM

max_len = 30           # number of words in a sentence
embedding_dim = 300   # GloVe or other embeddings
lstm_units = 128       # output dimension

# Input layer
inputs11 = Input(shape=(max_len, embedding_dim))

# LSTM layer — outputs final hidden state
x11= Bidirectional(LSTM(lstm_units, return_sequences=False))(inputs11)

# Create model
model11 = Model(inputs11, x11)
X_train_cbow_combined_bilstm= model11.predict(X_train_cbow)
X_test_cbow_combined_bilstm= model11.predict(X_test_cbow)


[1m171/171[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 12ms/step
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step


In [37]:
X_train_cbow_combined_bilstm.shape

(5452, 256)

2. will combine glove word vectors

In [38]:
max_len = 30           # number of words in a sentence
embedding_dim = 100  # GloVe or other embeddings
lstm_units = 128       # output dimension

# Input layer
inputs12 = Input(shape=(max_len, embedding_dim))

# LSTM layer — outputs final hidden state
x12= Bidirectional(LSTM(lstm_units, return_sequences=False))(inputs12)

# Create model
model12 = Model(inputs12, x12)
X_train_glove_combined_bilstm= model12.predict(X_train_glove)
X_test_glove_combined_bilstm= model12.predict(X_test_glove)

[1m171/171[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 13ms/step
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 14ms/step


In [39]:
X_train_glove_combined_bilstm.shape

(5452, 256)

5) using ML classifier model on cbow and glove  word embedding 

i) LSTM + Neural networrk

In [52]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Dropout

model = Sequential()
model.add(LSTM(128, input_shape=(max_len, w2v_model.vector_size), return_sequences=False))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dense(6, activation='softmax')) 
  # for multiclass classification

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

  super().__init__(**kwargs)


In [53]:

from tensorflow.keras.utils import to_categorical
y_train_lstm = to_categorical(y_train, 6)
y_test_lstm = to_categorical(y_test, 6)

In [50]:
X_train_cbow.shape

(5452, 30, 300)

In [54]:
history=model.fit(X_train_cbow,y_train_lstm,validation_data=(X_test_cbow,y_test_lstm),epochs=10,batch_size=32)

Epoch 1/10
[1m171/171[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 21ms/step - accuracy: 0.2708 - loss: 1.6252 - val_accuracy: 0.5040 - val_loss: 1.3155
Epoch 2/10
[1m171/171[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 21ms/step - accuracy: 0.5548 - loss: 1.1642 - val_accuracy: 0.6880 - val_loss: 0.9270
Epoch 3/10
[1m171/171[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 21ms/step - accuracy: 0.6932 - loss: 0.8839 - val_accuracy: 0.7720 - val_loss: 0.7159
Epoch 4/10
[1m171/171[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 21ms/step - accuracy: 0.7283 - loss: 0.7757 - val_accuracy: 0.7880 - val_loss: 0.6677
Epoch 5/10
[1m171/171[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 21ms/step - accuracy: 0.7799 - loss: 0.6777 - val_accuracy: 0.7720 - val_loss: 0.6236
Epoch 6/10
[1m171/171[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 18ms/step - accuracy: 0.7932 - loss: 0.6051 - val_accuracy: 0.7980 - val_loss: 0.6067
Epoch 7/10
[1m171/171

cbow_ACCURACY=84% cbow_val_ACCURACY =82%

now on to glove


In [55]:
X_train_glove.shape

(5452, 30, 100)

In [56]:
model2 = Sequential()
model2.add(LSTM(128, input_shape=(max_len,100), return_sequences=False))
model2.add(Dropout(0.5))
model2.add(Dense(64, activation='relu'))
model2.add(Dense(6, activation='softmax'))  # for multiclass classification

model2.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model2.summary()

In [57]:
history2=model2.fit(X_train_glove,y_train_lstm,validation_data=(X_test_glove,y_test_lstm),epochs=10,batch_size=32)

Epoch 1/10
[1m171/171[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 21ms/step - accuracy: 0.2889 - loss: 1.6260 - val_accuracy: 0.5060 - val_loss: 1.1753
Epoch 2/10
[1m171/171[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 19ms/step - accuracy: 0.4594 - loss: 1.2958 - val_accuracy: 0.5060 - val_loss: 1.1108
Epoch 3/10
[1m171/171[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 17ms/step - accuracy: 0.5237 - loss: 1.1543 - val_accuracy: 0.6180 - val_loss: 0.9614
Epoch 4/10
[1m171/171[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 20ms/step - accuracy: 0.6471 - loss: 0.9417 - val_accuracy: 0.7540 - val_loss: 0.7891
Epoch 5/10
[1m171/171[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 19ms/step - accuracy: 0.7193 - loss: 0.8044 - val_accuracy: 0.7880 - val_loss: 0.6548
Epoch 6/10
[1m171/171[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 20ms/step - accuracy: 0.7511 - loss: 0.6917 - val_accuracy: 0.8160 - val_loss: 0.5220
Epoch 7/10
[1m171/171

In [58]:
arr=model2.predict(np.array([X_train_glove[0]]))

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 271ms/step


In [59]:
max_idx = np.argmax(arr)

 glove_accuracy= 83% glove_val_accuracy = 84%

 ii) logistic regression
 cbow +lstm+logistic

In [60]:
X_train_cbow_mean.shape

(5452, 300)

In [61]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train_cbow_word_combined,labels)
y_pred = clf.predict(X_test_cbow_word_combined)

print("Accuracy:", accuracy_score(test_labels, y_pred))
print("\nClassification Report:\n", classification_report(test_labels, y_pred))

Accuracy: 0.194

Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00         9
           1       0.19      1.00      0.32        94
           2       0.00      0.00      0.00       138
           3       0.75      0.05      0.09        65
           4       0.00      0.00      0.00        81
           5       0.00      0.00      0.00       113

    accuracy                           0.19       500
   macro avg       0.16      0.17      0.07       500
weighted avg       0.13      0.19      0.07       500



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


glove

In [62]:
clf2 = LogisticRegression(max_iter=1000)
clf2.fit(X_train_glove_combined,labels)
y_pred = clf2.predict(X_test_glove_combined)

print("Accuracy:", accuracy_score(test_labels, y_pred))
print("\nClassification Report:\n", classification_report(test_labels, y_pred))

Accuracy: 0.372

Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00         9
           1       0.26      0.48      0.34        94
           2       0.42      0.96      0.59       138
           3       0.53      0.12      0.20        65
           4       0.00      0.00      0.00        81
           5       0.00      0.00      0.00       113

    accuracy                           0.37       500
   macro avg       0.20      0.26      0.19       500
weighted avg       0.24      0.37      0.25       500



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


cbow+mean+logistic

In [63]:
clf3 = LogisticRegression(max_iter=1000)
clf3.fit(X_train_cbow_mean,labels)
y_pred = clf3.predict(X_test_cbow_mean)

print("Accuracy:", accuracy_score(test_labels, y_pred))
print("\nClassification Report:\n", classification_report(test_labels, y_pred))

Accuracy: 0.752

Classification Report:
               precision    recall  f1-score   support

           0       0.71      0.56      0.62         9
           1       0.65      0.57      0.61        94
           2       0.69      0.86      0.76       138
           3       0.88      0.88      0.88        65
           4       0.73      0.79      0.76        81
           5       0.92      0.69      0.79       113

    accuracy                           0.75       500
   macro avg       0.76      0.72      0.74       500
weighted avg       0.76      0.75      0.75       500



glove+mean+logistic

In [64]:
X_train_glove_mean.shape

(5452, 100)

In [65]:
clf4 = LogisticRegression(max_iter=1000)
clf4.fit(X_train_glove_mean,labels)
y_pred = clf4.predict(X_test_glove_mean)

print("Accuracy:", accuracy_score(test_labels, y_pred))
print("\nClassification Report:\n", classification_report(test_labels, y_pred))

Accuracy: 0.392

Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00         9
           1       0.31      0.37      0.34        94
           2       0.53      0.62      0.58       138
           3       0.29      0.48      0.36        65
           4       0.35      0.36      0.36        81
           5       0.39      0.13      0.20       113

    accuracy                           0.39       500
   macro avg       0.31      0.33      0.31       500
weighted avg       0.39      0.39      0.37       500



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


ii) decision trees 
cbow+mean+decision

In [66]:
labels.shape

AttributeError: 'list' object has no attribute 'shape'

In [68]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=20, random_state=42) 
clf.fit(X_train_cbow_mean, labels)

# Step 4: Predict and evaluate
y_pred = clf.predict(X_test_cbow_mean)

# print("Accuracy:", accuracy_score(labels, y_pred))
# print("\nClassification Report:\n", classification_report(test_labels, y_pred))
y_pred.shape

(500,)

In [70]:
len(labels)

5452

In [71]:
print("Accuracy:", accuracy_score(test_labels, y_pred))
print("\nClassification Report:\n", classification_report(test_labels, y_pred))


Accuracy: 0.468

Classification Report:
               precision    recall  f1-score   support

           0       0.25      0.44      0.32         9
           1       0.35      0.38      0.37        94
           2       0.57      0.57      0.57       138
           3       0.42      0.54      0.47        65
           4       0.45      0.49      0.47        81
           5       0.56      0.36      0.44       113

    accuracy                           0.47       500
   macro avg       0.43      0.46      0.44       500
weighted avg       0.48      0.47      0.47       500



ii) 2 glove+mean+decision trees

In [72]:
from sklearn.tree import DecisionTreeClassifier

clf = DecisionTreeClassifier(max_depth=20, random_state=42) 
clf.fit(X_train_glove_mean, labels)

# Step 4: Predict and evaluate
y_pred = clf.predict(X_test_glove_mean)

print("Accuracy:", accuracy_score(test_labels, y_pred))
print("\nClassification Report:\n", classification_report(test_labels, y_pred))


Accuracy: 0.33

Classification Report:
               precision    recall  f1-score   support

           0       0.50      0.44      0.47         9
           1       0.25      0.28      0.26        94
           2       0.47      0.49      0.48       138
           3       0.21      0.31      0.25        65
           4       0.28      0.28      0.28        81
           5       0.36      0.21      0.27       113

    accuracy                           0.33       500
   macro avg       0.35      0.34      0.34       500
weighted avg       0.34      0.33      0.33       500



### the best accuracy comes with STEMMING + CBOW + LSTM  which is 84% on training data and 82 on validation data