# HW6.1 Text classification with w2v

We will use the same data set Triage as last two weeks. Here is what's in this assignment. 

1. we will explore text classification with pre-trained w2v embeddings with logistic regression. 

2. we will explore text classification with w2v embeddings trained on the Triage training dataset and then test it on the dev dataset. 

## PART I: Using pre-trained w2v embeddings for text classification

For data loading, you should use the same code as last time so that you can obtain train_text, train_label, dev_text, dev_label, etc. 

To get pretrained w2v embeddings, we can use the ```gensim``` library. You can do 

```!pip install gensim```

to get it first. 

One you installed the library, you can take a look at which pretrained embeddings are available for your to download. 

```
import gensim.downloader
#Show all available models in gensim-data
print(list(gensim.downloader.info()['models'].keys()))

```

You should see a list of available pretrained embeddings like this: 

```
['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']
```

We recommend trying out a few, like the 'glove-wiki-gigaword-300' and 'word2vec-google-news-300'. To download the embeddings: 

```
glove_vectors = gensim.downloader.load('glove-wiki-gigaword-300')
```

Once you downloaded it into a variable, you can do many things. For instance, you can find the most similar words to a query word you put in: 

```
glove_vectors.most_similar('how')
```

You can also look at the embedding of a word: 

```
word = "how"
word_embedding = glove_vectors[word]

```

In tfidf, a sentence or document is naturally represented as a vector by the vocabulary based vectors. However, in w2v, you have a vector for each word, but not a sentence (alternatively, you can use something called doc2vec to directly encode a sentence). The most common way to get a sentence vector from word vectors is just to go through each word, get their embeddings and finally take an average of all word embeddings. If each word is a 300-d vector, then the final sentence vector is also 300-d. 

### Task 1: Write a ```get_sentence_embedding()``` function. 

First, you need to write a function to get sentence embeddings from all words. Note that when you look up a word embedding in the pretrained w2v, there is no guarantee that that word is in the w2v dictionary. If not, then you will get an error when you look at that word. In your code, you should build in error handling to take care of this situation. If a word is not present in the dictionary, you should initialize it with a 300-d zero vector using ```numpy.zeros()```. 

For this task let's use the pre-trained google news 300-d vector. 

### Loading packages/modules

In [1]:
#loading all the modules and importing all the requier libraries 
import gensim.downloader
#Show all available models in gensim-data
# print(list(gensim.downloader.info()['models'].keys()))
import numpy as np

from util import load_data, Dataset, Example
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

import nltk
from nltk.tokenize import word_tokenize


In [2]:
#downloading pretrained embeddings 
glove_vectors = gensim.downloader.load('glove-wiki-gigaword-300')
word2vec_vectors = gensim.downloader.load('word2vec-google-news-300')
#tokenizing sentences instead of splitting to achieve higher accuracy:
def tokenize(sentence):
    return word_tokenize(sentence.lower())


In [3]:
def get_sentence_embedding(sentence:str,word2vec_vectors)->np.ndarray:
    """
    function to get embedding of a sentence from the words in it using w2v

    args:
        sentence: the input sentence to compute embeddings from 
        glove_vectors: the pretrained w2v object where you can look up word embeddings
    returns:
        a numpy ndarray with the same dimension as the pretrained w2v embeddings
    """
    
    #split the sentence into words
    words = tokenize(sentence)
    embed_dim = 300 

    #initializing sentence embedding with zeros of the dimensions of embedding
    sentence_embed = np.zeros(embed_dim)
    #creating a counter initialized to 0 that we will increment for number of valid words we encounter
    num_w = 0

    for i in words:
        try:
            #searching for each word in the glove_vector
            word_embed = glove_vectors[i]
            #adding each word to the sentence embedding
            sentence_embed += word_embed
            #incrementing our counter
            num_w +=1
        #exception in case of an error:
        except KeyError: #this is when the word is not found in the dictionary
            sentence_embed += np.zeros(embed_dim) #then we will add a zero vector
        
    if num_w > 0:
        sentence_embed = sentence_embed/num_w #finding the average of te word embeddings 
    return sentence_embed      

    # pass

### Task 2: encode your input sentences from train and test portion of the Triage dataset into vector representations 

Last week we saw how to use tf-idf vectors to represent sentences and use them in a classifier. Here we just need to similarly turn all training and dev sentences into vectors, but using w2v. 

Make use of the function above and go through all sentences in your train data and dev data. One possibility is that all of the words in a sentence may be absent from your pretrained w2v dictionary. In that case, it would just come out as a zero vector for the whole sentence, which may not be ideal but let's keep it simple. 

In [4]:
dataset = load_data("./data/triage")

def get_data(split: list[Example]) -> (list[str], list[int]):
    """
    Massage the data into a format consistent with the input type required by CountVectorizer or TfidfVectorizer. 

    Args:
        split: pass in the split, which should be either dataset.train or dataset.dev

    Returns: 
        text: list of sentences
        labels: list of labels  
    """
    # Extract texts and labels from the Example objects
    texts = [" ".join(example.words) for example in split]  
    labels = [example.label for example in split]

    return texts, labels

train_text, train_label = get_data(dataset.train)
dev_text, dev_label = get_data(dataset.dev)

In [5]:
#encoding the train text sentences
train_embed = []
for sentence in train_text:
    embedding = get_sentence_embedding(sentence, glove_vectors)
    train_embed.append(embedding)

#encoding the test text sentences
test_embed = []
for sentence in dev_text:
    embedding = get_sentence_embedding(sentence, glove_vectors)
    test_embed.append(embedding)


### Task 3: Logistic regression text classification with w2v

Feed your w2v encoded train data into the logistic regression classifier you worked with last week, except this time you should use the scikit-learn built-in function of logistic regression. Report the accuracy for train and dev datasets. 

In [6]:
#normalizing before feeding to the log regression -- this helps a lot with accuracy
from sklearn.preprocessing import normalize

train_embed = normalize(train_embed)
test_embed = normalize(test_embed)

In [7]:
logistic_reg = LogisticRegression(random_state=0, max_iter=10000).fit(train_embed, train_label)

#predicting using the train dat
train_prediction = logistic_reg.predict(train_embed)
train_acc = accuracy_score(train_label, train_prediction)
print("train prediction accuracy", train_acc)

#now running the logistic regression on the test 
dev_prediction = logistic_reg.predict(test_embed)
test_acc = accuracy_score(dev_label, dev_prediction)
print("test prediction accuracy is:", test_acc)


train prediction accuracy 0.776204504418892
test prediction accuracy is: 0.7726389428682472


## PART II: Train your own w2v embeddimgs on the Triage training data and test it on the dev data

In this part we will train our w2v model based on the training dataset. First, you can read through the gensim package tutorial. Pay special attention to the ```training parameters section``` to understand the parameters in the ```Word2Vec``` function below. 

Assuming you have the ```train_text``` variable set up above, which is a list of sentences, we would still need to break each sentence into a list of words. In the below code, we first do that, then take the three steps to train a w2v model:

1. initialize model with ```Word2Vec()```
2. build your vocab
3. train the model. 

### Task 3: train w2v model with default parameters

using the code below, and then use your above code to feed your text training data and dev data to your logistic regression model with this new trained w2v dictionary. Note that to load the embeddings for a word, you need to look it up by: 

```word_emb = w2v_vector.wv[word]```

Which is a little different from the pre-trained model. 

After training your logistic regression model, report accuracy for both training and dev data. 

### Part II: Loading the data

In [8]:
#redoing all the splitting the sentence inso a list of words again
dataset = load_data("./data/triage")

def get_data(split: list[Example]) -> (list[str], list[int]):
    """
    Massage the data into a format consistent with the input type required by CountVectorizer or TfidfVectorizer. 

    Args:
        split: pass in the split, which should be either dataset.train or dataset.dev

    Returns: 
        text: list of sentences
        labels: list of labels  
    """
    # Extract texts and labels from the Example objects
    texts = [" ".join(example.words) for example in split]  
    labels = [example.label for example in split]

    return texts, labels



train_text, train_label = get_data(dataset.train)
dev_text, dev_label = get_data(dataset.dev)

### Part II: training w2v model

In [9]:
#avoid printing all the epochs in log in jupyter notebook:
import logging 
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.WARNING)


In [12]:
from gensim.models import Word2Vec

W2V_MIN_COUNT = 5
W2V_WINDOW = 10 # Or try 5
W2V_EPOCH = 32
W2V_SIZE = 300


sentences = [tokenize(sentence) for sentence in train_text]

w2v_model = Word2Vec(vector_size=W2V_SIZE, 
                    window=W2V_WINDOW, 
                    min_count=W2V_MIN_COUNT, 
                    workers=8)

w2v_model.build_vocab(sentences, progress_per=10000)

w2v_model.train(sentences, total_examples=len(sentences), epochs=W2V_EPOCH, report_delay=1)



(11912516, 16427040)

### Part II: getting vectors from the train/test

In [19]:
# Convert sentences to vectors
def get_sentence_vector(sentence, model):
    words = tokenize(sentence)
    vectors = [model.wv[word] for word in words if word in model.wv]
    if vectors:
        return normalize(np.mean(vectors, axis=0).reshape(1, -1))[0]
    else:
        return np.zeros(model.vector_size)

#vectors from train text 
train_w2v = []
for sentence in train_text:
    embedding = get_sentence_vector(sentence, w2v_model)
    train_w2v.append(embedding)
    
    
test_w2v = []
for sentence in dev_text:
    embedding = get_sentence_vector(sentence, w2v_model)
    test_w2v.append(embedding)



In [20]:
#normalizing data before feeding to logistic regression

from sklearn.preprocessing import normalize

train_w2v = normalize(train_w2v)
test_w2v = normalize(test_w2v)


### Part II: Feeding train/test vectors to the logistic regression

In [21]:
# Train logistic regression model
# logistic_reg = LogisticRegression(random_state=0, max_iter=1000) 
logistic_reg = LogisticRegression(random_state=0, max_iter=10000).fit(train_w2v, train_label)

# Predict and report accuracy for training data
train_prediction = logistic_reg.predict(train_w2v)
train_acc = accuracy_score(train_label, train_prediction)
print("train prediction accuracy", train_acc)

# Predict and report accuracy for dev data
dev_prediction = logistic_reg.predict(test_w2v)
dev_acc = accuracy_score(dev_label, dev_prediction)
print("dev prediction accuracy", dev_acc)


train prediction accuracy 0.7640881877791504
dev prediction accuracy 0.7660318694131364


### Task 3.1: play with hyperparameters

Change the hyperparameters such as vector_size, window, min_count, etc., and train your w2v model again. Does the accuracy change? Report your findings. 

### Part 3.1 Chainging Hyperparameters:
### experimenting with logistic regression parameters

In [18]:
#tuning values - trying out from a list
C_values = [0.001, 0.01, 0.1, 1, 10, 100]
penalty_val = ['l1', 'l2']

best_accuracy = 0
best_c_val = None
best_penalty = None



#iterating through the C-values 

for c in C_values:
    for p in penalty_val:
        logistic_reg = LogisticRegression(random_state = 0, 
                                          max_iter = 10000, 
                                          C =c, 
                                          penalty = p, 
                                          solver = 'liblinear').fit(train_w2v, train_label)
        dev_prediction = logistic_reg.predict(test_w2v)
        dev_acc = accuracy_score(dev_label, dev_prediction)
        
        if dev_acc > best_accuracy:
            best_accuracy = dev_acc
            best_c_val = c
            best_penalty = p

print("best hyperparameters include: C-value of", best_c_val, "penalty of:", best_penalty, "resulting in accuracy of", best_accuracy, "in dev set.")

best hyperparameters include: C-value of 100 penalty of: l1 resulting in accuracy of 0.6315584920326467 in dev set.


### Experimenting with the hyperparameters of the w2v model

In [22]:
import logging 
from gensim.models import Word2Vec
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

# Define the range of hyperparameters
vector_sizes = [350]
window_sizes = [5, 8, 10]
min_counts = [5, 10, 15]
epochs = [20, 30, 40, 100]
train_accuracies  = []
best_acc = 0
best_params = None

sentences = [tokenize(sentence) for sentence in train_text]

# Iterate through all combinations of hyperparameter values
for vector_size in vector_sizes:
    for window_size in window_sizes:
        for min_count in min_counts:
            for epoch in epochs:
                # Train Word2Vec model
                w2v_model = Word2Vec(vector_size=vector_size, window=window_size, min_count=min_count, workers=8)
                w2v_model.build_vocab(sentences, progress_per=10000)
                w2v_model.train(sentences, total_examples=len(sentences), epochs=epoch, report_delay=1)
                
                # Convert text to vectors
                # Convert text to vectors and normalize
                train_w2v = normalize([get_sentence_vector(sentence, w2v_model) for sentence in train_text])
                test_w2v = normalize([get_sentence_vector(sentence, w2v_model) for sentence in dev_text])

                
                # Train logistic regression model
                logistic_reg = LogisticRegression(random_state=0, max_iter=10000).fit(train_w2v, train_label)
                
                # Evaluate model
                dev_prediction = logistic_reg.predict(test_w2v)
                dev_acc = accuracy_score(dev_label, dev_prediction)
                
                # Evaluate model on training set
                train_prediction = logistic_reg.predict(train_w2v)
                train_acc = accuracy_score(train_label, train_prediction)
                
                # Update best hyperparameters
                if dev_acc > best_acc:
                    best_acc = dev_acc
                    best_params = (vector_size, window_size, min_count, epoch)
                    
                # Save train accuracy
                train_accuracies.append(train_acc)



In [23]:

# Print the best hyperparameter values and the corresponding accuracy
print("Best hyperparameters:")
print("Vector size:", best_params[0])
print("Window size:", best_params[1])
print("Min count:", best_params[2])
print("Epochs:", best_params[3])
print("Accuracy on dev set:", best_acc)
print("Accuracy on train set:", max(train_accuracies))


Best hyperparameters:
Vector size: 350
Window size: 10
Min count: 5
Epochs: 100
Accuracy on dev set: 0.773416245627672
Accuracy on train set: 0.7705027083531313


## Wrapping up

Compile the scores from the models and their input vectors from these three weeks. Compare the train and dev accuracy for these configurations: 

- NB
- logistic regression (LR) with unigram count vectors
- LR with unigram+bigram count vectors
- LR with tfidf vectors
- LR with pre-trained w2v vectors
- LR with custom trained w2v vectors

And analyze the accuracy results from train and dev data. What do you see in terms comparing different methods and input representations? What do you see in terms of train and dev accuracy trends? 

In [28]:

nb_train_acc = 0.8446735721752352
nb_dev_acc = 0.7306645938593082

lr_tfidf_train_acc = 0.9354271595552599
lr_tfidf_dev_acc = 0.757870190439176

lr_unigram_train_acc = 0.8003896227311603
lr_unigram_dev_acc=  0.7407695297318305


lr_unigram_bigram_train_acc=  0.9791884443599734
lr_unigram_bigram_dev_acc = 0.7629226583754373

lr_pretrained_w2v_train_acc = 0.7616649244512022
lr_pretrained_w2v_dev_acc = 0.7586474931986008


lr_custom_w2v_train_acc = 0.7705027083531313
lr_custom_w2v_dev_acc = 0.773416245627672


In [29]:
import pandas as pd
from tabulate import tabulate


# Your data
data = {
    "Model": ["NB", "LR (unigram)", "LR (unigram+bigram)", "LR (tfidf)", "LR (pre-trained w2v)", "LR (custom w2v)"],
    "Train Accuracy": [nb_train_acc, lr_unigram_train_acc, lr_unigram_bigram_train_acc, lr_tfidf_train_acc, lr_pretrained_w2v_train_acc, lr_custom_w2v_train_acc],
    "Dev Accuracy": [nb_dev_acc, lr_unigram_dev_acc, lr_unigram_bigram_dev_acc, lr_tfidf_dev_acc, lr_pretrained_w2v_dev_acc, lr_custom_w2v_dev_acc]
}

df = pd.DataFrame(data)

# Format columns with accuracies to have 3 digits after the decimal point
df[['Train Accuracy', 'Dev Accuracy']] = df[['Train Accuracy', 'Dev Accuracy']].applymap(lambda x: round(x, 3))

# Print the dataframe in markdown format
print(tabulate(df, headers='keys', tablefmt='pipe', showindex=False))

| Model                |   Train Accuracy |   Dev Accuracy |
|:---------------------|-----------------:|---------------:|
| NB                   |            0.845 |          0.731 |
| LR (unigram)         |            0.8   |          0.741 |
| LR (unigram+bigram)  |            0.979 |          0.763 |
| LR (tfidf)           |            0.935 |          0.758 |
| LR (pre-trained w2v) |            0.762 |          0.759 |
| LR (custom w2v)      |            0.771 |          0.773 |


### Analysis & Observations

#### NaiveBayes:
Using this model, we achieve relatively normal accuracy. However, we notice that other models yield higher test accuracies than what we achieve with NaiveBayes.

#### LR (unigram):
In the case of LR unigram, we observe that train and dev accuracies are close enough, suggesting there is less overfitting. Overall, the training accuracy was higher in NaiveBayes, but the test accuracy was higher with LR unigram compared to NaiveBayes.

#### LR (Unigram + Bigram):
This model has shown the highest train accuracy so far, but the gap between the train and test accuracies is huge, suggesting overfitting. However, the test accuracy is not low either.

#### LR (tfidf):
Similar to the previous model, we observe a very high train accuracy rate, but a much lower test accuracy. This suggests overfitting of the model to the train data, with the test accuracy here being lower than that of the unigram and bigram model.

#### Pre-trained w2v:
While the accuracy for train data here is not the highest (actually, it is the lowest among all), the test and train accuracies are very close in numbers. This suggests that this model is probably among the best ones we have seen so far, as it performs as well with unseen data as it does with train data, meaning the model is good at generalizing.

#### LR (custom w2v):
While the train accuracy here is not great, we notice something strange - our test accuracy is higher than the train accuracy. The closeness of the accuracies in train and test also suggests good generalization.

#### Overall conclusions:
We notice certain trends among the models. The models that use word embeddings have closer train and test accuracies than the rest of the models. While the highest train accuracy is achieved with the LR unigram + bigram model, we do not observe the highest test accuracy with this model. Rather, the LR custom w2v model shows a high accuracy for test of 0.773 - the highest observed accuracy so far for test. Thus, to conclude, the best models in terms of generalization and less overfitting are the word-embedding models, especially those using custom w2v.