# HW6.1 Text classification with w2v

We will use the same data set Triage as last two weeks. Here is what's in this assignment. 

1. we will explore text classification with pre-trained w2v embeddings with logistic regression. 

2. we will explore text classification with w2v embeddings trained on the Triage training dataset and then test it on the dev dataset. 

## PART I: Using pre-trained w2v embeddings for text classification

For data loading, you should use the same code as last time so that you can obtain train_text, train_label, dev_text, dev_label, etc. 

To get pretrained w2v embeddings, we can use the ```gensim``` library. You can do 

```!pip install gensim```

to get it first. 

One you installed the library, you can take a look at which pretrained embeddings are available for your to download. 

```
import gensim.downloader
#Show all available models in gensim-data
print(list(gensim.downloader.info()['models'].keys()))

```

You should see a list of available pretrained embeddings like this: 

```
['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']
```

We recommend trying out a few, like the 'glove-wiki-gigaword-300' and 'word2vec-google-news-300'. To download the embeddings: 

```
glove_vectors = gensim.downloader.load('glove-wiki-gigaword-300')
```

Once you downloaded it into a variable, you can do many things. For instance, you can find the most similar words to a query word you put in: 

```
glove_vectors.most_similar('how')
```

You can also look at the embedding of a word: 

```
word = "how"
word_embedding = glove_vectors[word]

```

In tfidf, a sentence or document is naturally represented as a vector by the vocabulary based vectors. However, in w2v, you have a vector for each word, but not a sentence (alternatively, you can use something called doc2vec to directly encode a sentence). The most common way to get a sentence vector from word vectors is just to go through each word, get their embeddings and finally take an average of all word embeddings. If each word is a 300-d vector, then the final sentence vector is also 300-d. 

### Task 1: Write a ```get_sentence_embedding()``` function. 

First, you need to write a function to get sentence embeddings from all words. Note that when you look up a word embedding in the pretrained w2v, there is no guarantee that that word is in the w2v dictionary. If not, then you will get an error when you look at that word. In your code, you should build in error handling to take care of this situation. If a word is not present in the dictionary, you should initialize it with a 300-d zero vector using ```numpy.zeros()```. 

For this task let's use the pre-trained google news 300-d vector. 

In [19]:
%%capture

import numpy as np
def get_sentence_embedding(sentence:str,glove_vectors)->np.ndarray:
    """
    function to get embedding of a sentence from the words in it using w2v

    args:
        sentence: the input sentence to compute embeddings from 
        glove_vectors: the pretrained w2v object where you can look up word embeddings
    returns:
        a numpy ndarray with the same dimension as the pretrained w2v embeddings
    """
    words = sentence.split()
    embeddings = []

    for word in words:
        try:
            embedding = glove_vectors[word]
        except KeyError:  
            embedding = np.zeros(300) 
        embeddings.append(embedding)

    if not embeddings:
        return np.zeros(300)

    avg_embedding = np.mean(embeddings, axis=0)

    return avg_embedding

import gensim.downloader
glove_vectors = gensim.downloader.load('word2vec-google-news-300')

INFO - 17:31:57: loading projection weights from /Users/asze01/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz
INFO - 17:32:20: KeyedVectors lifecycle event {'msg': 'loaded (3000000, 300) matrix of type float32 from /Users/asze01/gensim-data/word2vec-google-news-300/word2vec-google-news-300.gz', 'binary': True, 'encoding': 'utf8', 'datetime': '2023-10-19T17:32:20.689223', 'gensim': '4.3.2', 'python': '3.10.12 (main, Jul  5 2023, 14:49:34) [Clang 14.0.6 ]', 'platform': 'macOS-12.5-arm64-arm-64bit', 'event': 'load_word2vec_format'}


### Task 2: encode your input sentences from train and test portion of the Triage dataset into vector representations 

Last week we saw how to use tf-idf vectors to represent sentences and use them in a classifier. Here we just need to similarly turn all training and dev sentences into vectors, but using w2v. 

Make use of the function above and go through all sentences in your train data and dev data. One possibility is that all of the words in a sentence may be absent from your pretrained w2v dictionary. In that case, it would just come out as a zero vector for the whole sentence, which may not be ideal but let's keep it simple. 

In [20]:
from util import load_data, Dataset, Example
import numpy as np
dataset = load_data("./data/triage")


def get_data(split:list[Example])->(list[str],list[int]):
    """
    massage the data into a format consistent with the input type required by CountVectorizer or tfidfVectorizer. 

    args 
        split: pass in the split, which should be either dataset.train or dataset.dev

    returns: 
        text: list of sentences
        labels: list of labels  

    """
    text = [' '.join(example.words) for example in split]
    labels = [example.label for example in split]
    return text, labels
    

# Load data using get_data
train_text, train_label = get_data(dataset.train)
dev_text, dev_label = get_data(dataset.dev)

# Write code to go through all sentences in the train and dev data respectively 
# and encode them into vectors using the function you wrote above with w2v
# Make sure the final matrix for your training set and dev set are represented in
# numpy arrays, not list of lists. 

train_embeddings = [get_sentence_embedding(sentence, glove_vectors) for sentence in train_text]

dev_embeddings = [get_sentence_embedding(sentence, glove_vectors) for sentence in dev_text]

train_embeddings_array = np.array(train_embeddings)
dev_embeddings_array = np.array(dev_embeddings)

### Task 3: Logistic regression text classification with w2v

Feed your w2v encoded train data into the logistic regression classifier you worked with last week, except this time you should use the scikit-learn built-in function of logistic regression. Report the accuracy for train and dev datasets. 

In [21]:
# code for logistic regression with scikit-learn library.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

model = LogisticRegression(max_iter=10000)

model.fit(train_embeddings_array, train_label)
train_predictions = model.predict(train_embeddings_array)
dev_predictions = model.predict(dev_embeddings_array)

train_accuracy = accuracy_score(train_label, train_predictions)
dev_accuracy = accuracy_score(dev_label, dev_predictions)

print(f"Training Accuracy: {train_accuracy*100:.2f}%")
print(f"Development Accuracy: {dev_accuracy*100:.2f}%")

Training Accuracy: 76.43%
Development Accuracy: 76.25%


## PART II: Train your own w2v embeddimgs on the Triage training data and test it on the dev data

In this part we will train our w2v model based on the training dataset. First, you can read through the gensim package tutorial. Pay special attention to the ```training parameters section``` to understand the parameters in the ```Word2Vec``` function below. 

Assuming you have the ```train_text``` variable set up above, which is a list of sentences, we would still need to break each sentence into a list of words. In the below code, we first do that, then take the three steps to train a w2v model:

1. initialize model with ```Word2Vec()```
2. build your vocab
3. train the model. 

### Task 3: train w2v model with default parameters

using the code below, and then use your above code to feed your text training data and dev data to your logistic regression model with this new trained w2v dictionary. Note that to load the embeddings for a word, you need to look it up by: 

```word_emb = w2v_vector.wv[word]```

Which is a little different from the pre-trained model. 

After training your logistic regression model, report accuracy for both training and dev data. 

In [22]:
%%capture

import logging 
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

from gensim.models import Word2Vec
W2V_SIZE = 300
W2V_WINDOW = 8
W2V_EPOCH = 32
W2V_MIN_COUNT = 10

sentences = [s.split() for s in train_text]

w2v_model = Word2Vec(vector_size=W2V_SIZE, 
                    window=W2V_WINDOW, 
                    min_count=W2V_MIN_COUNT, 
                    workers=8)

w2v_model.build_vocab(sentences, progress_per=10000)

w2v_model.train(sentences, total_examples=len(sentences), epochs=W2V_EPOCH, report_delay=1)


INFO - 17:32:28: Word2Vec lifecycle event {'params': 'Word2Vec<vocab=0, vector_size=300, alpha=0.025>', 'datetime': '2023-10-19T17:32:28.035949', 'gensim': '4.3.2', 'python': '3.10.12 (main, Jul  5 2023, 14:49:34) [Clang 14.0.6 ]', 'platform': 'macOS-12.5-arm64-arm-64bit', 'event': 'created'}
INFO - 17:32:28: collecting all words and their counts
INFO - 17:32:28: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO - 17:32:28: PROGRESS: at sentence #10000, processed 241164 words, keeping 21647 word types
INFO - 17:32:28: PROGRESS: at sentence #20000, processed 481958 words, keeping 31476 word types
INFO - 17:32:28: collected 32305 word types from a corpus of 506449 raw words and 21046 sentences
INFO - 17:32:28: Creating a fresh vocabulary
INFO - 17:32:28: Word2Vec lifecycle event {'msg': 'effective_min_count=10 retains 4573 unique words (14.16% of original 32305, drops 27732)', 'datetime': '2023-10-19T17:32:28.104689', 'gensim': '4.3.2', 'python': '3.10.12 (main, Jul 

In [23]:
# YOUR CODE HERE
# to train a logistic regression model with your new w2v embeddings
# report accuracy for training and dev data

# code for logistic regression with scikit-learn library.

def get_sentence_embedding2(sentence, w2v_model):
    """
    This function takes a sentence and w2v model as input.
    It returns the average embedding of the sentence.
    """
    words = sentence.split()
    embeddings = []
    
    for word in words:
        if word in w2v_model.wv:
            embeddings.append(w2v_model.wv[word])
    
    if not embeddings:
        return np.zeros(W2V_SIZE)
    
    sentence_embedding = np.mean(embeddings, axis=0)
    
    return sentence_embedding

X_train = np.array([get_sentence_embedding2(sentence, w2v_model) for sentence in train_text])
X_dev = np.array([get_sentence_embedding2(sentence, w2v_model) for sentence in dev_text])

clf = LogisticRegression(max_iter=1000).fit(X_train, train_label)

train_pred = clf.predict(X_train)
dev_pred = clf.predict(X_dev)

train_acc = accuracy_score(train_label, train_pred)
dev_acc = accuracy_score(dev_label, dev_pred)

print(f"Training Accuracy: {train_acc*100:.2f}%")
print(f"Dev Accuracy: {dev_acc*100:.2f}%")

Training Accuracy: 75.76%
Dev Accuracy: 74.70%


### Task 3.1: play with hyperparameters

Change the hyperparameters such as vector_size, window, min_count, etc., and train your w2v model again. Does the accuracy change? Report your findings. 

In [33]:
def get_sentence_embedding3(sentence, w2v_model):
    words = sentence.split()
    embeddings = []
    
    for word in words:
        if word in w2v_model.wv:
            embeddings.append(w2v_model.wv[word])
    
    if not embeddings:
        return np.zeros(w2v_model.vector_size)  # Ensure embeddings are of the same size as the model's vector size
    
    sentence_embedding = np.mean(embeddings, axis=0)
    
    return sentence_embedding

hyperparameters = [
    {"vector_size": 100, "window": 5, "min_count": 5},
    {"vector_size": 200, "window": 10, "min_count": 5},
    {"vector_size": 300, "window": 30, "min_count": 5}
]

for params in hyperparameters:
    w2v_model = Word2Vec(sentences=sentences, **params, workers=8)
    
    X_train = [get_sentence_embedding3(sentence, w2v_model) for sentence in train_text]
    X_dev = [get_sentence_embedding3(sentence, w2v_model) for sentence in dev_text]
    
    X_train = np.array(X_train)
    X_dev = np.array(X_dev)

    assert np.all([len(x) == w2v_model.vector_size for x in X_train])
    assert np.all([len(x) == w2v_model.vector_size for x in X_dev])
    
    clf = LogisticRegression(max_iter=1000).fit(X_train, train_label)
    
    train_acc = accuracy_score(train_label, clf.predict(X_train))
    dev_acc = accuracy_score(dev_label, clf.predict(X_dev))
    
    print(f"Parameters: {params}")
    print(f"Training Accuracy: {train_acc:.4f}")
    print(f"Dev Accuracy: {dev_acc:.4f}")
    print("-"*40)


INFO - 18:22:18: collecting all words and their counts
INFO - 18:22:18: PROGRESS: at sentence #0, processed 0 words, keeping 0 word types
INFO - 18:22:18: PROGRESS: at sentence #10000, processed 241164 words, keeping 21647 word types
INFO - 18:22:18: PROGRESS: at sentence #20000, processed 481958 words, keeping 31476 word types
INFO - 18:22:18: collected 32305 word types from a corpus of 506449 raw words and 21046 sentences
INFO - 18:22:18: Creating a fresh vocabulary
INFO - 18:22:18: Word2Vec lifecycle event {'msg': 'effective_min_count=5 retains 7451 unique words (23.06% of original 32305, drops 24854)', 'datetime': '2023-10-19T18:22:18.482137', 'gensim': '4.3.2', 'python': '3.10.12 (main, Jul  5 2023, 14:49:34) [Clang 14.0.6 ]', 'platform': 'macOS-12.5-arm64-arm-64bit', 'event': 'prepare_vocab'}
INFO - 18:22:18: Word2Vec lifecycle event {'msg': 'effective_min_count=5 leaves 469185 word corpus (92.64% of original 506449, drops 37264)', 'datetime': '2023-10-19T18:22:18.482751', 'gensi

Parameters: {'vector_size': 100, 'window': 5, 'min_count': 5}
Training Accuracy: 0.7120
Dev Accuracy: 0.7151
----------------------------------------


INFO - 18:22:20: Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2023-10-19T18:22:20.834904', 'gensim': '4.3.2', 'python': '3.10.12 (main, Jul  5 2023, 14:49:34) [Clang 14.0.6 ]', 'platform': 'macOS-12.5-arm64-arm-64bit', 'event': 'build_vocab'}
INFO - 18:22:20: Word2Vec lifecycle event {'msg': 'training model with 8 workers on 7451 vocabulary and 200 features, using sg=0 hs=0 sample=0.001 negative=5 window=10 shrink_windows=True', 'datetime': '2023-10-19T18:22:20.870967', 'gensim': '4.3.2', 'python': '3.10.12 (main, Jul  5 2023, 14:49:34) [Clang 14.0.6 ]', 'platform': 'macOS-12.5-arm64-arm-64bit', 'event': 'train'}
INFO - 18:22:21: EPOCH 0: training on 506449 raw words (366485 effective words) took 0.1s, 2511619 effective words/s
INFO - 18:22:21: EPOCH 1: training on 506449 raw words (366592 effective words) took 0.2s, 2443631 effective words/s
INFO - 18:22:21: EPOCH 2: training on 506449 raw words (366613 effective words) took 0.1s, 2602524 effective word

Parameters: {'vector_size': 200, 'window': 10, 'min_count': 5}
Training Accuracy: 0.7090
Dev Accuracy: 0.7070
----------------------------------------


INFO - 18:22:24: resetting layer weights
INFO - 18:22:24: Word2Vec lifecycle event {'update': False, 'trim_rule': 'None', 'datetime': '2023-10-19T18:22:24.928456', 'gensim': '4.3.2', 'python': '3.10.12 (main, Jul  5 2023, 14:49:34) [Clang 14.0.6 ]', 'platform': 'macOS-12.5-arm64-arm-64bit', 'event': 'build_vocab'}
INFO - 18:22:24: Word2Vec lifecycle event {'msg': 'training model with 8 workers on 7451 vocabulary and 300 features, using sg=0 hs=0 sample=0.001 negative=5 window=30 shrink_windows=True', 'datetime': '2023-10-19T18:22:24.929780', 'gensim': '4.3.2', 'python': '3.10.12 (main, Jul  5 2023, 14:49:34) [Clang 14.0.6 ]', 'platform': 'macOS-12.5-arm64-arm-64bit', 'event': 'train'}
INFO - 18:22:25: EPOCH 0: training on 506449 raw words (366541 effective words) took 0.2s, 1712789 effective words/s
INFO - 18:22:25: EPOCH 1: training on 506449 raw words (366665 effective words) took 0.2s, 1736577 effective words/s
INFO - 18:22:25: EPOCH 2: training on 506449 raw words (366341 effective

Parameters: {'vector_size': 300, 'window': 30, 'min_count': 5}
Training Accuracy: 0.7141
Dev Accuracy: 0.7105
----------------------------------------


Scroll through the information above to see the exact differences in training and testing accuracy. It does not appear to change significantly, though changing the vector and window size definitely do change both training and dev accuracy. In order to find the best number for both, however, it would make sense to search through a much larger range of values using a tool such as gridsearch. 

## Wrapping up

Compile the scores from the models and their input vectors from these three weeks. Compare the train and dev accuracy for these configurations: 

- NB
- logistic regression (LR) with unigram count vectors
- LR with unigram+bigram count vectors
- LR with tfidf vectors
- LR with pre-trained w2v vectors
- LR with custom trained w2v vectors

And analyze the accuracy results from train and dev data. What do you see in terms comparing different methods and input representations? What do you see in terms of train and dev accuracy trends? 

NB:
Final train classification_rate: 0.82946878266654
Final dev(test) classification_rate: 0.7329965021375826

NaivesBayes is a probabalistic model and has a decent accuracy, but isn't as good when it comes to classification in general compared to logistic regerssion most of the time. It is a good baseline for comparison, however, for the following models.

logistic regression (LR) with unigram count vectors:
Final train classification_rate: 0.9753397320155849
Final test classification_rate: 0.7271667314418966

LR with unigram+bigram count vectors:
Final train classification_rate: 0.9992872754917799
Final test classification_rate: 0.743101438010105

LR with unigram, and with unigram+bigram have high training accuracies but are lacking in their final classification rate. The unigram+bigram LR does a bit better than the NB model at 0.74 v. 0.73. The unigram only LR doesn't do as well as the NB model at 0.72 v. 0.73. The big gap between training and testing accuracy can indicate overfitting. This isn't that surprising because we didn't do any regularization.

LR with tfidf vectors:
Final train classification_rate: 0.9338591656371757
Final test classification_rate: 0.759813447337738

LR with TFIDF, even though it doesn't have as high of a training classification rate, actually has a higher final test classification rate than any of the above models. This indicates that this model is more generalizable than the above models.

LR with pre-trained w2v vectors:
Training Accuracy: 76.43%
Development (test) Accuracy: 76.25%

LR with custom trained w2v vectors:
Training Accuracy: 75.76%
Dev (test) Accuracy: 74.70%

LR with w2v have training and testing accuracies that are much closer to each other than the other models. This indicates that there is less overfitting. The pre-trained w2v model performs a bit better than the custom trained w2v model, which is likely a result of the pre-trained model having a bigger training corpus. 

Looking at overal train and test accuracy trends, there is a lot of overfitting for the models that use big input vectors (unigram, unigram+bigram). You can easily see by the big difference in training classification rate and test classification rate. Models that have dense vector representation like w2v have more balance between train and test classification rates. This could be because w2v is better at capturing the essence of the text. Rather than including everything, it only captures the semantics, which reduces noise. Finally, the best model for classification out of the above are TFIDF and pre-trained w2v because they have the highest test/dev accuracy.

The conclusion we can draw from this homework is that capturing semantics by transforming vectors is an essential part to creating a model that can classify generalized data. Having high training accuracy doesn't necessarily mean having high general classification rates, and can indicate overfitting. Dense vector representations like w2v and TFIDF seem to do a good job at making sure there is no overfitting by balancing model complexity and generalization.