# HW6.1 Text classification with w2v

We will use the same data set Triage as last two weeks. Here is what's in this assignment.

1. we will explore text classification with pre-trained w2v embeddings with logistic regression.

2. we will explore text classification with w2v embeddings trained on the Triage training dataset and then test it on the dev dataset.

## PART I: Using pre-trained w2v embeddings for text classification

For data loading, you should use the same code as last time so that you can obtain train_text, train_label, dev_text, dev_label, etc.

To get pretrained w2v embeddings, we can use the ```gensim``` library. You can do

```!pip install gensim```

to get it first.

One you installed the library, you can take a look at which pretrained embeddings are available for your to download.

```
import gensim.downloader
#Show all available models in gensim-data
print(list(gensim.downloader.info()['models'].keys()))

```

You should see a list of available pretrained embeddings like this:

```
['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']
```

We recommend trying out a few, like the 'glove-wiki-gigaword-300' and 'word2vec-google-news-300'. To download the embeddings:

```
glove_vectors = gensim.downloader.load('glove-wiki-gigaword-300')
```

Once you downloaded it into a variable, you can do many things. For instance, you can find the most similar words to a query word you put in:

```
glove_vectors.most_similar('how')
```

You can also look at the embedding of a word:

```
word = "how"
word_embedding = glove_vectors[word]

```

In tfidf, a sentence or document is naturally represented as a vector by the vocabulary based vectors. However, in w2v, you have a vector for each word, but not a sentence (alternatively, you can use something called doc2vec to directly encode a sentence). The most common way to get a sentence vector from word vectors is just to go through each word, get their embeddings and finally take an average of all word embeddings. If each word is a 300-d vector, then the final sentence vector is also 300-d.

### Task 1: Write a ```get_sentence_embedding()``` function.

First, you need to write a function to get sentence embeddings from all words. Note that when you look up a word embedding in the pretrained w2v, there is no guarantee that that word is in the w2v dictionary. If not, then you will get an error when you look at that word. In your code, you should build in error handling to take care of this situation. If a word is not present in the dictionary, you should initialize it with a 300-d zero vector using ```numpy.zeros()```.

For this task let's use the pre-trained google news 300-d vector.

In [19]:
!pip install gensim



In [20]:
import gensim.downloader
# Show all available models in gensim-data
print(list(gensim.downloader.info()['models'].keys()))

['fasttext-wiki-news-subwords-300', 'conceptnet-numberbatch-17-06-300', 'word2vec-ruscorpora-300', 'word2vec-google-news-300', 'glove-wiki-gigaword-50', 'glove-wiki-gigaword-100', 'glove-wiki-gigaword-200', 'glove-wiki-gigaword-300', 'glove-twitter-25', 'glove-twitter-50', 'glove-twitter-100', 'glove-twitter-200', '__testing_word2vec-matrix-synopsis']


In [21]:
glove_vectors = gensim.downloader.load('word2vec-google-news-300')

In [22]:
glove_vectors.most_similar('how')

[('what', 0.6820360422134399),
 ('How', 0.6297600865364075),
 ('why', 0.5838741064071655),
 ('whether', 0.5470786690711975),
 ('exactly', 0.5468065142631531),
 ('so', 0.5422654151916504),
 ('howthe', 0.5206358432769775),
 ('way', 0.5168502330780029),
 ('really', 0.5107458829879761),
 ('Q.How', 0.5080464482307434)]

In [23]:
word = "how"
word_embedding = glove_vectors[word]
print(word_embedding.shape)

(300,)


In [24]:
import numpy as np
def get_sentence_embedding(sentence:str,glove_vectors)->np.ndarray:
    """
    function to get embedding of a sentence from the words in it using w2v

    args:
        sentence: the input sentence to compute embeddings from
        glove_vectors: the pretrained w2v object where you can look up word embeddings
    returns:
        a numpy ndarray with the same dimension as the pretrained w2v embeddings
    """
    # YOU CODE HERE

    sentence_list = sentence.split()
    sentence_embedding = []

    for word in sentence_list:
        if word in glove_vectors:
            word_embedding = glove_vectors[word]
            sentence_embedding.append(word_embedding)

    if sentence_embedding:
        # the mean of all word embeddings in the sentence
        sentence_embedding = np.mean(sentence_embedding, axis=0)
    else:
        # a word is not present in the dictionary
        sentence_embedding = np.zeros(300)

    return sentence_embedding

### Task 2: encode your input sentences from train and test portion of the Triage dataset into vector representations

Last week we saw how to use tf-idf vectors to represent sentences and use them in a classifier. Here we just need to similarly turn all training and dev sentences into vectors, but using w2v.

Make use of the function above and go through all sentences in your train data and dev data. One possibility is that all of the words in a sentence may be absent from your pretrained w2v dictionary. In that case, it would just come out as a zero vector for the whole sentence, which may not be ideal but let's keep it simple.

In [25]:
# Write code to go through all sentences in the train and dev data respectively
# and encode them into vectors using the function you wrote above with w2v
# Make sure the final matrix for your training set and dev set are represented in
# numpy arrays, not list of lists.

# YOUR CODE HERE
from util import load_data, Dataset, Example
import numpy as np
dataset = load_data("./data/triage")

def get_data(split:list[Example])->(list[str],list[int]):
    """
    massage the data into a format consistent with the input type required by CountVectorizer or tfidfVectorizer.

    args
        split: pass in the split, which should be either dataset.train or dataset.dev

    returns:
        text: list of sentences
        labels: list of labels

    """
    # YOUR CODE HERE
    # pass
    text = list()
    label = list()

    for doc in split:
        text.append(doc.words)
        label.append(doc.label)

    text = [[' '.join(i)][0] for i in text]

    return text, label

train_text, train_label = get_data(dataset.train)
dev_text, dev_label = get_data(dataset.dev)

train_encoded = [get_sentence_embedding(doc, glove_vectors) for doc in train_text]
dev_encoded = [get_sentence_embedding(doc, glove_vectors) for doc in dev_text]

train_encoded = np.array(train_encoded)
dev_encoded = np.array(dev_encoded)
print(train_encoded.shape)
print(dev_encoded.shape)

# all of the words in a sentence are absent from pretrained w2v dictionary
# a zero vector - convert nan to 0
train_encoded[np.isnan(train_encoded)] = 0
dev_encoded[np.isnan(dev_encoded)] = 0

(21046, 300)
(2573, 300)


### Task 3: Logistic regression text classification with w2v

Feed your w2v encoded train data into the logistic regression classifier you worked with last week, except this time you should use the scikit-learn built-in function of logistic regression. Report the accuracy for train and dev datasets.

In [26]:
# code for logistic regression with scikit-learn library.
import sklearn.linear_model
clf = sklearn.linear_model.LogisticRegression(random_state=101)

clf.fit(train_encoded, train_label)
train_pred = clf.predict(train_encoded)
dev_pred = clf.predict(dev_encoded)

assert(len(train_pred) == len(train_label))
print("Accuracy on train data:", sum(train_pred == train_label) / len(train_label))
assert(len(dev_pred) == len(dev_label))
print("Accuracy on dev data:", sum(dev_pred == dev_label) / len(dev_label))

Accuracy on train data: 0.7662263613038107
Accuracy on dev data: 0.7625340069957248


## PART II: Train your own w2v embeddimgs on the Triage training data and test it on the dev data

In this part we will train our w2v model based on the training dataset. First, you can read through the gensim package tutorial. Pay special attention to the ```training parameters section``` to understand the parameters in the ```Word2Vec``` function below.

Assuming you have the ```train_text``` variable set up above, which is a list of sentences, we would still need to break each sentence into a list of words. In the below code, we first do that, then take the three steps to train a w2v model:

1. initialize model with ```Word2Vec()```
2. build your vocab
3. train the model.

### Task 3: train w2v model with default parameters

using the code below, and then use your above code to feed your text training data and dev data to your logistic regression model with this new trained w2v dictionary. Note that to load the embeddings for a word, you need to look it up by:

```word_emb = w2v_vector.wv[word]```

Which is a little different from the pre-trained model.

After training your logistic regression model, report accuracy for both training and dev data.

In [27]:
import logging
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

from gensim.models import Word2Vec
W2V_SIZE = 300
W2V_WINDOW = 8
W2V_EPOCH = 32
W2V_MIN_COUNT = 10

sentences = [s.split() for s in train_text]

w2v_model = Word2Vec(vector_size=W2V_SIZE,
                    window=W2V_WINDOW,
                    min_count=W2V_MIN_COUNT,
                    workers=8)

w2v_model.build_vocab(sentences, progress_per=10000)

w2v_model.train(sentences, total_examples=len(sentences), epochs=W2V_EPOCH, report_delay=1)


(11073529, 16206368)

In [28]:
# YOUR CODE HERE
# to train a logistic regression model with your new w2v embeddings
# report accuracy for training and dev data

def get_sentence_embedding(sentence:str, w2v_model, W2V_SIZE)->np.ndarray:
    """
    function to get embedding of a sentence from the words in it using w2v

    args:
        sentence: the input sentence to compute embeddings from
        glove_vectors: the pretrained w2v object where you can look up word embeddings
    returns:
        a numpy ndarray with the same dimension as the pretrained w2v embeddings
    """
    # YOU CODE HERE
    # pass
    sentence_list = sentence.split()
    sentence_embedding = []

    for word in sentence_list:
        if word in w2v_model.wv:
            word_embedding = w2v_model.wv[word]
            sentence_embedding.append(word_embedding)

    if sentence_embedding:
        # the mean of all word embeddings in the sentence
        sentence_embedding = np.mean(sentence_embedding, axis=0)
    else:
        # a word is not present in the dictionary
        sentence_embedding = np.zeros(W2V_SIZE) # use W2V_SIZE instead of a constant

    return sentence_embedding

train_encoded_new = [get_sentence_embedding(doc, w2v_model, W2V_SIZE) for doc in train_text]
dev_encoded_new = [get_sentence_embedding(doc, w2v_model, W2V_SIZE) for doc in dev_text]

train_encoded_new = np.array(train_encoded_new)
dev_encoded_new = np.array(dev_encoded_new)

train_encoded_new[np.isnan(train_encoded_new)] = 0
dev_encoded_new[np.isnan(dev_encoded_new)] = 0

In [29]:
clf = sklearn.linear_model.LogisticRegression(random_state=101, max_iter=500) # tried larger max_iter so the model can converge

clf.fit(train_encoded_new, train_label)
train_pred_new = clf.predict(train_encoded_new)
dev_pred_new = clf.predict(dev_encoded_new)

assert(len(train_pred_new) == len(train_label))
print("Accuracy on train data:", sum(train_pred_new == train_label) / len(train_label))
assert(len(dev_pred_new) == len(dev_label))
print("Accuracy on dev data:", sum(dev_pred_new == dev_label) / len(dev_label))

Accuracy on train data: 0.7572935474674523
Accuracy on dev data: 0.7532063738826272


### Task 3.1: play with hyperparameters

Change the hyperparameters such as vector_size, window, min_count, etc., and train your w2v model again. Does the accuracy change? Report your findings.

In [30]:
import logging
logging.basicConfig(format="%(levelname)s - %(asctime)s: %(message)s", datefmt= '%H:%M:%S', level=logging.INFO)

from gensim.models import Word2Vec
W2V_SIZE_list = [100, 300]
W2V_WINDOW_list = [2, 8]
W2V_EPOCH_list = [8, 32]
W2V_MIN_COUNT_list = [2, 10]

sentences = [s.split() for s in train_text]

for W2V_SIZE in W2V_SIZE_list:
  for W2V_WINDOW in W2V_WINDOW_list:
    for W2V_EPOCH in W2V_EPOCH_list:
      for W2V_MIN_COUNT in W2V_MIN_COUNT_list:
        w2v_model = Word2Vec(vector_size=W2V_SIZE,
                             window=W2V_WINDOW,
                             min_count=W2V_MIN_COUNT,
                             epochs=W2V_EPOCH,
                             workers=8)

        w2v_model.build_vocab(sentences, progress_per=10000)

        w2v_model.train(sentences, total_examples=len(sentences), epochs=W2V_EPOCH, report_delay=1)

        train_encoded_new = [get_sentence_embedding(doc, w2v_model, W2V_SIZE) for doc in train_text]
        dev_encoded_new = [get_sentence_embedding(doc, w2v_model, W2V_SIZE) for doc in dev_text]

        train_encoded_new = np.array(train_encoded_new)
        dev_encoded_new = np.array(dev_encoded_new)

        train_encoded_new[np.isnan(train_encoded_new)] = 0
        dev_encoded_new[np.isnan(dev_encoded_new)] = 0

        clf = sklearn.linear_model.LogisticRegression(random_state=101, max_iter=500) # tried larger max_iter so the model can converge

        clf.fit(train_encoded_new, train_label)
        train_pred_new = clf.predict(train_encoded_new)
        dev_pred_new = clf.predict(dev_encoded_new)

        print('W2V_SIZE:', W2V_SIZE, 'W2V_WINDOW:', W2V_WINDOW, 'W2V_EPOCH:', W2V_EPOCH, 'W2V_MIN_COUNT:', W2V_MIN_COUNT)
        assert(len(train_pred_new) == len(train_label))
        print("Accuracy on train data:", sum(train_pred_new == train_label) / len(train_label))
        assert(len(dev_pred_new) == len(dev_label))
        print("Accuracy on dev data:", sum(dev_pred_new == dev_label) / len(dev_label))

W2V_SIZE: 100 W2V_WINDOW: 2 W2V_EPOCH: 8 W2V_MIN_COUNT: 2
Accuracy on train data: 0.7234628908106053
Accuracy on dev data: 0.7225029148853478
W2V_SIZE: 100 W2V_WINDOW: 2 W2V_EPOCH: 8 W2V_MIN_COUNT: 10
Accuracy on train data: 0.7211346574170864
Accuracy on dev data: 0.7232802176447727
W2V_SIZE: 100 W2V_WINDOW: 2 W2V_EPOCH: 32 W2V_MIN_COUNT: 2
Accuracy on train data: 0.7449396559916374
Accuracy on dev data: 0.7454333462883793
W2V_SIZE: 100 W2V_WINDOW: 2 W2V_EPOCH: 32 W2V_MIN_COUNT: 10
Accuracy on train data: 0.7408058538439608
Accuracy on dev data: 0.7516517683637777
W2V_SIZE: 100 W2V_WINDOW: 8 W2V_EPOCH: 8 W2V_MIN_COUNT: 2
Accuracy on train data: 0.7259336691057683
Accuracy on dev data: 0.7271667314418966
W2V_SIZE: 100 W2V_WINDOW: 8 W2V_EPOCH: 8 W2V_MIN_COUNT: 10
Accuracy on train data: 0.7198042383350756
Accuracy on dev data: 0.7232802176447727
W2V_SIZE: 100 W2V_WINDOW: 8 W2V_EPOCH: 32 W2V_MIN_COUNT: 2
Accuracy on train data: 0.7453197757293547
Accuracy on dev data: 0.7442673921492421


The results shown above are the 16 combinations of different values of hyperparameters: vector_size, window, min_count, and epochs. The obtained accuracies are not significantly different, they are all approximately within the range of 72% to 76%.

However, they do have some level of impact. To be specific, **vector_size** is the dimensionality of the word vectors, which means larger vector sizes can capture more information from the text, so with suitable hyperparameter combinations (e.g., W2V_WINDOW: 8 W2V_EPOCH: 32 W2V_MIN_COUNT: 10), the models with vector_size=300 can achieve higher accuracies. **Window** is the maximum distance between the current and predicted word within a sentence, and a larger window captures broader context, so it appears that in this case, models with a smaller window size achieved higher accuracies on dev data. **Min_count** is the minimum total frequency, in some cases shown above, models with a high min_count achieved slightly lower accuracies on train data and slightly higher accuracies on dev data, showing setting min_count can help exclude noise. **Epochs** determine the number of iterations, in this case, models with more epochs achieved higher accuracies, which means they are getting better embeddings than the models with fewer epochs without overfitting.

It is important to note that, due to limited computing power, only a limited number of combinations are included, so the conclusions are by no means definitive or comprehensive. The hyperparameters should be chosen according to specific tasks, and more combinations are needed to find the optimal ones.

## Wrapping up

Compile the scores from the models and their input vectors from these three weeks. Compare the train and dev accuracy for these configurations:

- NB
- logistic regression (LR) with unigram count vectors
- LR with unigram+bigram count vectors
- LR with tfidf vectors
- LR with pre-trained w2v vectors
- LR with custom trained w2v vectors

And analyze the accuracy results from train and dev data. What do you see in terms comparing different methods and input representations? What do you see in terms of train and dev accuracy trends?


- **NB**: train accuracy ≈ 83%, dev accuracy ≈ 73%
- **LR with unigram count vectors**: train accuracy ≈ 80%, dev accuracy ≈ 74%
- **LR with unigram+bigram count vectors**: train accuracy ≈ 98%, dev accuracy ≈ 76%
- **LR with tfidf vectors**: train accuracy ≈ 93%, dev accuracy ≈ 76%
- **LR with pre-trained w2v vectors**: train accuracy ≈ 77%, dev accuracy ≈ 76%
- **LR with custom-trained w2v vectors**: train accuracy ≈ 76%, dev accuracy ≈ 75%

In terms of input representations, models with **simple count-based methods** have relatively lower dev accuracies. Models with more complex feature representations such as **TF-IDF** and **unigram+bigram** appear to be able to fit the training data very well but have higher risks of overfitting, as they have very high training accuracies but their dev accuracies are not significantly higher than the other models, suggesting weighting terms by their importance or including both unigrams and bigrams may help capture more information from training set but the models are not generalized enough to show a matching increase in dev accuracy. **Pre-trained w2v vectors** are powerful and capture semantic information, and in this case, it is the most suitable as the model performs well on both train data and dev data. **Custom-trained w2v vectors** have similar but slightly inferior performance compared to pre-trained w2v vectors, which means the text in this problem is not too unique or domain-specific, thus making custom training unnecessary.

To analyze train and dev accuracy trends, **NB** is the simplest model with reasonable training and dev accuracies. **LR with unigram count vectors** has slightly higher dev accuracy and lower training accuracy than NB, which means this model performs slightly better and is more generalized than NB. **LR with unigram+bigram count vectors** has higher dev accuracy and significantly higher training accuracy, suggesting this model is overfitting. Similarly, **LR with tfidf vectors** has similar dev accuracy to that of LR with unigram+bigram count vectors and an extremely high training accuracy as well, indicating this model is also overfitting. **LR with pre-trained w2v vectors** demonstrates the best performance, as it has similar dev accuracy to those of the previous two models, while maintaining similar accuracies on train data and dev data, showing it can fit the training set and generalize on the dev data very well. **LR with custom-trained w2v vectors** has only slightly worse accuracies than using pre-trained w2v vectors, it is possible that with more hyperparameter tuning in the custom training process, this model can achieve better performance.