# DSCI 572 Lab 3: CBOW model, minibatch training and (optionally) pretrained embeddings

In this lab, we'll work on a familiar task, namely, sentiment analysis. We'll build a CBOW model using pytorch. We'll also incorporate pretrained embeddings which turn out to have a substantial impact on model performance. Finally, we'll investigate the impact of minibatch training and dropout on model accuracy.

**Note!** This can be a good opportunity to rehearse running code on Google Colab, where you get access to a GPU. Check this [tutorial](https://www.marktechpost.com/2021/01/09/getting-started-with-pytorch-in-google-collab-with-free-gpu/). 

## Software Requirements
* Python (>=3.6)
* PyTorch (>=1.2.0)
* Jupyter (latest)
* Scikit Learn (>=0.23.2)

## Submission Info
* Due date: Saturday, Jan 25 at 23:59 PST



## Tidy Submission
rubric={mechanics:4}

To get the marks for tidy submission:

* Submit the assignment by filling in this jupyter notebook with your answers embedded
* Submit a PDF version of your Jupyter notebook.
* Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions)


## Getting started

You should start by downloading the lab data from the student repo `blank_labs/Lab3/data`.

Run the following code:

In [1]:
from copy import deepcopy
from collections import Counter
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
import numpy as np
import torch
import torch.nn as nn
import nltk

# We'll use double values in our tensors
torch.set_default_dtype(torch.float64)

# Checks if GPU is available, otherwise use CPU.
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 
torch.backends.cudnn.deterministic=True
print(device)

cpu


We'll now read data for sentiment analysis. This code is given to you. 

We get 500 training examples, 1000 development examples and 8476 test examples.

In [2]:
train = pd.read_csv("data/IMDB.train.tsv", header=None, names=["text", "sentiment"], sep="\t")[:500]
dev = pd.read_csv("data/IMDB.dev.tsv", header=None, names=["text", "sentiment"], sep="\t")
test = pd.read_csv("data/IMDB.test.tsv", header=None, names=["text", "sentiment"], sep="\t")

print(f"Number of training examples: {len(train)}")
print(f"Number of development examples: {len(dev)}")
print(f"Number of test examples: {len(test)}")

Number of training examples: 500
Number of development examples: 1000
Number of test examples: 8476


We'll then encode sentiment labels (`positive` and `negative`) as numbers.

In [3]:
label_encoder = LabelEncoder()
label_encoder.fit(train.sentiment)

train_y = label_encoder.transform(train.sentiment)
dev_y = label_encoder.transform(dev.sentiment)
test_y = label_encoder.transform(test.sentiment)

## Assignment 1

We'll start by training baseline sklearn sentiment analysis systems using [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and [TfidfVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html) for feature extraction. 

### Assignment 1.1
rubric={accuracy:2, quality:1}

Start by fitting a `CountVectorizer` and `TfidfVectorizer` using the training set `train`. For now, you don't need to worry about setting any of the parameters for either vectorizer.

You can then transform our datasets into two sets of matrices:

* `train_count_X`, `dev_count_X` and `test_count_X` (using `CountVectorizer`)
* `train_tfidf_X`, `dev_tfidf_X` and `test_tfidf_X` (using `TfidfVectorizer`)

In [4]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords
en_stopwords = stopwords.words("english")

# Initialize CountVectorizer
count_vect = CountVectorizer(stop_words=en_stopwords)
# Fit and transform the training data
train_count_X = count_vect.fit_transform(train.text)
# Transform the development and test data
dev_count_X = count_vect.transform(dev.text)
test_count_X = count_vect.transform(test.text)

# Initialize TfidfVectorizer
tfidf_vect = TfidfVectorizer(stop_words=en_stopwords)
# Fit and transform the training data
train_tfidf_X = tfidf_vect.fit_transform(train.text)
# Transform the development and test data
dev_tfidf_X = tfidf_vect.transform(dev.text)
test_tfidf_X = tfidf_vect.transform(test.text)

# Now you can check the shape of these matrices
print("CountVectorizer Shapes:")
print("Train:", train_count_X.shape)
print("Dev:", dev_count_X.shape)
print("Test:", test_count_X.shape)

print("\nTfidfVectorizer Shapes:")
print("Train:", train_tfidf_X.shape)
print("Dev:", dev_tfidf_X.shape)
print("Test:", test_tfidf_X.shape)


CountVectorizer Shapes:
Train: (500, 12035)
Dev: (1000, 12035)
Test: (8476, 12035)

TfidfVectorizer Shapes:
Train: (500, 12035)
Dev: (1000, 12035)
Test: (8476, 12035)


## Assignment 1.2
rubric={accuracy:1, reasoning:1}

You should now fit two [LogisticRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) models:

* `lr_count` using your count vectorizer features
* `lr_tfidf` using your tfidf vectorizer features

Evaluate your models on the **development** data using the sklearn function [`f1_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html). `lr_count` should get f-score > 75% and `lr_tfidf` > 55%.

In [5]:
# your code here
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

# Initialize and fit Logistic Regression using count vectorizer features
lr_count = LogisticRegression()
lr_count.fit(train_count_X, train_y)
# Predict on development data
dev_pred_count = lr_count.predict(dev_count_X)
# Calculate F1 score
f1_count = f1_score(dev_y, dev_pred_count)
print("F1 Score for CountVectorizer model:", f1_count)

# Initialize and fit Logistic Regression using tfidf vectorizer features
lr_tfidf = LogisticRegression()
lr_tfidf.fit(train_tfidf_X, train_y)
# Predict on development data
dev_pred_tfidf = lr_tfidf.predict(dev_tfidf_X)
# Calculate F1 score
f1_tfidf = f1_score(dev_y, dev_pred_tfidf)
print("F1 Score for TfidfVectorizer model:", f1_tfidf)


F1 Score for CountVectorizer model: 0.8349358974358975
F1 Score for TfidfVectorizer model: 0.7855227882037533


Why do you think CountVectorizer would achieve better performance than TfidfVectorizer on this task?

YOUR ANSWER HERE

CountVectorizer simply counts word frequencies, effectively capturing the importance of key words in sentiment analysis. On the other hand, TfidfVectorizer adjusts the weight of frequent words, potentially weakening their emotional impact. Therefore, in sentiment analysis tasks, CountVectorizer tends to perform better because it directly reflects the frequency of emotional words.


### Assignment 1.3 Optional
rubric={accuracy:1, reasoning:2}

Both `CountVectorizer` and `TfidfVectorizer` have many parameters which affects the feature vectors. Tune these parameters to achieve the best possible development accuracy. 

In [6]:
# your code here
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline

# Create a pipeline with CountVectorizer and Logistic Regression
pipeline_count = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', LogisticRegression())
])

# Parameters to tune for CountVectorizer
params_count = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
    'vect__stop_words': (None, 'english'),
    'clf__C': (0.1, 1, 10)
}

# Create a pipeline with TfidfVectorizer and Logistic Regression
pipeline_tfidf = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', LogisticRegression())
])

# Parameters to tune for TfidfVectorizer
params_tfidf = {
    'vect__max_df': (0.5, 0.75, 1.0),
    'vect__ngram_range': ((1, 1), (1, 2)),
    'vect__stop_words': (None, 'english'),
    'vect__use_idf': (True, False),
    'clf__C': (0.1, 1, 10)
}

# Grid search for CountVectorizer
grid_search_count = GridSearchCV(pipeline_count, params_count, cv=5, scoring='f1')
grid_search_count.fit(train.text, train_y)
print("Best score (CountVectorizer):", grid_search_count.best_score_)
print("Best parameters (CountVectorizer):", grid_search_count.best_params_)

# Grid search for TfidfVectorizer
grid_search_tfidf = GridSearchCV(pipeline_tfidf, params_tfidf, cv=5, scoring='f1')
grid_search_tfidf.fit(train.text, train_y)
print("Best score (TfidfVectorizer):", grid_search_tfidf.best_score_)
print("Best parameters (TfidfVectorizer):", grid_search_tfidf.best_params_)



Best score (CountVectorizer): 0.8240994530796268
Best parameters (CountVectorizer): {'clf__C': 0.1, 'vect__max_df': 0.75, 'vect__ngram_range': (1, 2), 'vect__stop_words': 'english'}
Best score (TfidfVectorizer): 0.8243290247863108
Best parameters (TfidfVectorizer): {'clf__C': 10, 'vect__max_df': 0.5, 'vect__ngram_range': (1, 1), 'vect__stop_words': 'english', 'vect__use_idf': False}


Explain which parameters you changed and why you think these would result in an improvement in f1-score.

YOUR ANSWER HERE

We adjusted max_df to exclude common words that might interfere with classification. The setting of ngram_range enhances the model's ability to capture context and phrases. The choice of clf__C reflects control over the model's regularization strength to optimize for overfitting or underfitting.


## Assignment 2

We'll then convert our training data into pytorch tensors. We *will not* use the output of sklearn vectorizers for this assignment. Instead we will directly numericalize our `train`, `dev` and `test` datasets. 

### Assignment 2.1
rubric={accuracy:1}

Start by creating a [`Counter`](https://docs.python.org/3/library/collections.html#collections.Counter) `vocabulary` which gives the count for each word type in the `train` dataset.

To tokenize the sentences in `train`, you can simply split at spaces.

In [7]:
# your code here
from collections import Counter

# Function to tokenize text by splitting at spaces
def tokenize(text):
    return text.split()

# Create a Counter to build the vocabulary from the tokenized training data
vocabulary = Counter()

# Loop through each text entry in the train dataset and update the Counter with tokens
for text in train.text:
    tokens = tokenize(text)
    vocabulary.update(tokens)

# Display the size of the vocabulary and some example counts
print("Vocabulary size:", len(vocabulary))
print("Sample word counts:", list(vocabulary.items())[:10])



Vocabulary size: 12775
Sample word counts: [('despite', 39), ('looking', 62), ('dated', 11), ('inki', 3), ('and', 3185), ('the', 6416), ('minah', 4), ('bird', 9), ('is', 1988), ('my', 228)]


Assertions to test your code: 

In [8]:
# A test which your function should pass. Note, that simply passing the test does not 
# guarantee that your function is working fully correctly.
assert vocabulary["the"] == 6416
assert vocabulary["dog"] == 7
print('pass')

pass


Next, create a mapping `word2id` which translates every word type in `vocabulary` into a unique id number in the range `1 ... len(vocabulary)`. `word2id` should also map the symbol `PAD="<pad>"` to the ID `0`.

In [9]:
PAD = "<pad>"

# Create a dictionary that maps each word to a unique ID, starting from 1
word2id = {word: idx + 1 for idx, word in enumerate(vocabulary)}

# Add the PAD symbol with an ID of 0
word2id[PAD] = 0

# Display the mapping for some words
print("Sample word to ID mapping:", list(word2id.items())[:10])
print("ID for PAD:", word2id[PAD])


Sample word to ID mapping: [('despite', 1), ('looking', 2), ('dated', 3), ('inki', 4), ('and', 5), ('the', 6), ('minah', 7), ('bird', 8), ('is', 9), ('my', 10)]
ID for PAD: 0


Assertions as a partial check of your code:

In [10]:
# A test which your function should pass. Note, that simply passing the test does not 
# guarantee that your function is working fully correctly.
assert word2id[PAD] == 0
assert len(word2id) == len(vocabulary) + 1
print('pass')

pass


### Assignment 2.2
rubric={accuracy:1}

Write a function `numericalize_ex` which takes the following arguments:

1. `ex`, a string representing a review. E.g. `"great movie !"`
1. `vocabulary`, the word type counter which we created above
1. `word2id`, the mapping words -> ID numbers which we created above
1. `min_count`, the minimum count of word type. Rarer words are filtered out.
1. `max_count`, the maximum count of word type. More frequent words are filtered out.

Your function should first split `ex` into individual tokens (you can split at spaces). You should then filter out all words whose frequency is < `min_count` or > `max_count`. 

Then, transform the example into a set and transform all the remaining words into ID numbers using `word2id`. Return a `torch.tensor` of shape `n`, where `n` is the count of ID numbers.

When you initialize the tensor, use `dtype=torch.long`.

In [11]:
import torch

def numericalize_ex(ex, vocabulary, word2id, min_count, max_count):
    # 1. Split the review into tokens (split by space)
    tokens = ex.split()
    
    # 2. Filter out words whose frequency is < min_count or > max_count
    filtered_tokens = [
        token for token in tokens
        if min_count <= vocabulary[token] <= max_count
    ]
    
    # 3. Convert the list of filtered tokens into a set (removes duplicates)
    unique_tokens = set(filtered_tokens)
    
    # 4. Map each remaining word to its ID
    # (ignore words that are not in word2id, e.g., if we had an unknown token)
    ids = [word2id[word] for word in unique_tokens if word in word2id]
    
    # 5. Return a torch tensor of type long
    return torch.tensor(ids, dtype=torch.long)


In [12]:
# A test which your function should pass. Note, that simply passing the test does not 
# guarantee that your function is working fully correctly.
assert numericalize_ex(train.text[0], vocabulary, word2id, 5, 100).size()[0] == 70
print('pass')

pass


### Assignment 2.3
rubric={accuracy:2}

Write a function `numericalize()` which takes the following arguments:

1. `data`, one of our datasets `train`, `dev` or `test`
1. `data_y`, the list of numeric labels for the examples in `data` (0 or 1 corresponding to positive and negative sentiment, respectively)
1. `vocabulary`, the word type counter which we created above
1. `word2id`, the mapping words -> ID numbers which we created above
1. `min_count`, the minimum count of word type. All rarer words are filtered out.
1. `max_count`, the maximum count of word type. More frequent words get filtered out.
1. `batch_size`, our minibatch size.

You should first convert all examples in `data` into tensors using `numericalize_ex`. 

Then, pack the examples and their labels into minibatches containing `batch_size` examples each. Every minibatch should be a 3-tuple containing:

1. A minibatch `b` of input examples of dimension `batch_size x k`, where `k` is the maximal length of an example vector in the minibatch.
1. A minibatch of sequence lengths of shape `batch_size`, where the elements are the lengths of the examples in `b` before padding is applied.
1. A minibatch of labels of shape `batch_size`, where each label `i` corresponds to example `b[i]`.

You will need to pad all examples in `b` to the same length using the padding symbol `word2id[PAD]`. Use the pytorch function [`pad_sequence`](https://pytorch.org/docs/stable/generated/torch.nn.utils.rnn.pad_sequence.html?highlight=pad_sequence) to convert a list of examples `x` of shapes `len_x` into a padded minibatch of length `batch_size x max_len_x`. You will need to call the function with the argument `batch_first=True` because we want the batch size to be the first dimension.

If `batch_size` does not evenly divide `len(data)`, you may need to create one smaller minibatch to account for all training examples. This is okay.

**Note:** We're returning a list from this function. It is, however, often better to create a [data loader](https://stanford.edu/~shervine/blog/pytorch-how-to-generate-data-parallel). This can save memory when we're dealing with very large training sets. You'll learn more about this later.  

In [13]:
import torch
from torch.nn.utils.rnn import pad_sequence

def numericalize(data, data_y, vocabulary, word2id, min_count, max_count, batch_size):
    # Convert each example into a tensor of IDs
    numeric_examples = [
        numericalize_ex(ex, vocabulary, word2id, min_count, max_count)
        for ex in data.text
    ]
    
    # Store all minibatches in a list
    batches = []
    
    # Break the data into chunks of size 'batch_size'
    for i in range(0, len(numeric_examples), batch_size):
        batch_x = numeric_examples[i : i + batch_size]
        batch_y_vals = data_y[i : i + batch_size]
        
        # Record the length of each sequence before padding
        lengths = [len(x) for x in batch_x]
        
        # Pad all sequences in this batch to the same length
        padded_batch_x = pad_sequence(
            batch_x, 
            batch_first=True, 
            padding_value=word2id[PAD]
        )
        
        # Convert labels for this batch into a tensor
        batch_labels = torch.tensor(batch_y_vals, dtype=torch.long)
        
        # Add a 3-tuple: (padded tensor, sequence lengths, labels)
        batches.append((
            padded_batch_x, 
            torch.tensor(lengths, dtype=torch.long), 
            batch_labels
        ))
    
    return batches


Some tests to check that the number of batches which you generate looks okay:

In [14]:
# A test which your function should pass. Note, that simply passing the test does not 
# guarantee that your function is working fully correctly.
batches = numericalize(train, train_y, vocabulary, word2id, 5, 100, 15)
assert 500//15 + 1 == len(batches)
assert batches[-1][0].size()[0] == 500 % 15
print('pass')

pass


Let's then numericalize the training, development and test data using `min_count` 5, `max_count` 100 and `batch_size` 10:

In [15]:
torch_train = numericalize(train, train_y, vocabulary, word2id, 5, 100, 10)
torch_dev = numericalize(dev, dev_y, vocabulary, word2id, 5, 100, 10)
torch_test = numericalize(test, test_y, vocabulary, word2id, 5, 100, 10)

## Assignment 3

We'll now build a CBOW model for sentiment classification.

### Assignment 3.1
rubric={accuracy:5}

We'll now write a baseline torch model `CBOW` for classification of CBOW inputs. This model does not yet implement dropout or pretrained embeddings.

#### The `__init__` function

Your `__init__` function should take the following parameters:

1. `num_words`, the number of unique word type features + 1 for the symbol `PAD` (i.e. `len(word2id)`) 
1. `num_classes`, the number of output classes . In our case, this will always be 2 because we have exactly two classes: positive and negative.
1. `dropout_prob`, the dropout probability

Your model should contain the following layers in order:

1. `self.embedding`, an embedding of dimension `EMB_SIZE` which can embed all word types recognized by `word2id`
1. `self.linear1`, a linear layer which maps `EMB_SIZE`-dimensional inputs to `HIDDEN_SIZE`-dimensional outputs
1. `self.dropout`, dropout with probability defined by `dropout_prob`
1. `self.relu` which applies relu to the output of `self.linear1`
1. `self.linear2` which maps `HIDDEN_SIZE`-dimensional inputs to `num_classes`-dimensional outputs
1. `self.log_softmax` which applies log-softmax to the output of `self.linear2`

**Note**, when you initalize `self.embedding`, make sure to define `word2id[PAD]` as the padding symbol as explained [in the documentation](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html). The effect is that `PAD` will always be embedded as the zero vector.

#### The `forward` function

Your forward function takes two arguments: 

1. a minibatch of examples `x` having shape `batch_size x k` as input.
1. A tensor `lengths` which indiactes the length of each example in `x`. 

Your forward function should:

1. Apply `self.embedding` to x. This results in a `batch_size x k x EMB_SIZE` tensors.
1. You should then compute the sum of the embeddings for each example in the batch using [`torch.tensor.sum`](https://pytorch.org/docs/stable/generated/torch.sum.html?highlight=sum#torch.sum) resulting in a `batch_size x EMB_SIZE` tensor.
1. Normalize each embedded example by dividing with the lengths in `lengths`. You can first use [`unsqueeze`](https://pytorch.org/docs/stable/generated/torch.unsqueeze.html?highlight=unsqueeze#torch.unsqueeze) and [`expand`](https://pytorch.org/docs/stable/generated/torch.Tensor.expand.html?highlight=expand) to transform `lengths` into a `batch_size x EMB_SIZE` tensor and then use [`torch.div`](https://pytorch.org/docs/stable/generated/torch.div.html)    
1. Pass the averaged embeddings through `self.linear1`, `self.dropout` `self.relu`, `self.linear2` and finally `self.log_softmax`. This results in a `batch_size x num_classes` tensor.
1. Return the result.

In [16]:
import torch
import torch.nn as nn
import torch.nn.functional as F

HIDDEN_SIZE = 100
EMB_SIZE = 100

class CBOW(nn.Module):
    def __init__(self, num_words, num_classes, dropout_prob):
        super(CBOW, self).__init__()
        
        # Embedding layer (with PAD=0 as padding index)
        self.embedding = nn.Embedding(
            num_embeddings=num_words, 
            embedding_dim=EMB_SIZE, 
            padding_idx=0
        )
        
        # Linear layer from EMB_SIZE to HIDDEN_SIZE
        self.linear1 = nn.Linear(EMB_SIZE, HIDDEN_SIZE)
        
        # Dropout layer
        self.dropout = nn.Dropout(dropout_prob)
        
        # ReLU activation
        self.relu = nn.ReLU()
        
        # Final linear layer from HIDDEN_SIZE to num_classes
        self.linear2 = nn.Linear(HIDDEN_SIZE, num_classes)
        
        # Log softmax for classification
        self.log_softmax = nn.LogSoftmax(dim=-1)

    def forward(self, x, lengths):
        """
        x: batch of examples of shape (batch_size, k)
        lengths: lengths of each example in x, shape (batch_size,)
        """
        # 1. Embed input => (batch_size, k, EMB_SIZE)
        embeddings = self.embedding(x)
        
        # 2. Sum across the time dimension => (batch_size, EMB_SIZE)
        summed = embeddings.sum(dim=1)
        
        # 3. Normalize by lengths => (batch_size, EMB_SIZE)
        lengths_expanded = lengths.unsqueeze(1).expand(-1, EMB_SIZE)
        averaged = summed.div(lengths_expanded)
        
        # 4. Forward pass through linear, dropout, relu, linear, log_softmax
        out = self.linear1(averaged)
        out = self.dropout(out)
        out = self.relu(out)
        out = self.linear2(out)
        out = self.log_softmax(out)
        
        return out


Assertions to check your code:

In [None]:
# A test which your function should pass. Note, that simply passing the test does not 
# guarantee that your function is working fully correctly.
model = CBOW(len(word2id), 2, 0)
model.train(False)
x = torch_train[0]
res = model(x[0], x[1])
assert res.size()[0] == x[0].size()[0]
assert res[0].exp().sum() - 1 < 0.001

## Assignment 4

### Assignment 4.1
rubric={accuracy:2}

Write a function `eval_model` which takes two arguments:

1. `data`, a torch data set containing examples `(input_minibatch, lengths, output_minibatch)` 
1. `model` a CBOW model

The function applies `model` to each input minibatch in `data` and returns the macro F-score computed by the sklearn function `f1_score`. 

Before running inference, make sure to call `model.train(False)` to disable dropout.

**Remember** to use `with torch.no_grad()` in order to avoid computing a bunch of gradients!

In [None]:
# your code here


You can now evaluate an untrained model on the development set. The performance is unlikely to be particularly good.

In [None]:
model = CBOW(len(word2id), 2, 0)
eval_model(torch_dev, model)

### Assignment 4.2
rubric={accuracy:3}

You should now write a training function `train_model`. The function takes the following parameters:

1. `model`, a CBOW model
1. `train_data`, a dataset of torch training examples
1. `dev_data`, a dataset of torch development examples
1. `max_epochs`, the maximum number of epochs for training

You should first:

1. Initialize a `CBOW` model `model` with `len(word2id)` word types, 2 output classes and dropout probability `dropout_prob`
1. Initialize an `Adam` optimizer for `model` (you can use the defaults for the `lr` and `betas`)
1. Initialize an [`NLLLoss`](https://pytorch.org/docs/stable/generated/torch.nn.NLLLoss.html) loss function.

Run training for `max_epochs`. Each epoch iterates over the training examples `(x, lengths, y)` in `train` and:

1. Calls `model.train(True)` to enable dropout
1. Calls `zero_grad` to erase old gradients
1. Applys the model to `x`
1. Compute the loss w.r.t. `y`.
1. Runs one step of backprop.

You should keep track of the average loss per training example over the epoch. As a general rule, the average loss should decrease through training. 

Once every epoch, you need to evaluate your model on the development data. Print the average loss and the `f1_score` on the development set. 

Keep track of the best development accuracy and store the model which attains the best development accuracy. You can use `deepcopy` to save the model so that its parameters won't be affected by subsequent updates.

Finally, return the best model you found.

In [None]:
# Your code here


Now, train a model for `max_epochs=50` with dropout probability 30%. You will probably get within 5%-points from CountVectorizer but without pretrained embeddings, it is hard to do better

In [None]:
model = CBOW(len(word2id), 2, 0.3)
model = train_model(model, torch_train, torch_dev, 50)

Print the F-score of your model on the test data. Compare this against our CountVectorizer and TfidfVectorizer models.

CBOW will probably land somewhere between CountVectorizer and TfidfVectorizer. 

In [None]:
# your code here


### Assignment 4.3 Optional
rubric={accuracy:2}

Tune the hyperparameters `min_count`, `max_count` and `batch_size` (for `numericalize`) as well as `dropout_prob` (for `train_model`)

In [None]:
# your code here


What are the best hyperparameters you found?

## Assignment 5

In this optional assignment, we will incorporate pretrained embeddings. You can get the embeddings [here](https://drive.google.com/file/d/1Vl5ks9DjKjSEVGtgHxL4QhxMzl2FfTem/view?usp=sharing).

Let's start by reading GloVe embeddings vectors from disk. 

The following code block will read an embedding array `embedding` of shape `number_of_word_types x embedding_dim`. Here `embedding_dim` is `EMB_SIZE` (=100). The code will
also read a list of all the word types `word_types` covered by the embedding. The embedding vector for `word_types[i]` is `embedding[i]`. Note that the first element `word_types[i]` is the padding symbol.

This code is given to you:

In [None]:
word_types, embedding = [PAD],[[0 for i in range(EMB_SIZE)]]

with open('glove.6B.100d.filtered.txt','rt') as fi:
    full_content = fi.read().strip().split('\n')

for i in range(len(full_content)):
    i_word = full_content[i].split(' ')[0]
    i_emb = [float(val) for val in full_content[i].split(' ')[1:]]
    word_types.append(i_word)
    embedding.append(i_emb)
    
embedding = np.array(embedding)

### Assignment 5.1 Optional
rubric={accuracy:1}

Start by forming `pre_vocabulary` and `pre_word2id`. 

`pre_vocabulary` is a `Counter` which gives count 1 to all word types in `word_types`. `pre_word2id` is a dictionary which maps `word_types[i]` to its index `i`.

In [None]:
# your code here


Run the following cell to generate training, development and test data for a CBOW model using pretrained embeddings: 

In [None]:
pre_torch_train = numericalize(train, train_y, pre_vocabulary, pre_word2id, 1, 1, 10)
pre_torch_dev = numericalize(dev, dev_y, pre_vocabulary, pre_word2id, 1, 1, 10)
pre_torch_test = numericalize(test, test_y, pre_vocabulary, pre_word2id, 1, 1, 10)

### Assignment 5.2 Optional
rubric={accuracy:1}

Modify your `CBOW` code to initialize the embeddings by calling `nn.Embedding.from_pretrained()` according to [this tutorial](https://medium.com/mlearning-ai/load-pre-trained-glove-embeddings-in-torch-nn-embedding-layer-in-under-2-minutes-f5af8f57416a). Set `freeze=False`, when calling the function.

You should name this new class `PreCBOW`. You don't need to touch the `forward` function. This only requires a change to `__init__`. 

You should pass `embedding` as an additional argument to the `__init__` function and use it when you initialize `self.embedding`. 

In [None]:
#your code here


### Assignment 5.3 Optional
rubric={accuracy:1}

Train a `PreCBOW` model on the datasets `pre_torch_train` and `pre_torch_dev` for 50 epochs with dropout probability 30%.   

In [None]:
# your code here


Now evaluate F-score on `pre_torch_test`. Your F-score should be a bit better than for CountVectorizer. Note, however, that results may change depending on random initialization. You may need to train a few times to surpass `CountVectorizer`.

In [None]:
# your code here
