# AI Homework 5
In Homework 5 We will train our own 'CBOW' Word2Vec embedding from WikiText2 dataset. (small dataset)
- Change Runtime option above to GPU if you could. (max 12 hours for one user)
- Save and submit the output of this notebook and model and vocab file you trained.
- not allowed to have other python file or import pretrained model

In [None]:
# YOU should run this command if you will train the model in COLAB environment
! pip install datasets transformers

In [None]:
import argparse
import yaml
import os
import torch
import torch.nn as nn
import torchtext

import json
import numpy as np 

from functools import partial
from torch.utils.data import DataLoader
from torchtext.data import to_map_style_dataset
from torchtext.data.utils import get_tokenizer

from torchtext.vocab import build_vocab_from_iterator
from torchtext.datasets import WikiText2 # WikiText103

import torch.optim as optim
from torch.optim.lr_scheduler import LambdaLR

from datasets import load_dataset



In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch_seed_numb = 0
if device.type == 'cuda':
    torch.cuda.manual_seed(torch_seed_numb)

In [None]:
device

In [None]:
# If you use Google Colab environment, mount you google drive here to save model and vocab
from google.colab import drive
drive.mount('/content/drive')
root_dir = '/content/drive/MyDrive/course_ai_hw5'

In [None]:
# You could change parameters if you want.

train_batch_size =  96
val_batch_size = 96
shuffle =  True

optimizer =  'Adam'
learning_rate =  0.025
epochs =  5

result_dir = 'weights/' 

# Parameters about CBOW model architecture and Vocab.
CBOW_N_WORDS = 4

MIN_WORD_FREQUENCY = 50
MAX_SEQUENCE_LENGTH = 256

EMBED_DIMENSION = 300
EMBED_MAX_NORM = 1

In [None]:
result_dir = os.path.join(root_dir, result_dir)
if not os.path.exists(result_dir):
    os.mkdir(result_dir)


## Prepare dataset and vocab

In [None]:
datasets = load_dataset('wikitext', 'wikitext-2-raw-v1')
train_dataset = datasets["train"]
val_dataset = datasets['validation']
test_dataset = datasets['test']
#train_dataset.map(tokenizing_word , batched= True, batch_size = 5000)


In [None]:
# Let's print one example
train_dataset['text'][11]

As you see, We should clean and lower sentences, tokenize sentences and change each word to index (one-hot-vector). Before going throught the whole process we should make vocabulary using train dataset in order to make each word to index.

In [None]:
tokenizer = get_tokenizer("basic_english", language="en")

# TO DO 1) make vocabulary 
# Hint) use function: build_vocab_from_iterator, use train_dataset set special tokens.. etc



We need a collate function in order to make dataset into CBOW train format. The collate function should iterate over (sliding) batch data and make train/test dataset. And each component of data should be composed of CBOW_N_WORD words in left and right side as input and target output as word in center.  
Make the collate function return CBOW dataset in tensor type.  
- 

In [None]:
# Here is a lambda function to tokenize sentence and change words to vocab indexes.
text_pipeline = lambda x: vocab(tokenizer(x))

![cbow](https://user-images.githubusercontent.com/74028313/204695601-51d44a38-4bd3-4a69-8891-2854aa57c034.png)

In [None]:
def collate(batch, text_pipeline):

    batch_input, batch_output = [], []
    
    # TO DO 2): make collate function

    return batch_input, batch_output

In [None]:
train_dataloader = DataLoader(
    train_dataset['text'],
    batch_size=train_batch_size,
    shuffle=shuffle,
    collate_fn=partial(collate, text_pipeline=text_pipeline),
)

val_dataloader = DataLoader(
    val_dataset['text'],
    batch_size=val_batch_size,
    shuffle=shuffle,
    collate_fn=partial(collate, text_pipeline=text_pipeline),
)

## Make CBOW Model
![image](https://user-images.githubusercontent.com/74028313/204701161-cd9df4bf-78b8-4b4d-b8b7-ed4a3b5c3922.png)

CBOW Models' main concept is to predict center-target word using context words. As you see in above simple architecture, input 2XCBOW_N_WORDS length words are projected to Projection layer. In order to convert each word to embedding, it needs look-up table and we will use torch's Embedding function to convert it. After combining embeddings of context, it use shallow linear neural network to predict target word and compare result with center word's index using cross-entropy loss. Finally, the embedding layer (lookup table) of the trained model itself serves as an embedding representing words.

In [None]:
class CBOW_Model(nn.Module):
    def __init__(self, vocab_size: int, EMBED_DIMENSION, EMBED_MAX_NORM):
        super(CBOW_Model, self).__init__()
        # TO DO 3-1): make CBOW model using nn.Embedding and nn.Linear function
    

    def forward(self, _inputs):
        # TO DO 3-2): make forward function


        return _outputs

## Train the model

Let's train our CBOW model, make _train_epoch and _validate_epoch function.  
- model.train() and model.eval() change torch mode in some parts (Dropout, BatchNorm..  etc) of the model to behave differently during inference time. 
- train model with constant learning rate first, There is lr_scheduler option which changes learning rate according to epoch level. Try the option if you are interested in. 

In [None]:
vocab_size = len(vocab.get_stoi())

model = CBOW_Model(vocab_size=vocab_size, EMBED_DIMENSION = EMBED_DIMENSION, EMBED_MAX_NORM = EMBED_MAX_NORM)
loss_function = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr = learning_rate)

In [None]:

class Train_CBOW:
    
    def __init__(
        self,
        model,
        epochs,
        train_dataloader,
        val_dataloader,
        loss_function,
        optimizer,
        device,
        model_dir,
        lr_scheduler = None
    ):  
        self.model = model
        self.epochs = epochs
        self.train_dataloader = train_dataloader
        self.val_dataloader = val_dataloader
        self.loss_function = loss_function
        self.optimizer = optimizer
        self.lr_scheduler = lr_scheduler
        self.device = device
        self.model_dir = model_dir

        self.loss = {"train": [], "val": []}
        self.model.to(self.device)

    def train(self):
        for epoch in range(self.epochs):
            self._train_epoch()
            self._validate_epoch()
            print(
                "Epoch: {}/{}, Train Loss={:.5f}, Val Loss={:.5f}".format(
                    epoch + 1,
                    self.epochs,
                    self.loss["train"][-1],
                    self.loss["val"][-1],
                )
            )
            if self.lr_scheduler is not None:
                self.lr_scheduler.step()


    def _train_epoch(self):
        self.model.train() # set model as train 
        loss_list = []
        # TO DO 4-1):


        # end of TO DO 
        epoch_loss = np.mean(loss_list)
        self.loss["train"].append(epoch_loss)

    def _validate_epoch(self):
        self.model.eval()
        loss_list = []
        
        with torch.no_grad():
            # TO DO 4-2): 

            # end of TO DO 
        epoch_loss = np.mean(loss_list)
        self.loss["val"].append(epoch_loss)
        

    def save_model(self):
        model_path = os.path.join(self.model_dir, "model.pt")
        torch.save(self.model, model_path)

    def save_loss(self):
        loss_path = os.path.join(self.model_dir, "loss.json")
        with open(loss_path, "w") as fp:
            json.dump(self.loss, fp)

In [None]:
# Option: you could add and change lr_sceduler 
scheduler = LambdaLR(optimizer, lr_lambda = lambda epoch: 0.95 ** epoch)

In [None]:
trainer = Train_CBOW(
    model=model,
    epochs=epochs,
    train_dataloader=train_dataloader,
    val_dataloader=val_dataloader,
    loss_function=loss_function,
    optimizer=optimizer,
    lr_scheduler=None,
    device=device,
    model_dir=result_dir,
)

trainer.train()
print("Training finished.")


In [None]:
# save model
trainer.save_model()
trainer.save_loss()

vocab_path = os.path.join(result_dir, "vocab.pt")
torch.save(vocab, vocab_path)

### Result
Let's inference trained word embedding and visualize it.

In [None]:
import pandas as pd
import sys

from sklearn.manifold import TSNE
import plotly.graph_objects as go

sys.path.append("../")

In [None]:
result_dir

In [None]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# reload saved model and vocab
model = torch.load(os.path.join(result_dir,"model.pt"), map_location=device)
vocab = torch.load(os.path.join(result_dir,"vocab.pt"))

# embedding is model's first layer
embeddings = list(model.parameters())[0]
embeddings = embeddings.cpu().detach().numpy()

# normalization
norms = (embeddings ** 2).sum(axis=1) ** (1 / 2)
norms = np.reshape(norms, (len(norms), 1))
embeddings_norm = embeddings / norms
embeddings_norm.shape



### 5-1) Make TSNE graph of trained embedding and color numeric values 

In [None]:
embeddings_df = pd.DataFrame(embeddings)
fig = go.Figure()
# TO DO 5-1) : make 2-d TSNE graph of all vocabs and color numeric values only



### 5-2) find top N similar words


In [None]:
def get_top_similar(word: str, vocab, embeddings_norm, topN: int = 10):
    # TO DO 5-2) : make function returning top n similiar words and similarity scores
    topN_dict = {}
    
    return topN_dict


In [None]:
for word, sim in get_top_similar("english", vocab, embeddings_norm).items():
    print("{}: {:.3f}".format(word, sim))


### Result Report

Save the colab result and submit it with your trained model and vocab file. Check one more time your submitted notebook file has result. 

You can change the CBOW model parameters Training parameters and details if you want.