## Emojify!
Welcome to the third and last programming assignment! You are going to use word vector representations to build an Emojifier.

You will implement a model which inputs a sentence (such as "Let's go see the baseball game tonight!") and finds the most appropriate emoji to be used with this sentence (⚾️).

By using word vectors, you'll see that even if your training set explicitly relates only a few words to a particular emoji, your algorithm will be able to generalize and associate words in the test set to the same emoji even if those words don't even appear in the training set. This allows you to build an accurate classifier mapping from sentences to emojis, even using a small training set.

## Packages

Let's first import all the packages that you will need during this part of assignment.

Feel free to use another libraries if you want to.

If you don't have emoji or other libraries, write "pip install emoji" command in one of the code cells in the notebook. 

In [None]:
#pip install emoji

In [4]:
import numpy as np
import pandas as pd
import torch as torch
import torch.nn as nn
from torchvision import models
from torch.autograd import Variable
import matplotlib.pyplot as plt
import emoji
import os

General Params

In [None]:
device = 'cuda' if torch.cuda.is_available() else "cpu"
batch_size = 50
epochs = 200
num_workers = 0

graph loss_vs_epochs

In [None]:
def loss_vs_epochs(train_losses_list, test_losses_list):
    plt.figure(figsize = (15,5))
    plt.plot(train_losses_list, label = 'Train Loss', c = 'red')
    plt.plot(test_losses_list, label = 'Test Loss', c = 'blue')
    plt.title('Train Loss VS & Test Loss per Epoch')
    plt.xlabel = ("Epoch")
    plt.ylabel = ("Loss")
    #plt.ylim(0, 1)
    plt.legend()
    plt.show()

## Import and visualize the data

In this part you need to:
1. Import the train and test data
2. Seperate the sentences (in the first column) and the index of the emoji (in the second column).
3. Convert the Y value of every sentence from emoji index (0-4) to one hot encoding. 2 --> [0,0,1,0,0]
4. Print 10 sentances from training data and visualize their matching emojies using the label_to_emoji() help function. Print also the one-hot-encoding representation of these sentences.

In [5]:
### START CODE HERE ###
#from google.colab import drive
#drive.mount("/content/drive")
#path = "/content/drive/MyDrive/Colab Notebooks"

In [6]:
emoji_dictionary = {"0": "\u2764\uFE0F",   
                    "1": ":baseball:",
                    "2": ":smile:",
                    "3": ":disappointed:",
                    "4": ":fork_and_knife:"}

def label_to_emoji(label):
    """
    Converts a label (int or string) into the corresponding emoji code (string) ready to be printed
    """
    return emoji.emojize(emoji_dictionary[str(label)], language='alias')

In [None]:
# 1. Import the train and test data
#train_data = pd.read_csv(io.BytesIO(uploaded['train_emoji.csv']), header=None)
#test_data = pd.read_csv(io.BytesIO(uploaded['test_emoji.csv']), header=None)
train_data = pd.read_csv("train_emoji.csv", header=None)
test_data = pd.read_csv("test_emoji.csv", header=None)
#train_data = pd.read_csv(path + "/train_emoji.csv", header=None)
#test_data = pd.read_csv(path + "/test_emoji.csv", header=None)

# 2. Seperate the sentences (in the first column) and the index of the emoji (in the second column).
train_data_x, train_data_y = train_data.iloc[:, 0].to_numpy(), train_data.iloc[:, 1].to_numpy(), 
test_data_x, test_data_y = test_data.iloc[:, 0].to_numpy(), test_data.iloc[:, 1].to_numpy(), 

# 3. Convert the Y value of every sentence from emoji index (0-4) to one hot encoding. 2 --> [0,0,1,0,0]
def index_to_one_hot_encoding(Y, length):
  return np.eye(length)[Y.reshape(-1)]

length = 5
train_data_y_onehot_encoded = index_to_one_hot_encoding(train_data_y, length)
test_data_y_onehot_encoded = index_to_one_hot_encoding(test_data_y, length)

# 4. Print 10 sentances from training data and visualize their matching emojies using the label_to_emoji() help function. Print also the one-hot-encoding representation of these sentences.
for index in range(10):
  print(f"#{index}  {label_to_emoji(train_data_y[index])}  {train_data_x[index]}")


## Help functions for word embedding

The following functions will help you conver words and sentences to vectors and matrixes.

In [7]:
# A function that obtains vector representations for words. Each word is represented by vector with size 50.
# words_to_index is a dictionary that maps word into indexes - every word has a number. 'banana' --> 67752
# index_to_words is a dictionary that maps indexes into indexes - every index has a matching word. 344429 --> 'strawberry'

def read_glove_vecs(glove_file):
    with open(glove_file, 'r',encoding='UTF-8') as f:
        words = set()
        word_to_vec_map = {}
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)
        
        i = 1
        words_to_index = {}
        index_to_words = {}
        for w in sorted(words):
            words_to_index[w] = i
            index_to_words[i] = w
            i = i + 1
    return words_to_index, index_to_words, word_to_vec_map


In [10]:
# load word embeddings and create word_to_index and index_to_word dictionaries

word_to_index, index_to_word, word_to_vec_map = read_glove_vecs('glove.6B.50d.txt')

In [12]:
# visualization: 

print(f"The vector embedding of banana is: {word_to_vec_map['banana']}")
print(f"The index of the word 'tree' is: {word_to_index['tree']}")
print(f"The word matcing the index 173081 is: {index_to_word[173081]}")

In [11]:
# A function that translates the sentences vectors of word indexes --> I love you --> [185457,226278,394475]
# the function uses padding of the longes sentence in the train set, so I love you --> [185457,226278,394475,0,0,0,0,0,0,0]

def sentences_to_indices(X, word_to_index, max_len):
    m = X.shape[0]
    X_indices = np.zeros((m, max_len))
    for i in range(m):
        sentence_words = X[i].lower().split()
        j = 0
        for w in sentence_words:
            X_indices[i, j] = int(word_to_index[w])
            j = j + 1
    return X_indices

sentences_to_indices(np.array(['hello world']),word_to_index,4)

In [None]:
# function that maps all the word indexes to their vectors embedding. 
# the embedding function is in shape (400000, 50) - each word is a vector in size 50.

def pretrained_embedding_layer(word_to_vec_map, word_to_index):
    vocab_len = len(word_to_index) + 1  #word index begin with 1,plus 1 for padding 0
    emb_dim = word_to_vec_map["cucumber"].shape[0] # the size of embedding of each word
    emb_matrix = np.zeros((vocab_len, emb_dim))
    for word, index in word_to_index.items():
        emb_matrix[index, :] = word_to_vec_map[word]
    return emb_matrix


## Train and test data preprocessing

The models that you will build will get the sentences as their vector representations - i.e., the sentences_to_indices() function output. 

Therefore, you need to:
* Transform the data to the right form using the above functions
* Transform the data and lables to tensors
* If needed, create train and test data loaders

In [None]:
### START CODE HERE ###
# 1. Transform the data to the right form using the above functions
# First, find the max sentence lenght
longest_sentence = max(train_data_x, key=len)
max_sentence_len = len(longest_sentence.split())

# Now, convert the sentences to indexes vectors by the function "sentences_to_indices"
train_data_to_indices = sentences_to_indices(train_data_x, word_to_index, max_sentence_len)
test_data_to_indices = sentences_to_indices(test_data_x, word_to_index, max_sentence_len)

In [None]:
# 2. Transform the data and lables to tensors
# train data set
train_data_x_tensor = torch.from_numpy(train_data_to_indices)
train_data_y_tensor = torch.from_numpy(train_data_y)
train_data_tensor = torch.utils.data.TensorDataset(train_data_x_tensor, train_data_y_tensor)

# train data set
test_data_x_tensor = torch.from_numpy(test_data_to_indices)
test_data_y_tensor = torch.from_numpy(test_data_y)
test_data_tensor = torch.utils.data.TensorDataset(test_data_x_tensor, test_data_y_tensor)

In [None]:
# 3. create train and test data 
#train_batch_size = len(train_data_x)
train_loader = torch.utils.data.DataLoader(dataset=train_data_tensor, batch_size=batch_size, num_workers=num_workers)
test_loader = torch.utils.data.DataLoader(dataset=test_data_tensor, batch_size=batch_size, num_workers=num_workers)

# First model - regular neural network

Build a neural network model that gets:

1. The vocabulary size
2. Embedding dimention - the length of every embedding vector
3. Pretrained embedding weights - the embedding matrix 
                                       
and returns: 
1. 5 dimention vector with the scores of every emoji.


---

Then train the model and plot loss vs. epoch for train and test set. 

Show the results on 5 new sentences.


---

You can use the added model as your base model

In [None]:
### START CODE HERE ###
def train_model_2(device, epochs, train_loader, test_loader, model, optimizer, loss_func, train_x_tensor, test_x_tensor):
  
  train_loss_list = []
  train_accuracy_list = []
  test_loss_list = []
  test_accuracy_list = []

  for e in range(epochs):
    train_loss_calc = 0
    test_loss_calc = 0
    train_accuracy_calc = 0
    test_accuracy_calc = 0 

    # model.train()

    for train_inputs, train_labels in train_loader:
      train_inputs = Variable(train_inputs.long())
      train_labels = Variable(train_labels.long())
      optimizer.zero_grad() # Clear the gradients of all optimized variables
      train_outputs = model(train_inputs) # Forward pass
      loss = loss_func(train_outputs, train_labels) # Calculate the loss
      loss.backward()
      optimizer.step() # Doing the optimizer step
      train_loss_calc += loss.item() # Update running training loss
      model.train() # Actually training the net
      for i in range(len(train_inputs)):
        train_original_index = train_labels[i]
        train_predictive_index = np.argmax(train_outputs[i].detach().numpy())
        if train_original_index == train_predictive_index:
          train_accuracy_calc += 1

    # Update lists  
    train_loss_list.append(train_loss_calc/len(train_loader.dataset))
    train_accuracy_list.append(train_accuracy_calc/len(train_x_tensor))    
      
    # model.eval()
      
    with torch.no_grad():
      for test_input, test_labels in test_loader:
        test_input = Variable(test_input.long())
        test_labels = Variable(test_labels.long())
        test_outputs = model(test_input) # Forward pass
        test_loss_calc += loss_func(test_outputs, test_labels).item() # Update running training loss
        for i in range(len(test_input)):
          test_original_index = test_labels[i]
          test_predictive_index = np.argmax(test_outputs[i].detach().numpy())
          #a =np.argmax(pred_2[i])
          if test_original_index == test_predictive_index:
            test_accuracy_calc += 1

      test_loss_list.append(test_loss_calc/len(test_loader.dataset))
      test_accuracy_list.append(test_accuracy_calc/len(train_x_tensor))      

    if e % 10 == 0:
      print(f"Epoch: {e + 1}/{epochs}... \tTrain Loss: {round(train_loss_calc, 5)} \tTrain Accuracy: {round(train_accuracy_calc, 5)} \tTest Loss: {round(test_loss_calc, 5)} \tTest Accuracy: {round(test_accuracy_calc, 5)}")

  return train_loss_list, train_accuracy_list, test_loss_list, test_loss_list

In [None]:
class NN_Model(nn.Module):
    
    def __init__(self,vocab_size,embedding_dim,pretrained_weight):
        super(NN_Model,self).__init__()
        self.word_embeds = nn.Embedding(vocab_size, embedding_dim) # stores embeddings of a fixed dictionary and size
        self.word_embeds.weight.data.copy_(torch.from_numpy(pretrained_weight)) # place the pretrained weights to the embedding function
        self.layers = nn.Sequential(
        nn.Linear(embedding_dim, embedding_dim//2), 
        nn.ReLU(),
        nn.Linear(embedding_dim//2, embedding_dim//4),
        nn.ReLU(),
        nn.Linear(embedding_dim//4, 5),
        nn.Softmax(dim=1)
        )

    def forward(self,x):
        out = self.word_embeds(x) 
        # out = torch.flatten(out, start_dim=1)
        out = out[:,-1,:]
        out = self.layers(out)
        return out

# Second model - neural network with RNN

Build a neural network + RNN model that gets the vocabulary size, embedding dimention and pretrained embedding weights, and returns a 5 dimention vector with the scores of every emoji.

Then train the model and plot loss vs. epoch for train and test set.

https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html

---

You can use the added model as your base model

In [None]:
### START CODE HERE ###

In [None]:
class RNN_Model(nn.Module):
    
    def __init__(self,vocab_size,embedding_dim,pretrained_weight):
        super(RNN_Model,self).__init__()
        self.word_embeds = nn.Embedding(vocab_size, embedding_dim) # stores embeddings of a fixed dictionary and size
        self.word_embeds.weight.data.copy_(torch.from_numpy(pretrained_weight)) # place the pretrained weights to the embedding function
        self.rnn = ### add the model here
        self.layers = nn.Sequential(
        ### add the model here
        )

    def forward(self,x,h):
        out = self.word_embeds(x)
        out, _ = self.rnn(out,h)
        out = out[:, -1, :]
        out = self.layers(out)
        return out


# Third model - neural network with transformers

Build a neural network + transformer model that gets the vocabulary size, embedding dimention and pretrained embedding weights, and returns a 5 dimention vector with the scores of every emoji.

Then train the model and plot loss vs. epoch for train and test set.

https://pytorch.org/docs/stable/generated/torch.nn.Transformer.html


---

You can use the added model as your base model

In [None]:
### START CODE HERE ###

In [13]:
import math

class PositionalEncoding(nn.Module):

    def __init__(self, d_model: int, dropout: float = 0.1, max_len: int = 5000):
        super().__init__()
        self.dropout = nn.Dropout(p=dropout)

        position = torch.arange(max_len).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe = torch.zeros(max_len, 1, d_model)
        pe[:, 0, 0::2] = torch.sin(position * div_term)
        pe[:, 0, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """
        Args:
            x: Tensor, shape [seq_len, batch_size, embedding_dim]
        """
        x = x + self.pe[:x.size(0)]
        return self.dropout(x)

In [None]:
class Transformer_model(nn.Module):
    
    def __init__(self,vocab_size, embedding_dim, pretrained_weight):
        super(Transformer_model,self).__init__()
        self.word_embeds = nn.Embedding(vocab_size, embedding_dim)
        self.word_embeds.weight.data.copy_(torch.from_numpy(pretrained_weight))

        self.pos_encoding = PositionalEncoding(embedding_dim)
        self.layers = nn.Sequential(
            ### complete code here
        )


    def forward(self,x):
        out = self.word_embeds(x)
        out += self.pos_encoding(out)
        out = self.layers(out)
        return out



### Compare between the models - who had the best results? Try to explain why. 

Write your answere here