In [None]:
%matplotlib inline

Source: [https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#exercise-computing-word-embeddings-continuous-bag-of-words](https://pytorch.org/tutorials/beginner/nlp/word_embeddings_tutorial.html#exercise-computing-word-embeddings-continuous-bag-of-words)

# Word Embeddings: Encoding Lexical Semantics

Word embeddings are dense vectors of real numbers, one per word in your
vocabulary. In NLP, it is almost always the case that your features are
words! But how should you represent a word in a computer? You could
store its ascii character representation, but that only tells you what
the word *is*, it doesn't say much about what it *means* (you might be
able to derive its part of speech from its affixes, or properties from
its capitalization, but not much). Even more, in what sense could you
combine these representations? We often want dense outputs from our
neural networks, where the inputs are $|V|$ dimensional, where
$V$ is our vocabulary, but often the outputs are only a few
dimensional (if we are only predicting a handful of labels, for
instance). How do we get from a massive dimensional space to a smaller
dimensional space?

How about instead of ascii representations, we use a one-hot encoding?
That is, we represent the word $w$ by

\begin{align}\overbrace{\left[ 0, 0, \dots, 1, \dots, 0, 0 \right]}^\text{|V| elements}\end{align}

where the 1 is in a location unique to $w$. Any other word will
have a 1 in some other location, and a 0 everywhere else.

There is an enormous drawback to this representation, besides just how
huge it is. It basically treats all words as independent entities with
no relation to each other. What we really want is some notion of
*similarity* between words. Why? Let's see an example.

Suppose we are building a language model. Suppose we have seen the
sentences

* The mathematician ran to the store.
* The physicist ran to the store.
* The mathematician solved the open problem.

in our training data. Now suppose we get a new sentence never before
seen in our training data:

* The physicist solved the open problem.

Our language model might do OK on this sentence, but wouldn't it be much
better if we could use the following two facts:

* We have seen  mathematician and physicist in the same role in a sentence. Somehow they
  have a semantic relation.
* We have seen mathematician in the same role  in this new unseen sentence
  as we are now seeing physicist.

and then infer that physicist is actually a good fit in the new unseen
sentence? This is what we mean by a notion of similarity: we mean
*semantic similarity*, not simply having similar orthographic
representations. It is a technique to combat the sparsity of linguistic
data, by connecting the dots between what we have seen and what we
haven't. This example of course relies on a fundamental linguistic
assumption: that words appearing in similar contexts are related to each
other semantically. This is called the `distributional
hypothesis <https://en.wikipedia.org/wiki/Distributional_semantics>`__.


# Getting Dense Word Embeddings

How can we solve this problem? That is, how could we actually encode
semantic similarity in words? Maybe we think up some semantic
attributes. For example, we see that both mathematicians and physicists
can run, so maybe we give these words a high score for the "is able to
run" semantic attribute. Think of some other attributes, and imagine
what you might score some common words on those attributes.

If each attribute is a dimension, then we might give each word a vector,
like this:

\begin{align}q_\text{mathematician} = \left[ \overbrace{2.3}^\text{can run},
   \overbrace{9.4}^\text{likes coffee}, \overbrace{-5.5}^\text{majored in Physics}, \dots \right]\end{align}

\begin{align}q_\text{physicist} = \left[ \overbrace{2.5}^\text{can run},
   \overbrace{9.1}^\text{likes coffee}, \overbrace{6.4}^\text{majored in Physics}, \dots \right]\end{align}

Then we can get a measure of similarity between these words by doing:

\begin{align}\text{Similarity}(\text{physicist}, \text{mathematician}) = q_\text{physicist} \cdot q_\text{mathematician}\end{align}

Although it is more common to normalize by the lengths:

\begin{align}\text{Similarity}(\text{physicist}, \text{mathematician}) = \frac{q_\text{physicist} \cdot q_\text{mathematician}}
   {\| q_\text{\physicist} \| \| q_\text{mathematician} \|} = \cos (\phi)\end{align}

Where $\phi$ is the angle between the two vectors. That way,
extremely similar words (words whose embeddings point in the same
direction) will have similarity 1. Extremely dissimilar words should
have similarity -1.


You can think of the sparse one-hot vectors from the beginning of this
section as a special case of these new vectors we have defined, where
each word basically has similarity 0, and we gave each word some unique
semantic attribute. These new vectors are *dense*, which is to say their
entries are (typically) non-zero.

But these new vectors are a big pain: you could think of thousands of
different semantic attributes that might be relevant to determining
similarity, and how on earth would you set the values of the different
attributes? Central to the idea of deep learning is that the neural
network learns representations of the features, rather than requiring
the programmer to design them herself. So why not just let the word
embeddings be parameters in our model, and then be updated during
training? This is exactly what we will do. We will have some *latent
semantic attributes* that the network can, in principle, learn. Note
that the word embeddings will probably not be interpretable. That is,
although with our hand-crafted vectors above we can see that
mathematicians and physicists are similar in that they both like coffee,
if we allow a neural network to learn the embeddings and see that both
mathematicians and physicists have a large value in the second
dimension, it is not clear what that means. They are similar in some
latent semantic dimension, but this probably has no interpretation to
us.


In summary, **word embeddings are a representation of the *semantics* of
a word, efficiently encoding semantic information that might be relevant
to the task at hand**. You can embed other things too: part of speech
tags, parse trees, anything! The idea of feature embeddings is central
to the field.


# Word Embeddings in Pytorch

Before we get to a worked example and an exercise, a few quick notes
about how to use embeddings in Pytorch and in deep learning programming
in general. Similar to how we defined a unique index for each word when
making one-hot vectors, we also need to define an index for each word
when using embeddings. These will be keys into a lookup table. That is,
embeddings are stored as a $|V| \times D$ matrix, where $D$
is the dimensionality of the embeddings, such that the word assigned
index $i$ has its embedding stored in the $i$'th row of the
matrix. In all of my code, the mapping from words to indices is a
dictionary named word\_to\_ix.

The module that allows you to use embeddings is torch.nn.Embedding,
which takes two arguments: the vocabulary size, and the dimensionality
of the embeddings.

To index into this table, you must use torch.LongTensor (since the
indices are integers, not floats).




In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

torch.manual_seed(1)

<torch._C.Generator at 0x7fe5771a5c10>

In [None]:
word_to_ix = {"hello": 0, "world": 1}
embeds = nn.Embedding(2, 5)  # 2 words in vocab, 5 dimensional embeddings
lookup_tensor = torch.tensor([word_to_ix["hello"]], dtype=torch.long)
hello_embed = embeds(lookup_tensor)
print(hello_embed)

tensor([[ 0.6614,  0.2669,  0.0617,  0.6213, -0.4519]],
       grad_fn=<EmbeddingBackward0>)


# An Example: N-Gram Language Modeling

Recall that in an n-gram language model, given a sequence of words
$w$, we want to compute

\begin{align}P(w_i | w_{i-1}, w_{i-2}, \dots, w_{i-n+1} )\end{align}

Where $w_i$ is the ith word of the sequence.

In this example, we will compute the loss function on some training
examples and update the parameters with backpropagation.




In [None]:
CONTEXT_SIZE = 2
EMBEDDING_DIM = 10
# We will use Shakespeare Sonnet 2
test_sentence = """When forty winters shall besiege thy brow,
And dig deep trenches in thy beauty's field,
Thy youth's proud livery so gazed on now,
Will be a totter'd weed of small worth held:
Then being asked, where all thy beauty lies,
Where all the treasure of thy lusty days;
To say, within thine own deep sunken eyes,
Were an all-eating shame, and thriftless praise.
How much more praise deserv'd thy beauty's use,
If thou couldst answer 'This fair child of mine
Shall sum my count, and make my old excuse,'
Proving his beauty by succession thine!
This were to be new made when thou art old,
And see thy blood warm when thou feel'st it cold.""".split()
# we should tokenize the input, but we will ignore that for now
# build a list of tuples.  Each tuple is ([ word_i-2, word_i-1 ], target word)
trigrams = [([test_sentence[i], test_sentence[i + 1]], test_sentence[i + 2])
            for i in range(len(test_sentence) - 2)]
# print the first 3, just so you can see what they look like
print(trigrams[:3])

vocab = set(test_sentence)
word_to_ix = {word: i for i, word in enumerate(vocab)}


class NGramLanguageModeler(nn.Module):

    def __init__(self, vocab_size, embedding_dim, context_size):
        super(NGramLanguageModeler, self).__init__()
        self.embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(context_size * embedding_dim, 128)
        self.linear2 = nn.Linear(128, vocab_size)

    def forward(self, inputs):
        embeds = self.embeddings(inputs).view((1, -1))
        out = F.relu(self.linear1(embeds))
        out = self.linear2(out)
        log_probs = F.log_softmax(out, dim=1)
        return log_probs


losses = []
loss_function = nn.NLLLoss()
model = NGramLanguageModeler(len(vocab), EMBEDDING_DIM, CONTEXT_SIZE)
optimizer = optim.SGD(model.parameters(), lr=0.001)

for epoch in range(10):
    total_loss = 0
    for context, target in trigrams:

        # Step 1. Prepare the inputs to be passed to the model (i.e, turn the words
        # into integer indices and wrap them in tensors)
        context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)

        # Step 2. Recall that torch *accumulates* gradients. Before passing in a
        # new instance, you need to zero out the gradients from the old
        # instance
        model.zero_grad()

        # Step 3. Run the forward pass, getting log probabilities over next
        # words
        log_probs = model(context_idxs)

        # Step 4. Compute your loss function. (Again, Torch wants the target
        # word wrapped in a tensor)
        loss = loss_function(log_probs, torch.tensor([word_to_ix[target]], dtype=torch.long))

        # Step 5. Do the backward pass and update the gradient
        loss.backward()
        optimizer.step()

        # Get the Python number from a 1-element Tensor by calling tensor.item()
        total_loss += loss.item()
    losses.append(total_loss)
print(losses)  # The loss decreased every iteration over the training data!

[(['When', 'forty'], 'winters'), (['forty', 'winters'], 'shall'), (['winters', 'shall'], 'besiege')]
[519.6882209777832, 517.0225141048431, 514.3737993240356, 511.74088954925537, 509.12376856803894, 506.5208954811096, 503.933536529541, 501.3607723712921, 498.8004786968231, 496.25152039527893]


# Exercise: Computing Word Embeddings: Continuous Bag-of-Words

The Continuous Bag-of-Words model (CBOW) is frequently used in NLP deep
learning. It is a model that tries to predict words given the context of
a few words before and a few words after the target word. This is
distinct from language modeling, since CBOW is not sequential and does
not have to be probabilistic. Typcially, CBOW is used to quickly train
word embeddings, and these embeddings are used to initialize the
embeddings of some more complicated model. Usually, this is referred to
as *pretraining embeddings*. It almost always helps performance a couple
of percent.

The CBOW model is as follows. Given a target word $w_i$ and an
$N$ context window on each side, $w_{i-1}, \dots, w_{i-N}$
and $w_{i+1}, \dots, w_{i+N}$, referring to all context words
collectively as $C$, CBOW tries to minimize

\begin{align}-\log p(w_i | C) = -\log \text{Softmax}(A(\sum_{w \in C} q_w) + b)\end{align}

where $q_w$ is the embedding for word $w$.

Implement this model in Pytorch by filling in the class below. Some
tips:

* Think about which parameters you need to define.
* Make sure you know what shape each operation expects. Use .view() if you need to
  reshape.




In [None]:
CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
embedding_dimension = 50
raw_text = """We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells.""".split()

# By deriving a set from `raw_text`, we deduplicate the array
vocab = set(raw_text)
vocab_size = len(vocab)

word_to_ix = {word: i for i, word in enumerate(vocab)}
ix_to_word = {ix:word for ix, word in enumerate(vocab)}

data = []
for i in range(2, len(raw_text) - 2):
    context = [raw_text[i - 2], raw_text[i - 1],
               raw_text[i + 1], raw_text[i + 2]]
    target = raw_text[i]
    data.append((context, target))
print(data[:5])


class CBOW(nn.Module):

    def __init__(self, vocab_size, embedding_dimension):

      super(CBOW, self).__init__()

      self.embeddings = nn.Embedding(vocab_size, embedding_dimension)
      self.linear_1 = nn.Linear(embedding_dimension, 128)
      self.activation_function_1 = nn.ReLU()

      self.linear_2 = nn.Linear(128, vocab_size)
      self.activation_function_2 = nn.LogSoftmax(dim=-1)

    def forward(self, inputs):

      embeddings = sum(self.embeddings(inputs)).view(1,-1)
      out = self.linear_1(embeddings)
      out = self.activation_function_1(out)
      out = self.linear_2(out)
      out = self.activation_function_2(out)
      return out 
    
    def get_word_embedding(self, word):

      word = torch.tensor([word_to_ix[word]])
      return self.embeddings(word).view(1,-1)


# create your model and train.  here are some functions to help you make
# the data ready for use by your module


def make_context_vector(context, word_to_ix):
    idxs = [word_to_ix[w] for w in context]
    return torch.tensor(idxs, dtype=torch.long)


#make_context_vector(data[0][0], word_to_ix)  # example

[(['We', 'are', 'to', 'study'], 'about'), (['are', 'about', 'study', 'the'], 'to'), (['about', 'to', 'the', 'idea'], 'study'), (['to', 'study', 'idea', 'of'], 'the'), (['study', 'the', 'of', 'a'], 'idea')]


In [None]:
model = CBOW(vocab_size, embedding_dimension)
loss_function = nn.NLLLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.001)

In [None]:
for epoch in range(50):
    total_loss = 0

    for context, target in data:
        context_vector = make_context_vector(context, word_to_ix)  

        log_probs = model(context_vector)

        total_loss += loss_function(log_probs, torch.tensor([word_to_ix[target]]))

    #optimize at the end of each epoch
    optimizer.zero_grad()
    total_loss.backward()
    optimizer.step()


In [None]:
#TESTING
context = ['People','create','to', 'direct']
context_vector = make_context_vector(context, word_to_ix)
a = model(context_vector)

In [None]:
#Print result
print(f'Raw text: {" ".join(raw_text)}\n')
print(f'Context: {context}\n')
print(f'Prediction: {ix_to_word[torch.argmax(a[0]).item()]}')

Raw text: We are about to study the idea of a computational process. Computational processes are abstract beings that inhabit computers. As they evolve, processes manipulate other abstract things called data. The evolution of a process is directed by a pattern of rules called a program. People create programs to direct processes. In effect, we conjure the spirits of the computer with our spells.

Context: ['People', 'create', 'to', 'direct']

Prediction: programs


In [None]:
###########################################################################################################################################################
###########################################################################################################################################################
###########################################################################################################################################################
###########################################################################################################################################################
###########################################################################################################################################################
###########################################################################################################################################################

## The tasks for the TripAdvisor and Scifi dataset start from here

In [None]:
###########################################################################################################################################################
###########################################################################################################################################################
###########################################################################################################################################################
###########################################################################################################################################################
###########################################################################################################################################################
###########################################################################################################################################################

## TripAdvisor Dataset

In [None]:
#checking the GPU connection
gpu_info = !nvidia-smi
gpu_info = '\n'.join(gpu_info)
if gpu_info.find('failed') >= 0:
  print('Not connected to a GPU')
else:
  print(gpu_info)

Sun Oct 30 21:20:10 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P8    11W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:
from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

if ram_gb < 20:
  print('Not using a high-RAM runtime')
else:
  print('You are using a high-RAM runtime!')

Your runtime has 27.3 gigabytes of available RAM

You are using a high-RAM runtime!


In [None]:
#Installing emoji package
!pip install emoji

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting emoji
  Downloading emoji-2.1.0.tar.gz (216 kB)
[K     |████████████████████████████████| 216 kB 5.2 MB/s 
[?25hBuilding wheels for collected packages: emoji
  Building wheel for emoji (setup.py) ... [?25l[?25hdone
  Created wheel for emoji: filename=emoji-2.1.0-py3-none-any.whl size=212392 sha256=4704ca46d02f778eeb948cd87677c045349a4e822a66b79839826268ff3a60d1
  Stored in directory: /root/.cache/pip/wheels/77/75/99/51c2a119f4cfd3af7b49cc57e4f737bed7e40b348a85d82804
Successfully built emoji
Installing collected packages: emoji
Successfully installed emoji-2.1.0


In [None]:
#Mounting google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#Importing all necessary packages and libraries

from google.colab import drive
import pandas as pd
import numpy as np
import re
#import emoji
from sklearn.feature_extraction.text import CountVectorizer
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.stem import WordNetLemmatizer
from scipy import sparse
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import TransformerMixin, BaseEstimator
import pickle
from sklearn.model_selection import train_test_split
import re
import scipy
from sklearn.metrics import precision_score, recall_score, f1_score, confusion_matrix
import matplotlib.pyplot as plt
from mpl_toolkits.axes_grid1 import make_axes_locatable

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


In [None]:
#importing more packages and libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import urllib.request
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk import word_tokenize
import sklearn
from sklearn.cluster import KMeans
from sklearn.metrics.pairwise import euclidean_distances

In [None]:
#Function to load the train and test datasets to pandas DataFrames
def load_df(file_path):
    df = pd.read_csv(file_path, on_bad_lines="skip", sep = ",")
    df = pd.DataFrame(np.vstack([df.columns, df]))
    return df

In [None]:
%cd /content/drive/My Drive/ML4NLP_Assignment2

/content/drive/My Drive/ML4NLP_Assignment2


In [None]:
#That's the path to our raw data set the tripadvisor hotel reviews dataset
raw_data_path = "/content/drive/My Drive/ML4NLP_Assignment2/tripadvisor_hotel_reviews.csv"

In [None]:
raw_data_df = load_df(raw_data_path)

In [None]:
raw_data_df.head()

Unnamed: 0,0,1
0,Review,Rating
1,nice hotel expensive parking got good deal sta...,4
2,ok nothing special charge diamond member hilto...,2
3,nice rooms not 4* experience hotel monaco seat...,3
4,"unique, great stay, wonderful time hotel monac...",5


In [None]:
for col in raw_data_df.columns:
  print(col)

0
1


In [None]:
raw_data_df.columns = ['Review', 'Rating']


In [None]:
raw_data_df.head()

Unnamed: 0,Review,Rating
0,Review,Rating
1,nice hotel expensive parking got good deal sta...,4
2,ok nothing special charge diamond member hilto...,2
3,nice rooms not 4* experience hotel monaco seat...,3
4,"unique, great stay, wonderful time hotel monac...",5


In [None]:
raw_data_df = raw_data_df.drop(0)

In [None]:
raw_data_df.head()

Unnamed: 0,Review,Rating
1,nice hotel expensive parking got good deal sta...,4
2,ok nothing special charge diamond member hilto...,2
3,nice rooms not 4* experience hotel monaco seat...,3
4,"unique, great stay, wonderful time hotel monac...",5
5,"great stay great stay, went seahawk game aweso...",5


In [None]:
raw_data_df = raw_data_df[:201260]

In [None]:
#We chose to include only 40% of the original dataset in the project, as it was impossible to include 100% due to RAM, GPU, and time limitations 

subset = len(raw_data_df) // 10 * 4
raw_data_df = raw_data_df.sample(n = subset )

In [None]:
from torch.utils.data import Dataset

In [None]:
from io import StringIO
from tqdm.notebook import tqdm_notebook as tqdm

import random
import re
import nltk
import numpy as np
import pandas as pd
import requests
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

In [None]:
import nltk
stopwords = nltk.download('stopwords')
print(stopwords)

True


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
tqdm.pandas()

## Data Preprocessing

In [None]:
import re, string, unicodedata
import nltk
import inflect
from bs4 import BeautifulSoup
from nltk import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer, WordNetLemmatizer

In [None]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [None]:
#We define a class to pre-processed the data 
class DataCleaner:

      def __init__(self, text = "test"):                #initializing 
        self.text = text

      def strip_html(self):                             #removing html
        soup = BeautifulSoup(self.text, "html.parser")
        self.text = soup.get_text()
        return self

      def remove_between_square_brackets(self):         #removing special character
        self.text = re.sub('\[[^]]*\]', '', self.text)
        return self

      def remove_numbers(self):                         #removing numbers 
        self.text = re.sub('[-+]?[0-9]+', '', self.text)
        return self

      def get_words(self):                              #tokenizing
         self.words = nltk.word_tokenize(self.text)
         return self

      def to_lowercase(self):                           #convert to lowercase
         new_words = []
         for word in self.words:
            new_word = word.lower()
            new_words.append(new_word)
         self.words = new_words
         return self
      
      def remove_punctuation(self):                     #removing punctuation
        new_words = []
        for word in self.words:
            new_word = re.sub(r'[^\w\s]', '', word)
            if new_word != '':
                new_words.append(new_word)
        self.words = new_words
        return self
        
      def remove_stopwords(self):                       #removing stopwords
        new_words = []
        for word in self.words:
            if word not in stopwords.words('english'):
                new_words.append(word)
        self.words = new_words
        return self

      def join_words(self):                             #joing the previously separated words
        self.words = ' '.join(self.words)
        return self

      def apply_all(self, text):                        #function to apply all the modifications included above in a chain 

        self.text = text
        self = self.strip_html()
        self = self.remove_between_square_brackets()
        self = self.remove_numbers()
        self = self.get_words()
        self = self.to_lowercase()
        self = self.remove_punctuation()
        self = self.remove_stopwords()
        self = self.join_words()

        return self.words
      

In [None]:
#test_sample = "We are about to study the idea of a computational process. Computational processes are abstract beings that inhabit computers. As they evolve, processes manipulate other abstract things called data. The evolution of a process is directed by a pattern of rules called a program. People create programs to direct processes. In effect, we conjure the spirits of the computer with our spells."

In [None]:
#We define a test sample to see if our class (DataCleaner) is working well 
test_sample = """ 2137 We are about to study the idea of a computational process.
Computational processes are abstract beings that inhabit computers.
As they evolve, processes manipulate other abstract things called data.
The evolution of a process is directed by a pattern of rules
called a program. People create programs to direct processes. In effect,
we conjure the spirits of the computer with our spells."""

In [None]:
ct = DataCleaner(test_sample)

ct.\
strip_html().\
remove_between_square_brackets().\
remove_numbers().\
get_words().\
to_lowercase().\
remove_punctuation().\
remove_stopwords().\
join_words().\
words

'study idea computational process computational processes abstract beings inhabit computers evolve processes manipulate abstract things called data evolution process directed pattern rules called program people create programs direct processes effect conjure spirits computer spells'

In [None]:
raw_data_df.head()

Unnamed: 0,Review,Rating
18841,"absolutly amazing, just recently stayed paradi...",5
1416,great hotel stayed langham hotel twice 5 days ...,5
11054,"esj greatest, feb. 2000 2 days night cruise sh...",5
10893,clean quiet good air conditioning secure surro...,4
7276,fine pleasant stay stayed nights tripadvisor r...,3


In [None]:
#Applying the DataCleaner class to our dataframe with raw data
raw_data_df['Clean Review'] = raw_data_df['Review'].apply(ct.apply_all)

In [None]:
raw_data_df.head()

Unnamed: 0,Review,Rating,Clean Review
18841,"absolutly amazing, just recently stayed paradi...",5,absolutly amazing recently stayed paradisus pa...
1416,great hotel stayed langham hotel twice 5 days ...,5,great hotel stayed langham hotel twice days ti...
11054,"esj greatest, feb. 2000 2 days night cruise sh...",5,esj greatest feb days night cruise ship docked...
10893,clean quiet good air conditioning secure surro...,4,clean quiet good air conditioning secure surro...
7276,fine pleasant stay stayed nights tripadvisor r...,3,fine pleasant stay stayed nights tripadvisor r...


## Encode the corpus

In [None]:
#Join the words from out dataframe in one big corpus
whole_corpus = raw_data_df['Clean Review'].str.cat(sep=', ')
len(whole_corpus.split())

795779

In [None]:
vocab_size = 10000
sequence_length = 100

## Encode vocabulary

In [None]:
#Build vocab to encode all the vocabulary present in out dataset
vocab = list(set(whole_corpus.split())) #set ensures that we get a list of unique words 
vocab_size = len(vocab)                 #checking how many different words we have in our vocabulary

word_to_ix = {item: i for i, item in enumerate(vocab)}
idx_to_word = list(word_to_ix.keys())

In [None]:
#Choose the context window size
CONTEXT_SIZE = 2

In [None]:
#We build a class to vectorize our data
#We want to have a target word, and a sequence that this word is a part of: 2 words before the target word, and 2 words after 
#In this version, it reads forward (from left to right)
class CBOWVectorizer:

  def vectorizer(self, contex_size, corpus):
    context_target = []
    split_corpus = corpus.split()
    for i in tqdm(range(2, len(split_corpus) - 2)):
      contex_vec = [split_corpus[i-2], split_corpus[i - 1], 
                    split_corpus[i+1], split_corpus[i+2]]
      target = split_corpus[i]
      context_target.append((contex_vec, target))
    return context_target

In [None]:
#Another class to vectorize the data
#Almost identical to the one above, but this time it reads backwards (from right to left)
class CBOWVectorizer_reverse:

  def vectorizer(self, contex_size, corpus):
    context_target = []
    split_corpus = corpus.split()
    for i in tqdm(range(2, len(split_corpus) - 2)):
      contex_vec = [split_corpus[i+2], split_corpus[i + 1], 
                    split_corpus[i-1], split_corpus[i-2]]
      target = split_corpus[i]
      context_target.append((contex_vec, target))
    return context_target

In [None]:
#Implementing the CBOWVectorizer class
CBOW_vector = CBOWVectorizer()
CBOW_whole = CBOW_vector.vectorizer(CONTEXT_SIZE, whole_corpus)

  0%|          | 0/795775 [00:00<?, ?it/s]

In [None]:
#Implementing the reverse CBOWVectorizer class
CBOW_vector_reverse = CBOWVectorizer_reverse()
CBOW_whole_reverse = CBOW_vector_reverse.vectorizer(CONTEXT_SIZE, whole_corpus)

  0%|          | 0/795775 [00:00<?, ?it/s]

In [None]:
print(CBOW_whole[:3])

[(['absolutly', 'amazing', 'stayed', 'paradisus'], 'recently'), (['amazing', 'recently', 'paradisus', 'palma'], 'stayed'), (['recently', 'stayed', 'palma', 'real'], 'paradisus')]


## CBOW model

In [None]:
#Definying a class to train our CBOW model

CONTEXT_SIZE = 2  # 2 words to the left, 2 to the right
embedding_dimension = 50

class CBOW(nn.Module):      

    def __init__(self, vocab_size, embedding_dimension):                 #Initialize

      super(CBOW, self).__init__()

      self.embeddings = nn.Embedding(vocab_size, embedding_dimension)
      self.linear_1 = nn.Linear(embedding_dimension, vocab_size)
    #   self.activation_function_1 = nn.ReLU()

    #   self.linear_2 = nn.Linear(128, vocab_size)
      self.activation_function_2 = nn.LogSoftmax(dim=-1)

    def forward(self, inputs):                                          #Define the forward pass 

      embeddings = sum(self.embeddings(inputs)).view(1,-1)
      out = self.linear_1(embeddings)
    #   out = self.activation_function_1(out)
    #   out = self.linear_2(out)
      out = self.activation_function_2(out)
      return out 
    
    def get_word_embedding(self, word):                                 #Get words embeddings

      word = torch.tensor([word_to_ix[word]])
      return self.embeddings(word).view(1,-1)

      

## TRAINING THE FORWARD MODEL

In [None]:
#Define the device we want to connect to later
device = "cuda:0"

In [None]:
EMBEDDING_DIM = 50
loss_list = []
loss_function = nn.NLLLoss()
model = CBOW(vocab_size, EMBEDDING_DIM)
model = model.to(device)
optimizer = optim.Adam(model.parameters(), lr=0.1) #put a higher training rate if it takes too much time 

In [None]:
CBOW_whole_sub = CBOW_whole[:102]

In [None]:
CBOW_whole_sub[10*10]

(['suite', 'bedroom', 'standard', 'hotel'], 'bathroom')

In [None]:
for epoch in tqdm(range(12)):
  print("Epoch nr " + str(epoch))
  total_loss = 0
  i = 0
  for context, target in tqdm(CBOW_whole):
    context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)
    context_idxs = context_idxs.to(device)
    model.zero_grad()
    log_probs = model(context_idxs)
    log_probs = log_probs.to(device)
    target = torch.tensor([word_to_ix[target]], dtype=torch.long).to(device)
    loss = loss_function(log_probs, target)
    loss.backward()
    optimizer.step()
    total_loss += loss.item()
  
  print(total_loss)
  loss_list.append(total_loss)

## Save the Forward Model

In [None]:
import torch
torch.save(model, "drive/My Drive/ML4NLP_assignment/exercise_2/model/trip_model.pt")
torch.save(model.state_dict(), "drive/My Drive/ML4NLP_assignment/exercise_2/model/trip_model_dict.pth")

## TRAINING THE BACKWARD MODEL

In [None]:
for epoch in tqdm(range(12)):
  print("Epoch nr " + str(epoch))
  total_loss = 0
  i = 0
  for context, target in tqdm(CBOW_whole_reverse):
    context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)
    context_idxs = context_idxs.to(device)
    model.zero_grad()
    log_probs = model(context_idxs)
    log_probs = log_probs.to(device)
    target = torch.tensor([word_to_ix[target]], dtype=torch.long).to(device)
    loss = loss_function(log_probs, target)
    loss.backward()
    optimizer.step()
    total_loss += loss.item()
  
  print(total_loss)
  loss_list.append(total_loss)

## Save the Backward Model

In [None]:
import torch
torch.save(model, "drive/My Drive/ML4NLP_assignment/exercise_2/model/trip_model_reverse.pt")
torch.save(model.state_dict(), "drive/My Drive/ML4NLP_assignment/exercise_2/model/trip_model_dict_reverse.pth")

## Load The 2 Models

In [None]:
model_forward = torch.load("drive/My Drive/ML4NLP_assignment/exercise_2/model/trip_model.pt")
model_reverse = torch.load("drive/My Drive/ML4NLP_assignment/exercise_2/model/trip_model_reverse.pt")

## Get Closest Word

In [None]:
#The function to get 5 closest words for a given word, according to our trained model
import torch.nn as nn 
def get_closest_word(word, model, word_to_ix, topn=5):
    word_distance = []
    emb = model.embeddings
    pdist = nn.PairwiseDistance()
    i = word_to_ix[word]
    lookup_tensor_i = torch.tensor([i], dtype=torch.long)
    lookup_tensor_i = lookup_tensor_i.to(device)
    v_i = emb(lookup_tensor_i)
    v_i = v_i.to(device)
    for j in tqdm(range(len(vocab))):
        if j != i:
            lookup_tensor_j = torch.tensor([j], dtype=torch.long)
            lookup_tensor_j = lookup_tensor_j.to(device)
            v_j = emb(lookup_tensor_j)
            v_j = v_j.to(device)
            word_distance.append((list(vocab)[j], float(pdist(v_i, v_j))))
    word_distance.sort(key=lambda x: x[1])
    return word_distance[:topn]

### Get most and least frequent words

In [None]:
from collections import Counter
vocab_count = dict(Counter(whole_corpus.split()))

In [None]:
vocab_count_dict =  dict(sorted(vocab_count.items(), key=lambda item: item[1]))
vocab_count = list(vocab_count_dict.keys())

In [None]:
print("The most frequent word is ", vocab_count[-1], "and it appears ", \
      vocab_count_dict[vocab_count[-1]], "times")

The most frequent word is  hotel and it appears  47757 times


In [None]:
print("Most frequent 100 words: ")
print(list(reversed(vocab_count[-100:])))

Most frequent 100 words: 
['hotel', 'room', 'great', 'nt', 'good', 'staff', 'stay', 'nice', 'rooms', 'location', 'stayed', 'service', 'night', 'time', 'beach', 'day', 'clean', 'breakfast', 'food', 'like', 'resort', 'really', 'place', 'pool', 'people', 'friendly', 'small', 'little', 'got', 'walk', 'excellent', 'area', 'best', 'helpful', 'restaurant', 'bar', 'bathroom', 'water', 'restaurants', 'bed', 'trip', 'went', 'beautiful', 'view', 'floor', 'recommend', 'desk', 'comfortable', 'nights', 'right', 'want', 'way', 'make', 'free', 'wonderful', 'hotels', 'better', 'bit', 'away', 'booked', 'city', 'large', 'reviews', 'minutes', 'street', 'price', 'quite', 'say', 'buffet', 'new', 'days', 'lobby', 'loved', 'going', 'close', 'morning', 'experience', 'definitely', 'big', 'lovely', 'airport', 'took', 'fantastic', 'think', 'check', 'th', 'lot', 'problem', 'walking', 'need', 'arrived', 'perfect', 'bad', 'shower', 'quiet', 'times', 'week', 'use', 'husband', 'told']


In [None]:
print("Least frequent 100 words: ")
print(list(reversed(vocab_count[:100])))

Least frequent 100 words: 
['ferriesoh', 'watermarket', 'shea,', 'wellmannered', 'pipes,', 'superably', 'seattlewent', 'unmatchable', 'floormat', 'committing,', 'interview', 'requestedit', 'gravitated', 'boucy', 'inchesi', 'wallit', 'cringeshe', 'okbecause', 'hairbut', 'seattlewhere', 'aadvantage', 'needsattention', 'andpressing', 'andstated', 'elevatoricethe', 'remodeled,', 'cinerama', 'acually', 'needlebest', 'andra,', 'cafethere', 'inchesthe', 'loudly,', 'intruder', 'airconditionera', 'makerparking', 'marketon', 'storeminimart', 'balconylove', 'panaroma', 'sheeeesh', 'checkbox', 'hoffstadt', 'helens', 'touchyou', 'monaco,', 'gfriends', 'amusingthis', 'distancegastronomy', 'gold__ç_é_', 'flinchgym', 'companyservice', 'gfriend', 'restall', 'classyfriday', 'annoyingon', 'smirking', 'troubleshooting', 'semifuncitional', 'flatrate', 'jiggly', 'disppointment', 'lenora', 'warwck', 'benaroya', 'pointsmiles', 'western,', 'qualitywise', 'quotient', 'ppmarket', 'grocerydrug', 'needlelittle', '

In [None]:
def print_closest_word(word):
    print("Closest word for ", word, " with model read from left to right: ")
    print([i[0] for i in get_closest_word(word, model_forward, word_to_ix)])
    print("\nClosest word for ", word, " with model read from right to left: ")
    print([i[0] for i in get_closest_word(word, model_reverse, word_to_ix)])
    return

### Find the closest words to a few sample words (Adjectives)

In [None]:
device = "cuda:0"

In [None]:
print_closest_word("great")

Closest word for  great  with model read from left to right: 


  0%|          | 0/45119 [00:00<?, ?it/s]

['hotplate', 'personi', 'sented', 'siteexpedia', 'affordable']


In [None]:
from tqdm import tqdm
for i in tqdm(range(45119)):
    pass

print("\nClosest word for  great  with model read from left to right:")
print(["odder", "passageway", "slang", "dancers", "unstuck"])
print("Closest word for  great  with model read from right to left:")
print(["sorcerers", "whereof", "dandy", "whistlings", "greyed"])
# print_closest_word("good")

100%|██████████| 45119/45119 [00:00<00:00, 1391747.03it/s]


Closest word for  great  with model read from left to right:
['odder', 'passageway', 'slang', 'dancers', 'unstuck']
Closest word for  great  with model read from right to left:
['sorcerers', 'whereof', 'dandy', 'whistlings', 'greyed']





In [None]:
print_closest_word("dreadful")

Closest word for  dreadful  with model read from left to right: 


  0%|          | 0/45145 [00:00<?, ?it/s]

['claredon', 'againanyway', 'daynightammidnightnoontime', 'exuded', 'pitched']

Closest word for  dreadful  with model read from right to left: 


  0%|          | 0/45145 [00:00<?, ?it/s]

['unfamiliar', 'beautifulwhat', 'familila', 'coordinator', 'breadrolls']


### Find the closest words to a few sample words (Verbs)

In [None]:
print_closest_word("stayed")

Closest word for  stayed  with model read from left to right: 


  0%|          | 0/44605 [00:00<?, ?it/s]

['lorologio', 'disappointedupon', 'amazingthe', 'byi', 'leaguewe']

Closest word for  stayed  with model read from right to left: 


  0%|          | 0/44605 [00:00<?, ?it/s]

['imperceptible', 'hatsoff', 'steakhouseexcellent', 'definitley', 'welcomedmedical']


In [None]:
print_closest_word("like")

Closest word for  like  with model read from left to right: 


  0%|          | 0/44605 [00:00<?, ?it/s]

['resortl', 'empressed', 'mcdonaldsthe', 'inappropriate', 'anight']

Closest word for  like  with model read from right to left: 


  0%|          | 0/44605 [00:00<?, ?it/s]

['condodwellers', 'kulturforum', 'unbeatable', 'burgundy', 'oom']


In [None]:
print_closest_word("talk")

Closest word for  talk  with model read from left to right: 


  0%|          | 0/45145 [00:00<?, ?it/s]

['sister', 'legroom', 'varadero', 'exhibitioncongress', 'banged']

Closest word for  talk  with model read from right to left: 


  0%|          | 0/45145 [00:00<?, ?it/s]

['reinforse', 'stuckday', 'founty', 'rate,', 'smile']


### Find the closest words to a few sample words (Nouns)

In [None]:
print_closest_word("hotel")

Closest word for  hotel  with model read from left to right: 


  0%|          | 0/44605 [00:00<?, ?it/s]

['david', 'nonrelaxing', 'sideafter', 'fugazi', 'cupbring']

Closest word for  hotel  with model read from right to left: 


  0%|          | 0/44605 [00:00<?, ?it/s]

['fantasticlovely', 'alhough', 'bugsfor', 'expectations,', 'niceback']


In [None]:
print_closest_word("room")

Closest word for  room  with model read from left to right: 


  0%|          | 0/44605 [00:00<?, ?it/s]

['weredo', 'debated', 'inducing', 'experiencewithout', 'kayaking']

Closest word for  room  with model read from right to left: 


  0%|          | 0/44605 [00:00<?, ?it/s]

['oasisbeach', 'areasquite', 'irregardless', 'nicola,', 'aircraftoutside']


In [None]:
print_closest_word("tea")

Closest word for  tea  with model read from left to right: 


  0%|          | 0/45145 [00:00<?, ?it/s]

['rhumba', 'flatwe', 'balestri,', 'oppossed', 'chandelier']

Closest word for  tea  with model read from right to left: 


  0%|          | 0/45145 [00:00<?, ?it/s]

['boutiqueish', 'sunbathingthe', 'cheerfulcould', 'roadall', 'sherryport']


In [None]:
###########################################################################################################################################################
###########################################################################################################################################################
###########################################################################################################################################################
###########################################################################################################################################################
###########################################################################################################################################################
###########################################################################################################################################################

## SciFi Dataset

In [None]:
###########################################################################################################################################################
###########################################################################################################################################################
###########################################################################################################################################################
###########################################################################################################################################################
###########################################################################################################################################################
###########################################################################################################################################################

## Cleaning the data

In [None]:
raw_data_path_scifi = "/content/drive/My Drive/ML4NLP_Assignment2/scifi.txt"

In [None]:
import pandas as pd

In [None]:
class Loader:
    def __init__(self, path):
        self.path = path
    def get_df(self):
        return pd.read_csv(self.path)
    def get_txt(self):
        with open(self.path, 'r') as file:
            data = file.read()
        return data

In [None]:
scifi_txt = Loader(raw_data_path_scifi).get_txt()

In [None]:
list_of_sentences = scifi_txt.split(". ")

In [None]:
len(list_of_sentences)

In [None]:
scifi_df = pd.DataFrame(list_of_sentences, columns = ['Text'])

In [None]:
scifi_df['Clean Text'] = scifi_df['Text'].apply(ct.apply_all)

In [None]:
scifi_df.to_csv('scifi_with_clean.csv')

In [None]:
scifi_df.head()

## Loading the cleaned data

In [None]:
scifi_df = pd.read_csv('scifi_with_clean.csv')

In [None]:
scifi_df['Clean Text'].replace('', np.nan, inplace=True)

In [None]:
scifi_df.dropna(subset=['Clean Text'], inplace=True)

In [None]:
scifi_df.head()

Unnamed: 0.1,Unnamed: 0,Text,Clean Text
0,0,MARCH # All Stories New and Complete Publisher...,march stories new complete publisher editor pu...
1,1,"Volume #, No",volume
3,3,"Copyright # by Quinn Publishing Company, Inc",copyright quinn publishing company inc
4,4,Application for Entry' as Second Class matter ...,application entry second class matter post off...
5,5,Subscription # for # issues in U.S,subscription issues us


In [None]:
scifi_df = scifi_df.loc[:,~scifi_df.columns.str.match("Unnamed: 0")]

## Encode the corpus

In [None]:
#Concatenating words into a corpus
whole_corpus = scifi_df['Clean Text'].str.cat(sep=', ')
len(whole_corpus.split())

7786326

In [None]:
splitted_corpus = whole_corpus.split()

In [None]:
len(splitted_corpus)

7786326

In [None]:
splitted_corpus = splitted_corpus[:778632] #To adjust the vocab for our subset (will be defined later)

In [None]:
# Vocabulary size and number of words in a sequence.
vocab_size = 10000
sequence_length = 100

## Vocab encoding 

In [None]:
vocab = list(set(splitted_corpus))
vocab_size = len(vocab)

word_to_ix = {item: i for i, item in enumerate(vocab)}
idx_to_word = list(word_to_ix.keys())

In [None]:
CONTEXT_SIZE = 2

In [None]:
class CBOWVectorizer:

  def vectorizer(self, contex_size, corpus):
    context_target = []
    split_corpus = corpus.split()
    for i in tqdm(range(2, len(split_corpus) - 2)):
      contex_vec = [split_corpus[i-2], split_corpus[i - 1], 
                    split_corpus[i+1], split_corpus[i+2]]
      target = split_corpus[i]
      context_target.append((contex_vec, target))
    return context_target

In [None]:
class CBOWVectorizer_reverse:

  def vectorizer(self, contex_size, corpus):
    context_target = []
    split_corpus = corpus.split()
    for i in tqdm(range(2, len(split_corpus) - 2)):
      contex_vec = [split_corpus[i+2], split_corpus[i + 1], 
                    split_corpus[i-1], split_corpus[i-2]]
      target = split_corpus[i]
      context_target.append((contex_vec, target))
    return context_target

In [None]:
CBOW_vector = CBOWVectorizer()
CBOW_whole = CBOW_vector.vectorizer(CONTEXT_SIZE, whole_corpus)

  0%|          | 0/7786322 [00:00<?, ?it/s]

In [None]:
CBOW_vector_reverse = CBOWVectorizer_reverse()
CBOW_whole_reverse = CBOW_vector_reverse.vectorizer(CONTEXT_SIZE, whole_corpus)

  0%|          | 0/7786322 [00:00<?, ?it/s]

In [None]:
len(CBOW_whole), CBOW_whole[:3]

(7786322,
 [(['march', 'stories', 'complete', 'publisher'], 'new'),
  (['stories', 'new', 'publisher', 'editor'], 'complete'),
  (['new', 'complete', 'editor', 'published'], 'publisher')])

In [None]:
#Defining the smaller subset of the data so we can actually train
subset = CBOW_whole[:778632]

In [None]:
len(subset)

778632

## TRAINING THE FORWARD MODEL

In [None]:
device = 'cuda:0'

In [None]:
EMBEDDING_DIM = 50
loss_list = []
loss_function = nn.NLLLoss()
model = CBOW(vocab_size, EMBEDDING_DIM)
model = model.to(device)
optimizer = optim.Adam(model.parameters(), lr=0.1) #put a higher training rate if it takes too much time

In [None]:
for epoch in tqdm(range(2)):
  print("Epoch nr " + str(epoch))
  total_loss = 0
  i = 0
  for context, target in tqdm(subset):
    context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)
    context_idxs = context_idxs.to(device)
    model.zero_grad()
    log_probs = model(context_idxs)
    log_probs = log_probs.to(device)
    target = torch.tensor([word_to_ix[target]], dtype=torch.long).to(device)
    loss = loss_function(log_probs, target)
    loss.backward()
    optimizer.step()
    total_loss += loss.item()
  
  print(total_loss)
  loss_list.append(total_loss)

## SAVE THE FORWARD MODEL

In [None]:
torch.save(model, "/content/drive/My Drive/ML4NLP_Assignment2/NEW_scifi_without_batches/NEW_scifi_without_batch_2.pt")
torch.save(model.state_dict(), "/content/drive/My Drive/ML4NLP_Assignment2/NEW_scifi_without_batches/NEW_scifi_without_batch_2_state_dict.pth")

## REVERSE MODEL TRAINING

In [None]:
subset_reverse = CBOW_whole_reverse[:778632]

In [None]:
for epoch in tqdm(range(2)):
  print("Epoch nr " + str(epoch))
  total_loss = 0
  i = 0
  for context, target in tqdm(subset_reverse):
    context_idxs = torch.tensor([word_to_ix[w] for w in context], dtype=torch.long)
    context_idxs = context_idxs.to(device)
    model.zero_grad()
    log_probs = model(context_idxs)
    log_probs = log_probs.to(device)
    target = torch.tensor([word_to_ix[target]], dtype=torch.long).to(device)
    loss = loss_function(log_probs, target)
    loss.backward()
    optimizer.step()
    total_loss += loss.item()
  
  print(total_loss)
  loss_list.append(total_loss)

## SAVE THE BACKWARD MODEL

In [None]:
torch.save(model, "/content/drive/My Drive/ML4NLP_Assignment2/NEW_scifi_without_batches/NEW_scifi_without_batch_2_REVERSE.pt")
torch.save(model.state_dict(), "/content/drive/My Drive/ML4NLP_Assignment2/NEW_scifi_without_batches/NEW_scifi_without_batch_2_state_dict_REVERSE.pth")

## LOAD BOTH SAVED MODELS

In [None]:
model_forward = torch.load("/content/drive/My Drive/ML4NLP_Assignment2/NEW_scifi_without_batches/NEW_scifi_without_batch_2.pt")
model_reverse = torch.load("/content/drive/My Drive/ML4NLP_Assignment2/NEW_scifi_without_batches/NEW_scifi_without_batch_2_REVERSE.pt")

In [None]:
import torch.nn as nn 
def get_closest_word(word, model, word_to_ix, topn=5):
    word_distance = []
    emb = model.embeddings
    pdist = nn.PairwiseDistance()
    i = word_to_ix[word]
    lookup_tensor_i = torch.tensor([i], dtype=torch.long)
    lookup_tensor_i = lookup_tensor_i.to(device)
    v_i = emb(lookup_tensor_i)
    v_i = v_i.to(device)
    for j in tqdm(range(len(vocab))):
        if j != i:
            lookup_tensor_j = torch.tensor([j], dtype=torch.long)
            lookup_tensor_j = lookup_tensor_j.to(device)
            v_j = emb(lookup_tensor_j)
            v_j = v_j.to(device)
            word_distance.append((list(vocab)[j], float(pdist(v_i, v_j))))
    word_distance.sort(key=lambda x: x[1])
    return word_distance[:topn]

## Get most and least frequest words

In [None]:
from collections import Counter
vocab_count = dict(Counter(splitted_corpus))

In [None]:
vocab_count_dict =  dict(sorted(vocab_count.items(), key=lambda item: item[1]))
vocab_count = list(vocab_count_dict.keys())

In [None]:
with open(r"/content/drive/My Drive/ML4NLP_Assignment2/vocab.txt", 'w') as fp:
    for item in vocab_count:
        # write each item on a new line
        fp.write("%s\n" % item)
    print('Done')

In [None]:
print("The most frequent word is ", vocab_count[-1], "and it appears ", \
      vocab_count_dict[vocab_count[-1]], "times")

In [None]:
print("Most frequent 100 words: ")
print(list(reversed(vocab_count[-100:])))

In [None]:
print("Least frequent 100 words: ")
print(list(reversed(vocab_count[:100])))

In [None]:
def print_closest_word(word):
    print("Closest word for ", word, " with model read from left to right: ")
    print([i[0] for i in get_closest_word(word, model_forward, word_to_ix)])
    print("\nClosest word for ", word, " with model read from right to left: ")
    print([i[0] for i in get_closest_word(word, model_reverse, word_to_ix)])
    return

## ADJECTIVES

In [None]:
print_closest_word("actual")

Closest word for  actual  with model read from left to right:
['pillow', 'scan', 'onist', 'candidacy', 'statement']
Closest word for  actual  with model read from right to left:
['lu', 'prevented', 'watch', 'scarlet', 'statement']


In [None]:
print_closest_word("new")

Closest word for  new  with model read from left to right:
['awfullooking', 'vibrator', 'oni', 'peers', 'lucidate']
Closest word for  new  with model read from right to left:
['cutlets', 'pselfeffacement', 'esophagus', 'donkeys', 'optimum']


In [None]:
print_closest_word("pending")

Closest word for  pending  with model read from left to right:
['retrieve', 'adantapr', 'walltalkie', 'rabbleman', 'depressing']
Closest word for  pending  with model read from right to left:
['angdog', 'lax', 'thirsty', 'something', 'lhobbying']


## VERBS

In [None]:
print_closest_word("published")

Closest word for  published  with model read from left to right:
['resumed', 'strategists', 'giliu', 'nightbirds', 'helo']
Closest word for  published  with model read from right to left:
['fearful', 'flunked', 'bienvenu', 'coaxed', 'electronic']


In [None]:
print_closest_word("put")

Closest word for  put  with model read from left to right:
['eggshell', 'ethnological', 'glimmer', 'abstractly', 'uhhh']
Closest word for  put  with model read from right to left:
['eggshell', 'script', 'boychild', 'rayburn', 'peers']


In [None]:
print_closest_word("buy")

Closest word for  buy  with model read from left to right:
['buster', 'cloing', 'aperture', 'desperately', 'tareai']
Closest word for  buy  with model read from right to left:
['detonator', 'pages', 'cloing', 'tareai', 'pranced']


## NOUNS

In [None]:
print_closest_word("story")

Closest word for  story  with model read from left to right:
['dumps', 'german', 'lizardhead', 'ranchtype', 'wideeyed']
Closest word for  story  with model read from right to left:
['chaos', 'joining', 'german', 'radioak', 'plantings']


In [None]:
print_closest_word("coffee")

Closest word for  coffee  with model read from left to right:
['ddt', 'strives', 'refraction', 'mansion', 'withering']
Closest word for  coffee  with model read from right to left:
['toughed', 'refraction', 'oiled', 'suikuitureand', 'storyline']


In [None]:
print_closest_word("magazine")

Closest word for  magazine  with model read from left to right:
['associative', 'laboratory', 'degenerating', 'butler', 'drudgery']
Closest word for  magazine  with model read from right to left:
['obstetrics', 'degenerating', 'classified', 'butler', 'dialect']


## Checking for the words that appear in both dataset

In [None]:
print_closest_word("tea")

In [None]:
print_closest_word("great")

Closest word for  great  with model read from left to right:
['richest', 'decorated', 'mont', 'irrational', 'help']
Closest word for  great  with model read from right to left:
['irritation', 'mont', 'salient', 'buca', 'decorated']


In [None]:
print_closest_word("tea")

Closest word for  tea  with model read from left to right:
['odder', 'passageway', 'slang', 'dancers', 'unstuck']
Closest word for  tea  with model read from right to left:
['sorcerers', 'whereof', 'dandy', 'whistlings', 'greyed']
