<a href="https://colab.research.google.com/github/tannisthamaiti/AIWeekend-Project/blob/main/NLP_WordEmbeddings_CNN/NLP_WordEmbeddings_CNN_Question_P2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In this assignment, you will practice how to compute word embeddings and use them for sentiment analysis.

To implement sentiment analysis, you can go beyond counting the number of positive words and negative words.
You can find a way to represent each word numerically, by a vector.
The vector could then represent syntactic (i.e. parts of speech) and semantic (i.e. meaning) structures.
In this assignment, you will explore a classic way of generating word embeddings or representations.

You will implement a famous model called the continuous bag of words (CBOW) model.
By completing this assignment you will:



*   Train word vectors from scratch
*   Learn how to create batches of data.
*   Understand how backpropagation works.
*   Plot and visualize your learned word vectors.



Let's take a look at the following sentence:
>**'I am happy because I am learning AI'**.

- In continuous bag of words (CBOW) modeling, we try to predict the center word given a few context words (the words around the center word).
- For example, if you were to choose a context half-size of say $C = 2$, then you would try to predict the word **happy** given the context that includes 2 words before and 2 words after the center word:

> $C$ words before: [I, am]

> $C$ words after: [because, I]

- In other words:

$$context = [I,am, because, I]$$
$$target = happy$$

Once you have encoded all the context words, you can use $\bar x$ as the input to your model.

The architecture you will be implementing is as follows:

\begin{align}
 h &= W_1 \  X + b_1  \tag{1} \\
 a &= ReLU(h)  \tag{2} \\
 z &= W_2 \  a + b_2   \tag{3} \\
 \hat y &= softmax(z)   \tag{4} \\
\end{align}

## Forward propagation

Let's dive into the neural network itself, which is shown below with all the dimensions and formulas you'll need.

![CBOW Model](https://github.com/tannisthamaiti/AIWeekend-Project/blob/main/images/cbow_model_dimensions_single_input.png?raw=true)

Set $N$ equal to 3. Remember that $N$ is a hyperparameter of the CBOW model that represents the size of the word embedding vectors, as well as the size of the hidden layer.

Also set $V$ equal to 5, which is the size of the vocabulary we have used so far.

In [None]:
# Import Python libraries and helper functions (in utils2)
import nltk
from nltk.tokenize import word_tokenize
import numpy as np
from collections import Counter
from utils2 import sigmoid, get_batches, compute_pca, get_dict

In [None]:
# Download sentence tokenizer
nltk.data.path.append('.')

# Exercise 1(a)

In [None]:
# Load, tokenize and process the data
import re                                                           #  Load the Regex-modul
with open('shakespeare.txt') as f:                                  # file location https://github.com/tannisthamaiti/AIWeekend-Project/blob/main/NLP_WordEmbeddings_CNN/shakespeare.txt
    data = f.read()                                                 #  Read in the data
data = re.sub(r'[,!?;-]', '.',data)                                 #  Punktuations are replaced by .
data = nltk.word_tokenize(data)                                     #  Tokenize string to words
data = #Your code here            #  Lower case and drop non-alphabetical tokens (use isalpha, list comphrehensive)
print("Number of tokens:", len(data),'\n', data[:15])              #  print data sample

Expected Output

Number of tokens: 17395

 ['the', 'sonnets', 'by', 'william', 'shakespeare', 'from', 'fairest', 'creatures', 'we', 'desire', 'increase', 'that', 'thereby', 'beauty', 'rose']

In [221]:
data = [word.lower() for word in data if word.isalpha()]
print("Number of tokens:", len(data),'\n', data[:15])

Number of tokens: 17395 
 ['the', 'sonnets', 'by', 'william', 'shakespeare', 'from', 'fairest', 'creatures', 'we', 'desire', 'increase', 'that', 'thereby', 'beauty', 'rose']


In [222]:
# Compute the frequency distribution of the words in the dataset (vocabulary)
fdist = nltk.FreqDist(word for word in data)
print("Size of vocabulary: ",len(fdist) )
print("Most frequent tokens: ",fdist.most_common(20) ) # print the 20 most frequent words and their freq.


Size of vocabulary:  3001
Most frequent tokens:  [('and', 490), ('the', 432), ('to', 408), ('my', 393), ('of', 370), ('i', 349), ('that', 323), ('in', 323), ('thy', 287), ('thou', 234), ('love', 188), ('with', 181), ('is', 180), ('not', 176), ('for', 171), ('me', 164), ('but', 163), ('a', 163), ('thee', 162), ('so', 145)]


#### Mapping words to indices and indices to words
We provide a helper function to create a dictionary that maps words to indices and indices to words.

In [223]:
# get_dict creates two dictionaries, converting words to indices and viceversa.
word2Ind, Ind2word = # Your implementation
V = len(word2Ind)
print("Size of vocabulary: ", V)

Size of vocabulary:  3001


Expected output:

Size of vocabulary:  3001

In [225]:
# example of word to index mapping
print("Index of the word 'king' :  ",word2Ind['king'] )
print("Index of the word 'king' :  ",word2Ind['queen'] )
print("Word which has index 2743:  ",Ind2word[2743] )

Index of the word 'king' :   1404
Index of the word 'king' :   1983
Word which has index 2743:   upon


Expected output:

Index of the word 'king' :   1404

Index of the word 'king' :   1983

Word which has index 2743:   upon