# Task 1: Word Embeddings (10 points)

This notebook will guide you through all steps necessary to train a word2vec model (Detailed description in the PDF).

## Imports

This code block is reserved for your imports. 

You are free to use the following packages: 

(List of packages)

In [1]:
# Imports
import pandas as pd
import regex
import numpy as np

# 1.1 Get the data (0.5 points)

The Hindi portion HASOC corpus from [github.io](https://hasocfire.github.io/hasoc/2019/dataset.html) is already available in the repo, at data/hindi_hatespeech.tsv . Load it into a data structure of your choice. Then, split off a small part of the corpus as a development set (~100 data points).

If you are using Colab the first two lines will let you upload folders or files from your local file system.

In [2]:
#TODO: implement!

from google.colab import files
uploaded = files.upload()

Saving hindi_hatespeech.tsv to hindi_hatespeech.tsv


In [3]:
data = pd.read_csv("hindi_hatespeech.tsv", sep="\t")          #loading data into pandas dataframe
development_set = data.sample(n=100, random_state=1)
development_set

Unnamed: 0,text_id,text,task_1,task_2,task_3
3251,hasoc_hi_6404,तेजस्वी सदन में हाज़िर हों !विधान सभा अध्यक्ष ...,NOT,NONE,NONE
1051,hasoc_hi_2290,आक्का बुरा मने ना माने पर गांड जरूर फटती है इ...,HOF,PRFN,TIN
3266,hasoc_hi_2348,#सूअरस्लाम इसमें जाकर देखो इन जिहादी सूअरों के...,HOF,HATE,TIN
1620,hasoc_hi_4811,अगर #Manchester में 300 पार है तो #Muzaffarpu...,NOT,NONE,NONE
3026,hasoc_hi_5425,"पांचवें चरण के बाद रोने का काम तो हो गया, अब ...",HOF,PRFN,UNT
...,...,...,...,...,...
3673,hasoc_hi_2562,बाबा रामरहीम को अगर अच्छे व्यवहार के कारण पैरो...,HOF,HATE,TIN
3652,hasoc_hi_375,"उन चूड़ियों का दर्द, दिल बेध देती हैं जिन्हों...",NOT,NONE,NONE
4160,hasoc_hi_5020,मतलब तेरे जैसे दोगले छद्मधारी पंडे के नाम पर म...,HOF,OFFN,TIN
3587,hasoc_hi_6714,"ये देश आपका है, और देश के हुक्मरानों को आईना ...",NOT,NONE,NONE


## 1.2 Data preparation (0.5 + 0.5 points)

* Prepare the data by removing everything that does not contain information. 
User names (starting with '@') and punctuation symbols clearly do not convey information, but we also want to get rid of so-called [stopwords](https://en.wikipedia.org/wiki/Stop_word), i. e. words that have little to no semantic content (and, but, yes, the...). Hindi stopwords can be found [here](https://github.com/stopwords-iso/stopwords-hi/blob/master/stopwords-hi.txt) Then, standardize the spelling by lowercasing all words.
Do this for the development section of the corpus for now.

* What about hashtags (starting with '#') and emojis? Should they be removed too? Justify your answer in the report, and explain how you accounted for this in your implementation.

In [12]:
#s = '145. !! #hsdfg संस्कृत महाविद्यालय । @asdsg https://m.facebook.com/story.php?story_fbid=443161759855093&id=464554060349094 asd'
#s = '¶¶खनकती चूड़ियाँ और छनकती पायल  मोहब्बत की मौशिकी में साज होते हैं¶¶¶  ~#साहिब१     #चूड़ी  #शब्दनिधि  #हिंदी_शब्द'
#s = regex.sub(r'(#[^\s]*)*', '', s)                    
#s = regex.sub(r'(@[\w]*)*[\d\p{Punct}*]*(https[^\s]*)*', '', s)
#s
development_set['text'] = development_set['text'].apply(lambda x: regex.sub(r'(#[^\s]*)*', '', x))                                    #removing hashtags     
development_set['text'] = development_set['text'].apply(lambda x: regex.sub(r'(@[\w]*)*[\d\p{Punct}*]*(http[^\s]*)*', '', x))        #removing usernames, numbers, punctuations and links
development_set

Unnamed: 0,text_id,text,task_1,task_2,task_3
3251,hasoc_hi_6404,तेजस्वी सदन में हाज़िर हों विधान सभा अध्यक्ष व...,NOT,NONE,NONE
1051,hasoc_hi_2290,आक्का बुरा मने ना माने पर गांड जरूर फटती है इ...,HOF,PRFN,TIN
3266,hasoc_hi_2348,इसमें जाकर देखो इन जिहादी सूअरों के कारनामे र...,HOF,HATE,TIN
1620,hasoc_hi_4811,अगर में पार है तो में भी पार है जश्न और ...,NOT,NONE,NONE
3026,hasoc_hi_5425,पांचवें चरण के बाद रोने का काम तो हो गया अब छ...,HOF,PRFN,UNT
...,...,...,...,...,...
3673,hasoc_hi_2562,बाबा रामरहीम को अगर अच्छे व्यवहार के कारण पैरो...,HOF,HATE,TIN
3652,hasoc_hi_375,उन चूड़ियों का दर्द दिल बेध देती हैं जिन्होंन...,NOT,NONE,NONE
4160,hasoc_hi_5020,मतलब तेरे जैसे दोगले छद्मधारी पंडे के नाम पर म...,HOF,OFFN,TIN
3587,hasoc_hi_6714,ये देश आपका है और देश के हुक्मरानों को आईना द...,NOT,NONE,NONE


In [5]:
#TODO: implement!
uploaded = files.upload()

Saving stopwords-hi.txt to stopwords-hi.txt


In [13]:
stop_words = pd.read_csv("stopwords-hi.txt", header=None)           #storing stopwords in a dataframe
print(stop_words[:][0].tolist())
text_split = development_set['text'].str.split()
text_wo_stopwords = text_split.apply(lambda x: [item for item in x if item not in stop_words[:][0].tolist()]) 
text_wo_stopwords

['अंदर', 'अत', 'अदि', 'अप', 'अपना', 'अपनि', 'अपनी', 'अपने', 'अभि', 'अभी', 'आदि', 'आप', 'इंहिं', 'इंहें', 'इंहों', 'इतयादि', 'इत्यादि', 'इन', 'इनका', 'इन्हीं', 'इन्हें', 'इन्हों', 'इस', 'इसका', 'इसकि', 'इसकी', 'इसके', 'इसमें', 'इसि', 'इसी', 'इसे', 'उंहिं', 'उंहें', 'उंहों', 'उन', 'उनका', 'उनकि', 'उनकी', 'उनके', 'उनको', 'उन्हीं', 'उन्हें', 'उन्हों', 'उस', 'उसके', 'उसि', 'उसी', 'उसे', 'एक', 'एवं', 'एस', 'एसे', 'ऐसे', 'ओर', 'और', 'कइ', 'कई', 'कर', 'करता', 'करते', 'करना', 'करने', 'करें', 'कहते', 'कहा', 'का', 'काफि', 'काफ़ी', 'कि', 'किंहें', 'किंहों', 'कितना', 'किन्हें', 'किन्हों', 'किया', 'किर', 'किस', 'किसि', 'किसी', 'किसे', 'की', 'कुछ', 'कुल', 'के', 'को', 'कोइ', 'कोई', 'कोन', 'कोनसा', 'कौन', 'कौनसा', 'गया', 'घर', 'जब', 'जहाँ', 'जहां', 'जा', 'जिंहें', 'जिंहों', 'जितना', 'जिधर', 'जिन', 'जिन्हें', 'जिन्हों', 'जिस', 'जिसे', 'जीधर', 'जेसा', 'जेसे', 'जैसा', 'जैसे', 'जो', 'तक', 'तब', 'तरह', 'तिंहें', 'तिंहों', 'तिन', 'तिन्हें', 'तिन्हों', 'तिस', 'तिसे', 'तो', 'था', 'थि', 'थी', 'थे', 'दबारा', 'दवा

3251    [तेजस्वी, सदन, हाज़िर, हों, विधान, सभा, अध्यक्...
1051     [आक्का, बुरा, मने, माने, गांड, जरूर, फटती, कलुए]
3266    [जाकर, देखो, जिहादी, सूअरों, कारनामे, रंगीला, ...
1620    [अगर, पार, पार, जश्न, मातम, बीच, खड़े, हम, भार...
3026    [पांचवें, चरण, रोने, काम, अब, छठे, चरण, कहीं, ...
                              ...                        
3673    [बाबा, रामरहीम, अगर, अच्छे, व्यवहार, कारण, पैर...
3652    [चूड़ियों, दर्द, दिल, बेध, देती, जिन्होंने, खन...
4160    [मतलब, तेरे, दोगले, छद्मधारी, पंडे, नाम, मौलवी...
3587    [देश, आपका, देश, हुक्मरानों, आईना, दिखाना, ज़र...
996                   [भडवा, क्या, जाने, राजा, भोज, बारे]
Name: text, Length: 100, dtype: object

## 1.3 Build the vocabulary (0.5 + 0.5 points)

The input to the first layer of word2vec is an one-hot encoding of the current word. The output od the model is then compared to a numeric class label of the words within the size of the skip-gram window. Now

* Compile a list of all words in the development section of your corpus and save it in a variable ```V```.

In [14]:
#TODO: implement!
V = []
AllWords = []
for x in text_wo_stopwords:
  for item in x:
    AllWords.append(item)
    if item not in V:
      V.append(item)
#V.append(text_wo_stopwords.apply(lambda x: [item for item in x]))
keys = [*range(len(V))]
V = dict(zip(keys,V))
V

{0: 'तेजस्वी',
 1: 'सदन',
 2: 'हाज़िर',
 3: 'हों',
 4: 'विधान',
 5: 'सभा',
 6: 'अध्यक्ष',
 7: 'विजय',
 8: 'चौधरी',
 9: 'चमकी',
 10: 'बुखार',
 11: 'विपक्ष',
 12: 'लाए',
 13: 'कार्य',
 14: 'स्थगन',
 15: 'प्रस्ताव',
 16: 'स्वीकार',
 17: 'यादव',
 18: 'मौजूद',
 19: 'सत्ता',
 20: 'पक्ष',
 21: 'उठाए',
 22: 'सवाल',
 23: 'आक्का',
 24: 'बुरा',
 25: 'मने',
 26: 'माने',
 27: 'गांड',
 28: 'जरूर',
 29: 'फटती',
 30: 'कलुए',
 31: 'जाकर',
 32: 'देखो',
 33: 'जिहादी',
 34: 'सूअरों',
 35: 'कारनामे',
 36: 'रंगीला',
 37: 'रसूल',
 38: 'कुरान',
 39: 'जाओ',
 40: 'सूअर',
 41: 'सलाम',
 42: 'अगर',
 43: 'पार',
 44: 'जश्न',
 45: 'मातम',
 46: 'बीच',
 47: 'खड़े',
 48: 'हम',
 49: 'भारतीय',
 50: 'पांचवें',
 51: 'चरण',
 52: 'रोने',
 53: 'काम',
 54: 'अब',
 55: 'छठे',
 56: 'कहीं',
 57: 'ज़हर',
 58: 'शीशी',
 59: 'लेकर',
 60: 'स्टेज',
 61: 'खड़ा',
 62: 'जाये',
 63: 'क्या',
 64: 'ज़िक्र',
 65: 'करके',
 66: 'बन्द',
 67: 'ज़ुबान',
 68: 'व्याख्या',
 69: 'इतनी',
 70: 'ख़ूबसूरती',
 71: 'आज',
 72: 'देख',
 73: 'लिया',
 74: 'वाह',
 

* Then, write a function ```word_to_one_hot``` that returns a one-hot encoding of an arbitrary word in the vocabulary. The size of the one-hot encoding should be ```len(v)```.

In [15]:
#TODO: implement!
def word_to_one_hot(word):
  hot_vector = []
  for x in V:
    if x==word:
      hot_vector.append(1)
    else:
      hot_vector.append(0) 
  return hot_vector

print(len(word_to_one_hot('सदन')))

1218


## 1.4 Subsampling (0.5 points)

The probability to keep a word in a context is given by:

$P_{keep}(w_i) = \Big(\sqrt{\frac{z(w_i)}{0.001}}+1\Big) \cdot \frac{0.001}{z(w_i)}$

Where $z(w_i)$ is the relative frequency of the word $w_i$ in the corpus. Now,
* Calculate word frequencies
* Define a function ```sampling_prob``` that takes a word (string) as input and returns the probabiliy to **keep** the word in a context.

In [17]:
#TODO: implement!
def word_frequency(word):
  freq = 0
  for x in AllWords:
    if x == word:
      freq += 1
  return freq

def sampling_prob(word):
  relative_frq = word_frequency(word)/len(AllWords)
  if relative_frq==0:          #if word is not present in the corpus
    return 0
  else:
    p_keep = (np.sqrt(relative_frq/0.001)+1)*(0.001/relative_frq)
    return p_keep

print(sampling_prob('BJP'))  

1.3294140926065678


# 1.5 Skip-Grams (1 point)

Now that you have the vocabulary and one-hot encodings at hand, you can start to do the actual work. The skip gram model requires training data of the shape ```(current_word, context)```, with ```context``` being the words before and/or after ```current_word``` within ```window_size```. 

* Have closer look on the original paper. If you feel to understand how skip-gram works, implement a function ```get_target_context``` that takes a sentence as input and [yield](https://docs.python.org/3.9/reference/simple_stmts.html#the-yield-statement)s a ```(current_word, context)```.

* Use your ```sampling_prob``` function to drop words from contexts as you sample them. 

In [25]:
#TODO: implement!
window_size = 5
def get_key(val):
    for key, value in V.items():
         if val == value:
             return key
def get_target_context(sentence):
  words = sentence.split()
  training_data = []
  for word in words:
    if sampling_prob(word)<0.5:      #droping words according to sampling probability
      words.remove(word)
  for word in words:
    context = []
    position = words.index(word) 
    for i in range(-window_size, window_size+1):
      if position+i<0 or position+i>=len(words) or i==0:
        continue
      context.append(get_key(words[position+i]))
      #print(context)
    training_data.append([get_key(word), context])
  return training_data

get_target_context('आक्का बुरा मने ना माने पर गांड जरूर फटती है')

[[23, [24, 25, 26, 27, 28]],
 [24, [23, 25, 26, 27, 28, 29]],
 [25, [23, 24, 26, 27, 28, 29]],
 [26, [23, 24, 25, 27, 28, 29]],
 [27, [23, 24, 25, 26, 28, 29]],
 [28, [23, 24, 25, 26, 27, 29]],
 [29, [24, 25, 26, 27, 28]]]

# 1.6 Hyperparameters (0.5 points)

According to the word2vec paper, what would be a good choice for the following hyperparameters? 

* Embedding dimension
* Window size

Initialize them in a dictionary or as independent variables in the code block below. 

In [41]:
# Set hyperparameters
window_size = 5
embedding_size = 300

# More hyperparameters
learning_rate = 0.05
epochs = 100

# 1.7 Pytorch Module (0.5 + 0.5 + 0.5 points)

Pytorch provides a wrapper for your fancy and super-complex models: [torch.nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html). The code block below contains a skeleton for such a wrapper. Now,

* Initialize the two weight matrices of word2vec as fields of the class.

* Override the ```forward``` method of this class. It should take a one-hot encoding as input, perform the matrix multiplications, and finally apply a log softmax on the output layer.

* Initialize the model and save its weights in a variable. The Pytorch documentation will tell you how to do that.

In [None]:
# Create model 

class Word2Vec(Module):
  def __init__(self):
    super().__init__()


  def forward(self, one_hot):
    pass

Word2Vec(
  (input): Linear(in_features=534, out_features=300, bias=False)
  (output): Linear(in_features=300, out_features=534, bias=False)
)


# 1.8 Loss function and optimizer (0.5 points)

Initialize variables with [optimizer](https://pytorch.org/docs/stable/optim.html#module-torch.optim) and loss function. You can take what is used in the word2vec paper, but you can use alternative optimizers/loss functions if you explain your choice in the report.

In [None]:
# Define optimizer and loss
optimizer = 
criterion = 

# 1.9 Training the model (3 points)

As everything is prepared, implement a training loop that performs several passes of the data set through the model. You are free to do this as you please, but your code should:

* Load the weights saved in 1.6 at the start of every execution of the code block
* Print the accumulated loss at least after every epoch (the accumulate loss should be reset after every epoch)
* Define a criterion for the training procedure to terminate if a certain loss value is reached. You can find the threshold by observing the loss for the development set.

You can play around with the number of epochs and the learning rate.

In [None]:
# Define train procedure

# load initial weights

def train():
 
  print("Training started")

train()

print("Training finished")

# 1.10 Train on the full dataset (0.5 points)

Now, go back to 1.1 and remove the restriction on the number of sentences in your corpus. Then, reexecute code blocks 1.2, 1.3 and 1.6 (or those relevant if you created additional ones). 

* Then, retrain your model on the complete dataset.

* Now, the input weights of the model contain the desired word embeddings! Save them together with the corresponding vocabulary items (Pytorch provides a nice [functionality](https://pytorch.org/tutorials/beginner/saving_loading_models.html) for this).