# Negative and Sub Sampling
In this tutorial, we will talk about `negative-sampling and sub-sampling.` For this purpose, we will utilize The Children’s [Books Test(CBT) dataset](https://www.kaggle.com/datasets/amoghjrules/babi-childrens-books-facebool-ai). First, let us load the dataset and do some data cleaning.

In [2]:
from google.colab import drive
drive.mount("/content/gdrive")

Mounted at /content/gdrive


In [3]:
%cd /content/gdrive/MyDrive/Colab\ Notebooks/attention

/content/gdrive/MyDrive/Colab Notebooks/attention


In [88]:
with open('text.txt', 'r') as f:
    text = f.read()

In [89]:
print(text[:1000])

_BOOK_TITLE_ : Andrew_Lang___The_Grey_Fairy_Book.txt.out
DONKEY SKIN There was once upon a time a king who was so much beloved by his subjects that he thought himself the happiest monarch in the whole world , and he had everything his heart could desire .
His palace was filled with the rarest of curiosities , and his garden with the sweetest flowers , while the marble stalls of his stables stood a row of milk-white Arabs , with big brown eyes .
Strangers who had heard of the marvels which the king had collected , and made long journeys to see them , were , however , surprised to find the most splendid stall of all occupied by a donkey , with particularly large and drooping ears .
It was a very fine donkey ; but still , as far as they could tell , nothing so very remarkable as to account for the care with which it was lodged ; and they went away wondering , for they could not know that every night , when it was asleep , bushels of gold pieces tumbled out of its ears , which were picked 

In [90]:
import re
import nltk
import numpy as np
import pandas as pd

In [91]:
text =re.sub(r"(?m)^(\_BOOK_TITLE\_|CHAPTER).*\n?","",text,re.MULTILINE) # some preprocessing but not all
text =re.sub(r"(?m)-LCB.*RCB-","",text,re.MULTILINE)
text[:1000]

'DONKEY SKIN There was once upon a time a king who was so much beloved by his subjects that he thought himself the happiest monarch in the whole world , and he had everything his heart could desire .\nHis palace was filled with the rarest of curiosities , and his garden with the sweetest flowers , while the marble stalls of his stables stood a row of milk-white Arabs , with big brown eyes .\nStrangers who had heard of the marvels which the king had collected , and made long journeys to see them , were , however , surprised to find the most splendid stall of all occupied by a donkey , with particularly large and drooping ears .\nIt was a very fine donkey ; but still , as far as they could tell , nothing so very remarkable as to account for the care with which it was lodged ; and they went away wondering , for they could not know that every night , when it was asleep , bushels of gold pieces tumbled out of its ears , which were picked up each morning by the attendants .\nAfter many years

In [92]:
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
nltk.download("popular")

[nltk_data] Downloading collection 'popular'
[nltk_data]    | 
[nltk_data]    | Downloading package cmudict to /root/nltk_data...
[nltk_data]    |   Package cmudict is already up-to-date!
[nltk_data]    | Downloading package gazetteers to /root/nltk_data...
[nltk_data]    |   Package gazetteers is already up-to-date!
[nltk_data]    | Downloading package genesis to /root/nltk_data...
[nltk_data]    |   Package genesis is already up-to-date!
[nltk_data]    | Downloading package gutenberg to /root/nltk_data...
[nltk_data]    |   Package gutenberg is already up-to-date!
[nltk_data]    | Downloading package inaugural to /root/nltk_data...
[nltk_data]    |   Package inaugural is already up-to-date!
[nltk_data]    | Downloading package movie_reviews to
[nltk_data]    |     /root/nltk_data...
[nltk_data]    |   Package movie_reviews is already up-to-date!
[nltk_data]    | Downloading package names to /root/nltk_data...
[nltk_data]    |   Package names is already up-to-date!
[nltk_data]    | Do

True

In [93]:
sentence_tokens= sent_tokenize(text.lower())
sentence_tokens[0:10]

['donkey skin there was once upon a time a king who was so much beloved by his subjects that he thought himself the happiest monarch in the whole world , and he had everything his heart could desire .',
 'his palace was filled with the rarest of curiosities , and his garden with the sweetest flowers , while the marble stalls of his stables stood a row of milk-white arabs , with big brown eyes .',
 'strangers who had heard of the marvels which the king had collected , and made long journeys to see them , were , however , surprised to find the most splendid stall of all occupied by a donkey , with particularly large and drooping ears .',
 'it was a very fine donkey ; but still , as far as they could tell , nothing so very remarkable as to account for the care with which it was lodged ; and they went away wondering , for they could not know that every night , when it was asleep , bushels of gold pieces tumbled out of its ears , which were picked up each morning by the attendants .',
 'aft

In [94]:
word_token =[word_tokenize(token) for token in sentence_tokens]
print(word_token[0])

['donkey', 'skin', 'there', 'was', 'once', 'upon', 'a', 'time', 'a', 'king', 'who', 'was', 'so', 'much', 'beloved', 'by', 'his', 'subjects', 'that', 'he', 'thought', 'himself', 'the', 'happiest', 'monarch', 'in', 'the', 'whole', 'world', ',', 'and', 'he', 'had', 'everything', 'his', 'heart', 'could', 'desire', '.']


In [95]:
token = [tok  for sent in word_token  for tok in sent ]

In [96]:
words = tuple(set(token))
int2str = dict(enumerate(words))
str2int = {ch: i for i, ch in int2str.items()}

In [97]:
print('Length of the vocabulary: ', len(words))
words[0:10]

Length of the vocabulary:  5474


('bird',
 'so',
 'gathers',
 'he',
 'brightly',
 'people',
 'fierce',
 'gutter',
 'drag',
 'swifter')

Our first step is sub-sampling. In lectures, we have gone over how sub-sampling works in `Lecture 3, part 5`. Briefly, sub-sampling removes some more frequent words like `the` to create a more uniform dataset. For this purpose, we will create a probability function to see whether we will remove a token from our dataset. This probability will be checked for **each occurrence** of the token. 

In [98]:
from collections import Counter,defaultdict

wordFreq = defaultdict(int)

for sent in word_token:
    for word in sent:
        wordFreq[word] += 1

In [99]:
import math

In [100]:
totalWords = sum([freq for freq in wordFreq.values()])
wi = {word:(freq/totalWords) for word, freq in wordFreq.items()}
wordProb ={ word:(math.sqrt(wi[word]/0.001)+1)*0.001/wi[word]  for word in wi}

In [101]:
posSet = []  ## there is a problem in this approach
dropped = 0
# add positive examples
for sent in word_token:
    for i in range(1, len(sent)-1):
      if   np.random.rand()<wordProb[sent[i]]:
        word = sent[i]
        context_words = [sent[i-1], sent[i+1]]   
        for context in context_words:
            posSet.append((word, context))  # we are creating bi-grams for text generation task here
      else:
        dropped+=1
n_pos_examples = len(posSet)
print(dropped)
posSet[0:10]

33964


[('skin', 'donkey'),
 ('skin', 'there'),
 ('there', 'skin'),
 ('there', 'was'),
 ('once', 'was'),
 ('once', 'upon'),
 ('upon', 'once'),
 ('upon', 'a'),
 ('time', 'a'),
 ('time', 'a')]

There is a problem with the approach above. Do you see what it is?

In [102]:
posSet = [] 
dropped = 0
for sent in word_token:
  dum_sent = sent.copy()
  for i in range(len(dum_sent)-1):
    if   np.random.rand()>wordProb[dum_sent[i]]:
        dum_sent[i] = None
        dropped +=1
  for i in range(1, len(dum_sent)-2):
      if(dum_sent[i]!= None):
        if(dum_sent[i+1]!= None):
          posSet.append((dum_sent[i], dum_sent[i+1]))
        if(dum_sent[i-1]!= None):
          posSet.append((dum_sent[i], dum_sent[i-1]))
print(dropped)
posSet[0:10]

36220


[('skin', 'there'),
 ('skin', 'donkey'),
 ('there', 'was'),
 ('there', 'skin'),
 ('was', 'once'),
 ('was', 'there'),
 ('once', 'upon'),
 ('once', 'was'),
 ('upon', 'a'),
 ('upon', 'once')]

In [103]:
n_pos_examples = len(posSet)
len(posSet)

87533

If you guessed that it keeps words as context you are right

Now that we finished our sub-sampling, we can do a negative sampling to enrich our data. Negative sampling is utilized to balance the positive examples with negatives so that our network will not overfit positive examples. This is again done by creating a probabilistic function to create examples. In the following code, we again create a word probability function. Then, utilizing this probability, we select create negative examples for each positive example.

In [104]:
totalWords = sum([freq**(3/4) for freq in wordFreq.values()])
wordProb = {word:(freq**(3/4)/totalWords) for word, freq in wordFreq.items()}

In [105]:
n_neg_examples = 0 # 40m run time
negSet = []
import tqdm

for i in tqdm.tqdm(range(n_pos_examples)):
  context=np.random.choice(list(wordProb.keys()), p=list(wordProb.values())) 
  while ((posSet[i][0],context)  in posSet):
    context=np.random.choice(list(wordProb.keys()), p=list(wordProb.values()))
  negSet.append((posSet[i][0], context))


100%|██████████| 87533/87533 [07:22<00:00, 197.70it/s]


In [63]:
len(negSet)

162752

In [106]:
pos_data = pd.DataFrame(posSet,columns=["word","context"])
pos_data["out"] = 1
pos_data.head()

Unnamed: 0,word,context,out
0,skin,there,1
1,skin,donkey,1
2,there,was,1
3,there,skin,1
4,was,once,1


In [107]:
neg_data = pd.DataFrame(negSet,columns=["word","context"])
neg_data["out"] = 0
neg_data.head()

Unnamed: 0,word,context,out
0,skin,and,0
1,skin,good,0
2,there,broken,0
3,there,apple-tree,0
4,was,barked,0


In [108]:
data = pd.concat([pos_data,neg_data],axis=0)
data.describe()

Unnamed: 0,out
count,175066.0
mean,0.5
std,0.500001
min,0.0
25%,0.0
50%,0.5
75%,1.0
max,1.0


In [109]:
data2 = data.copy()
data2["text"] =  data["word"]+' '+data["context"]
data2.head(1000)

Unnamed: 0,word,context,out,text
0,skin,there,1,skin there
1,skin,donkey,1,skin donkey
2,there,was,1,there was
3,there,skin,1,there skin
4,was,once,1,was once
...,...,...,...,...
995,donkey,'s,1,donkey 's
996,'s,skin,1,'s skin
997,'s,donkey,1,'s donkey
998,skin,-rsb-,1,skin -rsb-


In [110]:
data3 =data2.drop(columns=["context","word"])
data3 = data3[["text","out"]]
data3.head()

Unnamed: 0,text,out
0,skin there,1
1,skin donkey,1
2,there was,1
3,there skin,1
4,was once,1


In [111]:
data3.to_csv("data_all_val.csv",index=False)