### skipgram - > Transforms a sequence of word indexes (list of int) into couples of the form:

- (word, word in the same window), with label 1 (positive samples).
- (word, random word from the vocabulary), with label 0 (negative samples).



**Return: tuple (couples, labels).**

### 1. Import all Necessary libraries


In [64]:
# -*- coding: utf-8 -*-
from keras.preprocessing.text import *
from keras.preprocessing.sequence import skipgrams

import numpy as np

### 2. Text to be anlayzed

In [None]:
text = "I love green eggs and ham ."

### 3. Declare the tokenizer and run text against it -> produces list of word tokens

In [16]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts([text])

# Words are called tokens and the process of splitting text into tokens is called tokenization.

# Tokenizer --> creates a dictionary mapping each unique word to an integer ID 
#               and make it available in word_index attribute.

# Dictionary -->(key,value)
word2id = tokenizer.word_index
print(word2id)

{'i': 1, 'love': 2, 'green': 3, 'eggs': 4, 'and': 5, 'ham': 6}


In [17]:
id2word = {v:k for k,v in word2id.items()}
print(id2word)

{1: 'i', 2: 'love', 3: 'green', 4: 'eggs', 5: 'and', 6: 'ham'}


### 4. We convert our input list of words to a list of IDs

In [27]:
word_ids = [word2id[w] for w in text_to_word_sequence(text)] # contains word ids for the text. 
print(word_ids)
# Keras provides the text_to_word_sequence() function that you can use to split text into a list of words.

print(text_to_word_sequence(text),'\n')
for w in text_to_word_sequence(text):
    print(w,'-------ID--->',word2id[w])

[1, 2, 3, 4, 5, 6]
['i', 'love', 'green', 'eggs', 'and', 'ham'] 

i -------ID---> 1
love -------ID---> 2
green -------ID---> 3
eggs -------ID---> 4
and -------ID---> 5
ham -------ID---> 6


### 5. Then passing it to the skipgrams function

In [50]:
pairs, labels = skipgrams(word_ids, len(word2id))
print(skipgrams(word_ids, len(word2id))) # word_ids

print()


print("\n\n\n")
print("PAIRS")
print(pairs)
print("\n\n\n")
print("LABELS")
print(labels)

([[6, 4], [3, 2], [5, 5], [4, 5], [2, 5], [3, 4], [6, 4], [3, 5], [1, 1], [1, 2], [6, 3], [1, 4], [1, 1], [5, 4], [2, 4], [4, 4], [2, 2], [5, 4], [6, 2], [5, 5], [5, 1], [4, 4], [4, 1], [5, 4], [4, 5], [5, 3], [4, 2], [2, 5], [4, 3], [1, 5], [3, 3], [3, 6], [6, 2], [1, 3], [2, 6], [6, 4], [3, 3], [5, 6], [3, 4], [6, 5], [3, 2], [2, 1], [3, 5], [5, 2], [6, 4], [2, 3], [2, 3], [4, 6], [1, 5], [3, 1], [4, 1], [5, 2], [2, 5], [1, 3], [4, 5], [2, 3]], [0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0])





PAIRS
[[6, 5], [3, 3], [4, 5], [1, 4], [2, 1], [3, 6], [6, 2], [5, 4], [3, 4], [2, 4], [4, 3], [4, 3], [1, 5], [3, 2], [3, 1], [4, 5], [1, 5], [2, 5], [1, 2], [4, 6], [2, 2], [5, 1], [2, 1], [4, 1], [5, 3], [4, 5], [5, 2], [2, 3], [2, 6], [3, 1], [2, 3], [6, 4], [3, 2], [1, 5], [1, 3], [4, 5], [6, 4], [5, 2], [5, 5], [4, 4], [3, 5], [1, 1], [1, 3], [6, 5], [3, 1], [5, 5],

In [47]:
print("Pairs Length:",len(pairs),"\nLabel Length:",len(labels)) # We are generating 56 pairs

Pairs Length: 56 
Label Length: 56


In [68]:
print(pairs[0][1])
print(id2word[pairs[0][0]])
print(pairs[0][1])
print(id2word[pairs[0][1]])
print(type(pairs))

pairs_dimension = np.array(pairs)
print("Dimension of pairs:",pairs_dimension.shape)
print(text)

5
ham
5
and
<class 'list'>
Dimension of pairs: (56, 2)
I love green eggs and ham .


### 6. We then print the first 10 of the 56 (pair, label) skip-gram tuples generated:

In [67]:
for i in range(10):
    print("({:s} ({:d}), {:s} ({:d})) -> {:d}".format(
    id2word[pairs[i][0]], pairs[i][0],
    id2word[pairs[i][1]], pairs[i][1],
    labels[i]))

(ham (6), and (5)) -> 0
(green (3), green (3)) -> 0
(eggs (4), and (5)) -> 0
(i (1), eggs (4)) -> 1
(love (2), i (1)) -> 1
(green (3), ham (6)) -> 1
(ham (6), love (2)) -> 1
(and (5), eggs (4)) -> 0
(green (3), eggs (4)) -> 0
(love (2), eggs (4)) -> 0


Note that your results may be different since the skip-gram
method randomly samples the results from the pool of possibilities for the positive examples.


Additionally, the process of negative sampling, used for generating the negative examples, consists of
randomly pairing up arbitrary tokens from the text. As the size of the input text increases, this is more
likely to pick up unrelated word pairs