## CBOW Assignment

Team member names:  

Complete all of the sections as described below.   Then run all, print to pdf using Chrome, and submit on Gradescope (indicating on your submission the start of each part of the assignment and choosing your team members).

In [1]:
import keras
keras.__version__

Using TensorFlow backend.


'2.2.4'

## Continuous bag of words (CBOW) embedding


We'll learn a word embedding using the CBOW framework as described in section 4.2 of <a href="https://cs224d.stanford.edu/lecture_notes/notes1.pdf">these notes</a>.  

The word corpus will be the "IMDB dataset", a set of 50,000 highly-polarized reviews from the Internet Movie Database. The following code loads the dataset and sets a few parameters for use later.   

In [2]:
from keras.datasets import imdb

V = 5000 # vocabulary size
num_reviews = 5000 # number of reviews to use during training
num_test = 200 # number of reviews to use during testing and validation
dim = 20 # embedding dimension
window_size = 2


(train_data_full, train_labels_full), (test_data_full, test_labels_full) = imdb.load_data(num_words=V)

train_data = train_data_full[0:num_reviews]
test_data = test_data_full[0:num_test]
val_data = test_data_full[num_test:2*num_test]


The argument `num_words=V` means that we will only keep the top V most frequently occurring words in the training data. Rare words 
will be discarded. This allows us to work with vector data of manageable size.

The variables `train_data` and `test_data` are lists of reviews, each review being a list of word indices (encoding a sequence of words). 
`train_labels` and `test_labels` are lists of 0s and 1s, where 0 stands for "negative" and 1 stands for "positive".  The labels will not be used for this assignment. 

Here's some code to decode back to English words:

In [3]:
class WordIndexManager:
  def __init__(self, word_index = []):
    self.word_index = word_index
    self.reverse_word_index = []
    
    if not (word_index == []): # Reverse the, mapping integer indices to words
      self.reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

  def ind_to_string(self, word_ind):
    # Decode a word; note that our indices were offset by 3
    # because 0, 1 and 2 are reserved indices for "padding", "start of sequence", and "unknown".
    return self.reverse_word_index.get(word_ind - 3, '?')

  def inds_to_string(self, word_inds):  
    # Put a list of decoded words into a string
    decoded_review = ' '.join([self.ind_to_string(i) for i in word_inds])
    return decoded_review

  def create_word_list(self, word_inds):
    word_list = []
    for ind in word_inds:
      word_list.append(self.ind_to_string(ind))
    return word_list
  
# word_index is a dictionary mapping words to an integer index
# We create an instance of a class to manage this index
WIM = WordIndexManager(imdb.get_word_index())

## TODO:  
Use the code above to print the entire first review (index 0) and also to print the 2nd word (index 1) in the first review.

In [4]:
# print the entire first review
print(WIM.create_word_list(train_data[0]))
# print the second word of the first review
print(WIM.create_word_list(train_data[0])[1])
# ** YOUR CODE HERE **

['?', 'this', 'film', 'was', 'just', 'brilliant', 'casting', 'location', 'scenery', 'story', 'direction', "everyone's", 'really', 'suited', 'the', 'part', 'they', 'played', 'and', 'you', 'could', 'just', 'imagine', 'being', 'there', 'robert', '?', 'is', 'an', 'amazing', 'actor', 'and', 'now', 'the', 'same', 'being', 'director', '?', 'father', 'came', 'from', 'the', 'same', 'scottish', 'island', 'as', 'myself', 'so', 'i', 'loved', 'the', 'fact', 'there', 'was', 'a', 'real', 'connection', 'with', 'this', 'film', 'the', 'witty', 'remarks', 'throughout', 'the', 'film', 'were', 'great', 'it', 'was', 'just', 'brilliant', 'so', 'much', 'that', 'i', 'bought', 'the', 'film', 'as', 'soon', 'as', 'it', 'was', 'released', 'for', '?', 'and', 'would', 'recommend', 'it', 'to', 'everyone', 'to', 'watch', 'and', 'the', 'fly', '?', 'was', 'amazing', 'really', 'cried', 'at', 'the', 'end', 'it', 'was', 'so', 'sad', 'and', 'you', 'know', 'what', 'they', 'say', 'if', 'you', 'cry', 'at', 'a', 'film', 'it', '

## TODO:

Print the top 20 most common words.  Print them in a table form as in:

| Index      | Count | Word    |
| :---    |    :----:      |    ---: |
| 2      | 122808  |    ?   |
| ... | ... |  ...|





In [5]:
import numpy as np

# ** YOUR CODE HERE **
from tabulate import tabulate
headers = ["Index", "Count", "Word"]
M = [[]]
for i in range(4,24):
    M += [[i,0,WIM.ind_to_string(i)]]
m = np.array(M[1:-1])
table = tabulate(m, headers, tablefmt="fancy_grid")
print(table)

╒═════════╤═════════╤════════╕
│   Index │   Count │ Word   │
╞═════════╪═════════╪════════╡
│       4 │       0 │ the    │
├─────────┼─────────┼────────┤
│       5 │       0 │ and    │
├─────────┼─────────┼────────┤
│       6 │       0 │ a      │
├─────────┼─────────┼────────┤
│       7 │       0 │ of     │
├─────────┼─────────┼────────┤
│       8 │       0 │ to     │
├─────────┼─────────┼────────┤
│       9 │       0 │ is     │
├─────────┼─────────┼────────┤
│      10 │       0 │ br     │
├─────────┼─────────┼────────┤
│      11 │       0 │ in     │
├─────────┼─────────┼────────┤
│      12 │       0 │ it     │
├─────────┼─────────┼────────┤
│      13 │       0 │ i      │
├─────────┼─────────┼────────┤
│      14 │       0 │ this   │
├─────────┼─────────┼────────┤
│      15 │       0 │ that   │
├─────────┼─────────┼────────┤
│      16 │       0 │ was    │
├─────────┼─────────┼────────┤
│      17 │       0 │ as     │
├─────────┼─────────┼────────┤
│      18 │       0 │ for    │
├───────

## Preparing the data


We cannot feed lists of integers into a neural network. We have to turn our lists into tensors. This is done with a data generator.  To use the data generator, call the function to get an instance of the generator, then iterate on that instance.  E.g.,

```
data_gen = generate_data(input_data, window_size, vocab_size, batch_size)
for x,y in data_gen:
    do something
```


##TODO: 
Create a data generator that takes in train_data, window_size, vocab_size, and batch size as above and returns

x: a tensor of size 
```
(batch_size,  2*window_size, vocab_size).
```
Each vector x[i,j,:] is a one-hot encoding of one of the neighbor words of the central word

y: a tensor of size 
```
(batch_size, vocab_size).
```
Each vector y[i,:] is a one-hot encoding of the central word.  

You may shuffle the order of the reviews if you like, except the first review (index 0) should be left in place.  

Your generator should loop infinitely.  Each time through the loop should lead to 

```
yield (x, y)
```

In [39]:
import numpy as np
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, Embedding, Lambda
import keras.backend as K
from keras.preprocessing import sequence

def generate_data(corpus, window_size, V, batch_size=16):
    num_reviews = corpus.shape[0]
    #generation loop
    while 1:
        for i in range(0,num_reviews):
            review_size = len(corpus[i])
            total_batch_num = review_size//batch_size
            for current_batch_num in range(0,total_batch_num):
                x = np.zeros((batch_size, 2*window_size,V))
                y = np.zeros((batch_size,V))
                for batch_val in range(0,batch_size):
#                     print("Batch value: ")
#                     print(batch_val)
                    for j in range(window_size,review_size-window_size):
                        y[batch_val,corpus[i][j]] = 1
#                         print("Printing y")
#                         print(corpus[i][j])
#                         print("Printing x")
                        for k in range(2*window_size):
                            if k >= window_size:
                                x[batch_val,k,corpus[i][j+k+1-window_size]] = 0.1
#                                 print(corpus[i][j+k+1])
                            else:
                                x[batch_val,k,corpus[i][j+k-window_size]] = 0.5
                                # print(corpus[i][j+k])
                yield x,y

data_maker = generate_data(train_data,2,V)
for x,y in data_maker:
    for i in range(5):
        ind = np.nonzero(x[i,3,:])[0][0]
        print(ind)
        print(WIM.ind_to_string(ind))
        print(x[i,0,ind])
        print(np.count_nonzero(x))
        ind = np.nonzero(y[i,:])[0][0]
        print(ind)
        print(WIM.ind_to_string(ind))
        print(y[i,ind])
        print(np.count_nonzero(y))
    break
print(x)
len(train_data[0])
# print('break')
# print(y)

2
?
0.5
7280
2
?
1.0
1808
2
?
0.5
7280
2
?
1.0
1808
2
?
0.5
7280
2
?
1.0
1808
2
?
0.5
7280
2
?
1.0
1808
2
?
0.5
7280
2
?
1.0
1808
[[[0.  0.5 0.5 ... 0.  0.  0. ]
  [0.  0.  0.5 ... 0.  0.  0. ]
  [0.  0.  0.1 ... 0.  0.  0. ]
  [0.  0.  0.1 ... 0.  0.  0. ]]

 [[0.  0.5 0.5 ... 0.  0.  0. ]
  [0.  0.  0.5 ... 0.  0.  0. ]
  [0.  0.  0.1 ... 0.  0.  0. ]
  [0.  0.  0.1 ... 0.  0.  0. ]]

 [[0.  0.5 0.5 ... 0.  0.  0. ]
  [0.  0.  0.5 ... 0.  0.  0. ]
  [0.  0.  0.1 ... 0.  0.  0. ]
  [0.  0.  0.1 ... 0.  0.  0. ]]

 ...

 [[0.  0.5 0.5 ... 0.  0.  0. ]
  [0.  0.  0.5 ... 0.  0.  0. ]
  [0.  0.  0.1 ... 0.  0.  0. ]
  [0.  0.  0.1 ... 0.  0.  0. ]]

 [[0.  0.5 0.5 ... 0.  0.  0. ]
  [0.  0.  0.5 ... 0.  0.  0. ]
  [0.  0.  0.1 ... 0.  0.  0. ]
  [0.  0.  0.1 ... 0.  0.  0. ]]

 [[0.  0.5 0.5 ... 0.  0.  0. ]
  [0.  0.  0.5 ... 0.  0.  0. ]
  [0.  0.  0.1 ... 0.  0.  0. ]
  [0.  0.  0.1 ... 0.  0.  0. ]]]


218

## TODO:  
Verify your work by running the code below.  It should work without modification given the data generator you've written.  The final line should be 


```
film was brilliant casting :  just 
```



In [27]:
window_size = 2
train_gen = generate_data(train_data, window_size, V)
val_gen = generate_data(val_data, window_size, V)
test_gen = generate_data(test_data, window_size, V)

for bow, output in train_gen:
  for i in range(5):
    for k in range(bow.shape[1]):
        ind = np.nonzero(bow[i,k,:])[0][0]
        #print(ind, end = "")
        print(WIM.ind_to_string(ind) + ' ', end="")
    ind = np.nonzero(output[i,:])[0][0]
    #print(ind, end = "")
    print(':  ' + WIM.ind_to_string(ind) + ' ')
  break

22
1
14
22
16
16
14
22
16
43
43
22
16
43
530
530
16
43
530
973
973
43
530
973
1622
1622
530
973
1622
1385
1385
973
1622
1385
65
65
1622
1385
65
458
458
1385
65
458
4468
4468
65
458
4468
66
66
458
4468
66
3941
3941
4468
66
3941
4
4
66
3941
4
173
173
3941
4
173
36
36
4
173
36
256
256
173
36
256
5
5
36
256
5
25
25
256
5
25
100
100
5
25
100
43
43
25
100
43
838
838
100
43
838
112
112
43
838
112
50
50
838
112
50
670
670
112
50
670
2
2
50
670
2
9
9
670
2
9
35
35
2
9
35
480
480
9
35
480
284
284
35
480
284
5
5
480
284
5
150
150
284
5
150
4
4
5
150
4
172
172
150
4
172
112
112
4
172
112
167
167
172
112
167
2
2
112
167
2
336
336
167
2
336
385
385
2
336
385
39
39
336
385
39
4
4
385
39
4
172
172
39
4
172
4536
4536
4
172
4536
1111
1111
172
4536
1111
17
17
4536
1111
17
546
546
1111
17
546
38
38
17
546
38
13
13
546
38
13
447
447
38
13
447
4
4
13
447
4
192
192
447
4
192
50
50
4
192
50
16
16
192
50
16
6
6
50
16
6
147
147
16
6
147
2025
2025
6
147
2025
19
19
147
2025
19
14
14
2025
19
14
22
22
19
14
22
4
4


18
2
5
62
62
2
5
62
386
386
5
62
386
12
12
62
386
12
8
8
386
12
8
316
316
12
8
316
8
8
8
316
8
106
106
316
8
106
5
5
8
106
5
4
4
106
5
4
2223
2223
5
4
2223
2
2
4
2223
2
16
16
2223
2
16
480
480
2
16
480
66
66
16
480
66
3785
3785
480
66
3785
33
33
66
3785
33
4
4
3785
33
4
130
130
33
4
130
12
12
4
130
12
16
16
130
12
16
38
38
12
16
38
619
619
16
38
619
5
5
38
619
5
25
25
619
5
25
124
124
5
25
124
51
51
25
124
51
36
36
124
51
36
135
135
51
36
135
48
48
36
135
48
25
25
135
48
25
1415
1415
48
25
1415
33
33
25
1415
33
6
6
1415
33
6
22
22
33
6
22
12
12
6
22
12
215
215
22
12
215
28
28
12
215
28
77
77
215
28
77
52
52
28
77
52
5
5
77
52
5
14
14
52
5
14
407
407
5
14
407
16
16
14
407
16
82
82
407
16
82
2
2
16
82
2
8
8
82
2
8
4
4
2
8
4
107
107
8
4
107
117
117
4
107
117
2
2
107
117
2
15
15
117
2
15
256
256
2
15
256
4
4
15
256
4
2
2
256
4
2
7
7
4
2
7
3766
3766
2
7
3766
5
5
7
3766
5
723
723
3766
5
723
36
36
5
723
36
71
71
723
36
71
43
43
36
71
43
530
530
71
43
530
476
476
43
530
476
26
26
530
476
26
40

5
723
723
3766
5
723
36
36
5
723
36
71
71
723
36
71
43
43
36
71
43
530
530
71
43
530
476
476
43
530
476
26
26
530
476
26
400
400
476
26
400
317
317
26
400
317
46
46
400
317
46
7
7
317
46
7
4
4
46
7
4
2
2
7
4
2
1029
1029
4
2
1029
13
13
2
1029
13
104
104
1029
13
104
88
88
13
104
88
4
4
104
88
4
381
381
88
4
381
15
15
4
381
15
297
297
381
15
297
98
98
15
297
98
32
32
297
98
32
2071
2071
98
32
2071
56
56
32
2071
56
26
26
2071
56
26
141
141
56
26
141
6
6
26
141
6
194
194
141
6
194
2
2
6
194
2
18
18
194
2
18
4
4
2
18
4
226
226
18
4
226
22
22
4
226
22
21
21
226
22
21
134
134
22
21
134
476
476
21
134
476
26
26
134
476
26
480
480
476
26
480
5
5
26
480
5
144
144
480
5
144
30
30
5
144
30
2
2
144
30
2
18
18
30
2
18
51
51
2
18
51
36
36
18
51
36
28
28
51
36
28
224
224
36
28
224
92
92
28
224
92
25
25
224
92
25
104
104
92
25
104
4
4
25
104
4
226
226
104
4
226
65
65
4
226
65
16
16
226
65
16
38
38
65
16
38
1334
1334
16
38
1334
88
88
38
1334
88
12
12
1334
88
12
16
16
88
12
16
283
283
12
16
283
5
5
16
283

38
1334
88
12
12
1334
88
12
16
16
88
12
16
283
283
12
16
283
5
5
16
283
5
16
16
283
5
16
4472
4472
5
16
4472
113
113
16
4472
113
103
103
4472
113
103
32
32
113
103
32
15
15
103
32
15
16
16
32
15
16
2
2
15
16
2
19
19
16
2
19
178
22
1
14
22
16
16
14
22
16
43
43
22
16
43
530
530
16
43
530
973
973
43
530
973
1622
1622
530
973
1622
1385
1385
973
1622
1385
65
65
1622
1385
65
458
458
1385
65
458
4468
4468
65
458
4468
66
66
458
4468
66
3941
3941
4468
66
3941
4
4
66
3941
4
173
173
3941
4
173
36
36
4
173
36
256
256
173
36
256
5
5
36
256
5
25
25
256
5
25
100
100
5
25
100
43
43
25
100
43
838
838
100
43
838
112
112
43
838
112
50
50
838
112
50
670
670
112
50
670
2
2
50
670
2
9
9
670
2
9
35
35
2
9
35
480
480
9
35
480
284
284
35
480
284
5
5
480
284
5
150
150
284
5
150
4
4
5
150
4
172
172
150
4
172
112
112
4
172
112
167
167
172
112
167
2
2
112
167
2
336
336
167
2
336
385
385
2
336
385
39
39
336
385
39
4
4
385
39
4
172
172
39
4
172
4536
4536
4
172
4536
1111
1111
172
4536
1111
17
17
4536
1111
17
546
546


104
104
1029
13
104
88
88
13
104
88
4
4
104
88
4
381
381
88
4
381
15
15
4
381
15
297
297
381
15
297
98
98
15
297
98
32
32
297
98
32
2071
2071
98
32
2071
56
56
32
2071
56
26
26
2071
56
26
141
141
56
26
141
6
6
26
141
6
194
194
141
6
194
2
2
6
194
2
18
18
194
2
18
4
4
2
18
4
226
226
18
4
226
22
22
4
226
22
21
21
226
22
21
134
134
22
21
134
476
476
21
134
476
26
26
134
476
26
480
480
476
26
480
5
5
26
480
5
144
144
480
5
144
30
30
5
144
30
2
2
144
30
2
18
18
30
2
18
51
51
2
18
51
36
36
18
51
36
28
28
51
36
28
224
224
36
28
224
92
92
28
224
92
25
25
224
92
25
104
104
92
25
104
4
4
25
104
4
226
226
104
4
226
65
65
4
226
65
16
16
226
65
16
38
38
65
16
38
1334
1334
16
38
1334
88
88
38
1334
88
12
12
1334
88
12
16
16
88
12
16
283
283
12
16
283
5
5
16
283
5
16
16
283
5
16
4472
4472
5
16
4472
113
113
16
4472
113
103
103
4472
113
103
32
32
113
103
32
15
15
103
32
15
16
16
32
15
16
2
2
15
16
2
19
19
16
2
19
178
22
1
14
22
16
16
14
22
16
43
43
22
16
43
530
530
16
43
530
973
973
43
530
973
1622
1622


48
25
1415
1415
48
25
1415
33
33
25
1415
33
6
6
1415
33
6
22
22
33
6
22
12
12
6
22
12
215
215
22
12
215
28
28
12
215
28
77
77
215
28
77
52
52
28
77
52
5
5
77
52
5
14
14
52
5
14
407
407
5
14
407
16
16
14
407
16
82
82
407
16
82
2
2
16
82
2
8
8
82
2
8
4
4
2
8
4
107
107
8
4
107
117
117
4
107
117
2
2
107
117
2
15
15
117
2
15
256
256
2
15
256
4
4
15
256
4
2
2
256
4
2
7
7
4
2
7
3766
3766
2
7
3766
5
5
7
3766
5
723
723
3766
5
723
36
36
5
723
36
71
71
723
36
71
43
43
36
71
43
530
530
71
43
530
476
476
43
530
476
26
26
530
476
26
400
400
476
26
400
317
317
26
400
317
46
46
400
317
46
7
7
317
46
7
4
4
46
7
4
2
2
7
4
2
1029
1029
4
2
1029
13
13
2
1029
13
104
104
1029
13
104
88
88
13
104
88
4
4
104
88
4
381
381
88
4
381
15
15
4
381
15
297
297
381
15
297
98
98
15
297
98
32
32
297
98
32
2071
2071
98
32
2071
56
56
32
2071
56
26
26
2071
56
26
141
141
56
26
141
6
6
26
141
6
194
194
141
6
194
2
2
6
194
2
18
18
194
2
18
4
4
2
18
4
226
226
18
4
226
22
22
4
226
22
21
21
226
22
21
134
134
22
21
134
476
476
21


Here we create a function to save the embedding weights for use with the word2vec package, which allows us to explore the word embeddings easily.  

In [0]:
def save_weights(model, vocab_size=V, dim=dim, filename='vectorsCB.txt'):
  f = open(filename ,'w')
  f.write('{} {}\n'.format(vocab_size-1, dim))
  vectors = model.get_weights()[0]
  for i in range(1,vocab_size):
      str_vec = ' '.join(map(str, list(vectors[i, :])))
      word = WIM.ind_to_string(i)
      f.write('{} {}\n'.format(word, str_vec))
  f.close()
  return vectors

## TODO:  
Construct a CBOW model called cbow.   Use dim as the embedding dimension. The final layer should be a softmax activation onto the size of the vocabulary.  Use categorical crossentropy.  Use the cbow.summary() function to display the result.  

In [0]:
cbow = Sequential()
cbow.add(Lambda(lambda x: K.mean(x, axis=1), output_shape=(V,))) #"add" the words together
cbow.add(Dense(dim))
cbow.add(Dense(V))
cbow.add(Activation('softmax'))
cbow.compile(loss = 'categorical_crossentropy', 'optimizer'='rmsprop')

cbow.summary()

Save the untrained weights for comparison.  

In [0]:
weightsUT = save_weights(cbow, filename='untrainedCB.txt')

Train the model

In [0]:

val_steps = 100

history = cbow.fit_generator(train_gen,
                              steps_per_epoch=1500,
                              epochs=30,
                              validation_data=val_gen,
                              validation_steps=val_steps)

Plot the training

In [0]:
import matplotlib.pyplot as plt

loss = history.history['loss']
val_loss = history.history['val_loss']

epochs = range(len(loss))

plt.figure()

plt.plot(epochs, loss, 'bo', label='Training loss')
plt.plot(epochs, val_loss, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.legend()

plt.show()

Save the trained dictionary

In [0]:
weights = save_weights(cbow, filename='vectorsCB.txt')

Load the word embeddings and compare trained and untrained embeddings.  

In [0]:

import gensim
w2vUT = gensim.models.KeyedVectors.load_word2vec_format('./untrainedCB.txt', binary=False)
w2vT = gensim.models.KeyedVectors.load_word2vec_format('./vectorsCB.txt', binary=False)

def print_similarities(word, w2vUT=w2vUT, w2vT=w2vT):
  print('Nearest words and similarities to "' + word + '" ')
  print('Untrained similarities\tTrained similarities\n')
  for item1, item2 in zip(w2vUT.most_similar(positive=[word]), w2vT.most_similar(positive=[word])):
    print("{:10s}".format(item1[0]) + ', ' + "{:.2f}".format(item1[1]) + '\t' 
          + "{:10s}".format(item2[0]) + ', ' + "{:.2f}".format(item2[1]))
  print(' ')

print_similarities('movie')

In [0]:
print_similarities('film')

In [0]:
print_similarities('role')


In [0]:
print('Word pair similarity')
print('\t\t\tUntrained\tTrained')
print('film and movie: \t' + "{:.2f}".format(w2vUT.similarity('film', 'movie')) 
      + '\t\t' + "{:.2f}".format(w2vT.similarity('film', 'movie')))
print('man and woman:   \t' + "{:.2f}".format(w2vUT.similarity('man', 'woman')) 
      + '\t\t' + "{:.2f}".format(w2vT.similarity('man', 'woman')))
print('plot and talent: \t' + "{:.2f}".format(w2vUT.similarity('plot', 'talent')) 
      + '\t\t' + "{:.2f}".format(w2vT.similarity('plot', 'talent')))

print(' ')

Use TSNE to plot the two primary component of the embedding.  

In [0]:
from sklearn.manifold import TSNE
import plotly.offline as py
import plotly.graph_objs as go

number_of_words = 1000

X_embedded = TSNE(n_components=2).fit_transform(weights[0:number_of_words])
word_list = WIM.create_word_list(range(number_of_words))


trace = go.Scatter(
    x = X_embedded[0:number_of_words,0], 
    y = X_embedded[0:number_of_words, 1],
    mode = 'markers',
    text= word_list[0:number_of_words]
)

layout = dict(title= 'Trained t-SNE 1 vs t-SNE 2 for first 1000 words ',
              yaxis = dict(title='t-SNE 2'),
              xaxis = dict(title='t-SNE 1'),
              hovermode= 'closest')

fig = dict(data = [trace], layout= layout)

In [0]:
def configure_plotly_browser_state():
  import IPython
  display(IPython.core.display.HTML('''
        <script src="/static/components/requirejs/require.js"></script>
        <script>
          requirejs.config({
            paths: {
              base: '/static/base',
              plotly: 'https://cdn.plot.ly/plotly-latest.min.js?noext',
            },
          });
        </script>
        '''))
  
configure_plotly_browser_state()  

py.init_notebook_mode()
py.iplot(fig)

In [0]:
X_embedded = TSNE(n_components=2).fit_transform(weightsUT[0:number_of_words])
word_list = WIM.create_word_list(range(number_of_words))


trace = go.Scatter(
    x = X_embedded[0:number_of_words,0], 
    y = X_embedded[0:number_of_words, 1],
    mode = 'markers',
    text= word_list[0:number_of_words]
)

layout = dict(title= 'Untrained t-SNE 1 vs t-SNE 2 for first 1000 words ',
              yaxis = dict(title='t-SNE 2'),
              xaxis = dict(title='t-SNE 1'),
              hovermode= 'closest')

fig = dict(data = [trace], layout= layout)

In [0]:
configure_plotly_browser_state()  

py.init_notebook_mode()
py.iplot(fig)