## Example 1:
Counting words in a document
- Word co-occurence implementation with Alice in Wonderland
- Word similarity with cosine similarity

Some example plots:
- Ch15, Fig. 15.3 for small corpus


In [1]:
import numpy as np
np.random.seed(13)
import keras.backend as K
from keras.models import Sequential
from keras.layers import Dense, Embedding, Reshape, Lambda
from IPython.display import SVG
from keras.utils import np_utils
from keras.utils.data_utils import get_file
from keras.preprocessing.text import Tokenizer
from keras.utils.vis_utils import model_to_dot
from keras.preprocessing import sequence
from gensim.models import KeyedVectors
from sklearn.manifold import TSNE
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.neighbors import NearestNeighbors as nn
from itertools import islice
from matplotlib import pylab
from __future__ import division

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [2]:
# DO NOT Modify the lines in this cell
path = 'alice.txt'
corpus = open(path).readlines()[0:700]

corpus = [sentence for sentence in corpus if sentence.count(" ") >= 2]
tokenizer = Tokenizer(filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n'+"'")
tokenizer.fit_on_texts(corpus)
print(corpus)
corpus = tokenizer.texts_to_sequences(corpus)
nb_samples = sum(len(s) for s in corpus)
V = len(tokenizer.word_index) + 1

# Is this something they need to change?
dim = 100
window_size = 2
window_size_corpus = 4

['CHAPTER I. Down the Rabbit-Hole\n', 'Alice was beginning to get very tired of sitting by her sister on the\n', 'bank, and of having nothing to do: once or twice she had peeped into the\n', 'book her sister was reading, but it had no pictures or conversations in\n', "it, 'and what is the use of a book,' thought Alice 'without pictures or\n", 'So she was considering in her own mind (as well as she could, for the\n', 'hot day made her feel very sleepy and stupid), whether the pleasure\n', 'of making a daisy-chain would be worth the trouble of getting up and\n', 'picking the daisies, when suddenly a White Rabbit with pink eyes ran\n', 'close by her.\n', 'There was nothing so VERY remarkable in that; nor did Alice think it so\n', "VERY much out of the way to hear the Rabbit say to itself, 'Oh dear!\n", "Oh dear! I shall be late!' (when she thought it over afterwards, it\n", 'occurred to her that she ought to have wondered at this, but at the time\n', 'it all seemed quite natural); but whe

###### Word co-occurrence matrix for _The dog chased the cat away from the garden_
---------------------------------------------------------
|     | The | dog | chased | cat | away | from | garden |
|-----|-----|-----|--------|-----|------|------|--------|
| The | 0 | 2 | 2 | 3 | 2 | 2 | 1 |
|dog | 2 | 0 | 1 | 1 | 1 | 0 | 0 |
|chased | 2 | 1 | 0 | 1 | 1 | 1 | 0 |
|cat | 3 | 1 | 1 | 0 | 1 | 1 | 1 |
|away | 2 | 1 | 1 | 1 | 0 | 1 | 1 |
|from | 2 | 0 | 1 | 1 | 1 | 0 | 1 |
|garden | 1 | 0 | 0 | 1 | 1 | 1 | 0 |

This is an example of a word co-occurence matrix of only one sentence. Create a word co-occurrence matrix for Alice in Wonderland

In [7]:
#co-occurrence matrix
np.set_printoptions(threshold=np.nan)
coMatrix=np.zeros((V+1, V+1), np.int8)
    
def processSentence(sentence):
    # Run over the sentence from all words as a co-occurrence
    for w1 in range(0, len(sentence)):
        word1 = sentence[w1]
        window = [w1-2, w1-1, w1+1, w1+2]
        for w2 in window:
            if w2 >= 0 and w2 < len(sentence):
                word2 = sentence[w2]
                if(word1 != word2):
                    coMatrix[word1][word2] += 1

for sentence in corpus:
    processSentence(sentence)

In [8]:
#Compute similarity between the words Alice, Rabbit and Dinah
dictionary = tokenizer.word_index
alice = dictionary['alice']
rabbit = dictionary['rabbit']
dinah = dictionary['dinah']

print(cosine_similarity(coMatrix[alice].reshape(1, -1), coMatrix[rabbit].reshape(1, -1)))
print(cosine_similarity(coMatrix[alice].reshape(1, -1), coMatrix[dinah].reshape(1, -1)))
print(cosine_similarity(coMatrix[dinah].reshape(1, -1), coMatrix[rabbit].reshape(1, -1)))

[[0.26397443]]
[[0.25959354]]
[[0.16871245]]


In [9]:
#Retrieve the five most similar words to Alice with nearest neighbors

neighbors = nn(n_neighbors=6)
neighbors.fit(coMatrix) 
similars = (neighbors.kneighbors(coMatrix[alice])[1])[0]
#translate ids back into words:
for word in similars:
    print(list(dictionary.keys())[list(dictionary.values()).index(word)])

alice
herself
very
that
do
her




## Example 2:
Word embedding (dense) comparisons
- Load the pre-trained word embeddings of word2vec
- See whether the differences between the following word pairs are similar:
    - _A king is to a queen as a man is to a woman_
    - _A cat is to a kitten as a dog is to a puppy_
    - _Cats are to a cat as dogs are to a dog_
- Compare the following synonyms and antonyms:
    - Unhappy and happy
    - Happy and cheerful
    - Unhappy and cheerful
    - Synonym and equivalence
    - Synonym and antonym
    

Download word2vec here: https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/edit


In [10]:
#load word2vec
path = '/Users/daphne/Desktop/GoogleNews-vectors-negative300.bin'
word2vec = KeyedVectors.load_word2vec_format(path, binary=True)

In [11]:
#perform gensim tasks

In [12]:
#A king is to a queen as a man is to a woman
print('A king is to a queen', cosine_similarity(word2vec['king'].reshape(1,-1), word2vec['queen'].reshape(1, -1)))
print('as a man is to a woman', cosine_similarity(word2vec['man'].reshape(1,-1), word2vec['woman'].reshape(1,-1)), '\n')

#A cat is to a kitten as a dog is to a puppy
print('A cat is to a kitten', cosine_similarity(word2vec['cat'].reshape(1,-1), word2vec['kitten'].reshape(1,-1)))
print('as a dog is to a puppy', cosine_similarity(word2vec['dog'].reshape(1,-1), word2vec['puppy'].reshape(1,-1)), '\n')

#Cats are to a cat as dogs are to a dog
print('Cats are to a cat', cosine_similarity(word2vec['cats'].reshape(1,-1), word2vec['cat'].reshape(1,-1)))
print('as dogs are to a dog', cosine_similarity(word2vec['dogs'].reshape(1,-1), word2vec['dog'].reshape(1,-1)))


A king is to a queen [[0.6510958]]
as a man is to a woman [[0.7664013]] 

A cat is to a kitten [[0.7464985]]
as a dog is to a puppy [[0.81064284]] 

Cats are to a cat [[0.8099379]]
as dogs are to a dog [[0.8680489]]


In [13]:
# Unhappy and happy
print(cosine_similarity(word2vec['unhappy'].reshape(1,-1), word2vec['happy'].reshape(1,-1)))

# Happy and cheerful
print(cosine_similarity(word2vec['happy'].reshape(1,-1), word2vec['cheerful'].reshape(1,-1)))

# Unhappy and cheerful
print(cosine_similarity(word2vec['unhappy'].reshape(1,-1), word2vec['cheerful'].reshape(1,-1)))

# Synonym and equivalence
print(cosine_similarity(word2vec['synonym'].reshape(1,-1), word2vec['equivalence'].reshape(1,-1)))

# Synonym and antonym
print(cosine_similarity(word2vec['synonym'].reshape(1,-1), word2vec['antonym'].reshape(1,-1)))

[[0.6128039]]
[[0.38377392]]
[[0.2091079]]
[[0.16996254]]
[[0.5681054]]
