# Enterprise Deep Learning with TensorFlow: openSAP

## SAP Innovation Center Network

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

## Overview

This notebook loads pre-trained word vectors and shows some example applications. It is intended to provide an idea about how word vectors trained with Word2Vec can capture similarity.
The vectors can be downloaded here: https://code.google.com/archive/p/word2vec/

## Load the necessary modules

In [1]:
import gensim
#includes lots of NLP applications. Use Cython for speed-up.

## Load model

In [2]:
# create data folder, if it doesn't exist
import os
if not os.path.exists("data/"):
    os.makedirs("data/")

# Download Google word vectors file 'GoogleNews-vectors-negative300.bin.gz' from the below URL 
# You need to do this download manually using your browser.
# NOTE: the file is 1.5GB (!)
# https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM/view

# Put the file into the data/ directory.

# For more information, refer to the Google word2vec website:
# https://code.google.com/archive/p/word2vec/
    
# https://code.google.com/archive/p/word2vec/


In [3]:
# load word vectors
w2v = gensim.models.KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin.gz', binary=True)

## Exploring word embeddings

First, let's look at the dimensions of our word embeddings:

In [4]:
dog = w2v['dog']
dog.shape

(300,)

In [None]:
dog[:10]

array([ 0.05126953, -0.02233887, -0.17285156,  0.16113281, -0.08447266,
        0.05737305,  0.05859375, -0.08251953, -0.01538086, -0.06347656], dtype=float32)

Let's get started:
Tell me, Word2Vec, what is the most similar word to 'banana' you know?

### Word similarity

In [None]:
w2v.most_similar('banana', topn=5)

Here, we do the same for chocolate:

In [None]:
w2v.most_similar('chocolate', topn=5)

### Word analogies

Let's try out whether our vectors displacements also capture the concepts we discussed in the unit.  
Read the analogy like this:  
'Woman' is to 'man' as king is to ...?

#### v(woman) - v(man) + v(king)

In [None]:
w2v.most_similar(positive=['woman', 'king'], negative=['man'], topn=1)

#### v(woman) - v(girl) + v(boy)

In [None]:
w2v.most_similar(positive=['woman', 'boy'], negative=['girl'], topn=1)

#### v(puppy) - v(dog) + v(cat)

In [None]:
w2v.most_similar(positive=['puppy', 'cat'], negative=['dog'], topn=1)

#### v(pond) - v(lake) + v(small)

In [None]:
w2v.most_similar(positive=['pond', 'small'], negative=['lake'], topn=1)

#### Other interesting results:

In [None]:
w2v.most_similar(positive=['chinese', 'river'], topn=5)

### Visualize in Tensorboard

tensorboard --logdir EMBEDDING_DIR