## Installation

1. If you haven't already installed Python and Jupyter Notebook:   
    1. Get Python3 from [Python.org](https://www.python.org/downloads/). **Tensorflow does not yet work with Python 3.7, so you _must_ get Python 3.6.** See https://github.com/tensorflow/tensorflow/issues/20517 for updates on 3.7 support.
    1. In Terminal, run `python3 -m pip install jupyter`
    1. In Terminal, cd to the folder in which you downloaded this file and run `jupyter notebook`. This should open up a page in your web browser that shows all of the files in the current directory, so that you can open this file. You will need to leave this Terminal window up and running and use a different one for the rest of the instructions.
1. Install the Gensim word2vec Python implementation: `pip3 install --upgrade gensim`
1. Get the trained model (1billion_word_vectors.zip) from me via airdrop or flashdrive and put it in the same folder as the ipynb file, the folder in which you are running the jupyter notebook command.
1. Unzip the trained model file. You should now have three files in the folder (if zip created a new folder, move these files out of that separate folder into the same folder as the ipynb file):
    * 1billion_word_vectors
    * 1billion_word_vectors.syn1neg.npy
    * 1billion_word_vectors.wv.syn0.npy
1. If you didn't install keras last time, install it now
    1. Install the tensorflow machine learning library by typing the following into Terminal:
    `pip3 install --upgrade tensorflow`
    1. Install the keras machine learning library by typing the following into Terminal:
    `pip3 install keras`


## Extra Details -- Do Not Do This
This took awhile, which is why I'm giving you the trained file rather than having you do this. But just in case you're curious, here is how to create the trained model file.
1. Download the corpus of sentences from [http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz](http://www.statmt.org/lm-benchmark/1-billion-word-language-modeling-benchmark-r13output.tar.gz)
1. Unzip and unarchive the file: `tar zxf 1-billion-word-language-modeling-benchmark-r13output.tar.gz` 
1. Run the following Python code:
    ```
    from gensim.models import word2vec
    import os

    corpus_dir = '1-billion-word-language-modeling-benchmark-r13output/training-monolingual.tokenized.shuffled'
    sentences = word2vec.PathLineSentences(corpus_dir)
    model = word2vec.Word2Vec(sentences) # just use all of the default settings for now
    model.save('1billion_word_vectors')
    ```

## Documentation/Sources
* [https://radimrehurek.com/gensim/models/word2vec.html](https://radimrehurek.com/gensim/models/word2vec.html) for more information about how to use gensim word2vec in general
* [https://codekansas.github.io/blog/2016/gensim.html](https://codekansas.github.io/blog/2016/gensim.html) for information about using it to create embedding layers for neural networks.
* [https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/](https://machinelearningmastery.com/sequence-classification-lstm-recurrent-neural-networks-python-keras/) for information on sequence classification with keras
* [https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html](https://blog.keras.io/using-pre-trained-word-embeddings-in-a-keras-model.html) for using pre-trained embeddings with keras (though the syntax they use for the model layers is different than most other tutorials I've seen).
* [https://keras.io/](https://keras.io/) Keras API documentation

## Load the trained word vectors

In [7]:
from gensim.models import word2vec

Load the trained model file into memory

In [8]:
wv_model = word2vec.Word2Vec.load('1billion_word_vectors')

Since we do not need to continue training the model, we can save memory by keeping the parts we need (the word vectors themselves) and getting rid of the rest of the model.

In [9]:
wordvec = wv_model.wv
del wv_model

## Exploration of word vectors
Now we can look at some of the relationships between different words.

Like [the gensim documentation](https://radimrehurek.com/gensim/models/word2vec.html), let's start with a famous example: king + woman - man

In [4]:
wordvec.most_similar(positive=['king', 'woman'], negative=['man'])

[('queen', 0.8407387733459473),
 ('monarch', 0.7541723251342773),
 ('prince', 0.7350203394889832),
 ('princess', 0.696908175945282),
 ('empress', 0.677180290222168),
 ('sultan', 0.6649758815765381),
 ('Chakri', 0.6451102495193481),
 ('goddess', 0.6439394950866699),
 ('ruler', 0.6275453567504883),
 ('kings', 0.6273428201675415)]

This next one does not work as well as I'd hoped, but it gets close. Maybe you can find a better example.

In [5]:
wordvec.most_similar(positive=['panda', 'eucalyptus'], negative=['bamboo'])

[('okapi', 0.7140712738037109),
 ('gibbon', 0.7034620046615601),
 ('koala', 0.697202742099762),
 ('cub', 0.6907659769058228),
 ('tortoise', 0.6886162757873535),
 ('beetle', 0.6859476566314697),
 ('salamander', 0.6855185031890869),
 ('psyllid', 0.6837549209594727),
 ('lynx', 0.6802860498428345),
 ('carnivore', 0.6794542670249939)]

In [30]:
imdb_map['thing']

NameError: name 'imdb_map' is not defined

Which one of these is not like the others?

In [None]:
wordvec.doesnt_match(['red', 'purple', 'laptop', 'turquoise', 'ruby'])

How far apart are different words?

In [None]:
wordvec.distances('laptop', ['computer', 'phone', 'rabbit'])

Let's see what one of these vectors actually looks like.

In [None]:
wordvec['textbook']

In [None]:
wordvec.most_similar(positive=['king', 'woman'], negative=['man'])

In [None]:
wordvec.most_similar(positive=['man', 'success'], negative=[])

In [None]:
wordvec.most_similar(positive=['Beijing', 'Japan'], negative=['China'])
wordvec.most_similar(positive=['Beijing'], negative=[])
wordvec.most_similar(positive=['Quebec', 'States'], negative=['Canada'])


In [None]:
wordvec.most_similar(positive=['tree', 'human'], negative=['seed'])

In [None]:
wordvec.most_similar(positive=['panda', 'eucalyptus'], negative=['bamboo'])

In [81]:
wordvec.distances('laptop', ['computer', 'phone', 'rabbit'])

array([0.205414  , 0.36557418, 0.6597437 ], dtype=float32)

In [80]:
wordvec.distances('shark', ['tuna', 'turtle', 'salmon'])

array([0.37990654, 0.22643244, 0.39046144], dtype=float32)

In [None]:

wordvec['computer']

What other methods are available to us?

## Using the word vectors in an embedding layer of a Keras model

In [6]:
from keras.models import Sequential
import numpy

Using TensorFlow backend.


## Classification without using the pre-trained word vectors

Model definition. The embedding layer here learns the 100-dimensional vector embedding within the overall classification problem training. That is usually what we want, unless we have a bunch of un-tagged data that could be used to train word vectors but not classification model.

Train the model. __This takes awhile. You might not want to re-run it.__

## For any model that you try in these exercises, take notes about the performance you see and anything you notice about the differences between the models.

# Exercise Option 1: Use the word vectors in a full model
Using the knowledge about how the imdb dataset and the keras embedding layer represent words, as detailed above, define a model that uses the pre-trained word vectors from the imdb dataset rather than an embedding that keras learns as it goes along. You'll need to swap out the embedding layer and feed in different training data.

# Exercise Option 2:
Same as option 1, but try using the 1billion vector word embeddings instead of the imdb vectors.

# Exercise Option 3:
Try changing different hyperparameters of the not_pretrained model. Keep notes on how the performance changes.

# Exercise Option 4: From here:
Make a model for the reuters classification problem.

In [65]:
from keras.datasets import reuters

(r_x_train, raw_r_y_train), (r_x_test, raw_r_y_test) = reuters.load_data(path="reuters.npz",
                                                         num_words=None,
                                                         skip_top=0,
                                                         maxlen=500,
                                                         test_split=0.2,
                                                         seed=113,
                                                         start_char=1,
                                                         oov_char=2,
                                                         index_from=3)

Review help here: https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/

In [66]:
from sklearn.preprocessing import OneHotEncoder
import numpy as np
dataEnc = OneHotEncoder(sparse=False)
r_y_train =dataEnc.fit_transform(raw_r_y_train.reshape(-1,1))
dataEnc2 = OneHotEncoder(sparse=False)
r_y_test =dataEnc2.fit_transform(raw_r_y_test.reshape(-1,1))

#for i in range(0,len(t)):
#    print(raw_r_y_train[i])
#    print(r_y_train[i])
#yay it one hot encode properly!
type(r_y_test[1][1])

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


numpy.float64

In [67]:
from keras.preprocessing import sequence
r_x_train_padded = sequence.pad_sequences(r_x_train, maxlen=500)
r_x_test_padded = sequence.pad_sequences(r_x_test, maxlen=500)

In [68]:
reuters_offset = 3
reuters_map = dict((index + reuters_offset, word) for (word, index) in reuters.get_word_index().items())
reuters_map[0] = 'PADDING'
reuters_map[1] = 'START'
reuters_map[2] = 'UNKNOWN'

In [69]:
reuters_word_index = reuters.get_word_index(path="reuters_word_index.json")
from keras.models import Sequential
from keras.layers import Embedding, Conv1D, Dense, Flatten

In [76]:
reuterbasic_model = Sequential()
first_embedding_layer = wordvec.get_keras_embedding()
first_embedding_layer.input_length = 500 #number of words input
first_embedding_layer.input_dim = 552402 #reuters data
first_embedding_layer.output_dim = 100 #number of dimensions of outputted wordvec
reuterbasic_model.add(first_embedding_layer)
reuterbasic_model.add(Conv1D(filters=32, kernel_size=5, activation='relu'))
reuterbasic_model.add(Conv1D(filters=32, kernel_size=5, activation='relu'))
reuterbasic_model.add(Conv1D(filters=32, kernel_size=5, activation='relu'))
reuterbasic_model.add(Conv1D(filters=32, kernel_size=5, activation='relu'))
reuterbasic_model.add(Conv1D(filters=32, kernel_size=5, activation='relu'))
reuterbasic_model.add(Flatten())
reuterbasic_model.add(Dense(units=128, activation='relu'))
reuterbasic_model.add(Dense(units=46, activation='softmax')) # categorical
reuterbasic_model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['categorical_accuracy'])

In [77]:
reuterbasic_model.fit(r_x_train_padded, r_y_train, epochs=5, batch_size=64)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x12cafb748>

In [78]:
reuter_scoring = reuterbasic_model.evaluate(r_x_test_padded, r_y_test)
print('loss: {} accuracy: {}'.format(*reuter_scoring))

loss: 2.093535653293022 accuracy: 0.5408163265306123


add kfold next to see...