# Word2Vec in Action

Please, please, please remember that almost all the libraries we use in this course come with documentation. The quality of the documentation, in terms of readability, varies, but it is almost always available. (Readability is also a function of your own expertise: as you get better, you'll find that you can get what you need even if the prose isn't terribly clear.)

The version of **Word2Vec** we are using this week is part of [Gensim](https://radimrehurek.com/gensim/index.html) an open source project started by a Czech scientist Rahim Radurek and is now maintained by a core group that is based around the world. [The documentation is pretty good](https://radimrehurek.com/gensim/models/word2vec.html).

<div class="alert alert-block alert-warning">
<b>One thing to note</b>: Gensim is licensed under the OSI-approved GNU LGPLv2.1 license. This means that it’s free for both personal and commercial use, but if you make any modification to Gensim that you distribute to other people, you have to disclose the source code of these modifications. Some organizations may not be comfortable tiw this. If so, a commercial license is available. (Also, note to yourself that licenses matter.</div>

In [None]:
# IMPORTS
import re
import numpy as np
from nltk.tokenize import sent_tokenize, word_tokenize
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot as plt

# My usual preferences
plt.rcParams['figure.dpi'] = 300
plt.rcParams["figure.figsize"] = (10,5)

In [None]:
# Many of these libraries come with their own test data / toy corpora.
from gensim.test.utils import common_texts
print(common_texts[0:3])

In [None]:
# Toy corpus
# Pretend we tokenized, lowered, and removed punctuation
sentences = [['i', 'like', 'apple', 'pie', 'for', 'dessert'],
            ['i', 'dont', 'drive', 'fast', 'cars'],
            ['data', 'science', 'is', 'fun'],
            ['chocolate', 'is', 'my', 'favorite'],
            ['my', 'favorite', 'movie', 'is', 'predator']]

## Basics of `Word2Vec`

**Word2Vec Parameters**
- `min_count` : the minimum number of times a word must occur to be included
- `size` : the length of the vector (it will otherwise create a vector for all the words)
- `window` : how many words to include on either side of the target word