In [2]:
# code for loading the format for the notebook
import os

# path : store the current path to convert back to it later
path = os.getcwd()
os.chdir('../../notebook_format')
from formats import load_style
load_style()

In [3]:
os.chdir(path)
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.rcParams['figure.figsize'] = 8, 6 # change default figure size

# 1. magic to print version
# 2. magic so that the notebook will reload external python modules
%load_ext watermark
%load_ext autoreload 
%autoreload 2

%watermark -a 'Ethen' -d -t -v -p numpy,pandas,gensim,keras

Ethen 2016-08-03 09:50:02 

CPython 3.5.2
IPython 4.2.0

numpy 1.11.1
pandas 0.18.1
gensim 0.13.1
keras 1.0.5


In [1]:
import os
from gensim.models import Word2Vec

gensim’s `Word2vec` expects a sequence of sentences as its input, where each sentence a list of words.

In [2]:
sentences = [['first', 'sentence'], ['second', 'sentence']]
model = Word2Vec( sentences, min_count = 1 )

The model also accepts several key parameters that affect both training speed and quality.

- One of them is for pruning the internal dictionary, `min_count`. Words that appear only once or twice in a billion-word corpus are probably uninteresting typos and garbage. In addition, there’s not enough data to make any meaningful training on those words, so it’s best to ignore them. A reasonable value for min_count is between 0-100, depending on the size of the dataset.
- Another parameter is the `size` of the hidden layers. Bigger size values require more training data, but can lead to better (more accurate) models. Reasonable values are in the tens to hundreds.
- The last of the major parameters (full list [here](http://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec)) is for training parallelization, to speed up training:

```python

model = Word2Vec(sentences, min_count = 10)  # default value is 5
model = Word2Vec(sentences, size = 200)  # default value is 100
model = Word2Vec(sentences, workers = 4) # default = 1 worker = no parallelization

```

Since the model is an unsupervised task, there’s no good way to objectively evaluate the result. Evaluation depends on your end application.

---

In the example above, keeping the input as a Python built-in list is convenient, but can use up a lot of RAM when the input is large.

Gensim only requires that the input must provide sentences sequentially, hence if our input files are scattered across several different places then instead of loading everything into an in-memory list, we can process the input file by file, line by line:

In [11]:
class Sentences:
    """
    iterate over files in a directory, and read in each line
    as a list of words. Used with gensim's Word2Vec
    """
    def __init__( self, dirname ):
        self.dirname = dirname
 
    def __iter__(self):
        for file in os.listdir(self.dirname):
            fname = os.path.join( self.dirname, file )
            
            with open(fname) as f:
                for line in f:
                    yield line.split()

In [12]:
# a memory-friendly iterator
sentences = Sentences('test')
model = Word2Vec( sentences, min_count = 1 )

## Reference

- [Word2vec API Tutorial](http://rare-technologies.com/word2vec-tutorial/)