
## Preprocessing

Download english stopwords:

    aws s3 cp s3://guj-mjum/datasets/wikipedia/enwiki.txt .

Download english Wikipedia articles:

    aws s3 cp s3://guj-mjum/datasets/word2vec/stopwords_english.txt .

Preprocess articles:

    python preprocess.py --input-path enwiki.txt \
        --stopword-path stopwords_english.txt \
        --output-path . \
        --win-size 11 \
        --vocab-size 10000

    Loading stopwords: stopwords_english.txt
    Build vocabulary
    1000 articles added to dictionary
    2000 articles added to dictionary
    3000 articles added to dictionary
    4000 articles added to dictionary
    ...
    861000 articles added to dictionary
    862000 articles added to dictionary
    863000 articles added to dictionary
    864000 articles added to dictionary
    num words: 1687212
    num_documents: 864785
    num words: 10000
    num_documents: 864785
    1000 articles tokenized
    2000 articles tokenized
    3000 articles tokenized
    4000 articles tokenized
    ...

Preprocessing does the following tasks:

 * Remove HTML, links, numbers, special characters and punctuations
 * Tokenize the text
 * Remove stopwords and short tokens
 * Create vocabulary from the top `vocab-size` words
 * Save vocabulary to a file
 * Map tokens to token IDs
 * Create examples x and label y by sliding a window of size `win-size` over each article
 * Write examples and labels to a file
 
Preprocessing creates the following files:

 * vocab.txt - A Gensim dictionary in text format
 * vocab.pkl - A Gensim dictionary in binary format
 * dataset.hdf5 - A hdf5 file with keys `x_train` and `y_train`

The `vocab.txt` is not required later on and is generated for manual inspection of the vocabulary.

The first line contains the total number of documents. The rest of the file contains tokenID, token and 
the number of documents the token appears in.

    > head vocab.txt
    864785
    1179    a&m 1287
    9157    aa  1397
    7649    aaron   2430
    8832    ab  1250
    6890    abandon 1700
    2549    abandoned   8447
    8322    abbey   2954
    8814    abbot   1104
    2321    abbreviated 2529

## Train model 

The training data is generated from token windows $\{w_1,...,w_{k-1},w_k,w_{k+1},...,w_{2k+1}\}$ where 
$x=\{w_1,...,w_{k-1},w_{k+1},...,w_{2k+1}\}$ and $y=w_k$.

    python train.py --dataset-path dataset.hdf5 --vocab-path vocab.pkl --models-path /tmp/w2v_models

    Using TensorFlow backend.
    Load vocabulary
    vocab_size: 10000
    Load dataset.hdf5
    X_train.shape: (68820132, 10)
    y_train.shape: (68820132,)
    Cutoff train data to 10000000 examples
    X_train.shape: (10000000, 10)
    y_train.shape: (10000000,)
    Shuffle dataset
    win_size: 10
    epoch 0: loss=8.908036 acc=0.004845 time=758975.000000
    Save model: /tmp/w2v_models/20181008_081335_8370637/w2v_model.h5
    epoch 1: loss=8.561957 acc=0.012169 time=756866.000000
    Save model: /tmp/w2v_models/20181008_082612_4173529/w2v_model.h5
    ...