## Setup

#### Packages setup and imports

As of 6/20/2020 some TensorFlow nightly build features offer improved training speed over 2.2.0. This cell is optional if such performance gains aren't necessary

In [None]:
!pip install tf-nightly                     # ONLY IF YOU DO NEED IT
!pip uninstall -q tensorboard tb-nightly    # THIS FIXES ISSUES WITH DUPLICATE TENSORBOARDS
!pip install -q tensorboard

Download and install WordEmbeddings package

In [5]:
!git clone -b gcp https://github.com/Flomastruk/wordembeddings.git
%cd wordembeddings/
!pip install .
%cd /content/

Cloning into 'wordembeddings'...
remote: Enumerating objects: 193, done.[K
remote: Counting objects: 100% (193/193), done.[K
remote: Compressing objects: 100% (134/134), done.[K
remote: Total 193 (delta 108), reused 136 (delta 56), pack-reused 0[K
Receiving objects: 100% (193/193), 1.69 MiB | 1.83 MiB/s, done.
Resolving deltas: 100% (108/108), done.
/content/wordembeddings
Processing /content/wordembeddings
Building wheels for collected packages: wordembeddings
  Building wheel for wordembeddings (setup.py) ... [?25l[?25hdone
  Created wheel for wordembeddings: filename=wordembeddings-0.0.0-cp36-none-any.whl size=22175 sha256=0661953bc6a2245e90d9605df0674608c26c149af4b84d8c0ba8caf1058a6e31
  Stored in directory: /tmp/pip-ephem-wheel-cache-fwag7ond/wheels/7d/69/30/2bc46802895f7cb924804b5ba49719e836ebce59e6a9b27714
Successfully built wordembeddings
Installing collected packages: wordembeddings
Successfully installed wordembeddings-0.0.0
/content


The cell below is necessary if job directory is intended to be inside Google Drive. Fully optional

In [6]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


## Training the model from command line

Create a job directory and (optionally) copy tests inside.

In [8]:
%env JOB_DIR=/content/wordembeddings_jobs
!mkdir $JOB_DIR
!cp -r /content/wordembeddings/tests $JOB_DIR/

env: JOB_DIR=/content/wordembeddings_jobs
mkdir: cannot create directory ‘/content/wordembeddings_jobs’: File exists


The command below launches training process inside the chosen job directory and saves model weights

In [9]:
!python -m newmodel.task --job-dir=/content/wordembeddings_jobs  --mode=glove --log-dir=glove_demo_model --save-dir=glove_demo_model --corpus-name=enwik8 --epochs=1 --min-occurrence=10 --max-vocabulary-size=50000 --skip-window=10

2020-06-23 04:11:29.755999: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-06-23 04:11:31.755409: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2020-06-23 04:11:31.774222: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:982] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-06-23 04:11:31.774977: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:00:04.0 name: Tesla K80 computeCapability: 3.7
coreClock: 0.8235GHz coreCount: 13 deviceMemorySize: 11.17GiB deviceMemoryBandwidth: 223.96GiB/s
2020-06-23 04:11:31.775027: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2020-06-23 04:11:31.777242: I tensorflow/stream_executor/platform/defa

## Training the model using WordEmbeddings package

#### Importing required classes and functions

In [31]:
import os
import numpy as np

from newmodel.util import load_process_data, create_dataset_from_stored_batches, normalized_train_file_name, lr_scheduler_factory
from newmodel.model import GloveModel, HypGloveModel, Word2VecModel
from newmodel.tests import get_similarity_tests, get_analogy_tests
from newmodel.tests import ANALOGY_TEST_GROUPS

from tensorflow.keras.callbacks import LearningRateScheduler

#### Main training parameters definitions

In [20]:
# `job_dir` must be a path to a directory where data, models and logs will be stored
job_dir = '/content/wordembeddings_jobs' # Can be a path in Google Drive if enabled above '/content/drive/My Drive/word2vec/newmodel' 

# note: paths below are relative to `job_dir`
log_dir = 'word_embedding_model'        # optional, if None model logs are written to `logs/temp`
restore_dir = None                      # optional, if specified assumes that the model was saved previously in `restore_dir`
save_dir = 'word_embedding_model'       # optional, if None the model is not saved

# model settings
mode = 'glove'              # Available: 'glove' 'hypglove' word2vec'
embedding_size = 200

# data processing settings
corpus_name = 'enwik8'      # Available: 'enwik8' 'enwik9' 'enwiki_dump'
max_vocabulary_size = 50000
min_occurrence = 10
skip_window = 10            # window size in skipgram model (2-sided)

# training settings
num_epochs = 2              # number of epochs to train the model (if the model is restored the training continues)
learning_rate = 3e-3        # default initial learning rate
batch_size = 2**15
stored_batch_size = 2**17   # must be divisible by batch_size, affects how the data is stored on the disk
neg_samples = 0             # only used in models using negative sampling
po = 0.75                   # parameters used in word cooccurrence re-weighing
threshold = 100

In [21]:
# assign selected settings to a single Namespace 
from types import SimpleNamespace
args = SimpleNamespace()

args.job_dir = job_dir
args.log_dir = log_dir
args.restore_dir = restore_dir
args.save_dir = save_dir

args.mode = mode
args.embedding_size = embedding_size

args.corpus_name = corpus_name
args.max_vocabulary_size = max_vocabulary_size 
args.min_occurrence = min_occurrence
args.skip_window = skip_window

args.num_epochs = num_epochs
args.learning_rate = learning_rate
args.batch_size = batch_size
args.stored_batch_size = stored_batch_size
args.neg_samples = neg_samples
args.po = po
args.threshold = threshold

In [13]:
os.makedirs(job_dir, exist_ok = True)
os.chdir(job_dir) # expects that the directory exists

#### Example 1: Glove Model

Redefine some example-specific arguments

In [14]:
args.save_dir = 'glove_demo_model'      # model is saved in saved_models/glove_demo_model under job_dir
args.restore_dir = 'glove_demo_model'   # model is restored from saved_models/ glove_demo_model under job_dir
args.log_dir = 'glove_demo_model'       # logs are written to logs/glove_demo_model under job_dir

args.mode = 'glove'
args.neg_samples = 0 # no negative samples

Create dataset

In [17]:
train_file_name = normalized_train_file_name(args)
train_file_path = os.path.join(args.job_dir, 'model_data', train_file_name)

word2id, id2word, word_counts, id_counts, skips_paths = load_process_data(train_file_name, args, remove_zero = False)
vocabulary_size = max(word2id.values()) + 1

dataset = create_dataset_from_stored_batches(skips_paths, args.stored_batch_size, batch_size = args.batch_size, sampling_distribution = None, threshold = args.threshold, po = args.po, neg_samples = args.neg_samples)

Key and value files for stored_enwik8_maxsize_50000_minocc_10_window_10_storedbatch_131072 already exist. Nothing to be done. Consider checking contents.


Create and compile the model

In [18]:
train_model = GloveModel(vocabulary_size, args.embedding_size, args.neg_samples, learning_rate=2e-3, word2id = word2id, id2word = id2word)
train_model.compile(loss = train_model.loss, optimizer = train_model.optimizer)

Restore model weights if available

In [19]:
epochs_trained_ = 0
try:
    if args.restore_dir:
        _, epochs_trained_ = train_model.load_model(os.path.join(args.job_dir, 'saved_models', args.restore_dir))    
except FileNotFoundError: # if no file to restore from found
    print('Model weights could not be found.')
print(f'Epochs trained: {epochs_trained_}')  

Epochs trained: 1


Define callbacks for training

In [32]:
similarity_tests_dict = get_similarity_tests(args.job_dir)
similarity_callbacks = train_model.get_similarity_tests_callbacks(similarity_tests_dict, ['target', 'context', 'added', 'concat'], ['l2', 'cos'], args.job_dir, args.log_dir)

analogy_tests_dict = get_analogy_tests(args.job_dir)
analogy_callbacks = train_model.get_analogy_tests_callbacks(analogy_tests_dict, ['target', 'context'], ['l2', 'cos'], args.job_dir, args.log_dir, group_dict= ANALOGY_TEST_GROUPS)

save_callbacks = train_model.get_save_callbacks(os.path.join(args.job_dir, 'saved_models' , args.save_dir) if args.save_dir else None, args, period = 1)
loss_callback = train_model.get_loss_callback(args.job_dir, args.log_dir)
lr_callback = LearningRateScheduler(lr_scheduler_factory(args.learning_rate))

callbacks = save_callbacks + [loss_callback] + similarity_callbacks + analogy_callbacks + [lr_callback]



Train the model

In [33]:
train_model.fit(dataset, epochs = args.num_epochs, callbacks = callbacks, initial_epoch=epochs_trained_)    # 182ms/step

Epoch 2/2
   1016/Unknown - 80s 79ms/step - loss: 0.0536
Epoch 00002: saving model to /content/wordembeddings_jobs/saved_models/word_embedding_model/cp-0002.ckpt
Saving model configuration to /content/wordembeddings_jobs/saved_models/word_embedding_model


<tensorflow.python.keras.callbacks.History at 0x7fd16fd80668>

#### Example 2: HypGlove Model

Redefine some example-specific arguments

In [41]:
args.save_dir = 'hypglove_demo_model'      # model is saved in saved_models/glove_demo_model under job_dir
args.restore_dir = 'hypglove_demo_model'   # model is restored from saved_models/ glove_demo_model under job_dir
args.log_dir = 'hypglove_demo_model'       # logs are written to logs/glove_demo_model under job_dir

args.mode = 'hypglove'
args.neg_samples = 0 # no negative samples

Create dataset

In [43]:
train_file_name = normalized_train_file_name(args)
train_file_path = os.path.join(args.job_dir, 'model_data', train_file_name)

word2id, id2word, word_counts, id_counts, skips_paths = load_process_data(train_file_name, args, remove_zero = False)
vocabulary_size = max(word2id.values()) + 1

dataset = create_dataset_from_stored_batches(skips_paths, args.stored_batch_size, batch_size = args.batch_size, sampling_distribution = None, threshold = args.threshold, po = args.po, neg_samples = args.neg_samples)

Key and value files for stored_enwik8_maxsize_50000_minocc_10_window_10_storedbatch_131072 already exist. Nothing to be done. Consider checking contents.


Create and compile the model

In [44]:
train_model = HypGloveModel(vocabulary_size, args.embedding_size, args.neg_samples, learning_rate=2e-3, word2id = word2id, id2word = id2word)
train_model.compile(loss = train_model.loss, optimizer = train_model.optimizer)

Restore model weights if available

In [45]:
epochs_trained_ = 0
try:
    if args.restore_dir:
        _, epochs_trained_ = train_model.load_model(os.path.join(args.job_dir, 'saved_models', args.restore_dir))    
except FileNotFoundError: # if no file to restore from found
    print('Model weights could not be found.')
print(f'Epochs trained: {epochs_trained_}')  

Model weights could not be found.
Epochs trained: 0


Define callbacks for training

In [46]:
similarity_tests_dict = get_similarity_tests(args.job_dir)
similarity_callbacks = train_model.get_similarity_tests_callbacks(similarity_tests_dict, ['target', 'context', 'added', 'concat'], ['l2', 'cos'], args.job_dir, args.log_dir)

analogy_tests_dict = get_analogy_tests(args.job_dir)
analogy_callbacks = train_model.get_analogy_tests_callbacks(analogy_tests_dict, ['target', 'context'], ['l2', 'cos'], args.job_dir, args.log_dir, group_dict= ANALOGY_TEST_GROUPS)

save_callbacks = train_model.get_save_callbacks(os.path.join(args.job_dir, 'saved_models' , args.save_dir) if args.save_dir else None, args, period = 1)
loss_callback = train_model.get_loss_callback(args.job_dir, args.log_dir)
lr_callback = LearningRateScheduler(lr_scheduler_factory(args.learning_rate))

callbacks = save_callbacks + [loss_callback] + similarity_callbacks + analogy_callbacks + [lr_callback]



Train the model

In [None]:
train_model.fit(dataset, epochs = args.num_epochs, callbacks = callbacks, initial_epoch=epochs_trained_)# 182ms/step

#### Example 3: Word2Vec Model

Redefine some example-specific arguments

In [49]:
args.save_dir = 'w2v_demo_model'      # model is saved in saved_models/glove_demo_model under job_dir
args.restore_dir = 'w2v_demo_model'   # model is restored from saved_models/ glove_demo_model under job_dir
args.log_dir = 'w2v_demo_model'       # logs are written to logs/glove_demo_model under job_dir

args.mode = 'word2vec'
args.neg_samples = 16 # no negative samples

Create dataset

In [50]:
train_file_name = normalized_train_file_name(args)
train_file_path = os.path.join(args.job_dir, 'model_data', train_file_name)

word2id, id2word, word_counts, id_counts, skips_paths = load_process_data(train_file_name, args, remove_zero = False)
vocabulary_size = max(word2id.values()) + 1

arr_counts = np.array([id_counts[i] for i in range(len(id2word))], dtype = np.float32)
arr_counts[:] = arr_counts**args.po
unigram = arr_counts/arr_counts.sum()

dataset = create_dataset_from_stored_batches(skips_paths, args.stored_batch_size, batch_size = args.batch_size, sampling_distribution = unigram, threshold = args.threshold, po = args.po, neg_samples = args.neg_samples)

Key and value files for stored_enwik8_maxsize_50000_minocc_10_window_10_storedbatch_131072 already exist. Nothing to be done. Consider checking contents.


Create and compile the model

In [51]:
train_model = Word2VecModel(vocabulary_size, args.embedding_size, args.neg_samples, learning_rate=2e-3, word2id = word2id, id2word = id2word)
train_model.compile(loss = train_model.loss, optimizer = train_model.optimizer)

Restore model weights if available

In [52]:
epochs_trained_ = 0
try:
    if args.restore_dir:
        _, epochs_trained_ = train_model.load_model(os.path.join(args.job_dir, 'saved_models', args.restore_dir))    
except FileNotFoundError: # if no file to restore from found
    print('Model weights could not be found.')
print(f'Epochs trained: {epochs_trained_}')  

Model weights could not be found.
Epochs trained: 0


Define callbacks for training

In [53]:
similarity_tests_dict = get_similarity_tests(args.job_dir)
similarity_callbacks = train_model.get_similarity_tests_callbacks(similarity_tests_dict, ['target', 'context', 'added', 'concat'], ['l2', 'cos'], args.job_dir, args.log_dir)

analogy_tests_dict = get_analogy_tests(args.job_dir)
analogy_callbacks = train_model.get_analogy_tests_callbacks(analogy_tests_dict, ['target', 'context'], ['l2', 'cos'], args.job_dir, args.log_dir, group_dict= ANALOGY_TEST_GROUPS)

save_callbacks = train_model.get_save_callbacks(os.path.join(args.job_dir, 'saved_models' , args.save_dir) if args.save_dir else None, args, period = 1)
loss_callback = train_model.get_loss_callback(args.job_dir, args.log_dir)
lr_callback = LearningRateScheduler(lr_scheduler_factory(args.learning_rate))

callbacks = save_callbacks + [loss_callback] + similarity_callbacks + analogy_callbacks + [lr_callback]



Train the model

In [None]:
train_model.fit(dataset, epochs = args.num_epochs, callbacks = callbacks, initial_epoch=epochs_trained_)# 182ms/step

## Tensorboard

In [None]:
%load_ext tensorboard
# %reload_ext tensorboard

In [None]:
%tensorboard --logdir $JOB_DIR/logs/ --port=9009