In [1]:
import sys
sys.path.insert(1, '../')

import torch
import torch.optim as optim

from models import generator
from models import discriminator
from trainers import train_generator_MLE, train_generator_PG, train_discriminator
import helpers

%load_ext autoreload
%autoreload 2

torch.cuda.device_count()

0

# Tutorial Part 1 Overall Story

here we will describe each step of the pipeline at a high level to help us contextualize our future learnings


## Synthetic Data Experiment

The most accurate way of evaluating generative models is that we draw some samples from it and let human observers re- view them based on their prior knowledge. We assume that the human observer has learned an accurate model of the natural distribution p_human(x). 
 
the authots used a randomly initialized language model as the true model, aka, the ***oracle***, to generate the "real" data distribution p(x_t |x_1 , . . . , x_t−1 ). The benefit of having such oracle is that firstly, it provides the training dataset and secondly evaluates the exact perfor- mance of the generative models, which will not be possible with real data.

In [13]:
# experimental constants

CUDA = torch.cuda.is_available()
VOCAB_SIZE = 5000
MAX_SEQ_LEN = 20

BATCH_SIZE = 32
START_LETTER = 0

GEN_EMBEDDING_DIM = 32 # length of input vectors for generator and oracle
GEN_HIDDEN_DIM = 32 # length of hidden state for generator and oracle

oracle_state_dict_path = '../params/oracle_EMBDIM32_HIDDENDIM32_VOCAB5000_MAXSEQLEN20.trc'
oracle_samples_path = '../sample_data/oracle_samples.trc'

MLE_TRAIN_EPOCHS = 100
POS_NEG_SAMPLES = 10000

DIS_EMBEDDING_DIM = 64 # length of input vectors for discriminator
DIS_HIDDEN_DIM = 64 # length of hidden state for discriminator

pretrained_gen_path = '../params/gen_MLEtrain_EMBDIM32_HIDDENDIM32_VOCAB5000_MAXSEQLEN20.trc'
pretrained_dis_path = '../params/dis_pretrain_EMBDIM_64_HIDDENDIM64_VOCAB5000_MAXSEQLEN20.trc'

pretrained_gen_path_cpu = '../params/gen_MLEtrain_EMBDIM32_HIDDENDIM32_VOCAB5000_MAXSEQLEN20_cpu.trc'
pretrained_dis_path_cpu = '../params/dis_pretrain_EMBDIM_64_HIDDENDIM64_VOCAB5000_MAXSEQLEN20_cpu.trc'

ADV_TRAIN_EPOCHS = 50 # ADVERSARIAL TRAINING EPOCHS

In [3]:
oracle = generator.Generator(
    GEN_EMBEDDING_DIM, 
    GEN_HIDDEN_DIM, 
    VOCAB_SIZE, 
    MAX_SEQ_LEN, 
    gpu=CUDA
)

# for reproducibiility we provide saved parameters for the oracle
oracle.load_state_dict(torch.load(oracle_state_dict_path))

oracle

Generator(
  (embeddings): Embedding(5000, 32)
  (gru): GRU(32, 32)
  (gru2out): Linear(in_features=32, out_features=5000, bias=True)
)

The output above should look like this

```
Generator(
  (embeddings): Embedding(5000, 32)
  (gru): GRU(32, 32)
  (gru2out): Linear(in_features=32, out_features=5000, bias=True)
)
```

To explain the information above: the model has 5000 possible input vectors in its vocab each with length 32, the GRU takes vectors  length 32 and outputs activation of the same length. The output returns activations of the same length as the vocab.

the authors use the oracle to generate 10,000 sequences of length 20 as the training set S for the generative models.
we have already used helpers.batchwise_sample() to save S so you can load it below

In [4]:
oracle_samples = torch.load(oracle_samples_path).type(torch.LongTensor)
print(type(oracle_samples), oracle_samples.shape)

<class 'torch.Tensor'> torch.Size([10000, 20])


### instantiate a generator and discriminator

In [17]:
gen_optimizer = optim.Adam(gen.parameters(), lr=1e-2)
dis_optimizer = optim.Adagrad(dis.parameters())

gen = generator.Generator(
    GEN_EMBEDDING_DIM, 
    GEN_HIDDEN_DIM, 
    VOCAB_SIZE, 
    MAX_SEQ_LEN, 
    gpu=CUDA,
)

dis = discriminator.Discriminator(
    DIS_EMBEDDING_DIM, 
    DIS_HIDDEN_DIM, 
    VOCAB_SIZE,
    MAX_SEQ_LEN, 
    gpu=CUDA,
)

In [6]:
gen

Generator(
  (embeddings): Embedding(5000, 32)
  (gru): GRU(32, 32)
  (gru2out): Linear(in_features=32, out_features=5000, bias=True)
)

In [7]:
dis

Discriminator(
  (embeddings): Embedding(5000, 64)
  (gru): GRU(64, 64, num_layers=2, dropout=0.2, bidirectional=True)
  (gru2hidden): Linear(in_features=256, out_features=64, bias=True)
  (dropout_linear): Dropout(p=0.2, inplace=False)
  (hidden2out): Linear(in_features=64, out_features=1, bias=True)
)

#### If you have and want to use GPU, all models and model inputs need to be on GPU

If there is any mismatch between the parameters being on one device and the inputs being on another, then problems will arise. 

We run one call to the helpers.batchwise_oracle_nll function to test that we have the inputs and params on mathing devices, to get the baseline oracle NLL and to get a sense of the GPU/CPU speedup. On CPU the original batchwise_oracle_nll has a wall time of 22 seconds. 

In [8]:
if CUDA:
    oracle = oracle.cuda()
    gen = gen.cuda()
    dis = dis.cuda()
    oracle_samples = oracle_samples.cuda()
    
    print(torch.cuda.device_count(), torch.cuda.get_device_name(0))
    
    print(
        next(gen.embeddings.parameters()).device, 
        next(dis.embeddings.parameters()).device,
        next(oracle.embeddings.parameters()).device,
     )

In [None]:
%%time
# sample from generator and compute baseline oracle NLL
oracle_loss = helpers.batchwise_oracle_nll(
    gen, 
    oracle, 
    POS_NEG_SAMPLES, 
    BATCH_SIZE, MAX_SEQ_LEN,
    start_letter=START_LETTER, 
    gpu=CUDA,
)

print('oracle_loss', oracle_loss)

### GENERATOR MLE TRAINING

At the beginning of the training, the authors used maximum likelihood estimation (MLE) to pretrain Gθ on training set S. 

They found the supervised signal from the pretrained discriminator is informative to help adjust the generator efficiently.

```
# GENERATOR MLE TRAINING
print('Starting Generator MLE Training...')
gen_optimizer = optim.Adam(gen.parameters(), lr=1e-2)
train_generator_MLE(gen, gen_optimizer, oracle, oracle_samples, MLE_TRAIN_EPOCHS)
torch.save(gen.state_dict(), pretrained_gen_path)

# PRETRAIN DISCRIMINATOR
print('Starting Discriminator Training...')
dis_optimizer = optim.Adagrad(dis.parameters())
train_discriminator(dis, dis_optimizer, oracle_samples, gen, oracle, d_steps = 50,  epochs = 3)
torch.save(dis.state_dict(), pretrained_dis_path)
```

The below pretraining only needs to be done once. After it is saved, it can be loaded using `model.load_state_dict(torch.load(pretrained_gen_path))` in the cell two cells down while skipping the next two cells. You can skip to ***Load Pretrained Generator and Discriminator***

In [10]:
# GENERATOR MLE TRAINING
print('Starting Generator MLE Training...')

train_generator_MLE(gen, gen_optimizer, oracle, oracle_samples, 
                    MLE_TRAIN_EPOCHS, 
                    POS_NEG_SAMPLES = POS_NEG_SAMPLES,
                    BATCH_SIZE = BATCH_SIZE,
                    START_LETTER = START_LETTER,
                    MAX_SEQ_LEN = MAX_SEQ_LEN,
                    CUDA = CUDA,
)
# this will be a GPU version if you train on GPU, 
# if you train on cpu it will be a cpu model and there is
# no need for the last line save
torch.save(gen.state_dict(), pretrained_gen_path)
torch.save(gen.cpu().state_dict(), pretrained_gen_path_cpu)

Starting Generator MLE Training...
epoch 1 : .......... average_train_NLL = 6.8282, oracle_sample_NLL = 14.6160
epoch 2 : .......... average_train_NLL = 6.1780, oracle_sample_NLL = 13.7436
epoch 3 : .......... average_train_NLL = 5.8610, oracle_sample_NLL = 13.1479
epoch 4 : .......... average_train_NLL = 5.6560, oracle_sample_NLL = 12.8644
epoch 5 : .......... average_train_NLL = 5.5087, oracle_sample_NLL = 12.5425
epoch 6 : .......... average_train_NLL = 5.3974, oracle_sample_NLL = 12.3266
epoch 7 : .......... average_train_NLL = 5.3087, oracle_sample_NLL = 12.1899
epoch 8 : .......... average_train_NLL = 5.2363, oracle_sample_NLL = 12.0750
epoch 9 : .......... average_train_NLL = 5.1747, oracle_sample_NLL = 11.9395
epoch 10 : .......... average_train_NLL = 5.1223, oracle_sample_NLL = 11.8729
epoch 11 : .......... average_train_NLL = 5.0776, oracle_sample_NLL = 11.8209
epoch 12 : .......... average_train_NLL = 5.0383, oracle_sample_NLL = 11.7077
epoch 13 : .......... average_train_NL

In [19]:
# PRETRAIN DISCRIMINATOR
print('Starting Discriminator Training...')

train_discriminator(dis, dis_optimizer, oracle_samples, gen, oracle, 
                    d_steps = 50,  
                    epochs = 3,
                    POS_NEG_SAMPLES = POS_NEG_SAMPLES,
                    BATCH_SIZE = BATCH_SIZE,
                    CUDA = CUDA,
)
# this will be a GPU version if you train on GPU, 
# if you train on cpu it will be a cpu model and there is
# no need for the last line save
torch.save(dis.state_dict(), pretrained_dis_path)
torch.save(dis.cpu().state_dict(), pretrained_dis_path_cpu)

Starting Discriminator Training...
d-step 1 epoch 1 : .......... average_loss = 0.6870, train_acc = 0.5441, val_acc = 0.5150
d-step 1 epoch 2 : .......... average_loss = 0.6596, train_acc = 0.6076, val_acc = 0.5700
d-step 1 epoch 3 : .......... average_loss = 0.6233, train_acc = 0.6573, val_acc = 0.6400
d-step 2 epoch 1 : .......... average_loss = 0.6279, train_acc = 0.6466, val_acc = 0.6300
d-step 2 epoch 2 : .......... average_loss = 0.5985, train_acc = 0.6801, val_acc = 0.6550
d-step 2 epoch 3 : .......... average_loss = 0.5675, train_acc = 0.7090, val_acc = 0.6450
d-step 3 epoch 1 : .......... average_loss = 0.5744, train_acc = 0.7015, val_acc = 0.6350
d-step 3 epoch 2 : .......... average_loss = 0.5436, train_acc = 0.7280, val_acc = 0.6250
d-step 3 epoch 3 : .......... average_loss = 0.5148, train_acc = 0.7490, val_acc = 0.6500
d-step 4 epoch 1 : .......... average_loss = 0.5344, train_acc = 0.7332, val_acc = 0.6650
d-step 4 epoch 2 : .......... average_loss = 0.5061, train_acc = 

d-step 31 epoch 1 : .......... average_loss = 0.1492, train_acc = 0.9579, val_acc = 0.6550
d-step 31 epoch 2 : .......... average_loss = 0.1272, train_acc = 0.9653, val_acc = 0.6750
d-step 31 epoch 3 : .......... average_loss = 0.1086, train_acc = 0.9708, val_acc = 0.6700
d-step 32 epoch 1 : .......... average_loss = 0.1490, train_acc = 0.9578, val_acc = 0.6450
d-step 32 epoch 2 : .......... average_loss = 0.1249, train_acc = 0.9645, val_acc = 0.6600
d-step 32 epoch 3 : .......... average_loss = 0.1087, train_acc = 0.9690, val_acc = 0.6700
d-step 33 epoch 1 : .......... average_loss = 0.1517, train_acc = 0.9588, val_acc = 0.6650
d-step 33 epoch 2 : .......... average_loss = 0.1267, train_acc = 0.9661, val_acc = 0.6650
d-step 33 epoch 3 : .......... average_loss = 0.1112, train_acc = 0.9701, val_acc = 0.6700
d-step 34 epoch 1 : .......... average_loss = 0.1403, train_acc = 0.9619, val_acc = 0.6550
d-step 34 epoch 2 : .......... average_loss = 0.1201, train_acc = 0.9677, val_acc = 0.6750

### Sanity Check 1

At the end of pretraining the generator shuld have anegative log likelihood loss of around 10
```
tarting Generator MLE Training...
epoch 100 : .......... average_train_NLL = 4.5203, oracle_sample_NLL = 10.9189
```

meaning it is good enough tha the discriminator only is correct about half the time initially

```
Starting Discriminator Training...
d-step 1 epoch 1 : .......... average_loss = 0.6870, train_acc = 0.5441, val_acc = 0.5150
.
.
.
d-step 30 epoch 2 : .......... average_loss = 0.1358, train_acc = 0.9609, val_acc = 0.6650
```

the discriminator will make much more gains on the training accuracy than on the validation accuracy

### Load Pretrained Generator and Discriminator

You can load the CPU version and sent to GPU or load the GPU version for a model already on CUDA

In [19]:
# load pretrained generator and discrimnator for GPU Models
# en.load_state_dict(torch.load(pretrained_gen_path))
# dis.load_state_dict(torch.load(pretrained_dis_path))

# load pretrained generator and discrimnator for CPU Models
gen.load_state_dict(torch.load(pretrained_gen_path_cpu))
dis.load_state_dict(torch.load(pretrained_dis_path_cpu))

<All keys matched successfully>

In [22]:
%%time
# sample from generator and compute oracle NLL
oracle_loss = helpers.batchwise_oracle_nll(
    gen, 
    oracle, 
    POS_NEG_SAMPLES, 
    BATCH_SIZE, MAX_SEQ_LEN,
    start_letter=START_LETTER, 
    gpu=CUDA,
)

print('\nInitial Oracle Sample Loss : %.4f' % oracle_loss)


Initial Oracle Sample Loss : 10.9625
CPU times: user 22 s, sys: 41 ms, total: 22.1 s
Wall time: 22.1 s


In [21]:
# ADVERSARIAL TRAINING
print('\nStarting Adversarial Training...')

for epoch in range(ADV_TRAIN_EPOCHS):
    
    print('\n--------\nEPOCH %d\n--------' % (epoch+1))
    sys.stdout.flush()
    
    # TRAIN GENERATOR
    print('\nAdversarial Training Generator : ', end='')
    train_generator_PG(gen, gen_optimizer, oracle, dis, 
                       num_batches = 1,
    )

    # TRAIN DISCRIMINATOR
    print('\nAdversarial Training Discriminator : ')
    train_discriminator(dis, dis_optimizer, oracle_samples, gen, oracle, 
                        d_steps = 50,  
                        epochs = 3,
                        POS_NEG_SAMPLES = POS_NEG_SAMPLES,
                        BATCH_SIZE = BATCH_SIZE,
                        CUDA = CUDA,
    )


Starting Adversarial Training...

--------
EPOCH 1
--------

Adversarial Training Generator :  oracle_sample_NLL = 10.9642

Adversarial Training Discriminator : 


KeyboardInterrupt: 