Tensor2tensor uses subwords as tokens by default which results in better performance. It also uses steps fo determining the length of training as opposed to epochs. Converting between number of steps and number of epochs is based on the batch effective size (i.e. `effective_batch_size = batch_size * num_of_gpus`) and the number of subwords in a batch such that `epochs = steps * effective_batch_size / training_subwords`

However, rather unhelpfully, t2t does not provide us with the number of subwords in our dataset or in an individual sample so in order to convert between steps and epochs, this must be done manually using the vocabulary.

In [1]:
from tensor2tensor.data_generators import text_encoder
import subprocess as sp

In [2]:
vocab_filepath = '../data/t2t_experiments/transformer/low_resource/full_context/data/vocab.mimic_discharge_summaries.32768.subwords'
vocab = text_encoder.SubwordTextEncoder(vocab_filepath)

W0904 18:32:46.797372 139836362250048 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/envs/tf/lib/python3.7/site-packages/tensor2tensor/data_generators/text_encoder.py:938: The name tf.gfile.Exists is deprecated. Please use tf.io.gfile.exists instead.

W0904 18:32:46.798849 139836362250048 deprecation_wrapper.py:119] From /home/aa5118/anaconda3/envs/tf/lib/python3.7/site-packages/tensor2tensor/data_generators/text_encoder.py:940: The name tf.gfile.Open is deprecated. Please use tf.io.gfile.GFile instead.



In [3]:
text = sp.getoutput('head -1 ../data/preprocessed/src-train.txt')

In [4]:
print(len([vocab._subtoken_ids_to_tokens([x]) for x in vocab.encode(text)]))

510


509 subwords in that example - the final input context in our training set. Let's use this method to work out the total number of subwords in our training dataset

In [6]:
total_subword_count = 0
max_512_subword_count = 0
max_1024_subword_count = 0

f=open("../data/preprocessed/low_resource/src-train.txt","r")

line = f.readline()

while line:
    
    count = len([vocab._subtoken_ids_to_tokens([x]) for x in vocab.encode(line)])
    total_subword_count += count
    max_512_subword_count += min(512,count)
    max_1024_subword_count += min(1024,count)
    line = f.readline()

f.close()

print ('{:,}'.format(total_subword_count))
print ('{:,}'.format(max_512_subword_count))
print ('{:,}'.format(max_1024_subword_count))

2,357,469
1,726,779
2,196,080


~25.7m subwords in source training file consisting ~40k discharge summaries. We can now use this number to calculate how many training steps corresponds to one epoch

In [7]:
num_of_gpus = 4
batch_size = 4096
effective_batch_size = num_of_gpus * batch_size
epochs = 1

def epoch2steps(subword_count):
    steps = epochs / (effective_batch_size/subword_count)
    print ("1 epoch correponds to", '{:,}'.format(int(steps)), "steps")
    return

In [8]:
epoch2steps(total_subword_count)
epoch2steps(max_512_subword_count)
epoch2steps(max_1024_subword_count)

1 epoch correponds to 143 steps
1 epoch correponds to 105 steps
1 epoch correponds to 134 steps


Liu et al (2018) use 400,000 steps when training their transformer model. With this setup, this would correspond to:

In [27]:
print('{0:,}'.format(round(400000 / steps)), "epochs") 

255 epochs


However, this is not a valid comparison because we are only focusing on discharge summaries whereas Liu at el where looking at the entire dataset of notes of which discharge summaries are only a small percentage.

Another important consideration point is the fact that Liu et al (2018) limit/truncate both input and output tokens (subwords) to 512. As shown above, this ends up removing ~7m subwords from our input. Doubling this to 1024 is worth considering as this only removes ~2m words.