<a href="https://colab.research.google.com/github/diego-feijo/bertpt/blob/master/Pre_training_ALBERT_from_Wikipedia_using_TPU.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pre-training ALBERT from Wikipedia using TPU

This kernel shows how to pre-train an [ALBERT](https://github.com/google-research/ALBERT) model from Wikipedia dump using free Colab TPU v2.

These are the steps to follow:

1. Setting Up the Environment
2. Download and Prepare Data
3. Extract Raw Text
4. Build SentencePiece Model
5. Generate pre-training data
6. Train the model

Colab TPU requires a [Google Cloud Storage bucket](https://cloud.google.com/tpu/docs/quickstart). New users have [$300 free credit](https://cloud.google.com/free/) for one year. 

After each step, we save persistent data so we can always stop and resume from the last finished step. If you need to resume, run the first step again and then go straight to the step you need to continue.

**Notes** 

1. You **need** to set the same BUCKET_NAME in steps 3-6.
2. Steps 5 and 6 can take several hours to run. Google Colab will interrupt after 8 hour running, so it will be necessary to start over from the previous finished step.

**Credits**: This tutorial was adapted from https://towardsdatascience.com/pre-training-bert-from-scratch-with-cloud-tpu-6e2f71028379

MIT License

Copyright (c) [2019] [Diego de Vargas Feijo]

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

## Step 1: Setting Up the Environment
Install dependencies, import globally required packages and authorize with Google Account to access Colab TPU.

In [0]:
!pip install --upgrade -q sentencepiece

import json
import logging
import nltk
import os
#import random
import sentencepiece as spm
import sys
import tensorflow as tf

#from glob import glob
from google.colab import auth

auth.authenticate_user()
  
# configure logging
log = logging.getLogger('tensorflow')
log.setLevel(logging.INFO)

# create formatter and add it to the handlers
formatter = logging.Formatter('%(asctime)s:  %(message)s')
sh = logging.StreamHandler()
sh.setLevel(logging.INFO)
sh.setFormatter(formatter)
log.handlers = [sh]

if 'COLAB_TPU_ADDR' in os.environ:
  log.info("Using TPU runtime")
  USE_TPU = True
  TPU_ADDRESS = 'grpc://' + os.environ['COLAB_TPU_ADDR']

  with tf.Session(TPU_ADDRESS) as session:
    log.info('TPU address is ' + TPU_ADDRESS)
    # Upload credentials to TPU.
    with open('/content/adc.json', 'r') as f:
      auth_info = json.load(f)
    tf.contrib.cloud.configure_gcs(session, credentials=auth_info)
    
else:
  log.warning('Not connected to TPU runtime')
  USE_TPU = False

Clone and Patch ALBERT sources

Some Albert sources use deprecated API that generate a lot of warnings. We also make some minor changes to the scripts can run smoothly on Colab.

In [0]:
# Clone the repository
!test -d ALBERT || git clone https://github.com/google-research/ALBERT.git ALBERT

# Avoid deprecated warnings
!sed -i 's/tf.logging/tf.compat.v1.logging/' ALBERT/*.py
!sed -i 's/tf.app.run/tf.compat.v1.app.run/' ALBERT/*.py

# Avoid error when the line contains only one number
!sed -i 's/i.lower()/str(i).lower()/' ALBERT/create_pretraining_data.py

# Create Dummy flag (Colab Bug)
!sed -i 's/FLAGS = flags.FLAGS/FLAGS=flags.FLAGS\n\nflags.DEFINE_string("f", "", "Dummy flag. Not used.")/' ALBERT/run_pretraining.py

# Mute too verbose output
!sed -i 's/tf.compat.v1.logging.info/# tf.compat.v1.logging.info/' ALBERT/tokenization.py

In [0]:
if not 'ALBERT' in sys.path:
  sys.path += ['ALBERT']

import modeling, optimization, tokenization

## Step 2: Download and Prepare Data

Wikipedia dump is available in XML format. We need to extract the raw text from it. It is possible to use any language, I listed here the most frequent ones ([sorted by the number of active users](https://en.wikipedia.org/wiki/List_of_Wikipedias)):


In [0]:
LANG = "pt"  #@param ['en', 'fr', 'de', 'es', 'ja', 'ru', 'it', 'zh', 'pt', 'ar', 'fa', 'pl', 'nl', 'id', 'uk', 'he', 'sv', 'cs', 'ko', 'vi', 'ca', 'no', 'fi', 'hu', 'th', 'el', 'hi', 'bn', 'ceb', 'tr', 'ro', 'sw', 'kk', 'da', 'eo', 'sr', 'lt', 'sk', 'bg', 'min', 'sl', 'eu', 'et', 'hr', 'te', 'nn', 'gl']

The latest Wikipedia dump can be from the day 1 or 20, but the date when dump is finished can vary. Instead of complicated inspecting in the page, we are guessing when the dump is ready.

In [0]:
import datetime

def get_last_dump():
  today = datetime.datetime.now()

  if today.day > 8 and today.day < 25:
    day = 1
    month = today.month
  elif today.day >= 25:
    day = 20
    month = today.month
  else:
    day = 1
    month = today.month - 1
  return '{}{:02d}{:02d}'.format(today.year, month, day)
  

In [0]:
last_dump = get_last_dump()
corpus = tf.keras.utils.get_file(
    "{}wiki.bz2".format(LANG),
    "https://dumps.wikimedia.org/{}wiki/{}/{}wiki-{}-pages-articles-multistream.xml.bz2".format(
        LANG,
        last_dump,
        LANG,
        last_dump
    ))
!bzip2 -d {corpus}

## Step 3: Extract Raw Text


In [0]:
BUCKET_NAME = "<Insert Bucket Name Here>" # @param string

Uses WikiExtractor to remove XML tags and keep only raw text.

In [0]:
!test -d wikiextractor || git clone https://github.com/attardi/wikiextractor.git

In [0]:
WIKI_INPUT_FILE, _ = os.path.splitext(corpus)
WIKI_EXTRACTED_DIR = "wikimedia" 

tf.io.gfile.makedirs(WIKI_EXTRACTED_DIR)
!python3 wikiextractor/WikiExtractor.py -q -c -o {WIKI_EXTRACTED_DIR} {WIKI_INPUT_FILE}

In [0]:
import nltk
nltk.download('punkt')

# Snowball Stemmers
LANG_STM = "portuguese" # @param ['danish', 'english', 'finnish', 'french', 'german', 'hugarian', 'italian', 'norwegian', 'porter', 'portuguese', 'romanian', 'russian', 'spanish', 'swedish']

sent_tokenizer = nltk.data.load('tokenizers/punkt/{}.pickle'.format(LANG_STM))

Prepare input for create_pretraining_data script.

Input file format requires:
- One sentence per line. These should ideally be actual sentences, not entire paragraphs or arbitrary spans of text. (Because we use the sentence boundaries for the "next sentence prediction" task).
- Blank lines between documents. Document boundaries are needed so that the "next sentence prediction" task doesn't span between documents.


In [0]:
import bz2

PRC_DATA_FPATH = "proc_wikimedia.txt" #@param {type: "string"}

with open(PRC_DATA_FPATH, "w", encoding="utf-8") as fo:
  for group in os.listdir(WIKI_EXTRACTED_DIR):
    basedir = os.path.join(WIKI_EXTRACTED_DIR, group)
    for fz in os.listdir(basedir):
      if fz.endswith('.bz2'):
        with bz2.BZ2File(os.path.join(basedir, fz), 'r') as fi:
          contents = fi.read()
        is_title = False
        text = contents.decode('utf-8')
        for l in text.splitlines():
          # print (l)
          if l.startswith('</doc>'):
            # Empty line for each new document
            fo.write("\n")
          elif len(l) == 1:
            # Empty lines must be ignored
            pass
          elif l.startswith('<doc'):
            # After the heading, there is a title
            is_title = True
          elif is_title:
            # Ignore this line, reset variable
            is_title = False
          else:
            # Wikipedia uses multiple sentences in on line
            # We need to split one sentence per line
            sentences = sent_tokenizer.tokenize(l)
            for sentence in sentences:
              fo.write(sentence + "\n")


In [0]:
!head {PRC_DATA_FPATH}
!test -f {PRC_DATA_FPATH}.gz || gzip < {PRC_DATA_FPATH} > {PRC_DATA_FPATH}.gz
tf.gfile.MakeDirs("gs://{}/datasets/".format(BUCKET_NAME))
!gsutil -m cp {PRC_DATA_FPATH}.gz gs://{BUCKET_NAME}/datasets/

## Step 4: Build SentencePiece Model
In this step we will be generating the config files and the encoder to covert text to integers.

In [0]:
BUCKET_NAME = "<Insert Bucket Name Here>" # @param {type: "string"}

MODEL_DIR = "albert_cased_L-12_H-768_A-12" #@param {type: "string"}
VOC_SIZE = 30000 #@param {type:"integer"}

!test -f {PRC_DATA_FPATH}.gz || gsutil -m cp gs://{BUCKET_NAME}/datasets/{PRC_DATA_FPATH}.gz .
!test -f {PRC_DATA_FPATH} || gzip -d < {PRC_DATA_FPATH}.gz > {PRC_DATA_FPATH}

Build ALBERT Configuration Base model:
- Base Model: https://tfhub.dev/google/albert_base/2
- Large Model: https://tfhub.dev/google/albert_large/2
- X-Large Model: https://tfhub.dev/google/albert_xlarge/2
- XX-Large Model: https://tfhub.dev/google/albert_xxlarge/2

It is not feasible to create pre-training data for models bigger than Large using Colab. 

In [0]:
# use this for ALBERT-base
albert_config = {
  "attention_probs_dropout_prob": 0.1, 
  "hidden_act": "gelu", 
  "hidden_dropout_prob": 0.1, 
  "embedding_size": 128,
  "hidden_size": 768, 
  "initializer_range": 0.02, 
  "intermediate_size": 3072, 
  "max_position_embeddings": 512, 
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "num_hidden_groups": 1,
  "net_structure_type": 0,
  "gap_size": 0, 
  "num_memory_blocks": 0, 
  "inner_group_num": 1,
  "down_scale_factor": 1,
  "type_vocab_size": 2,
  "vocab_size": VOC_SIZE
}

with open("albert_config.json", "w") as fo:
  json.dump(albert_config, fo, indent=2)
!gsutil -m cp albert_config.json gs://{BUCKET_NAME}/{MODEL_DIR}/

[SentencePiece](https://github.com/google/sentencepiece) will be used to encode the text.

We need to train the SentencePiece model and build our vocabulary. It is required a lot of RAM. Even 35GB RAM offered by Colab may not be enough if all the raw text is used. We use SUBSAMPLE_SIZE to control how much memory is used.

In case of Out of Memory, it is possible to reduce SUBSAMPLE_SIZE.

The VOC_SIZE used by monolingual BERT and ALBERT papers are 30000. The multilingual uses 129000 tokens. It is not clear if increasing the VOC_SIZE will improve the model.

NUM_PLACEHOLDERS can be used after the pre-training during the fine-tunning.

In [0]:
PRC_DATA_FPATH = "proc_wikimedia.txt"  #@param {type: "string"}
MODEL_PREFIX = 'tokenizer' #@param {type:"string"}
SUBSAMPLE_SIZE = 10000000 #@param {type:"integer"}
# Number of reserved tokens at end of vocabulary
# This should only be used when training data contains a small but very
# frequent tokens.
NUM_PLACEHOLDERS = 0 #@param {type:"integer"}

SPM_COMMAND = ('--input={} --model_prefix={} '
               '--vocab_size={} --input_sentence_size={} '
               '--shuffle_input_sentence=true ' 
               '--pad_piece=[PAD] '
               '--unk_piece=[UNK] '
               '--pad_id=0 --unk_id=1 --user_defined_symbols=[CLS],[SEP],[MASK] ' 
               '--bos_id=-1 --eos_id=-1 ').format(
               PRC_DATA_FPATH, MODEL_PREFIX, 
               VOC_SIZE - NUM_PLACEHOLDERS, SUBSAMPLE_SIZE)

Training SentencePiece may take several minutes to run.

In [0]:
spm.SentencePieceTrainer.Train(SPM_COMMAND)

In [0]:
!test -f {MODEL_PREFIX}.tar.gz && rm {MODEL_PREFIX}.tar.gz
!tar czvf {MODEL_PREFIX}.tar.gz {MODEL_PREFIX}.*
tf.gfile.MakeDirs("gs://{}/{}/".format(BUCKET_NAME, MODEL_DIR))
!gsutil -m cp {MODEL_PREFIX}.tar.gz gs://{BUCKET_NAME}/{MODEL_DIR}/

Let's see the first tokens from the vocabulary:

In [0]:
VOC_FNAME = "{}.vocab".format(MODEL_PREFIX)
MDL_FNAME = "{}.model".format(MODEL_PREFIX)

!head {VOC_FNAME}
!wc -l {VOC_FNAME}

Let's check how SentencePiece tokenize one sentence. You may change to sentence in the language you are using.

In [0]:
testcase = "[CLS] [MASK] Sentença de mérito. [SEP] Embargos de declaração 普通话.[SEP]"

bert_tokenizer = tokenization.FullTokenizer(VOC_FNAME, do_lower_case=False, spm_model_file=MDL_FNAME)
tokens = bert_tokenizer.tokenize(testcase)
ids = bert_tokenizer.convert_tokens_to_ids(tokens)
print(tokens)
print(ids)

Everything is working as expected.

## Step 5: Generate Pre-training Data
Pre-training data is a collection of tfrecord (binary files where the text is SentencePiece encoded, and masking and auxiliary vectors were build).

So, a text:

- "This is just one of many samples. This a sentence that follows. " 

would be converted to:
- "2 1700 23 45 ... 67 89 ... 0 0 0"


The create_pretraining_data will insert special tokens ('\[CLS\]', '\[SEP\]', '\[UNK\]', '\[PAD\]').

In [0]:
BUCKET_NAME = "<Insert Bucket Name Here>" # @param {type: "string"}

MODEL_DIR = 'albert_cased_L-12_H-768_A-12' #@param {type:"string"}
MODEL_PREFIX = 'tokenizer' #@param {type:"string"}
VOC_FNAME = "{}.vocab".format(MODEL_PREFIX)
MDL_FNAME = "{}.model".format(MODEL_PREFIX)
PRC_DATA_FPATH = "proc_wikimedia.txt" #@param {type:"string"}

!test -f {PRC_DATA_FPATH}.gz || gsutil -m cp gs://{BUCKET_NAME}/datasets/{PRC_DATA_FPATH}.gz .
!test -f {PRC_DATA_FPATH} || gzip -d < {PRC_DATA_FPATH}.gz > {PRC_DATA_FPATH}
!test -f {MODEL_PREFIX}.tar.gz || gsutil -m cp gs://{BUCKET_NAME}/{MODEL_DIR}/{MODEL_PREFIX}.tar.gz .
!test -f {VOC_FNAME} || tar xzvf {MODEL_PREFIX}.tar.gz

In [0]:
DEMO_MODE = True # @param {type: "boolean"}

# Reduce the number of lines to train faster
if DEMO_MODE:
  !head -1000000 {PRC_DATA_FPATH} > {PRC_DATA_FPATH}.tmp
  !mv {PRC_DATA_FPATH}.tmp {PRC_DATA_FPATH}

Since our corpus can be large, we will split it into shards:

In [0]:
!rm -rf ./shards
!mkdir ./shards
!split -a 4 -l 256000 -d {PRC_DATA_FPATH} ./shards/shard_
!ls ./shards/

The **MAX_SEQ_LENGTH** (maximum sequence length) supported for the model is 512, but training time will be a lot slower because the complexity is quadratic to the length of sentences. Albert authors trained 90% of time using length 128 and the remaining using 512.

To simulate this behaviour, it is necessary to create training using length 128,and then create pre-training data again using 512 and change the configuration file.

The **DUPE_FACTOR** defines how many times each sequence will be used. Each sequence is randomly masked so it is a good use of the data to have as many duplicates as possible. However, using values larger than 20 may generate files larger than 1GB per shard. Larger files will not make the pre-training to run slowly, but will require a lot of space.


In [0]:
MAX_SEQ_LENGTH = 128 #@param {type:"integer"}
MASKED_LM_PROB = 0.15 #@param {type: "number"}
MAX_PREDICTIONS = 20 #@param {type:"integer"}
# Strip diacritics and Lowercase
DO_LOWER_CASE = False #@param {type:"boolean"}
DO_WHOLE_WORD_MASK = True #@param {type:"boolean"}
PROCESSES = 2 #@param {type:"integer"}
PRETRAINING_DIR = "gs://{}/{}/pretraining_data_{}".format(BUCKET_NAME, MODEL_DIR, MAX_SEQ_LENGTH)
DUPE_FACTOR = 4 #@param {type:"integer"}

Confirm where the training files will be written.

In [0]:
PRETRAINING_DIR

Let's try using as many cores are available. For each shard we need to call *create_pretraining_data.py* script. To that end, we will employ the  *xargs* command. 

This step will take a while to run. We will be saving generated data from each shards in the permanent storage.

If you need to resume this step, you can check the bucket for generated files and manually delete the local shards that were already generated.

In [0]:
XARGS_CMD = ('ls ./shards | '
      'xargs -n 1 -P {} -I{} '
      'python3 ALBERT/create_pretraining_data.py '
      '--input_file=./shards/{} '
      '--output_file={}/{}.tfrecord '
      '--vocab_file={} '
      '--spm_model_file={} '
      '--do_lower_case={} '
      '--do_whole_word_mask={} '
      '--max_predictions_per_seq={} '
      '--max_seq_length={} '
      '--masked_lm_prob={} '
      '--dupe_factor={} ')
XARGS_CMD = XARGS_CMD.format(PROCESSES, '{}', '{}',
                             PRETRAINING_DIR, '{}', 
                             VOC_FNAME, MDL_FNAME, DO_LOWER_CASE,
                             DO_WHOLE_WORD_MASK, MAX_PREDICTIONS, 
                             MAX_SEQ_LENGTH, MASKED_LM_PROB, DUPE_FACTOR)
!$XARGS_CMD

## Step 6: Training the Model

If you need to resume from an interrupted training, you may skip steps 2-5 and proceed from here.

In [0]:
BUCKET_NAME = "<Insert Bucket Name Here>" # @param {type: "string"}

MODEL_DIR = 'albert_cased_L-12_H-768_A-12' #@param {type:"string"}

# Input data pipeline config
TRAIN_BATCH_SIZE = 256 #@param {type:"integer"}
MAX_PREDICTIONS = 20 #@param {type:"integer"}
MAX_SEQ_LENGTH = 128 #@param {type:"integer"}
MASKED_LM_PROB = 0.15 #@param

PRETRAINING_DIR = "pretraining_data_{}".format(MAX_SEQ_LENGTH)

# Training procedure config
EVAL_BATCH_SIZE = 64
LEARNING_RATE = 0.00176
TRAIN_STEPS = 175000 #@param {type:"integer"}
SAVE_CHECKPOINTS_STEPS = 5000 #@param {type:"integer"}
NUM_TPU_CORES = 8

if BUCKET_NAME:
  BUCKET_PATH = "gs://{}".format(BUCKET_NAME)
else:
  BUCKET_PATH = "."

ALBERT_GCS_DIR = "{}/{}".format(BUCKET_PATH, MODEL_DIR)
DATA_GCS_DIR = "{}/{}".format(ALBERT_GCS_DIR, PRETRAINING_DIR)

CONFIG_FILE = os.path.join(ALBERT_GCS_DIR, "albert_config.json")

INIT_CHECKPOINT = tf.train.latest_checkpoint(ALBERT_GCS_DIR)

albert_config = modeling.AlbertConfig.from_json_file(CONFIG_FILE)
input_files = tf.gfile.Glob(os.path.join(DATA_GCS_DIR,'*tfrecord'))

log.info("Using checkpoint: {}".format(INIT_CHECKPOINT))
log.info("Using {} data shards".format(len(input_files)))

Prepare the training run configuration, build the estimator and input function, power up the bass cannon.

In [0]:
from run_pretraining import input_fn_builder, model_fn_builder


model_fn = model_fn_builder(
      albert_config=albert_config,
      init_checkpoint=INIT_CHECKPOINT,
      learning_rate=LEARNING_RATE,
      num_train_steps=TRAIN_STEPS,
      num_warmup_steps=3125,
      use_tpu=USE_TPU,
      optimizer="lamb",
      poly_power=1.0,
      start_warmup_step=0,
      use_one_hot_embeddings=USE_TPU)

tpu_cluster_resolver = tf.contrib.cluster_resolver.TPUClusterResolver(TPU_ADDRESS)

run_config = tf.contrib.tpu.RunConfig(
    cluster=tpu_cluster_resolver,
    model_dir=ALBERT_GCS_DIR,
    save_checkpoints_steps=SAVE_CHECKPOINTS_STEPS,
    tpu_config=tf.contrib.tpu.TPUConfig(
        iterations_per_loop=SAVE_CHECKPOINTS_STEPS,
        num_shards=NUM_TPU_CORES,
        per_host_input_for_training=tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2))

estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=USE_TPU,
    model_fn=model_fn,
    config=run_config,
    train_batch_size=TRAIN_BATCH_SIZE,
    eval_batch_size=EVAL_BATCH_SIZE)
  
train_input_fn = input_fn_builder(
        input_files=input_files,
        max_seq_length=MAX_SEQ_LENGTH,
        max_predictions_per_seq=MAX_PREDICTIONS,
        is_training=True)

Start real pre-training

In [0]:
estimator.train(input_fn=train_input_fn, max_steps=TRAIN_STEPS)

Training the model with the default parameters for 175k steps will take ~20 hours. 

In case the kernel is restarted, you may always continue training from the latest checkpoint. 