# Finetune BERT-Bahasa

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/finetune/bert](https://github.com/huseinzol05/Malaya/tree/master/finetune/bert).
    
</div>

In this notebook, I will going to show to finetune pretrained BERT-Bahasa using Tensorflow Estimator.

TF-Estimator is really a great module created by Tensorflow Team to train a model for a very long period.

In [1]:
# !pip3 install bert-tensorflow==1.0.1 tensorflow==1.15

### Download pretrained model

https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/bert#download, In this example, we are going to try BASE size. Just uncomment below to download pretrained model and tokenizer.

In [2]:
# !wget https://f000.backblazeb2.com/file/malaya-model/bert-bahasa/bert-base-2020-10-08.tar.gz
# !wget https://raw.githubusercontent.com/huseinzol05/Malaya/master/pretrained-model/bert/BERT.wordpiece
# !wget https://raw.githubusercontent.com/huseinzol05/Malaya/master/pretrained-model/bert/config/BASE_config.json
# !tar -zxf bert-base-2020-10-08.tar.gz

In [3]:
#!ls bert-base

There is a helper function [malaya/finetune/utils.py](https://github.com/huseinzol05/Malaya/blob/master/finetune/utils.py) to help us to train the model on single GPU or multiGPUs.

In [4]:
# import sys

# sys.path.insert(0, '../')
import utils

### Load dataset

Just going to train on very small news bahasa sentiment.

In [5]:
import pandas as pd

df = pd.read_csv('data/sentiment-data-v2.csv')
df.head()

Unnamed: 0,label,text
0,Negative,Lebih-lebih lagi dengan kemudahan internet da...
1,Positive,boleh memberi teguran kepada parti tetapi perl...
2,Negative,Adalah membingungkan mengapa masyarakat Cina b...
3,Positive,Kami menurunkan defisit daripada 6.7 peratus p...
4,Negative,"Ini masalahnya. Bukan rakyat, tetapi sistem"


In [6]:
labels = df['label'].values.tolist()
texts = df['text'].values.tolist()
unique_labels = sorted(list(set(labels)))
unique_labels

['Negative', 'Neutral', 'Positive']

In [7]:
import tensorflow as tf
import bert
from bert import run_classifier
from bert import optimization
from bert import tokenization
from bert import modeling




In [8]:
#if got error UnparsedFlagAccessError: Trying to access flag --preserve_unused_tokens before flags were parsed.
#!pip install bert-tensorflow==1.0.1

In [9]:
tokenizer = tokenization.FullTokenizer(vocab_file = 'BERT.wordpiece', do_lower_case = False)
tokens = tokenizer.tokenize('Husein Comel tersangat sangatlah')
tokens




['Husein', 'Comel', 'tersangat', 'sangatlah']

In [10]:
tokenizer.convert_tokens_to_ids(tokens)

[31560, 17094, 26759, 30559]

In [11]:
def token_to_ids(text, maxlen = 512):
    tokens_a = tokenizer.tokenize(text)
    if len(tokens_a) > maxlen - 2:
        tokens_a = tokens_a[:(maxlen - 2)]
    tokens = ['[CLS]'] + tokens_a + ['[SEP]']
    segment_id = [0] * len(tokens)
    input_mask = [1] * len(tokens)
    input_id = tokenizer.convert_tokens_to_ids(tokens)
    return {'tokens': tokens, 'input_id': input_id,
    'input_mask': input_mask, 'segment_id': segment_id}

1. `tokens`, tokenized words.
2. `input_id`, integer representation of tokenized words, sorted based on wordpiece weightage.
3. `input_mask`, attention masking. During training, short words will padded with `0`, so we do not want the model learn padded values as part of the context.
4. `segment_id`, Use for text pair classification, in this case, we can simply put `0`.

In [12]:
token_to_ids(texts[0])

{'input_id': [2,
  4015,
  17,
  2009,
  2088,
  1822,
  5714,
  6332,
  1766,
  3062,
  3558,
  16,
  20153,
  1828,
  3718,
  2766,
  20018,
  18,
  3],
 'input_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'segment_id': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'tokens': ['[CLS]',
  'Lebih',
  '-',
  'lebih',
  'lagi',
  'dengan',
  'kemudahan',
  'internet',
  'dan',
  'laman',
  'sosial',
  ',',
  'taktik',
  'ini',
  'semakin',
  'mudah',
  'dikembangkan',
  '.',
  '[SEP]']}

### TF-Estimator

TF-Estimator, required 2 parts,

1. Input pipeline, https://www.tensorflow.org/api_docs/python/tf/data/Dataset
2. Model definition, https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator

### Data pipeline

In [13]:
def generate():
    while True:
        for i in range(len(texts)):
            if len(texts[i]) > 5:
                d = token_to_ids(texts[i])
                d['label'] = [unique_labels.index(labels[i])]
                d.pop('tokens', None)
                yield d

In [14]:
g = generate()
next(g)

{'input_id': [2,
  4015,
  17,
  2009,
  2088,
  1822,
  5714,
  6332,
  1766,
  3062,
  3558,
  16,
  20153,
  1828,
  3718,
  2766,
  20018,
  18,
  3],
 'input_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
 'label': [0],
 'segment_id': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}

It must a function return a function.

```python
def get_dataset(batch_size = 32, shuffle_size = 32):
    def get():
        return dataset
    return get
```

In [15]:
def get_dataset(batch_size = 32, shuffle_size = 32):
    def get():
        dataset = tf.data.Dataset.from_generator(
            generate,
            {'input_id': tf.int32, 'input_mask': tf.int32, 'segment_id': tf.int32, 'label': tf.int32},
            output_shapes = {
                'input_id': tf.TensorShape([None]),
                'input_mask': tf.TensorShape([None]),
                'segment_id': tf.TensorShape([None]),
                'label': tf.TensorShape([None])
            },
        )
        dataset = dataset.shuffle(shuffle_size)
        dataset = dataset.padded_batch(
            batch_size,
            padded_shapes = {
                'input_id': tf.TensorShape([None]),
                'input_mask': tf.TensorShape([None]),
                'segment_id': tf.TensorShape([None]),
                'label': tf.TensorShape([None])
            },
            padding_values = {
                'input_id': tf.constant(0, dtype = tf.int32),
                'input_mask': tf.constant(0, dtype = tf.int32),
                'segment_id': tf.constant(0, dtype = tf.int32),
                'label': tf.constant(0, dtype = tf.int32),
            },
        )
        return dataset
    return get

#### Test data pipeline using tf.session

In [16]:
tf.reset_default_graph()
sess = tf.InteractiveSession()
iterator = get_dataset()()
iterator = iterator.make_one_shot_iterator().get_next()

Instructions for updating:
Use `for ... in dataset:` to iterate over a dataset. If using `tf.estimator`, return the `Dataset` object directly from your input function. As a last resort, you can use `tf.compat.v1.data.make_one_shot_iterator(dataset)`.


In [17]:
iterator

{'input_id': <tf.Tensor 'IteratorGetNext:0' shape=(?, ?) dtype=int32>,
 'input_mask': <tf.Tensor 'IteratorGetNext:1' shape=(?, ?) dtype=int32>,
 'label': <tf.Tensor 'IteratorGetNext:2' shape=(?, ?) dtype=int32>,
 'segment_id': <tf.Tensor 'IteratorGetNext:3' shape=(?, ?) dtype=int32>}

In [18]:
sess.run(iterator)

{'input_id': array([[   2, 1968, 2279, ...,    0,    0,    0],
        [   2, 3566, 3841, ...,    0,    0,    0],
        [   2, 3061, 1762, ...,    0,    0,    0],
        ...,
        [   2, 2316, 1874, ...,    0,    0,    0],
        [   2, 2243, 4211, ...,    0,    0,    0],
        [   2, 2635, 1960, ...,    0,    0,    0]]),
 'input_mask': array([[1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        ...,
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0],
        [1, 1, 1, ..., 0, 0, 0]]),
 'label': array([[2],
        [0],
        [2],
        [0],
        [2],
        [1],
        [2],
        [0],
        [2],
        [2],
        [2],
        [2],
        [2],
        [2],
        [0],
        [0],
        [0],
        [2],
        [0],
        [0],
        [0],
        [2],
        [2],
        [2],
        [2],
        [2],
        [2],
        [2],
        [2],
        [2],
        [0],
        [0]]),
 'segment_id'

### Model definition

It must a function accepts 4 parameters.

```python
def model_fn(features, labels, mode, params):
```

In [20]:
bert_config = modeling.BertConfig.from_json_file('config/BASE_config.json')
bert_config.__dict__

{'attention_probs_dropout_prob': 0.1,
 'directionality': 'bidi',
 'hidden_act': 'gelu',
 'hidden_dropout_prob': 0.1,
 'hidden_size': 768,
 'initializer_range': 0.02,
 'intermediate_size': 3072,
 'max_position_embeddings': 512,
 'num_attention_heads': 12,
 'num_hidden_layers': 12,
 'pooler_fc_size': 768,
 'pooler_num_attention_heads': 12,
 'pooler_num_fc_layers': 3,
 'pooler_size_per_head': 128,
 'pooler_type': 'first_token_transform',
 'type_vocab_size': 2,
 'vocab_size': 32000}

In [21]:
epoch = 10
warmup_proportion = 0.1
num_warmup_steps = int(epoch * warmup_proportion)
learning_rate = 2e-5
init_checkpoint = 'bert-base-2020-10-08/bert-base/model.ckpt-1000000'

In [22]:
def model_fn(features, labels, mode, params):
    Y = tf.cast(features['label'][:, 0], tf.int32)
    
    model = modeling.BertModel(
        config = bert_config,
        is_training = True,
        input_ids = features['input_id'],
        input_mask = features['input_mask'],
        token_type_ids = features['segment_id'],
        use_one_hot_embeddings = False,
    )
    output_layer = model.get_pooled_output()
    # change the n value to number of label
    n = 3
    logits = tf.layers.dense(output_layer, n)
    loss = tf.reduce_mean(
        tf.nn.sparse_softmax_cross_entropy_with_logits(
            logits = logits, labels = Y
        )
    )

    tf.identity(loss, 'train_loss')

    accuracy = tf.metrics.accuracy(
        labels = Y, predictions = tf.argmax(logits, axis = 1)
    )
    tf.identity(accuracy[1], name = 'train_accuracy')
    
    variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
    
    assignment_map, initialized_variable_names = utils.get_assignment_map_from_checkpoint(
        variables, init_checkpoint
    )

    tf.train.init_from_checkpoint(init_checkpoint, assignment_map)
    
    if mode == tf.estimator.ModeKeys.TRAIN:
        train_op = optimization.create_optimizer(loss, learning_rate, epoch, num_warmup_steps, False)
        estimator_spec = tf.estimator.EstimatorSpec(
            mode = mode, loss = loss, train_op = train_op
        )
        
    elif mode == tf.estimator.ModeKeys.EVAL:
        estimator_spec = tf.estimator.EstimatorSpec(
            mode = tf.estimator.ModeKeys.EVAL,
            loss = loss,
            eval_metric_ops = {'accuracy': accuracy},
        )

    return estimator_spec

### Initiate training session

In [23]:
train_dataset = get_dataset()

In [None]:
train_hooks = [
    tf.train.LoggingTensorHook(
        ['train_accuracy', 'train_loss'], every_n_iter = 1
    )
]
utils.run_training(
    train_fn = train_dataset,
    model_fn = model_fn,
    model_dir = 'finetuned-bert-base',
    num_gpus = 1,
    log_step = 1,
    save_checkpoint_step = epoch,
    max_steps = epoch,
    train_hooks = train_hooks,
)



INFO:tensorflow:Using config: {'_model_dir': 'finetuned-bert-base', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 10, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 1, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x0000024613575518>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized aut

INFO:tensorflow:  name = bert/encoder/layer_2/attention/self/value/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_2/attention/self/value/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_2/attention/output/dense/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_2/attention/output/dense/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_2/attention/output/LayerNorm/beta:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_2/attention/output/LayerNorm/gamma:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_2/intermediate/dense/kernel:0, shape = (768, 3072), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_2/intermediate/dense/bias:0, shape = (3072,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_2/output/dense/kernel:0, shape = (3072, 768), *INIT_FROM_CK

INFO:tensorflow:  name = bert/encoder/layer_6/output/LayerNorm/gamma:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_7/attention/self/query/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_7/attention/self/query/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_7/attention/self/key/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_7/attention/self/key/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_7/attention/self/value/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_7/attention/self/value/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_7/attention/output/dense/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_7/attention/output/dense/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:te

INFO:tensorflow:  name = bert/encoder/layer_11/intermediate/dense/kernel:0, shape = (768, 3072), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_11/intermediate/dense/bias:0, shape = (3072,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_11/output/dense/kernel:0, shape = (3072, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_11/output/dense/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_11/output/LayerNorm/beta:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/encoder/layer_11/output/LayerNorm/gamma:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/pooler/dense/kernel:0, shape = (768, 768), *INIT_FROM_CKPT*
INFO:tensorflow:  name = bert/pooler/dense/bias:0, shape = (768,), *INIT_FROM_CKPT*
INFO:tensorflow:  name = dense/kernel:0, shape = (768, 3)
INFO:tensorflow:  name = dense/bias:0, shape = (3,)



Instructions for updating:
Use tf.where in 2.0, which has the same broa