# Finetune XLNET-Bahasa

<div class="alert alert-info">

This tutorial is available as an IPython notebook at [Malaya/finetune/xlnet](https://github.com/huseinzol05/Malaya/tree/master/finetune/xlnet).
    
</div>

In this notebook, I will going to show to finetune pretrained XLNET-Bahasa using Tensorflow Estimator.

TF-Estimator is really a great module created by Tensorflow Team to train a model for a very long period.

In [2]:
# !pip3 install tensorflow==1.15 xlnet-tensorflow

### Download pretrained model

https://github.com/huseinzol05/Malaya/tree/master/pretrained-model/xlnet#download, In this example, we are going to try BASE size. Just uncomment below to download pretrained model and tokenizer.

In [4]:
# !wget https://f000.backblazeb2.com/file/malaya-model/bert-bahasa/xlnet-base-500k-20-10-2020.gz
# !wget https://raw.githubusercontent.com/huseinzol05/Malaya/master/pretrained-model/preprocess/sp10m.cased.v9.model
# !wget https://raw.githubusercontent.com/huseinzol05/Malaya/master/pretrained-model/xlnet/config/xlnet-base_config.json
# !tar -zxf xlnet-base-500k-20-10-2020.gz
!ls

sp10m.cased.v9.model			xlnet-base-500k-20-10-2020.gz
tf-estimator-text-classification.ipynb	xlnet-base_config.json
xlnet-base


In [5]:
!ls xlnet-base

model.ckpt-500000.data-00000-of-00001  model.ckpt-500000.meta
model.ckpt-500000.index		       xlnet-base_config.json


There is a helper function [malaya/finetune/utils.py](https://github.com/huseinzol05/Malaya/blob/master/finetune/utils.py) to help us to train the model on single GPU or multiGPUs.

In [6]:
import sys

sys.path.insert(0, '../')
import utils

### Load dataset

Just going to train on very small news bahasa sentiment.

In [7]:
import pandas as pd

df = pd.read_csv('../sentiment-data-v2.csv')
df.head()

Unnamed: 0,label,text
0,Negative,Lebih-lebih lagi dengan kemudahan internet da...
1,Positive,boleh memberi teguran kepada parti tetapi perl...
2,Negative,Adalah membingungkan mengapa masyarakat Cina b...
3,Positive,Kami menurunkan defisit daripada 6.7 peratus p...
4,Negative,"Ini masalahnya. Bukan rakyat, tetapi sistem"


In [8]:
labels = df['label'].values.tolist()
texts = df['text'].values.tolist()
unique_labels = sorted(list(set(labels)))
unique_labels

['Negative', 'Positive']

In [10]:
import numpy as np
import tensorflow as tf
from xlnet import model_utils
from xlnet import xlnet




In [11]:
import sentencepiece as spm
from xlnet.prepro_utils import preprocess_text, encode_ids

sp_model = spm.SentencePieceProcessor()
sp_model.Load('sp10m.cased.v9.model')

SEG_ID_A = 0
SEG_ID_B = 1
SEG_ID_CLS = 2
SEG_ID_SEP = 3
SEG_ID_PAD = 4

special_symbols = {
    '<unk>': 0,
    '<s>': 1,
    '</s>': 2,
    '<cls>': 3,
    '<sep>': 4,
    '<pad>': 5,
    '<mask>': 6,
    '<eod>': 7,
    '<eop>': 8,
}

VOCAB_SIZE = 32000
UNK_ID = special_symbols['<unk>']
CLS_ID = special_symbols['<cls>']
SEP_ID = special_symbols['<sep>']
MASK_ID = special_symbols['<mask>']
EOD_ID = special_symbols['<eod>']


def tokenize_fn(text):
    text = preprocess_text(text, lower = False)
    return encode_ids(sp_model, text)


def token_to_ids(text, maxlen = 512):
    tokens_a = tokenize_fn(text)
    if len(tokens_a) > maxlen - 2:
        tokens_a = tokens_a[: (maxlen - 2)]
    segment_id = [SEG_ID_A] * len(tokens_a)
    tokens_a.append(SEP_ID)
    tokens_a.append(CLS_ID)
    segment_id.append(SEG_ID_A)
    segment_id.append(SEG_ID_CLS)
    input_mask = [0.0] * len(tokens_a)
    assert len(tokens_a) == len(input_mask) == len(segment_id)
    return {
        'input_id': tokens_a,
        'input_mask': input_mask,
        'segment_id': segment_id,
    }

1. `input_id`, integer representation of tokenized words, sorted based on sentencepiece weightage.
2. `input_mask`, attention masking. During training, short words will padded with `1`, so we do not want the model learn padded values as part of the context. https://github.com/zihangdai/xlnet/blob/master/classifier_utils.py#L113
3. `segment_id`, Use for text pair classification, in this case, we can simply put `0`.

In [12]:
token_to_ids(texts[0])

{'input_id': [1620,
  13,
  5177,
  53,
  33,
  2808,
  3168,
  24,
  3400,
  807,
  21,
  16179,
  31,
  742,
  578,
  17153,
  9,
  4,
  3],
 'input_mask': [0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0],
 'segment_id': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2]}

### TF-Estimator

TF-Estimator, required 2 parts,

1. Input pipeline, https://www.tensorflow.org/api_docs/python/tf/data/Dataset
2. Model definition, https://www.tensorflow.org/api_docs/python/tf/estimator/Estimator

In [13]:
def generate():
    while True:
        for i in range(len(texts)):
            if len(texts[i]) > 5:
                d = token_to_ids(texts[i])
                d['label'] = [unique_labels.index(labels[i])]
                d.pop('tokens', None)
                yield d

In [14]:
g = generate()
next(g)

{'input_id': [1620,
  13,
  5177,
  53,
  33,
  2808,
  3168,
  24,
  3400,
  807,
  21,
  16179,
  31,
  742,
  578,
  17153,
  9,
  4,
  3],
 'input_mask': [0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0,
  0.0],
 'segment_id': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2],
 'label': [0]}

It must a function return a function.

```python
def get_dataset(batch_size = 32, shuffle_size = 32):
    def get():
        return dataset
    return get
```

In [15]:
def get_dataset(batch_size = 32, shuffle_size = 32):
    def get():
        dataset = tf.data.Dataset.from_generator(
            generate,
            {'input_id': tf.int32, 'input_mask': tf.float32, 'segment_id': tf.int32, 'label': tf.int32},
            output_shapes = {
                'input_id': tf.TensorShape([None]),
                'input_mask': tf.TensorShape([None]),
                'segment_id': tf.TensorShape([None]),
                'label': tf.TensorShape([None])
            },
        )
        dataset = dataset.shuffle(shuffle_size)
        dataset = dataset.padded_batch(
            batch_size,
            padded_shapes = {
                'input_id': tf.TensorShape([None]),
                'input_mask': tf.TensorShape([None]),
                'segment_id': tf.TensorShape([None]),
                'label': tf.TensorShape([None])
            },
            padding_values = {
                'input_id': tf.constant(0, dtype = tf.int32),
                'input_mask': tf.constant(1.0, dtype = tf.float32),
                'segment_id': tf.constant(4, dtype = tf.int32),
                'label': tf.constant(0, dtype = tf.int32),
            },
        )
        return dataset
    return get

#### Test data pipeline using tf.session

In [17]:
tf.reset_default_graph()
sess = tf.InteractiveSession()
iterator = get_dataset()()
iterator = iterator.make_one_shot_iterator().get_next()

Instructions for updating:
Use `for ... in dataset:` to iterate over a dataset. If using `tf.estimator`, return the `Dataset` object directly from your input function. As a last resort, you can use `tf.compat.v1.data.make_one_shot_iterator(dataset)`.


In [18]:
iterator

{'input_id': <tf.Tensor 'IteratorGetNext:0' shape=(?, ?) dtype=int32>,
 'input_mask': <tf.Tensor 'IteratorGetNext:1' shape=(?, ?) dtype=float32>,
 'segment_id': <tf.Tensor 'IteratorGetNext:3' shape=(?, ?) dtype=int32>,
 'label': <tf.Tensor 'IteratorGetNext:2' shape=(?, ?) dtype=int32>}

In [19]:
sess.run(iterator)

{'input_id': array([[1084,  791,  835, ...,    0,    0,    0],
        [ 256, 8993,    9, ...,    0,    0,    0],
        [8110,   87, 1743, ...,    0,    0,    0],
        ...,
        [ 767,  250,   51, ...,    0,    0,    0],
        [ 398, 8269,  742, ...,    9,    4,    3],
        [3593,   21, 7901, ...,    0,    0,    0]], dtype=int32),
 'input_mask': array([[0., 0., 0., ..., 1., 1., 1.],
        [0., 0., 0., ..., 1., 1., 1.],
        [0., 0., 0., ..., 1., 1., 1.],
        ...,
        [0., 0., 0., ..., 1., 1., 1.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 1., 1., 1.]], dtype=float32),
 'segment_id': array([[0, 0, 0, ..., 4, 4, 4],
        [0, 0, 0, ..., 4, 4, 4],
        [0, 0, 0, ..., 4, 4, 4],
        ...,
        [0, 0, 0, ..., 4, 4, 4],
        [0, 0, 0, ..., 0, 0, 2],
        [0, 0, 0, ..., 4, 4, 4]], dtype=int32),
 'label': array([[0],
        [0],
        [0],
        [1],
        [0],
        [1],
        [0],
        [1],
        [0],
        [1

### Model definition

It must a function accepts 4 parameters.

```python
def model_fn(features, labels, mode, params):
```

In [22]:
kwargs = dict(
    is_training = True,
    use_tpu = False,
    use_bfloat16 = False,
    dropout = 0.1,
    dropatt = 0.1,
    init = 'normal',
    init_range = 0.1,
    init_std = 0.05,
    clamp_len = -1,
)

xlnet_parameters = xlnet.RunConfig(**kwargs)
xlnet_config = xlnet.XLNetConfig(json_path = 'xlnet-base_config.json')




In [26]:
epoch = 10
batch_size = 32
warmup_proportion = 0.1
num_train_steps = 10
num_warmup_steps = int(num_train_steps * warmup_proportion)
learning_rate = 2e-5

training_parameters = dict(
    decay_method = 'poly',
    train_steps = num_train_steps,
    learning_rate = learning_rate,
    warmup_steps = num_warmup_steps,
    min_lr_ratio = 0.0,
    weight_decay = 0.00,
    adam_epsilon = 1e-8,
    num_core_per_host = 1,
    lr_layer_decay_rate = 1,
    use_tpu = False,
    use_bfloat16 = False,
    dropout = 0.0,
    dropatt = 0.0,
    init = 'normal',
    init_range = 0.1,
    init_std = 0.05,
    clip = 1.0,
    clamp_len = -1,
)

In [27]:
class Parameter:
    def __init__(
        self,
        decay_method,
        warmup_steps,
        weight_decay,
        adam_epsilon,
        num_core_per_host,
        lr_layer_decay_rate,
        use_tpu,
        learning_rate,
        train_steps,
        min_lr_ratio,
        clip,
        **kwargs
    ):
        self.decay_method = decay_method
        self.warmup_steps = warmup_steps
        self.weight_decay = weight_decay
        self.adam_epsilon = adam_epsilon
        self.num_core_per_host = num_core_per_host
        self.lr_layer_decay_rate = lr_layer_decay_rate
        self.use_tpu = use_tpu
        self.learning_rate = learning_rate
        self.train_steps = train_steps
        self.min_lr_ratio = min_lr_ratio
        self.clip = clip


training_parameters = Parameter(**training_parameters)
init_checkpoint = 'xlnet-base/model.ckpt-500000'

In [28]:
def model_fn(features, labels, mode, params):
    Y = tf.cast(features['label'][:, 0], tf.int32)

    xlnet_model = xlnet.XLNetModel(
        xlnet_config = xlnet_config,
        run_config = xlnet_parameters,
        input_ids = tf.transpose(features['input_id'], [1, 0]),
        seg_ids = tf.transpose(features['segment_id'], [1, 0]),
        input_mask = tf.transpose(features['input_mask'], [1, 0]),
    )

    output_layer = xlnet_model.get_sequence_output()
    output_layer = tf.transpose(output_layer, [1, 0, 2])

    logits_seq = tf.layers.dense(output_layer, 2)
    logits = logits_seq[:, 0]

    loss = tf.reduce_mean(
        tf.nn.sparse_softmax_cross_entropy_with_logits(
            logits = logits, labels = Y
        )
    )

    tf.identity(loss, 'train_loss')

    accuracy = tf.metrics.accuracy(
        labels = Y, predictions = tf.argmax(logits, axis = 1)
    )
    tf.identity(accuracy[1], name = 'train_accuracy')

    variables = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)

    assignment_map, initialized_variable_names = utils.get_assignment_map_from_checkpoint(
        variables, init_checkpoint
    )

    tf.train.init_from_checkpoint(init_checkpoint, assignment_map)

    if mode == tf.estimator.ModeKeys.TRAIN:
        train_op, _, _ = model_utils.get_train_op(training_parameters, loss)
        estimator_spec = tf.estimator.EstimatorSpec(
            mode = mode, loss = loss, train_op = train_op
        )

    elif mode == tf.estimator.ModeKeys.EVAL:
        estimator_spec = tf.estimator.EstimatorSpec(
            mode = tf.estimator.ModeKeys.EVAL,
            loss = loss,
            eval_metric_ops = {'accuracy': accuracy},
        )

    return estimator_spec

### Initiate training session

In [29]:
train_dataset = get_dataset()

In [None]:
train_hooks = [
    tf.train.LoggingTensorHook(
        ['train_accuracy', 'train_loss'], every_n_iter = 1
    )
]
utils.run_training(
    train_fn = train_dataset,
    model_fn = model_fn,
    model_dir = 'finetuned-xlnet-base',
    num_gpus = 1,
    log_step = 1,
    save_checkpoint_step = epoch,
    max_steps = epoch,
    train_hooks = train_hooks,
)



INFO:tensorflow:Using config: {'_model_dir': 'finetuned-xlnet-base', '_tf_random_seed': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 10, '_save_checkpoints_secs': None, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_keep_checkpoint_max': 5, '_keep_checkpoint_every_n_hours': 10000, '_log_step_count_steps': 1, '_train_distribute': None, '_device_fn': None, '_protocol': None, '_eval_distribute': None, '_experimental_distribute': None, '_experimental_max_worker_delay_secs': None, '_session_creation_timeout_secs': 7200, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f31fb236fd0>, '_task_type': 'worker', '_task_id': 0, '_global_id_in_cluster': 0, '_master': '', '_evaluation_master': '', '_is_chief': True, '_num_ps_replicas': 0, '_num_worker_replicas': 1}
Instructions for updating:
Use Variable.read_value. Variables in 2.X are initialized automa