# Finetuning of the pretrained Japanese ALBERT model (v2)

Finetune the pretrained model to solve multi-class classification problems.  
This notebook requires the following objects:
- trained sentencepiece model (model and vocab files)
- pretraiend Japanese ALBERT model

Dataset is livedoor ニュースコーパス in https://www.rondhuit.com/download.html.  
We make test:dev:train = 2:2:6 datasets.

Results:

- Full training data
  - ALBERT with SentencePiece
    ```
                  precision    recall  f1-score   support
  
    dokujo-tsushin       1.00      0.93      0.96       178
      it-life-hack       0.92      0.96      0.94       172
     kaden-channel       0.95      0.98      0.97       176
    livedoor-homme       0.90      0.82      0.86        95
       movie-enter       0.96      0.98      0.97       158
            peachy       0.95      0.97      0.96       174
              smax       0.99      0.98      0.98       167
      sports-watch       0.96      0.98      0.97       190
        topic-news       0.96      0.94      0.95       163

          accuracy                           0.96      1473
         macro avg       0.95      0.95      0.95      1473
      weighted avg       0.96      0.96      0.96      1473
    ```

  - BERT with SentencePiece
    ```
                    precision    recall  f1-score   support

    dokujo-tsushin       0.98      0.94      0.96       178
      it-life-hack       0.96      0.97      0.96       172
     kaden-channel       0.99      0.98      0.99       176
    livedoor-homme       0.98      0.88      0.93        95
       movie-enter       0.96      0.99      0.98       158
            peachy       0.94      0.98      0.96       174
              smax       0.98      0.99      0.99       167
      sports-watch       0.98      1.00      0.99       190
        topic-news       0.99      0.98      0.98       163

         micro avg       0.97      0.97      0.97      1473
         macro avg       0.97      0.97      0.97      1473
      weighted avg       0.97      0.97      0.97      1473
    ```
  - sklearn GradientBoostingClassifier with MeCab
    ```
                      precision    recall  f1-score   support

    dokujo-tsushin       0.89      0.86      0.88       178
      it-life-hack       0.91      0.90      0.91       172
     kaden-channel       0.90      0.94      0.92       176
    livedoor-homme       0.79      0.74      0.76        95
       movie-enter       0.93      0.96      0.95       158
            peachy       0.87      0.92      0.89       174
              smax       0.99      1.00      1.00       167
      sports-watch       0.93      0.98      0.96       190
        topic-news       0.96      0.86      0.91       163

         micro avg       0.92      0.92      0.92      1473
         macro avg       0.91      0.91      0.91      1473
      weighted avg       0.92      0.92      0.91      1473
    ```

- Small training data (1/5 of full training data)
  - ALBERT with SentencePiece
    ````
                    precision    recall  f1-score   support
    
    dokujo-tsushin       0.93      0.87      0.90       178
      it-life-hack       0.84      0.78      0.81       172
     kaden-channel       0.89      0.97      0.92       176
    livedoor-homme       0.76      0.69      0.73        95
       movie-enter       0.93      0.94      0.94       158
            peachy       0.86      0.87      0.87       174
              smax       0.91      0.95      0.93       167
      sports-watch       0.96      0.94      0.95       190
        topic-news       0.88      0.93      0.90       163
    
          accuracy                           0.89      1473
         macro avg       0.88      0.88      0.88      1473
      weighted avg       0.89      0.89      0.89      1473
    ````

  - BERT with SentencePiece
    ```
                    precision    recall  f1-score   support

    dokujo-tsushin       0.97      0.87      0.92       178
      it-life-hack       0.86      0.86      0.86       172
     kaden-channel       0.95      0.94      0.95       176
    livedoor-homme       0.82      0.82      0.82        95
       movie-enter       0.97      0.99      0.98       158
            peachy       0.89      0.95      0.92       174
              smax       0.94      0.96      0.95       167
      sports-watch       0.97      0.97      0.97       190
        topic-news       0.94      0.94      0.94       163

         micro avg       0.93      0.93      0.93      1473
         macro avg       0.92      0.92      0.92      1473
      weighted avg       0.93      0.93      0.93      1473
    ```
  - sklearn GradientBoostingClassifier with MeCab
    ```
                    precision    recall  f1-score   support

    dokujo-tsushin       0.82      0.71      0.76       178
      it-life-hack       0.86      0.88      0.87       172
     kaden-channel       0.91      0.87      0.89       176
    livedoor-homme       0.67      0.63      0.65        95
       movie-enter       0.87      0.95      0.91       158
            peachy       0.70      0.78      0.73       174
              smax       1.00      1.00      1.00       167
      sports-watch       0.87      0.95      0.91       190
        topic-news       0.92      0.82      0.87       163

         micro avg       0.85      0.85      0.85      1473
         macro avg       0.85      0.84      0.84      1473
      weighted avg       0.86      0.85      0.85      1473
    ```

In [0]:
!git clone --recursive https://github.com/alinear-corp/albert-japanese.git

In [0]:
cd albert-japanese

In [0]:
import configparser
import glob
import os
import pandas as pd
import subprocess
import sys
import tarfile 
from urllib.request import urlretrieve

CURDIR = os.getcwd()
CONFIGPATH = os.path.join(CURDIR, 'config.ini')
config = configparser.ConfigParser()
config.read(CONFIGPATH)

## Data preparing

You need execute the following cells just once.

In [0]:
FILEURL = config['FINETUNING-DATA']['FILEURL']
FILEPATH = "/content/ldcc-20140209.tar.gz"
EXTRACTDIR = "/content/livedoor"

Download and unzip data.

In [0]:
%%time

urlretrieve(FILEURL, FILEPATH)

mode = "r:gz"
tar = tarfile.open(FILEPATH, mode) 
tar.extractall(EXTRACTDIR) 
tar.close()

Data preprocessing.

In [0]:
def extract_txt(filename):
    with open(filename) as text_file:
        # 0: URL, 1: timestamp
        text = text_file.readlines()[2:]
        text = [sentence.strip() for sentence in text]
        text = list(filter(lambda line: line != '', text))
        return ''.join(text)

In [0]:
categories = [ 
    name for name 
    in os.listdir( os.path.join(EXTRACTDIR, "text") ) 
    if os.path.isdir( os.path.join(EXTRACTDIR, "text", name) ) ]

categories = sorted(categories)

In [0]:
categories

In [0]:
table = str.maketrans({
    '\n': '',
    '\t': '　',
    '\r': '',
})

In [0]:
%%time

all_text = []
all_label = []

for cat in categories:
    files = glob.glob(os.path.join(EXTRACTDIR, "text", cat, "{}*.txt".format(cat)))
    files = sorted(files)
    body = [ extract_txt(elem).translate(table) for elem in files ]
    label = [cat] * len(body)
    
    all_text.extend(body)
    all_label.extend(label)

In [0]:
df = pd.DataFrame({'text' : all_text, 'label' : all_label})

In [0]:
df.head()

In [0]:
df = df.sample(frac=1, random_state=23).reset_index(drop=True)

In [0]:
df.head()

Save data as tsv files.  
test:dev:train = 2:2:6. To check the usability of finetuning, we also prepare sampled training data (1/5 of full training data).

In [0]:
df[:len(df) // 5].to_csv( os.path.join(EXTRACTDIR, "test.tsv"), sep='\t', index=False)
df[len(df) // 5:len(df)*2 // 5].to_csv( os.path.join(EXTRACTDIR, "dev.tsv"), sep='\t', index=False)
df[len(df)*2 // 5:].to_csv( os.path.join(EXTRACTDIR, "train.tsv"), sep='\t', index=False)

### 1/5 of full training data.
# df[:len(df) // 5].to_csv( os.path.join(EXTRACTDIR, "test.tsv"), sep='\t', index=False)
# df[len(df) // 5:len(df)*2 // 5].to_csv( os.path.join(EXTRACTDIR, "dev.tsv"), sep='\t', index=False)
# df[len(df)*2 // 5:].sample(frac=0.2, random_state=23).to_csv( os.path.join(EXTRACTDIR, "train.tsv"), sep='\t', index=False)

## Finetune pre-trained model

It will take a lot of hours to execute the following cells on CPU environment.  
You can also use colab to recieve the power of TPU. You need to uplode the created data onto your GCS bucket.

In [0]:
pwd

In [0]:
!ls "/content/drive/My Drive/MachineLearning/albert-japanese-work/v2/albert_japanese_v2_model/"*

In [0]:
!cp -pr "/content/drive/My Drive/MachineLearning/albert-japanese-work/v2/albert_japanese_v2_model/"* ./model

In [0]:
ls model

In [0]:
mkdir livedoor_output

In [0]:
PRETRAINED_MODEL_PATH = './model/model.ckpt-1400000'
FINETUNE_OUTPUT_DIR = './livedoor_output'
# FINETUNE_OUTPUT_DIR = '/path/to/livedoor_output_light'

In [0]:
!pip install sentencepiece

In [0]:
%tensorflow_version 1.x

In [0]:
pwd

In [0]:
%%time
# It will take many hours on CPU environment.

!PYTHONPATH=/tensorflow-1.15.2/python3.6:. python3 src/run_classifier.py \
  --albert_config_file=albert_config.json \
  --task_name=livedoor \
  --do_train=true \
  --do_eval=true \
  --data_dir={EXTRACTDIR} \
  --spm_model_file=./model/wiki-ja_albert.model \
  --init_checkpoint={PRETRAINED_MODEL_PATH} \
  --max_seq_length=512 \
  --train_batch_size=4 \
  --train_step=11055 \
  --warmup_step=1105 \
  --learning_rate=2e-5 \
  --output_dir={FINETUNE_OUTPUT_DIR} 2> ./log

# for small data training, use
#  --train_step=2212 \
#  --warmup_step=221 \

In [0]:
! tail -n 100 ./log

In [0]:
ls {FINETUNE_OUTPUT_DIR}

## Predict using the finetuned model

Let's predict test data using the finetuned model.  

In [0]:
import sys
sys.path.append("./src")

from ALBERT import tokenization
from run_classifier import LivedoorProcessor
from ALBERT.classifier_utils import model_fn_builder
from ALBERT.classifier_utils import file_based_input_fn_builder
from ALBERT.classifier_utils import file_based_convert_examples_to_features
from utils import str_to_value

In [0]:
from ALBERT import modeling
from ALBERT import optimization
import tensorflow as tf

In [0]:
import configparser
import json
import glob
import os
import pandas as pd
import tempfile

albert_config = modeling.AlbertConfig.from_json_file("albert_config.json")

In [0]:
!cp -pr {FINETUNE_OUTPUT_DIR} data

In [0]:
FINETUNED_MODEL_PATH = os.path.abspath("./data/livedoor_output/model.ckpt-best")
# FINETUNED_MODEL_PATH = os.path.abspath("./data/livedoor_output_light/model.ckpt-best")

In [0]:
class FLAGS(object):
    '''Parameters.'''
    def __init__(self):
        self.model_file = "./model/wiki-ja_albert.model"
        self.vocab_file = "./model/wiki-ja_albert.vocab"
        self.do_lower_case = True
        self.use_tpu = False
        self.output_dir = "./data/dummy"
        self.data_dir = EXTRACTDIR
        self.max_seq_length = 512
        self.init_checkpoint = FINETUNED_MODEL_PATH
        self.predict_batch_size = 4
        
        # The following parameters are not used in predictions.
        # Just use to create RunConfig.
        self.master = None
        self.save_checkpoints_steps = 1
        self.iterations_per_loop = 1
        self.num_tpu_cores = 1
        self.learning_rate = 0
        self.num_warmup_steps = 0
        self.num_train_steps = 0
        self.train_batch_size = 0
        self.eval_batch_size = 0

In [0]:
FLAGS = FLAGS()

In [0]:
processor = LivedoorProcessor(use_spm=True, do_lower_case=True)
label_list = processor.get_labels()

In [0]:
tokenizer = tokenization.FullTokenizer(
    spm_model_file=FLAGS.model_file, vocab_file=FLAGS.vocab_file,
    do_lower_case=FLAGS.do_lower_case)

tpu_cluster_resolver = None

is_per_host = tf.contrib.tpu.InputPipelineConfig.PER_HOST_V2

run_config = tf.contrib.tpu.RunConfig(
    cluster=tpu_cluster_resolver,
    master=FLAGS.master,
    model_dir=FLAGS.output_dir,
    save_checkpoints_steps=FLAGS.save_checkpoints_steps,
    tpu_config=tf.contrib.tpu.TPUConfig(
        iterations_per_loop=FLAGS.iterations_per_loop,
        num_shards=FLAGS.num_tpu_cores,
        per_host_input_for_training=is_per_host))

In [0]:
model_fn = model_fn_builder(
    albert_config=albert_config,
    num_labels=len(label_list),
    init_checkpoint=FLAGS.init_checkpoint,
    learning_rate=FLAGS.learning_rate,
    task_name="livedoor",
    num_train_steps=FLAGS.num_train_steps,
    num_warmup_steps=FLAGS.num_warmup_steps,
    use_tpu=FLAGS.use_tpu,
    use_one_hot_embeddings=FLAGS.use_tpu)


estimator = tf.contrib.tpu.TPUEstimator(
    use_tpu=FLAGS.use_tpu,
    model_fn=model_fn,
    config=run_config,
    train_batch_size=FLAGS.train_batch_size,
    eval_batch_size=FLAGS.eval_batch_size,
    predict_batch_size=FLAGS.predict_batch_size)

In [0]:
predict_examples = processor.get_test_examples(FLAGS.data_dir)
predict_file = tempfile.NamedTemporaryFile(mode='w+t', encoding='utf-8', suffix='.tf_record')

file_based_convert_examples_to_features(predict_examples, label_list,
                                        FLAGS.max_seq_length, tokenizer,
                                        predict_file.name, task_name="livedoor")

predict_drop_remainder = True if FLAGS.use_tpu else False

predict_input_fn = file_based_input_fn_builder(
    input_file=predict_file.name,
    seq_length=FLAGS.max_seq_length,
    is_training=False,
    drop_remainder=predict_drop_remainder, task_name="livedoor", use_tpu=False, bsz=32)

In [0]:
result = estimator.predict(input_fn=predict_input_fn)

In [0]:
%%time
# It will take a few hours on CPU environment.

result = list(result)

In [0]:
result[:2]

Read test data set and add prediction results.

In [0]:
import pandas as pd

In [0]:
test_df = pd.read_csv(os.path.join(EXTRACTDIR, "test.tsv"), sep='\t')

In [0]:
test_df['predict'] = [ label_list[elem['probabilities'].argmax()] for elem in result ]

In [0]:
test_df.head()

In [0]:
sum( test_df['label'] == test_df['predict'] ) / len(test_df)

A littel more detailed check using `sklearn.metrics`.

In [0]:
!pip install scikit-learn

In [0]:
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [0]:
print(classification_report(test_df['label'], test_df['predict']))

In [0]:
print(confusion_matrix(test_df['label'], test_df['predict']))