<a href="https://colab.research.google.com/github/ftnext/practice-dl-nlp/blob/master/bert_exercise/20220423KantaiBERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

ref: https://github.com/PacktPublishing/Transformers-for-Natural-Language-Processing/blob/main/Chapter03/KantaiBERT.ipynb

# Step 1: Fetch dataset

In [1]:
!curl --output kant.txt \
  https://raw.githubusercontent.com/PacktPublishing/Transformers-for-Natural-Language-Processing/main/Chapter03/kant.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 10.7M  100 10.7M    0     0  20.4M      0 --:--:-- --:--:-- --:--:-- 20.4M


In [2]:
!ls -lh kant.txt

-rw-r--r-- 1 root root 11M Apr 23 07:05 kant.txt


In [3]:
!wc -l kant.txt

188287 kant.txt


In [4]:
!grep -n 'Gutenberg EBook' kant.txt

2:The Project Gutenberg EBook of The Critique of Pure Reason, by Immanuel Kant
20816:End of the Project Gutenberg EBook of The Critique of Pure Reason, by Immanuel Kant
21180:The Project Gutenberg EBook of Fundamental Principles of the Metaphysic of Morals
31383:The Project Gutenberg EBook of The Critique of Pure Reason, by Immanuel Kant
52197:End of the Project Gutenberg EBook of The Critique of Pure Reason, by Immanuel Kant
52561:The Project Gutenberg EBook of Fundamental Principles of the Metaphysic of Morals
62764:The Project Gutenberg EBook of The Critique of Pure Reason, by Immanuel Kant
83578:End of the Project Gutenberg EBook of The Critique of Pure Reason, by Immanuel Kant
83942:The Project Gutenberg EBook of Fundamental Principles of the Metaphysic of Morals
94145:The Project Gutenberg EBook of The Critique of Pure Reason, by Immanuel Kant
114959:End of the Project Gutenberg EBook of The Critique of Pure Reason, by Immanuel Kant
115323:The Project Gutenberg EBook of Fundament

# Step 2: Install dependencies

In [5]:
!pip uninstall -y tensorflow

Found existing installation: tensorflow 2.8.0
Uninstalling tensorflow-2.8.0:
  Successfully uninstalled tensorflow-2.8.0


In [6]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 5.2 MB/s 
[?25hCollecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 34.3 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 kB)
[K     |████████████████████████████████| 895 kB 43.6 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 45.2 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 6.3 MB/s 
Installing collected packages: pyyaml, tokenizers, sacremoses, huggingface-hub, transformers
  Attempting uninstall: pyyaml
    Fo

In [7]:
!pip list | grep -E 'transformers|tokenizers'

tokenizers                    0.12.1
transformers                  4.18.0


# Check GPU

In [8]:
!nvidia-smi

Sat Apr 23 07:06:28 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla K80           Off  | 00000000:00:04.0 Off |                    0 |
| N/A   34C    P8    29W / 149W |      0MiB / 11441MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [9]:
import torch

In [10]:
torch.cuda.is_available()

True

# Imports

In [11]:
from pathlib import Path

In [12]:
from tokenizers import ByteLevelBPETokenizer
from tokenizers.processors import BertProcessing

In [13]:
from transformers import (
    pipeline,
    DataCollatorForLanguageModeling,
    LineByLineTextDataset,
    RobertaConfig,
    RobertaTokenizer,
    RobertaForMaskedLM,
    Trainer,
    TrainingArguments,
)

# Tokenizer

## Train tokenizer then save

In [14]:
paths = [str(x) for x in Path(".").glob("**/*.txt")]  # == ["kant.txt"]

tokenizer = ByteLevelBPETokenizer()

In [15]:
%%time
special_tokens = ["<s>", "<pad>", "</s>", "<unk>","<mask>"]

tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=special_tokens)

CPU times: user 6.72 s, sys: 242 ms, total: 6.96 s
Wall time: 3.65 s


In [16]:
token_dir = Path("KantaiBERT")
token_dir.mkdir(exist_ok=True)

tokenizer.save_model(str(token_dir))

['KantaiBERT/vocab.json', 'KantaiBERT/merges.txt']

In [17]:
!ls -lh KantaiBERT

total 496K
-rw-r--r-- 1 root root 186K Apr 23 07:07 merges.txt
-rw-r--r-- 1 root root 308K Apr 23 07:07 vocab.json


In [18]:
!wc -l KantaiBERT/*

 19036 KantaiBERT/merges.txt
     0 KantaiBERT/vocab.json
 19036 total


In [19]:
tokenizer.get_vocab_size()

19296

In [20]:
len(tokenizer.get_vocab())

19296

## Load tokenizer (as huggingface/tokenizers)

In [21]:
tokenizer = ByteLevelBPETokenizer(
    f"{token_dir}/vocab.json", f"{token_dir}/merges.txt"
)

In [22]:
tokenizer.get_vocab_size()

19296

In [23]:
tokenizer.encode("The Critique of Pure Reason.").tokens

['The', 'ĠCritique', 'Ġof', 'ĠPure', 'ĠReason', '.']

In [24]:
tokenizer.post_processor = BertProcessing(
    ("</s>", tokenizer.token_to_id("</s>")),  # [SEP]
    ("<s>", tokenizer.token_to_id("<s>")), # [CLS]
)
tokenizer.enable_truncation(max_length=512)

In [25]:
tokenizer.encode("The Critique of Pure Reason.").tokens

['<s>', 'The', 'ĠCritique', 'Ġof', 'ĠPure', 'ĠReason', '.', '</s>']

# RoBERTa

## Config

In [26]:
config = RobertaConfig(
    vocab_size=52_000,
    max_position_embeddings=514,
    num_attention_heads=12,
    num_hidden_layers=6,
    type_vocab_size=1,
)

## Tokenizer

In [27]:
tokenizer = RobertaTokenizer.from_pretrained(str(token_dir), max_length=512)

In [28]:
tokenizer.encode("The Critique of Pure Reason.")

[0, 803, 2245, 270, 1410, 1270, 18, 2]

## Model

In [29]:
model = RobertaForMaskedLM(config)

In [30]:
model.num_parameters()

83504416

## Dataset (for pre-training)

In [31]:
%%time
dataset = LineByLineTextDataset(
    tokenizer=tokenizer, file_path="kant.txt", block_size=128
)



CPU times: user 36.3 s, sys: 1.11 s, total: 37.4 s
Wall time: 37.1 s


In [32]:
len(dataset)

170964

In [33]:
dataset[0]

{'input_ids': tensor([   0,  803, 1123, 1156, 8937,  270,  487, 2245,  270, 1410, 1270,   16,
          379, 4555, 4032,    2])}

## Data collator

In [34]:
data_collator = DataCollatorForLanguageModeling(
    tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

## Train

In [35]:
training_args = TrainingArguments(
    output_dir=str(token_dir),
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=64,
    save_steps=10_000,
    save_total_limit=2,
)

In [36]:
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

In [37]:
%%time
trainer.train()

***** Running training *****
  Num examples = 170964
  Num Epochs = 1
  Instantaneous batch size per device = 64
  Total train batch size (w. parallel, distributed & accumulation) = 64
  Gradient Accumulation steps = 1
  Total optimization steps = 2672


Step,Training Loss
500,6.5923
1000,5.7597
1500,5.2916
2000,5.0233
2500,4.8696




Training completed. Do not forget to share your model on huggingface.co/models =)




CPU times: user 18min 57s, sys: 5.64 s, total: 19min 3s
Wall time: 19min


TrainOutput(global_step=2672, training_loss=5.463243130438342, metrics={'train_runtime': 1140.1473, 'train_samples_per_second': 149.949, 'train_steps_per_second': 2.344, 'total_flos': 873620128952064.0, 'train_loss': 5.463243130438342, 'epoch': 1.0})

In [38]:
trainer.save_model(str(token_dir))

Saving model checkpoint to KantaiBERT
Configuration saved in KantaiBERT/config.json
Model weights saved in KantaiBERT/pytorch_model.bin


## fill-mask task

In [44]:
fill_mask = pipeline(
    "fill-mask",
    model=str(token_dir),
    tokenizer=str(token_dir)
)

loading configuration file KantaiBERT/config.json
Model config RobertaConfig {
  "_name_or_path": "KantaiBERT",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "torch_dtype": "float32",
  "transformers_version": "4.18.0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}

loading configuration file KantaiBERT/config.json
Model config RobertaConfig {
  "_name_or_path": "KantaiBERT",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,


In [45]:
fill_mask("Human thinking involves<mask>.")

[{'score': 0.023670444265007973,
  'sequence': 'Human thinking involves reason.',
  'token': 393,
  'token_str': ' reason'},
 {'score': 0.015469097532331944,
  'sequence': 'Human thinking involves it.',
  'token': 306,
  'token_str': ' it'},
 {'score': 0.012228215113282204,
  'sequence': 'Human thinking involves conceptions.',
  'token': 605,
  'token_str': ' conceptions'},
 {'score': 0.011805330403149128,
  'sequence': 'Human thinking involves experience.',
  'token': 531,
  'token_str': ' experience'},
 {'score': 0.009144469164311886,
  'sequence': 'Human thinking involves them.',
  'token': 508,
  'token_str': ' them'}]

In [46]:
fill_mask("Human thinking involves <mask>.")

[{'score': 0.023670444265007973,
  'sequence': 'Human thinking involves reason.',
  'token': 393,
  'token_str': ' reason'},
 {'score': 0.015469097532331944,
  'sequence': 'Human thinking involves it.',
  'token': 306,
  'token_str': ' it'},
 {'score': 0.012228215113282204,
  'sequence': 'Human thinking involves conceptions.',
  'token': 605,
  'token_str': ' conceptions'},
 {'score': 0.011805330403149128,
  'sequence': 'Human thinking involves experience.',
  'token': 531,
  'token_str': ' experience'},
 {'score': 0.009144469164311886,
  'sequence': 'Human thinking involves them.',
  'token': 508,
  'token_str': ' them'}]

# Export artifacts

In [47]:
!ls -lh KantaiBERT/*.*

-rw-r--r-- 1 root root  636 Apr 23 07:34 KantaiBERT/config.json
-rw-r--r-- 1 root root 186K Apr 23 07:07 KantaiBERT/merges.txt
-rw-r--r-- 1 root root 319M Apr 23 07:34 KantaiBERT/pytorch_model.bin
-rw-r--r-- 1 root root 3.0K Apr 23 07:34 KantaiBERT/training_args.bin
-rw-r--r-- 1 root root 308K Apr 23 07:07 KantaiBERT/vocab.json


In [48]:
!cp KantaiBERT/*.* drive/MyDrive/nlp/20220423/KantaiBERT/