<a href="https://colab.research.google.com/github/ThatCodeCodingGuy/Czech-T5-Base-Model/blob/main/cst5_base.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Installing Necessary Packages**

In [None]:
!pip install transformers sentencepiece

Collecting transformers
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 23.6 MB/s 
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 48.1 MB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 38.7 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 48.0 MB/s 
[?25hCollecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 7.1 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.49-py3-none-any.whl (895 

# **Importing Torch and Loading the Original Multilingual T5 Model**

In [None]:
from transformers import T5ForConditionalGeneration, T5Tokenizer
import torch

In [None]:
tokenizer = T5Tokenizer.from_pretrained("google/mt5-base")
tokenizer

Downloading:   0%|          | 0.00/4.11M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/376 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/702 [00:00<?, ?B/s]

PreTrainedTokenizer(name_or_path='google/mt5-base', vocab_size=250100, model_max_len=1000000000000000019884624838656, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<pad>'})

In [None]:
model = T5ForConditionalGeneration.from_pretrained('google/mt5-base')

You are using a model of type mt5 to instantiate a model of type t5. This is not supported for all configurations of models and can yield errors.


Downloading:   0%|          | 0.00/2.17G [00:00<?, ?B/s]

In [None]:
print(tokenizer.vocab_size) # The total vocabulary size consists of 250K tokens

250100


In [None]:
# Defining a function to look at the parameters of the model. 
# It has 582 million parameters.
def msize(m):
    return sum(p.numel() for p in m.parameters())

original_size = msize(model)
print(msize(model))
print(msize(model.shared))
print('encoder')
print(msize(model.encoder))
print(msize(model.encoder.block))
print('decoder')
print(msize(model.decoder))
print(msize(model.decoder.block))
print(msize(model.lm_head))

582401280
192086016
encoder
277040256
84953472
decoder
305361024
113274240
192086016


In [None]:
# The input and output embeddings of the model (makes up the 66% of the whole model)
print(msize(model.shared) / msize(model))
print(msize(model.lm_head) / msize(model))

0.32981729710484153
0.32981729710484153


# **Getting Language Data Before Changing the Tokens of the Old Model**

Totally two files are chosen for Czech and English from:
* https://wortschatz.uni-leipzig.de/en/download/Czech
* https://wortschatz.uni-leipzig.de/en/download/English


It would be better to choose the data files that are of the same category (e.g. "Web-Public" for both of the languages) to get better results. 

If possible, the file with the largest size (e.g. 1M) will give better results since a given language will be represented better.

In [None]:
!wget http://pcai056.informatik.uni-leipzig.de/downloads/corpora/ces-cz_web-public_2019_1M.tar.gz
!tar -xsvf ces-cz_web-public_2019_1M.tar.gz

--2022-04-23 16:38:01--  http://pcai056.informatik.uni-leipzig.de/downloads/corpora/ces-cz_web-public_2019_1M.tar.gz
Resolving pcai056.informatik.uni-leipzig.de (pcai056.informatik.uni-leipzig.de)... 139.18.2.216
Connecting to pcai056.informatik.uni-leipzig.de (pcai056.informatik.uni-leipzig.de)|139.18.2.216|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 210451310 (201M) [application/x-gzip]
Saving to: ‘ces-cz_web-public_2019_1M.tar.gz’


2022-04-23 16:38:09 (26.1 MB/s) - ‘ces-cz_web-public_2019_1M.tar.gz’ saved [210451310/210451310]

ces-cz_web-public_2019_1M/
ces-cz_web-public_2019_1M/ces-cz_web-public_2019_1M-import.sql
ces-cz_web-public_2019_1M/ces-cz_web-public_2019_1M-co_s.txt
ces-cz_web-public_2019_1M/ces-cz_web-public_2019_1M-co_n.txt
ces-cz_web-public_2019_1M/ces-cz_web-public_2019_1M-inv_so.txt
ces-cz_web-public_2019_1M/ces-cz_web-public_2019_1M-sentences.txt
ces-cz_web-public_2019_1M/ces-cz_web-public_2019_1M-inv_w.txt
ces-cz_web-public_2019_1M/ces-

In [None]:
!wget http://pcai056.informatik.uni-leipzig.de/downloads/corpora/eng-com_web-public_2018_1M.tar.gz
!tar -xsvf eng-com_web-public_2018_1M.tar.gz

--2022-04-23 16:38:19--  http://pcai056.informatik.uni-leipzig.de/downloads/corpora/eng-com_web-public_2018_1M.tar.gz
Resolving pcai056.informatik.uni-leipzig.de (pcai056.informatik.uni-leipzig.de)... 139.18.2.216
Connecting to pcai056.informatik.uni-leipzig.de (pcai056.informatik.uni-leipzig.de)|139.18.2.216|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 228887647 (218M) [application/x-gzip]
Saving to: ‘eng-com_web-public_2018_1M.tar.gz’


2022-04-23 16:38:28 (26.0 MB/s) - ‘eng-com_web-public_2018_1M.tar.gz’ saved [228887647/228887647]

eng-com_web-public_2018_1M/
eng-com_web-public_2018_1M/eng-com_web-public_2018_1M-co_s.txt
eng-com_web-public_2018_1M/eng-com_web-public_2018_1M-inv_so.txt
eng-com_web-public_2018_1M/eng-com_web-public_2018_1M-words.txt
eng-com_web-public_2018_1M/eng-com_web-public_2018_1M-sentences.txt
eng-com_web-public_2018_1M/eng-com_web-public_2018_1M-sources.txt
eng-com_web-public_2018_1M/eng-com_web-public_2018_1M-co_n.txt
eng-com_web-p

Looking at some of the sentences in Czech and English

In [None]:
import pandas as pd
pd.options.display.max_colwidth = 300
import csv
fname = 'ces-cz_web-public_2019_1M/ces-cz_web-public_2019_1M-sentences.txt'
df_cz = pd.read_csv(fname, sep='\t', header=None, quoting=csv.QUOTE_NONE)
df_cz.columns = ['idx', 'text']
df_cz.sample(5)

Unnamed: 0,idx,text
644072,644073,"Projekty buď působí úspěšně, nebo dospěly ke krachu."
932126,932127,"Výše uvedené nařízení výslovně požaduje využití NOAEL pro výpočet bezpečnosti, takže se tomu nikdo nevyhne."
374861,374862,"Mluvili jsme s genetičkou a dala nám čas na rozhodnutí, zda chceme zkusit donosit, nebo ne."
38973,38974,A s tím vysvětlováním úspěšného sociálního státu v Evropě ztrácíš jenom čas.
573963,573964,Pokusila se obrátit Kalena na bok.


In [None]:
fname = 'eng-com_web-public_2018_1M/eng-com_web-public_2018_1M-sentences.txt'
df_en = pd.read_csv(fname, sep='\t', header=None, quoting=csv.QUOTE_NONE)
df_en.columns = ['idx', 'text']
df_en.sample(5)

Unnamed: 0,idx,text
965351,965352,"With one touch of a button, it's simple to add an LG Music Flow sound bar to your network and synchronize it with your LG Music flow speakers to enjoy a true home theater experience."
690888,690889,"Tayeb, in flight from his Yemeni homeland, befriends Frieda and, when she learns she has inherited the contents of an apartment belonging to a dead woman she has never heard of, they embark on an unexpected journey together."
854915,854916,"This sustainable clothing company was suffering the effects of an outdated website with a poor user experience, until Coalition entered the picture."
637222,637223,Serial interface to a computer program would allow the patient to program their own hearing comfort zone.
313877,313878,I don’t want to ever pay for the storage of the same data twice.


Code to count the tokens of the current model and see what percentage of it consists of Czech and English tokens

In [None]:
from collections import Counter
from tqdm.auto import tqdm, trange

cnt_cz = Counter()
for text in tqdm(df_cz.text):
    cnt_cz.update(tokenizer.encode(text))

cnt_en = Counter()
for text in tqdm(df_en.text):
    cnt_en.update(tokenizer.encode(text))

  0%|          | 0/1000000 [00:00<?, ?it/s]

  0%|          | 0/1000000 [00:00<?, ?it/s]

In [None]:
print(len(cnt_cz), len(cnt_cz)/tokenizer.vocab_size)
print(len(cnt_en), len(cnt_en)/tokenizer.vocab_size)
common = len(set(cnt_cz.keys()).intersection(set(cnt_en.keys())))
print(common, common / len(cnt_cz))

72697 0.29067173130747703
67920 0.2715713714514194
52507 0.7222718956765753


The output of the code above shows that the tokens used with Czech are 29% of the whole vocabulary while it is 27% for English. 

Also there is a considerable amount of overlap between the two, which is 72%.

In [None]:
# printing the top 10, 20, and 30K tokens of both languages to see how much they make up for the total tokens of them.
print('cz')
for top in 10_000, 20_000, 30_000:
    print(top, sum(v for k, v in cnt_cz.most_common(top)) / sum(cnt_cz.values()))
print('en')
for top in 10_000, 20_000, 30_000:
    print(top, sum(v for k, v in cnt_en.most_common(top)) / sum(cnt_en.values()))

cz
10000 0.9698733601465653
20000 0.9891941735702253
30000 0.9950201377394043
en
10000 0.9531899764307693
20000 0.9840809828270257
30000 0.9937869259525808


The old vocabulary to be replaced later on.

In [None]:
old_voc = tokenizer.get_vocab()
old_inv_voc = {v: k for k, v in old_voc.items()}

In [None]:
# printing the most used tokens.
print(tokenizer.convert_ids_to_tokens([k for k, v in cnt_cz.most_common(30)]))
print(tokenizer.convert_ids_to_tokens([k for k, v in cnt_en.most_common(30)]))

['▁', '</s>', '.', ',', 'a', 'u', '▁v', '▁na', '▁se', 'y', 's', 'í', 'i', '▁je', 'e', 'o', '▁z', 'é', 'ní', '▁do', '▁pro', '▁k', 'ně', '▁to', '▁za', 'ů', 'á', 'm', '▁po', '▁že']
['▁', '</s>', '.', '▁the', ',', 's', '▁to', '▁and', 'a', '▁of', '▁in', '▁is', '▁I', '’', '▁that', 'ed', '▁for', '-', 'ing', "'", '▁you', '▁it', '▁with', '▁on', 'ly', 'y', '▁be', '▁The', '▁as', '▁are']


# **Making the New Vocabulary**


Features:
* 1K of top tokens of the original tokenizer
* Top 10K of the English vocabulary
* Top 20K or more of the Czech vocabulary
* The 100 special tokens that T5 uses

In [None]:
new_tokens = set(range(1000))
for i, (k, v) in enumerate(cnt_en.most_common(10_000)):
    if k not in new_tokens:
        new_tokens.add(k)
for i, (k, v) in enumerate(cnt_cz.most_common(25_000)):
    if len(new_tokens) == 29_900:
        print(i, 'Czech tokens are included')
        break
    if k not in new_tokens:
        new_tokens.add(k)

for t in range(tokenizer.vocab_size - 100, tokenizer.vocab_size):
    new_tokens.add(t)

print(len(new_tokens))
kept_ids = sorted(new_tokens)

23509 Czech tokens are included
30000


In [None]:
# The current vocabulary is 12% of the previous total one. 
len(kept_ids) / tokenizer.vocab_size

0.11995201919232307

# **Updating the Embeddings**

In [None]:
import torch

In [None]:
new_size = len(kept_ids)
new_emb = torch.nn.Embedding(new_size, model.shared.embedding_dim)
new_head = torch.nn.Linear(in_features=model.lm_head.in_features, out_features=new_size, bias=False)

In [None]:
for new_id, old_id in enumerate(kept_ids):
    new_emb.weight.data[new_id] = model.shared.weight.data[old_id]
    new_head.weight.data[new_id] = model.lm_head.weight.data[old_id]

In [None]:
model.shared.weight = new_emb.weight
model.lm_head.weight = new_head.weight

In [None]:
# printing the number of parameters of the new model
print(msize(model))

244309248


# **Updating the Tokenizer**

Since sentencepiece tokenizer used by T5 model is implemented not in Python but in C. However, to overcome this, we can use the Protobuf representation of it to deploy it in Python. 

In [None]:
!wget https://raw.githubusercontent.com/google/sentencepiece/master/src/sentencepiece_model.proto

--2022-04-23 16:51:25--  https://raw.githubusercontent.com/google/sentencepiece/master/src/sentencepiece_model.proto
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12872 (13K) [text/plain]
Saving to: ‘sentencepiece_model.proto’


2022-04-23 16:51:25 (98.5 MB/s) - ‘sentencepiece_model.proto’ saved [12872/12872]



In [None]:
# Compiling the protobuf representation to eb able to modify it. 
! protoc --python_out=. sentencepiece_model.proto

Serializing the model used by the current tokenizer and opening it as a protobuf class.

In [None]:
import sentencepiece_model_pb2 as spmp
smp = tokenizer.sp_model.serialized_model_proto()
m = spmp.ModelProto()
m.ParseFromString(smp)

print('the loaded model has pieces:', len(m.pieces))
new_pieces = [m.pieces[idx] for idx in kept_ids]
print('the new pieces:', len(new_pieces))

# replace the content of the first 30K pieces
for i, p in enumerate(new_pieces):
    m.pieces[i].piece = p.piece
    m.pieces[i].score = p.score
    m.pieces[i].type = p.type

# drop the remaining pieces
n = len(new_pieces)
for i in trange(len(m.pieces) - n):
    m.pieces.pop(len(m.pieces) - 1)

print(len(m.pieces))
with open('new_sp.model', 'wb') as f:
    f.write(m.SerializeToString())

the loaded model has pieces: 250100
the new pieces: 30000


  0%|          | 0/220100 [00:00<?, ?it/s]

30000


In [None]:
new_tokenizer = T5Tokenizer('new_sp.model', extra_ids=0)

# **Saving and Mounting the Model on Google Drive**

In [None]:
model.config.__dict__['vocab_size'] = new_size
model.config.__dict__['_name_or_path'] = 'azizbarank/cst5-base'
model.config

T5Config {
  "_name_or_path": "azizbarank/cst5-base",
  "architectures": [
    "MT5ForConditionalGeneration"
  ],
  "d_ff": 2048,
  "d_kv": 64,
  "d_model": 768,
  "decoder_start_token_id": 0,
  "dropout_rate": 0.1,
  "eos_token_id": 1,
  "feed_forward_proj": "gated-gelu",
  "initializer_factor": 1.0,
  "is_encoder_decoder": true,
  "layer_norm_epsilon": 1e-06,
  "model_type": "t5",
  "num_decoder_layers": 12,
  "num_heads": 12,
  "num_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "relative_attention_max_distance": 128,
  "relative_attention_num_buckets": 32,
  "tie_word_embeddings": false,
  "tokenizer_class": "T5Tokenizer",
  "transformers_version": "4.18.0",
  "use_cache": true,
  "vocab_size": 30000
}

In [None]:
new_tokenizer.save_pretrained('cst5-base')
model.save_pretrained('cst5-base')

In [None]:
!ls cst5-base -alsh

total 933M
4.0K drwxr-xr-x 2 root root 4.0K Apr 23 16:53 .
4.0K drwxr-xr-x 1 root root 4.0K Apr 23 16:53 ..
4.0K -rw-r--r-- 1 root root  746 Apr 23 16:53 config.json
933M -rw-r--r-- 1 root root 933M Apr 23 16:53 pytorch_model.bin
4.0K -rw-r--r-- 1 root root   65 Apr 23 16:53 special_tokens_map.json
744K -rw-r--r-- 1 root root 741K Apr 23 16:53 spiece.model
4.0K -rw-r--r-- 1 root root  173 Apr 23 16:53 tokenizer_config.json


In [None]:
model1 = T5ForConditionalGeneration.from_pretrained('cst5-base')
tokenizer1 = T5Tokenizer.from_pretrained('cst5-base')

In [None]:
from google.colab import drive
drive.mount('/gd', force_remount=True)

Mounted at /gd


In [None]:
model1.save_pretrained('/gd/MyDrive/models/cst5-base-raw')
tokenizer1.save_pretrained('/gd/MyDrive/models/cst5-base-raw')

('/gd/MyDrive/models/cst5-base-raw/tokenizer_config.json',
 '/gd/MyDrive/models/cst5-base-raw/special_tokens_map.json',
 '/gd/MyDrive/models/cst5-base-raw/spiece.model',
 '/gd/MyDrive/models/cst5-base-raw/added_tokens.json')