# Git Hub: Create Your Own Tokenizer

In this notebook we will create a custom tokenizer.

## Load Your Dataset

In [4]:
from datasets import load_dataset

# we will load our own dataset
dataset = load_dataset('Kain17/reuters_articles')

README.md:   0%|          | 0.00/512 [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/150k [00:00<?, ?B/s]

validation-00000-of-00001.parquet:   0%|          | 0.00/30.4k [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/39.2k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/462 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/58 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/58 [00:00<?, ? examples/s]

In [5]:
# check
dataset

DatasetDict({
    train: Dataset({
        features: ['title', 'body'],
        num_rows: 462
    })
    validation: Dataset({
        features: ['title', 'body'],
        num_rows: 58
    })
    test: Dataset({
        features: ['title', 'body'],
        num_rows: 58
    })
})

## Modify the Existing Dataset

We can perform changes on existing dataset before feeding it to the tokenizer.

In [6]:
# helper function
def create_full_article_column(example):
    return {
        'full_article': 
        f"TITLE : {example['title']}\n\nBODY : {example['body']}"
    }

In [7]:
# Map every column to create a full article column
dataset = dataset.map(create_full_article_column)

# Verify a new column `full_article` was created for each subset within the dataset
dataset

Map:   0%|          | 0/462 [00:00<?, ? examples/s]

Map:   0%|          | 0/58 [00:00<?, ? examples/s]

Map:   0%|          | 0/58 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['title', 'body', 'full_article'],
        num_rows: 462
    })
    validation: Dataset({
        features: ['title', 'body', 'full_article'],
        num_rows: 58
    })
    test: Dataset({
        features: ['title', 'body', 'full_article'],
        num_rows: 58
    })
})

In [9]:
# Verify the content of the column
dataset['train'][1]['full_article']

'TITLE : HUGE OIL PLATFORMS DOT GULF LIKE BEACONS\n\nBODY : Huge oil platforms dot the Gulf like\nbeacons -- usually lit up like Christmas trees at night.\n    One of them, sitting astride the Rostam offshore oilfield,\nwas all but blown out of the water by U.S. Warships on Monday.\n    The Iranian platform, an unsightly mass of steel and\nconcrete, was a three-tier structure rising 200 feet (60\nmetres) above the warm waters of the Gulf until four U.S.\nDestroyers pumped some 1,000 shells into it.\n    The U.S. Defense Department said just 10 pct of one section\nof the structure remained.\n    U.S. helicopters destroyed three Iranian gunboats after an\nAmerican helicopter came under fire earlier this month and U.S.\nforces attacked, seized, and sank an Iranian ship they said had\nbeen caught laying mines.\n    But Iran was not deterred, according to U.S. defense\nofficials, who said Iranian forces used Chinese-made Silkworm\nmissiles to hit a U.S.-owned Liberian-flagged ship on Thursd

## Training the Tokenizer

We will use Hugging Face transformers class to create batched datasets for training and an iterator object for a later usage when training the tokenizer.

We will use **GPT-2** model to demonstrate this part and the next sections.

In [11]:
from transformers import AutoTokenizer

In [12]:
# create the batched dataset using the `full_article` column
corpus = (
    dataset["train"][i : i + 1000]["full_article"] for i in range(0, len(dataset["train"]), 1000)
)

In [15]:
# train gpt-2 tokenizer
old_tokenizer = AutoTokenizer.from_pretrained('gpt2')

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]



In [17]:
import time

# pass the training corpus into the tokenizer (specify training corpus and vocabulary size - e.g. 52000)

print('Start Batch Training...\n')
start_time = time.time()

tokenizer = old_tokenizer.train_new_from_iterator(corpus, 52000)

print('Batch Training complete.\n')
end_time = time.time()

execution_time = end_time-start_time
print(f"Execution Time: {execution_time:.2f} seconds.")

Start Batch Training...




Batch Training complete.

Execution Time: 0.87 seconds.




In [24]:
# test sample
print('Original Text: \n')
example = dataset['test'][2]['full_article']
print(example)

Original Text: 

TITLE : ANIMAL FEED SHIP ON FIRE AGAIN AT CHINESE PORT

BODY : The Cyprus vessel Fearless, 31,841 tonnes
dw, which was on fire, grounded then towed to Yantai, China, in
August, had all its cargo reloaded but the cargo in the no. 3
hold caught fire on October 15.
    The fire was put out with salt water and water from the
no.4 hold has spread over most of the cargo. Some water is also
in the no.5 hold. Bottom patching was reported complete but
only the no.4 starboard wing tank has been pumped out and
remains dry. The engine room is flooded to about three metres.
    The ship was originally loaded with 10,000 tonnes of animal
feed.
 REUTER



In [31]:
# test old tokenizer
print('Tokenized Example:')
test1 = old_tokenizer.tokenize(example)
test1

Tokenized Example:


['TIT',
 'LE',
 'Ġ:',
 'ĠAN',
 'IM',
 'AL',
 'ĠFE',
 'ED',
 'ĠSH',
 'IP',
 'ĠON',
 'ĠFIRE',
 'ĠAGA',
 'IN',
 'ĠAT',
 'ĠCH',
 'IN',
 'ESE',
 'ĠP',
 'ORT',
 'Ċ',
 'Ċ',
 'B',
 'ODY',
 'Ġ:',
 'ĠThe',
 'ĠCyprus',
 'Ġvessel',
 'ĠFear',
 'less',
 ',',
 'Ġ31',
 ',',
 '8',
 '41',
 'Ġtonnes',
 'Ċ',
 'd',
 'w',
 ',',
 'Ġwhich',
 'Ġwas',
 'Ġon',
 'Ġfire',
 ',',
 'Ġgrounded',
 'Ġthen',
 'Ġtowed',
 'Ġto',
 'ĠY',
 'ant',
 'ai',
 ',',
 'ĠChina',
 ',',
 'Ġin',
 'Ċ',
 'August',
 ',',
 'Ġhad',
 'Ġall',
 'Ġits',
 'Ġcargo',
 'Ġreload',
 'ed',
 'Ġbut',
 'Ġthe',
 'Ġcargo',
 'Ġin',
 'Ġthe',
 'Ġno',
 '.',
 'Ġ3',
 'Ċ',
 'hold',
 'Ġcaught',
 'Ġfire',
 'Ġon',
 'ĠOctober',
 'Ġ15',
 '.',
 'Ċ',
 'Ġ',
 'Ġ',
 'Ġ',
 'ĠThe',
 'Ġfire',
 'Ġwas',
 'Ġput',
 'Ġout',
 'Ġwith',
 'Ġsalt',
 'Ġwater',
 'Ġand',
 'Ġwater',
 'Ġfrom',
 'Ġthe',
 'Ċ',
 'no',
 '.',
 '4',
 'Ġhold',
 'Ġhas',
 'Ġspread',
 'Ġover',
 'Ġmost',
 'Ġof',
 'Ġthe',
 'Ġcargo',
 '.',
 'ĠSome',
 'Ġwater',
 'Ġis',
 'Ġalso',
 'Ċ',
 'in',
 'Ġthe',
 'Ġno',
 '.',
 '5',
 '

In [30]:
# test: tokenizer
test2 = tokenizer.tokenize(example)
test2

['TITLE',
 'Ġ:',
 'ĠAN',
 'IM',
 'AL',
 'ĠFE',
 'ED',
 'ĠSH',
 'IP',
 'ĠON',
 'ĠFI',
 'RE',
 'ĠAG',
 'AIN',
 'ĠAT',
 'ĠCH',
 'INES',
 'E',
 'ĠP',
 'ORT',
 'Ċ',
 'Ċ',
 'BODY',
 'Ġ:',
 'ĠThe',
 'ĠC',
 'y',
 'pr',
 'us',
 'Ġvessel',
 'ĠF',
 'ear',
 'less',
 ',',
 'Ġ31',
 ',',
 '8',
 '41',
 'Ġtonnes',
 'Ċ',
 'd',
 'w',
 ',',
 'Ġwhich',
 'Ġwas',
 'Ġon',
 'Ġfire',
 ',',
 'Ġg',
 'round',
 'ed',
 'Ġthen',
 'Ġto',
 'w',
 'ed',
 'Ġto',
 'ĠY',
 'ant',
 'a',
 'i',
 ',',
 'ĠChina',
 ',',
 'Ġin',
 'Ċ',
 'August',
 ',',
 'Ġhad',
 'Ġall',
 'Ġits',
 'Ġc',
 'argo',
 'Ġre',
 'lo',
 'ad',
 'ed',
 'Ġbut',
 'Ġthe',
 'Ġc',
 'argo',
 'Ġin',
 'Ġthe',
 'Ġno',
 '.',
 'Ġ3',
 'Ċ',
 'hold',
 'Ġcaught',
 'Ġfire',
 'Ġon',
 'ĠOctober',
 'Ġ15',
 '.',
 'ĊĠĠĠ',
 'ĠThe',
 'Ġfire',
 'Ġwas',
 'Ġput',
 'Ġout',
 'Ġwith',
 'Ġs',
 'alt',
 'Ġwater',
 'Ġand',
 'Ġwater',
 'Ġfrom',
 'Ġthe',
 'Ċ',
 'no',
 '.',
 '4',
 'Ġhold',
 'Ġhas',
 'Ġsp',
 're',
 'ad',
 'Ġover',
 'Ġmost',
 'Ġof',
 'Ġthe',
 'Ġc',
 'argo',
 '.',
 'ĠSome',
 'Ġwater

In [33]:
# Compare results
print(f'Tokenization by old tokenizer: {len(test1)}')
print(f'Tokenization by new tokenizer: {len(test2)}')

Tokenization by old tokenizer: 184
Tokenization by new tokenizer: 203


## Push New Tokenizer to Hugging Face Hub

In [38]:
from huggingface_hub import notebook_login

# Login to Hugging Face Hub
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [39]:
# Store tokenizer to hub
tokenizer.push_to_hub("gp2-reuters-tokenizer")

CommitInfo(commit_url='https://huggingface.co/Kain17/gp2-reuters-tokenizer/commit/2c6e1775132d08002ba65df99d88286ccd76aaa2', commit_message='Upload tokenizer', commit_description='', oid='2c6e1775132d08002ba65df99d88286ccd76aaa2', pr_url=None, repo_url=RepoUrl('https://huggingface.co/Kain17/gp2-reuters-tokenizer', endpoint='https://huggingface.co', repo_type='model', repo_id='Kain17/gp2-reuters-tokenizer'), pr_revision=None, pr_num=None)

## Using our Custom Tokenizer

In [43]:
# Load custom tokenizer
my_tokenizer = AutoTokenizer.from_pretrained("Kain17/gp2-reuters-tokenizer")

tokenizer_config.json:   0%|          | 0.00/441 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/209k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/119k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/972k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/99.0 [00:00<?, ?B/s]

In [44]:
my_tokenizer

GPT2TokenizerFast(name_or_path='Kain17/gp2-reuters-tokenizer', vocab_size=14184, model_max_length=1024, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<|endoftext|>', 'eos_token': '<|endoftext|>', 'unk_token': '<|endoftext|>'}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
	0: AddedToken("<|endoftext|>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}

In [48]:
# Test the tokenizer
anArticle = dataset['test'][2]

print(anArticle['full_article'])


res = my_tokenizer.tokenize(anArticle['full_article'])
print(len(res))
print(res)

TITLE : ANIMAL FEED SHIP ON FIRE AGAIN AT CHINESE PORT

BODY : The Cyprus vessel Fearless, 31,841 tonnes
dw, which was on fire, grounded then towed to Yantai, China, in
August, had all its cargo reloaded but the cargo in the no. 3
hold caught fire on October 15.
    The fire was put out with salt water and water from the
no.4 hold has spread over most of the cargo. Some water is also
in the no.5 hold. Bottom patching was reported complete but
only the no.4 starboard wing tank has been pumped out and
remains dry. The engine room is flooded to about three metres.
    The ship was originally loaded with 10,000 tonnes of animal
feed.
 REUTER

203
['TITLE', 'Ġ:', 'ĠAN', 'IM', 'AL', 'ĠFE', 'ED', 'ĠSH', 'IP', 'ĠON', 'ĠFI', 'RE', 'ĠAG', 'AIN', 'ĠAT', 'ĠCH', 'INES', 'E', 'ĠP', 'ORT', 'Ċ', 'Ċ', 'BODY', 'Ġ:', 'ĠThe', 'ĠC', 'y', 'pr', 'us', 'Ġvessel', 'ĠF', 'ear', 'less', ',', 'Ġ31', ',', '8', '41', 'Ġtonnes', 'Ċ', 'd', 'w', ',', 'Ġwhich', 'Ġwas', 'Ġon', 'Ġfire', ',', 'Ġg', 'round', 'ed', 'Ġthen'

<hr>

###### End of the Notebook