<a href="https://colab.research.google.com/github/educatorsRlearners/hugging_face_course/blob/main/06_the_%F0%9F%A4%97_Tokenizers_library.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install datasets transformers[sentencepiece]
from transformers import AutoTokenizer

Collecting datasets
  Downloading datasets-2.0.0-py3-none-any.whl (325 kB)
[K     |████████████████████████████████| 325 kB 8.1 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 61.7 MB/s 
Collecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 6.5 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 kB 49.0 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.2.0-py3-none-any.whl (134 kB)
[K     |████████████████████████████████| 134 kB 53.4 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 41.8

When fine-tuning a model, it only makes sense to use the same tokenizer that it was trained on. But what do you do when you want to create a model from scratch? 

Well, that's exactly what we're going to do in this chapter. 

# Training a new tokeinizer from an old one

Key point: if a language model is not available in our target language or, and this is more likely in my case, the corpus is significantly different from the one a language model was trained on, then we're going to want to train a model from scratch using a tokenizer adapted to our data. 

For instance, if we want to tokenize a simple sentence like, "I went shopping with my mother last week," the standard Bert-based tokenizer works well:

In [2]:
checkpoint = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sample_sentence = "I went shopping with my mother last week."

print(tokenizer.tokenize(sample_sentence))

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

['i', 'went', 'shopping', 'with', 'my', 'mother', 'last', 'week', '.']


However, if we try to pass a highly technical, academic, or archaic text, then the results aren't nearly as good: 

In [3]:
medical = "the medical vocabulary is divided into many sub-token: paracetamol, pharyngitis, and oxycodone."

print(tokenizer.tokenize(medical))

['the', 'medical', 'vocabulary', 'is', 'divided', 'into', 'many', 'sub', '-', 'token', ':', 'para', '##ce', '##tam', '##ol', ',', 'ph', '##ary', '##ng', '##itis', ',', 'and', 'ox', '##y', '##co', '##don', '##e', '.']


To that end, training a tokenizer consists of four steps: 
- building a corpus
- selecting the tokenizer architecture
- training the tokenizer on the corpus
- saving the result

## [Assembling a corpus](https://huggingface.co/course/chapter6/2?fw=pt#assembling-a-corpus)

Once we have the corpus, we can use ```AutoTokenizer.train_new_from_iterator()``` so that the new tokenizer will have the same characteristics as the one for the model we wish to emulate. 

What do I mean by that? 

Bascially, if we're going to be using ```GPT-2``` model architecture, we're going to want out tokenizer to tokenize in the same maner as ```GPT-2```. 

For this code through, I'm going to just follow along using the [CodeSearchNet](https://huggingface.co/datasets/code_search_net) but, in the future, I'm going to do something more classical (like Shakespear) or possibly exotic. 

But, for now...

In [4]:
from datasets import load_dataset

raw_datasets = load_dataset("code_search_net", "python")

Downloading builder script:   0%|          | 0.00/2.60k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading and preparing dataset code_search_net/python (download: 897.32 MiB, generated: 1.62 GiB, post-processed: Unknown size, total: 2.49 GiB) to /root/.cache/huggingface/datasets/code_search_net/python/1.0.0/80a244ab541c6b2125350b764dc5c2b715f65f00de7a56107a28915fac173a27...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/941M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/412178 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/22176 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/23107 [00:00<?, ? examples/s]

Dataset code_search_net downloaded and prepared to /root/.cache/huggingface/datasets/code_search_net/python/1.0.0/80a244ab541c6b2125350b764dc5c2b715f65f00de7a56107a28915fac173a27. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Let's have a look at the columns we're working with. 

In [5]:
raw_datasets['train']

Dataset({
    features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
    num_rows: 412178
})

OK, so the docstrings are separated from the code and the dataset recommends tokenizing both of them. 

Let's have a look at en example to see what we're working with:

In [6]:
print(raw_datasets["train"][123456]['whole_func_string'])

def rank(self, issue, next_issue):
        """Rank an issue before another using the default Ranking field, the one named 'Rank'.

        :param issue: issue key of the issue to be ranked before the second one.
        :param next_issue: issue key of the second issue.
        """
        if not self._rank:
            for field in self.fields():
                if field['name'] == 'Rank':
                    if field['schema']['custom'] == "com.pyxis.greenhopper.jira:gh-lexo-rank":
                        self._rank = field['schema']['customId']
                        break
                    elif field['schema']['custom'] == "com.pyxis.greenhopper.jira:gh-global-rank":
                        # Obsolete since JIRA v6.3.13.1
                        self._rank = field['schema']['customId']

        if self._options['agile_rest_path'] == GreenHopperResource.AGILE_BASE_REST_PATH:
            url = self._get_url('issue/rank', base=self.AGILE_BASE_URL)
            payload = {'issues': [i

OK, now the key is to transform the dataset into an *iterator*. 

Why? Because if our dataset is an iterator, we can feed it to our function in batches as opposed to all at once. 

Why does that matter? If we pass it to our function all at once, we need to load the ***entire dataset into memory*** which will most likely, crash our computer. 

For example, doing the following would be bad 🙁

In [7]:
# Don't uncomment the following line unless your dataset is small!

'''
training_corpus = [
                   raw_datasets["train"][i: i + 1000]["whole_func_string"] 
                   for i in range(0, len(raw_datasets["train"]), 1000)]
'''

'\ntraining_corpus = [\n                   raw_datasets["train"][i: i + 1000]["whole_func_string"] \n                   for i in range(0, len(raw_datasets["train"]), 1000)]\n'

Instead, we want to create a generator like this: 

In [8]:
training_corpus = (
    raw_datasets["train"][i : i + 1000]["whole_func_string"]
    for i in range(0, len(raw_datasets["train"]), 1000)
)

So what's the difference between the two? 

Instead of using brackets we use parentheses; fun fact, if you've ever wondered why you can't do tuple comprehension like list comprehension, now you know 😆)

Now the important thing to remember about generator objects is they can only be used once like this: 

In [9]:
gen = (l for l in "abcdefg")
print(list(gen))
print(list(gen))

['a', 'b', 'c', 'd', 'e', 'f', 'g']
[]


So what do we do if we want to use a generator more than once? 

Simply write a function which returns a generator 😀

If the object is straight forward, we can use comprehension syntax like above: 

In [10]:
def get_training_corpus():
  return(
      raw_datasets['train'][i: i + 1000]["whole_func_string"]
      for i in range(0, len(raw_datasets['train']), 1000)
  )

training_corpus = get_training_corpus()

Now if we're going to do something more complicated, then a better idea is to use ```yield``` instead of ```return``` statement.

In [11]:
def get_training_corpus():
  dataset = raw_datasets["train"]
  for start_idx in range(0, len(dataset), 1000):
    samples = dataset[start_idx : start_idx + 1000]
    yield samples["whole_func_string"]

training_corpus = get_training_corpus()

##[Training a new tokenizer](https://huggingface.co/course/chapter6/2?fw=pt#training-a-new-tokenizer)

Now that we've created a function which will generate batches of text to feed into our tokenizer, we needd to load the tokenizer we'd like to emulate. 

In [13]:
from transformers import AutoTokenizer

old_tokenizer = AutoTokenizer.from_pretrained("gpt2")

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.29M [00:00<?, ?B/s]

> _So if we're creating a new tokenizer, why not just start it from scratch?_

Solid question. 

The answer is it's best to stand on the shoulders of giants so, rather than defining every aspect of the tokenizer (e.g., special tokens) and instead just train it up using our specific vocabulary. 

Now let's create a baseline by identifying how the standard ```gpt2``` tokenizer would tokenize the following: 

In [15]:
example = '''def add_numbers(a, b):
    """Add the two numbers `a` and `b`."""
    return a + b'''

tokens = old_tokenizer.tokenize(example)
print(tokens)

['def', 'Ġadd', '_', 'n', 'umbers', '(', 'a', ',', 'Ġb', '):', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo', 'Ġnumbers', 'Ġ`', 'a', '`', 'Ġand', 'Ġ`', 'b', '`', '."', '""', 'Ċ', 'Ġ', 'Ġ', 'Ġ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']


We can see some real problems such as it dividing the function name at the underscore incorrectly as well as tokenizing the white space. 

Let's see if a tokenizer trained for this corpus peforms better:

In [16]:
tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)

Now, without getting too much into the weeds, the 🤗Tokenizers library has both fast and slow tokenizers. Fast tokenizers are written in Rust wheresas slow tokenizers are written in pure Python. 

> _Why does that matter?_ 

Training a tokenizer in pure Python is exceptionally slow. 

As such, check [here](https://huggingface.co/transformers/#supported-frameworks) to see if the model your tokenizer is based on has a fast version.  

In [17]:
tokens = tokenizer.tokenize(example)
print(tokens)

['def', 'Ġadd', '_', 'numbers', '(', 'a', ',', 'Ġb', '):', 'ĊĠĠĠ', 'Ġ"""', 'Add', 'Ġthe', 'Ġtwo', 'Ġnumbers', 'Ġ`', 'a', '`', 'Ġand', 'Ġ`', 'b', '`."""', 'ĊĠĠĠ', 'Ġreturn', 'Ġa', 'Ġ+', 'Ġb']


Much better! Our tokenizer recognizes an indentation 'ĊĠĠĠ', a docstring 'Ġ"""', and properly splits the function name on the underscore. 

Furthermore, to get a sense of how many fewer tokens we'd create by correctly identifying white space as well as docstring markers, we can simply do the following: 

In [18]:
print(len(tokens))
print(len(old_tokenizer.tokenize(example)))

27
36


And, since all learning is repetition, let's see another example: 

In [20]:
example = """class LinearLayer():
    def __init__(self, input_size, output_size):
        self.weight = torch.randn(input_size, output_size)
        self.bias = torch.zeros(output_size)

    def __call__(self, x):
        return x @ self.weights + self.bias
    """

print(tokenizer.tokenize(example))

['class', 'ĠLinear', 'Layer', '():', 'ĊĠĠĠ', 'Ġdef', 'Ġ__', 'init', '__(', 'self', ',', 'Ġinput', '_', 'size', ',', 'Ġoutput', '_', 'size', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'weight', 'Ġ=', 'Ġtorch', '.', 'randn', '(', 'input', '_', 'size', ',', 'Ġoutput', '_', 'size', ')', 'ĊĠĠĠĠĠĠĠ', 'Ġself', '.', 'bias', 'Ġ=', 'Ġtorch', '.', 'zeros', '(', 'output', '_', 'size', ')', 'ĊĊĠĠĠ', 'Ġdef', 'Ġ__', 'call', '__(', 'self', ',', 'Ġx', '):', 'ĊĠĠĠĠĠĠĠ', 'Ġreturn', 'Ġx', 'Ġ@', 'Ġself', '.', 'weights', 'Ġ+', 'Ġself', '.', 'bias', 'ĊĠĠĠĠ']


Oh very nice! We can see that camel-cased names are correctly tokenized as well as dunder methods. 

## Saving the Tokenizer 

Now there is no point to doing all that work to only redo it again at a later date. Additionally, if we save and share the tokenizer on the Hub, others will benefit from our hard work. 

To that end, be sure to save your tokenizer by using the ```save_pretrained()``` method like so: 

In [21]:
# Be sure to uncomment out the line below and pass in a unique name

# tokenizer.save_pretrained(name-of-tokenizer)

Now let's push it to the hub. 

If you're working in a notebook: 

In [22]:
#from huggingface_hub import notebook_login

#notebook_login()

If not: 

In [23]:
# huggingface-cli login

Now that you're logged in, you can simply push it like so: 

In [24]:
# tokenizer.push_to_hub(name-of-tokenizer)

And with that, your tokenizer lives in the Hub where anyone can load it like so: 

In [25]:
# Replace "huggingface-course" below with your actual namespace to use your own tokenizer

# tokenizer = AutoTokenizer.from_pretrained("huggingface-course/name-of-tokenizer")

# [Fast tokenizer's special powers](https://huggingface.co/course/chapter6/3?fw=pt#fast-tokenizers-special-powers)