<a href="https://colab.research.google.com/github/educatorsRlearners/hugging_face_course/blob/main/06_the_%F0%9F%A4%97_Tokenizers_library.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install datasets transformers[sentencepiece]
from transformers import AutoTokenizer

Collecting datasets
  Downloading datasets-2.0.0-py3-none-any.whl (325 kB)
[K     |████████████████████████████████| 325 kB 5.0 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.17.0-py3-none-any.whl (3.8 MB)
[K     |████████████████████████████████| 3.8 MB 53.5 MB/s 
[?25hCollecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.2.0-py3-none-any.whl (134 kB)
[K     |████████████████████████████████| 134 kB 50.7 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 35.8 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 4.3 MB/s 
Collecting xxhash
  Downloading xxhash-3.0.0-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (212 kB)
[K     |████████████████████████████████| 212 k

When fine-tuning a model, it only makes sense to use the same tokenizer that it was trained on. But what do you do when you want to create a model from scratch? 

Well, that's exactly what we're going to do in this chapter. 

# Training a new tokeinizer from an old one

Key point: if a language model is not available in our target language or, and this is more likely in my case, the corpus is significantly different from the one a language model was trained on, then we're going to want to train a model from scratch using a tokenizer adapted to our data. 

For instance, if we want to tokenize a simple sentence like, "I went shopping with my mother last week," the standard Bert-based tokenizer works well:

In [2]:
checkpoint = "bert-base-uncased"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)

sample_sentence = "I went shopping with my mother last week."

print(tokenizer.tokenize(sample_sentence))

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

['i', 'went', 'shopping', 'with', 'my', 'mother', 'last', 'week', '.']


However, if we try to pass a highly technical, academic, or archaic text, then the results aren't nearly as good: 

In [3]:
medical = "the medical vocabulary is divided into many sub-token: paracetamol, pharyngitis, and oxycodone."

print(tokenizer.tokenize(medical))

['the', 'medical', 'vocabulary', 'is', 'divided', 'into', 'many', 'sub', '-', 'token', ':', 'para', '##ce', '##tam', '##ol', ',', 'ph', '##ary', '##ng', '##itis', ',', 'and', 'ox', '##y', '##co', '##don', '##e', '.']


To that end, training a tokenizer consists of four steps: 
- building a corpus
- selecting the tokenizer architecture
- training the tokenizer on the corpus
- saving the result

## [Assembling a corpus](https://huggingface.co/course/chapter6/2?fw=pt#assembling-a-corpus)

Once we have the corpus, we can use ```AutoTokenizer.train_new_from_iterator()``` so that the new tokenizer will have the same characteristics as the one for the model we wish to emulate. 

What do I mean by that? 

Bascially, if we're going to be using ```GPT-2``` model architecture, we're going to want out tokenizer to tokenize in the same maner as ```GPT-2```. 

For this code through, I'm going to just follow along using the [CodeSearchNet](https://huggingface.co/datasets/code_search_net) but, in the future, I'm going to do something more classical (like Shakespear) or possibly exotic. 

But, for now...

In [6]:
from datasets import load_dataset

raw_datasets = load_dataset("code_search_net", "python")

Downloading builder script:   0%|          | 0.00/2.60k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading and preparing dataset code_search_net/python (download: 897.32 MiB, generated: 1.62 GiB, post-processed: Unknown size, total: 2.49 GiB) to /root/.cache/huggingface/datasets/code_search_net/python/1.0.0/80a244ab541c6b2125350b764dc5c2b715f65f00de7a56107a28915fac173a27...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/941M [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/412178 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/22176 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/23107 [00:00<?, ? examples/s]

Dataset code_search_net downloaded and prepared to /root/.cache/huggingface/datasets/code_search_net/python/1.0.0/80a244ab541c6b2125350b764dc5c2b715f65f00de7a56107a28915fac173a27. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

Let's have a look at the columns we're working with. 

In [7]:
raw_datasets['train']

Dataset({
    features: ['repository_name', 'func_path_in_repository', 'func_name', 'whole_func_string', 'language', 'func_code_string', 'func_code_tokens', 'func_documentation_string', 'func_documentation_tokens', 'split_name', 'func_code_url'],
    num_rows: 412178
})

OK, so the docstrings are speparated from the code and the dataset recommends tokenizing both of them. 

Let's have a look at en example to see what we're working with:

In [13]:
print(raw_datasets["train"][123456]['whole_func_string'])

def core_properties(self):
        """
        Instance of |CoreProperties| holding the read/write Dublin Core
        document properties for this presentation. Creates a default core
        properties part if one is not present (not common).
        """
        try:
            return self.part_related_by(RT.CORE_PROPERTIES)
        except KeyError:
            core_props = CorePropertiesPart.default()
            self.relate_to(core_props, RT.CORE_PROPERTIES)
            return core_props


OK, now the key is to transform the dataset into an *iterator*. 

Why? Because if our dataset is an iterator, we can feed it to our function in batches as opposed to all at once. 

Why does that matter? If we pass it to our function all at once, we need to load the ***entire dataset into memory*** which will most likely, crash our computer. 

For example, doing the following would be bad 🙁

In [14]:
# Don't uncomment the following line unless your dataset is small!

'''
training_corpus = [
                   raw_datasets["train"][i: i + 1000]["whole_func_string"] 
                   for i in range(0, len(raw_datasets["train"]), 1000)]
'''

Instead, we want to create a generator like this: 

In [16]:
training_corpus = (
    raw_datasets["train"][i : i + 1000]["whole_func_string"]
    for i in range(0, len(raw_datasets["train"]), 1000)
)

So what's the difference between the two? 

Instead of using brackets we use parentheses (fun fact, if you've ever wondered why you can't do tuple comprehension like list comprehension, now you know 😆

Now the important thing to remember about generator objects is they can only be used once like this: 

In [21]:
gen = (l for l in "abcdefg")
print(list(gen))
print(list(gen))

['a', 'b', 'c', 'd', 'e', 'f', 'g']
[]


So what do we do if we want to use a generator more than once? 

Simply write a function which returns a generator 😀

If the object is straight forward, we can use comprehension syntax like above: 

In [25]:
def get_training_corpus():
  return(
      raw_datasets['train'][i: i + 1000]["whole_func_string"]
      for i in range(0, len(raw_datasets['train']), 1000)
  )

training_corpus = get_training_corpus()

Now if we're going to do something more comlicated, then a better idea is to use ```yield``` instead of ```return``` statement.

In [29]:
def get_training_corpus():
  dataset = raw_datasets["train"]
  for start_idx in range(0, len(dataset), 1000):
    samples = dataset[start_idx : start_idx + 1000]
    yield samples["whole_func_string"]

training_corpus = get_training_corpus()

##[Training a new tokenizer](https://huggingface.co/course/chapter6/2?fw=pt#training-a-new-tokenizer)