# Training A New Tokenizer From An Old Tokenizer

- Check [this](https://huggingface.co/learn/nlp-course/chapter6/2?fw=pt) for info on how to finetune a pretrained tokenizer.

In [1]:
# Built-in library
import re
import json
from typing import Any, Dict, List, Optional, Union
import logging
import warnings

# Standard imports
import numpy as np
import pandas as pd
from rich import print

# Visualization
import matplotlib.pyplot as plt


# Pandas settings
pd.options.display.max_rows = 1_000
pd.options.display.max_columns = 1_000
pd.options.display.max_colwidth = 600

warnings.filterwarnings("ignore")

# Black code formatter (Optional)
%load_ext lab_black

# auto reload imports
%load_ext autoreload
%autoreload 2

<hr><br>

## Batch Encoding Using Fast Tokenizers

In [2]:
from transformers import AutoTokenizer


CHECKPOINT: str = "bert-base-cased"
tokenizer: AutoTokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
example: str = "My name is Chineidu and I work at Hugging Face In Brooklyn."
encoding: dict["str", Any] = tokenizer(example)

print(type(encoding))

In [3]:
print(encoding)

# Access the tokens (w/o converting the IDs back to tokens)
print(encoding.tokens())

In [4]:
# Get the index of the word each token comes from.
# The special tokens [CLS] and [SEP] are represented as None.
print(encoding.word_ids())

In [5]:
# Try another tokenizer!
CHECKPOINT: str = "roberta-base"
tokenizer_2: AutoTokenizer = AutoTokenizer.from_pretrained(CHECKPOINT)
example_2: str = "My name is Chineidu and I work at Hugging Face In Brooklyn."
encoding_2: dict["str", Any] = tokenizer_2(example_2)

print(encoding_2)

# Access the tokens (w/o converting the IDs back to tokens)
print(encoding_2.tokens())

In [6]:
print(example)
print(encoding.word_ids())

# Access the tokens (w/o converting the IDs back to tokens)
print(encoding.tokens())

```text
- We can map any word or token to characters in the original text, and vice versa, 
* via the:
  - word_to_chars() 
  - or token_to_chars() and char_to_word() 
  - or char_to_token() methods. 
  
- The word_ids() method told us that ##ei is part of the word at index 3, but which word is it in the sentence? We can find out like this:
```

In [7]:
start, end = encoding.word_to_chars(3)
example[start:end]

'Chineidu'

<hr><br>

## [Text Classification Pipeline](https://huggingface.co/learn/nlp-course/chapter6/3?fw=pt)

```text
- Using a token classification pipeline, we can get some results to compare manually. 
- The model used by default is dbmdz/bert-large-cased-finetuned-conll03-english and it performs NER on sentences.
```

In [None]:
from transformers import pipeline

token_classifier = pipeline("token-classification")
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

In [8]:
from transformers import pipeline


TASK: str = "token-classification"  # Named Entity Recognition (NER)
token_classifier: pipeline = pipeline(task=TASK)
example: str = "My name is Chineidu and I work at Hugging Face In Brooklyn."

token_classifier(example)

2023-10-17 02:44:25.488885: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision f2482bf (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Downloading model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

'(ReadTimeoutError("HTTPSConnectionPool(host='huggingface.co', port=443): Read timed out. (read timeout=10)"), '(Request ID: ed8aa359-4cb2-4878-84e9-2bc47ffa32fd)')' thrown while requesting HEAD https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english/resolve/main/config.json


Downloading model.safetensors:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

KeyboardInterrupt: 

<br>

#### Comment

```text
- The model properly identified each token generated by `Chineidu` as a person, each token generated by “Hugging Face” as an organization, and the token “Brooklyn” as a location. We can also ask the pipeline to group together the tokens that correspond to the same entity:
```