# Giving tokens a label - How to categorize tokens


The first part of this Notebook contains elements that explain how to label tokens and how the functions work.

The second part shows how all tokens are labelled that are used for our delphi language models.3

# 1) How to use the token labelling functions

In [90]:
# autoreload
%load_ext autoreload
%autoreload 2

from pprint import pprint 

import spacy
from tqdm.auto import tqdm

import delphi

# from delphi.eval import token_labelling

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


We analyze a simple sentence and receive the respective tokens with their analyzed attributes.  
The grammatical/linguistic analysis is done by a model provided by spaCy for the English language.

In [2]:
# Load the english model
nlp = spacy.load("en_core_web_sm")

# Create a Doc object from a given text
doc = nlp("This is a dummy sentence for testing.")

token = doc[0]
print(token)

This


Let's get the label for our custom token that we just printed.

In [8]:
from delphi.eval import token_labelling

label = token_labelling.label_single_token(token)
pprint(label)

[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')
{'Capitalized': True,
 'Is Adjective': False,
 'Is Adverb': False,
 'Is Conjunction': False,
 'Is Interjunction': False,
 'Is Named Entity': False,
 'Is Noun': False,
 'Is Preposition': False,
 'Is Pronoun': True,
 'Is Verb': False,
 'Starts with space': False}


Let's get an understanding of what the labels acutally mean.
Use this function to receive an explanation for a single token.

In [9]:
token_labelling.explain_token_labels(token)

-------- Explanation of token labels --------
Token text:          This
Token dependency:    nominal subject
Token POS:           pronoun
---------------- Token labels ---------------
  0   Starts with space    False
  1   Capitalized          True
  2   Is Noun              False
  3   Is Pronoun           True
  4   Is Adjective         False
  5   Is Verb              False
  6   Is Adverb            False
  7   Is Preposition       False
  8   Is Conjunction       False
  9   Is Interjunction     False
 10   Is Named Entity      False


If you are interested in all the possible labels a token can have, that spaCy is capable of assigning, then call the same function but without any argument:
```Python
>>> token_labelling.explain_token_labels()
```

### Batched token labelling
Next, let us analyze a batch of sentences and have them labelled.
> In this example the input sentences are not yet tokenized, so spaCy uses its internal tokenizer.

In [18]:
sentences = [
    "This is a sentence."
]
labels = token_labelling.label_batch_sentences(sentences, tokenized=False, verbose=True)

print(len(labels[0]))
print(labels[0])

Token: This
Starts with space | Capitalized | Is Noun | Is Pronoun | Is Adjective | Is Verb | Is Adverb | Is Preposition | Is Conjunction | Is Interjunction | Is Named Entity
False             | True        | False   | True       | False        | False   | False     | False          | False          | False            | False          
---
Token: is
Starts with space | Capitalized | Is Noun | Is Pronoun | Is Adjective | Is Verb | Is Adverb | Is Preposition | Is Conjunction | Is Interjunction | Is Named Entity
False             | False       | False   | False      | False        | True    | False     | False          | False          | False            | False          
---
Token: a
Starts with space | Capitalized | Is Noun | Is Pronoun | Is Adjective | Is Verb | Is Adverb | Is Preposition | Is Conjunction | Is Interjunction | Is Named Entity
False             | False       | False   | False      | False        | False   | False     | False          | False          | False            |

Now with our own tokenization. E.g. the one from our TinyStories models.

In [19]:
sentences = [
    ["This ", "is ", "a ", "sentence", "."]
]
labelled_sentences = token_labelling.label_batch_sentences(sentences, tokenized=True, verbose=False)

print(len(labelled_sentences[0]))
print(labelled_sentences[0])

5
[{'Starts with space': False, 'Capitalized': True, 'Is Noun': True, 'Is Pronoun': False, 'Is Adjective': False, 'Is Verb': False, 'Is Adverb': False, 'Is Preposition': False, 'Is Conjunction': False, 'Is Interjunction': False, 'Is Named Entity': False}, {'Starts with space': False, 'Capitalized': False, 'Is Noun': False, 'Is Pronoun': False, 'Is Adjective': False, 'Is Verb': False, 'Is Adverb': True, 'Is Preposition': False, 'Is Conjunction': False, 'Is Interjunction': False, 'Is Named Entity': False}, {'Starts with space': False, 'Capitalized': False, 'Is Noun': False, 'Is Pronoun': False, 'Is Adjective': True, 'Is Verb': False, 'Is Adverb': False, 'Is Preposition': False, 'Is Conjunction': False, 'Is Interjunction': False, 'Is Named Entity': False}, {'Starts with space': False, 'Capitalized': False, 'Is Noun': True, 'Is Pronoun': False, 'Is Adjective': False, 'Is Verb': False, 'Is Adverb': False, 'Is Preposition': False, 'Is Conjunction': False, 'Is Interjunction': False, 'Is Named

# 2) Labelling all tokens in the dataset

Now we want to label all the tokens that our tokenizer knows - its entire vocabulary.

In [58]:
# Get all the tokens of the tokenizer
from transformers import AutoTokenizer, PreTrainedTokenizer, PreTrainedTokenizerFast

def tokenize(tokenizer: PreTrainedTokenizer, sample_txt: str) -> list[int]:
    # supposedly this can be different than prepending the bos token id
    return tokenizer.encode(tokenizer.bos_token + sample_txt, return_tensors="pt")[0]

# Decode a sentence
def decode(tokenizer: PreTrainedTokenizer, token_ids: list[int]) -> str:
    return tokenizer.decode(token_ids, skip_special_tokens=True)

tokenizer = AutoTokenizer.from_pretrained("roneneldan/TinyStories-1M")
vocab_size = tokenizer.vocab_size
print("The vocab size is:", vocab_size)

The vocab size is  50257


In [63]:
# Let's have a look at some tokens
ranges = [(0,10), (800,810), (1200,1210), (2300, 2310), (vocab_size-10, vocab_size)]
for start, end in ranges:
    for i in range(start, end):
        print(decode(tokenizer, i).ljust(10), end=" ")
    print()


!          "          #          $          %          &          '          (          )          *          
 inv       lect        supp      ating       look      man        pect        8         row         bu        
 child      since     ired       less        life       develop   ittle       dep        pass      �          
 matter    reg        ext        angu       isc        ole        aut         compet    eed        fect       
 (/        …."        Compar      amplification ominated    regress    Collider   informants  gazed                


In [101]:
# let's label each token
labelled_token_ids_dict: dict[int, dict[str, bool]] = {}
max_token_id = 4000  # stop at which token id, for testing, later vocab size
batch_size = 500
# we iterate (batchwise) over all token_ids
for start in tqdm(range(0, max_token_id, batch_size), desc="Labelling tokens"):
    # create a batch of token_ids
    token_ids = list(range(start, start+batch_size))
    # decode the token_ids to get a list of tokens, a 'sentence'
    tokens = decode(tokenizer, token_ids)  # list of tokens == sentence
    # put the sentence into a list, to make it a batch of sentences
    sentences = [tokens]
    # label the batch of sentences
    labels = token_labelling.label_batch_sentences(sentences, tokenized=True, verbose=False)
    # create a dict with the token_ids and their labels
    labelled_sentence_dict = dict(zip(token_ids, labels[0]))
    # update the labelled_token_ids_dict with the new dict
    labelled_token_ids_dict.update(labelled_sentence_dict)
    
    # print(token_ids)
    # print(sentences)
    # print(labels)

Labelling tokens:   0%|          | 0/8 [00:00<?, ?it/s]

We save the dict (pickle it) for future reference.  
With 4000 ids it currently takes ~150kB disk space.

In [103]:
import pickle
with open("labelled_token_ids_dict.pkl", "wb") as f:
    pickle.dump(labelled_token_ids_dict, f)
    
    