This is a file to create token IDs for unit testing. 

Use [The Tokenizer Playground](https://huggingface.co/spaces/Xenova/the-tokenizer-playground) to verify token IDs. 

In [1]:
from transformers import AutoTokenizer

In [2]:
def get_token_IDs(text, model="microsoft/Phi-3.5-mini-instruct"): 
    """ 
    Get token IDs given some text and model. 

    Parameters
    ---------- 
        - text (str): text for which to get token IDs
        - model (model): model to use as tokenizer

    Returns
    ------- 
    tuple of (tokens, token_IDs)
        - tokens (list): list of tokens for input text 
        - token_IDs (list): list of token IDs for each token  
    """
    tokenizer = AutoTokenizer.from_pretrained(model) 
    tokens = tokenizer.tokenize(text) 
    token_IDs = tokenizer.encode(text, add_special_tokens=True)

    return tokens, token_IDs

In [3]:
text = 'Hello, how are you?'
tokens, token_IDs = get_token_IDs(text)
print(tokens)
print(token_IDs)

['▁Hello', ',', '▁how', '▁are', '▁you', '?']
[15043, 29892, 920, 526, 366, 29973]


In [4]:
texts = ["The sky is bright today.", 
         "I love classical music.", 
         "Data science is fascinating.", 
         "Could you pass the salt?", 
         "Baroque composers inspire my work.", 
         "What time is the meeting?", 
         "This coffee tastes really good.",
         "Purple is pretty"]
tokens = []
token_IDs = []
for text in texts:
    tokens_, token_IDs_ = get_token_IDs(text)
    tokens.append(tokens_)
    token_IDs.append(token_IDs_)

In [5]:
tokens

[['▁The', '▁sky', '▁is', '▁bright', '▁today', '.'],
 ['▁I', '▁love', '▁classical', '▁music', '.'],
 ['▁Data', '▁science', '▁is', '▁fasc', 'in', 'ating', '.'],
 ['▁Could', '▁you', '▁pass', '▁the', '▁salt', '?'],
 ['▁Bar', 'o', 'que', '▁compos', 'ers', '▁insp', 'ire', '▁my', '▁work', '.'],
 ['▁What', '▁time', '▁is', '▁the', '▁meeting', '?'],
 ['▁This', '▁coffee', '▁t', 'ast', 'es', '▁really', '▁good', '.'],
 ['▁Pur', 'ple', '▁is', '▁pretty']]

In [6]:
token_IDs

[[450, 14744, 338, 11785, 9826, 29889],
 [306, 5360, 14499, 4696, 29889],
 [3630, 10466, 338, 21028, 262, 1218, 29889],
 [6527, 366, 1209, 278, 15795, 29973],
 [2261, 29877, 802, 5541, 414, 8681, 533, 590, 664, 29889],
 [1724, 931, 338, 278, 11781, 29973],
 [910, 26935, 260, 579, 267, 2289, 1781, 29889],
 [15247, 552, 338, 5051]]

Create dataframe of expected tokens and token IDs for each model and text:

In [7]:
models = ["google/flan-t5-xxl",
          "bigscience/mt0-xxl-mt",
          "CohereForAI/aya-101",
          "bigscience/bloomz-7b1",
          "microsoft/Phi-3.5-mini-instruct",
          "neulab/Pangea-7B",
          "google/gemma-7b",
          "google/gemma-2-9b",
          "meta-llama/Llama-3.2-1B-Instruct"]

In [8]:
# Model: google/flan-t5-xxl
tokens = []
token_IDs = []
for text in texts:
    tokens_, token_IDs_ = get_token_IDs(text, model='google/flan-t5-xxl')
    tokens.append(tokens_)
    token_IDs.append(token_IDs_)

tokens

[['▁The', '▁sky', '▁is', '▁bright', '▁today', '.'],
 ['▁I', '▁love', '▁classical', '▁music', '.'],
 ['▁Data', '▁science', '▁is', '▁fascinating', '.'],
 ['▁Could', '▁you', '▁pass', '▁the', '▁salt', '?'],
 ['▁Bar', 'o', 'que', '▁composer', 's', '▁inspire', '▁my', '▁work', '.'],
 ['▁What', '▁time', '▁is', '▁the', '▁meeting', '?'],
 ['▁This', '▁coffee', '▁tastes', '▁really', '▁good', '.'],
 ['▁Purple', '▁is', '▁pretty']]

Use this as part of unit tests. 