<a href="https://colab.research.google.com/github/antalvdb/antalvdb.github.io/blob/main/INFOMTALC2025_Seminar_4.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Transformers: Applications in Language and Communication (INFOMTALC)
## Seminar 4: Tokenizers and tools

Today's Colab consists of two parts that have to do with the two different topics covered in the lecture. One is **tokenization** and the other one has to do with recent developments in how language models are used in practice: they are given access to external **tools** such as web search, calculator etc. -- and in combination with these tools, they become what has been called **agents**. We will give agents a try.


# PART I: Tokenization and character-level information

During the lecture, we discussed standard practices of text tokenization for recent transformer language models. In particular, the most common subword tokenization algorithm, BPE, was introduced (to refresh the details of how it works and to see how it could be implemented, check out [this page on HF](https://huggingface.co/learn/nlp-course/en/chapter6/5)).

Additionally, we discussed character-level tokenization as an alternative to subword tokenization algorithms. There is a variety of tasks that builds on knowledge about letters (spelling correction being one prominent example), and subword tokenization loses character-level information because multi-character tokens are treated as an indivisible whole.

Character-level (or byte-level) tokenizers are not particularly widespread (yet?) due to computational considerations. But there are prominent models that have both subword-level and character-level variants. We will look at the [T5 family of models from Google](https://research.google/blog/exploring-transfer-learning-with-t5-the-text-to-text-transfer-transformer/). These are encoder-decoder models trained for a variety of sequence-to-sequence tasks. Their [mt5 models](https://github.com/google-research/multilingual-t5) (multilingual T5) have character-level counterparts (see [paper](https://arxiv.org/abs/2105.13626)). Let's look at the small versions of the corresponding models and illustrate the differences in their tokenization:

In [None]:
from transformers import AutoTokenizer, T5Tokenizer

tokenizer_subword = T5Tokenizer.from_pretrained("google/mt5-small")
tokenizer_byte = AutoTokenizer.from_pretrained("google/byt5-small")

sentence = "This is a tokenization test!"
print('DIFFERENT TOKENIZATIONS:')
print('Subword:', tokenizer_subword.tokenize(sentence))
print('Character:', tokenizer_byte.tokenize(sentence))

Does the regular mT5 model represent information about characters that tokens are made of? In particular, when compared to a minimally different character-level model byT5? Let's try to answer this question empirically.

In order to do it, we will turn to a method that's called **probing** or sometimes also **diagnistic classification**. The idea is very simple: if some information is encoded in model's representations (embeddings it produces at different layers), it should be retrievable by some simple model -- for instance, a simple logistic regression.

If you want to know more about this method and what it reveals, here are two classic papers for further reading:

- Conneau et al. (2018) [What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties](https://aclanthology.org/P18-1198/). ACL 2018.
- Rogers et al. 2020. [A Primer in BERTology: What We Know About How BERT Works](https://doi.org/10.1162/tacl_a_00349). TACL 2020: 8, pp. 842–866.


Probing is one of early and still classic ways of opening the black box of language models. It is part of the broader field of language model **interpretability** -- a super-active area of research. Check out papers from [BlackboxNLP](https://aclanthology.org/venues/blackboxnlp/) -- a workshop dedicated to interpretability in NLP.

Now, back to character-level information in models with subword tokenizers. Let's build a probing classifier that, based on contextual embeddings from the model, tries to predict whether the sequence contains a particular character. In terms of implementation, it's going to be very similar to what we saw in Colab 1 where we used embeddings as features for an external classifier model. Some of the steps will be implemented for you, some of the steps will be your coding exercises.

Let's start with making decisions on embeddings. We will use only the encoder part of the encoder-decoder model. Encoder produces embeddings for each of the input tokens, so we will need to find a way to represent the whole sequence with just one vector given all the token vectors. T5 models don't have the `[CLS]` token that is used to represent the whole sequence for classification tasks, so we would have to come up with something else. One usual solution in this case is averaging across all the token vectors. Here is how this could work:

In [None]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_name = "google/mt5-small"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to('cuda:0')

tokenized = tokenizer(sentence, return_tensors='pt').to('cuda:0')
with torch.no_grad():
  output = model.encoder(**tokenized, return_dict=True)
pooled_emb = output.last_hidden_state
pooled_emb = torch.mean(pooled_emb, dim=1)
pooled_emb.shape

We can use embeddings extracted this way as features for our diagnostic classifier. But what should this classifier classify? We could use some corpus as a source of words and then use these words as data points, and classes for the classifier would reflect whether the word contains a particular letter (for instance, 'n'). something like this:

In [None]:
import pandas as pd

pd.DataFrame({'word': ['table', 'chair', 'panel', 'cake'], 'has n?': [0, 0, 1, 0]})

**Perform the next step yourself**. We give you a small corpus from simple wikipedia and suggest some useful imports. Open the file, get 10000 most frequent words, maybe filter out non-words by some simple criterion (let's say, exclude strings that contain digits), create a dataframe similar to the one above.

In [None]:
! wget https://raw.githubusercontent.com/bylinina/TMA_seminars/refs/heads/main/simple_corpus.txt
! pip install datasets

from sklearn.feature_extraction.text import CountVectorizer
import re
import pandas as pd
import numpy as np
from datasets import Dataset
#from itertools import islice # not necessarily needed

##YOUR CODE##

**Perform this step yourself**. Now, produce a train dataset and a test dataset (we suggest a 80:20 split), tokenize all the words in the datasets with the mT5 tokenizer and embed all the tokenized words with mT5 encoder with mean pooling. Once this is done, we are ready to feed it into a diagnostic classifier.

In [None]:
##YOUR CODE##

Now, we fit a logistic regression on these embeddings as features. So, does mT5 with a subword tokenizer 'know' whether a word contains the letter 'n'? Our results show that mT5 represents this information at least to some extent! Way above random (if you want to simply run this cell below, make sure your dataset is named in the same way).

In [None]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=0, max_iter=10000).fit(train_dataset['embeddings'], train_dataset['n'])
clf.score(test_dataset['embeddings'], test_dataset['n'])

Just as a way of comparison, **repeat the same analysis** for a character-level counterpart of this model: byT5 ("google/byt5-small"). Our results show that character-level information is represented in byT5 nearly perfectly -- which, of course, is expected.

In [None]:
##YOUR CODE##

**Bonus task:** We looked at embeddings that mT5 produces after the very last layer of the encoder part of the model. Is it better or worse after previous layers? You can access intermediate representations specifying `output_hidden_states=True`. Note that the last of the hidden states tensors is the same thing as what you access with `last_hidden_state`.

In [None]:
model_name = "google/mt5-small"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

tokenized = tokenizer(sentence, return_tensors='pt')
with torch.no_grad():
  output = model.encoder(**tokenized, output_hidden_states=True, return_dict=True)

hidden_states = output.hidden_states
len(hidden_states), hidden_states[0].shape, torch.equal(hidden_states[-1], output.last_hidden_state)

## PART II: Tools and agents

During the lecture, we mentioned quickly that recently, language models have been equipped with the use of **tools** -- LM output is used to trigger external instruments, such as web search, calculator etc. The output of running these external instruments then can be fed back to the LM and conditions its text generation. See [Intro to agents](https://huggingface.co/docs/smolagents/en/conceptual_guides/intro_agents) from HF for an intro. We will take a look at agents and tools with the [smolagents](https://github.com/huggingface/smolagents) library that was released just a couple of months ago. In order to try it out, we will need to log into huggingface and install `smolagents`:

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
!pip install smolagents

Now, we will use the same LLM via HF API first without any tools and then see how the web search tool changes its answer. Let's ask the model when schools will have summer holidays in the Netherlands this year:

In [None]:
from smolagents import HfApiModel

engine = HfApiModel(
    model_id="Qwen/Qwen2.5-72B-Instruct",
    max_tokens=500)

messages = [{"role": "user", "content": "What are the dates of school summer holidays in the Netherlands in 2025?"}]
response = engine(messages, stop_sequences=["END"])
response.content

It's clear that the model can't give a precise answer (since the knowledge cut-off point for the model was too long ago). Now, we can build an agent on top of this LLM that is equipped with one of the `smolagents` tools: `DuckDuckGoSearchTool`. Given the question, the LLM-driven agent outputs code that runs the websearch tool. The search results are fed back into the model that, in turn, runs the built-in `final answer` tool, which returns the final answer.

Compare the answers of the plain LLM and search-equipped LLM-based agent -- which one is better?

In [None]:
from smolagents import load_tool, CodeAgent, DuckDuckGoSearchTool

search_tool = DuckDuckGoSearchTool()

agent = CodeAgent(tools=[search_tool], model=engine)

agent.run("What are the dates of school summer holidays in the Netherlands in 2025?")

Importing a tool from the `smolagents` library is not the only way to introduce a tool. You can load a tool from HF hub, like in the example below with an image generation tool (if you go to the [corresponding HF space](https://huggingface.co/spaces/m-ric/text-to-image) and then check out the files there, you will see the tool defined based on `Tool` class from `smolagents`). Note how the LLM is first refining our prompt and only then sending it to the image generation tool:

In [None]:
image_generation_tool = load_tool("m-ric/text-to-image", trust_remote_code=True)

agent = CodeAgent(tools=[image_generation_tool], model=engine)

agent.run("Improve this prompt, then generate an image of it.", additional_args={'user_prompt': 'A rabbit in a space suit'})

You can combine multiple tools in one agent:

In [None]:
agent = CodeAgent(
    tools=[image_generation_tool, search_tool], model=engine)

agent.run("Improve this prompt, then generate an image of it.", additional_args={'user_prompt': 'the car that James Bond drove in the latest movie'})

In fact, agents in `smolagents` have some built-in tools that can be used without explicitly passing them to the agent. `smolagents` has a built-in python interpreter, so the agent can help debug your code by running it, collecting the error, feeding the error to the LM, LM making an attempt to rewrite the code to solve the problem etc.:

In [None]:
agent = CodeAgent(tools=[], model=engine)

code = """
numbers=[0, 1, 2]

for i in range(4):
    print(numbers(i))
"""

agent.run(
    "I have some code that creates a bug: please debug it, then run it to make sure it works and return the final code",
    additional_args=dict(code=code))

Finally, you can define a tool yourself. Use `tool` as a function decorator, like in these examples below, where we define a tool to retrieve current time given a location, and a tool to look up information in Wikipedia:

In [None]:
from smolagents import tool
import requests

@tool
def get_time_in_timezone(location: str) -> str:
    """
    Fetches the current time for a given location using the World Time API.
    Args:
        location: The location for which to fetch the current time, formatted as 'Region/City'.
    Returns:
        str: A string indicating the current time in the specified location, or an error message if the request fails.
    Raises:
        requests.exceptions.RequestException: If there is an issue with the HTTP request.
    """
    url = f"http://worldtimeapi.org/api/timezone/{location}.json"

    try:
        response = requests.get(url)
        response.raise_for_status()

        data = response.json()
        current_time = data["datetime"]

        return f"The current time in {location} is {current_time}."

    except requests.exceptions.RequestException as e:
        return f"Error fetching time data: {str(e)}"

@tool
def search_wikipedia(query: str) -> str:
    """
    Fetches a summary of a Wikipedia page for a given query.
    Args:
        query: The search term to look up on Wikipedia.
    Returns:
        str: A summary of the Wikipedia page if successful, or an error message if the request fails.
    Raises:
        requests.exceptions.RequestException: If there is an issue with the HTTP request.
    """
    url = f"https://en.wikipedia.org/api/rest_v1/page/summary/{query}"

    try:
        response = requests.get(url)
        response.raise_for_status()

        data = response.json()
        title = data["title"]
        extract = data["extract"]

        return f"Summary for {title}: {extract}"

    except requests.exceptions.RequestException as e:
        return f"Error fetching Wikipedia data: {str(e)}"


Let's now interact with an agent equipped with these tools!

In [None]:
agent = CodeAgent(
    tools=[
        search_wikipedia,
        get_time_in_timezone,
    ],
    max_steps=10,
    model=engine)

agent.run("What's the time in Lenin's place of birth?")

**Try out** different models, tools, questions and their combinations!