<a href="https://colab.research.google.com/github/besimorhino/ai-workshop/blob/main/Tokens.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Step 0. Setup

We need to install some python libraries to make our demo code work.

In [None]:
!pip install langchain langchain-community

We now load the librarires needed to generate our tokens

In [None]:
from transformers import AutoTokenizer

checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)

# Step 1. Tokenizing
Tokenizing is a fancy term for slicing up something (in this case text) and representing it in another way. It is a foundational requirement for AI systems being able to understand human text. This area of computer science is more commonly called Natural Language Processing (NLP)

In [None]:
raw_inputs = [
    "I've been wanting to know what tokens are all about...",
    "Do they live up to the hype?",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

Let's go for a much larger data set!

In [None]:
# This downloads State of the Union speech data
import os

if os.path.exists("./data/state_of_the_union.txt") == False:
    !mkdir ./data
    !wget -P ./data https://raw.githubusercontent.com/KxSystems/kdbai-samples/main/retrieval_augmented_generation/data/state_of_the_union.txt

In [None]:
# Load the TextLoader function
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load the documents we want to prompt an LLM about
doc = TextLoader("data/state_of_the_union.txt").load()

We can now see that we have 1 document loaded in the doc variable

In [None]:
len(doc)

In [None]:
# Chunk the documents into 200 character chunks using langchain's text splitter "RucursiveCharacterTextSplitter"
text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=0)

In [None]:
# split_documents produces a list of all the chunks created
chunks = [p.page_content for p in text_splitter.split_documents(doc)]

The single document is now split into many chunks

In [None]:
len(chunks)

In [None]:
tokenized_chunks = []
for chunk in chunks:
  inputs = tokenizer(chunk, padding=True, truncation=True, return_tensors="pt")
  tokenized_chunks.append(inputs)

Show first chunk of State of Union speech

In [None]:
chunks[0]

Show the token that was made from this chunk.

In [None]:
tokenized_chunks[0]

Interesting detail: Unless you force it to split at a specific boundary, the `RecursiveCharacterTextSplitter` class from LangChain attempts to split at an 'acceptable' boundary. This results in chunks that are about the size of our chunksize.

Here we see the length of the first chunk

In [None]:
len(chunks[0])

But you can see that not all the chunks are the same size!

In [None]:
for chunk in chunks:
  print(len(chunk))