# Exercise 3: Chunking
In the previous exercise, we searched through the PyData talks based on their title, abstracts, and descriptions.
However, as we noticed in the story example, embedding large pieces of text is not always the best approach.
The main reason for this is that an embedding can only capture so much information.
So, if a piece of text becomes longer, the embedding will become less accurate.
Therefore, it is often better to split the text into smaller chunks and embed each chunk separately.

In this exercise we will demonstrate how to perform chunking.

In [2]:
import json
from langchain_huggingface import HuggingFaceEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from llm_in_production.huggingface_utils import get_device
from llm_in_production.llm import instantiate_langchain_model
import dotenv

dotenv.load_dotenv()

client = instantiate_langchain_model(
    # llm_provider="azure",
    llm_provider="gcp",
)

We shall be working with the PyData talks dataset, so let's start by loading that in again.

In [3]:
with open("pydata.json", "r") as f:
    talks = json.load(f)["talks"]
    titles = [talk["title"] for talk in talks]
    abstracts = [talk["abstract"] for talk in talks]
    descriptions = [talk["description"] for talk in talks]

We will again use the [MiniLM](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) model from [HuggingFace](https://huggingface.co/) to embed the text. To make it easier to use, we have wrapped the model in a class called [HuggingFaceEmbeddings](https://api.python.langchain.com/en/latest/embeddings/langchain.embeddings.huggingface.HuggingFaceEmbeddings.html) from LangChain. Running the cell below will download the model from the HuggingFace model hub and load it into memory. This can take a while the first time you run it. However, the model will be cached on your computer, so it will be much faster the next time you run it.

In [4]:
# This function check if the accelerator is available like a GPU and if so, it will use it.
device = get_device()
# Here we create the embedding function that will be used to embed the sentences.
embedding_func = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2", model_kwargs={"device": get_device()})

## Exercise 3a: Introduction to text splitting and chunking

In the cell below, we have an example of how to split a text into chunks.

We use the [RecursiveCharacterTextSplitter](https://api.python.langchain.com/en/latest/text_splitter/langchain.text_splitter.RecursiveCharacterTextSplitter.html) from LangChain, which attempts to split the text using different separator characters, until it finds chunks of a small enough size.

The main reason for using this recursive approach is that it allows us to split the text into chunks of roughly the same size.

Please run the cell below and do the following:
- Change the `chunk_size` and `chunk_overlap` and see how the results change.
- Change the `keep_separator` and see how the results change.
- Add or remove separators and see how the results change.
- Currently, we are using the `client`'s built-in `get_num_tokens` method to determine how to measure the lengths of the chunks. Change the `length_function` to `len` and see how the results change.

In [5]:
text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", ".", "?", "!", " ", ""],
    chunk_size = 100,
    chunk_overlap  = 25,
    length_function = client.get_num_tokens,
    keep_separator=True,
    strip_whitespace=True,
)

talk_idx = 5
title = titles[talk_idx]
abstract = abstracts[talk_idx]
description = descriptions[talk_idx]
all_text = f"""
Title: {title}
Abstract: 
{abstract}
Description:
{description}
"""

chunks = text_splitter.split_text(all_text)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i} (n_char={len(chunk)} n_tokens={client.get_num_tokens(chunk)}):")
    print(chunk)
    print()

Chunk 0 (n_char=508 n_tokens=99):
Title: Power Users, Long Tail Users, and Everything In Between: Choosing Meaningful Metrics and KPIs for Product Strategy
Abstract: 
Data scientists in industry often have to wear many hats. They must navigate statistical validity, business acumen and strategic thinking, while also representing the end user. In this talk, we will talk about the pillars that make a metric the right one for a job, and how to choose appropriate Key Performance Indicators (KPIs) to drive product success and strategic gains.

Chunk 1 (n_char=508 n_tokens=95):
Description:
Our presentation will traverse the relationship of data science skills in product strategy - embracing the multifaceted role of the data scientist and navigating the journey from user segmentation to making data-driven decisions.

1. The Data Scientist's Hat Trick: We initiate by emphasising the assorted roles that a data scientist plays in today's business landscape - from being a statistician ensuring 

## Exercise 3b: Incorporating chunking to get better search results.
We have now seen how to split a text into chunks.
Let's see if we can use it to get better search results on the PyData talks dataset.

We will do the following:
1. Create a `RecursiveCharacterTextSplitter` that splits the text into chunks of roughly 100 tokens, with an overlap of 10-30 tokens.
2. Then, for each talk, we do the following:
    1. Combine the title, abstract, and description into a single string.
    1. Split the text into chunks using the `RecursiveCharacterTextSplitter`.
    1. Store each chunk in the vector database and make all the chunks of the same talk have the same metadata:
        - The `title` of the talk.
        - The `talk_idx`, which is the index of the talk in the `talks` list.
3. We can then build the vector database around the chunks and their metadata.


In [8]:
text_splitter = RecursiveCharacterTextSplitter(
    # YOUR CODE HERE START: define the separators, chunk_size and chunk_overlap here.
    separators=["\n\n", "\n", ".", "?", "!", " ", ""],
    chunk_size= 100,
    chunk_overlap=10,
     # YOUR CODE HERE END
    length_function = client.get_num_tokens,
    keep_separator=True,
    strip_whitespace=True,
)

texts = []
metadatas = []

for talk_idx in range(len(talks)):
    title = titles[talk_idx]
    abstract = abstracts[talk_idx]
    description = descriptions[talk_idx]
    
    metadata = {"title": title, "talk_idx": talk_idx}
    all_text = f"""
        Title: {title} Abstract: {abstracts} Description: {description}
    """.strip()
    # YOUR CODE HERE START: Combine the title, abstract, and description into a single string.
    # YOUR CODE HERE END
    
    # YOUR CODE HERE START: Split the text into chunks using the RecursiveCharacterTextSplitter.
    chunks = text_splitter.split_text(all_text)
    # YOUR CODE HERE END
    
    for chunk in chunks:
        texts.append(chunk)
        metadatas.append(metadata)


assert len(texts) == len(metadatas), f"{len(texts)} != {len(metadatas)}"
    
# YOUR CODE HERE START: Create a vector database around the texts and their metadata. (Hint: use the FAISS.from_texts method)
vector_database = FAISS.from_texts(texts, metadatas=metadatas, embedding=embedding_func)
# YOUR CODE HERE END

Try out some different queries in the cell below. Do you get better results?

In [9]:
query = 'which talks are about LLM?'
# query = 'which talks are about data engineering?'
# query = 'which talks combine bayesian and llm?'
k = 3
documents = vector_database.similarity_search(query, k=k)

for document in documents:
    title = document.metadata['title']
    page_content = document.page_content
    
    print(f"Title: {title}")
    print(f"Page content: {page_content}")
    print("#" * 80 +"\n")

Title: What the PDEP? An overview of some upcoming pandas changes
Page content: .\r\n\r\nThis talk is for people who want to learn how to build their first LLM-based agent. Familiarity with Python, PyDantic, and LMMs is nice during this presentation but not essential. As long as you love overengineered solutions to a basic to-do list, you will like this presentation.', 'The continued success of large language models (LLMs) hinges upon accurate, diverse, and well-labeled data
################################################################################

Title: Turning your Data/AI algorithms into full web applications in no time with Taipy
Page content: .\r\n\r\nThis talk is for people who want to learn how to build their first LLM-based agent. Familiarity with Python, PyDantic, and LMMs is nice during this presentation but not essential. As long as you love overengineered solutions to a basic to-do list, you will like this presentation.', 'The continued success of large language mod

---