# Data Ingestion - Chunking

### Introduction

So far we stored entire documents in our database, which then were fed as a context to llm model.  For example, when working with our restaurant review data, an individual review was separately stored and fed to the llm.

However, oftentimes we will have longer documents.  In that case, we may have to break up our documents into multiple segments both for storage and to feed to our llm.

We'll describe some of the reasoning and techniques for accomplishing this in this lesson.

### Chunks and Tokens

When our we feed text to an AI model, before being embedded that text is first broken into tokens.  A token is the smallest unit into which text data can be broken down for an AI model to process.  

Each word is typically broken into a separate token.  However, as you'll see below, the tokens may also include punctuation.

```python
import nltk

nltk.download('punkt')  # Download necessary datasets
from nltk.tokenize import word_tokenize

text = "I am sleepwalking, are you?"
tokens = word_tokenize(text)
print(tokens)
# ['I', 'am', 'sleepwalking', ',', 'are', 'you', '?']
```

With LLMs we do not need to worry too much about the tokenization process -- the llm will tokenize our text for us.

* Chunking 

What is more relevant to us are the *chunks* that a document is broken, which are then stored in our vector database.


Remember that we would not want to store an entire pdf file or document in a vector database.  If we did this, then we would be retrieving information that is not so relevant to the user's query.  So instead we chunk our document into different components.

Chunking, just means to breaking down large pieces of text into smaller segments.  These chunks will ultimately be embedded into vectors to be stored in our database, and then the relevant chunks will be retrieved and fed to our LLM.  Typically, a chunk size may have a limit of 250 tokens (on the small end) to a few thousand tokens.  And remember a token is about 3/4 of a word.

In working with LLMs, you will have some control over the size of the chunk.  And you may wish to change the size of the chunk depending on:

* How the content is naturally grouped (among other factors).  For example, for a pdf document, we can imagine chunking a paragraph at a time, and then embedding it.  If we have a dataset of tweets, we may chunk each individual tweet (see more in the chunking strategies resource below.)

* The amount of context we can provide to our llm.  For example, GPT4 can be fed more information in it's context than GPT3.5, so the chunks could be larger with chatgpt4.

* The kinds of questions a user asks.  Longer questions typically require more information, so if the user is providing longer questions we would want larger chunks.

> See more considerations [here](https://www.pinecone.io/learn/chunking-strategies/)

For now we can just set each sentence to be a separate chunk.

In [5]:
climate_change_text = "In common usage, climate change describes global warming—the ongoing increase in global average temperature—and its effects on Earth's climate system. Climate change in a broader sense also includes previous long-term changes to Earth's climate. The current rise in global average temperature is more rapid than previous changes, and is primarily caused by humans burning fossil fuels.[3][4] Fossil fuel use, deforestation, and some agricultural and industrial practices add to greenhouse gases, notably carbon dioxide and methane.[5] Greenhouse gases absorb some of the heat that the Earth radiates after it warms from sunlight. Larger amounts of these gases trap more heat in Earth's lower atmosphere, causing global warming."

In [8]:
chunks = climate_change_text.split('.')
chunks[:2]

["In common usage, climate change describes global warming—the ongoing increase in global average temperature—and its effects on Earth's climate system",
 " Climate change in a broader sense also includes previous long-term changes to Earth's climate"]

And remember that at this stage, we can use openai to retrieve the embedding for each chunk, and store text of each chunk along with the embedding in the vector database.

### Moving to LLamaindex

So now that we learned about chunking a document in general, let's move to llamaindex.  Llamaindex is a library for building LLM pipelines.  

> An alternative to Llamaindex is langchain. 

One of the components that LLamaindex helps with is data ingestion.  And in the data ingestion stage, we both retrieve the text information, and chunk the document.

Ok, so let's get started.

> The code below is also located in the separate `codebase` folder which may work better than moving through this collab.


> Install the requirements.txt, and then run open the index.py file.  Add the openai api key, and run `python3 -i index.py` to interact with the code.

In [None]:
!pip3 install pymupdf
!pip3 install llama-index

[Located here](https://www.worldbank.org/en/publication/wdr2024) is the world development report on middle income countries from the world bank.  If you click on the link you'll see what we would like to download.



> So let's make a new directory and then download the report.

In [12]:
# !mkdir data

> You may need to download wget (`brew install wget` on a mac.)

In [13]:
# !wget --user-agent "Mozilla" "https://thedocs.worldbank.org/en/doc/ba24094c4f9f37714f07345b1505c930-0050062023/original/WDR2024-Concept-Note.pdf" -O "data/dev_report.pdf"

And from there, we can use pymupdf's fitz module to read the contents of our pdf. 

In [15]:
file_path = "./data/dev_report.pdf"
doc = fitz.open(file_path)

doc_text = ""
for doc_idx, page in enumerate(doc):
    page_text = page.get_text("text")
    doc_text += page_text

Now that we've collected the text, we can create a Document object with this text (which we do below).  And then use the SentenceSplitter to break this into a chunk size of 1024 tokens, where sentences are not cut off midstream. 

The `get_nodes_from_documents` function is what chunks our document.

In [None]:
from llama_index.core import Document
from llama_index.core.node_parser import SentenceSplitter

documents = [Document(text = doc_text)]

parser = SentenceSplitter(chunk_size=1024)
nodes = parser.get_nodes_from_documents(documents)

Let's take a look at one of those nodes.

In [None]:
node = nodes[0]
node.__dict__.keys()
# ['id_', 'embedding', 'metadata', 
# 'excluded_embed_metadata_keys', 'excluded_llm_metadata_keys',
# 'relationships', 'text', 'start_char_idx', 'end_char_idx',
# 'text_template', 'metadata_template', 'metadata_seperator']

You can see that in addition to storing the text, on each node is also the corresponding embedding.  And then there is also an atribute of relationships -- this includes pointers to the previous node and the next node.  

> And something like this is valuable if say, we wish to include say a summary of the previous node's text along with the current node's text to improve our search.

Metadata may include information like the page number or name of the document.

* Adding the embedding 

In the next step, we explicitly add an embedding to each node with the following:

```python
for node in nodes:
    node_embedding = embed_model.get_text_embedding(
        node.get_content()
    )
    node.embedding = node_embedding
```

You can see that we add the embedding directly to each node by calling:
    
```
node.get_content()
```

To return the text, and then we pass that text into the `embed_model.get_text_embedding()` function to get the corresponding embedding.  Then we store this embedding on the node's embedding attribute.

### Finishing up

Ok, so now we moved through downloading a document with `wget` then reading the text of that document with pymupdf's fitz module, and chunking that text into separate nodes.

Let's store those nodes in a vector store database, and then ask a question of our data.

So below we store the nodes in a VectorStoreIndex (an in memory vector database), and then specify how we would like a response from the vector store.

```python
os.environ['OPENAI_API_KEY'] =api_key
index = VectorStoreIndex(nodes)

query_engine = index.as_query_engine(response_mode="tree_summarize")
```

> We need to specify our `openai_apikey` in the environment, because the VectorStoreIndex will re-embed our nodes using the openai api.  

Finally we can ask questions of our vector store with the following.

In [None]:
response = query_engine.query("What is creative destruction according to the document?")

And in the response we can both get the response text.

In [None]:
response_text = response.response

As well as the nodes that the query engine used to generate that response.  So if you look at the first source_node, you'll see the relevant text that was retrieved from our original document.

In [None]:
response.source_nodes[0]

### Summary

Ok, that was a lot but all of our code is can be encapsulated in a few lines.

In [None]:
import fitz
from llama_index.core import Document, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
import os
from llama_index.embeddings.openai import OpenAIEmbedding

file_path = "./data/dev_report.pdf"
doc = fitz.open(file_path)

doc_text = ""
for doc_idx, page in enumerate(doc):
    page_text = page.get_text("text")
    doc_text += page_text


documents = [Document(text = doc_text)]
parser = SentenceSplitter(chunk_size=1024)
nodes = parser.get_nodes_from_documents(documents)



# api_key = "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"
# embed_model = OpenAIEmbedding(api_key=api_key)

# for node in nodes:
#     node_embedding = embed_model.get_text_embedding(
#         node.get_content()
#     )
#     node.embedding = node_embedding

# Note: Above, we  skip the embedding as openai automatically embeds when 
# we pass our nodes into our vector index
api_key = ""
os.environ['OPENAI_API_KEY'] =api_key
index = VectorStoreIndex(nodes)

query_engine = index.as_query_engine(response_mode="tree_summarize")
response = query_engine.query("What is creative destruction according to the document?")

response_text = response.response

### Resources

[Chunking Strategies](https://www.pinecone.io/learn/chunking-strategies/)

[Pinecone Ingestion](https://docs.llamaindex.ai/en/stable/examples/low_level/ingestion.html)

[Documents and Nodes](https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/root.html)

[Karpathy - Tokenization Video](https://www.youtube.com/watch?v=zduSFxRajkE&list=PLAqhIrjkxbuWI23v9cThsA9GvCAUhRvKZ&index=9)