# 2. Ingest and Retrieve documents

In this notebook, we will use the [Pro Git book](https://git-scm.com/book/en/v2) to feed relevant information to our LLM. We will parse the text of the book, retrieve text relevant to the questions being asked, and inject the relevant text into the prompt sent to the LLM.

## Ingest documents from the Pro Git book

We will use a PDF of the Pro Git book from https://github.com/progit/progit2/releases/download/2.1.426/progit.pdf as our source of information. Let's start by extracting the text from the PDF.

In [1]:
import json
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import CharacterTextSplitter, RecursiveCharacterTextSplitter
from ruamel.yaml import YAML

In [2]:
pdf_path = "https://github.com/progit/progit2/releases/download/2.1.426/progit.pdf"
loader = PyPDFLoader(pdf_path)
pages = loader.load_and_split()

In [3]:
pages[0]

Document(page_content='Pro Git\nScott Chacon, Ben Straub\nVersion 2.1.426, 2024-04-05', metadata={'source': 'https://github.com/progit/progit2/releases/download/2.1.426/progit.pdf', 'page': 1})

### Page chunking

Now we have the entire book chunked into pages. However, pages are probably not the most useful way to chunk the text for a few reasons:

1. Injecting entire pages into our prompts may exceed the maximum input size of the LLM and/or limit the number of relevant results we can include.
2. Pages are not meaningful boundaries. A page may end in the middle of a sentence, for example.

In [4]:
docs = []
# Skip intro and appendix
for page in range(17, 460):
    text = pages[page].page_content
    docs.append(text)

In [5]:
print(docs[0])

Getting Started
This chapter will be about getting started with Git. We will begin by explaining some background
on version control tools, then move on to how to get Git running on your system and finally how to
get it set up to start working with. At the end of this chapter you should understand why Git is
around, why you should use it and you should be all set up to do so.
About Version Control
What is “version control”, and why should you care? Version control is a system that records
changes to a file or set of files over time so that you can recall specific versions later. For the
examples in this book, you will use software source code as the files being version controlled,
though in reality you can do this with nearly any type of file on a computer.
If you are a graphic or web designer and want to keep every version of an image or layout (which
you would most certainly want to), a Version Control System (VCS) is a very wise thing to use. It
allows you to revert selected files ba

In [6]:
print("\n\n".join(docs[10:15]))

Installing from Source
Some people may instead find it useful to install Git from source, because you’ll get the most recent
version. The binary installers tend to be a bit behind, though as Git has matured in recent years,
this has made less of a difference.
If you do want to install Git from source, you need to have the following libraries that Git depends
on: autotools, curl, zlib, openssl, expat, and libiconv. For example, if you’re on a system that has dnf
(such as Fedora) or apt-get (such as a Debian-based system), you can use one of these commands to
install the minimal dependencies for compiling and installing the Git binaries:
$ sudo dnf install dh-autoreconf curl-devel expat-devel gettext-devel \
  openssl-devel perl-devel zlib-devel
$ sudo apt-get install dh-autoreconf libcurl4-gnutls-dev libexpat1-dev \
  gettext libz-dev libssl-dev
In order to be able to add the documentation in various formats (doc, html, info), these additional
dependencies are required:
$ sudo dnf insta

### Alternative chunking strategies

Instead of chunking by pages, we will chunk by characters and chunk size. We will limit chunks to 1500 characters, splitting recursively by periods and newline characters `[".\n", ".", "\n"]`.

In [7]:
text_splitter = RecursiveCharacterTextSplitter(separators=[".\n", ".", "\n"], chunk_size=1500, chunk_overlap=0)
chunks = text_splitter.split_text("\n".join(docs))

In [8]:
print("\n\n".join(chunks[10:15]))

.
•Staged means that you have marked a modified file in its current version to go into your next
commit snapshot.
•Committed means that the data is safely stored in your local database.
This leads us to the three main sections of a Git project: the working tree, the staging area, and the
Git directory.
Figure 6. Working tree, staging area, and Git directory
The working tree is a single checkout of one version of the project. These files are pulled out of the
compressed database in the Git directory and placed on disk for you to use or modify.
The staging area is a file, generally contained in your Git directory, that stores information about
what will go into your next commit. Its technical name in Git parlance is the “index”, but the phrase
“staging area” works just as well.
The Git directory is where Git stores the metadata and object database for your project. This is the
most important part of Git, and it is what is copied when you clone  a repository from another
computer.
The bas

### Advanced chunking strategies

Chunks are no longer split in the middle of sentences, although there are some where we would probably choose a different way to split the chunks. These preprocessing steps can take a lot of time to fine-tune, and you can play with the parameters here. For example, you can try to adjust the parameters for `RecursiveCharacterTextSplitter` and re-chunk the PDF, or look into more advanced techniques like [semantic chunking](https://python.langchain.com/docs/modules/data_connection/document_transformers/semantic-chunker/).

For now, let's save our results so we can reuse these chunked documents later.

In [9]:
with open("../data/docs.json", "w") as f:
    json.dump(chunks, f)

## Retrieving relevant documents

We have split the PDF into chunks (each chunk is usually called a "document" in RAG terminology), but that's only useful if we can look up the chunks relevant to the question being asked. This is called information retrieval.

Below, we use a [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) retriever to find relevant documents. BM25 finds matching keywords between the question and documents, and ranks the documents, including weights for the word's frequency and the length of documents, both within that document and across all documents. A word match that appears often within a short document but appears seldom across all other documents will rank highly by BM25.

In [10]:
from langchain import hub
from langchain_community.retrievers import BM25Retriever
from langchain_community.llms import HuggingFaceEndpoint

In [11]:
retriever = BM25Retriever.from_texts(chunks)

question = "What is a branch?"

context = retriever.get_relevant_documents(question)

In [12]:
print("\n\n".join([doc.page_content for doc in context]))

.
Figure 10. Commits and their parents
A branch in Git is simply a lightweight movable pointer to one of these commits. The default branch
name in Git is master. As you start making commits, you’re given a master branch that points to the
last commit you made. Every time you commit, the master branch pointer moves forward
automatically.
The “master” branch in Git is not a special branch.  It is exactly like any other
branch. The only reason nearly every repository has one is that the git init 
command creates it by default and most people don’t bother to change it.
64
Figure 11. A branch and its commit history
Creating a New Branch
What happens when you create a new branch? Well, doing so creates a new pointer for you to
move around. Let’s say you want to create a new branch called testing. You do this with the git
branch command: 
$ git branch testing
This creates a new pointer to the same commit you’re currently on.
Figure 12. Two branches pointing into the same series of commits
Ho

### Applying relevant documents

To build a RAG application, we then inject this context into the LLM prompt along with the question.

In [13]:
prompt = hub.pull("rlm/rag-prompt").messages[0].prompt
llm = HuggingFaceEndpoint(repo_id="HuggingFaceH4/zephyr-7b-beta")

Token has not been saved to git credential helper. Pass `add_to_git_credential=True` if you want to set the git credential as well.
Token is valid (permission: write).
Your token has been saved to /Users/dave/.cache/huggingface/token
Login successful


  from .autonotebook import tqdm as notebook_tqdm


In [14]:
context_str = "\n\n".join([doc.page_content for doc in context])
input = prompt.invoke({"question": question, "context": context_str})
print(input.text)

You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: What is a branch? 
Context: .
Figure 10. Commits and their parents
A branch in Git is simply a lightweight movable pointer to one of these commits. The default branch
name in Git is master. As you start making commits, you’re given a master branch that points to the
last commit you made. Every time you commit, the master branch pointer moves forward
automatically.
The “master” branch in Git is not a special branch.  It is exactly like any other
branch. The only reason nearly every repository has one is that the git init 
command creates it by default and most people don’t bother to change it.
64
Figure 11. A branch and its commit history
Creating a New Branch
What happens when you create a new branch? Well, doing so creates a new pointer for

In [15]:
response = llm.invoke(input.text)
print(response)

 A branch in Git is a lightweight movable pointer to a specific commit in the project's history. It allows you to create a new line of development without affecting the original branch or other branches. Git provides commands to create, switch, and delete branches. The default branch name in Git is master.


What are some other questions we can ask?

### Embeddings and vector store retrievers

BM25 is a good baseline for retrieval and proved useful here, returning relevant documents and providing enough context for the LLM to understand that the question was about Git branches. However, it is limited to matching words and has no concept of the actual meaning of the question or documents. Notice that some of the documents are not that relevant to the question.

Most RAG applications calculate embedding vectors for each document that capture their semantic meaning and save those in an index. When a question is asked, it gets encoded into a vector using the same embedding model. Then, the distance between that question vector and all saved document vectors in the index is calculated using cosine similarity or a similar metric. The documents with the smallest distance from the question vector are returned as the relevant context.

We will return to embeddings later in the workshop.

## Exercise

For this notebook's exercise, build a RAG application using the DVC data we downloaded to `../data/dvc_discord_channel.csv`. Unlike the Git PDF, this is a CSV file with several fields besides the actual text, broken down by message, including both questions and answers. You will need to:

1. Parse the CSV data to and chunk it into documents. What are some meaningful ways to chunk the Discord data?
2. Save the chunked data to `../data/docs_dvc.json` so that you can use it in future exercises.
3. Retrieve relevant documents from the chunked data.
4. Build a RAG application that applies the relevant documents and try asking some questions. Do the answers seem reasonable? What could you do to improve it?