# Hands-on Session Document Q&A
This is the hands-on session accompanying the workshop on LangChain fundamentals. This is inspired by the more extensive LangChain Cookbook Part 1.

Copyright (c) 2023 Michael Neumayr

## Setup

### 0. Set up the Colab in your drive

- Load this Colab from Github
- Run the first cell to install all required packages (this takes a moment)
- During installation jump to section "Set OpenAI API Key" and put the key we provide you instead of "PUT_YOUR_KEY_HERE"

### 1. Required python packages

In [1]:
# install required packages; this may take some minutes; ignore dependency warnings it should work anyway
%pip install openai
%pip install langchain
%pip install pypdf
%pip install tiktoken


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m24.0[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython3.10 -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.
Collecting pypdf
  Downloading pypdf-4.1.0-py3-none-any.whl (286 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m286.1/286.1 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hInstalling collected packages: pypdf
Successfully installed pypdf-4.1.0

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m 

### 2. Load the workshop github

In [1]:
!git clone https://github.com/michaelnoi/venture_labs_build.git

fatal: destination path 'venture_labs_build' already exists and is not an empty directory.


In [2]:
%cd venture_labs_build
!git checkout only_static_files

/Users/caterinagalata/Desktop/TUM venture labs/venture_labs_build/venture_labs_build
Already on 'only_static_files'
Your branch is up to date with 'origin/only_static_files'.


### 3. OpenAI API key

In [3]:
import os

openai_api_key = os.getenv('OPENAI_API_KEY', 'PUT_YOUR_KEY_HERE')

### 4. Optional: Connect to your Google Drive storage to upload your own documents later

In [None]:
# connect to your google drive storage to use your own documents
from google.colab import drive

drive.mount('/content/drive')

## Project: Document Q&A

One of the most powerful applications enabled by LLMs is sophisticated question-answering (Q&A) chatbots. These are applications that can answer questions about specific source information. These applications use a technique known as Retrieval Augmented Generation, or RAG.

<img src="static/rag.jpeg"/>

#### What is RAG?

RAG is a technique for augmenting LLM knowledge with additional data.

LLMs can reason about wide-ranging topics, but their knowledge is limited to the public data up to a specific point in time that they were trained on. If you want to build AI applications that can reason about private data or data introduced after a model’s cutoff date, you need to augment the knowledge of the model with the specific information it needs. The process of bringing the appropriate information and inserting it into the model prompt is known as Retrieval Augmented Generation (RAG).

RAG is made up of two components: indexing and retrieval+generation.

In [4]:
from langchain import OpenAI

In [5]:
llm = OpenAI(model_name="gpt-3.5-turbo-instruct", openai_api_key=openai_api_key, temperature=0)

  warn_deprecated(


## 1. Indexing

The first component of RAG requires ingesting data from a source and indexing it.

### 1.1. Load a document (PDF)

In [6]:
from langchain.document_loaders import PyPDFLoader

# load short business model canvas pdf again
pdf_path = "static/natural_language_processing.pdf"
loader = PyPDFLoader(pdf_path)
documents = loader.load()

We load the natural language processing wikipedia again as pdf document. We know this document has around 6k tokens.

In [7]:
overall_tokens = 0
for page in documents[:-4]:
    n_tokens = llm.get_num_tokens(page.page_content)
    overall_tokens += n_tokens
    
print(f"Overall number of tokens: {overall_tokens}")

Overall number of tokens: 6650


### 1.2. Split the document into chunks 

This is useful both for indexing data and for passing it in to a model, since large chunks are harder to search over and won’t fit in a model’s finite context window.

In [8]:
# import predefined chain for text splitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

We load a text splitter to split up our document into chunks. This time the chunks must contain fewer tokens since we will be feeding more than one in the same context window of our LLM Q&A chatbot later on.

In [9]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=150)

# put relevant pages into one string
article = ""
for page in documents[:-4]:
    article += page.page_content + "\n\n"

# split into chunks with the defined text splitter
chunks = text_splitter.create_documents([article])

print(f"Number of chunks: {len(chunks)}")
print("Number of tokens in each chunk:")
for chunk in chunks:
    print(llm.get_num_tokens(chunk.page_content))

Number of chunks: 26
Number of tokens in each chunk:
295
317
378
301
299
331
76
312
287
98
281
335
251
281
312
274
294
300
262
120
323
320
302
318
316
97


### 1.3. Embeddings and Vectorstores

We need somewhere to store and index our splits, so that they can later be searched over. This is often done using a VectorStore and Embeddings model.

Embeddings create a vector (numerical) representation of a piece of text. This is useful because it means we can think about text in the vector space, and do things like semantic search where we look for pieces of text that are most similar in the vector space. The numerical representation of the text chunks can be used to mathematicaly commpare documents: similar documents will be closer in the vector space than different documents.

In [10]:
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings(openai_api_key=openai_api_key)

  warn_deprecated(


A vector store takes care of storing embedded data and performing vector search for you. At query time we embed the user query (Question) and retrieve the embedding vectors (of the text chunks) that are 'most similar' to the embedded query.

The vectorstore we use for this exercise is Chroma because it is in-memory, which makes it very easy to use. LangChain offers integrations with over 30 vectorstores, some of which are more suited for storing large amounts of data.

In [11]:
from langchain.vectorstores import Chroma

In [12]:
# directory to store the vectorstore so that we can use it later on
persist_directory = 'docs/chroma/'

In [13]:
!rm -rf ./docs/chroma  # remove old database files if any

In [14]:
# create the vectorstore
vectordb = Chroma.from_documents(
    documents=chunks,
    embedding=embedding,
    persist_directory=persist_directory
)

In [15]:
# should be the same as the number of text chunks from before 
print(vectordb._collection.count())

26


## 2. Retrieval and Generation

The actual RAG chain is composed of retrieval and generation: the user queries the document at run time, the relevant data is retrieved from the index and passed to the model.

### 2.1. Retrieval

Given a user input, relevant splits are retrieved from storage using a Retriever. 

Similarity search simply retrives the k most similar text chunk embeddings to our query.

In [16]:
question = "what does natural language processing study?"
docs = vectordb.similarity_search(question,k=3)
print(docs)

[Document(page_content='Illustration of the field by a brain and\na microchip interacting via language,\nknowledge representation, signal\nprocessing, programming etc.Natural language processing\nNatural language processing (NLP) is an interdisciplinary subfield of computer\nscience and linguistics. It is primarily concerned with giving computers the ability to\nsuppor t and manipulate speech. It involves processing natural langua ge datasets,\nsuch as text corpora or speech corpora, using either rule-based or probabilistic (i.e.\nstatistical and, most recently, neural network-based) machine learning approaches.\nThe goal is a computer capable of "unde rstanding" the contents of documents,\nincluding the contextual nuances of the langua ge within them. The technology can\nthen accurately extract information and insights contained in the documents as well\nas categorize and or ganize the documents themselves.\nChallenges in natural langua ge processing frequently involve speech recognit

### 2.2. Document Q&A 

We now take the stored document text chunks and the question about the document and pass them both to an LLM. The LLM produces an answer using a prompt that includes the question and the retrieved data.

By default we pass all the relevant text chunks in the same call to the LLM. If out chunks of text are too large we can reach the token limit. Here too we can use map reduce to overcome this issue.

The RetrievalQA chain performs question answering backed by a retrieval step

In [17]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever()
)

In [18]:
result = qa_chain({"query": question})
result["result"]

  warn_deprecated(


' Natural language processing (NLP) studies the computational methods and techniques used to enable computers to understand, manipulate, and generate human language. This includes tasks such as speech recognition, natural language understanding, and natural language generation. '

## More ressources

- Documentation: https://python.langchain.com/docs/get_started/introduction
- Really comprehensive tutorials: https://github.com/gkamradt/langchain-tutorials