# Chat based document Q&A with LangChain and OpenAI API

This notebook will guide you through the process of setting up the environment, importing documents, and interacting with LangChain for document-based Q&A. We'll cover topics such as document preprocessing, question formulation, and analyzing the model's responses. This technique is called [Retrieval Augmented Generation (RAG)](https://www.perplexity.ai/search/What-is-Retrieval-s56y5BPbQGyxsvOU._R6ng) - [original paper](https://arxiv.org/abs/2005.11401).

Whether you're a researcher, student, or professional, this demo notebook will showcase how LangChain can revolutionize your document exploration and information retrieval workflows.

<a target="_blank" href="https://colab.research.google.com/github/aiaicaramba/aiai-for-society/blob/aiai-dev/20230926_AI4Soc_Chat_with_Your_Data.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

In [None]:
# cleanup mess and setup data dir
!rm -drf /content/sample_data
!mkdir /content/data

# Prerequisites

This code installs several Python packages that are required for the project. Explanations happily generated for you by [Chepetto](https://openai.com/blog/chatgpt).

- [`langchain`](<https://python.langchain.com/>) is a package for language modeling and language generation tasks.
- [`openai`](<https://openai.com/>) is a package for accessing the OpenAI API, which provides access to various language models and AI tools.
- [`pypdf`](<https://pypi.org/project/PyPDF2/>) is a package for working with PDF files in Python.
- [`tiktoken`](<https://github.com/openai/tiktoken>) is a package for accessing the TikToken API.
- [`faiss-cpu`](<https://github.com/facebookresearch/faiss>) is a package for performing efficient similarity searches on large datasets using the FAISS library.


In [1]:
%pip install \
langchain \
openai \
pypdf \
faiss-cpu \
pandas \
datasets \
tiktoken 

Collecting datasets
  Downloading datasets-2.14.6-py3-none-any.whl.metadata (19 kB)
Collecting pyarrow>=8.0.0 (from datasets)
  Downloading pyarrow-13.0.0-cp311-cp311-macosx_10_14_x86_64.whl.metadata (3.0 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl.metadata (9.9 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.4.1-cp311-cp311-macosx_10_9_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py311-none-any.whl.metadata (7.2 kB)
Collecting fsspec<=2023.10.0,>=2023.1.0 (from fsspec[http]<=2023.10.0,>=2023.1.0->datasets)
  Downloading fsspec-2023.10.0-py3-none-any.whl.metadata (6.8 kB)
Collecting huggingface-hub<1.0.0,>=0.14.0 (from datasets)
  Downloading huggingface_hub-0.18.0-py3-none-any.whl.metadata (13 kB)
Collecting filelock (from huggingface-hub<1.0.0,>=0.14.0->datasets)
  Downloading filelock-3.13.1-py3-none-any.whl.metadata (2.8 kB)
Downloading datasets-2.14.6-py3-none-any

# OpenAI API Key

To use the OpenAI API, you need to obtain an API key from the [OpenAI website](https://platform.openai.com/account/api-keys). The API key is a unique identifier that allows you to access the OpenAI API and make requests to it. By setting the 'OPENAI_API_KEY' environment variable, you can securely provide your API key to the code without hardcoding it into the script.

In [2]:
import os
import getpass
os.environ['OPENAI_API_KEY'] = getpass.getpass("OPENAI_API_KEY")

# Embeddings setup

This code initializes an instance of the [OpenAIEmbeddings](https://python.langchain.com/en/latest/reference/modules/embeddings.html?highlight=embeddings#langchain.embeddings.OpenAIEmbeddings) class and assigns it to the variable embeddings. An [embedding](https://platform.openai.com/docs/guides/embeddings) is a way to represent words or phrases as numeric vectors, which can be used as input to machine learning models.  The `OpenAIEmbeddings` class provides access to pre-trained word embeddings from OpenAI, which were trained on a large corpus of text data using advanced deep learning techniques.

Once you have initialized an instance of the `OpenAIEmbeddings` class, you can use it to obtain the embedding vector for any given chunk of text. This can be useful for a variety of [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing) (NLP) tasks, such as sentiment analysis, language translation, and text classification. In this notebook we use it to do [semantic search](https://en.wikipedia.org/wiki/Semantic_search) with a [vector database](https://www.youtube.com/watch?v=klTvEwg3oJ4&ab_channel=Fireship) in this case.

## Model

| Name | Tokenizer | Max input tokens | Output dimensions |
| :--- | :--- | ---: | ---: |
| text-embedding-ada-002 | cl100k_base | 8191 | 1536 |




In [3]:
from langchain.embeddings.openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()

# Splitter setup

The [RecursiveCharacterTextSplitter](https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/recursive_text_splitter.html) is a text splitting tool that takes in a large text document as input and splits it into smaller chunks for downstream processing. Here's what each parameter in the splitter setup means:

- `chunk_size`: This parameter specifies the size of each chunk of text that the splitter will output. In this case, the splitter is set up to output chunks of 500 characters each.

- `chunk_overlap`: This parameter specifies the number of characters of overlap that each chunk will have with the next chunk. In this case, the splitter is set up to have an overlap of 20 characters between adjacent chunks.

- `length_function`: This parameter specifies the function that the splitter will use to calculate the length of the input text. In this case, the `len` function is used, which returns the number of characters in the text.

Together, these parameters determine how the input text will be split into smaller chunks. The splitter will output chunks of 500 characters each, with an overlap of 20 characters between adjacent chunks, until the entire input text has been processed. This setup is designed to balance the need for small enough chunks for efficient processing, with enough overlap between chunks to minimize the risk of losing contextual information at the boundaries between chunks.

In [4]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1000,
    chunk_overlap  = 0,
    length_function = len,
)

# Load (and split) documents

This code snippet loads PDF files from a directory named "data/" using a [PyPDFDirectoryLoader](https://python.langchain.com/en/latest/modules/indexes/document_loaders/examples/pdf.html?highlight=PyPDFDirectoryLoader) class from the `langchain.document_loaders` module. The `loader` variable is an instance of `PyPDFDirectoryLoader`, which takes the directory path as an argument.

After instantiating the loader, the code calls the `load_and_split` method to load the PDF files from the directory and split their text using the text splitter we created before.

## Upload your PDFs
Create a folder called 'data' and throw in any number of pdf's you'd like to chat with.

> Note; The pdf's will be deleted once you close the notebook.

In [5]:
from langchain.document_loaders import PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader("data/")
docs = loader.load_and_split(text_splitter=text_splitter)
len(docs)


210

In [6]:
docs[111].page_content # arbitrary chunk

'17.3 Ik een beetje meer dan jij…\nJe verdeelt een geldbedrag over vier personen (A, B, C en D). Elke persoon krijgt een ander\nbedrag: A krijgt 20% meer dan het gemiddelde van de vier personen. Voor de duidelijkheid:\nuiteindelijk krijgen de vier personen samen het hele bedrag. A krijgt daarvan een kwart +\n20% van zo’n kwart. Daarna verlaat persoon A de kamer.\nEr blijven drie personen over, met wie een soortgelijke verdeling plaatsvindt: B krijgt 20%\nmeer dan het gemiddelde van de 3 overgebleven personen en verlaat de kamer. C krijgt 20%\nmeer dan het gemiddelde van de 2 overgebleven personen. D ontvangt de overgebleven 168\neuro.\nHet oorspronkelijke, te verdelen bedrag is … euro.\nSchrijf een computerprogramma dat dit getal berekent.\nHieronder een deel van de oplossing, maar probeer het eerst zelf! (Afkomstig van beterreken.nl,\n12aug2020; een som van Henk van Huffelen)\n17.4 Ik een beetje meer dan jij… aanzet tot een oplossing'

In [7]:
# Do a little cleanup.
# Generated by GPT4 advanced data analysis: https://chat.openai.com/share/237b1e56-349e-4a5c-a007-d862c724b72f
import re

def clean_text(text):
    # Step 1: Remove extra spaces
    text = re.sub(' +', ' ', text)

    # Step 2: Normalize punctuation by removing unnecessary spaces before or after
    text = re.sub(' ,', ',', text)
    text = re.sub(' \.', '.', text)
    text = re.sub(' ;', ';', text)
    text = re.sub(' !', '!', text)
    text = re.sub(' \?', '?', text)
    text = re.sub('“ ', '“', text)
    text = re.sub(' ”', '”', text)

    # Step 3: Remove line breaks
    text = text.replace('\n', ' ')

    return text

def ultimate_clean_text(text):
    # Previous cleaning steps
    text = clean_text(text)

    # Replacing various types of quotes with standard ASCII quotes
    text = text.replace('“', '').replace('”', '').replace('‘', "'").replace('’', "'").replace('www.ztcprep.com', '')

    # Replacing non-breaking spaces with standard spaces
    text = text.replace(u'\xa0', u' ')

    # Correcting contractions like "don' t" to "don't"
    text = re.sub(r"(\w+) \'", r"\1'", text)
    text = re.sub(r"\' t", "'t", text)

    # Removing website URLs
    text = re.sub(r'http\S+|www\.\S+', '', text)

    return text


cleaned_contents = list(map(lambda doc: ultimate_clean_text(doc.page_content), docs))
cleaned_contents[111]


"17.3 Ik een beetje meer dan jij… Je verdeelt een geldbedrag over vier personen (A, B, C en D). Elke persoon krijgt een ander bedrag: A krijgt 20% meer dan het gemiddelde van de vier personen. Voor de duidelijkheid: uiteindelijk krijgen de vier personen samen het hele bedrag. A krijgt daarvan een kwart + 20% van zo'n kwart. Daarna verlaat persoon A de kamer. Er blijven drie personen over, met wie een soortgelijke verdeling plaatsvindt: B krijgt 20% meer dan het gemiddelde van de 3 overgebleven personen en verlaat de kamer. C krijgt 20% meer dan het gemiddelde van de 2 overgebleven personen. D ontvangt de overgebleven 168 euro. Het oorspronkelijke, te verdelen bedrag is … euro. Schrijf een computerprogramma dat dit getal berekent. Hieronder een deel van de oplossing, maar probeer het eerst zelf! (Afkomstig van beterreken.nl, 12aug2020; een som van Henk van Huffelen) 17.4 Ik een beetje meer dan jij… aanzet tot een oplossing"

# Vector store setup

This code creates a **Facebook AI Similarity Searck** ([FAISS](https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/)) from the [langchain.vectorstores](https://python.langchain.com/en/latest/modules/indexes/vectorstores/examples/faiss.html) module.

FAISS enables fast similarity search among high-dimensional vectors. In this code, it is used to create a FAISS index for a set of documents. The `from_documents()` method is called on the FAISS class, passing in two arguments: `docs` and `embeddings`. The `docs` parameter is a set of documents that have been previously loaded and split, and the `embeddings` parameter is a pre-trained embedding model that maps the documents to high-dimensional vectors.

In [8]:
from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings

faiss_index = FAISS.from_documents(docs, embeddings)

# Save the db

This code saves the FAISS index created in the previous code cell to disk with the name `faiss_index`. The `save_local()` method is called on the `faiss_index` object, which is the FAISS index created earlier. The `save_local()` method is a utility method provided by the FAISS class to save the index to the local file system.

After executing this code, a file named "faiss_index" should be created in the current working directory. This file contains the serialized version of the FAISS index, which can be loaded back into memory later using the `FAISS.load_local()` method.

In [9]:
faiss_index.save_local("faiss_index")

# Test the vector store

This code performs a similarity search using the FAISS index created earlier and the query string *"Is a prototyping more than enough for software?"*.

The `similarity_search()` method is called on the faiss_index object with two arguments: the query string and `k=5`, which specifies that the top 5 most similar documents should be returned. The result of the similarity search is stored in the `query_result` variable.

The code then iterates over the chunks in the `query_result` list and prints the metadata and page content of each `chunk`. Specifically, it prints the page number and source of the document, along with its page content.

# Test the vector store

This code performs a similarity search using the FAISS index created earlier and the query string *"Is a prototyping more than enough for software?"*.

The `similarity_search()` method is called on the faiss_index object with two arguments: the query string and `k=5`, which specifies that the top 5 most similar documents should be returned. The result of the similarity search is stored in the `query_result` variable.

The code then iterates over the chunks in the `query_result` list and prints the metadata and page content of each `chunk`. Specifically, it prints the page number and source of the document, along with its page content.

In [10]:
query_result = faiss_index.similarity_search("Wat is een loop?", k=5)

for chunk in query_result:
    print(str(chunk.metadata["page"]) + " " + chunk.metadata["source"] + " :", chunk.page_content[:])

77 data/Programming-Basics.pdf : Bronnen: Een aantal van deze opgaven komen van www.beterrekenen.nl (ook als mobile app):
Elke dag een paar minuten aan uitdagingen, soms makkelijk, soms moeilijker, some recht-
toe-recht-aan, soms echt nodig om een plaatje te tekenen of een laptop te hulp te roepen…
hersengym dus… een aanrader!
17.7 Spoilers…
Enkele antwoorden:
•Al wandelend... : 4746 meter
78
116 data/Programming-Basics.pdf : abelen (en namen van lokale variabelen beginnen met een kleineletter).
•DeMain-method is de enige plek waar Console.ReadLine() enConsole.WriteLine()
voorkomt!
•Roep elke methode meerdere keren aan! Bijvoorbeeld een methode die 2 strings als
parameter heeft roep je minstens 1 keer aan met 2 strings van verschillende lengte, maar
ook met 2 strings van gelijke lengte. Controleer het antwoord!
•Dehappy flow testen we tijdens het programmeren meestal wel. Probeer ook uitzon-
deringssituaties te testen! Geef eens een lege string mee aan een methode die een string-
param

# Chat memory

This code imports the [ConversationBufferWindowMemory](https://python.langchain.com/en/latest/modules/memory/types/buffer_window.html) class from the `langchain.memory` module and creates an instance of it called `memory`. This class represents a memory buffer that stores conversations in a windowed fashion, meaning that the buffer only retains a certain number of recent conversations.

The constructor of the `ConversationBufferWindowMemory` class takes two arguments: `memory_key` and `return_messages`. The `memory_key` parameter specifies a unique identifier for the memory buffer, and the `return_messages` parameter indicates whether or not to return the stored messages along with their metadata when accessing the memory buffer.

In this code, the `memory_key` is set to "chat_history", which is being used to store the chat conversations. The return_messages parameter is set to `True`, which indicates that the stored messages will be returned along with their metadata when accessing the memory buffer.

In [11]:
from langchain.memory import ConversationBufferWindowMemory

memory = ConversationBufferWindowMemory(memory_key="chat_history", return_messages=True)

# Chain setup

This code imports several classes and functions from various modules in the langchain package and creates an instance of the [ConversationalRetrievalChain](https://python.langchain.com/en/latest/modules/chains/index_examples/chat_vector_db.html?highlight=ConversationalRetrievalChain) class called `qa`.

The `ConversationalRetrievalChain` class is a high-level class that provides an interface for building a conversational agent that can perform retrieval-based question answering. In this code, the `qa` instance is initialized using the `from_llm()` method, which initializes the agent using an LLM model, a retriever and the memory buffer.

### LLM
The `OpenAI` class from the `langchain.llms` module represents an instance of the OpenAI language model. In this code, an instance of the OpenAI class is created of the model "[gpt-3.5-turbo](https://platform.openai.com/docs/models)".

### Vector Store
The `faiss_index.as_retriever()` method returns a retriever instance that wraps the FAISS index created earlier. This retriever is used to retrieve candidate answers to questions asked of the conversational agent.

### Chat History Memory
The `memory` variable is a memory buffer that was created earlier using the `ConversationBufferWindowMemory` class. This memory buffer is used to store and retrieve past conversations for use in future interactions.

The `verbose=True` parameter indicates that verbose output should be produced when running the conversational agent.

In [12]:
from langchain.chains import ConversationalRetrievalChain, LLMChain
from langchain.chat_models import ChatOpenAI
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain
from langchain.chains import RetrievalQA, RetrievalQAWithSourcesChain

qa = ConversationalRetrievalChain.from_llm(
    ChatOpenAI(model="gpt-3.5-turbo",temperature=0.7, max_tokens=500),
    faiss_index.as_retriever(k=5),
    memory=memory, verbose=False)


Voorbeeld: 
queries = ['Which school does Harry Potter attend and why?',
         'How did Harry found out about his powers and what are they?',
         'How did he become friends with the other students?',
         'Describe the character Hermione.']

chat_result = qa({"question": queries[0]})  # De 1e vraag wordt beantwoord (want index begint bij 0)
chat_result['answer']

In [29]:
queries = ['Hoe maak ik een loop die start bij 3 en loopt tot 25',
         'hoe verander ik het returntype van een methode?',
         'Kan ik in C# ook modulo berekenen?',
         'Hoe bepaal ik in C# of een getal deelbaar is door 7?']

print("Some example queries: ")
for nr,q in enumerate(queries):
    print(f"{ nr } : {q}",nr,q)


q = input("Typ number of query, or type in your own query, then press <enter>")
print(f"\nYour input: [{q}]")
if q.isdigit():
    q = int(q)
    if q in range(len(queries)):
        q = queries[q]
    else:
        print("I don't know what question that would be...")
else:
    print("Ah, you typed a nice question.")
    # q probably already contains the query typed by the user.

# q is a string containing the chosen query

print(f"Q: {q}")
chat_result = qa({"question": q})
print(chat_result['answer'])

Some example queries: 
0 : Hoe maak ik een loop die start bij 3 en loopt tot 25 0 Hoe maak ik een loop die start bij 3 en loopt tot 25
1 : hoe verander ik het returntype van een methode? 1 hoe verander ik het returntype van een methode?
2 : Kan ik in C# ook modulo berekenen? 2 Kan ik in C# ook modulo berekenen?
3 : Hoe bepaal ik in C# of een getal deelbaar is door 7? 3 Hoe bepaal ik in C# of een getal deelbaar is door 7?



Your input: [hoe bepaal ik of een getal een priemgetal is in C#?]
Ah, you typed a nice question.
Q: hoe bepaal ik of een getal een priemgetal is in C#?
Om te bepalen of een getal een priemgetal is in C#, kun je een functie schrijven die het getal controleert op delers. Een priemgetal is namelijk een getal dat alleen deelbaar is door 1 en zichzelf.

Hier is een voorbeeld van een functie die controleert of een getal een priemgetal is:

```csharp
public static bool IsPrime(int number)
{
    if (number <= 1)
    {
        return false;
    }

    for (int i = 2; i <= Math.Sqrt(number); i++)
    {
        if (number % i == 0)
        {
            return false;
        }
    }

    return true;
}
```

Je kunt deze functie gebruiken door het gewenste getal als argument aan de functie door te geven. De functie retourneert `true` als het getal een priemgetal is en `false` als het geen priemgetal is.

Hier is een voorbeeld van hoe je de functie kunt gebruiken:

```csharp
int number = 17;
bool 