# Chat With Your Data

In this course, we will endeavor to replicate [chatpdf](https://www.chatpdf.com/) and the essence of all natural language models that operate by responding within a specific context (a document).

## Requirements

In [None]:
%pip install langchain openai pypdf python-dotenv chromadb  tiktoken -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m806.2/806.2 kB[0m [31m7.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m227.4/227.4 kB[0m [31m8.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m284.0/284.0 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m525.5/525.5 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m15.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.8/1.8 MB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m252.4/252.4 kB[0m [31m19.7 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.2/64.2 kB[0m [31m6.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━

![](https://www.dropbox.com/scl/fi/cevg8k5kcjav1meihfr8m/embedding_1.png?rlkey=vsp2v4kk7k6r2xzv1dcqjbonh&dl=1)

![](https://www.dropbox.com/scl/fi/z3wlqq3q5sa3njzkhuwrs/embedding_2.png?rlkey=mqp453qxzvpsm5ne6ymrcai3w&dl=1)

![](https://www.dropbox.com/scl/fi/13g4kkcgmzhoj72ift91u/embedding_3.png?rlkey=0l6ieif0gcr92tkz6c3sdsdkf&dl=1)


## Structure

- API keys and Environment Variables
- Document Loading
- Document Splitting
- Vectorstores and Embedding (Storage)
- Retrieval
- Question Answering
- Chat


<!-- ![](figs/0_preview.png) -->
![](https://python.langchain.com/assets/images/data_connection-95ff2033a8faa5f3ba41376c0f6dd32a.jpg)

In this tutorial, we will delve into the essential steps for creating a natural language model. We'll begin by understanding the significance of API keys and utilizing environment variables to ensure security and privacy in our applications. Next, we'll dive into document loading, mastering the handling of various file types and data sources. Following that, we'll tackle document splitting for efficient processing. We'll then proceed to create vector stores and embeddings, crucial for representing the semantic meaning of words. Afterward, we'll explore information retrieval techniques and question-answering capabilities, culminating in the implementation of a chat system based on our model.



## API Key and Environment Variables

Creating the API key from OPENAI, [here](https://platform.openai.com/account/api-keys).

Once you have obtained the API key, it needs to be securely stored.

There are various methods for loading this API key. One approach is to utilize environment files, which should ideally be private and included in the `.gitignore` if working in a collaborative environment. Github, for instance, detects if any keys are uploaded to the platform, triggering an alert and disabling the API key, necessitating the creation of a new one. The key can be manually entered as well.

### `Dotenv`

For this method, you can create a `.env` file in the working environment and list the variables as follows: `variable = "value_variable"`.

```plaintext
NAME_OF_VARIABLE="sk-xxxxxxxxxxxxxxxx"
```

To use it, `python-dotenv` is employed, which, through the functions `load_dotenv` and `find_dotenv`, loads the variables from the `.env` file.

```python
# `!pip install python-dotenv`
from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())

secret_variable = os.environ['NAME_OF_VARIABLE']
```

### Colab

For Colab, a form with `getpass` can be introduced, which conceals the API key when entered. However, the drawback compared to the previous method is that the API key needs to be pasted each time the file is executed.

In [None]:
# !pip install openai
import getpass, openai, os
api_key = getpass.getpass(prompt="OPENAI - KEY: ")
openai.apikey = api_key
os.environ["OPENAI_API_KEY"] = api_key

OPENAI - KEY: ··········


## Document Loading

When it comes to loading documents, two scenarios must be considered: whether the information will be extracted from the web or if it's within our working environment. In the former case, it's possible (although `langchain` already has these cases implemented) that we'll need the `requests` library to download the file or its content. For the latter case, only the relative or absolute path of the file is sufficient.

All documents follow this structure, returning a list of `Document` objects containing two sub-objects: `page_content`, which is the text within, and `metadata`.

```python
from langchain.document_loaders import `Method`
file = Method(file_path)
file_read = file.lod()
print(file_read[0])
Document(
    page_content: "text",
    metadata: {"source": file_path, ...}
)
```

### PDFs

For PDF files, it's already implemented to read the document by URL and local path.

In [None]:
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("https://arxiv.org/pdf/2103.15348.pdf")
pages = loader.load()

print(pages[7].page_content[:500])
print(pages[7].metadata)

8 Z. Shen et al.
Table 2: All operations supported by the layout elements. The same APIs are
supported across diﬀerent layout element classes including Coordinate types,
TextBlock andLayout .
Operation Name Description
block.pad(top, bottom, right, left) Enlarge the current block according to the input
block.scale(fx, fy)Scale the current block given the ratio
in x and y direction
block.shift(dx, dy)Move the current block with the shift
distances in x and y direction
block1.is in(block2) Whether
{'source': 'https://arxiv.org/pdf/2103.15348.pdf', 'page': 7}


### Web Plain Text

For plain text from a URL, the `WebBaseLoader` can be utilized.

In [None]:
from langchain.document_loaders import WebBaseLoader
loader = WebBaseLoader("https://raw.githubusercontent.com/basecamp/handbook/master/getting-started.md")
loader.load()

[Document(page_content="# Getting Started\n\nGetting started at 37signals involves a lot of little details, a number of big tasks, learning the details of your new job, meeting new coworkers, all while working remotely. Your teammates, your manager, your 37signals buddy, your Ops buddy, and our People team are all here to help as you navigate your first few days and weeks.\n\n## Your First Few Days\n\nBefore you start, the People team will order you a new Apple laptop with the specs you request and any accessories you need like an external keyboard, mouse, or display. Get what you need, while keeping in mind the demands of your work when choosing specs.\n\nA day or two before you start, your manager will email you instructions for your first day. Your manager will be your point of contact for your early projects and activities. You‚Äôll also work with a member of our Ops team who will help you as you set up all the accounts you need to work at 37signals.\n\nOn your first day, you‚Äôll 

### JSON

```python
from langchain_community.document_loaders import JSONLoader
loader = JSONLoader(
    file_path="",
    jq_schema='.messages[].content',
    text_content=False)

data = loader.load()
```

For other documents, the documentation of [Langchain - Document Loaders](https://python.langchain.com/docs/modules/data_connection/document_loaders/) can be referred to.

## Document Splitting

Splitting the text of a document in LLM (Deep Learning Language Models) can be advantageous for several reasons. Firstly, it helps manage long documents, as LLMs may struggle with processing very large texts due to memory constraints or computational limitations. It improves contextual representation by capturing local contextual relationships more effectively.

All methods of `langchain.text_splitter` have the following parameters

- `separator="\n"`: Character used as a separator between parts of the text (e.g., line breaks).
- `chunk_size=100`: Maximum size of each text fragment.
- `chunk_overlap=20`: Overlap of characters between consecutive fragments.
- `length_function`: A function that may dynamically adjust the fragment size, though its specific function is unclear without further context.



### Split by character

Splits text based on a user defined character.

In [None]:
from langchain.document_loaders import WebBaseLoader

markdown = WebBaseLoader("https://raw.githubusercontent.com/basecamp/handbook/master/how-we-work.md")
markdown_doc = markdown.load()
text_markdown = markdown_doc[0].page_content
print(text_markdown[:500])

# How We Work

## Remotely

37signals is a fully distributed company. Our team works from all over the world, across 5 continents. We don't care where employees choose to live and work, just that they're here to do great work on exceptional products, alongside a world-class team. We’ve been remote since we started, and our founders literally [wrote the book](https://basecamp.com/books/remote) on the subject.

You can work from anywhere, but please be sure to inform your People Ops team when you 


In [None]:
from langchain.text_splitter import CharacterTextSplitter

text_splitter = CharacterTextSplitter(
    separator="\n\n",
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    is_separator_regex=False,
)

text_splitted = text_splitter.split_text(text_markdown)
print(text_splitted[0])

# How We Work

## Remotely

37signals is a fully distributed company. Our team works from all over the world, across 5 continents. We don't care where employees choose to live and work, just that they're here to do great work on exceptional products, alongside a world-class team. We’ve been remote since we started, and our founders literally [wrote the book](https://basecamp.com/books/remote) on the subject.

You can work from anywhere, but please be sure to inform your People Ops team when you move – especially across state or country borders. It may affect your or the company’s tax situation.

## Cycles

We work in 6-week cycles at 37signals. This fixed cadence serves to give us an internal sense of urgency, to keep projects from ballooning, and to provide us with a regular interval to make decisions about what we’re working on.


### Split for markdown

Splits text based on Markdown-specific characters. Notably, this adds in relevant information about where that chunk came from (based on the Markdown)

We can use `MarkdownHeaderTextSplitter` to preserve header metadata in our chunks, as show below.

In [None]:
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)

md_header_splits = markdown_splitter.split_text(text_markdown)
md_header_splits[:2]

[Document(page_content="37signals is a fully distributed company. Our team works from all over the world, across 5 continents. We don't care where employees choose to live and work, just that they're here to do great work on exceptional products, alongside a world-class team. We’ve been remote since we started, and our founders literally [wrote the book](https://basecamp.com/books/remote) on the subject.  \nYou can work from anywhere, but please be sure to inform your People Ops team when you move – especially across state or country borders. It may affect your or the company’s tax situation.", metadata={'Header 1': 'How We Work', 'Header 2': 'Remotely'}),
 Document(page_content='We work in 6-week cycles at 37signals. This fixed cadence serves to give us an internal sense of urgency, to keep projects from ballooning, and to provide us with a regular interval to make decisions about what we’re working on.  \nOur cycle structure is particularly important for the product teams, since they

### Split for Code

Splits text based on characters specific to coding languages.

In [None]:
from langchain.text_splitter import (
    Language,
    RecursiveCharacterTextSplitter,
)

PYTHON_CODE = """
import numpy as np
def rand():
    np.random.randint()

# Call the function
rand()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs

[Document(page_content='import numpy as np'),
 Document(page_content='def rand():\n    np.random.randint()'),
 Document(page_content='# Call the function\nrand()')]

In [None]:
RecursiveCharacterTextSplitter.get_separators_for_language(Language.PYTHON)

['\nclass ', '\ndef ', '\n\tdef ', '\n\n', '\n', ' ', '']

## Embedding and Vectorstores (Storage)

### Embeddings

Embeddings are vector representations of words in a dimensional space, learned during training. They capture semantic and contextual meaning to facilitate the model's understanding and processing of the text.

The base Embeddings class in LangChain provides two methods: one for embedding documents and one for embedding a query. The former takes as input multiple texts, while the latter takes a single text. The reason for having these as two separate methods is that some embedding providers have different embedding methods for documents (to be searched over) vs queries (the search query itself).

**Example:**

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()
embeddings = embedding.embed_documents(
    [
        "Hi there!",
        "Hello"
    ]
)
len(embeddings), len(embeddings[1])

(2, 1536)

This code snippet demonstrates how to use the `OpenAIEmbeddings` class from LangChain to embed multiple documents. It initializes an instance of the class, embeds the provided documents, and prints the length of the embeddings along with a sample of the first embedding vector.

In [None]:
text_embedding = embeddings[0]
print("length: ", len(text_embedding), "\nvector_sample: " ,text_embedding[:3])

length:  1536 
vector_sample:  [-0.020291369630971504, -0.00707277412171542, -0.022869059830264393]


This code snippet further explores the embeddings obtained in the previous example. It prints the length of the first embedding vector and a sample of its elements.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = PyPDFLoader("https://arxiv.org/pdf/2103.15348.pdf")
pages = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)

splits = text_splitter.split_documents(pages)
print(type(splits[0]))

<class 'langchain_core.documents.base.Document'>


Lastly, this code snippet demonstrates how to use the `RecursiveCharacterTextSplitter` class from LangChain to split documents into smaller chunks. It loads a PDF document from a URL, splits it into chunks, and stores the resulting chunks in the `splits` variable.





### Vectorstores

To build our database, we need an array of [Documents].

With Chroma, this will be done locally. Note that there is no directory referencing our Chroma database.

In [None]:
os.listdir()

['.config', 'sample_data']

Next, we'll create the vectorstore considering the document split, embedding method, and the location of the vectorstore.

In [None]:
from langchain.vectorstores import Chroma

persist_directory = './vector_db_chroma/'

!rm -rf ./docs/chroma  # remove old database files if any

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

print(vectordb._collection.count())

40


In [None]:
os.listdir()

['.config', 'vector_db_chroma', 'sample_data']

Searches conducted in the Chroma database will yield pieces of information that match the "intention" of the question, which will subsequently serve for response generation.

In [None]:
query_1 = vectordb.similarity_search(
    "What are some of the challenges hindering the widespread adoption and reuse of innovations in document image analysis (DIA), particularly in comparison to disciplines like natural language processing and computer vision?",
    k=3,
)
query_2 = vectordb.similarity_search(
    "How does the LayoutParser library address the challenges mentioned in the summary and contribute to streamlining the usage of deep learning in DIA research and applications?",
    k=3,
)
# print(query_1)

In [None]:
query_1[1].page_content[:100]

'image processing: a search of document image analysis in Github leads to 5M\nrelevant code pieces6; y'

## Retrieval

A vector store retriever utilizes a vector store for document retrieval. It acts as a simplified interface to the vector store class, enabling compatibility with the retriever interface. This retriever leverages search functionalities provided by the vector store, such as similarity search and MMR, to retrieve texts stored within it.

By default, the LangChain retriever object uses semantic similarity, which takes $k$ (4 by default) documents whose embeddings are closest to the query vector:

In [None]:
question = "What are some of the challenges hindering the widespread adoption and reuse of innovations in document image analysis (DIA), particularly in comparison to disciplines like natural language processing and computer vision?"
retriever = vectordb.as_retriever()
docs = retriever.get_relevant_documents(question)
docs

[Document(page_content='2 Z. Shen et al.\n37], layout detection [ 38,22], table detection [ 26], and scene text detection [ 4].\nA generalized learning-based framework dramatically reduces the need for the\nmanual speciﬁcation of complicated rules, which is the status quo with traditional\nmethods. DL has the potential to transform DIA pipelines and beneﬁt a broad\nspectrum of large-scale document digitization projects.\nHowever, there are several practical diﬃculties for taking advantages of re-\ncent advances in DL-based methods: 1) DL models are notoriously convoluted\nfor reuse and extension. Existing models are developed using distinct frame-\nworks like TensorFlow [ 1] or PyTorch [ 24], and the high-level parameters can\nbe obfuscated by implementation details [ 8]. It can be a time-consuming and\nfrustrating experience to debug, reproduce, and adapt existing models for DIA,\nand many researchers who would beneﬁt the most from using these methods lack\nthe technical background to

### Maximum Marginal Relevance Retrieval (MMR)

By default, the vector store retriever uses similarity search. If the underlying vector store supports maximum marginal relevance (`mmr`) search, you can specify that as the search type. Maximum Marginal Relevance is a process that takes into account the diversity of the semantic content of the documents to be outputed. The way this works:

1. First, it conducts a semantic search and finds a larger initial pool of documents (20 by default)
2. From those documents, it takes the most relevant one (the one with the smallest distance from the query).
3. Adds this document to the pool of final relevant picks. We call this set $R$
4. For the documents not in $R$, generate a score that rewards relevance and penalizes the proximity to the closest document in $R$.
5. Add the document with the highest score to $R$
6. Repeat 4 and 5 until $R$ has as many documents as we want (4 by default)

For more details, see [Carbonell and Goldstein (1998)](https://www.cs.cmu.edu/~jgc/publication/The_Use_MMR_Diversity_Based_LTMIR_1998.pdf)

In [None]:
retriever = vectordb.as_retriever(search_type="mmr")
docs = retriever.get_relevant_documents(question)
docs

[Document(page_content='2 Z. Shen et al.\n37], layout detection [ 38,22], table detection [ 26], and scene text detection [ 4].\nA generalized learning-based framework dramatically reduces the need for the\nmanual speciﬁcation of complicated rules, which is the status quo with traditional\nmethods. DL has the potential to transform DIA pipelines and beneﬁt a broad\nspectrum of large-scale document digitization projects.\nHowever, there are several practical diﬃculties for taking advantages of re-\ncent advances in DL-based methods: 1) DL models are notoriously convoluted\nfor reuse and extension. Existing models are developed using distinct frame-\nworks like TensorFlow [ 1] or PyTorch [ 24], and the high-level parameters can\nbe obfuscated by implementation details [ 8]. It can be a time-consuming and\nfrustrating experience to debug, reproduce, and adapt existing models for DIA,\nand many researchers who would beneﬁt the most from using these methods lack\nthe technical background to

### Similarity Score Threshold Retrieval

Sets a similarity score threshold and only returns documents with a score above that threshold.

In [None]:
retriever = vectordb.as_retriever(
    search_type="similarity_score_threshold", search_kwargs={"score_threshold": 0.5}
)
docs = retriever.get_relevant_documents(question)
docs[1]

Document(page_content='image processing: a search of document image analysis in Github leads to 5M\nrelevant code pieces6; yet most of them rely on traditional rule-based methods\nor provide limited functionalities. The closest prior research to our work is the\nOCR-D project7, which also tries to build a complete toolkit for DIA. However,\nsimilar to the platform developed by Neudecker et al. [ 21], it is designed for\nanalyzing historical documents, and provides no supports for recent DL models.\nThe DocumentLayoutAnalysis project8focuses on processing born-digital PDF\ndocuments via analyzing the stored PDF data. Repositories like DeepLayout9\nand Detectron2-PubLayNet10are individual deep learning models trained on\nlayout analysis datasets without support for the full DIA pipeline. The Document\nAnalysis and Exploitation (DAE) platform [ 15] and the DeepDIVA project [ 2]\naim to improve the reproducibility of DIA methods (or DL models), yet they\nare not actively maintained. OCR en

<!-- retrieval -->

## Question Answering

Let's review the model setup. We have our vectorstore, we ask a question, and the vectorstore returns relevant elements to answer the question. Since these are parts of the document, they need to be passed through an LLM engine to structure (`chain`) a coherent response. Typically, within these models, there exists a parameter called `temperature` where 0 is the most precise and 1 makes the model "ultra creative." This final step can be done in various ways.

- `stuff`: Prepares and organizes input data or parameters.
- `map_reduce`: Distributes computation tasks across multiple nodes or processes, often used for parallel processing and aggregating results.
- `refine`: Improves the quality or accuracy of output by iteratively adjusting parameters or fine-tuning the model.


In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA as RQa

llm_model = "gpt-3.5-turbo"
llm = ChatOpenAI(model_name=llm_model, temperature=0)
question = "What are some of the challenges hindering the widespread adoption and reuse of innovations in document image analysis (DIA), particularly in comparison to disciplines like natural language processing and computer vision?"


  warn_deprecated(


### Stuff

In [None]:
stuff = RQa.from_chain_type(
    llm, retriever=vectordb.as_retriever(),
    chain_type="stuff" # default
)
stuff_result = stuff({"query": question})
stuff_result['result']

  warn_deprecated(


'Some of the challenges hindering the widespread adoption and reuse of innovations in document image analysis (DIA) include:\n\n1. Complexity of Deep Learning Models: Deep learning models used in DIA are often convoluted and developed using different frameworks like TensorFlow or PyTorch. This complexity makes it challenging to reuse and extend existing models, as high-level parameters can be obfuscated by implementation details.\n\n2. Lack of Infrastructure for Customized Training: Document images contain diverse patterns across domains, requiring customized training for desirable detection accuracy. Currently, there is no full-fledged infrastructure for easily curating target document image datasets and fine-tuning or re-training models.\n\n3. Sequential Processing Requirements: DIA often requires a sequence of models and processing steps to obtain final outputs. This sequential processing can complicate the adoption of deep learning models, as research teams may need to use multiple

### Map Reduce

In [None]:
m_p = RQa.from_chain_type(
    llm, retriever=vectordb.as_retriever(),
    chain_type="map_reduce"
)
mp_result = m_p({"query": question})
mp_result['result']

'Some of the challenges hindering the widespread adoption and reuse of innovations in document image analysis (DIA), particularly in comparison to disciplines like natural language processing and computer vision, include factors like the convoluted nature of deep learning models for reuse and extension, the lack of a full-fledged infrastructure for easily curating target document image datasets, the reliance on traditional rule-based methods in existing code pieces, limited functionalities in available tools, lack of standardized datasets and benchmarks for DIA tasks, and the lack of active maintenance in platforms supporting DIA.'

### Refine

In [None]:
refine = RQa.from_chain_type(
    llm, retriever=vectordb.as_retriever(),
    chain_type="refine"
)
refine_result = refine({"query": question})
refine_result['result']

'The additional context provided highlights the advancements in document image analysis (DIA) tools and datasets, such as TensorFlow Hub and various document data collections, that aim to facilitate the development and sharing of pretrained models and pipelines specific to DIA tasks. These resources, along with the LayoutParser model zoo, offer a spectrum of models trained on diverse datasets to support different use cases in document analysis.\n\nIn light of this context, the challenges hindering the widespread adoption and reuse of innovations in DIA, particularly in comparison to disciplines like natural language processing and computer vision, can be further refined:\n\n1. Fragmentation in DIA Tools and Models: Despite the availability of pretrained models and document data collections in DIA, the lack of standardized tools and pipelines for document analysis hinders the seamless integration and reuse of innovations across different stages of DIA tasks. This fragmentation in tools 

### Question Answering With Prompt

Now, a [prompt template](https://python.langchain.com/docs/modules/model_io/prompts/quick_start#prompttemplate) will be created to guide us in answering the question, instructing the model on how to use the provided context to generate concise answers.

In [None]:
from langchain.prompts import PromptTemplate

# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer.
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

In [None]:
# Run chain
from langchain.chains import RetrievalQA
qa_chain = RetrievalQA.from_chain_type(
    llm,
    retriever=vectordb.as_retriever(),
    return_source_documents=True,
    chain_type_kwargs={"prompt": QA_CHAIN_PROMPT},
)
promt_result = qa_chain({"query": question})
promt_result["result"]

'Some challenges hindering the widespread adoption and reuse of innovations in DIA include the convoluted nature of deep learning models, the lack of infrastructure for curating datasets and re-training models, and the need for a sequence of models and processing steps for final outputs. These challenges make it difficult for researchers without technical backgrounds to implement and adapt existing models for DIA. Thanks for asking!'

## RAG

After completing the previous step, we have all the relevant information for the natural language model to interpret and provide a response considering the context and the question.

To summarize, first, we need to set up the environment by providing the API key for OpenAI and adding it to our virtual environment.

In [None]:
import getpass, openai, os
import getpass, openai, os
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA as RQa
from langchain.text_splitter import RecursiveCharacterTextSplitter


api_key = getpass.getpass(prompt="OPENAI - API-KEY: ")
openai.apikey = api_key
os.environ["OPENAI_API_KEY"] = api_key

OPENAI - API-KEY: ··········


Next, we load the document, in this case, we'll use a thesis from PUCP (https://tesis.pucp.edu.pe/repositorio/handle/20.500.12404/27052).

In [None]:
from langchain.document_loaders import PyPDFLoader

url_pdf = "https://tesis.pucp.edu.pe/repositorio/bitstream/handle/20.500.12404/27040/HUAMAN%c3%8d_LLAMOCCA_ROGER_ANGEL_DESARROLLO_COMPETENCIAS.pdf?sequence=1&isAllowed=y"
loader = PyPDFLoader(url_pdf)
pages = loader.load()

SSLError: HTTPSConnectionPool(host='tesis.pucp.edu.pe', port=443): Max retries exceeded with url: /repositorio/bitstream/handle/20.500.12404/27040/HUAMAN%C3%8D_LLAMOCCA_ROGER_ANGEL_DESARROLLO_COMPETENCIAS.pdf?sequence=1&isAllowed=y (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate has expired (_ssl.c:1007)')))

Then, we define the splitting module to generate text chunks with the previously loaded `Document`.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)
splits = text_splitter.split_documents(pages)
len(splits)

40

Finally, we generate the vector database (thesis) using the OPENAI embedding.

In [None]:
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
embedding = OpenAIEmbeddings()

persist_directory = './thesis_chroma/'

# !rm -rf ./thesis_chroma  # remove old database files if any (linux, Mac)

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    persist_directory=persist_directory
)

If you want to retrieve the previously created vectorbase, you only need to locate the directory of the database and pass the embedding method used.

In [None]:
 vectordb = Chroma(persist_directory=persist_directory, embedding_function = embedding)

In [None]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA as RQa

llm_model = "gpt-3.5-turbo"
llm = ChatOpenAI(model_name = llm_model, temperature = 0)
question = input("Ingrese una pregunta acerca de la tesis: ")

In [None]:
stuff = RQa.from_chain_type(
    llm, retriever = vectordb.as_retriever(),
    chain_type = "stuff" # default
)
result = stuff({"query": question})
response = result['result']
response

## Final Result


### Pre process

In [None]:
import getpass, openai, os
from langchain.document_loaders import PyPDFLoader
from langchain.vectorstores import Chroma
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA as RQa

api_key = getpass.getpass(prompt="Insert your OPENAI - API-KEY: ")
openai.apikey = api_key
os.environ["OPENAI_API_KEY"] = api_key


url_pdf = input("Insert the pdfurl: ")

loader = PyPDFLoader(url_pdf)
pages = loader.load()

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1500,
    chunk_overlap = 150
)
splits = text_splitter.split_documents(pages)
embedding = OpenAIEmbeddings()

# persist_directory = './thesis_chroma/'

# !rm -rf ./thesis_chroma  # remove old database files if any (linux, Mac)

vectordb = Chroma.from_documents(
    documents=splits,
    embedding=embedding,
    # persist_directory=persist_directory
)


llm_model = "gpt-3.5-turbo"
llm = ChatOpenAI(model_name = llm_model, temperature = 0)

# example url: https://tesis.pucp.edu.pe/repositorio/bitstream/handle/20.500.12404/27040/HUAMAN%c3%8d_LLAMOCCA_ROGER_ANGEL_DESARROLLO_COMPETENCIAS.pdf?sequence=1&isAllowed=y



### Ask

In [None]:
# ¿Qué mecanismos y/o procesos implementa el programa Mibeca para la inserción laboral de los egresados?
# ¿Cómo se evalúa la calidad de los servicios prestados por los IEST elegibles por el programa Mibeca para el desarrollo de competencias en los becarios?
while True:
    question = input("Ask: ")
    if question == "":
        break
    stuff = RQa.from_chain_type(
        llm, retriever = vectordb.as_retriever(),
        chain_type = "stuff" # default
    )
    stuff_result = stuff({"query": question})
    result = stuff_result['result']
    format_response = f"""
    Question:
      {question}
    Result:
      {result}
    ------------ x -------------
    """
    print(format_response)

### Several documents

In [None]:
example_urls = [
    "https://www.defensoria.gob.pe/wp-content/uploads/2023/02/Reporte-Mensual-de-Conflictos-Sociales-N%C2%B0-227-Enero-2023.pdf",
    "https://www.defensoria.gob.pe/wp-content/uploads/2023/03/Reporte-Mensual-de-Conflictos-Sociales-N%C2%B0-228-Febrero-2023.pdf",
    "https://www.defensoria.gob.pe/wp-content/uploads/2023/04/Reporte-Mensual-de-Conflictos-Sociales-N-229-Marzo-2023.pdf"
]

!rm -rf ./vector_db_chroma/  # remove old database files if any (linux, Mac)

for i, url_pdf in enumerate(example_urls):

    loader = PyPDFLoader(url_pdf)
    pages = loader.load()

    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 1500,
        chunk_overlap = 150
    )
    splits = text_splitter.split_documents(pages)
    embedding = OpenAIEmbeddings()

    # persist_directory = './thesis_chroma/'

    # !rm -rf ./thesis_chroma  # remove old database files if any (linux, Mac)
    if i == 0:
        vectordb = Chroma.from_documents(
            documents=splits,
            embedding=embedding
        )
    else:
        vectordb.add_documents(
            documents=splits
        )

llm_model = "gpt-3.5-turbo"
llm = ChatOpenAI(model_name = llm_model, temperature = 0)

In [None]:
while True:
    question = input("Ask: ")
    if question == "":
        break
    stuff = RQa.from_chain_type(
        llm, retriever = vectordb.as_retriever(),
        chain_type = "stuff" # default
    )
    stuff_result = stuff({"query": question})
    result = stuff_result['result']
    format_response = f"""
    Question:
      {question}
    Result:
      {result}
    ------------ x -------------
    """
    print(format_response)

# example questions:
# cómo afectan los choques de política fiscal no anticipados a la economía?
# la política monetaria afecta de manera distinta a sectores distintos?
# cómo afectan los acuerdos comerciales al valor de las exportaciones?