<a href="https://colab.research.google.com/github/edgarbc/llm-knowledge-extractor/blob/main/my_RAG_example_PDFs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Example of a simple RAG to query an LLM on local data from PDFs.

Edgar Bermudez

November, 2023

Retrieval augmented generation (RAG) provides a way to optimize the output of an LLM with targeted information without modifying the underlying model itself.

This targeted information can be more up-to-date than the LLM and can be specific to a particular organization and industry [REF](https://www.oracle.com/artificial-intelligence/generative-ai/retrieval-augmented-generation-rag/).

The RAG method works by fetching up-to-date or context-specific data from an external database and making it available to an LLM when asking it to generate a response. This approach can help to reduce the likelihood of hallucinations  and improve the accuracy of the LLMs with more up-to-date information.

The key points are :

- how to ingest data so that can be passed to the encoder part of the LLM to produce embeddings so that we can do semantic search on the new data
- map the query to the space with the new embeddings using the same encoder and use the decoder part of the model to generate the response

In [1]:
# install the required modules
!pip install openai
!pip install llama-index
!pip install tiktoken
!pip install pypdf

Collecting openai
  Downloading openai-1.3.8-py3-none-any.whl (221 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/221.5 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━[0m[91m╸[0m[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m71.7/221.5 kB[0m [31m2.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m221.5/221.5 kB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.25.2-py3-none-any.whl (74 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m75.0/75.0 kB[0m [31m12.4 MB/s[0m eta [36m0:00:00[0m
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.2-py3-none-any.whl (76 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.9/76.9 kB[0m [31m13.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-p

In [2]:
import openai
import tiktoken
from llama_index import ServiceContext, LLMPredictor, OpenAIEmbedding, PromptHelper
from llama_index.llms import OpenAI
from llama_index.text_splitter import TokenTextSplitter
from llama_index.node_parser import SimpleNodeParser
from llama_index import VectorStoreIndex, SimpleDirectoryReader
from llama_index import set_global_service_context

Load the API_KEY to be able to use gpt-3.5-turbo LLM from OpenAI.

TODO: extend to an opensource LLM from HuggingFace.

In [3]:
import os

In [4]:
os.environ['OPENAI_API_KEY']="API_KEY"

Load the PDF documents stored in the local directory data_dir. Here we assume that PDFs are uploaded into the data_dir. However, if you want to keep your data in the cloud in a permanent storage like google cloud storage (GCS) you can use unstructured (https://pypi.org/project/unstructured/).

In [5]:
data_dir = "/content/drive/MyDrive/Colab Notebooks/paper_data/"
!ls "/content/drive/MyDrive/Colab Notebooks/paper_data/"

2209.07162.pdf	2302.05543.pdf	2307.15208.pdf


In [6]:
#from llama_index.readers.file.base import SimpleDirectoryReader
documents = SimpleDirectoryReader(input_dir="/content/drive/MyDrive/Colab Notebooks/paper_data/").load_data()

Check that loading the PDFs was successful

In [7]:
documents[1]

Document(id_='e4ffc334-ab9c-4372-a977-c295e6bcc90a', embedding=None, metadata={'page_label': '2', 'file_name': '2209.07162.pdf', 'file_path': '/content/drive/MyDrive/Colab Notebooks/paper_data/2209.07162.pdf', 'file_type': 'application/pdf', 'file_size': 6234415, 'creation_date': '2023-11-07', 'last_modified_date': '2023-11-07', 'last_accessed_date': '2023-11-07'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, hash='f1935e7d9d77101e716dec79166ad2d24671a696ce9902cfa006b83dcc4a8a6e', text='2 Pinaya et al.\nsentences [8,26]. During the same period, medical image analysis also made re-\nmarkable breakthroughs by applying deep neural networks to solve tasks such\nas segmentation, structure detection, and computer-aided diagnosis (detailed re-\nview available at [

Now we need to split the text into tokens. We can use tiktoken for this and openAI gpt to produce the encodings.

In [9]:
# you can parse the documents into nodes using the default values
text_splitter = TokenTextSplitter(
    separator=" ",
    chunk_size=1024,
    chunk_overlap=20,
    backup_separators=["\n"],
    tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode
)

node_parser = SimpleNodeParser.from_defaults()

Or you can be more specific to define how to parse the documents. For example:

In [None]:
from llama_index.node_parser import SentenceSplitter
text_splitter=SentenceSplitter(
    separator=" ",
    chunk_size=1024,
    chunk_overlap=20,
    paragraph_separator="\n\n\n",
    secondary_chunking_regex="[^,.; ]+[,.; ]?",
    tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode
)



Now it is time to get the embeddings using the GPT to do the encoding.

In [10]:
from llama_index.embeddings.openai import OpenAIEmbedding
llm = OpenAI(model="gpt-3.5-turbo", temperature=0, max_tokens=256)
embed_model = OpenAIEmbedding()
prompt_helper = PromptHelper(
    context_window=4096,
    num_output=256,
    chunk_overlap_ratio=0.1,
    chunk_size_limit=None
)
service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model,
    node_parser=node_parser,
    prompt_helper=prompt_helper
)

Now we index the documents with the context

In [11]:
from llama_index.indices.vector_store.base import VectorStoreIndex
index = VectorStoreIndex.from_documents(
    documents,
    service_context=service_context
)

Now define how we are going to query the knowledge (vector) database

In [12]:
query_engine=index.as_query_engine(service_context=service_context)


Now let's ask something specific that is present in the documents we augmented it with:

In [13]:
response = query_engine.query("Who proposed a 3D GAN with a hierarchical structure?")
print(response)

Sun et al. proposed a 3D GAN with a hierarchical structure.
