[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/generation/langchain/handbook/05-langchain-retrieval-augmentation.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/generation/langchain/handbook/05-langchain-retrieval-augmentation.ipynb)

#### [LangChain Handbook](https://pinecone.io/learn/langchain)

# Retrieval Augmentation

**L**arge **L**anguage **M**odels (LLMs) have a data freshness problem. The most powerful LLMs in the world, like GPT-4, have no idea about recent world events.

The world of LLMs is frozen in time. Their world exists as a static snapshot of the world as it was within their training data.

A solution to this problem is *retrieval augmentation*. The idea behind this is that we retrieve relevant information from an external knowledge base and give that information to our LLM. In this notebook we will learn how to do that.

[![Open fast notebook](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/fast-link.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/generation/langchain/handbook/05-langchain-retrieval-augmentation-fast.ipynb)

To begin, we must install the prerequisite libraries that we will be using in this notebook. If we install all libraries we will find a conflict in the Hugging Face `datasets` library so we must install everything in a specific order like so:

In [1]:
!pip install -qU \
    apache_beam \
    mwparserfromhell

[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
google-cloud-bigquery 2.34.4 requires protobuf<4.0.0dev,>=3.12.0, but you have protobuf 4.23.2 which is incompatible.
streamlit 1.11.1 requires protobuf<4,>=3.12, but you have protobuf 4.23.2 which is incompatible.
tensorboard 2.10.1 requires protobuf<3.20,>=3.9.2, but you have protobuf 4.23.2 which is incompatible.
tensorflow 2.10.0 requires protobuf<3.20,>=3.9.2, but you have protobuf 4.23.2 which is incompatible.
tensorflow-metadata 1.10.0 requires protobuf<4,>=3.13, but you have protobuf 4.23.2 which is incompatible.[0m[31m
[0m

## Red google docs from data folder

In [9]:
!ls ../data

dis_advantage_real_state.docx fair_real_state.docx


In [12]:
!pip install docx

Collecting docx
  Downloading docx-0.2.4.tar.gz (54 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m54.9/54.9 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: docx
  Building wheel for docx (setup.py) ... [?25ldone
[?25h  Created wheel for docx: filename=docx-0.2.4-py3-none-any.whl size=53897 sha256=26d8adbf5e4ac29c30d5217d4ef8546503ad154db63d5a0932579b5604f49ce1
  Stored in directory: /Users/federicomoreno/Library/Caches/pip/wheels/81/f5/1d/e09ba2c1907a43a4146d1189ae4733ca1a3bfe27ee39507767
Successfully built docx
Installing collected packages: docx
Successfully installed docx-0.2.4


In [1]:
docx_file = '../data/dis_advantage_real_state.docx'
#read docx file
from docx import Document
doc = Document(docx_file)


In [6]:
#append all paragraphs to a list
data = []
for para in doc.paragraphs:
    data.append(para.text)


In [7]:
data

['OFERTA DE COMPRA AD REFERENDUM',
 '',
 'RECIBIMOS del Sr/a ___________________________, DNI N° _____________________, con domicilio en _______________________________, CABA, Celular: ____________________ mail __________________________la cantidad de U$S ___________.- (Dólares _________________________________________) en efectivo en concepto de OFERTA DE COMPRA “Ad referendum” de la aprobación del DESTINATARIO y del resultado de los informes de inhibición y dominio, por la adquisición de UN INMUEBLE ubicado en  _________________________, piso __° __”, CABA  sirviendo el presente de recibo por dicha suma, de acuerdo con las siguientes condiciones: ',
 'A) PRECIO: El precio propuesto para la presente operación asciende a la cantidad de USD _________________.- (dólares estadounidenses billetes ______________________________)  pagaderos con la firma del boleto con posesión de la unidad en dinero en efectivo y en billetes de la referida moneda. El comprador declara que posee en su patrimo

Now we install the remaining libraries:

In [8]:
!pip install -qU \
  langchain==0.0.162 \
  openai==0.27.7 \
  tiktoken==0.4.0 \
  "pinecone-client[grpc]"==2.2.1

Every record contains *a lot* of text. Our first task is therefore to identify a good preprocessing methodology for chunking these articles into more "concise" chunks to later be embedding and stored in our Pinecone vector database.

For this we use LangChain's `RecursiveCharacterTextSplitter` to split our text into chunks of a specified max length.

In [9]:
import tiktoken

tiktoken.encoding_for_model('gpt-3.5-turbo')

<Encoding 'cl100k_base'>

In [10]:
import tiktoken

tokenizer = tiktoken.get_encoding('cl100k_base')

# create the length function
def tiktoken_len(text):
    tokens = tokenizer.encode(
        text,
        disallowed_special=()
    )
    return len(tokens)

tiktoken_len("hello I am a chunk of text and using the tiktoken_len function "
             "we can find the length of this chunk of text in tokens")

26

In [11]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=400,
    chunk_overlap=20,
    length_function=tiktoken_len,
    separators=["\n\n", "\n", " ", ""]
)

In [12]:
chunks = text_splitter.split_text(data)
chunks

['OFERTA DE COMPRA AD REFERENDUMRECIBIMOS del Sr/a ___________________________, DNI N° _____________________, con domicilio en _______________________________, CABA, Celular: ____________________ mail __________________________la cantidad de U$S ___________.- (Dólares _________________________________________) en efectivo en concepto de OFERTA DE COMPRA “Ad referendum” de la aprobación del DESTINATARIO y del resultado de los informes de inhibición y dominio, por la adquisición de UN INMUEBLE ubicado en  _________________________, piso __° __”, CABA  sirviendo el presente de recibo por dicha suma, de acuerdo con las siguientes condiciones: A) PRECIO: El precio propuesto para la presente operación asciende a la cantidad de USD _________________.- (dólares estadounidenses billetes ______________________________)  pagaderos con la firma del boleto con posesión de la unidad en dinero en efectivo y en billetes de la referida moneda. El comprador declara que posee en su patrimonio y de libre 

In [13]:
tiktoken_len(chunks[0]), tiktoken_len(chunks[1]), tiktoken_len(chunks[2])

(390, 352, 269)

Using the `text_splitter` we get much better sized chunks of text. We'll use this functionality during the indexing process later. Now let's take a look at embedding.

## Creating Embeddings

Building embeddings using LangChain's OpenAI embedding support is fairly straightforward. We first need to add our [OpenAI api key]() by running the next cell:

In [14]:
import os
#load from json .creds/PINECONE_API
import json
with open('../.creds/PINECONE_API') as f:
    creds = json.load(f)
    PINECONE_API_KEY = creds['PINECONE_API_KEY']
    PINECONE_ENVIRONMENT = creds['PINECONE_ENVIRONMENT']
    OPENAI_API_KEY = creds['OPENAI_API_KEY']

*(Note that OpenAI is a paid service and so running the remainder of this notebook may incur some small cost)*

After initializing the API key we can initialize our `text-embedding-ada-002` embedding model like so:

In [15]:
from langchain.embeddings.openai import OpenAIEmbeddings

model_name = 'text-embedding-ada-002'

embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

Now we embed some text like so:

From this we get *two* (aligning to our two chunks of text) 1536-dimensional embeddings.

Now we move on to initializing our Pinecone vector database.

## Vector Database

To create our vector database we first need a [free API key from Pinecone](https://app.pinecone.io). Then we initialize like so:

In [16]:
index_name = 'legal-docs'

In [19]:
res = embed.embed_documents(data)
len(res), len(res[0])

(47, 1536)

In [20]:
import pinecone

pinecone.init(
    api_key=PINECONE_API_KEY,
    environment=PINECONE_ENVIRONMENT
)

if index_name not in pinecone.list_indexes():
    # we create a new index
    pinecone.create_index(
        name=index_name,
        metric='cosine',
        dimension=len(res[0])  # 1536 dim of text-embedding-ada-002
    )

Then we connect to the new index:

In [None]:
index = pinecone.GRPCIndex(index_name)

index.describe_index_stats()

We should see that the new Pinecone index has a `total_vector_count` of `0`, as we haven't added any vectors yet.

## Indexing

We can perform the indexing task using the LangChain vector store object. But for now it is much faster to do it via the Pinecone python client directly. We will do this in batches of `100` or more.

from tqdm.auto import tqdm
from uuid import uuid4

batch_limit = 100

texts = []
metadatas = []

for i, record in enumerate(tqdm(data)):
    # first get metadata fields for this record
    metadata = {
        'wiki-id': str(record['id']),
        'source': record['url'],
        'title': record['title']
    }
    # now we create chunks from the record text
    record_texts = text_splitter.split_text(record['text'])
    # create individual metadata dicts for each chunk
    record_metadatas = [{
        "chunk": j, "text": text, **metadata
    } for j, text in enumerate(record_texts)]
    # append these to current batches
    texts.extend(record_texts)
    metadatas.extend(record_metadatas)
    # if we have reached the batch_limit we can add texts
    if len(texts) >= batch_limit:
        ids = [str(uuid4()) for _ in range(len(texts))]
        embeds = embed.embed_documents(texts)
        index.upsert(vectors=zip(ids, embeds, metadatas))
        texts = []
        metadatas = []

if len(texts) > 0:
    ids = [str(uuid4()) for _ in range(len(texts))]
    embeds = embed.embed_documents(texts)
    index.upsert(vectors=zip(ids, embeds, metadatas))

We've now indexed everything. We can check the number of vectors in our index like so:

In [16]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.1,
 'namespaces': {'': {'vector_count': 27437}},
 'total_vector_count': 27437}

## Creating a Vector Store and Querying

Now that we've build our index we can switch back over to LangChain. We start by initializing a vector store using the same index we just built. We do that like so:

In [17]:
from langchain.vectorstores import Pinecone

text_field = "text"

# switch back to normal index for langchain
index = pinecone.Index(index_name)

vectorstore = Pinecone(
    index, embed.embed_query, text_field
)

In [18]:
query = "Cual es la condición del contrato?"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

[Document(page_content='Benito Amilcare Andrea Mussolini KSMOM GCTE (29 July 1883 – 28 April 1945) was an Italian politician and journalist. He was also the Prime Minister of Italy from 1922 until 1943. He was the leader of the National Fascist Party.\n\nBiography\n\nEarly life\nBenito Mussolini was named after Benito Juarez, a Mexican opponent of the political power of the Roman Catholic Church, by his anticlerical (a person who opposes the political interference of the Roman Catholic Church in secular affairs) father. Mussolini\'s father was a blacksmith. Before being involved in politics, Mussolini was a newspaper editor (where he learned all his propaganda skills) and elementary school teacher.\n\nAt first, Mussolini was a socialist, but when he wanted Italy to join the First World War, he was thrown out of the socialist party. He \'invented\' a new ideology, Fascism, much out of Nationalist\xa0and Conservative views.\n\nRise to power and becoming dictator\nIn 1922, he took power b

All of these are good, relevant results. But what can we do with this? There are many tasks, one of the most interesting (and well supported by LangChain) is called _"Generative Question-Answering"_ or GQA.

## Generative Question-Answering

In GQA we take the query as a question that is to be answered by a LLM, but the LLM must answer the question based on the information it is seeing being returned from the `vectorstore`.

To do this we initialize a `RetrievalQA` object like so:

In [19]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# completion llm
llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name='gpt-3.5-turbo',
    temperature=0.0
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

In [20]:
qa.run(query)

'Benito Mussolini was an Italian politician and journalist who served as the Prime Minister of Italy from 1922 until 1943. He was the leader of the National Fascist Party and invented the ideology of Fascism. Mussolini was a dictator of Italy by the end of 1927, and his form of Fascism, "Italian Fascism," was different and less destructive than Hitler\'s Nazism. Mussolini wanted Italy to become a new Roman Empire and attacked several countries, including Abyssinia (now called Ethiopia) and Greece. He was removed from power in 1943 and was executed by Italian partisans in 1945.'

We can also include the sources of information that the LLM is using to answer our question. We can do this using a slightly different version of `RetrievalQA` called `RetrievalQAWithSourcesChain`:

In [21]:
from langchain.chains import RetrievalQAWithSourcesChain

qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

In [22]:
qa_with_sources(query)

{'question': 'who was Benito Mussolini?',
 'answer': 'Benito Mussolini was an Italian politician and journalist who was the Prime Minister of Italy from 1922 until 1943. He was the leader of the National Fascist Party and invented the ideology of Fascism. He became dictator of Italy by the end of 1927 and was friends with German dictator Adolf Hitler. Mussolini attacked Greece and failed to conquer it. He was removed by the Great Council of Fascism in 1943 and was executed by a partisan on April 28, 1945. After the war, several Neo-Fascist movements have had success in Italy, the most important being the Movimento Sociale Italiano. His granddaughter Alessandra Mussolini has outspoken views similar to Fascism. \n',
 'sources': 'https://simple.wikipedia.org/wiki/Benito%20Mussolini, https://simple.wikipedia.org/wiki/Fascism'}

Now we answer the question being asked, *and* return the source of this information being used by the LLM.

---