[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/langchain-retrieval-augmentation.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/docs/langchain-retrieval-augmentation.ipynb)

### Code is borrowed LangChain Handbook , adapted to fit in the presently working schema
August 17, 2023

#### [LangChain Handbook](https://pinecone.io/learn/langchain)

# Retrieval Augmentation

**L**arge **L**anguage **M**odels (LLMs) have a data freshness problem. The most powerful LLMs in the world, like GPT-4, have no idea about recent world events.

The world of LLMs is frozen in time. Their world exists as a static snapshot of the world as it was within their training data.

A solution to this problem is *retrieval augmentation*. The idea behind this is that we retrieve relevant information from an external knowledge base and give that information to our LLM. In this notebook we will learn how to do that.


<!--Nothing actually in these notebooks links, it's the same notebook [![Open full notebook](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/full-link.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/langchain-retrieval-augmentation.ipynb) -->

To begin, we must install the prerequisite libraries that we will be using in this notebook.

In [1]:
!pip install -qU \
  langchain==0.0.162 \
  openai==0.27.7 \
  tiktoken==0.4.0 \
  "pinecone-client[grpc]"==2.2.1 \
  pinecone_datasets=='0.5.0rc10'

---

🚨 _Note: the above `pip install` is formatted for Jupyter notebooks. If running elsewhere you may need to drop the `!`._

---

## Building the Knowledge Base

We will download a pre-embedding dataset from `pinecone-datasets`. Allowing us to skip the embedding and preprocessing steps, if you'd rather work through those steps you can find the [full notebook here](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/langchain-retrieval-augmentation.ipynb).

In [39]:
import pinecone_datasets

dataset = pinecone_datasets.load_dataset('wikipedia-simple-text-embedding-ada-002-100K')
dataset.head()

Unnamed: 0,id,values,sparse_values,metadata,blob
0,1-0,"[-0.011254455894231796, -0.01698738895356655, ...",,,"{'chunk': 0, 'source': 'https://simple.wikiped..."
1,1-1,"[-0.0015197008615359664, -0.007858820259571075...",,,"{'chunk': 1, 'source': 'https://simple.wikiped..."
2,1-2,"[-0.009930099360644817, -0.012211072258651257,...",,,"{'chunk': 2, 'source': 'https://simple.wikiped..."
3,1-3,"[-0.011600767262279987, -0.012608098797500134,...",,,"{'chunk': 3, 'source': 'https://simple.wikiped..."
4,1-4,"[-0.026462381705641747, -0.016362832859158516,...",,,"{'chunk': 4, 'source': 'https://simple.wikiped..."


In [17]:
len(dataset)
# type(dataset)
# dataset.__
# glob = dataset["blob"]
# glob.head()

2000

In [48]:
# we drop sparse_values as they are not needed for this example
# dataset.documents.drop(['metadata'], axis=1, inplace=True)
# dataset.documents.rename(columns={'blob': 'metadata'}, inplace=True)
# dataset.documents.copy(columns={'blob': 'metadata'}, inplace=True)
# we will use rows of the dataset up to index 30_000
dataset.documents.drop(dataset.documents.index[2000:], inplace=True)
len(dataset)
dataset.head()


Unnamed: 0,id,values,sparse_values,metadata,blob
0,1-0,"[-0.011254455894231796, -0.01698738895356655, ...",,,"{'chunk': 0, 'source': 'https://simple.wikiped..."
1,1-1,"[-0.0015197008615359664, -0.007858820259571075...",,,"{'chunk': 1, 'source': 'https://simple.wikiped..."
2,1-2,"[-0.009930099360644817, -0.012211072258651257,...",,,"{'chunk': 2, 'source': 'https://simple.wikiped..."
3,1-3,"[-0.011600767262279987, -0.012608098797500134,...",,,"{'chunk': 3, 'source': 'https://simple.wikiped..."
4,1-4,"[-0.026462381705641747, -0.016362832859158516,...",,,"{'chunk': 4, 'source': 'https://simple.wikiped..."


In [26]:
type(dataset)

pinecone_datasets.dataset.Dataset

We'll format the dataset ready for upsert and reduce what we use to a subset of the full dataset.

In [49]:
#### THE API REQUIRES THE METADATA FIELD TO BE POPULATED
# 1. Convert the Dataset object to a DataFrame to ensure we have a metadata column
df = dataset.documents

# 2. Modify the DataFrame
# Assuming you want to copy the 'metadata' column to 'blob' (and create 'blob' if it doesn't exist)
if 'blob' in df.columns:
    df['metadata'] = df['blob']
df.head()

Unnamed: 0,id,values,sparse_values,metadata,blob
0,1-0,"[-0.011254455894231796, -0.01698738895356655, ...",,"{'chunk': 0, 'source': 'https://simple.wikiped...","{'chunk': 0, 'source': 'https://simple.wikiped..."
1,1-1,"[-0.0015197008615359664, -0.007858820259571075...",,"{'chunk': 1, 'source': 'https://simple.wikiped...","{'chunk': 1, 'source': 'https://simple.wikiped..."
2,1-2,"[-0.009930099360644817, -0.012211072258651257,...",,"{'chunk': 2, 'source': 'https://simple.wikiped...","{'chunk': 2, 'source': 'https://simple.wikiped..."
3,1-3,"[-0.011600767262279987, -0.012608098797500134,...",,"{'chunk': 3, 'source': 'https://simple.wikiped...","{'chunk': 3, 'source': 'https://simple.wikiped..."
4,1-4,"[-0.026462381705641747, -0.016362832859158516,...",,"{'chunk': 4, 'source': 'https://simple.wikiped...","{'chunk': 4, 'source': 'https://simple.wikiped..."


In [50]:
#### THE PINECONE OBJECT REQUIRES THE PARQUET FILE TO BE IN DIRECOTRY '*/DOCUMENTS'
# # Create the documents directory if it doesn't exist
documents_dir = "../data/processed/documents"
os.makedirs(documents_dir, exist_ok=True)

# Save the DataFrame as a Parquet file inside the documents directory
parquet_file_inside_documents = os.path.join(documents_dir, "parquet_df_with_metadata.parquet")
df.to_parquet(parquet_file_inside_documents)


In [53]:
#### MAKE SURE META DATA IS POPULATED BY USING WHAT WAS IN BLOB
### THIS NEEDS TO BE A PINECONE OBJECT
new_dataset = pinecone_datasets.dataset.Dataset.from_path("../data/processed/")
print(new_dataset.head())
type(new_dataset)

    id                                             values sparse_values  \
0  1-0  [-0.011254455894231796, -0.01698738895356655, ...          None   
1  1-1  [-0.0015197008615359664, -0.007858820259571075...          None   
2  1-2  [-0.009930099360644817, -0.012211072258651257,...          None   
3  1-3  [-0.011600767262279987, -0.012608098797500134,...          None   
4  1-4  [-0.026462381705641747, -0.016362832859158516,...          None   

                                            metadata  \
0  {'chunk': 0, 'source': 'https://simple.wikiped...   
1  {'chunk': 1, 'source': 'https://simple.wikiped...   
2  {'chunk': 2, 'source': 'https://simple.wikiped...   
3  {'chunk': 3, 'source': 'https://simple.wikiped...   
4  {'chunk': 4, 'source': 'https://simple.wikiped...   

                                                blob  
0  {'chunk': 0, 'source': 'https://simple.wikiped...  
1  {'chunk': 1, 'source': 'https://simple.wikiped...  
2  {'chunk': 2, 'source': 'https://simple.wikip

pinecone_datasets.dataset.Dataset

Now we move on to initializing our Pinecone vector database.

## Vector Database

To create our vector database we first need a [free API key from Pinecone](https://app.pinecone.io). Then we initialize like so:

In [54]:
index_name = 'langchain-retrieval-augmentation-fast'
indexname = index_name

In [31]:
from dotenv import load_dotenv,find_dotenv
load_dotenv(find_dotenv())
import os
import pinecone

# # find API key in console at app.pinecone.io
# PINECONE_API_KEY = os.getenv('PINECONE_API_KEY') or 'PINECONE_API_KEY'
# # find ENV (cloud region) next to API key in console
# PINECONE_ENVIRONMENT = os.getenv('PINECONE_ENVIRONMENT') or 'PINECONE_ENVIRONMENT'

# pinecone.init(
#     api_key=PINECONE_API_KEY,
#     environment=PINECONE_ENVIRONMENT
# )
# connect to pinecone environment
pinecone.init(
    api_key=os.getenv('PINECONE_API_KEY'),  
    environment=os.getenv('PINECONE_ENV') 
)

if index_name not in pinecone.list_indexes():
    # we create a new index
    pinecone.create_index(
        name=index_name,
        metric='cosine',
        dimension=1536,  # 1536 dim of text-embedding-ada-002
    )

Then we connect to the new index:

In [34]:
from dotenv import load_dotenv,find_dotenv
load_dotenv(find_dotenv())
import os
import pinecone
import time
# allows notebook to work
import requests
from requests.packages.urllib3.util.ssl_ import create_urllib3_context

CIPHERS = (
    'ECDHE+AESGCM:ECDHE+CHACHA20:DHE+AESGCM:DHE+CHACHA20:ECDH+AESGCM:ECDH+CHACHA20:DH+AESGCM:DH+CHACHA20:'
    'ECDHE+AES:!aNULL:!eNULL:!EXPORT:!DES:!MD5:!PSK:!RC4:!HMAC_SHA1:!SHA1:!DHE+AES:!ECDH+AES:!DH+AES'
)

requests.packages.urllib3.util.ssl_.DEFAULT_CIPHERS = CIPHERS
# Skip the following two lines if they cause errors
# requests.packages.urllib3.contrib.pyopenssl.DEFAULT_SSL_CIPHER_LIST = CIPHERS
# requests.packages.urllib3.contrib.pyopenssl.inject_into_urllib3()
requests.packages.urllib3.util.ssl_.create_default_context = create_urllib3_context

index = pinecone.GRPCIndex(index_name)
# wait a moment for the index to be fully initialized
time.sleep(20)

index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

We should see that the new Pinecone index has a `total_vector_count` of `0`, as we haven't added any vectors yet.

Now we upsert the data to Pinecone:

In [56]:
for batch in new_dataset.iter_documents(batch_size=100):
    index.upsert(batch)

We've now indexed everything. We can check the number of vectors in our index like so:

In [57]:
index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.1,
 'namespaces': {'': {'vector_count': 2000}},
 'total_vector_count': 2000}

## Creating a Vector Store and Querying

Now that we've build our index we can switch over to LangChain. We need to initialize a LangChain vector store using the same index we just built. For this we will also need a LangChain embedding object, which we initialize like so:

In [58]:
from langchain.embeddings.openai import OpenAIEmbeddings

# get openai api key from platform.openai.com
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or 'OPENAI_API_KEY'

model_name = 'text-embedding-ada-002'

embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

Now initialize the vector store:

In [59]:
from langchain.vectorstores import Pinecone

text_field = "text"

# switch back to normal index for langchain
index = pinecone.Index(index_name)

vectorstore = Pinecone(
    index, embed.embed_query, text_field
)

Now we can query the vector store directly using `vectorstore.similarity_search`:

In [60]:
query = "who was Benito Mussolini?"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)

[Document(page_content='Veneto was made part of Italy in 1866 after a war with Austria. Italian soldiers won Latium in 1870. That was when they took away the Pope\'s power. The Pope, who was angry, said that he was a prisoner to keep Catholic people from being active in politics. That was the year of Italian unification.\n\nItaly participated in World War I. It was an ally of Great Britain, France, and Russia against the Central Powers. Almost all of Italy\'s fighting was on the Eastern border, near Austria. After the "Caporetto defeat", Italy thought they would lose the war. But, in 1918, the Central Powers surrendered. Italy gained the Trentino-South Tyrol, which once was owned by Austria.\n\nFascist Italy \nIn 1922, a new Italian government started. It was ruled by Benito Mussolini, the leader of Fascism in Italy. He became head of government and dictator, calling himself "Il Duce" (which means "leader" in Italian). He became friends with German dictator Adolf Hitler. Germany, Japan

All of these are good, relevant results. But what can we do with this? There are many tasks, one of the most interesting (and well supported by LangChain) is called _"Generative Question-Answering"_ or GQA.

## Generative Question-Answering

In GQA we take the query as a question that is to be answered by a LLM, but the LLM must answer the question based on the information it is seeing being returned from the `vectorstore`.

To do this we initialize a `RetrievalQA` object like so:

In [61]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# completion llm
llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name='gpt-3.5-turbo',
    temperature=0.0
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

In [62]:
qa.run(query)

'Benito Mussolini was an Italian politician and leader who ruled Italy as the head of the National Fascist Party from 1922 until his ousting in 1943. He is known for establishing a fascist dictatorship in Italy and aligning the country with Nazi Germany and Imperial Japan as part of the Axis Powers during World War II. Mussolini, also known as "Il Duce," implemented authoritarian policies, suppressed political opposition, and promoted nationalism and militarism. His rule ended when he was overthrown and executed by Italian partisans in 1945.'

We can also include the sources of information that the LLM is using to answer our question. We can do this using a slightly different version of `RetrievalQA` called `RetrievalQAWithSourcesChain`:

In [63]:
from langchain.chains import RetrievalQAWithSourcesChain

qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

In [64]:
qa_with_sources(query)

{'question': 'who was Benito Mussolini?',
 'answer': 'Benito Mussolini was the leader of Fascism in Italy and served as the head of government and dictator from 1922. He was known as "Il Duce" and was an ally of Adolf Hitler. Italy, under Mussolini\'s rule, joined Germany and Japan as the Axis Powers in World War II. Mussolini was removed from power in 1943 and executed by Italian partisans in 1945. \n',
 'sources': 'https://simple.wikipedia.org/wiki/Italy'}

Now we answer the question being asked, *and* return the source of this information being used by the LLM.

Once done, we can delete the index to save resources.

In [None]:
# pinecone.delete_index(index_name)

---