[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/langchain-retrieval-augmentation.ipynb) [![Open nbviewer](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/nbviewer-shield.svg)](https://nbviewer.org/github/pinecone-io/examples/blob/master/docs/langchain-retrieval-augmentation.ipynb)

#### Code is borrowed from LangChain Handbook, adapted to fit in the presently working schema
August 17, 2023

Adding pydantic validation to appease LLM Gods with doing due dilligence by way of data governance and sufficient [chicken sacrifices](https://www.linkedin.com/feed/update/urn:li:activity:7092904219103432704?updateEntityUrn=urn%3Ali%3Afs_feedUpdate%3A%28V2%2Curn%3Ali%3Aactivity%3A7092904219103432704%29)

#### [LangChain Handbook](https://pinecone.io/learn/langchain)

# Retrieval Augmentation

**L**arge **L**anguage **M**odels (LLMs) have a data freshness problem. The most powerful LLMs in the world, like GPT-4, have no idea about recent world events.

The world of LLMs is frozen in time. Their world exists as a static snapshot of the world as it was within their training data.

A solution to this problem is *retrieval augmentation*. The idea behind this is that we retrieve relevant information from an external knowledge base and give that information to our LLM. In this notebook we will learn how to do that.


<!--Nothing actually in these notebooks links, it's the same notebook [![Open full notebook](https://raw.githubusercontent.com/pinecone-io/examples/master/assets/full-link.svg)](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/langchain-retrieval-augmentation.ipynb) -->

To begin, we must install the prerequisite libraries that we will be using in this notebook.

In [None]:
!pip install -qU \
  langchain==0.0.162 \
  openai==0.27.7 \
  tiktoken==0.4.0 \
  "pinecone-client[grpc]"==2.2.1 \
  pinecone_datasets=='0.5.0rc10'

---

üö® _Note: the above `pip install` is formatted for Jupyter notebooks. If running elsewhere you may need to drop the `!`._

---

## Building the Knowledge Base

We will download a pre-embedding dataset from `pinecone-datasets`. Allowing us to skip the embedding and preprocessing steps, if you'd rather work through those steps you can find the [full notebook here](https://colab.research.google.com/github/pinecone-io/examples/blob/master/docs/langchain-retrieval-augmentation.ipynb).

In [None]:
import pinecone_datasets

dataset = pinecone_datasets.load_dataset('wikipedia-simple-text-embedding-ada-002-100K')
dataset.head()

In [None]:
print(type(dataset))
print(len(dataset))

We'll format the dataset ready for upsert and reduce what we use to a subset of the full dataset.

In [None]:
#### THE API REQUIRES THE METADATA FIELD TO BE POPULATED
# 1. Convert the Dataset object to a DataFrame to be able to 
#  a. test with pydantic
#   1. ensure we have blob and metadata columns (to appease pinecone today)
#   2. ensure there is a valid https
df = dataset.documents

# 2. Modify the DataFrame
# Assuming you want to copy the 'metadata' column to 'blob' (and create 'blob' if it doesn't exist)
if 'blob' in df.columns:
    df['metadata'] = df['blob']
df.head()

In [None]:
# Conduct data and datatype exploration
# Check the type of the first few entries
types_of_values = df['values'].apply(type).value_counts()

# Check if there are any non-list type entries
non_list_values = df[df['values'].apply(lambda x: not isinstance(x, list))]['values']

types_of_values, non_list_values.head()


In [None]:
# pydantic will ensure only rows with data (a valid 'blob') make it to pinecone
# for all rows with data, they must cite a valid https source
from pydantic import BaseModel, HttpUrl, validator, ValidationError
from typing import Optional, Union, Dict
import numpy as np

class Blob(BaseModel):
    chunk: int
    source: HttpUrl
    
    @validator('source', pre=True, always=True)
    def check_https_scheme(cls, v):
        '''Cybersecurity check-- ensure only secure websites are referenced'''
        if not v.startswith("https://"):
            raise ValueError("source URL must be HTTPS")
        return v

    @validator('chunk', pre=True, always=True)
    def check_chunk(cls, v):
        if v is None:
            raise ValueError("chunk field is missing")
        return v


class DatasetEntry(BaseModel):
    id: str
    values: np.ndarray
    sparse_values: Optional[Union[str, None]]
    metadata: Optional[Union[Dict, None]]
    blob: Blob

    @validator('blob', pre=True, always=True)
    def check_blob(cls, v):
        if v is None:
            raise ValueError("blob field is missing")
        return v

    @validator('values', pre=True, always=True)
    def check_values_type(cls, v):
        if not isinstance(v, np.ndarray):
            raise ValidationError("values must be a numpy array")
        return v
    
    class Config:
        arbitrary_types_allowed = True
        
for _, row in df.iterrows():
    entry = DatasetEntry(**row.to_dict())



In [None]:
import os
#### THE PINECONE OBJECT REQUIRES THE PARQUET FILE TO BE IN DIRECOTRY '*/DOCUMENTS'
# # Create the documents directory if it doesn't exist
# switch the df object back to pinecone object
documents_dir = "../data/processed/documents"
os.makedirs(documents_dir, exist_ok=True)

# Save the DataFrame as a Parquet file inside the documents directory
parquet_file_inside_documents = os.path.join(documents_dir, "parquet_df_with_metadata.parquet")
df.to_parquet(parquet_file_inside_documents)


In [None]:
#### MAKE SURE META DATA IS POPULATED BY USING WHAT WAS IN BLOB
### THIS NEEDS TO BE A PINECONE OBJECT
new_dataset = pinecone_datasets.dataset.Dataset.from_path("../data/processed/")
print(new_dataset.head())
type(new_dataset)

Now we move on to initializing our Pinecone vector database.

## Vector Database

To create our vector database we first need a [free API key from Pinecone](https://app.pinecone.io). Then we initialize like so:

In [1]:
index_name = 'langchain-retrieval-augmentation-fast'
indexname = index_name

In [2]:
from dotenv import load_dotenv,find_dotenv
load_dotenv(find_dotenv())
import os
import pinecone

# connect to pinecone environment
pinecone.init(
    api_key=os.getenv('PINECONE_API_KEY'),  
    environment=os.getenv('PINECONE_ENV') 
)

if index_name not in pinecone.list_indexes():
    # we create a new index
    pinecone.create_index(
        name=index_name,
        metric='cosine',
        dimension=1536,  # 1536 dim of text-embedding-ada-002
    )

  from tqdm.autonotebook import tqdm


Then we connect to the new index:

In [3]:
from dotenv import load_dotenv,find_dotenv
load_dotenv(find_dotenv())
import os
import pinecone
import time
# allows notebook to work
import requests
from requests.packages.urllib3.util.ssl_ import create_urllib3_context

CIPHERS = (
    'ECDHE+AESGCM:ECDHE+CHACHA20:DHE+AESGCM:DHE+CHACHA20:ECDH+AESGCM:ECDH+CHACHA20:DH+AESGCM:DH+CHACHA20:'
    'ECDHE+AES:!aNULL:!eNULL:!EXPORT:!DES:!MD5:!PSK:!RC4:!HMAC_SHA1:!SHA1:!DHE+AES:!ECDH+AES:!DH+AES'
)

requests.packages.urllib3.util.ssl_.DEFAULT_CIPHERS = CIPHERS
# Skip the following two lines if they cause errors
# requests.packages.urllib3.contrib.pyopenssl.DEFAULT_SSL_CIPHER_LIST = CIPHERS
# requests.packages.urllib3.contrib.pyopenssl.inject_into_urllib3()
requests.packages.urllib3.util.ssl_.create_default_context = create_urllib3_context

index = pinecone.GRPCIndex(index_name)
# wait a moment for the index to be fully initialized
time.sleep(20)

index.describe_index_stats()

{'dimension': 1536,
 'index_fullness': 0.1,
 'namespaces': {'': {'vector_count': 2000}},
 'total_vector_count': 2000}

In [None]:
# for batch in new_dataset.iter_documents(batch_size=100):
#     index.upsert(batch)

## Creating a Vector Store and Querying

Iinitialize a LangChain vector store using the same index built. For this we will also need a LangChain embedding object, which we initialize like so:

In [4]:
from langchain.embeddings.openai import OpenAIEmbeddings

# get openai api key from platform.openai.com
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY') or 'OPENAI_API_KEY'

model_name = 'text-embedding-ada-002'

embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

Now initialize the vector store:

In [5]:
from langchain.vectorstores import Pinecone

text_field = "text"

# switch back to normal index for langchain
index = pinecone.Index(index_name)

vectorstore = Pinecone(
    index, embed.embed_query, text_field
)

Now we can query the vector store directly using `vectorstore.similarity_search`:

In [10]:
# query = "who was Benito Mussolini?"
query = "How did Benito Mussolini affect Italy?"

vectorstore.similarity_search(
    query,  # our search query
    k=10  # return 3 most relevant docs
)

[Document(page_content='Veneto was made part of Italy in 1866 after a war with Austria. Italian soldiers won Latium in 1870. That was when they took away the Pope\'s power. The Pope, who was angry, said that he was a prisoner to keep Catholic people from being active in politics. That was the year of Italian unification.\n\nItaly participated in World War I. It was an ally of Great Britain, France, and Russia against the Central Powers. Almost all of Italy\'s fighting was on the Eastern border, near Austria. After the "Caporetto defeat", Italy thought they would lose the war. But, in 1918, the Central Powers surrendered. Italy gained the Trentino-South Tyrol, which once was owned by Austria.\n\nFascist Italy \nIn 1922, a new Italian government started. It was ruled by Benito Mussolini, the leader of Fascism in Italy. He became head of government and dictator, calling himself "Il Duce" (which means "leader" in Italian). He became friends with German dictator Adolf Hitler. Germany, Japan

All of these are good, relevant results. But what can we do with this? There are many tasks, one of the most interesting (and well supported by LangChain) is called _"Generative Question-Answering"_ or GQA.

## Generative Question-Answering

In GQA we take the query as a question that is to be answered by a LLM, but the LLM must answer the question based on the information it is seeing being returned from the `vectorstore`.

To do this we initialize a `RetrievalQA` object like so:

In [11]:
from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# completion llm
llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name='gpt-3.5-turbo',
    temperature=0.0
)

qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

In [12]:
qa.run(query)

"Benito Mussolini had a significant impact on Italy during his rule as the leader of Fascist Italy from 1922 to 1943. Here are some ways in which Mussolini affected Italy:\n\n1. Rise of Fascism: Mussolini was the founder of Italian Fascism, a political ideology that emphasized authoritarian rule, nationalism, and the glorification of the state. Under his leadership, the Fascist Party gained power and established a totalitarian regime in Italy.\n\n2. Dictatorship and One-Party Rule: Mussolini became the head of government and established a dictatorship, concentrating power in his hands. He suppressed political opposition, dissolved other political parties, and established the National Fascist Party as the only legal political party in Italy.\n\n3. Economic Policies: Mussolini implemented various economic policies, including state control of industries, public works projects, and autarky (economic self-sufficiency). These policies aimed to strengthen the Italian economy and promote natio

We can also include the sources of information that the LLM is using to answer our question. We can do this using a slightly different version of `RetrievalQA` called `RetrievalQAWithSourcesChain`:

In [13]:
from langchain.chains import RetrievalQAWithSourcesChain

qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

In [14]:
qa_with_sources(query)

{'question': 'How did Benito Mussolini affect Italy?',
 'answer': "Benito Mussolini affected Italy by ruling as the leader of Fascism and establishing a dictatorship. He formed an alliance with Adolf Hitler and led Italy to join the Axis Powers in World War II. Mussolini's rule ended in 1943 when he was removed from power and Italy switched sides to fight against Germany. After World War II, Italy became a republic and joined NATO and the European Community. Mussolini's impact on Italy was significant, as he played a major role in shaping the country's political and military history during the early 20th century.\n",
 'sources': 'https://simple.wikipedia.org/wiki/Italy'}

---