# End-to-End RAG Tutorial Using Source-File, PyAirbyte, Pinecone, and LangChain

This Python notebook demonstrates an end-to-end process for integrating data from a file source into a Pinecone vector store using PyAirbyte. It also showcases a short demo of using OpenAI's LangChain to analyze the data stored in Pinecone with a RAG (Retrieval-Augmented Generation) model.

Source File provides the possibility to acquire data from several storage providers, which you can get details of [here](https://docs.airbyte.com/integrations/sources/file#step-2-select-the-provider-and-set-provider-specific-configurations).

In this article, we will use S3 as storage, however you can set up your own storage provider.(Checkout [refrences](https://docs.airbyte.com/integrations/sources/file#reference)).

## Prerequisites

1. **Storage Provider Account**: Visit [docs](https://docs.airbyte.com/integrations/sources/file#step-2-select-the-provider-and-set-provider-specific-configurations) to obtain the essential information to setup the source .

2. **Pinecone Account**: Sign up on [Pinecone](https://www.pinecone.io/) and generate an API key in your project settings as per the [documentation](https://docs.pinecone.io/guides/get-started/quickstart).

3. **OpenAI API Key**: Create an account at [OpenAI](https://openai.com/api/), then generate an API key in the API section following the [documentation](https://platform.openai.com/docs/quickstart).

## Install PyAirbyte and other dependencies

In [None]:
# Add virtual environment support for running in Google Colab
!apt-get install -qq python3.10-venv

# First, we need to install the necessary libraries.
!pip3 install airbyte openai langchain pinecone-client langchain-openai langchain-pinecone python-dotenv langchainhub

## Setup Source File with PyAirbyte

Here we will setup the source file with AWS s3 storage. Checkout this [specs](https://github.com/airbytehq/airbyte/blob/master/airbyte-integrations/connectors/source-file/source_file/spec.json) for setting up appropriate key-value pair in ab.get_source.

Note: The credentials are retrieved securely using the get_secret() method. This will automatically locate a matching Google Colab secret or environment variable, ensuring they are not hard-coded into the notebook. Make sure to add your key to the Secrets section on the left.


In [None]:
import airbyte as ab

source = ab.get_source(
    "source-file",
    config={
        "dataset_name": "products", # The Name of the final table to replicate this file into (should include letters, numbers dash and underscores only).
        "format": "csv", # Supported file format: https://docs.airbyte.com/integrations/sources/file#file-formats
        "url": ab.get_secret("URL"), # The URL path to access the file which should be replicated.
        "provider":{
            "storage": "S3", # Change the storage provider
            "aws_access_key_id": ab.get_secret("AWS_ACCESS_KEY"),
            "aws_secret_access_key": ab.get_secret("AWS_SECRET_KEY"),
        }
    }
)
source.check()

This process outlines how to retrieve data from a file using Airbyte and transform it into a suitable format for additional processing or analysis.

In [None]:
source.select_all_streams() # Select all streams
read_result = source.read() # Read the data
product = [doc for value in read_result.values() for doc in value.to_documents()]

print(str(product[3]))

# Use Langchain to build a RAG pipeline.

The code uses RecursiveCharacterTextSplitter to break documents into smaller chunks. Metadata within these chunks is converted to strings. This facilitates efficient processing of large texts, enhancing analysis capabilities.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunked_docs = splitter.split_documents(product)
print(f"Created {len(chunked_docs)} document chunks.")

for doc in chunked_docs:
    for md in doc.metadata:
        doc.metadata[md] = str(doc.metadata[md])

In [26]:
from langchain_openai import OpenAIEmbeddings
import os

## Embedding Technique Of OPENAI
os.environ['OPENAI_API_KEY'] = ab.get_secret("OPENAI_API_KEY")
embeddings=OpenAIEmbeddings()


## Setting up Pinecone

Pinecone is a managed vector database service designed for storing, indexing, and querying high-dimensional vector data efficiently.


In [None]:
from pinecone import Pinecone, ServerlessSpec
from pinecone import Pinecone
import os

os.environ['PINECONE_API_KEY'] = ab.get_secret("PINECONE_API_KEY")
pc = Pinecone()
index_name = "productindex" # Replace with your index name


# # Uncomment this if you have not created pinecone index already
# spec = ServerlessSpec(cloud="aws", region="us-east-1") #Replace with your cloud and region
# pc.create_index(
#         name=index_name,
#         dimension=1536,
#         metric='cosine',
#         spec=spec
# )

index = pc.Index(index_name)

index.describe_index_stats()

PineconeVectorStore is a class provided by the LangChain library specifically designed for interacting with Pinecone vector stores.
from_documents method of PineconeVectorStore is used to create or update vectors in a Pinecone vector store based on the provided documents and their corresponding embeddings.

In [31]:
from langchain_pinecone import PineconeVectorStore

pinecone = PineconeVectorStore.from_documents(
    chunked_docs, embeddings, index_name=index_name
)

Now setting up a pipeline for RAG using LangChain, incorporating document retrieval from Pinecone, prompt configuration, and a chat model from OpenAI for response generation.

In [None]:
from langchain_openai import ChatOpenAI
from langchain import hub
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

retriever = pinecone.as_retriever()
prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)


def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)
print("Langchain RAG pipeline set up successfully.")

In [None]:
print(rag_chain.invoke("What is the cost of Canon Photo Ink Cartridge - CL52?"))