<a href="https://colab.research.google.com/github/ashutosh3060/Agentic-RAG/blob/main/Overview_of_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

* Purpose of the notebook: This notebook is meant for basic RAG set up

## Table of Contents
### 0. Pre-requisites
### 1. Overview of all the components of RAG
### 2. Indexing
### 3. Retrieval
### 4. Generation

### 0. Pre-requisites

Option 1: Install all the dependent libraries inside a virtual environment

* Install virtual environment library
* Create a virtual environment
* Activate the environment
* Install the required libraries from requirements.txt file

Option 2: Install dependent libraries onn the global python environment


#### Option 1: Install all the dependent libraries inside a virtual environment

In [1]:
!pip install virtualenv
!virtualenv "rag_venv" # To set up the env

!source /content/rag_venv/bin/activate # To activate the environment and download all its dependencies and packages


created virtual environment CPython3.11.11.final.0-64 in 573ms
  creator CPython3Posix(dest=/content/rag_venv, clear=False, no_vcs_ignore=False, global=False)
  seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/root/.local/share/virtualenv)
    added seed packages: pip==25.0.1, setuptools==75.8.2, wheel==0.45.1
  activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator


In [2]:
# !source /content/myenv/bin/activate; pip3 list

In [3]:
# ! which python

In [4]:
# Install all packages
! pip install -r /content/requirements.txt --quiet

#### Option 2: Install dependent libraries onn the global python environment

In [5]:
# ! pip install --quiet pinecone-client python-dotenv langchain langchain-community langchain-core langchain-openai beautifulsoup4 tiktoken numpy


### Environment

(1) Packages

In [6]:
import os
from dotenv import load_dotenv

# Load all environment variables from .env file
load_dotenv()

# Access the environment variables (Langchain)
langchain_tracing_v2 = os.getenv('LANGCHAIN_TRACING_V2')
langchain_endpoint = os.getenv('LANGCHAIN_ENDPOINT')
langchain_api_key = os.getenv('LANGCHAIN_API_KEY')

# LLM
openai_api_key = os.getenv('OPENAI_API_KEY')

# Pinecone Vector Database
pinecone_api_key = os.getenv('PINECONE_API_KEY')
pinecone_api_host = os.getenv('PINECONE_API_HOST')
index_name = os.getenv('PINECONE_INDEX_NAME')

In [7]:
os.environ['LANGCHAIN_TRACING_V2'] = langchain_tracing_v2
os.environ['LANGCHAIN_ENDPOINT'] = langchain_endpoint
os.environ['LANGCHAIN_API_KEY'] = langchain_api_key

In [8]:
os.environ['OPENAI_API_KEY'] = openai_api_key

#Pinecone keys
os.environ['PINECONE_API_KEY'] = pinecone_api_key
os.environ['PINECONE_API_HOST'] = pinecone_api_host
os.environ['PINECONE_INDEX_NAME'] = index_name

In [9]:
index_name

'ragtest'

In [10]:
print(langchain_tracing_v2)
print(langchain_endpoint)
print(langchain_api_key)

true
https://api.smith.langchain.com
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


Pinecone Init

In [11]:
import time
from pinecone import Pinecone
from pinecone import ServerlessSpec

pc = Pinecone(api_key=os.environ.get("PINECONE_API_KEY"))
index_name = "ragtest"  # change if desired

existing_indexes = [index_info["name"] for index_info in pc.list_indexes()]

if index_name not in existing_indexes:
  pc.create_index(
      name=index_name,
      dimension=1536,
      metric="cosine",
      index_type="hnsw",
      sepc=ServerlessSpec(cloud="aws", region="us-east-1"),
  )
  while not pc.describe_index(index_name).status["ready"]:
    time.sleep(1)

index = pc.Index(index_name)

### 1. Overview of all the components of RAG

In [12]:
# Import all the libraries

# For beautiful print
from pprint import pprint
#Beautiful Soup is a library that makes it easy to scrape information from web pages
import bs4

# Langchain Libraries
# Taking inspiration from Hugging Face Hub, LangChainHub is collection of all artifacts useful for working with LangChain primitives such as prompts, chains and agents
from langchain import hub
# take a document and split into chunks that can be used for retrieval
from langchain.text_splitter import RecursiveCharacterTextSplitter
# loading documents from a variety of source
from langchain_community.document_loaders import WebBaseLoader
# Working with Pinecone vector store
# from langchain_community.vectorstores import Pinecone
from langchain_pinecone import PineconeVectorStore
# Langchain output parser
from langchain_core.output_parsers import StrOutputParser
# to passthrough inputs unchanged or with additional keys
from langchain_core.runnables import RunnablePassthrough
# OpenAI model and embeddings
from langchain_openai import ChatOpenAI, OpenAIEmbeddings




* Indexing

In [30]:
# Load Documents

# loader = WebBaseLoader(
#     web_paths = ("https://greenly.earth/en-gb/blog/ecology-news/climate-change-in-2022-where-do-we-stand",),
#     bs_kwargs = dict(
#         parse_only=bs4.SoupStrainer(
#             class_=("post-content", "post-title", "post-header")
#             )
#         )
# )

loader = WebBaseLoader("https://greenly.earth/en-gb/blog/ecology-news/climate-change-in-2022-where-do-we-stand")

docs = loader.load()

In [31]:
# Split

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

In [32]:
# Embed

vectorstore = PineconeVectorStore.from_documents(
    documents=splits,
    embedding=OpenAIEmbeddings(model="text-embedding-3-large"),
    index_name=index_name
)

retriever = vectorstore.as_retriever()

* Retrieval & Generation

In [33]:
# Prompt

prompt = hub.pull("rlm/rag-prompt")
# prompt = hub.pull("rlm/rag-prompt:50442af1") --Another prompt (check if it is the same, check quickly with results difference)

In [34]:
# LLM

llm = ChatOpenAI(model_name="gpt-4o-mini", temperature=0.1) #Temperature is usually a value between 0 and 1, where 0 is completely deterministic and 1 is highly random

In [35]:
# Post Processing
def format_docs(docs):
  '''
  This function formats the retrieved output from the vectorstore as below.
  Get the doc content and format it in such a way that there are 2 new lines between every doc content
  '''
  return "\n\n".join(doc.page_content for doc in docs)

In [36]:
# Chain

# In the below chain, first input is passed through RunnablePassthrough, then the retriever retrieves the information from vector store
# The output from the vector store is post processed
# The processed output from vector store is then fed to the prompt for prompt engineering or specific prompt styling
# The LLM receives the prompt in the specified style and generates the text / answer.
# The LLM's output is then formatted through Langchain's String Output parser and being returned.

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
# # Question
# pprint(rag_chain.invoke("How does LangChain use vector stores for efficient data retrieval?"))

In [37]:
# Question
pprint(rag_chain.invoke("worst projection for climate change in 2024"))

('The worst projection for climate change in 2024 indicates a nearly 50% '
 'chance that the average global temperature will rise above 1.5Â°C during the '
 'five-year period from 2022 to 2026. This follows record-breaking '
 'temperatures in 2023, suggesting that 2024 will continue to experience '
 'exceptionally warm conditions. Overall, the outlook for climate change in '
 '2024 is concerning, with expectations of worsening impacts.')


* End of Notebook