<a href="https://colab.research.google.com/github/cyalbino/WITCON2024/blob/main/%5BWiTCon_2024%5D_Building_Solutions_with_RAG_Code_Along.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ❗ **ACTION NEEDED:** Create your own COPY of this notebook

👉 This is a code-along version of the notebook. This does not contain all the lines of code.

👉 **The participant is tasked to fill in missing lines.** Have fun!

🆘 **NEED HELP?** If you get lost, you can check out the answer key: [the complete version of the notebook](https://colab.research.google.com/drive/1XmTDRis2kGHMyux3xgSnXWaxSSL3_cpw?usp=sharing)!

# 🌐 Retrieval Augmented Generation

**Retrieval Augmented Generation (RAG)** is a GenAI technique for applying large language models  (LLMs) to **specific use cases**.

On a high-level, RAG equips an LLM with custom data — typically found in documents — to produce more accurate, relevant, and contextually rich responses than it could by relying only on its pre-trained knowledge.

## **🍃 Let's find a real-world problem that we can solve with RAG!**


📢 *Special thanks to Aiden Dai! This notebook is built upon his [LangChain RAG Tutorial](https://github.com/daixba/langchain-tutorials/blob/main/02-langchain-rag.ipynb)!*

# 🧠 The Problem

🦠 The COVID-19 pandemic has led to an overwhelming volume of scientific papers, making it difficult for health experts and policymakers to catch up with the latest research.

# 💡 The Solution

🤖 To make the process of reviewing research papers faster, we can use a LLM that can answer medical questions given a research paper via RAG.

# 👨‍💻 The Implementation

## 0️⃣ Install libraries

In [None]:
! pip -qqq install langchain sentence-transformers faiss-cpu langchain-openai

## 1️⃣ Load a document that contains your custom data

For this specific use case, the document that we will load is a research paper related to COVID-19:
[Effects of COVID-19 on College Students’ Mental Health in the United States: Interview Survey Study](https://www.jmir.org/2020/9/e21279/).

Of course, you can upload many other kinds of documents, such as PDFs, CSVs, etc. [Learn more here!](https://python.langchain.com/docs/modules/data_connection/document_loaders/)

In [None]:
from langchain.document_loaders import WebBaseLoader

# Find the URL of the web page
url = "<💡 ACTION NEEDED: Write down the URL of the web page>"

# Load a web page
web_loader = WebBaseLoader(url)
document = web_loader.load()

## 2️⃣ Split the document into chunks



Chunking a document is necessary to break down texts into smaller and manageable pieces called "chunks".

> 💡 **What is a chunk?** A chunk serves as a useful unit of information that can be retrieved to answer specific questions about the text.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Split a long document into smaller chunks
text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", "."],
    chunk_size = 512,
    chunk_overlap  = 0,
    length_function = len,
    is_separator_regex = False,
)

In [None]:
# Split the documents into chunks
chunks = "<💡 ACTION NEEDED: Write down the function that splits the document into chunks>"

In [None]:
# Check how many chunks we have
# "<💡 ACTION NEEDED: Get the number of chunks>"

In [None]:
# Randomly preview a chunk of the document
# "<💡 ACTION NEEDED: Get any element from the list of chunks>"

## 3️⃣ Convert the chunks into vector embeddings

> 💡 **What are vector embeddings?** Vector embeddings are numerical representations of words that capture their meaning, enabling computers (or LLMs) to understand language.

There are many kinds of vector embeddings to choose from! You can find a lot of these in [Hugging Face](https://huggingface.co/models?other=embeddings). Even [OpenAI](https://platform.openai.com/docs/guides/embeddings) has their own embeddings - but keep in mind that these models come with a price tag!

**For this use case, we choose an OpenAI embedding model called `text-embedding-ada-002` for the experience!**

In [None]:
import getpass
import os

# 💡 ACTION NEEDED: Run this cell and paste this API key: sk-proj-gn028j30woImOKOw3JZqT3BlbkFJx6TGXlBxQsmOE7EnEcpb
os.environ["OPENAI_API_KEY"] = getpass.getpass()

In [None]:
from langchain_community.embeddings.openai import OpenAIEmbeddings

# 💡 ACTION NEEDED: Type in the embedding model we will be using
# 🆘 HINT: You can see it above!
embeddings_model = OpenAIEmbeddings(
    model="💡 ACTION NEEDED: Type in the embedding model we will be using",
)

In [None]:
# Convert a sample phrase into an embedding
sample_phrase = "💡 ACTION NEEDED: Type in a sample phrase that we will convert into an embedding"

sample_embeddings = embeddings_model.embed_query(sample_phrase)
sample_embeddings[:10]

We will be converting all the chunks into vector embeddings in the next step!

This process is already taken care of by the function that creates a vector store.

## 4️⃣ Store the vector embeddings in a vector store

> 💡 **What is a vector store?** A vector store is like a massive filing system where pieces of information (chunks) represented as lists of numbers (vector embeddings) are organized in a way that makes it easy for computers to search and retrieve relevant data quickly.

There are many vector stores supported by LangChain. [Learn more here!](https://python.langchain.com/docs/integrations/vectorstores/)

For this use case, we will use FAISS, which stands for Facebook AI Similarity Search.

In [None]:
from langchain.vectorstores import FAISS

# Convert the chunks into vector embeddings, and...
# Store the vector embeddings in a FAISS vector store
vector_store = FAISS.from_documents(
  "💡 ACTION NEEDED: Get the chunks",
  "💡 ACTION NEEDED: Get the embedding model"
  )

## 5️⃣ Retrieve the chunks that can best answer a question

Given a question, we want to retrieve the chunks of information that can best answer it.

We can choose from [many retrieval approaches](https://python.langchain.com/docs/modules/data_connection/retrievers/vectorstore/).

For this use case, we will use an approach called `top k retrieval`, which simply retrieves the top k most relevant chunks of information given a question.

In [None]:
# Write a question about the document
question = "💡 ACTION NEEDED: Ask a question about the document"

# Instruct the retriever to get the top x chunks that seem most likely to answer our question
k = "💡 ACTION NEEDED: Select a number of chunks we want to retrieve"

# Create a retriever with the vector store
retriever = "💡 ACTION NEEDED: Create a retriever"

# Retrieve the chunks that can best answer a question
retrieved_chunks = "💡 ACTION NEEDED: Retrieve relevant chunks"
retrieved_chunks

## 6️⃣ Generate the answer using an LLM

After looking for chunks of information from the documents that are closely similar to what the query is asking for, we can head onto the next step.

The next step is to take the question and the retrieved chunks in order to write a prompt that we can give to our LLM.

Given the prompt, the LLM then produces an insightful answer to our question.

> 💡 **What is a prompt?** A prompt is an instruction that commands an LLM to perform a specific task.

There are many LLMs we can choose from! We can actually use [OpenAI's GPT models](https://platform.openai.com/docs/models), which power ChatGPT. Keep in mind however that these models come with a price tag.

But do not worry! There are still a lot more [free-to-use LLMs](https://www.datacamp.com/blog/top-open-source-llms) out there. Many free LLMs can also be found on [Hugging Face 🤗](https://huggingface.co/models?pipeline_tag=text-generation&sort=likes), which is a machine learning platform that aims to make AI tech accessible to everyone.

**For this use case, we will be using OpenAI's `gpt-3.5-turbo` for the experience!**





In [None]:
import getpass
import os

# 💡 ACTION NEEDED: Run this cell and paste this API key: sk-proj-gn028j30woImOKOw3JZqT3BlbkFJx6TGXlBxQsmOE7EnEcpb
os.environ["OPENAI_API_KEY"] = getpass.getpass()

In [None]:
from langchain_openai import ChatOpenAI

# 💡 ACTION NEEDED: Type in the name of the LLM we will be using
# 🆘 HINT: You can see it above!
llm = ChatOpenAI(model="💡 ACTION NEEDED: Type in the name of LLM the we will be using")

In [None]:
from langchain import PromptTemplate

# Create a prompt that contains the following:
# (1) Task
# (2) Context
# (3) Question
# (4) Output indicator
prompt = """
💡 ACTION NEEDED: Create a prompt
"""

prompt = PromptTemplate(
    template=prompt, input_variables=["context", "question"],
  )

In [None]:
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# Create a question-answering (QA) chain that can be used to answer questions ...
# by retrieving relevant chunks of information from our vector store (retriever), ...
# by using a prompt that states the task, context, and question, and ...
# by using an LLM
qa_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [None]:
answer = qa_chain.invoke("💡 ACTION NEEDED: Ask a question")

In [None]:
import textwrap

width = 80

# Format the answer
answer = textwrap.fill(answer, width=width)

# Print the answer
print(answer)