<a href="https://colab.research.google.com/github/elhamod/IS813/blob/main/Week6/IS883_2024_Week6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IS883 Week 10: RAG, ReAct, and CoT


1. Use Google Colab for this assignment.

2. **You are allowed to use ChatGPT for this assignment (except where specifically instructed). However, you may use Google and other online resources. As per the syllabus, you are required to cite your usage. You are also responsible for understanding the solution and defending it when asked in class.**

3. For each question, fill in the answer in the cell(s) right below it. The answer could be code or text. You can add as many cells as you need for clarity.

4. Enter your BUID (only numerical part) below, if applicable.

5. **Your submission on Blackboard should be the downloaded notebook (i.e., ipynb file). It should be prepopulated with your solution (i.e., the TA and/or instructor need not rerun the notebook to inspect the output). The code, when executed by the TA and/or instructor, should run with no runtime errors.**

#Part 1: Pre-class Work

In [None]:
!pip install langchain_google_community openai tiktoken faiss-cpu pypdf

Collecting openai
  Downloading openai-1.46.0-py3-none-any.whl.metadata (24 kB)
Collecting httpx<1,>=0.23.0 (from openai)
  Downloading httpx-0.27.2-py3-none-any.whl.metadata (7.1 kB)
Collecting jiter<1,>=0.4.0 (from openai)
  Downloading jiter-0.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)
Collecting httpcore==1.* (from httpx<1,>=0.23.0->openai)
  Downloading httpcore-1.0.5-py3-none-any.whl.metadata (20 kB)
Collecting h11<0.15,>=0.13 (from httpcore==1.*->httpx<1,>=0.23.0->openai)
  Downloading h11-0.14.0-py3-none-any.whl.metadata (8.2 kB)
Downloading openai-1.46.0-py3-none-any.whl (375 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m375.0/375.0 kB[0m [31m3.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpx-0.27.2-py3-none-any.whl (76 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m978.4 kB/s[0m eta [36m0:00:00[0m
[?25hDownloading httpcore-1.0.5-py3-none-any.whl (77 kB)
[2K   [90m

# Part 2: In-class Work

## 2.1 RAG: Question Answering System Using the School's Syllabus Database

 At your school, the department has embarked on a project to utilize language modeling for the development of a question-answering agent. This initiative aims to streamline the access to information for faculty and staff, particularly regarding the extensive array of courses offered at our institution. The data pertaining to these courses is currently dispersed across numerous documents within [the department's syllabus corpus](https://drive.google.com/drive/folders/1dH-t_Ujih4lMMzUOaNOHngvOYLK_gWOp?usp=sharing).

Download the corpus to your Google Drive and update the path below.

Note: The used syllabus corpus is a subset of [Cal Poly's Syllabus Corpus dataset](https://www.kaggle.com/datasets/mfekadu/syllabus-corpus).

In [None]:
from google.colab import drive
drive.mount('/content/drive')

syllabus_corpus_path = "/content/drive/MyDrive/IS883/Assignments/2023/IS883_HW4_syllabus_corpus"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


We use [PyPDFDirectoryLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain.document_loaders.pdf.PyPDFDirectoryLoader.html) to create a loader that can load all the PDFs in the directory so they could be used by LangChain.

In [None]:
from langchain.document_loaders import PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader(syllabus_corpus_path)

Given the extensive data contained within these documents, it's impractical to include them in their entirety in our queries. Including all data at once could exceed the context window's capacity and may result in significant processing costs. To address this challenge, we employ a methodical approach to manage the data effectively.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)

Now, using the afortmentioned loader and splitter, perform the splitting.

In [None]:
chunks = loader.load_and_split(text_splitter)



The next crucial step involves the creation of a data store, essentially a database, that will house the chunks of data you've created. The effectiveness of our question-answering system hinges on its ability to swiftly locate the relevant chunk containing the answer to any given query. To achieve this efficiency, we will employ a sophisticated indexing strategy, rather than relying on a basic brute-force search method.

* Build the Data Store with [Facebook AI Similarity Search (FAISS)](https://python.langchain.com/docs/integrations/vectorstores/faiss): Set up your data store using a [FAISS Vector store](https://python.langchain.com/docs/integrations/vectorstores/faiss). FAISS is a library developed by Facebook AI that allows for efficient similarity search and clustering of dense vectors.

* Embedding Calculation with `OpenAIEmbeddings`: For each chunk of data in your store, calculate an embedding using `OpenAIEmbeddings`. These embeddings are essentially numerical representations of your text data, which can then be compared to the embeddings of incoming queries.

* Indexing for Efficient Search: By creating embeddings for each chunk and indexing them in the FAISS Vector store, you will enable the system to quickly find the most relevant chunk in response to a query. This process involves comparing the embedding of the query with the embeddings of the chunks to identify the best match.

The combination of `FAISS` and `OpenAIEmbeddings` will significantly enhance the efficiency and accuracy of the question-answering system, allowing for rapid retrieval of information from the extensive syllabus corpus.

In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

from google.colab import userdata
openai_api_key = userdata.get('MyOpenAIKey')

faiss_index = FAISS.from_documents(chunks, OpenAIEmbeddings(openai_api_key=openai_api_key))

With the data store and indexing system in place, you are now equipped to tackle the core functionality of our question-answering system: responding to queries based on the indexed database.

* The code will utilize the [*`similarity_search`*](https://python.langchain.com/docs/integrations/vectorstores/faiss) function to identify the chunk that is most relevant or most similar to the posed question. This function will compare the embedding of the query with those of the indexed chunks to find the best match.

* Once you have identified the most relevant answer, output additional details indicating where this chunk is located. Specifically, provide information about *the page number and the document from which this chunk was extracted*.

To gain a deeper understanding of how similarity search operates, refer to the provided articles and references. These resources will offer a more detailed conceptual insight into the workings of similarity search algorithms and their applications in systems like ours.

[Resource 1.](https://www.pinecone.io/learn/what-is-similarity-search/)

[Resource 2.](https://python.langchain.com/docs/modules/data_connection/vectorstores/)

In [None]:
question = "Who is the instructor of Linear Algebra III?"

In [None]:
relevant_chunks = faiss_index.similarity_search(question)
best_chunk = relevant_chunks[0]

print(str(best_chunk.metadata['source']) + ". Page: " + str(best_chunk.metadata['page']) + "\n\n", best_chunk.page_content)

/content/drive/MyDrive/IS883/Assignments/2023/IS883_HW4_syllabus_corpus/11___syllabus.pdf. Page: 0

 Math 406 – Linear Algebra III
Winter 2009
Course Syllabus
Instructor: Anton Kaul
Oﬃce: 25-312 (Faculty Oﬃces East)
Phone: 6-1678
email: akaul@calpoly.edu
Oﬃce Hours: Monday 2-4, Tuesday 9-10, Thursday 9-10
Course Web Page: www.calpoly.edu/ ∼akaul/teaching/Math406
Textbook
The text for the course is Friedberg, Insel, and Spence, Linear Algebra , 4th ed.
Course Description
In Math 406 we will continue our study of the fundamental concepts of Linear Algebra. Topics


## 2.2 Chain of Though: Riddle Me This...

Let's have some fun with math riddles and explore the impact of different prompt engineering frameworks on solving them using AI.

In [None]:
riddle  = "A man left 100 dollars to be divided between his two sons Alfred and Benjamin. If one third of Alfred’s legacy was taken from one-fourth of Benjamin’s, the remainder would be 11 dollars. How much is Alfred's legacy?"

In [None]:
from langchain.agents import AgentType, initialize_agent, load_tools

In [None]:
#Use LangChain debugging
import langchain
langchain.debug = False


**Solution 0: Zero-shot learning**.



In [None]:
## Create the llm
from langchain.chat_models import ChatOpenAI
chat = ChatOpenAI(openai_api_key=openai_api_key, model="gpt-3.5-turbo")
print("\n", chat.invoke(riddle).content)


 Let x be the amount of the legacy of Alfred and y be the amount of the legacy of Benjamin.

From the given information, we can create the following equations:

x + y = 100 (equation 1)
(1/3)x - (1/4)y = 11 (equation 2)

To solve the system of equations, we can first simplify equation 2 by finding a common denominator:

(4/12)x - (3/12)y = 11
(4x - 3y)/12 = 11

Now, we can multiply both sides by 12 to get rid of the denominator:

4x - 3y = 132

Next, we can use equation 1 to substitute y = 100 - x into the above equation:

4x - 3(100 - x) = 132
4x - 300 + 3x = 132
7x - 300 = 132
7x = 432
x = 432 / 7
x ≈ 61.71

Therefore, Alfred's legacy is approximately 61.71 dollars.


**Questions:**

1. What happens if we use a newer model?
2. What happens if we change temperature?
3. What happens if we use prompt engineering?

### **Solution 1: Zero-shot learning with a calculator**.


Let's try enhancing the zero-shot learning approach by integrating a calculator tool with the language model. This setup aims to improve the accuracy and effectiveness of solving riddles, especially those involving mathematical elements.

Use OpenAI playground with prompt engineering to trigger this.



### **Solution 2: Chain of Thought (CoT)**.


Finally, you'll guide the model through a step-by-step process, breaking down the solution into clear, logical steps. This CoT approach helps the AI model understand the reasoning process needed to arrive at the correct answer.


In [None]:
from langchain.prompts.few_shot import FewShotPromptTemplate
from langchain.prompts.prompt import PromptTemplate

examples = [
  {
    "question": "A woman left 200 dollars to be divided between her daughters. If one half of the Mary's inheritence was taken from one-quarter of the Sally's, the remainder would be 10 dollars. How much is Mary's inheritence?"
,
    "answer":
"""
Are follow up questions needed here: Yes.
Follow up: What's the sum equation?
Intermediate answer: 200 = x + y
Follow up: what's the fractions' equations?
Intermediate answer: 0.25*y - 0.5*x = 10
Follow up: get y in terms of x
Intermediate answer: y = (10 + 0.5*x) / 0.25
Follow up: substitute y into the first equation
Intermediate answer: 200 = x + (10 + 0.5*x) / 0.25
Follow up: solve for x
Intermediate answer: x + (0.5/0.25)x = 200 - 10/0.25
Follow up: simplify
Intermediate answer: x + 2x = 200 - 40
Follow up: Consolidate
Intermediate answer: 3x = 160
Follow up: get x
Intermediate answer: x = 160/3
The final answer: 53.333333
"""
  }
]

In [None]:
example_prompt = PromptTemplate(input_variables=["question", "answer"], template="Question: {question}\n Answer: {answer} \n ---------")

In [None]:
prompt = FewShotPromptTemplate(
    examples=examples,
    example_prompt=example_prompt,
    suffix="Question: {input}",
    input_variables=["input"]
)

print(prompt.format(input=riddle))

Question: A woman left 200 dollars to be divided between her daughters. If one half of the Mary's inheritence was taken from one-quarter of the Sally's, the remainder would be 10 dollars. How much is Mary's inheritence?
 Answer: 
Are follow up questions needed here: Yes.
Follow up: What's the sum equation?
Intermediate answer: 200 = x + y
Follow up: what's the fractions' equations?
Intermediate answer: 0.25*y - 0.5*x = 10
Follow up: get y in terms of x
Intermediate answer: y = (10 + 0.5*x) / 0.25
Follow up: substitute y into the first equation
Intermediate answer: 200 = x + (10 + 0.5*x) / 0.25
Follow up: solve for x
Intermediate answer: x + (0.5/0.25)x = 200 - 10/0.25
Follow up: simplify
Intermediate answer: x + 2x = 200 - 40
Follow up: Consolidate
Intermediate answer: 3x = 160
Follow up: get x
Intermediate answer: x = 160/3
The final answer: 53.333333
 
 ---------

Question: A man left 100 dollars to be divided between his two sons Alfred and Benjamin. If one third of Alfred’s legacy 

Now, call the `llm` object with the template after properly substituting the riddle into it.

In [None]:
print(chat.invoke(prompt.format(input=riddle)).content)

Answer:
Are follow up questions needed here: Yes
Follow up: What's the sum equation?
Intermediate answer: 100 = x + y
Follow up: What's the fractions' equations?
Intermediate answer: 0.25*y - 0.33*x = 11
Follow up: Get y in terms of x
Intermediate answer: y = (11 + 0.33*x) / 0.25
Follow up: Substitute y into the first equation
Intermediate answer: 100 = x + (11 + 0.33*x) / 0.25
Follow up: Solve for x
Intermediate answer: x + 1.32x = 100 - 11/0.25
Follow up: Simplify
Intermediate answer: 2.32x = 100 - 44
Follow up: Consolidate
Intermediate answer: 2.32x = 56
Follow up: Get x
Intermediate answer: x = 56 / 2.32
The final answer: 24.137931


## 2.3 ReAct

#Agents

Let's see how we can use the wikipedia agent in `LangChain`

In [None]:
!pip install -U wikipedia

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11679 sha256=7287bb78fe7365c5a6de4e17630490615c9e71b80115d0f8ece3f1c9490a29a7
  Stored in directory: /root/.cache/pip/wheels/5e/b6/c5/93f3dec388ae76edc830cb42901bb0232504dfc0df02fc50de
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


Let's ask a question about GPT4 in Wikipedia

In [None]:
from langchain.agents import load_tools, initialize_agent
from langchain_community.utilities import SearchApiAPIWrapper
import os
# os.environ["SEARCHAPI_API_KEY"] = userdata.get('SEARCHAPI_API_KEY')
# SearchApiAPIWrapper()


chat = ChatOpenAI(openai_api_key=openai_api_key)
tools = load_tools(["wikipedia"], llm=chat) #"llm-math" is another possible tool for math. # , "searchapi"




agent= initialize_agent(
    tools,
    chat,
    agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
    verbose = True)

msg = "When was ChatGPT 4 released?"

agent(msg)



[1m> Entering new AgentExecutor chain...[0m
[32;1m[1;3mI need to find information about Mohannad Elhamod's date of birth
Action: wikipedia
Action Input: Mohannad Elhamod[0m
Observation: [36;1m[1;3mNo good Wikipedia Search Result was found[0m
Thought:[32;1m[1;3mI should try a different search query to find Mohannad Elhamod's date of birth
Action: wikipedia
Action Input: Mohannad Elhamod date of birth[0m
Observation: [36;1m[1;3mNo good Wikipedia Search Result was found[0m
Thought:[32;1m[1;3mI should try searching for Mohannad Elhamod's biography to find his date of birth
Action: wikipedia
Action Input: Mohannad Elhamod biography[0m
Observation: [36;1m[1;3mNo good Wikipedia Search Result was found[0m
Thought:[32;1m[1;3mI should try searching for Mohannad Elhamod's personal information to find his date of birth
Action: wikipedia
Action Input: Mohannad Elhamod personal information[0m
Observation: [36;1m[1;3mNo good Wikipedia Search Result was found[0m
Thought:[3

{'input': "When was Mohannad Elhamod's date of birth",
 'output': 'The date of birth for Mohannad Elhamod could not be found using Wikipedia.'}

Here is how you could list all tools available.

In [None]:
langchain.agents.get_all_tool_names()

['sleep',
 'wolfram-alpha',
 'google-search',
 'google-search-results-json',
 'searx-search-results-json',
 'bing-search',
 'metaphor-search',
 'ddg-search',
 'google-lens',
 'google-serper',
 'google-scholar',
 'google-finance',
 'google-trends',
 'google-jobs',
 'google-serper-results-json',
 'searchapi',
 'searchapi-results-json',
 'serpapi',
 'dalle-image-generator',
 'twilio',
 'searx-search',
 'merriam-webster',
 'wikipedia',
 'arxiv',
 'golden-query',
 'pubmed',
 'human',
 'awslambda',
 'stackexchange',
 'sceneXplain',
 'graphql',
 'openweathermap-api',
 'dataforseo-api-search',
 'dataforseo-api-search-json',
 'eleven_labs_text2speech',
 'google_cloud_texttospeech',
 'read_file',
 'reddit_search',
 'news-api',
 'tmdb-api',
 'podcast-api',
 'memorize',
 'llm-math',
 'open-meteo-api',
 'requests',
 'requests_get',
 'requests_post',
 'requests_patch',
 'requests_put',
 'requests_delete',
 'terminal']