<a href="https://colab.research.google.com/github/elhamod/IS883/blob/main/Week9/IS883_2024_Week9_assignment.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IS883 Week9: Retrieval Augmented Generation (RAG) - Accessing knowledge from documents.

1. Use Google Colab for this assignment.

2. **You are NOT allowed to use external or embedded Gen AI for this assignment (except where specifically instructed). However, you may use Google search and other online resources. As per the syllabus, you are required to cite your usage. You are also responsible for understanding the solution and defending it when asked in class.**

3. For each question, fill in the answer in the cell(s) right below it. The answer could be code or text. You can add as many cells as you need for clarity.

4. **Your submission on Blackboard should be the downloaded notebook (i.e., ipynb file). It should be prepopulated with your solution (i.e., the TA and/or instructor need not rerun the notebook to inspect the output). The code, when executed by the TA and/or instructor, should run with no runtime errors.**

# Part 1: In-class Work

##1.1 Using Gen AI for Recruiting

Let's say you are a recruiter who is trying to determine whether a candidate is a good fit for a job posting you have.

Your workflow probably goes like this:
- Load the candidate's CV.
- Inspect it for education, career, and extracurricular highlights.
- Read the job description.
- Check if there is a match.

We will do exactly that in the next few cells.

While this excerise could be done using ChatGPT without coding **(We could even create a GPT for this!)**, it will become clear later why a programmatic approach may be needed.

In [None]:
!pip install pypdf langchain-community



First specify where the CV is.

In [None]:
cv_path = "https://questromapps.bu.edu/faculty/uploadFiles/cvUploadsProfile/IS_Elhamod_Mohannad_CV.pdf"

We will use [`PyPDFLoader`](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.PyPDFLoader.html#langchain_community.document_loaders.pdf.PyPDFLoader.load_and_split) to load the CV.

In [None]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader(cv_path)
document = loader.load()
document

[Document(metadata={'source': 'https://questromapps.bu.edu/faculty/uploadFiles/cvUploadsProfile/IS_Elhamod_Mohannad_CV.pdf', 'page': 0}, page_content='Mohannad Elhamod\n/envel⌢peelhamod@bu.edu • /linkedin-inmohannadelhamod\nResearch and Teaching Interests\nHighly engaged in interdisciplinary research and teaching, with a focus on interpretability and visualization in Machine\nLearning. Keen on evaluating and instructing on the influence of generative AI and language models across various\nbusiness sectors such as management, marketing, strategy, and finance. Actively exploring and integrating these\nadvanced technologies, along with Knowledge-Guided ML, into both academic inquiries and classroom settings\nto foster innovation, enhance learning experiences, and create new opportunities in the wider business spectrum.\nThis approach not only advances research frontiers but also equips students with critical, cutting-edge skills and\nunderstandings necessary for navigating the evolving bu

Notice that `document` is a list and not string.

In [None]:
type(document)

list

Let's say we want to use the entire CV as a reference. We need to aggregate the pages together to make a single text.

In [None]:
fullCV = ""
for i, page in enumerate(document):
  fullCV = fullCV + "\n" + page.page_content

fullCV

'\nMohannad Elhamod\n/envel⌢peelhamod@bu.edu • /linkedin-inmohannadelhamod\nResearch and Teaching Interests\nHighly engaged in interdisciplinary research and teaching, with a focus on interpretability and visualization in Machine\nLearning. Keen on evaluating and instructing on the influence of generative AI and language models across various\nbusiness sectors such as management, marketing, strategy, and finance. Actively exploring and integrating these\nadvanced technologies, along with Knowledge-Guided ML, into both academic inquiries and classroom settings\nto foster innovation, enhance learning experiences, and create new opportunities in the wider business spectrum.\nThis approach not only advances research frontiers but also equips students with critical, cutting-edge skills and\nunderstandings necessary for navigating the evolving business landscape.\nEducation\nPh.D. in Computer Science May 2023\nVirginia Tech Blacksburg, VA\nM.Eng in Computer Engineering April 2012\nMcGill Uni

###Case 1: CV Question-Answering.

Let's try interrogating our CV. You may try this in ChatGPT as well and see how the behavior is different.

In [None]:
question = "Where did Mohannad Elhamod graduate from?"
# question = "What is the candidate's email address?" # PDF foprmatting may be jumbled.
# question = "How many papers has Mohannad published?" # Uncertainty in answer
# question = "Where was Mohannad Elhamod born?" # Uncertainty in answer
# question = "What are the highlights of the candidate's career?"
# question = "Based on this CV, would the candidate be a good fit for a technical consultant position at an investment firm?"
# question = "What positions would the candidate be good for?"

In [None]:
prompt_template = """
Answer the following question based on the provided CV.


CV:
{fullCV}


Question:
{question}

"""

from langchain.prompts import ChatPromptTemplate

# Create the template and the CV and question into it.
prompt_template = ChatPromptTemplate.from_template(prompt_template)
prompt = prompt_template.format_messages(fullCV=fullCV, question=question)
print(prompt)

[HumanMessage(content='\nAnswer the following question based on the provided CV.\n\n\nCV:\n\nMohannad Elhamod\n/envel⌢peelhamod@bu.edu • /linkedin-inmohannadelhamod\nResearch and Teaching Interests\nHighly engaged in interdisciplinary research and teaching, with a focus on interpretability and visualization in Machine\nLearning. Keen on evaluating and instructing on the influence of generative AI and language models across various\nbusiness sectors such as management, marketing, strategy, and finance. Actively exploring and integrating these\nadvanced technologies, along with Knowledge-Guided ML, into both academic inquiries and classroom settings\nto foster innovation, enhance learning experiences, and create new opportunities in the wider business spectrum.\nThis approach not only advances research frontiers but also equips students with critical, cutting-edge skills and\nunderstandings necessary for navigating the evolving business landscape.\nEducation\nPh.D. in Computer Science Ma

In [None]:
# For debugging
import langchain
langchain.debug = True

We can use `get_openai_callback` to calculate cost.

In [None]:
# Getting OpenAI API Key
from google.colab import userdata
openai_api_key = userdata.get('MyOpenAIKey')


from langchain.chat_models import ChatOpenAI
from langchain.callbacks import get_openai_callback

# Get the answer!
chat = ChatOpenAI(openai_api_key=openai_api_key, model="gpt-4o") # Things to try: a differnt model
with get_openai_callback() as cb: # Used to get cost information.
  print(chat.predict(prompt[0].content))
  print(cb) # print the cost.

[32;1m[1;3m[llm/start][0m [1m[llm:ChatOpenAI] Entering LLM run with input:
[0m{
  "prompts": [
    "Human: \nAnswer the following question based on the provided CV.\n\n\nCV:\n\nMohannad Elhamod\n/envel⌢peelhamod@bu.edu • /linkedin-inmohannadelhamod\nResearch and Teaching Interests\nHighly engaged in interdisciplinary research and teaching, with a focus on interpretability and visualization in Machine\nLearning. Keen on evaluating and instructing on the influence of generative AI and language models across various\nbusiness sectors such as management, marketing, strategy, and finance. Actively exploring and integrating these\nadvanced technologies, along with Knowledge-Guided ML, into both academic inquiries and classroom settings\nto foster innovation, enhance learning experiences, and create new opportunities in the wider business spectrum.\nThis approach not only advances research frontiers but also equips students with critical, cutting-edge skills and\nunderstandings necessary

Notce that if we do not use the CV as context, we get hallucinations or the inability to answer.

In [None]:
chat.invoke(p)

[32;1m[1;3m[llm/start][0m [1m[llm:ChatOpenAI] Entering LLM run with input:
[0m{
  "prompts": [
    "Human: Based on this CV, would the candidate be a good fit for a technical consultant position at an investment firm?"
  ]
}
[36;1m[1;3m[llm/end][0m [1m[llm:ChatOpenAI] [4.87s] Exiting LLM run with output:
[0m{
  "generations": [
    [
      {
        "text": "I'm unable to view specific documents like a CV. However, I can certainly help you evaluate whether a candidate might be a good fit for a technical consultant position at an investment firm based on typical qualifications and skills required for such a role. Here are some aspects to consider:\n\n1. **Technical Skills**: The candidate should have a strong foundation in relevant technical skills, such as programming languages (Python, R, Java, etc.), data analysis, and possibly financial modeling. Experience with databases, data visualization tools, and financial software could also be beneficial.\n\n2. **Industry Knowledge

AIMessage(content="I'm unable to view specific documents like a CV. However, I can certainly help you evaluate whether a candidate might be a good fit for a technical consultant position at an investment firm based on typical qualifications and skills required for such a role. Here are some aspects to consider:\n\n1. **Technical Skills**: The candidate should have a strong foundation in relevant technical skills, such as programming languages (Python, R, Java, etc.), data analysis, and possibly financial modeling. Experience with databases, data visualization tools, and financial software could also be beneficial.\n\n2. **Industry Knowledge**: Familiarity with the financial industry, including understanding of financial markets, instruments, and regulations, is crucial. Look for any previous experience in finance, banking, or investment sectors.\n\n3. **Consulting Experience**: Previous experience in a consulting role or similar client-facing positions can be advantageous. The candidat

###Case 2: Is The Candidate a Good Fit For a Certain Job?

Now, let's upgrade the functionality such that we get a score for how well a candidate matches a job posting.

In [None]:
# Example 1: Questrom Faculty job posting.
jobposting_path = "https://github.com/elhamod/IS883/blob/5d2bd74a75cef838c3a945c12e956fe10f1cd895/Week9/2024%20BU%20Recruiting.pdf?raw=true"

# Example 2: Software Developer job posting.
# jobposting_path = "https://cse.sc.edu/files/SFSoftwareDeveloper.pdf"

In [None]:
# Load and process the job posting document
loader = PyPDFLoader(jobposting_path)
document = loader.load()
document

jobposting = ""
for i, page in enumerate(document):
  jobposting = jobposting + "\n" + page.page_content

jobposting

'\nBoston University Questrom School of Business  \nThe Questrom School of Business at Boston University is hiring a full-time, tenure-track Assistant Professor of Information Systems with an anticipated start date of July 1, 2025.  We are seeking exceptional candidates with interests in the Business of AI. This position is part of a Boston University cluster hiring initiative led by the Faculty of Computing and Data Science. Prospective candidates must have a PhD in a core business discipline such as Information Systems or a PhD in Computer Science, Data Science, or Computer Engineering, with research focused on problems related to the development, management, and application of AI in business.   Prospective candidates must have demonstrated capabilities in producing original and innovative scholarly work of the highest possible quality with impact in the top business academic journals.  Candidates will be expected to teach a variety of undergraduate and graduate courses in informatio

Let' do a great job with prompt engineering. Notice how we used instructions, persona, formatting, and constraints.

In [None]:
prompt_template = """
You are acting as a professional recruiter that is vetting candidate CVs for a position to get best job-candidate matches.
Below, you are given a candidate's CV and job posting.
Do not make up any answers if you do not have evidence. If you are unsure about a certain detail or conclusion, state you are unsure about it.
You are to provide the following:
- A key summary of the position
- A key summary of the candidate
- A table of the strengths the candidate has that makes them a good fit for the job.
Structure the table as follows:
     Strength | Evidence | Degree of relevance |How it is relevant to the job
     ---------|----------|---------------------|-------------------------------
- A table of missing skills or issues in the candidate's background, along with the degree of severity for this gap in the CV.
Structure the table as follows:
     Weakness | Severity level | How it is relevant to the job
     ---------|----------------|----------------------------------
- A rating between 0 and 5, indicating whether the candidate is a good fit or not, 5 being an excellent fit.
- A recommendation whether to proceed with the candidate or not for a round of interviews.


CV:
{fullCV}


Job posting:
{jobposting}

"""

from langchain.prompts import ChatPromptTemplate

# Construct the prompt and substitute into it.
prompt_template = ChatPromptTemplate.from_template(prompt_template)
prompt = prompt_template.format_messages(fullCV=fullCV, jobposting=jobposting)

In [None]:
with get_openai_callback() as cb:
  print(chat.predict(prompt[0].content))
  print(cb)

[32;1m[1;3m[llm/start][0m [1m[llm:ChatOpenAI] Entering LLM run with input:
[0m{
  "prompts": [
    "Human: \nYou are acting as a professional recruiter that is vetting candidate CVs for a position to get best job-candidate matches.\nBelow, you are given a candidate's CV and job posting.\nDo not make up any answers if you do not have evidence. If you are unsure about a certain detail or conclusion, state you are unsure about it.\nYou are to provide the following:\n- A key summary of the position\n- A key summary of the candidate\n- A table of the strengths the candidate has that makes them a good fit for the job.\nStructure the table as follows:\n     Strength | Evidence | Degree of relevance |How it is relevant to the job\n     ---------|----------|---------------------|-------------------------------\n- A table of missing skills or issues in the candidate's background, along with the degree of severity for this gap in the CV.\nStructure the table as follows:\n     Weakness | Seve

###Case 3: Online Job Postings.

Some job postings are not in PDF format, but we still want to be able to check them.

In [None]:
!pip install playwright html2text

Collecting playwright
  Downloading playwright-1.48.0-py3-none-manylinux1_x86_64.whl.metadata (3.5 kB)
Collecting html2text
  Downloading html2text-2024.2.26.tar.gz (56 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.5/56.5 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting pyee==12.0.0 (from playwright)
  Downloading pyee-12.0.0-py3-none-any.whl.metadata (2.8 kB)
Downloading playwright-1.48.0-py3-none-manylinux1_x86_64.whl (38.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m38.2/38.2 MB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyee-12.0.0-py3-none-any.whl (14 kB)
Building wheels for collected packages: html2text
  Building wheel for html2text (setup.py) ... [?25l[?25hdone
  Created wheel for html2text: filename=html2text-2024.2.26-py3-none-any.whl size=33111 sha256=edca0280c5ae1ad2c9a12e346f68036a1402c216f7f51a959f96bbf52355f3bc
  Stored in directory: /roo

In [None]:
from langchain_community.document_transformers import Html2TextTransformer
from langchain_community.document_loaders import WebBaseLoader

# Example 1: MLOPs position
url = "https://www.capitalonecareers.com/job/new-york/senior-machine-learning-engineer-python-spark-dask-mlops/1732/69791453600"

# Example 2: Amazon Software Developer
# url = "https://www.indeed.com/viewjob?jk=d24100691c8a049b&utm_campaign=google_jobs_apply&utm_source=google_jobs_apply&utm_medium=organic"

# Example 3: Head of investments operations
# url = "https://careers.massmutual.com/job/boston/head-of-investment-operations-and-accounting-ecosystem-program/724/71301983296?utm_campaign=google_jobs_apply&utm_source=google_jobs_apply&utm_medium=organic"

# Example 4: Warehouse worker
# url = "https://careers.fedex.com/handler-warehouse-like/job/P25-94693-3"

# Example 5: Data scientist
# url= "https://jobs.careers.microsoft.com/global/en/job/1711166/Data-Scientist"

# load the page and convert it from HTML to plain text,
loader = WebBaseLoader(url)
html_doc = loader.load()
text_doc = Html2TextTransformer().transform_documents(html_doc)
jobposting = text_doc[0].page_content
jobposting



"Senior Machine Learning Engineer (Python, Spark/Dask, MLOPS) at Capital One Skip to main content Menu Menu Explore Jobs back Explore Jobs Teams Military Pinned Jobs All Jobs Locations back Locations United States Canada India Mexico Philippines United Kingdom Students & Grads back Students & Grads Overview Internship Programs Full-Time Programs Graduate Programs Events Working at Capital One back Working at Capital One Benefits Culture Diversity, Inclusion & Belonging How We Work Innovation Careers Blog FAQs Search Jobs Returning Applicant You’re located near You’re interested in Find Jobs Here, one dream job leads to the next dream Sasha, Customer Experience R196828 Posted 10/28/2024 Senior Machine Learning Engineer (Python, Spark/Dask, MLOPS) New York, New York | McLean, Virginia Pin job Apply Category Engineering Experience Principal Associate Primary Address New York, New York Overview 11 West 19th Street (22008), United States of America, New York, New YorkSenior Machine Learning

In [None]:
# Construct the prompt
prompt = prompt_template.format_messages(fullCV=fullCV, jobposting=jobposting)

# Get the answer!
with get_openai_callback() as cb:
  print(chat.predict(prompt[0].content))
  print(cb)

[32;1m[1;3m[llm/start][0m [1m[llm:ChatOpenAI] Entering LLM run with input:
[0m{
  "prompts": [
    "Human: \nYou are acting as a professional recruiter that is vetting candidate CVs for a position to get best job-candidate matches.\nBelow, you are given a candidate's CV and job posting.\nDo not make up any answers if you do not have evidence. If you are unsure about a certain detail or conclusion, state you are unsure about it.\nYou are to provide the following:\n- A key summary of the position\n- A key summary of the candidate\n- A table of the strengths the candidate has that makes them a good fit for the job.\nStructure the table as follows:\n     Strength | Evidence | Degree of relevance |How it is relevant to the job\n     ---------|----------|---------------------|-------------------------------\n- A table of missing skills or issues in the candidate's background, along with the degree of severity for this gap in the CV.\nStructure the table as follows:\n     Weakness | Seve

#Part 2: Homework

##2.1 Question Answering System Using the School's Syllabus Database

 At your school, the department has embarked on a project to utilize language modeling for the development of a question-answering agent. This initiative aims to streamline the access to information for faculty and staff, particularly regarding the extensive array of courses offered at our institution. The data pertaining to these courses is currently dispersed across numerous documents within [the department's syllabus corpus](https://drive.google.com/drive/folders/1dH-t_Ujih4lMMzUOaNOHngvOYLK_gWOp?usp=sharing).

*Note: The used syllabus corpus is a subset of [Cal Poly's Syllabus Corpus dataset](https://www.kaggle.com/datasets/mfekadu/syllabus-corpus).*

###2.1.1 Let's first download the files

*Note: Downloading the files may be slow. No need to run this step more than once.*

In [None]:
!pip install --upgrade gdown
!gdown --folder https://drive.google.com/drive/folders/1dH-t_Ujih4lMMzUOaNOHngvOYLK_gWOp

Retrieving folder contents
Processing file 1coF5-0feLpYVDPomRPRTliitFUYUJb0i 1___syllabus.pdf
Processing file 1kAzZOyO0klFCcR_WDcwju7ymzGPQL1Ja 2___syllabus.pdf
Processing file 1GD-C007dvGBniAbZJjTdERR72caxap6Z 3___Syllabus-Lean-Sigma-Green-Belt-Cert12.pdf
Processing file 1EJlGrtRdzkjm_t8tJNRS_XbECedBdc05 4___syllabus.pdf
Processing file 1QZuoGCgSis5_SNhYnl3V8GPOGk2rH1xo 5___LA114-213Lec.pdf
Processing file 1CbCQQH-AGhnmh0RPsghRsuOtLYYPRore 6___syllabus.pdf
Processing file 192_AeIBU4feZhNcJAMyZhuoLueY9dJ7a 7___Syllabus-Summer-Online-GB-TGB-2017-IT303G.pdf
Processing file 1G5h0fH4Ig1IwyjxPwAE3jD8MugSQnqmS 8___syllabus.pdf
Processing file 1TyktAq4SAaA1RUnNRPVBLfz8e7tFZKn3 9___syllabus.pdf
Processing file 1GaYYH2K5sQw1s9bNuTD-JP0rohCejhGq 10___syllabus.pdf
Processing file 1iQOQUCZJBJmQcOiXmE5AAX091L7bHId8 11___syllabus.pdf
Processing file 17vGMIy5skXXOZ_Oq6LP6kvQrmCQicroT 12___WebPageSly.pdf
Processing file 1msOVu6i58lKvxSTtwCXVLXwqPGuhtHuD 13___syllabus.pdf
Processing file 13_rIvyOCp02TQ

In [None]:
syllabus_corpus_path = "/content/IS883_HW4_syllabus_corpus"

We will use a [PyPDFDirectoryLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain.document_loaders.pdf.PyPDFDirectoryLoader.html) to create a `loader` that can load all the PDFs in the directory so they could be used by `LangChain`.

In [None]:
from langchain.document_loaders import PyPDFDirectoryLoader

loader = PyPDFDirectoryLoader(syllabus_corpus_path)

###2.1.2 Document splitting



Given the extensive data contained within these documents, it's impractical to include them in their entirety in our queries. Including all data at once may exceed the context window's capacity. It may also result in significant processing costs. To address this challenge, we will split the syllabi into small chunks of data.

* Creating a [RecursiveCharacterTextSplitter](https://python.langchain.com/docs/modules/data_connection/document_transformers/text_splitters/recursive_text_splitter): A `RecursiveCharacterTextSplitter` divides the documents into more manageable segments/chunks.

* Configurations:
  * `chunk_size`: Set to 500 characters. This size ensures that the chunks are large enough to contain meaningful content but small enough to be processed efficiently.

  * `chunk_overlap`: Set  to 50 characters. This overlap will help prevent the loss of context that might occur at the boundaries of each chunk. It ensures that no critical information is missed or misunderstood due to the chunking process.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunk_size=500
chunk_overlap=50

text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)

Next, we perform the splitting.

*Note: Ignore any offset Warnings.*

In [None]:
chunks = loader.load_and_split(text_splitter)



Here is a chunk example

In [None]:
chunks[122]

Document(metadata={'source': '/content/IS883_HW4_syllabus_corpus/20___syllabus09.pdf', 'page': 0}, page_content='Program Design in C by J. Hanly and E. Koffman (5th Edition)   Recommended Text(s)   The C Programming Language, Second Edition by Kernighan and Ritchie, Prentice Hall, Inc.,   Policies and Advisories   1)      This course will have a common final scheduled for either Tuesday or Wednesday from 7-10:00pm of finals week – no other finals will be given! 2)      The College of Engineering requires proof of unusual circumstances to withdraw from a course after the eighth day of the quarter. 3)')

###2.1.3 Chunk Similarity Search



The next crucial step involves the creation of a data store, essentially a database, that will house the chunks of data you've created.

* Building the Data Store with [Facebook AI Similarity Search (FAISS)](https://python.langchain.com/docs/integrations/vectorstores/faiss): First, we set up a data store using a [FAISS Vector store](https://python.langchain.com/docs/integrations/vectorstores/faiss). FAISS is a library developed by Facebook AI that allows for efficient similarity search.

* For each chunk of data in your store, we calculate an embedding using `OpenAIEmbeddings`. These embeddings are numerical representations of text data, which can then be compared each other as a similarity metric.

* Using these embeddings, FAISS will be able to quickly find the most relevant chunk(s) to a query. This process involves comparing the embedding of the query with the embeddings of the chunks to identify the best match(es).

In [None]:
!pip install tiktoken faiss-cpu

Collecting tiktoken
  Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Downloading tiktoken-0.8.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/1.2 MB[0m [31m15.3 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading faiss_cpu-1.9.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (27.5 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.5/27.5 MB[0m [31m36.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: faiss-cpu, tiktoken
Successfully installed faiss-cpu-1.9.0 tiktoken-0.8.0


In [None]:
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS

faiss_store = FAISS.from_documents(chunks, OpenAIEmbeddings(openai_api_key=openai_api_key))

  faiss_store = FAISS.from_documents(chunks, OpenAIEmbeddings(openai_api_key=openai_api_key))


With the data store in place, you are now equipped to tackle the core functionality of our question-answering system: responding to queries based on the indexed database.

* We utilize the [*`similarity_search_with_score`*](https://python.langchain.com/docs/integrations/vectorstores/faiss) function to identify the chunk that is most relevant to the posed question.

* Display Source Information: Once you have identified the most relevant answer, output additional details indicating where this chunk is located. Specifically, provide information about *the page number and the document from which this chunk was extracted*.  **(2 Points)**

To gain a deeper understanding of how similarity search operates, refer to this [article](https://www.pinecone.io/learn/what-is-similarity-search/).

In [None]:
#Use LangChain debugging
import langchain
langchain.debug = False

In [None]:
question = "Who is the instructor of Linear Algebra III?"

In [None]:
number_of_top_matches = 5

top_matching_chunks = faiss_store.similarity_search_with_score(question, k=number_of_top_matches)

#Get best chunk
best_chunk =   ### Fill here

# Print chunk and details
source_file = ### Fill here
page_number = ### Fill here
chunk_content = ### Fill here
print(source_file + ". Page: " + page_number + "\n\n", chunk_content)

* Examine why certain chunks received higher or lower similarity scores. Analyze the content of each chunk in relation to your query to understand the basis of these scores. **(5 Points)**

  * Discuss whether the model is effectively discerning relevant information or if it appears to be misled by certain elements. Provide suggestions for improvements.

**Answer:**

In [None]:
### Fill here: Code to get answer

*Answer here*

###2.1.4 Putting it all together!

Finally, we are going to use OpenAI API to get the answer to the question based on the relevant chunk. To do that, we will use `LangChain`'s [*`create_retrieval_chain`*](https://api.python.langchain.com/en/latest/chains/langchain.chains.retrieval.create_retrieval_chain.html#langchain.chains.retrieval.create_retrieval_chain).

* Analyze the requests and responses in detail. Discuss how the API processes the chunks and formulates a prompt. The answer should be elaborate and clearly backed by actual diagnostics (i.e., using code to show the actual requests and responses that are taking place behind the scenes) **(5 point)** .

* Characterize and evaluate the system in terms of
  * **average** cost.
  * data store size.
  * and **average** accuracy (in terms of relevance and precision).

  Elaborate, justify, and show your work. **(5 point)**








In [None]:
#Use LangChain debugging
import langchain
langchain.debug = False

In [None]:
temperature = 1.0

In [None]:
from langchain.chat_models import ChatOpenAI

# Create a reference to the language model
llm = ChatOpenAI(openai_api_key=openai_api_key, temperature=temperature)

In [None]:
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

# The retriever is the FAISS store, which retrieves relevant chunks/documents.
number_of_top_matches = 5
retriever = faiss_store.as_retriever(k=number_of_top_matches)

system_prompt = (
    """
    Answer the questions based on the provided context.

    Context:
    {context}
    """
)

# DO NOT CHANGE THIS
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

# The aggregator assembles the relevant documents together into one text.
aggregator = create_stuff_documents_chain(llm, prompt=prompt)

# We finally have a RAG chain!
rag_chain = create_retrieval_chain(retriever, aggregator)

First question:



> "Who is the instructor of Linear Algebra III?"

In [None]:
question = "Who is the instructor of Linear Algebra III?"

**Answer**

In [None]:
### Fill here: Code to answer here.

*Your answer*




###2.1.5 Trying a Different Question

Second question:



> "Define cheating in Lean Six Sigma Black Belt Training."

It's important to analyze and compare the system's performance across various questions.



* Compare with First Question in 2.1.4: Reflect on the system's response to the following question and compare it with the response to the first question above. Note any differences in performance. Elaborate on the causes behind your observations. Consider factors such as the nature of the question and the relevance and characteristics of the chosen chunk(s). **(5 point)**

* Based on your observations, propose *and implement* changes or adjustments that would improve the system's ability to retrieve more accurate or relevant answers. Discuss the trade-offs associated with your changes. **(5 point)**


In [None]:
question2 = "Define cheating in Lean Six Sigma Black Belt Training."

In [None]:
#Use LangChain debugging
import langchain
langchain.debug = False

In [None]:
# Let's put all the relevant code in one cell to make it more contained

temperature=0.1
chunk_size=500
chunk_overlap=50
number_of_top_matches = 5
system_prompt = (
    """
    Answer the questions based on the provided context.

    Context:
    {context}
    """
)

# Defining the store
text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
chunks = loader.load_and_split(text_splitter)
faiss_store = FAISS.from_documents(chunks, OpenAIEmbeddings(openai_api_key=openai_api_key))
retriever = faiss_store.as_retriever(k=number_of_top_matches)

# DO NOT CHANGE THIS
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", system_prompt),
        ("human", "{input}"),
    ]
)

# The aggregator assembles the relevant documents together into one text.
llm = ChatOpenAI(openai_api_key=openai_api_key, temperature=temperature, model="gpt-4o") #
aggregator = create_stuff_documents_chain(llm, prompt=prompt)

# We finally have a RAG chain!
rag_chain = create_retrieval_chain(retriever, aggregator)

In [None]:
faiss_store.similarity_search_with_score(question2, k=number_of_top_matches)

**Answer**

In [None]:
### Fill here: Code to answer here.

*Your answer here*