## Create Embeddings

In [0]:
%pip install -qU databricks-langchain==0.1.1 langchain-chroma==0.1.1 pypdf==4.3.0
dbutils.library.restartPython()

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
from openai import OpenAI
import os

DATABRICKS_TOKEN = dbutils.secrets.get(scope = "db-field-eng", key = "va-pat-token")

client = OpenAI(
  api_key=DATABRICKS_TOKEN,
  base_url="https://e2-demo-field-eng.cloud.databricks.com/serving-endpoints"
)

In [0]:
from langchain.document_loaders import PyPDFLoader

loaders = [
  # Duplicate documents on purpose - simulate messy data
  PyPDFLoader("./data/docs/cs229_lectures/MachineLearning-Lecture01.pdf"),
  PyPDFLoader("./data/docs/cs229_lectures/MachineLearning-Lecture01.pdf"),
  PyPDFLoader("./data/docs/cs229_lectures/MachineLearning-Lecture02.pdf"),
  PyPDFLoader("./data/docs/cs229_lectures/MachineLearning-Lecture03.pdf")
]

docs = []
for loader in loaders:
  docs.extend(loader.load())

In [0]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
  chunk_size = 1500,
  chunk_overlap = 150
)

chunked_docs = text_splitter.split_documents(docs)
len(chunked_docs)

209

In [0]:
from langchain_chroma import Chroma
from databricks_langchain import DatabricksEmbeddings

embedding = DatabricksEmbeddings(endpoint="databricks-gte-large-en")

In [0]:
%sh
rm -rf ./data/docs/chroma  # remove old database files if any

In [0]:
persist_directory = './data/docs/chroma/'

In [0]:
vectordb = Chroma.from_documents(
  documents=chunked_docs,
  embedding=embedding,
  persist_directory=persist_directory   # Save vector indexes on disk
)

print(vectordb._collection.count())     # Same as number of chunks as it created 1 vector for each chunk

209


In [0]:
question = "is there an email i can ask for help?"
similar_docs = vectordb.similarity_search(question, k=3)
print(similar_docs)

[Document(metadata={'page': 5, 'source': './data/docs/cs229_lectures/MachineLearning-Lecture01.pdf'}, page_content="cs229-qa@cs.stanford.edu. This goes to an acc ount that's read by all the TAs and me. So \nrather than sending us email individually, if you send email to this account, it will \nactually let us get back to you maximally quickly with answers to your questions.  \nIf you're asking questions about homework probl ems, please say in the subject line which \nassignment and which question the email refers to, since that will also help us to route \nyour question to the appropriate TA or to me  appropriately and get the response back to \nyou quickly.  \nLet's see. Skipping ahead — let's see — for homework, one midterm, one open and term \nproject. Notice on the honor code. So one thi ng that I think will help you to succeed and \ndo well in this class and even help you to enjoy this cla ss more is if you form a study \ngroup.  \nSo start looking around where you' re sitting now

In [0]:
from databricks_langchain import ChatDatabricks
from databricks_langchain import DatabricksEmbeddings

In [0]:
from langchain_chroma import Chroma

In [0]:
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
import mlflow

In [0]:
# Set Temperature = 0 for generation model in our Q&A application for low variability and factual answers
llm = ChatDatabricks(endpoint="databricks-meta-llama-3-3-70b-instruct", temperature=0)

persist_directory = './data/docs/chroma/'
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embedding)

print(vectordb._collection.count())

209


In [0]:
question = "What are major topics for this class?"
docs = vectordb.similarity_search(question, k=3)

docs[0].page_content[0:500]

"okay?  \nSo as an overview of what we're going to do in this class, this class is sort of organized \ninto four major sections. We're gonna talk about four major topics in this class, the first \nof which is supervised learning. So le t me give you an example of that.  \nSo suppose you collect a data set of housing prices. And one of the TAs, Dan Ramage, \nactually collected a data set for me last week to use in the example later. But suppose that \nyou go to collect statistics about how much hous es "

In [0]:
qa_chain = RetrievalQA.from_chain_type(
  llm=llm,
  retriever=vectordb.as_retriever(search_type="mmr"),   # MMR added to remove duplicate chunks
  chain_type_kwargs={"verbose":True}
)

result = qa_chain.invoke(question)
result



Prompt after formatting:
[32;1m[1;3mSystem: Use the following pieces of context to answer the user's question. 
If you don't know the answer, just say that you don't know, don't try to make up an answer.
----------------
okay?  
So as an overview of what we're going to do in this class, this class is sort of organized 
into four major sections. We're gonna talk about four major topics in this class, the first 
of which is supervised learning. So le t me give you an example of that.  
So suppose you collect a data set of housing prices. And one of the TAs, Dan Ramage, 
actually collected a data set for me last week to use in the example later. But suppose that 
you go to collect statistics about how much hous es cost in a certain geographic area. And 
Dan, the TA, collected data from housing pr ices in Portland, Oregon. So what you can do 
is let's say plot the square footage of the house against the list price of  the house, right, so 
you collect data on a bunch of houses. And let' 

{'query': 'What are major topics for this class?',
 'result': 'The class is organized into four major sections, and the first major topic is supervised learning. The other three topics are not specified in the given context.'}

Trace(request_id=tr-9abfc6c1587b4905a320bd79661a8e6f)

### Prompt

In [0]:
# Build prompt
template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer. 
{context}
Question: {question}
Helpful Answer:"""

QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

In [0]:
# Run chain
qa_chain = RetrievalQA.from_chain_type(
  llm=llm,
  retriever=vectordb.as_retriever(search_type="mmr", search_kwargs={"k": 3}),
  return_source_documents=True,                     # Inspect Retrieved docs
  chain_type_kwargs={"prompt": QA_CHAIN_PROMPT, "verbose":True}     # Use this prompt in the chain instead of default that doesn't contain instructions around response length and thanking at the end
)

In [0]:
question = "Is probability a class topic?"

result = qa_chain.invoke(question)
result["result"]



Prompt after formatting:
[32;1m[1;3mUse the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Use three sentences maximum. Keep the answer as concise as possible. Always say "thanks for asking!" at the end of the answer. 
statistics for a while or maybe algebra, we'll go over those in the discussion sections as a 
refresher for those of you that want one.  
Later in this quarter, we'll also use the disc ussion sections to go over extensions for the 
material that I'm teaching in the main lectur es. So machine learning is a huge field, and 
there are a few extensions that we really want  to teach but didn't have time in the main 
lectures for.

of this class will not be very program ming intensive, although we will do some 
programming, mostly in either MATLAB or Octa ve. I'll say a bit more about that later.  
I also assume familiarity with basic proba bility and statistics. So mo

'The class assumes familiarity with basic probability, and it will be reviewed in discussion sections as a refresher. Probability is a prerequisite, not a main topic to be taught. thanks for asking!'

Trace(request_id=tr-020459cd089e4e4882e8f6b1d1b26dfe)

In [0]:
result["source_documents"][0]

Document(metadata={'page': 8, 'source': './data/docs/cs229_lectures/MachineLearning-Lecture01.pdf'}, page_content="statistics for a while or maybe algebra, we'll go over those in the discussion sections as a \nrefresher for those of you that want one.  \nLater in this quarter, we'll also use the disc ussion sections to go over extensions for the \nmaterial that I'm teaching in the main lectur es. So machine learning is a huge field, and \nthere are a few extensions that we really want  to teach but didn't have time in the main \nlectures for.")