### This notebook does a bare bones implementation of RAG using the unprocessed MIDS intranet content as is, no tuning/tweaking. OpenAI gpt 3.5 is used as the llm

In [24]:
import os
import hashlib
from langchain_community.vectorstores import Qdrant
from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.docstore.document import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI

In [2]:
base_embeddings = HuggingFaceEmbeddings(model_name="multi-qa-mpnet-base-dot-v1")

  from .autonotebook import tqdm as notebook_tqdm
You try to use a model that was created with version 3.0.0.dev0, however, your version is 2.6.1. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





#### Load data

In [3]:
def read_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

In [4]:
def create_document(file_path):
    # Read the content of the file
    content = read_file(file_path)
    
    file_name = os.path.basename(file_path)\
    
    # Strip the .txt suffix and replace underscores with slashes
    source_url = file_name.replace('.txt', '').replace('_', '/')
    metadata = {"source": source_url}
    
    # Create a Document object with content and metadata
    return Document(page_content=content, metadata=metadata)

In [5]:
def find_txt_files(directories):
    txt_files = []
    for directory in directories:
        for root, _, files in os.walk(directory):
            for file in files:
                if file.endswith('.txt'):
                    txt_files.append(os.path.join(root, file))
    return txt_files

In [6]:
data_directories = [
    '../data/mids_site_content/courses/without_whitespace',
    '../data/mids_site_content/intranet/without_whitespace',
    '../data/mids_site_content/public_pages/without_whitespace'
]

In [7]:
file_paths = find_txt_files(data_directories)
documents = [create_document(file_path) for file_path in file_paths]

In [29]:
documents[0]

Document(page_content="Data Science 281. Computer Vision | UC Berkeley School of Information Skip to main content Submit UC Berkeley Give Volunteer Contact Us Alumni Alumni Get Involved & Give Back Sign Up to Volunteer Stay Connected I School Slack Alumni News Alumni Events Alumni Accounts Alumni Computing Services MIDS & MICS for Life Career Support Intranet Intranet For Students For MIMS Students MIMS Student Handbook MIMS Forms MIMS Final Project Advising Appointments Tuition & Fees Funding Your Education Technology Support Resources Conference Travel Grant Project Pages MIMS Student Leadership Commencement For MIDS Students MIDS Student Handbook Academic Calendar MIDS Forms Class Registration Acceleration Requests Tuition, Fees, and Funding Who to Contact MIDS Student Representatives Immersion Program Conference Travel Grant Capstone Project Pages Project Pages Commencement For 5th Year MIDS Students 5th Year MIDS Student Handbook Academic Calendar 5th Year MIDS Forms Class Registr

#### Chunk documents based on semantic meaning

In [8]:
text_splitter = SemanticChunker(
    base_embeddings,
    breakpoint_threshold_type='percentile'
)

In [9]:
split_documents = text_splitter.split_documents(documents)

#### De-duplicate documents, this is a temporary workaround to strip the site header and footer text (since the data hasn't been sanitized yet)

In [44]:
# Example of text we're stripping out duplicate copies of
# This document doesnt actually tell us anything (and we have another copy of it for every site page)
split_documents[0]

Document(page_content='Data Science 281. Computer Vision | UC Berkeley School of Information Skip to main content Submit UC Berkeley Give Volunteer Contact Us Alumni Alumni Get Involved & Give Back Sign Up to Volunteer Stay Connected I School Slack Alumni News Alumni Events Alumni Accounts Alumni Computing Services MIDS & MICS for Life Career Support Intranet Intranet For Students For MIMS Students MIMS Student Handbook MIMS Forms MIMS Final Project Advising Appointments Tuition & Fees Funding Your Education Technology Support Resources Conference Travel Grant Project Pages MIMS Student Leadership Commencement For MIDS Students MIDS Student Handbook Academic Calendar MIDS Forms Class Registration Acceleration Requests Tuition, Fees, and Funding Who to Contact MIDS Student Representatives Immersion Program Conference Travel Grant Capstone Project Pages Project Pages Commencement For 5th Year MIDS Students 5th Year MIDS Student Handbook Academic Calendar 5th Year MIDS Forms Class Registr

In [10]:
# Generates a hash of the document text, strips out duplicates
def deduplicate_documents(documents):
    seen_hashes = set()
    unique_documents = []
    for doc in documents:
        doc_hash = hashlib.md5(doc.page_content.encode('utf-8')).hexdigest()
        if doc_hash not in seen_hashes:
            seen_hashes.add(doc_hash)
            unique_documents.append(doc)
    return unique_documents

In [11]:
len(split_documents)

333

In [12]:
split_documents = deduplicate_documents(documents)

In [13]:
len(split_documents)

67

#### Load documents into vector db

In [14]:
vectorstore = Qdrant.from_documents(split_documents,
    base_embeddings,
    location=":memory:",
)

In [15]:
retriever = vectorstore.as_retriever()

### Prompt definitions

In [40]:
# We will need to tune this prompt

baseline_user_prompt = """
You are a helpful assistant that answers questions about the UC Berkeley Masters of Information in Data Science (MIDS) program.
You will be provided a list of relevant context documents and a question.
Provide an answer to the question based on information from the relevant context

Please answer the question below only based on the context information provided.

### Here is a context:
{context} 

### Here is a question:
{question}
"""

In [41]:
rag_prompt = ChatPromptTemplate.from_template(baseline_user_prompt)

In [42]:
prompt_template.invoke({
    "context": ["abcd", "efgh"],
    "question": "jklmnop?"
})

ChatPromptValue(messages=[SystemMessage(content='\nYou are a helpful assistant that answers questions about the UC Berkeley Masters of Information in Data Science (MIDS) program.\nYou will be provided a list of relevant context documents and a question.\nProvide an answer to the question based on information from the relevant context\n'), HumanMessage(content="\nPlease answer the question below only based on the context information provided.\n\n### Here is a context:\n['abcd', 'efgh'] \n\n### Here is a question:\njklmnop?\n")])

### Pipe retriever and llm together

##### Requires OPENAI_API_KEY env var to be set

In [22]:
os.environ['OPENAI_API_KEY'] = ""

In [65]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo")

In [31]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [66]:
rag_chain = (
    {"context": retriever | format_docs,
     "question": RunnablePassthrough() }
    | rag_prompt
    | llm
)

In [45]:
rag_chain.invoke("What is the MIDS program").content

"The MIDS program is an online degree offered by UC Berkeley's School of Information that prepares data science professionals to solve real-world problems. It equips graduates with interdisciplinary skills to tackle complex human, social, economic, and health issues using data."

In [46]:
rag_chain.invoke("What are the prerequisites for taking the 210 course?").content

'The prerequisites for taking the Data Science 210A Capstone for Early Career Data Scientists course are DATASCI 200, DATASCI 201A, DATASCI 203, DATASCI 205, and DATASCI 207. Additionally, the course must be taken in the final term of the 5th Year MIDS program.'

In [47]:
rag_chain.invoke("Does the MIDS program have a payment plan? How does it work?").content

'Yes, the MIDS program offers a Fee Payment Plan (FPP) for the Fall and Spring terms. The FPP allows students to pay their tuition and fees in installments rather than a lump sum. More details and deadlines for the Fee Payment Plan can be found on the billing website. Please note that the Fee Payment Plan is not available for the Summer terms.'

In [49]:
rag_chain.invoke("What kind of additional fees does the program have outside of regular tuition? List the costs").content

'The UC Berkeley Masters of Information in Data Science (MIDS) program has the following additional fees outside of regular tuition:\n1. Berkeley Campus Fee:\n   - $790.25 for Fall 2023 & Spring 2024\n   - $431 for Summer 2024\n2. Document Management Fee: $107 (charged in the first term of enrollment)\n3. Instructional Resilience and Enhancement Fee:\n   - $117.50 for Fall and Spring terms\n   - $81 for Summer 2024 (if not enrolled in Spring 2024)\n4. UC Graduate and Professional Council (UCGPC) Fee: $3.50 for Fall and Spring terms\n5. Effective Fall 2024 Student Health Insurance Plan (SHIP): $3,078.00 for Fall and Spring terms only\n6. Immersion Fee: $500\n\nPlease note that fees are subject to change each academic year.'

In [52]:
rag_chain.invoke("Provide an short overview of the MIDS program degree requirements").content

'The Master of Information and Data Science (MIDS) program at UC Berkeley requires students to complete a total of 27 units, which includes core courses, elective courses, and a capstone project. Students must also attend an immersion program and fulfill all academic and administrative requirements set by the School of Information. Additionally, students are expected to maintain a minimum GPA of 3.0 in order to graduate from the program.'

In [54]:
rag_chain.invoke("What are the core courses students are required to take? Include the course numbers").content

'The core courses that students are required to take in the MIDS program are:\n\n1. Introduction to Data Science Programming (Course Number: Data Science 200)\n2. Research Design and Application for Data and Analysis (Course Number: Data Science 201)\n3. Statistics for Data Science (Course Number: Data Science 203)\n4. Fundamentals of Data Engineering (Course Number: Data Science 205)\n5. Applied Machine Learning (Course Number: Data Science 207)'

In [68]:
rag_chain.invoke("What is immersion? When does it happen and what are the requirements?").content

'Immersion is an academic event and a requirement for the UC Berkeley Masters of Information in Data Science (MIDS) program. It is an opportunity for students to meet faculty and peers in person on the UC Berkeley campus or in other relevant locations. Immersion typically lasts two to three days and includes presentations by faculty and industry experts, socializing with classmates, career advancement workshops, and networking opportunities. Immersion must be completed before the end of the Capstone course since participation is a graduation requirement.'

In [64]:
rag_chain.invoke("What are the main differences between 5th year MIDS students and regular MIDS students?").content

"The main differences between 5th Year MIDS students and regular MIDS students at UC Berkeley are as follows:\n\n1. **Streamlined Path for Cal Undergraduates:**\n   - The 5th Year MIDS program offers a streamlined path to the MIDS degree specifically for UC Berkeley undergraduates, allowing them to transition directly into the master's program.\n\n2. **Program Duration and Structure:**\n   - 5th Year MIDS is a lock-step program that requires students to complete the degree over four consecutive semesters (fall, spring, summer, and fall).\n   - Students in the 5th Year MIDS program are required to enroll in specific DATASCI courses each term in a predefined sequence.\n\n3. **Summer Practicum and Internship Encouragement:**\n   - 5th Year MIDS students are required to complete a 1-unit summer practicum course (DATASCI 293) and are strongly encouraged to pursue a full-time internship during the summer semester, although it is not mandatory.\n   - This focus on a summer practicum and inter