### This notebook does a bare bones implementation of RAG using the unprocessed MIDS intranet content as is, no tuning/tweaking. OpenAI gpt 3.5 is used as the llm

In [1]:
import os
import hashlib
from langchain_community.vectorstores import Qdrant
from langchain_experimental.text_splitter import SemanticChunker
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.docstore.document import Document
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_openai import ChatOpenAI
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, pipeline
from langchain.llms import HuggingFacePipeline

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
base_embeddings = HuggingFaceEmbeddings(model_name="multi-qa-mpnet-base-dot-v1")

You try to use a model that was created with version 3.0.0.dev0, however, your version is 2.6.1. This might cause unexpected behavior or errors. In that case, try to update to the latest version.





#### Load data

In [3]:
def read_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        return file.read()

In [4]:
def create_document(file_path):
    # Read the content of the file
    content = read_file(file_path)
    
    file_name = os.path.basename(file_path)\
    
    # Strip the .txt suffix and replace underscores with slashes
    source_url = file_name.replace('.txt', '').replace('_', '/')
    metadata = {"source": source_url}
    
    # Create a Document object with content and metadata
    return Document(page_content=content, metadata=metadata)

In [5]:
def find_txt_files(directories):
    txt_files = []
    for directory in directories:
        for root, _, files in os.walk(directory):
            for file in files:
                if file.endswith('.txt'):
                    txt_files.append(os.path.join(root, file))
    return txt_files

In [6]:
data_directories = [
    '../data/mids_site_content/courses/without_whitespace',
    '../data/mids_site_content/intranet/without_whitespace',
    '../data/mids_site_content/public_pages/without_whitespace'
]

In [7]:
file_paths = find_txt_files(data_directories)
documents = [create_document(file_path) for file_path in file_paths]

In [8]:
documents[0]

Document(page_content="Data Science 281. Computer Vision | UC Berkeley School of Information Skip to main content Submit UC Berkeley Give Volunteer Contact Us Alumni Alumni Get Involved & Give Back Sign Up to Volunteer Stay Connected I School Slack Alumni News Alumni Events Alumni Accounts Alumni Computing Services MIDS & MICS for Life Career Support Intranet Intranet For Students For MIMS Students MIMS Student Handbook MIMS Forms MIMS Final Project Advising Appointments Tuition & Fees Funding Your Education Technology Support Resources Conference Travel Grant Project Pages MIMS Student Leadership Commencement For MIDS Students MIDS Student Handbook Academic Calendar MIDS Forms Class Registration Acceleration Requests Tuition, Fees, and Funding Who to Contact MIDS Student Representatives Immersion Program Conference Travel Grant Capstone Project Pages Project Pages Commencement For 5th Year MIDS Students 5th Year MIDS Student Handbook Academic Calendar 5th Year MIDS Forms Class Registr

#### Chunk documents based on semantic meaning

In [9]:
text_splitter = SemanticChunker(
    base_embeddings,
    breakpoint_threshold_type='percentile'
)

In [10]:
split_documents = text_splitter.split_documents(documents)

In [11]:
# text_splitter_recursive = RecursiveCharacterTextSplitter(chunk_size=128, chunk_overlap=0)
# split_documents = text_splitter_recursive.split_documents(documents)

#### De-duplicate documents, this is a temporary workaround to strip the site header and footer text (since the data hasn't been sanitized yet)

In [12]:
# Example of text we're stripping out duplicate copies of
# This document doesnt actually tell us anything (and we have another copy of it for every site page)
split_documents[0]

Document(page_content='Data Science 281. Computer Vision | UC Berkeley School of Information Skip to main content Submit UC Berkeley Give Volunteer Contact Us Alumni Alumni Get Involved & Give Back Sign Up to Volunteer Stay Connected I School Slack Alumni News Alumni Events Alumni Accounts Alumni Computing Services MIDS & MICS for Life Career Support Intranet Intranet For Students For MIMS Students MIMS Student Handbook MIMS Forms MIMS Final Project Advising Appointments Tuition & Fees Funding Your Education Technology Support Resources Conference Travel Grant Project Pages MIMS Student Leadership Commencement For MIDS Students MIDS Student Handbook Academic Calendar MIDS Forms Class Registration Acceleration Requests Tuition, Fees, and Funding Who to Contact MIDS Student Representatives Immersion Program Conference Travel Grant Capstone Project Pages Project Pages Commencement For 5th Year MIDS Students 5th Year MIDS Student Handbook Academic Calendar 5th Year MIDS Forms Class Registr

In [13]:
# Generates a hash of the document text, strips out duplicates
def deduplicate_documents(documents):
    seen_hashes = set()
    unique_documents = []
    for doc in documents:
        doc_hash = hashlib.md5(doc.page_content.encode('utf-8')).hexdigest()
        if doc_hash not in seen_hashes:
            seen_hashes.add(doc_hash)
            unique_documents.append(doc)
    return unique_documents

In [14]:
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

In [15]:
len(split_documents)

333

In [16]:
split_documents = deduplicate_documents(split_documents)

In [17]:
len(split_documents)

232

In [18]:
split_documents[0]

Document(page_content='Data Science 281. Computer Vision | UC Berkeley School of Information Skip to main content Submit UC Berkeley Give Volunteer Contact Us Alumni Alumni Get Involved & Give Back Sign Up to Volunteer Stay Connected I School Slack Alumni News Alumni Events Alumni Accounts Alumni Computing Services MIDS & MICS for Life Career Support Intranet Intranet For Students For MIMS Students MIMS Student Handbook MIMS Forms MIMS Final Project Advising Appointments Tuition & Fees Funding Your Education Technology Support Resources Conference Travel Grant Project Pages MIMS Student Leadership Commencement For MIDS Students MIDS Student Handbook Academic Calendar MIDS Forms Class Registration Acceleration Requests Tuition, Fees, and Funding Who to Contact MIDS Student Representatives Immersion Program Conference Travel Grant Capstone Project Pages Project Pages Commencement For 5th Year MIDS Students 5th Year MIDS Student Handbook Academic Calendar 5th Year MIDS Forms Class Registr

#### Load documents into vector db

In [19]:
vectorstore = Qdrant.from_documents(split_documents,
    base_embeddings,
    location=":memory:",
)

In [20]:
retriever = vectorstore.as_retriever()

# Consumes ~ 5GB VRAM to documents into db (using semantic chunking)

In [21]:
!nvidia-smi

Sun Jun  2 22:50:28 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0  On |                  Off |
| 73%   53C    P0            363W /  450W |    5335MiB /  24564MiB |     94%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


### Prompt definitions

In [18]:
# We will need to tune this prompt

baseline_user_prompt = """
You are a helpful assistant that answers questions about the UC Berkeley Masters of Information in Data Science (MIDS) program.
You will be provided a list of relevant context documents and a question.
Provide an answer to the question based on information from the relevant context

Please answer the question below only based on the context information provided.

### Here is a context:
{context} 

### Here is a question:
{question}
"""

In [19]:
rag_prompt = ChatPromptTemplate.from_template(baseline_user_prompt)

In [21]:
rag_prompt.invoke({
    "context": ["abcd", "efgh"],
    "question": "jklmnop?"
})

ChatPromptValue(messages=[HumanMessage(content="\nYou are a helpful assistant that answers questions about the UC Berkeley Masters of Information in Data Science (MIDS) program.\nYou will be provided a list of relevant context documents and a question.\nProvide an answer to the question based on information from the relevant context\n\nPlease answer the question below only based on the context information provided.\n\n### Here is a context:\n['abcd', 'efgh'] \n\n### Here is a question:\njklmnop?\n")])

### Pipe retriever and llm together

#### Try with OpenAI GPT3.5 as baseline

##### Requires OPENAI_API_KEY env var to be set

In [22]:
os.environ['OPENAI_API_KEY'] = ""

In [65]:
llm = ChatOpenAI(model_name="gpt-3.5-turbo")

In [66]:
rag_chain = (
    {"context": retriever | format_docs,
     "question": RunnablePassthrough() }
    | rag_prompt
    | llm
)

In [45]:
rag_chain.invoke("What is the MIDS program").content

"The MIDS program is an online degree offered by UC Berkeley's School of Information that prepares data science professionals to solve real-world problems. It equips graduates with interdisciplinary skills to tackle complex human, social, economic, and health issues using data."

In [46]:
rag_chain.invoke("What are the prerequisites for taking the 210 course?").content

'The prerequisites for taking the Data Science 210A Capstone for Early Career Data Scientists course are DATASCI 200, DATASCI 201A, DATASCI 203, DATASCI 205, and DATASCI 207. Additionally, the course must be taken in the final term of the 5th Year MIDS program.'

In [47]:
rag_chain.invoke("Does the MIDS program have a payment plan? How does it work?").content

'Yes, the MIDS program offers a Fee Payment Plan (FPP) for the Fall and Spring terms. The FPP allows students to pay their tuition and fees in installments rather than a lump sum. More details and deadlines for the Fee Payment Plan can be found on the billing website. Please note that the Fee Payment Plan is not available for the Summer terms.'

In [49]:
rag_chain.invoke("What kind of additional fees does the program have outside of regular tuition? List the costs").content

'The UC Berkeley Masters of Information in Data Science (MIDS) program has the following additional fees outside of regular tuition:\n1. Berkeley Campus Fee:\n   - $790.25 for Fall 2023 & Spring 2024\n   - $431 for Summer 2024\n2. Document Management Fee: $107 (charged in the first term of enrollment)\n3. Instructional Resilience and Enhancement Fee:\n   - $117.50 for Fall and Spring terms\n   - $81 for Summer 2024 (if not enrolled in Spring 2024)\n4. UC Graduate and Professional Council (UCGPC) Fee: $3.50 for Fall and Spring terms\n5. Effective Fall 2024 Student Health Insurance Plan (SHIP): $3,078.00 for Fall and Spring terms only\n6. Immersion Fee: $500\n\nPlease note that fees are subject to change each academic year.'

In [52]:
rag_chain.invoke("Provide an short overview of the MIDS program degree requirements").content

'The Master of Information and Data Science (MIDS) program at UC Berkeley requires students to complete a total of 27 units, which includes core courses, elective courses, and a capstone project. Students must also attend an immersion program and fulfill all academic and administrative requirements set by the School of Information. Additionally, students are expected to maintain a minimum GPA of 3.0 in order to graduate from the program.'

In [54]:
rag_chain.invoke("What are the core courses students are required to take? Include the course numbers").content

'The core courses that students are required to take in the MIDS program are:\n\n1. Introduction to Data Science Programming (Course Number: Data Science 200)\n2. Research Design and Application for Data and Analysis (Course Number: Data Science 201)\n3. Statistics for Data Science (Course Number: Data Science 203)\n4. Fundamentals of Data Engineering (Course Number: Data Science 205)\n5. Applied Machine Learning (Course Number: Data Science 207)'

In [68]:
rag_chain.invoke("What is immersion? When does it happen and what are the requirements?").content

'Immersion is an academic event and a requirement for the UC Berkeley Masters of Information in Data Science (MIDS) program. It is an opportunity for students to meet faculty and peers in person on the UC Berkeley campus or in other relevant locations. Immersion typically lasts two to three days and includes presentations by faculty and industry experts, socializing with classmates, career advancement workshops, and networking opportunities. Immersion must be completed before the end of the Capstone course since participation is a graduation requirement.'

In [64]:
rag_chain.invoke("What are the main differences between 5th year MIDS students and regular MIDS students?").content

"The main differences between 5th Year MIDS students and regular MIDS students at UC Berkeley are as follows:\n\n1. **Streamlined Path for Cal Undergraduates:**\n   - The 5th Year MIDS program offers a streamlined path to the MIDS degree specifically for UC Berkeley undergraduates, allowing them to transition directly into the master's program.\n\n2. **Program Duration and Structure:**\n   - 5th Year MIDS is a lock-step program that requires students to complete the degree over four consecutive semesters (fall, spring, summer, and fall).\n   - Students in the 5th Year MIDS program are required to enroll in specific DATASCI courses each term in a predefined sequence.\n\n3. **Summer Practicum and Internship Encouragement:**\n   - 5th Year MIDS students are required to complete a 1-unit summer practicum course (DATASCI 293) and are strongly encouraged to pursue a full-time internship during the summer semester, although it is not mandatory.\n   - This focus on a summer practicum and inter

### Try with phi3 mini (with 4k context)

The 4k context only seems to be sufficent for more basic questions (heavy tuning on the chunking process might improve this and allow us to use this model instead)

In [23]:
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct", 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-4k-instruct")

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=500,
)

A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- configuration_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct:
- modeling_phi3.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attention` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:01<00:00,  1.34it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


### The 4k context window is too small (atleast with semantic chunking) use the 128k model but cap it at 20k context

In [24]:
!nvidia-smi

Sun Jun  2 22:44:29 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0  On |                  Off |
|  0%   43C    P0             68W /  450W |    8929MiB /  24564MiB |      1%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


#### Loading the smallest phi3-mini 4k context model consumes ~ another 4GB of VRAM, (9GB total)

In [23]:
# Reduce context window to 20k
config = AutoConfig.from_pretrained("microsoft/Phi-3-mini-128k-instruct")
config.max_position_embeddings = 20000  # Set to 20k tokens

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-128k-instruct", 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True, 
)
tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3-mini-128k-instruct")

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=500,
    max_length=20000
)

The repository for microsoft/Phi-3-mini-128k-instruct contains custom code which must be executed to correctly load the model. You can inspect the repository content at https://hf.co/microsoft/Phi-3-mini-128k-instruct.
You can avoid this prompt in future by passing the argument `trust_remote_code=True`.

Do you wish to run the custom code? [y/N]  y


`flash-attention` package not found, consider installing for better performance: No module named 'flash_attn'.
Current `flash-attenton` does not support `window_size`. Either upgrade or use `attn_implementation='eager'`.
Loading checkpoint shards: 100%|██████████████████| 2/2 [00:01<00:00,  1.23it/s]
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


In [31]:
llm_phi3 = HuggingFacePipeline(pipeline=pipe)

In [32]:
baseline_phi3_user_prompt = """
<|user|>
You are a helpful assistant that answers questions about the UC Berkeley Masters of Information in Data Science (MIDS) program.
You will be provided a list of relevant context documents and a question.
Provide an answer to the question based on information from the relevant context

Please answer the question below only based on the context information provided.

### Here is a context:
{context} 

### Here is a question:
{question}
<|end|>
<|assistant|>
"""

In [33]:
phi3_rag_prompt = ChatPromptTemplate.from_template(baseline_phi3_user_prompt)

In [34]:
# Given a query, retrieves documents
# Returns the number of tokens in the full context from the retrieved documents
def count_tokens_in_context(query):
    return len(tokenizer.tokenize("".join([doc.page_content for doc in retriever.invoke(query)])))

In [35]:
count_tokens_in_context("What are the prerequisites for taking the 210 course?")

7221

In [36]:
def extract_assistant_resp(raw_resp):
    if "<|assistant|>" not in raw_resp:
        raise f"Could not identify assistant token in response: {raw_resp}"
    
    return raw_resp.split("<|assistant|>")[1] 
    

In [37]:
phi3_rag_chain = (
    {"context": retriever | format_docs,
     "question": RunnablePassthrough() }
    | phi3_rag_prompt
    | llm_phi3
    | extract_assistant_resp
)

In [38]:
phi3_rag_chain.invoke("What is the MIDS program")

Both `max_new_tokens` (=500) and `max_length`(=20000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)
You are not running the flash-attention implementation, expect numerical differences.


'\n The MIDS program at UC Berkeley is a Masters of Information in Data Science program that focuses on problem-solving through data collection, analysis, and presentation. It prepares students to address problems in areas of human interest, government, and business. The program is multidisciplinary, challenging, and rigorous, with a curriculum designed to cultivate initiative, encourage participation, and foster relationships with instructors and peers.'

In [33]:
!nvidia-smi

Sun Jun  2 22:47:02 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0  On |                  Off |
| 64%   47C    P0             72W /  450W |   12409MiB /  24564MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [39]:
phi3_rag_chain.invoke("What are the prerequisites for taking the 210 course?")

Both `max_new_tokens` (=500) and `max_length`(=20000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


OutOfMemoryError: CUDA out of memory. Tried to allocate 6.43 GiB. GPU 0 has a total capacity of 23.55 GiB of which 4.52 GiB is free. Including non-PyTorch memory, this process has 18.23 GiB memory in use. Of the allocated memory 17.59 GiB is allocated by PyTorch, and 187.58 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [40]:
phi3_rag_chain.invoke("Does the MIDS program have a payment plan? How does it work?")

Both `max_new_tokens` (=500) and `max_length`(=20000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


OutOfMemoryError: CUDA out of memory. Tried to allocate 3.04 GiB. GPU 0 has a total capacity of 23.55 GiB of which 1.47 GiB is free. Including non-PyTorch memory, this process has 21.27 GiB memory in use. Of the allocated memory 17.49 GiB is allocated by PyTorch, and 3.32 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

In [43]:
phi3_rag_chain.invoke("What kind of additional fees does the program have outside of regular tuition? List the costs")

Both `max_new_tokens` (=500) and `max_length`(=20000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


'\n The program has several additional fees outside of regular tuition, which include:\n\n1. Cybersecurity Certificate Fee\n2. Data Science Certificate Fee\n3. Applied Data Science Certificate Fee\n4. Document Management Fee (one-time, first term of enrollment)\n5. Health Insurance Plan (UC SHIP) Fee (automatically charged unless waived)\n6. UC Student Health Insurance Plan (UC SHIP) Waiver Application Fee (for those opting out)\n7. Immersion Program Conference Travel Grant Fee\n8. Capstone Project Fee\n9. Late Registration Fee (after the deadline for official registration)\n10. Fee Payment Plan (for Fall and Spring terms, not available for Summer terms)\n\nPlease note that the exact amounts for some of these fees, such as the Immersion Program Conference Travel Grant Fee and the Capstone Project Fee, are not specified in the provided context.'

In [44]:
phi3_rag_chain.invoke("Provide an short overview of the MIDS program degree requirements")

Both `max_new_tokens` (=500) and `max_length`(=20000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


'\n The MIDS program requires students to complete a minimum of 27 semester units, comprising nine three-unit courses across three categories: Foundation Courses (15 units), Advanced Courses (9 units), and the Capstone Course (3 units). Students must also attend one Immersion Program to fulfill the degree requirements.'

In [41]:
phi3_rag_chain.invoke("What are the core courses students are required to take? Include the course numbers")

Both `max_new_tokens` (=500) and `max_length`(=20000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


'\n The core courses required for students are:\n\n1. Data Science 201 - Introduction to Data Science Programming (3 units)\n2. Data Science 203 - Research Design and Applications for Data and Analysis (3 units)\n3. Data Science 205 - Fundamentals of Data Engineering (3 units)\n4. Data Science 207 - Applied Machine Learning (3 units)\n5. Data Science 261 - Machine Learning at Scale (3 units)\n6. Data Science 271 - Statistical Methods for Discrete Response, Time Series, and Panel Data (3 units)\n7. Data Science 281 - Computer Vision (3 units)\n8. Data Science 290 - Special Topics (3 units)\n9. Data Science 293 - Data Science Professional Practicum (1 unit)\n\nThese courses cover fundamental knowledge and skills in programming, data engineering, machine learning, statistics, and computer vision, which are essential for a career in data science.'

In [42]:
phi3_rag_chain.invoke("What is immersion? When does it happen and what are the requirements?")

Both `max_new_tokens` (=500) and `max_length`(=20000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


"\n The immersion program at UC Berkeley's Masters of Information in Data Science (MIDS) program is a mandatory in-person experience for students. It typically occurs before the end of their final term. During the immersion, students are required to attend at least one three- to four-day program. This program includes a variety of activities such as learning modules on special topics, conference sessions, career services seminars, company visits, networking events, and more. The immersion program fee is $500, and students are responsible for their hotel stay, travel costs, and other incidental expenses not covered by the program."

#### ~22.6GB of VRAM required to run the RAG system (with semantic chunking), not all queries worked since it generated too large of a context

In [45]:
!nvidia-smi

Sun Jun  2 22:53:31 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 4090        Off |   00000000:01:00.0  On |                  Off |
|  0%   43C    P8             30W /  450W |   22665MiB /  24564MiB |      9%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [48]:
phi3_rag_chain.invoke("Can I ")

Both `max_new_tokens` (=500) and `max_length`(=20000) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


'\n The estimated total tuition and fees for the MIDS program for a new student starting in Fall 2023 is currently $81,633. This cost is an estimate and most students complete the degree in approximately two years. However, the exact amount may vary based on several factors.'