<a href="https://colab.research.google.com/github/fwitschel/QDMKM/blob/main/notebooks/QDMKM_RAG_start.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import os

# Execute this code only if in colab
if 'COLAB_GPU' in os.environ:
  print("Executing in Colab!")
  # Cloning GitHub repository
  !git clone https://github.com/fwitschel/QDMKM.git
  %cd QDMKM


Executing in Colab!
Cloning into 'QDMKM'...
remote: Enumerating objects: 94, done.[K
remote: Counting objects: 100% (94/94), done.[K
remote: Compressing objects: 100% (81/81), done.[K
remote: Total 94 (delta 22), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (94/94), 92.24 KiB | 4.19 MiB/s, done.
Resolving deltas: 100% (22/22), done.
/content/QDMKM


We install some libraries that we will need later

In [2]:
!pip install langchain langchain-community faiss-cpu langchain-anthropic groq rank_bm25

Collecting langchain-community
  Downloading langchain_community-0.3.30-py3-none-any.whl.metadata (3.0 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting langchain-anthropic
  Downloading langchain_anthropic-0.3.21-py3-none-any.whl.metadata (1.9 kB)
Collecting groq
  Downloading groq-0.32.0-py3-none-any.whl.metadata (16 kB)
Collecting rank_bm25
  Downloading rank_bm25-0.2.2-py3-none-any.whl.metadata (3.2 kB)
Collecting requests<3,>=2 (from langchain)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7.0,>=0.6.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting anthropic<1.0.0,>=0.69.0 (from langchain-anthropic)
  Downloading anthropic-0.69.0-py3-none-any.whl.metadata (28 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7.0,>=0.6.7->langchain-community)
  Downloading marshma

We read the input file into a so-called dataframe.

In [3]:
import pandas as pd
emails = pd.read_csv("/content/QDMKM/data/cases-emails.csv")
print(emails)

                 sender                 receiver  \
0     student42@MBFH.ch      my_msc_dean@MBFH.ch   
1     student21@MBFH.ch      my_msc_dean@MBFH.ch   
2     student84@MBFH.ch  former_msc_dean@MBFH.ch   
3    student168@MBFH.ch      my_msc_dean@MBFH.ch   
4    student336@MBFH.ch      my_msc_dean@MBFH.ch   
5    student672@MBFH.ch      my_msc_dean@MBFH.ch   
6     student72@MBFH.ch      my_msc_dean@MBFH.ch   
7     student67@MBFH.ch      my_msc_dean@MBFH.ch   
8      student7@MBFH.ch      my_msc_dean@MBFH.ch   
9    student745@MBFH.ch  former_msc_dean@MBFH.ch   
10    student45@MBFH.ch      my_msc_dean@MBFH.ch   
11    student74@MBFH.ch      my_msc_dean@MBFH.ch   
12     student5@MBFH.ch      my_msc_dean@MBFH.ch   
13   student666@MBFH.ch      my_msc_dean@MBFH.ch   
14   student888@MBFH.ch      my_msc_dean@MBFH.ch   
15  student4242@MBFH.ch         my_msc_dean@MBFH   

                                     subject  first_date  notice_weeks  \
0                        Re: late submiss

In the input file, each case is represented by one row. We create one Document object for each case that contains, as textual content (to be transformed and stored as embedding vectors) the subject of the initial email, followed by the entire text of the conversation. As metadata, we keep the email address of the sender and the notice with which the extension was requested. Later, it can be useful to have quick access to this metadata...

In [4]:
import datetime as dt
from langchain_core.documents import Document
docs = []
for index, row in emails.iterrows():
    sender = row['sender']
    subject = row['subject']
    text = subject + " " + row['all_text']
    notice = row['notice_weeks']
    year = dt.datetime.strptime(row['first_date'], '%Y-%m-%d').year
    document = Document(
        page_content=text,
        metadata={"source": sender,"notice":notice, "year":year},
        id = index
    )
    docs.append(document)

print(docs[0])

# later, when we combine ranks of documents, it will be useful to have a data structure
# that maps document indices to document objects:
doc_map = {}
for doc in docs:
  doc_map[doc.id] = doc

page_content='Re: late submission Dear Ms Smith, unfortunately, we cannot accept your request for deadline extension. Since your sickness occurred during a non-critical period of your thesis work and was comparatively short, there was enough time to resolve issues resulting from it. We are looking forward to receiving your thesis submission on June 21st. Best regards, The Dean. ---- Dear Prof. Dean, please find attached the certificate for my sickness. Hoping for a positive decision, best regards, Jane Smith. ---- Dear Ms Smith, could you please send us a medical certificate for your sick period. Please note that this does not imply that we will grant the extension, it is just a routine request. Thanks and best regards, The Dean. --- Dear Prof. Dean, I am writing to you to ask for a deadline extension of 1 week for my master thesis. In February, I had a really bad flu from which it took me two weeks to recover. I feel that I am still suffering from the consequences since my whole thesi

We define the new case for which would like to retrieve and summarize similar historical cases. We also set the number (topk) of most similar cases to consider

In [5]:
topk = 3
query = "A student got a very late feedback regarding his MRTP that he wants to react to."
#query = "A student discovered that she was pregnant soon after starting the thesis proposal. Towards the end of her thesis, the pregancy became complicated and she had to take leave. A sickness certificate is available."
#query = "A student needs more time because he had to take over more responsibilities for a new project / mission. His employer assigned him as a project leader and he could not refuse it."

We now create embeddings out of our emails that can be stored to and retrieved from a vector store

In [6]:
from langchain_community.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-small-en"
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}
bge_embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)

#chunk_texts = list(map(lambda d: d.page_content, docs))
#embeddings = bge_embeddings.embed_documents(chunk_texts)
#print(embeddings[0])

  bge_embeddings = HuggingFaceBgeEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

We store the embeddings in a vector store

In [7]:
from langchain_community.vectorstores import FAISS
db = FAISS.from_documents(docs, bge_embeddings)

Then we retrieve the topk most similar cases using semantic search, i.e. comparison of embeddings.

In [8]:
semantic_results = db.similarity_search_with_score(query, k=topk)

for i in range(topk):
  print(semantic_results[i])

(Document(id='0', metadata={'source': 'student42@MBFH.ch', 'notice': 1.0, 'year': 2019}, page_content='Re: late submission Dear Ms Smith, unfortunately, we cannot accept your request for deadline extension. Since your sickness occurred during a non-critical period of your thesis work and was comparatively short, there was enough time to resolve issues resulting from it. We are looking forward to receiving your thesis submission on June 21st. Best regards, The Dean. ---- Dear Prof. Dean, please find attached the certificate for my sickness. Hoping for a positive decision, best regards, Jane Smith. ---- Dear Ms Smith, could you please send us a medical certificate for your sick period. Please note that this does not imply that we will grant the extension, it is just a routine request. Thanks and best regards, The Dean. --- Dear Prof. Dean, I am writing to you to ask for a deadline extension of 1 week for my master thesis. In February, I had a really bad flu from which it took me two week

Here, we connect to an LLM at Groq. To make it work, please get yourself an API key for GROQ and store it as a key on the left side of this notebook...!

In [9]:
from groq import Groq
def llm(groq_client, prompt):
  chat_completion = groq_client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="llama-3.3-70b-versatile",
  )

  return chat_completion.choices[0].message.content

In [10]:
from google.colab import userdata
groq_client = Groq(
    api_key=userdata.get('GROQ_API_KEY')
)

Here, you instruct the Large Language Model what to do:

* In the "system" part of the prompt, you explain the general task, including the
context (i.e. the retrieved information) that the system should rely on. You can pass the content of the retrieved emails by putting "{context}" into this part of the prompt
* In the "query" part of the prompt, you give instruction to make a decision about the new case (as introduced already above, before the retrieval)

In [11]:
context = '\n\n'.join(list(map(lambda c: c[0].page_content, semantic_results)))
prompt = f"""You are an assistant that helps a study dean to decide about students' request for extending the deadline of their master theses.
        The current case is described as follows: {query}.
        To decide about the current case, the following historical emails seem to be relevant: {context}. Please make a suggestion whether or not
        to grant the deadline extension, including a justification that is based on the given context! If possible, please include quotes from the historical emails"""
print(llm(groq_client, prompt))

Based on the historical emails, I suggest granting the deadline extension for the current student. The justification is as follows:

The current student received very late feedback regarding their MRTP and wants to react to it. This situation is unique and unforeseen, similar to Ms. Orange's case, where her sickness occurred at a critical point in time. As the Dean stated in the email to Ms. Orange, "We understand the critically of the point in time when your sickness occurred." This implies that the timing of the event is a crucial factor in the decision-making process.

In the current case, the late feedback on the MRTP is a critical event that affects the student's ability to complete the thesis on time. As seen in the previous emails, the Dean has considered the timing and impact of events on the student's thesis progress. For example, in Ms. Smith's case, the Dean stated, "Since your sickness occurred during a non-critical period of your thesis work and was comparatively short, th