<a href="https://colab.research.google.com/github/fwitschel/QDMKM/blob/main/notebooks/QDMKM_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import os

# Execute this code only if in colab
if 'COLAB_GPU' in os.environ:
  print("Executing in Colab!")
  # Cloning GitHub repository
  !git clone https://github.com/fwitschel/QDMKM.git
  %cd QDMKM


Executing in Colab!
Cloning into 'QDMKM'...
remote: Enumerating objects: 49, done.[K
remote: Counting objects: 100% (49/49), done.[K
remote: Compressing objects: 100% (36/36), done.[K
remote: Total 49 (delta 7), reused 0 (delta 0), pack-reused 0 (from 0)[K
Receiving objects: 100% (49/49), 45.29 KiB | 2.52 MiB/s, done.
Resolving deltas: 100% (7/7), done.
/content/QDMKM


We install some libraries that we will need later

In [2]:
!pip install langchain langchain-community pypdf sentence_transformers faiss-cpu langchain-anthropic groq

Collecting langchain-community
  Downloading langchain_community-0.3.29-py3-none-any.whl.metadata (2.9 kB)
Collecting pypdf
  Downloading pypdf-6.0.0-py3-none-any.whl.metadata (7.1 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.12.0-cp312-cp312-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl.metadata (5.1 kB)
Collecting langchain-anthropic
  Downloading langchain_anthropic-0.3.19-py3-none-any.whl.metadata (1.9 kB)
Collecting groq
  Downloading groq-0.31.1-py3-none-any.whl.metadata (16 kB)
Collecting requests<3,>=2 (from langchain)
  Downloading requests-2.32.5-py3-none-any.whl.metadata (4.9 kB)
Collecting dataclasses-json<0.7,>=0.6.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting anthropic<1,>=0.64.0 (from langchain-anthropic)
  Downloading anthropic-0.66.0-py3-none-any.whl.metadata (27 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.6.7->langchain-community)
  Downloading marshmallow-3.26.1-py3-

We read the input file into a so-called dataframe.

In [3]:
import pandas as pd
emails = pd.read_csv("/content/QDMKM/data/cases-emails.csv")
print(emails)

                 sender                 receiver  \
0     student42@MBFH.ch      my_msc_dean@MBFH.ch   
1     student21@MBFH.ch      my_msc_dean@MBFH.ch   
2     student84@MBFH.ch  former_msc_dean@MBFH.ch   
3    student168@MBFH.ch      my_msc_dean@MBFH.ch   
4    student336@MBFH.ch      my_msc_dean@MBFH.ch   
5    student672@MBFH.ch      my_msc_dean@MBFH.ch   
6     student72@MBFH.ch      my_msc_dean@MBFH.ch   
7     student67@MBFH.ch      my_msc_dean@MBFH.ch   
8      student7@MBFH.ch      my_msc_dean@MBFH.ch   
9    student745@MBFH.ch  former_msc_dean@MBFH.ch   
10    student45@MBFH.ch      my_msc_dean@MBFH.ch   
11    student74@MBFH.ch      my_msc_dean@MBFH.ch   
12     student5@MBFH.ch      my_msc_dean@MBFH.ch   
13   student666@MBFH.ch      my_msc_dean@MBFH.ch   
14   student888@MBFH.ch      my_msc_dean@MBFH.ch   
15  student4242@MBFH.ch         my_msc_dean@MBFH   

                                     subject  first_date  notice_weeks  \
0                        Re: late submiss

In the input file, each case is represented by one row. We create one Document object for each case that contains, as textual content (to be transformed and stored as embedding vectors) the subject of the initial email, followed by the entire text of the conversation. As metadata, we keep the email address of the sender and the notice with which the extension was requested. Later, it can be useful to have quick access to this metadata...

In [6]:
import datetime as dt
from langchain_core.documents import Document
docs = []
for index, row in emails.iterrows():
    sender = row['sender']
    subject = row['subject']
    text = subject + " " + row['all_text']
    notice = row['notice_weeks']
    year = dt.datetime.strptime(row['first_date'], '%Y-%m-%d').year
    document = Document(
        page_content=text,
        metadata={"source": sender,"notice":notice, "year":year}
    )
    docs.append(document)

print(docs[0])

page_content='Re: late submission Dear Ms Smith, unfortunately, we cannot accept your request for deadline extension. Since your sickness occurred during a non-critical period of your thesis work and was comparatively short, there was enough time to resolve issues resulting from it. We are looking forward to receiving your thesis submission on June 21st. Best regards, The Dean. ---- Dear Prof. Dean, please find attached the certificate for my sickness. Hoping for a positive decision, best regards, Jane Smith. ---- Dear Ms Smith, could you please send us a medical certificate for your sick period. Please note that this does not imply that we will grant the extension, it is just a routine request. Thanks and best regards, The Dean. --- Dear Prof. Dean, I am writing to you to ask for a deadline extension of 1 week for my master thesis. In February, I had a really bad flu from which it took me two weeks to recover. I feel that I am still suffering from the consequences since my whole thesi

We take an embeddings model and use it to create embeddings vectors for our email conversations. When you print such an embedding vector, you see that it is just a bunch of numbers...

In [7]:
from langchain_community.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-small-en"
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}
bge_embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)

chunk_texts = list(map(lambda d: d.page_content, docs))
embeddings = bge_embeddings.embed_documents(chunk_texts)
print(embeddings[0])

  bge_embeddings = HuggingFaceBgeEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/684 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/133M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

[-0.06024385988712311, 0.026430662721395493, 0.022581422701478004, 0.013069221749901772, 0.017530471086502075, -0.002363329753279686, 0.030586568638682365, 0.028281182050704956, -0.023568671196699142, -0.02540016919374466, 0.01419154740869999, -0.02738691121339798, -0.005180547013878822, -0.009107926860451698, 0.028404181823134422, 0.01528555154800415, 0.004506260622292757, -0.00012868685007560998, -0.011500484310090542, 0.05693953484296799, -0.011474189348518848, -0.0036948933266103268, -0.004475072957575321, -0.018138622865080833, -0.028383374214172363, -0.012660247273743153, -0.002087835455313325, -0.03936571627855301, -0.026432760059833527, -0.19615943729877472, -0.04876246303319931, -0.02333776094019413, -0.0004094137402717024, -0.023912718519568443, 0.024079622700810432, -0.007743343245238066, -0.024135878309607506, 0.02551259659230709, -0.02752256765961647, 0.03140414506196976, 0.04504133015871048, 0.02439050003886223, -0.03522006422281265, -0.031253378838300705, -0.011611488647

We store the embedding vectors in a vector database (FAISS)

In [8]:
from langchain_community.vectorstores import FAISS

text_embedding_pairs = zip(chunk_texts, embeddings)
db = FAISS.from_embeddings(text_embedding_pairs, bge_embeddings)

Here, we describe the new case that needs to be decided / solved. We then use the description to retrieve emails with the 3 most similar cases.

In [9]:
topk = 3
query = "A student discovered that she was pregnant soon after starting the thesis proposal. Towards the end of her thesis, the pregancy became complicated and she had to take leave. A sickness certificate is available."
#query = "A student needs more time because he had to take over more responsibilities for a new project / mission. His employer assigned him as a project leader and he could not refuse it."

contexts = db.similarity_search(query, k=topk)

for i in range(topk):
  print(contexts[i].page_content)


Re:late submission? Dear Ms Orange, you will be granted an extension of two weeks. We understand the critically of the point in time when your sickness occurred. Please submit your thesis on August 4th, midnight. Regards, The Dean ---- Hi Dean, as you can see from the attached medical certificate, I was sick for more than two weeks and would like to ask for a deadline extension for my master thesis. The sickness started three weeks ago when I was starting to write up my results. There were also two (out of 5) evaluation interviews that I had to cancel because of the sickness. This means that I was unable to finish the thesis. Hoping for your understanding, best regards, Olivia
Re:master thesis Dear Mr Brown, Although your certificate does not cover the entire period in which you say you were sick, one can conclude from your documents that the sickness must have started earlier. Because of that long period, we can grant you an extension of two weeks. Please submit your thesis by August 

Here, we connect to an LLM at Groq. To make it work, please get yourself an API key for GROQ and store it as a key on the left side of this notebook...!

In [10]:
from groq import Groq
def llm(groq_client, prompt):
  chat_completion = groq_client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="llama-3.3-70b-versatile",
  )

  return chat_completion.choices[0].message.content

In [12]:
from google.colab import userdata
groq_client = Groq(
    api_key=userdata.get('GROQ_API_KEY')
)

Here, you instruct the Large Language Model what to do:

* In the "system" part of the prompt, you explain the general task, including the
context (i.e. the retrieved information) that the system should rely on. You can pass the content of the retrieved emails by putting "{context}" into this part of the prompt
* In the "query" part of the prompt, you give instruction to make a decision about the new case (as introduced already above, before the retrieval)

In [13]:
context = '\n\n'.join(list(map(lambda c: c.page_content, contexts)))
prompt = f"""You are an assistant that helps a study dean to decide about students' request for extending the deadline of their master theses.
        The current case is described as follows: {query}.
        To decide about the current case, the following historical emails seem to be relevant: {context}. Please make a suggestion whether or not
        to grant the deadline extension, including a justification that is based on the given context! If possible, please include quotes from the historical emails"""
print(llm(groq_client, prompt))

Based on the provided historical emails and the current case, I suggest granting the deadline extension to the student. 

The student's situation is unique and challenging, as she discovered she was pregnant soon after starting her thesis proposal, and the pregnancy became complicated towards the end, forcing her to take leave. This is a critical period in her thesis work, and the sickness certificate supports her claim. As the Dean stated in a previous email to Ms. Orange, "we understand the critically of the point in time when your sickness occurred." This suggests that the timing of the sickness is an important factor in considering deadline extensions.

The Dean has previously granted extensions to students who were sick during critical periods of their thesis work. For example, in the case of Mr. Brown, the Dean stated, "one can conclude from your documents that the sickness must have started earlier. Because of that long period, we can grant you an extension of two weeks." Althou