<a href="https://colab.research.google.com/github/fwitschel/QDMKM/blob/main/notebooks/QDMKM_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import os

# Execute this code only if in colab
if 'COLAB_GPU' in os.environ:
  print("Executing in Colab!")
  # Cloning GitHub repository
  !git clone https://github.com/fwitschel/QDMKM.git
  %cd QDMKM


We install some libraries that we will need later

In [None]:
!pip install langchain langchain-community pypdf sentence_transformers faiss-cpu langchain-anthropic groq

Collecting groq
  Downloading groq-0.31.0-py3-none-any.whl.metadata (16 kB)
Downloading groq-0.31.0-py3-none-any.whl (131 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m131.4/131.4 kB[0m [31m5.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: groq
Successfully installed groq-0.31.0


I've put the csv file into Google Drive. To make it work for yourself, make sure to do the same. If you put it into a folder, you need to adapt the path in the second row of code below.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


We read the input file into a so-called dataframe.

In [None]:
import pandas as pd
df = pd.read_csv('/content/drive/MyDrive/cases-emails.csv', sep=",")
print(df.head())

               sender                 receiver  \
0   student42@MBFH.ch      my_msc_dean@MBFH.ch   
1   student21@MBFH.ch      my_msc_dean@MBFH.ch   
2   student84@MBFH.ch  former_msc_dean@MBFH.ch   
3  student168@MBFH.ch      my_msc_dean@MBFH.ch   
4  student336@MBFH.ch      my_msc_dean@MBFH.ch   

                                    subject  first_date notice_weeks  \
0                       Re: late submission  2019-06-16            1   
1  Re: extend deadline because of sickness?  2020-06-14            1   
2                     Re: thesis submission  2016-07-01          NaN   
3                 Re: request for extension  2021-05-10            6   
4                             Re: extension  2019-05-16            5   

                   tags                                           all_text  \
0  sickness certificate  Dear Ms Smith, unfortunately, we cannot accept...   
1  sickness certificate  Dear Mr Doe, your request to extend the deadli...   
2                   job  Dear Ms

In the input file, each case is represented by one row. We create one Document object for each case that contains, as textual content (to be transformed and stored as embedding vectors) the subject of the initial email, followed by the entire text of the conversation. As metadata, we keep the email address of the sender and the notice with which the extension was requested. Later, it can be useful to have quick access to this metadata...

In [None]:
from langchain_core.documents import Document
docs = []
for index, row in df.iterrows():
    sender = row['sender']
    subject = row['subject']
    text = subject + " " + row['all_text']
    notice = row['notice_weeks']
    document = Document(
        page_content=text,
        metadata={"source": sender,"notice":notice}
    )
    docs.append(document)

print(docs[0])

page_content='Re: late submission Dear Ms Smith, unfortunately, we cannot accept your request for deadline extension. Since your sickness occurred during a non-critical period of your thesis work and was comparatively short, there was enough time to resolve issues resulting from it. We are looking forward to receiving your thesis submission on June 21st. Best regards, The Dean. ---- Dear Prof. Dean, please find attached the certificate for my sickness. Hoping for a positive decision, best regards, Jane Smith. ---- Dear Ms Smith, could you please send us a medical certificate for your sick period. Please note that this does not imply that we will grant the extension, it is just a routine request. Thanks and best regards, The Dean. --- Dear Prof. Dean, I am writing to you to ask for a deadline extension of 1 week for my master thesis. In February, I had a really bad flu from which it took me two weeks to recover. I feel that I am still suffering from the consequences since my whole thesi

We take an embeddings model and use it to create embeddings vectors for our email conversations. When you print such an embedding vector, you see that it is just a bunch of numbers...

In [None]:
from langchain_community.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-small-en"
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}
bge_embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)

chunk_texts = list(map(lambda d: d.page_content, docs))
embeddings = bge_embeddings.embed_documents(chunk_texts)
print(embeddings[0])

  bge_embeddings = HuggingFaceBgeEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


[-0.06024384871125221, 0.026430655270814896, 0.0225814338773489, 0.013069206848740578, 0.01753048226237297, -0.0023633234668523073, 0.03058656118810177, 0.028281155973672867, -0.02356867864727974, -0.025400156155228615, 0.014191539026796818, -0.027386901900172234, -0.0051805428229272366, -0.009107922203838825, 0.028404168784618378, 0.015285535715520382, 0.004506242927163839, -0.0001287022460019216, -0.011500483378767967, 0.056939512491226196, -0.011474188417196274, -0.003694898448884487, -0.004475079011172056, -0.01813863031566143, -0.02838337980210781, -0.012660231441259384, -0.0020878424402326345, -0.03936571255326271, -0.026432769373059273, -0.19615942239761353, -0.04876245558261871, -0.023337749764323235, -0.00040942514897324145, -0.02391272597014904, 0.024079613387584686, -0.00774333905428648, -0.02413586899638176, 0.02551259659230709, -0.027522554621100426, 0.03140414506196976, 0.04504132270812988, 0.02439049445092678, -0.03522004932165146, -0.031253401190042496, -0.0116114811971

We store the embedding vectors in a vector database (FAISS)

In [None]:
from langchain_community.vectorstores import FAISS

text_embedding_pairs = zip(chunk_texts, embeddings)
db = FAISS.from_embeddings(text_embedding_pairs, bge_embeddings)

Here, we describe the new case that needs to be decided / solved. We then use the description to retrieve emails with the 3 most similar cases.

In [None]:
topk = 3
query = "A student discovered that she was pregnant soon after starting the thesis proposal. Towards the end of her thesis, the pregancy became complicated and she had to take leave. A sickness certificate is available."
#query = "A student needs more time because he had to take over more responsibilities for a new project / mission. His employer assigned him as a project leader and he could not refuse it."

contexts = db.similarity_search(query, k=topk)

for i in range(topk):
  print(contexts[i].page_content)


Re:late submission? Dear Ms Orange, you will be granted an extension of two weeks. We understand the critically of the point in time when your sickness occurred. Please submit your thesis on August 4th, midnight. Regards, The Dean ---- Hi Dean, as you can see from the attached medical certificate, I was sick for more than two weeks and would like to ask for a deadline extension for my master thesis. The sickness started three weeks ago when I was starting to write up my results. There were also two (out of 5) evaluation interviews that I had to cancel because of the sickness. This means that I was unable to finish the thesis. Hoping for your understanding, best regards, Olivia
Re:master thesis Dear Mr Brown, Although your certificate does not cover the entire period in which you say you were sick, one can conclude from your documents that the sickness must have started earlier. Because of that long period, we can grant you an extension of two weeks. Please submit your thesis by August 

Here, we connect to an LLM at Groq. To make it work, please get yourself an API key for GROQ and store it as a key on the left side of this notebook...!

In [None]:
from groq import Groq
def llm(groq_client, prompt):
  chat_completion = groq_client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="llama-3.3-70b-versatile",
  )

  return chat_completion.choices[0].message.content

In [None]:
groq_client = Groq(
    api_key=userdata.get('GROQ_API_KEY')
)

Here, you instruct the Large Language Model what to do:

* In the "system" part of the prompt, you explain the general task, including the
context (i.e. the retrieved information) that the system should rely on. You can pass the content of the retrieved emails by putting "{context}" into this part of the prompt
* In the "query" part of the prompt, you give instruction to make a decision about the new case (as introduced already above, before the retrieval)

In [None]:
context = '\n\n'.join(list(map(lambda c: c.page_content, contexts)))
prompt = f"""You are an assistant that helps a study dean to decide about students' request for extending the deadline of their master theses.
        The current case is described as follows: {query}.
        To decide about the current case, the following historical emails seem to be relevant: {context}. Please make a suggestion whether or not
        to grant the deadline extension, including a justification that is based on the given context! If possible, please include quotes from the historical emails"""
print(llm(groq_client, prompt))

Based on the provided historical emails and the current case, I suggest granting the deadline extension to the student. 

The student's situation is unique and challenging, as she discovered her pregnancy soon after starting her thesis proposal and faced complications towards the end, necessitating a leave of absence. The availability of a sickness certificate supports her claim, similar to the cases of Olivia and Bob Brown, where medical documentation was used to justify the extension.

As seen in the email to Ms. Orange, the Dean has previously granted an extension of two weeks, stating, "We understand the critically of the point in time when your sickness occurred." In this case, the student's pregnancy and subsequent complications occurred during a critical period of her thesis work, making it difficult for her to complete the thesis on time.

Moreover, the Dean has shown flexibility in granting extensions when the sickness has significantly impacted the student's work, as in the c