<a href="https://colab.research.google.com/github/fwitschel/QDMKM/blob/main/notebooks/QDMKM_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import os

# Execute this code only if in colab
if 'COLAB_GPU' in os.environ:
  print("Executing in Colab!")
  # Cloning GitHub repository
  !git clone https://github.com/fwitschel/QDMKM.git
  %cd QDMKM


We install some libraries that we will need later

In [None]:
!pip install langchain langchain-community faiss-cpu langchain-anthropic groq rank_bm25

We read the input file into a so-called dataframe.

In [None]:
import pandas as pd
emails = pd.read_csv("/content/QDMKM/data/cases-emails.csv")
print(emails)

In the input file, each case is represented by one row. We create one Document object for each case that contains, as textual content (to be transformed and stored as embedding vectors) the subject of the initial email, followed by the entire text of the conversation. As metadata, we keep the email address of the sender and the notice with which the extension was requested. Later, it can be useful to have quick access to this metadata...

In [None]:
import datetime as dt
from langchain_core.documents import Document
docs = []
for index, row in emails.iterrows():
    sender = row['sender']
    subject = row['subject']
    text = subject + " " + row['all_text']
    notice = row['notice_weeks']
    year = dt.datetime.strptime(row['first_date'], '%Y-%m-%d').year
    document = Document(
        page_content=text,
        metadata={"source": sender,"notice":notice, "year":year},
        id = index
    )
    docs.append(document)

print(docs[0])

# later, when we combine ranks of documents, it will be useful to have a data structure
# that maps document indices to document objects:
doc_map = {}
for doc in docs:
  doc_map[doc.id] = doc

First, we want to index our emails for basic keyword retrieval. For this, we define a function that will do some pro-processing...

In [None]:
from nltk.stem.snowball import SnowballStemmer
import nltk

nltk.download('stopwords')

def preprocess(text):

    # Tokenize the text
    tokens = nltk.word_tokenize(text)

    # lowercase the tokens
    tokens = [token.lower() for token in tokens]

    # stemming
    stemmer = SnowballStemmer('german', ignore_stopwords=True)
    tokens = [stemmer.stem(token) for token in tokens]

    return tokens

Then, we do the actual pre-processing on our documents. We store the pre-processed texts in a list, in the same order that they have in the document collection (thus, we can use the order to identify the metadata that belongs to a pre-processed text)

In [None]:
from rank_bm25 import BM25Okapi
import operator
import nltk
nltk.download('punkt_tab')

preprocessed_corpus = []
for doc in docs:
  doc_tokens = preprocess(doc.page_content)
  preprocessed_corpus.append(doc_tokens)

We define the new case for which would like to retrieve and summarize similar historical cases. We also set the number (topk) of most similar cases to consider

In [None]:
topk = 3
query = "A student got a very late feedback regarding his MRTP that he wants to react to."
#query = "A student discovered that she was pregnant soon after starting the thesis proposal. Towards the end of her thesis, the pregancy became complicated and she had to take leave. A sickness certificate is available."
#query = "A student needs more time because he had to take over more responsibilities for a new project / mission. His employer assigned him as a project leader and he could not refuse it."

We build a keyword retrieval object and use it to run our query against all emails and to get a score for each email (hopefully indicating how relevant the email is for the new case)

In [None]:
bm25_retriever = BM25Okapi(preprocessed_corpus)
doc_scores = bm25_retriever.get_scores(preprocess(query))

# retrieve the documents (including metadata) corresponding to the topk highest scores
# by first inserting doc ids and corresponding scores into a dict and then sorting it by
# the values (=scores)
doc_scores_dict = {}
for i in range(len(doc_scores)):
  doc_scores_dict[i] = doc_scores[i]
sorted_scores = sorted(doc_scores_dict.items(), key=operator.itemgetter(1))

After getting the scores, we create a list where we insert pairs of documents (including their metadata) and their scores, only for the topk top-ranked documents...

In [None]:
bm25_results = []
for i in range(topk):
  (doc_id, score) = sorted_scores[len(docs)-i-1]
  cur_doc = docs[doc_id]
  bm25_results.append((cur_doc, score))

for result in bm25_results:
  print(result)

Now, we take an embeddings model and use it to create embeddings vectors for our email conversations. In the end, such an embedding vector is just a bunch of numbers...

In [None]:
from langchain_community.embeddings import HuggingFaceBgeEmbeddings

model_name = "BAAI/bge-small-en"
model_kwargs = {"device": "cpu"}
encode_kwargs = {"normalize_embeddings": True}
bge_embeddings = HuggingFaceBgeEmbeddings(
    model_name=model_name, model_kwargs=model_kwargs, encode_kwargs=encode_kwargs
)

#chunk_texts = list(map(lambda d: d.page_content, docs))
#embeddings = bge_embeddings.embed_documents(chunk_texts)
#print(embeddings[0])

We store the embeddings in a vector store

In [None]:
from langchain_community.vectorstores import FAISS
db = FAISS.from_documents(docs, bge_embeddings)

Then we retrieve the topk most similar cases using semantic search, i.e. comparison of embeddings.

In [None]:
semantic_results = db.similarity_search_with_score(query, k=topk)

for i in range(topk):
  print(semantic_results[i])

Before we combine the results of keyword and semantic search, we also retrieve - for each document that any of the two searches has found - its age. We want to also factor that into the final ranking because old decisions should not impact new decisions as much as more recent ones.

In [None]:
# rank all documents that appear in either semantic search or keyword search results by age
ages = {}
for (doc,score) in bm25_results:
  ages[doc.id] = 2025 - doc.metadata['year']

for (doc,score) in semantic_results:
  ages[doc.id] = 2025 - doc.metadata['year']

sorted_ages = sorted(ages.items(), key=operator.itemgetter(1))
print(sorted_ages)

Now, let's create a final ranking by using weighted reciprocal rang fusion (RRF)

In [None]:
k = 60
weights = {"bm25":0.4,"semantic":0.4,"age":0.2}

final_scores = {}
bm25_rank = 1
for (doc,score) in bm25_results:
  final_scores[doc.id] = weights["bm25"]/(k+bm25_rank)
  bm25_rank += 1

semantic_rank = 1
for (doc,score) in semantic_results:
  if doc.id in final_scores:
    final_scores[doc.id] += weights["semantic"]/(k+semantic_rank)
  else:
    final_scores[doc.id] = weights["semantic"]/(k+semantic_rank)
  semantic_rank += 1

age_rank = 1
for (doc_id,score) in sorted_ages:
  final_scores[doc_id] += weights["age"]/(k+age_rank)


sorted_final = sorted(final_scores.items(), key=operator.itemgetter(1))
final_results = []
for i in range(topk):
  (doc_id, score) = sorted_final[len(sorted_final)-i-1]
  cur_doc = doc_map[doc_id]
  final_results.append((cur_doc, score))

for (doc,score) in final_results:
  print(doc, score)

Here, we connect to an LLM at Groq. To make it work, please get yourself an API key for GROQ and store it as a key on the left side of this notebook...!

In [None]:
from groq import Groq
def llm(groq_client, prompt):
  chat_completion = groq_client.chat.completions.create(
    messages=[
        {
            "role": "user",
            "content": prompt,
        }
    ],
    model="llama-3.3-70b-versatile",
  )

  return chat_completion.choices[0].message.content

In [None]:
from google.colab import userdata
groq_client = Groq(
    api_key=userdata.get('GROQ_API_KEY')
)

Here, you instruct the Large Language Model what to do:

* In the "system" part of the prompt, you explain the general task, including the
context (i.e. the retrieved information) that the system should rely on. You can pass the content of the retrieved emails by putting "{context}" into this part of the prompt
* In the "query" part of the prompt, you give instruction to make a decision about the new case (as introduced already above, before the retrieval)

In [None]:
context = '\n\n'.join(list(map(lambda c: c[0].page_content, final_results)))
prompt = f"""You are an assistant that helps a study dean to decide about students' request for extending the deadline of their master theses.
        The current case is described as follows: {query}.
        To decide about the current case, the following historical emails seem to be relevant: {context}. Please make a suggestion whether or not
        to grant the deadline extension, including a justification that is based on the given context! If possible, please include quotes from the historical emails"""
print(llm(groq_client, prompt))