### Evaluation of the Data Transformations


In [2]:
from council_rag.data_transformations.text_transformations import generate_groq_summary


In [3]:
# paste the GROQ_KEY.txt file in the same folder as this notebook
with open("GROQ_KEY.txt", "r") as file:
    groq_key = file.read()

In [4]:
base_summary_prompt = """Il s'agit d'un texte concernant un projet de géothermie :
TEXTE: 
{}

TÂCHE:
Make a one page summary paying special attention to all the administrative matters like budgets, plans, actions to take in the future, organizational and hierarchical charts, announcements, meetings, contacts, elections, reports and all the related topics to these.

OUTPUT:
Publier directement le résumé en anglais.
""" # added the english output to make the eval easier


In [5]:
import os
# loading the data from the correct folder
eval_summ_data_path = "Evaluation data/Summaries/"
eval_summ_list_random_pages = os.listdir(eval_summ_data_path)
print(eval_summ_list_random_pages)

['page_13.pdf', 'page_29.pdf', 'page_35.pdf', 'page_44.pdf']


In [6]:
from pypdf import PdfReader

# quickly read the content of the pdfs. it is not markdown but it should be okay for the eval
def extract_text_from_pdf(pdf_name, eval_data_path):

    reader = PdfReader(eval_data_path + pdf_name)
    text_page = reader.pages[0].extract_text()
    
    return text_page

text_to_summ = [extract_text_from_pdf(pdf_name, eval_summ_data_path) for pdf_name in eval_summ_list_random_pages]

In [36]:
# models we could eval from groq
list_models_eval = ["deepseek-r1-distill-qwen-32b", 
                    "deepseek-r1-distill-llama-70b", 
                    "gemma2-9b-it", 
                    "llama-3.3-70b-versatile"]

In [38]:
from pprint import pprint

# for each page, we generate the summary with each model and we open the page in the browser for easier reference

i = 0
while i < len(text_to_summ[0:2]): # generating the first 2 pages 
    for model in list_models_eval:
        summ = generate_groq_summary(text_to_summ[i], 
                                    base_summary_prompt, 
                                    groq_key, 
                                    model)
        print("SUMMARY OF {} with model: {}".format(eval_summ_list_random_pages[i], model))
        pprint(summ)
        print("\n")
        print("--------------------------------")
        
        # if you prefer openning the txt file instead of reading below. will save into the Evaluation data/Summaries folder
        with open(f"{eval_summ_data_path}summary_{model}_{eval_summ_list_random_pages[i]}.txt") as f: 
            f.write(summ)        
    i += 1

SUMMARY OF page_13.pdf with model: deepseek-r1-distill-qwen-32b
('<think>\n'
 'Okay, let me try to figure out how to approach this. The user has provided a '
 'detailed transcript from a community council meeting, and they want me to '
 'create a summary. I need to understand the main points discussed and present '
 'them in a clear, concise manner.\n'
 '\n'
 "First, I'll read through the transcript carefully to identify the key "
 'topics. It seems there are several sections: discussions about the airport '
 'development, the hunting rights at Grands Murcins, the Roma settlement plan, '
 "and the arrival of a new member. There's also a section on general affairs, "
 'specifically the 2020 activity report.\n'
 '\n'
 "I should structure the summary by these main topics. For each section, I'll "
 'extract the essential points without getting bogged down in the details. For '
 "example, Franck Beysson's concerns about the long-term airport development "
 "and the implications of hunting r

Note for Ziyan: do the first 2 pages, I will do the other 2.



Maybe write a couple of general comments on the length of the summaries, 

whether they include info that is useful (although it may be the case that those pages are just not that important)

things the summaries missed, etc. No need to make it too long.


## Evaluation of the Table Context


In [7]:
from council_rag.data_transformations.table_transformations import convert_pdf_to_image, augment_multimodal_context
from pprint import pprint
import os

In [8]:
# paste the GROQ_KEY.txt file in the same folder as this notebook
with open("GROQ_KEY.txt", "r") as file:
    groq_key = file.read()

In [9]:
base_multi_prompt = """CONTEXTE
L'image suivante contient une page et un tableau. 

TÂCHE
Décrivez le tableau et le contenu de la page en accordant une attention particulière au contexte qui l'entoure, qu'il s'agisse de budgets, de dates, d'élections, d'agendas, de projets futurs ou de sujets connexes. 

FORMAT DE LA RÉPONSE
Votre réponse doit être aussi détaillée que possible. 
Votre résultat doit être la description du tableau directement.
La langue de votre réponse est le français"""

In [10]:
eval_table_cont_data_path = "Evaluation data/Table Context/"
eval_table_cont_random_pages = os.listdir(eval_table_cont_data_path)
print(eval_table_cont_random_pages)


['page_10.pdf', 'page_29.pdf', 'page_36.pdf', 'page_43.pdf', 'page_9.pdf']


In [11]:
paths = [eval_table_cont_data_path + pdf_path for pdf_path in eval_table_cont_random_pages]

In [12]:
paths[:2]

['Evaluation data/Table Context/page_10.pdf',
 'Evaluation data/Table Context/page_29.pdf']

In [None]:
images_paths = [convert_pdf_to_image(pdf_path) for pdf_path in paths]

for im_path in images_paths:
    resp = augment_multimodal_context(im_path, 
                                        base_multi_prompt, 
                                        groq_key,
                                        model="llama-3.2-11b-vision-preview")
    
    print("TABLE DESCRIPTION OF {} with model: {}".format(im_path, "llama-3.2-11b-vision"))
    pprint(resp)
    print("\n")
    print("--------------------------------")
        
    # if you prefer openning the txt file instead of reading below. will save into the Evaluation data/Summaries folder
    with open(f"llama-3.2-11b-vision_table_descr__{im_path}.txt") as f: 
        f.write(summ)    

## Evaluation of Embeddings via Clustering Distance

In [16]:
from council_rag.preprocessing import split_markdown_to_paras, compute_paragraph_embeddings
from pprint import pprint
import os
import torch
import numpy as np

In [2]:
device = "cpu"
if torch.cuda.is_available():
    # print("Cuda available")
    device = torch.device('cuda')


In [None]:
[os.path.join(final_eval_pdfs_path, pdf) for pdf in final_eval_pdfs

In [7]:
eval_clust_path = r"Evaluation data\PDFs for Clustering Eval wrt Embed Model"
list_txts = [txt for txt in os.listdir(eval_clust_path) if txt.endswith(".txt")]
list_txts_paths = [os.path.join(eval_clust_path, txt) for txt in list_txts]
list_txts_paths

['Evaluation data\\PDFs for Clustering Eval wrt Embed Model\\08d25_DEL-2023-171_Annexe_Rapport_developpment_durable_2021-2022.txt',
 'Evaluation data\\PDFs for Clustering Eval wrt Embed Model\\09541_pagedoc_1503_B.txt',
 'Evaluation data\\PDFs for Clustering Eval wrt Embed Model\\40b835f11d17fb2a29780f5702b5def19b97d4ef_RA-2021_ALEC-POL.txt',
 'Evaluation data\\PDFs for Clustering Eval wrt Embed Model\\44581100d4d00d982967f67297858d20b40cb9d8_Programme-LEADER.txt',
 'Evaluation data\\PDFs for Clustering Eval wrt Embed Model\\ace41990e2df7de61ced4b55c88c215810403448_20230302_PLUi_PA.txt',
 'Evaluation data\\PDFs for Clustering Eval wrt Embed Model\\cc7329f0f9d289cba48647ce810b5aa7e02189c0_9ccf7c_35ac2a46e.txt',
 'Evaluation data\\PDFs for Clustering Eval wrt Embed Model\\e9442_Photovoltaique-20230711-012221-6.txt']

In [20]:
dict_file_txt = {}
for txt_file in list_txts_paths:
    with open(txt_file, "r") as f:
        text = f.read()
        
    paragraphs = split_markdown_to_paras(text, spacy_model="fr_core_news_sm", n_sents_per_para=10)
    print("Num of paragraphs in the text:")
    print(len(paragraphs))
    # check len of paragraphs
    quick_len_check = lambda x: len(x.split(" "))
    n_tokens = []
    txt_paras = []
    for paragraph in paragraphs:
        n_toks = quick_len_check(paragraph["paragraph_union"])
        # print(n_toks)
        n_tokens.append(n_toks)
        para_txt = paragraph["paragraph_union"]
        txt_paras.append(para_txt)
    max_toks = np.max(n_tokens)
    dict_file_txt[txt_file] = {
                                "max_tokens":max_toks,
                                "para_list_text": txt_paras
                            }
    print("Max. number of naive tokens in a sentence: ", max_toks)

Num of paragraphs in the text:
67
Max. number of naive tokens in a sentence:  1111
Num of paragraphs in the text:
19
Max. number of naive tokens in a sentence:  260
Num of paragraphs in the text:
30
Max. number of naive tokens in a sentence:  344
Num of paragraphs in the text:
160
Max. number of naive tokens in a sentence:  383
Num of paragraphs in the text:
127
Max. number of naive tokens in a sentence:  306
Num of paragraphs in the text:
28
Max. number of naive tokens in a sentence:  328
Num of paragraphs in the text:
11
Max. number of naive tokens in a sentence:  267


The number of artificial paragraphs varies as expected since we have pdfs of different sizes.

The max number of tokens is usually around 260-360.

In [22]:
dict_file_txt.keys()
# dictionary with the txt files as keys and their max number of tokens an list of paragraphs as values

dict_keys(['Evaluation data\\PDFs for Clustering Eval wrt Embed Model\\08d25_DEL-2023-171_Annexe_Rapport_developpment_durable_2021-2022.txt', 'Evaluation data\\PDFs for Clustering Eval wrt Embed Model\\09541_pagedoc_1503_B.txt', 'Evaluation data\\PDFs for Clustering Eval wrt Embed Model\\40b835f11d17fb2a29780f5702b5def19b97d4ef_RA-2021_ALEC-POL.txt', 'Evaluation data\\PDFs for Clustering Eval wrt Embed Model\\44581100d4d00d982967f67297858d20b40cb9d8_Programme-LEADER.txt', 'Evaluation data\\PDFs for Clustering Eval wrt Embed Model\\ace41990e2df7de61ced4b55c88c215810403448_20230302_PLUi_PA.txt', 'Evaluation data\\PDFs for Clustering Eval wrt Embed Model\\cc7329f0f9d289cba48647ce810b5aa7e02189c0_9ccf7c_35ac2a46e.txt', 'Evaluation data\\PDFs for Clustering Eval wrt Embed Model\\e9442_Photovoltaique-20230711-012221-6.txt'])

MTEB(fra, v1) BenchMark: http://mteb-leaderboard.hf.space/?benchmark_name=MTEB%28fra%2C+v1%29


https://huggingface.co/dangvantuan/sentence-camembert-large

https://huggingface.co/Salesforce/SFR-Embedding-2_R - highest in the clustering tasks in the benchmark

https://huggingface.co/manu/sentence_croissant_alpha_v0.4 - not highest but probably faster

https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0

https://huggingface.co/flaubert/flaubert_base_uncased

antoinelouis/crossencoder-camembert-base-mmarcoFR - reranker in french

https://huggingface.co/jinaai/jina-embeddings-v3 - has different types of embedding functions

In [None]:
model_candidates = ["HIT-TMG/KaLM-embedding-multilingual-mini-v1", 
                    "all-MiniLM-L6-v2", 
                    "dangvantuan/sentence-camembert-large",
                    "Snowflake/snowflake-arctic-embed-l-v2.0",
                    "jinaai/jina-embeddings-v3",
                    "flaubert-large-cased", 
                    "Salesforce/SFR-Embedding-2_R",
                    "manu/sentence_croissant_alpha_v0.4"
                    ]

In [None]:
from sentence_transformers import SentenceTransformer

model_id ="HIT-TMG/KaLM-embedding-multilingual-mini-v1"

paragraphs = compute_paragraph_embeddings(paragraphs, model_id)

In [2]:
from council_rag.preprocessing import get_optimal_n_clusters

ImportError: cannot import name 'cluster_n' from 'council_rag.preprocessing' (c:\Users\alber\Desktop\Council-Minutes\Alberto Research\RAGTOR\council_rag\preprocessing\__init__.py)

In [None]:
optimal_n, final_clusters = get_optimal_n_clusters(embeddings, max_n_clusters=9)