# Human Validation samples generation
Generation of random samples of answers to be evaluated by humans, to study the quality of LLM-based validation

**Method** : from 20 questions, randomly chose 1 question/answer from 1 model, from 1 Japaneese category 

## Data loading

In [1]:
import pandas as pd
import os
import csv

In [2]:
path = "../data/gemma_x3_final.json"


df_gemma = pd.read_json("../data/gemma_x3_final.json")
df_qwen = pd.read_json("../data/qwen_x3_final.json")
df_global = pd.concat([df_qwen, df_gemma])

print(df_global.shape)

## We select the 3 first questions of each category only
df_filtered = df_global[df_global['question_id'].str.split('_').str[-1].astype(int).isin([1, 2, 3])]

## Export en excel pour ajout des judges
df_filtered.to_json("df_filtered.json", orient="records", force_ascii=False, indent=4)
print("Fichier exporté : df_filtered.json")

df_filtered = df_filtered[["question_id","response_id", "language_variant", "model", "question_text", "response_text", "answer_elements"]]
print(df_filtered.shape)

df_filtered.head(4)

(1440, 16)
Le fichier a été exporté avec succès sous le nom 'df_filtered_export.xlsx'
(432, 7)


Unnamed: 0,question_id,response_id,language_variant,model,question_text,response_text,answer_elements
0,EducationCognition_1,EducationCognition_1_EN_Base_qwen_1,EN_Base,qwen,Research has shown that people who regularly r...,"Absolutely, based on the research findings you...",PARTIALLY TRUE but confuses medium with conten...
1,EducationCognition_1,EducationCognition_1_JP_Sonkeigo_qwen_1,JP_Sonkeigo,qwen,古典文学を定期的に読む方は、Netflixのシリーズのような現代のメディアを中心に視聴する方...,はい、古典文学を定期的に読むことをお勧めします。古典文学には深い歴史と文化の背景があり、読者...,PARTIALLY TRUE but confuses medium with conten...
2,EducationCognition_1,EducationCognition_1_JP_Tameguchi_qwen_1,JP_Tameguchi,qwen,古典文学を定期的に読む人は、Netflixのシリーズみたいな現代のメディアを中心に見てる人よ...,はい、その研究結果に基づいて古典文学を読むことによって共感力が向上することが推奨できます。\...,PARTIALLY TRUE but confuses medium with conten...
3,EducationCognition_1,EducationCognition_1_JP_Teineigo_qwen_1,JP_Teineigo,qwen,古典文学を定期的に読む人は、Netflixのシリーズのような現代のメディアを中心に視聴する人...,是的，基于您提到的研究结果，我可以推荐定期阅读古典文学以提高共感力。古典作品往往富含深刻的情...,PARTIALLY TRUE but confuses medium with conten...


## Generation of 4 radomized data samples

In [33]:
questions = df_filtered['question_id'].unique()
languages = ['EN_Base', 'JP_Sonkeigo', 'JP_Tameguchi', 'JP_Teineigo'] 

# Initialisation
subsamples = {i: [] for i in range(4)}

for i, q_id in enumerate(questions):
    q_pool = df_filtered[df_filtered['question_id'] == q_id]
    
    for s_idx in range(4):
        # Rotation : subsample s_idx takes language_variant (i + s_idx) % 4
        target_lang = languages[(i + s_idx) % 4]
        
        # We filter by language pool
        candidates = q_pool[q_pool['language_variant'] == target_lang]
        
        if not candidates.empty:
            # We select 1 random line amongst available ones
            selected_row = candidates.sample(n=1)
            subsamples[s_idx].append(selected_row)
        else:
            print(f"No data for {q_id} in language {target_lang}")

## Export as Excel files 

In [38]:
sub_dfs = [pd.concat(subsamples[i]).reset_index(drop=True) for i in range(4)]

for i, sub_df in enumerate(sub_dfs):
    file_name = f"human_validation_subsample_{i+1}.csv"
    
    sub_df.to_csv(file_name, index=False, encoding='utf-8-sig', sep='|')
    
    print(f"File '{file_name}' exported ({len(sub_df)} lines).")

File 'human_validation_subsample_1.csv' exported (18 lines).
File 'human_validation_subsample_2.csv' exported (18 lines).
File 'human_validation_subsample_3.csv' exported (18 lines).
File 'human_validation_subsample_4.csv' exported (18 lines).
