# SimpEvalDE Generation Procedure

This notebook outlines the procedure for recreating the SimpEvalDE dataset, which was used to train and validate the [DETECT](add your citation here) metric for German text simplification evaluation.

Compiling this dataset requires obtaining access permissions for two proprietary datasets that form the foundation of SimpEvalDE:

1. APA-LHA — (Spring et al., 2021)
https://zenodo.org/records/5148163

2. DEplain — (Stodden et al., 2023)
https://zenodo.org/records/7674560

Once you have permission to use these datasets, you should also download the additional data files required for assembling SimpEvalDE from the Hugging Face repository: ➡️ https://huggingface.co/datasets/ZurichNLP/SimpEvalDE

After obtaining all required files, specify the dataset paths in the notebook:

In [126]:
# Path to Hugging Face data components (e.g., generations, scores, human grades)
data_path = "../data"

# Paths to the proprietary datasets (must be obtained separately)
apa_lha_path = "../data/apa_lha"
deplain_path = "../data/deplain"

Run the notebook to merge, filter, and augment the datasets as described on HuggingFace. The script outputs `SimpEvalDE_train.csv` and `SimpEvalDE_test.csv`.

## Load Packages

In [38]:
from collections import defaultdict

In [2]:
import pandas as pd
import os
import numpy as np
import textstat

In [39]:
from bert_score import score

  from .autonotebook import tqdm as notebook_tqdm
  _torch_pytree._register_pytree_node(


In [40]:
import spacy
nlp = spacy.load("de_core_news_sm")

In [41]:
from sklearn.model_selection import train_test_split

## Define functions

In [4]:
def clean_sentence(sentence):
    """
    Cleans a sentence to ensure it ends with a single full stop.
    - Adds a full stop if missing.
    - Fixes cases where it ends with ' .'.
    
    :param sentence: The sentence to clean.
    :return: Cleaned sentence with a proper full stop.
    """
    sentence = sentence.strip()  # Remove leading/trailing spaces
    
    # Fix cases where sentence ends with " ."
    if sentence.endswith(" ."):
        sentence = sentence[:-2]  # Remove the extra space and dot
    
    # Ensure the sentence ends with a single full stop
    if not sentence.endswith("."):
        sentence += "."
    
    return sentence

In [5]:
def map_and_combine(original, simplified):
    """
    Maps original sentences to their corresponding simplified sentences, ensuring proper cleaning.
    Flags sentences where any of their simplifications are also mapped to by other original sentences.
    Exports the results to a Pandas DataFrame.
    """
    mapping = defaultdict(list)
    reverse_mapping = defaultdict(set)

    # Step 1: Clean sentences
    original = [clean_sentence(o) for o in original]
    simplified = [clean_sentence(s) for s in simplified]

    # Step 2: Build mapping (Original → Simplified)
    for orig, simp in zip(original, simplified):
        mapping[orig].append(simp)
        reverse_mapping[simp].add(orig)  # Track which originals a simplified sentence belongs to

    # Step 3: Identify problematic mappings
    problematic_simplifications = {simp for simp, origs in reverse_mapping.items() if len(origs) > 1}

    # Step 4: Generate Data for DataFrame
    data = []
    for orig, simps in mapping.items():
        joined_simps = " ".join(simps)  # Join simplified sentences
        num_simplified = len(simps)  # Count number of simplifications
        is_problematic = any(simp in problematic_simplifications for simp in simps)  # Flag if any simplified sentence is problematic
        data.append({
            "original": orig,
            "simplified": joined_simps,
            "no_sentences": num_simplified,
            "multi": is_problematic  # True if any of the simplifications are shared across multiple originals
        })

    # Convert to DataFrame
    df = pd.DataFrame(data)

    return df

In [6]:
def remove_multiple_orig_matches(df, original, simplification):
    return df[df.groupby([simplification])[original].transform('count') == 1].reset_index(drop = True)

In [20]:
def remove_extra_space(df, columns):
    for column in columns:
        print(column)
        df[column] = df[column].str.replace(r'\s+([.,])', r'\1', regex=True)
        df[column] = df[column].str.replace(r'\s+', ' ', regex=True).str.strip()
    return df

In [72]:
def adjust_score_with_length(bert_scores, word_counts):

    max_word_count = max(word_counts)  # Find the longest sentence in the dataset
    adjusted_scores = [
        bs * (np.log(max_word_count + 1) / np.log(wc + 1)) if wc > 0 else 0
        for bs, wc in zip(bert_scores, word_counts)
    ]
    return adjusted_scores

In [73]:
def bert_score(df, orig, simp):
    P, R, F1 = score(df[orig].tolist(), df[simp].tolist(), lang="de", rescale_with_baseline=True)
    return F1

In [129]:
def FK_score(text):
    flesch_kincaid = textstat.flesch_reading_ease(text)
    return flesch_kincaid

In [None]:
def get_text_metrics(text):
    # Tokenize using spaCy
    doc = nlp(text)
    num_words = len([token.text for token in doc if token.is_alpha])
    sentence_count = len(list(doc.sents))
    avg_word_length = sum(len(token.text) for token in doc) / num_words if num_words > 0 else 0
    #flesch_kincaid = textstat.flesch_reading_ease(text)
    return num_words, avg_word_length, sentence_count

In [None]:
def compute_text_metrics(df, orig_col, simp_col):
    """
    Compute three text metrics:
    1. Word Reduction Ratio (original words / simplified words)
    2. Sentence Reduction Ratio (original sentences / simplified sentences)
    3. ROUGE-1 Score between original and simplified text.

    Args:
        df (pd.DataFrame): DataFrame containing 'original' and simplification columns.

    Returns:
        pd.DataFrame: Original DataFrame with additional metric columns.
    """

    #orig_word_counts = []
    word_ratios = []
    sentence_ratios = []
    word_counts = []
    #bert_scores = []
    #FK_scores = []

    for _, row in df.iterrows():
        original = row[orig_col]
        simplification = row[simp_col]
        #reference = row[ref_col]

        # Tokenize using spaCy
        orig_doc = nlp(original)
        simp_doc = nlp(simplification)

        # Word count
        orig_word_count = len([token.text for token in orig_doc if token.is_alpha])
        simp_word_count = len([token.text for token in simp_doc if token.is_alpha])
        word_ratio = orig_word_count / simp_word_count if simp_word_count > 0 else 0

        # Sentence count
        orig_sentence_count = len(list(orig_doc.sents))
        simp_sentence_count = len(list(simp_doc.sents))
        sentence_ratio = orig_sentence_count / simp_sentence_count if simp_sentence_count > 0 else 0

 

        word_counts.append(orig_word_count)
        word_ratios.append(word_ratio)
        sentence_ratios.append(sentence_ratio)
        #FK_scores.append(complexity_score)

    # Add metrics to DataFrame
    df["WordReductionRatio"] = word_ratios
    df["WordCountOrig"] = word_counts
    df["SentenceReductionRatio"] = sentence_ratios

    return df

In [None]:
def transform_to_simplifications(df):
    df = df.copy()  # Avoid modifying the original dataframe
    df["simplifications"] = df.apply(
        lambda row: [val for val in [row["simplified_B1"], row["simplified_A2"]] if pd.notna(val)],
        axis=1
    )
    return df[["original", "simplifications"]]

In [None]:
def clean_output(text):
    if not isinstance(text, str):
        return text  # Skip cleaning if it's not a string (e.g., NaN)

    # Keep only the first line
    first_line = text.strip().split('\n')[0]

    # Remove specific prefixes if present
    for prefix in ["Ausgabe:", "Eingabe:"]:
        if first_line.startswith(prefix):
            first_line = first_line[len(prefix):].strip()

    return first_line

## APA-LHA Original-B1-A2 Dataset Compilation

In [None]:
A2_OR_mapped_out = pd.DataFrame(columns=['fileID', 'original', 'simplified', 'no_sentences', 'multi'])
A2_OR_mapping_dir = f"{apa_lha_path}A2-OR"

In [8]:
files_dir = os.listdir(A2_OR_mapping_dir)
files_mapped = list(set(["_".join(file_dir.split('_')[0:2]) for file_dir in files_dir if file_dir.endswith(".simpde")]))

for file_mapped in files_mapped:
    orig_path = os.path.join(A2_OR_mapping_dir, f"{file_mapped}.de")
    simp_path = os.path.join(A2_OR_mapping_dir, f"{file_mapped}_A2.simpde")

    with open(orig_path, "r", encoding="utf-8") as file:
        orig_sentences = [line.strip() for line in file]  
    with open(simp_path, "r", encoding="utf-8") as file:
        simp_sentences = [line.strip() for line in file]  

    #combined_mapping = map_and_combine(orig_sentences, simp_sentences)
    # new_rows = pd.DataFrame([
    #     {"fileID": file_mapped, "original": orig, "simplified": simp} 
    #     for orig, simp in combined_mapping.items()
    # ])
    new_rows = map_and_combine(orig_sentences, simp_sentences)
    new_rows['fileID'] = file_mapped
    A2_OR_mapped_out = pd.concat([A2_OR_mapped_out, new_rows], ignore_index=True)


In [9]:
A2_OR_mapped_out_dedup = remove_multiple_orig_matches(A2_OR_mapped_out, 'original', 'simplified')

In [None]:
B1_OR_mapped_out = pd.DataFrame(columns=['fileID', 'original', 'simplified', 'no_sentences', 'multi'])
B1_OR_mapping_dir = f"{apa_lha_path}B1-OR"

In [11]:
files_dir = os.listdir(B1_OR_mapping_dir)
files_mapped = list(set(["_".join(file_dir.split('_')[0:2]) for file_dir in files_dir if file_dir.endswith(".simpde")]))

for file_mapped in files_mapped:
    orig_path = os.path.join(B1_OR_mapping_dir, f"{file_mapped}.de")
    simp_path = os.path.join(B1_OR_mapping_dir, f"{file_mapped}_B1.simpde")

    with open(orig_path, "r", encoding="utf-8") as file:
        orig_sentences = [line.strip() for line in file]  
    with open(simp_path, "r", encoding="utf-8") as file:
        simp_sentences = [line.strip() for line in file]  

    #combined_mapping = map_and_combine(orig_sentences, simp_sentences)
    # new_rows = pd.DataFrame([
    #     {"fileID": file_mapped, "original": orig, "simplified": simp} 
    #     for orig, simp in combined_mapping.items()
    # ])
    new_rows = map_and_combine(orig_sentences, simp_sentences)
    new_rows['fileID'] = file_mapped
    B1_OR_mapped_out = pd.concat([B1_OR_mapped_out, new_rows], ignore_index=True)

In [12]:
B1_OR_mapped_out_dedup = remove_multiple_orig_matches(B1_OR_mapped_out, 'original', 'simplified')

In [13]:
merged_df = pd.merge(B1_OR_mapped_out_dedup, A2_OR_mapped_out_dedup, on=["fileID", "original"], suffixes=("_B1", "_A2"), how="inner")

## Merge APA-LHA with DePLAIN-test

Note: For SimpEvalDE, only the test subset of DePLAIN is used as two of the LLMs used for further ATS generation are trained on the train+dev subsets of this dataset.

In [127]:
test_deplain = pd.read_csv(f"{deplain_path}/test.csv")

In [None]:
#clean datasets
merged_df = remove_extra_space(merged_df, ['original', 'simplified_B1', 'simplified_A2'])

original
simplified_B1
simplified_A2


In [23]:
A2_OR_mapped_out_dedup = remove_extra_space(A2_OR_mapped_out_dedup, ['original', 'simplified'])

original
simplified


In [24]:
B1_OR_mapped_out_dedup = remove_extra_space(B1_OR_mapped_out_dedup, ['original', 'simplified'])

original
simplified


### Merge 1 - A2 and B1 joined

In [25]:
#first join deplain where both B1 and A2 simplifications align - hopefully makes the original mapping the best
slice_deplain_test_merged = pd.merge(merged_df, test_deplain[['original', 'simplification', 'alignment']], left_on=['simplified_B1', 'simplified_A2'], right_on=['original', 'simplification'], how = "inner")

In [None]:
slice_deplain_test_merged = slice_deplain_test_merged.drop(columns = ['no_sentences_B1', 'multi_B1', 'no_sentences', 'multi_A2', 'no_sentences_A2', 'original_y', 'simplification'])
slice_deplain_test_merged.columns = ['fileID', 'original', 'simplified_B1', 'simplified_A2', 'alignment']
slice_deplain_test_merged['align'] = "full"

In [28]:
slice_deplain_test_merged.shape

(19, 6)

### Merge 2 - Join by B1 only and merge A2

In [29]:
#first remove the simplifications that were already done
# Then match by B1_Ze = B1_De and join Deplain B1_De -> A2_De simplification, Orig-B1 should be checked
# # this also keeps the items that were removed via remove_multiple_orig_matches, but that is okay because they will get filtered out again
# Have to match by the original to avoid getting mapped incorrect mappings that don't correspond in Deplain, where 
slice_deplain_test_B1 = B1_OR_mapped_out_dedup[~B1_OR_mapped_out_dedup['original'].isin(slice_deplain_test_merged['original'])]
slice_deplain_test_B1 = pd.merge(slice_deplain_test_B1,  test_deplain[['original', 'simplification', 'alignment']], left_on='simplified', right_on='original', how = "inner")

In [None]:
slice_deplain_test_B1 = remove_multiple_orig_matches(slice_deplain_test_B1, 'original_x', 'original_y') #120 rows

In [32]:
#renaming and cleaning
slice_deplain_test_B1 = slice_deplain_test_B1.drop(columns = ['original_y', 'multi', 'no_sentences'])
slice_deplain_test_B1.columns = ['fileID', 'original', 'simplified_B1', 'simplified_A2', 'alignment']
slice_deplain_test_B1['align'] = "Deplain_B1"

### Merge 3 - join By A2 only and merge B1

In [34]:
slice_deplain_test_A2 = A2_OR_mapped_out_dedup[~A2_OR_mapped_out_dedup['original'].isin((slice_deplain_test_merged['original']).tolist() +slice_deplain_test_B1['original'].tolist())]
slice_deplain_test_A2 = pd.merge(slice_deplain_test_A2,  test_deplain[['original', 'simplification', 'alignment']], left_on='simplified', right_on='simplification', how = "inner")

In [39]:
slice_deplain_test_A2 = remove_multiple_orig_matches(slice_deplain_test_A2, 'original_x', 'original_y')

In [None]:
slice_deplain_test_A2 = slice_deplain_test_A2.drop(columns = ['simplified', 'no_sentences', 'no_sentences', 'multi'])
slice_deplain_test_A2.columns = ['fileID', 'original', 'simplified_B1', 'simplified_A2', 'alignment']
slice_deplain_test_A2['align'] = "Deplain_A2"

In [None]:
slice_deplain_test_A2  #95 rows

### concatenate all 3 parts

In [42]:
full_deplain_set = pd.concat([slice_deplain_test_merged, slice_deplain_test_B1, slice_deplain_test_A2])

In [43]:
full_deplain_set = remove_multiple_orig_matches(full_deplain_set, 'original', 'simplified_B1')

In [None]:
full_deplain_set = remove_multiple_orig_matches(full_deplain_set, 'original', 'simplified_A2') #216 rows

In [None]:
len(full_deplain_set['simplified_A2'].unique())

216

In [46]:
len(full_deplain_set['simplified_B1'].unique())

216

In [47]:
len(full_deplain_set['original'].unique())

216

## Add unmatched rows to map

### Filter out texts in merged dataset or DEPlaintrain+dev

In [None]:
train_deplain = pd.read_csv(f"{deplain_path}/train.csv")
dev_deplain = pd.read_csv(f"{deplain_path}/dev.csv")

In [53]:
train_dev_deplain = pd.concat([train_deplain, dev_deplain])

In [54]:
merged_df_nontrain = merged_df[(~merged_df['simplified_B1'].isin(train_dev_deplain['original'])) & (~merged_df['simplified_A2'].isin(train_dev_deplain['simplification']))]

In [55]:
merged_df_nontrain.shape

(1128, 9)

In [57]:
merged_df_nontrain_unmatched = merged_df_nontrain[~merged_df_nontrain['original'].isin(full_deplain_set['original'])]

In [58]:
merged_df_nontrain_unmatched.shape

(1067, 9)

In [59]:
merged_df_nontrain_unmatched['alignment'] = merged_df_nontrain_unmatched['no_sentences_B1'].astype(str) + ":" + merged_df_nontrain_unmatched['no_sentences_A2'].astype(str)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merged_df_nontrain_unmatched['alignment'] = merged_df_nontrain_unmatched['no_sentences_B1'].astype(str) + ":" + merged_df_nontrain_unmatched['no_sentences_A2'].astype(str)


In [None]:
merged_df_nontrain_unmatched = merged_df_nontrain_unmatched.drop(columns = ["multi_B1", "no_sentences_B1", "multi_A2", "no_sentences_A2"])

In [62]:
merged_df_nontrain_unmatched['align'] = "none"

In [63]:
full_merged_df = pd.concat([merged_df_nontrain_unmatched, full_deplain_set])

In [64]:
#Both of these simplifications do not correspond to the original
full_merged_df[full_merged_df['simplified_A2'].duplicated(keep=False)].sort_values(by = "original")

Unnamed: 0,fileID,original,simplified_B1,simplified_A2,alignment,align


In [65]:
full_merged_df[full_merged_df['simplified_B1'].duplicated(keep=False)].sort_values(by = "original")

Unnamed: 0,fileID,original,simplified_B1,simplified_A2,alignment,align


In [66]:
full_merged_df[full_merged_df['original'].duplicated(keep=False)].sort_values(by = "original")

Unnamed: 0,fileID,original,simplified_B1,simplified_A2,alignment,align


### Text Metrics to select highly likely candidates

In [None]:
full_merged_df[['num_words_orig', 'avg_word_length_orig']] = full_merged_df['original'].apply(
    lambda x: pd.Series(get_text_metrics(str(x)))
)

In [70]:
full_merged_df[['num_words_B1', 'avg_word_length_B1']] = full_merged_df['simplified_B1'].apply(
    lambda x: pd.Series(get_text_metrics(str(x)))
)

In [71]:
full_merged_df[['num_words_A2', 'avg_word_length_A2']] = full_merged_df['simplified_A2'].apply(
    lambda x: pd.Series(get_text_metrics(str(x)))
)

In [74]:
full_merged_df['bert_score_B1_orig'] = bert_score(full_merged_df, 'original', 'simplified_B1')

In [75]:
full_merged_df['bert_score_A2_orig'] = bert_score(full_merged_df, 'original', 'simplified_A2')

In [76]:
full_merged_df["bert_score_B1_orig_adjusted"] = adjust_score_with_length(full_merged_df['bert_score_B1_orig'], full_merged_df["num_words_B1"])

In [77]:
full_merged_df["bert_score_A2_orig_adjusted"] = adjust_score_with_length(full_merged_df['bert_score_A2_orig'], full_merged_df["num_words_A2"])

In [78]:
full_merged_df['bert_score_A2_orig_adjusted'].describe()

count    1283.000000
mean        0.494115
std         0.434568
min        -0.746817
25%         0.224177
50%         0.377510
75%         0.627293
max         3.054263
Name: bert_score_A2_orig_adjusted, dtype: float64

In [79]:
full_merged_df['bert_score_B1_orig_adjusted'].describe()

count    1283.000000
mean        0.560183
std         0.493309
min        -0.512017
25%         0.233982
50%         0.418697
75%         0.726788
max         3.196160
Name: bert_score_B1_orig_adjusted, dtype: float64

In [81]:
full_merged_df['num_words_orig'].describe()

count    1283.000000
mean       17.663289
std         8.007920
min         3.000000
25%        12.000000
50%        17.000000
75%        23.000000
max        42.000000
Name: num_words_orig, dtype: float64

In [82]:
#remove matches which are the same - need distinct simplifcations
full_merged_df = full_merged_df.loc[~(full_merged_df['simplified_B1'] == full_merged_df['simplified_A2']), :]

In [83]:
full_merged_df.shape

(1140, 16)

In [84]:
full_merged_df = full_merged_df.loc[full_merged_df['num_words_orig']> 5, :]

In [85]:
full_merged_df['high_sim'] = (full_merged_df['bert_score_B1_orig_adjusted'] > full_merged_df['bert_score_B1_orig_adjusted'].quantile(0.5)) & (full_merged_df['bert_score_A2_orig_adjusted'] > full_merged_df['bert_score_A2_orig_adjusted'].quantile(0.5))

In [86]:
full_merged_df.shape

(1111, 17)

## Export and manually validate

In [None]:
#full_merged_df.to_csv("../data/confidential/combined_augmented_dataset_tocheck_v5.csv")

In [None]:
manual_alignments = pd.read_csv(f"{data_path}/combined_augmented_dataset_tocheck_fin.csv")
checked_full_merged_df = pd.concat([full_merged_df, manual_alignments], axis = 1)

In [164]:
checked_full_merged_df.shape

(1111, 19)

In [165]:
checked_full_merged_df['Match'] = checked_full_merged_df['Match'].astype(str)

In [166]:
checked_full_merged_df.Match.value_counts()

Match
no                       502
nan                      322
ok                        88
yes                       44
B1                        30
B1_man_sent_match         27
man_sent_match            20
man                       15
A2                        13
ref                       12
B1_A2                     10
B1/A2_man_sent_match       7
A2_man_sent_match          7
B1_man                     6
A2/B1_man_sent_match       6
A2_man_sent_match/ref      1
B1/man_sent_match          1
Name: count, dtype: int64

In [168]:
checked_full_merged_df.loc[checked_full_merged_df.Match.str.startswith("B1/"), 'Match'] = "B1"
checked_full_merged_df.loc[checked_full_merged_df.Match.str.startswith("A2/"), 'Match'] = "A2"
checked_full_merged_df_select = checked_full_merged_df.loc[checked_full_merged_df.Match.isin(["yes", "B1", "A2", "B1_A2"]), :]

In [179]:
test_deplain_unmatched = test_deplain.loc[
    ((~test_deplain['original'].isin(checked_full_merged_df_select['simplified_B1'])) & (~test_deplain['simplification'].isin(checked_full_merged_df_select['simplified_A2'])))| (test_deplain.index == 455), :]

In [180]:
test_deplain_unmatched['FK_score'] = test_deplain_unmatched['original'].apply(FK_score)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_deplain_unmatched['FK_score'] = test_deplain_unmatched['original'].apply(FK_score)


In [181]:
test_deplain_unmatched.FK_score.describe()

count    1193.000000
mean       66.135423
std        21.802466
min       -35.130000
25%        52.870000
50%        68.770000
75%        80.780000
max       119.190000
Name: FK_score, dtype: float64

In [110]:
#test_deplain_unmatched.loc[test_deplain_unmatched.FK_score < 50, :].to_csv("test_deplain_FK_50.csv")

In [182]:
test_deplain_unmatched_alignments = pd.read_csv(f"{data_path}/test_deplain_FK_50_checked_fin.csv", index_col=0)
test_deplain_unmatched_checked = pd.concat([test_deplain_unmatched.loc[test_deplain_unmatched.FK_score < 50, :], test_deplain_unmatched_alignments.loc[:, "Match"]], axis=1)

In [186]:
test_deplain_unmatched_checked['Match'] = test_deplain_unmatched_checked['Match'].astype(str)
test_deplain_unmatched_select = test_deplain_unmatched_checked.loc[test_deplain_unmatched_checked['Match'] == "yes", ['original', 'simplification']]
test_deplain_unmatched_select.columns = ["original", "simplified_A2"]
test_deplain_unmatched_select['Match'] = "B1_A2"
test_deplain_unmatched_select['orig'] = "B1"

In [188]:
checked_full_merged_df_select_t = checked_full_merged_df_select[['original', 'simplified_B1', 'simplified_A2', 'Match']]
checked_full_merged_df_select_t.loc[checked_full_merged_df_select_t['Match']== "B1_A2", 'original'] = checked_full_merged_df_select_t.loc[checked_full_merged_df_select_t['Match']== "B1_A2", 'simplified_B1']
checked_full_merged_df_select_t.loc[checked_full_merged_df_select_t['Match'].isin(["B1_A2", 'A2']), 'simplified_B1'] = None
checked_full_merged_df_select_t.loc[checked_full_merged_df_select_t['Match'] == "B1", 'simplified_A2'] = None
checked_full_merged_df_select_t['orig'] = "orig"

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  checked_full_merged_df_select_t['orig'] = "orig"


In [189]:
final_dataset = pd.concat([test_deplain_unmatched_select, checked_full_merged_df_select_t]).reset_index(drop = True)

In [190]:
final_dataset.Match.value_counts()

Match
B1_A2    68
yes      44
B1       38
A2       19
Name: count, dtype: int64

In [191]:
final_dataset.orig.value_counts()

orig
orig    111
B1       58
Name: count, dtype: int64

In [192]:
final_dataset.shape

(169, 5)

## Split into Train and Test

In [193]:
metrics_A2 = compute_text_metrics(final_dataset[pd.notna(final_dataset['simplified_A2'])], 'original', 'simplified_A2')
metrics_B1 = compute_text_metrics(final_dataset[pd.isna(final_dataset['simplified_A2'])], 'original', 'simplified_B1')
final_dataset_w_metrics = pd.concat([metrics_A2, metrics_B1])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["WordReductionRatio"] = word_ratios
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["WordCountOrig"] = word_counts
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["SentenceReductionRatio"] = sentence_ratios
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row

In [195]:
final_dataset_w_metrics['category'] = np.where(
    final_dataset_w_metrics['SentenceReductionRatio'] < 1, 
    "split", 
    np.where(final_dataset_w_metrics['WordReductionRatio'] >= 1.5, "delete", "paraphrase")
)

In [196]:
final_dataset_w_metrics['original'].value_counts() #one row go put twice in test, so deduplication is OK after splitting to preserve the train_df selection

original
14 Prozent der Befragten sagten, dass sie schon Erfahrungen mit CBD-Produkten gemacht haben.                                                                         2
Arbeitnehmer nennt man Personen, die in Firmen arbeiten.                                                                                                             1
Mindestens sechs Tote durch Hurrikan Laura in den USA.                                                                                                               1
In Kirchschlag in der Buckligen Welt ( Bezirk Wiener Neustadt ) soll ein 14-Jähriger am Montag seine Mutter erstochen haben.                                         1
Die offizielle Gesamtzahl der Krankheitsfälle in Festland-China durch das inzwischen offiziell als Covid-19 bezeichnete Virus wuchs damit auf mehr als 44.200 an.    1
                                                                                                                                                            

In [197]:
final_dataset_w_metrics.category.value_counts()

category
paraphrase    92
split         45
delete        32
Name: count, dtype: int64

In [198]:
# Stratify using both 'Category' and 'Match'
train, temp_df = train_test_split(final_dataset_w_metrics, test_size=69, stratify=final_dataset_w_metrics[['category', 'Match']], random_state=42)

# Second split: Get test (60) and remaining validation (9)
test, few_shot = train_test_split(temp_df, test_size=9, stratify=temp_df['category'], random_state=42)

# Print the distribution to check balance
print("Train distribution:")
print(train.groupby(['category', 'Match']).size())

print("\nTest distribution:")
print(test.groupby(['category', 'Match']).size())
# Now, `train` has 100 rows and `test` has 60 rows, stratified by 'Category' and 'Match'

Train distribution:
category    Match
delete      A2        5
            B1        5
            B1_A2     3
            yes       5
paraphrase  A2        4
            B1       12
            B1_A2    24
            yes      15
split       A2        2
            B1        5
            B1_A2    14
            yes       6
dtype: int64

Test distribution:
category    Match
delete      A2        3
            B1        3
            B1_A2     2
            yes       4
paraphrase  A2        2
            B1        9
            B1_A2    14
            yes       7
split       A2        1
            B1        3
            B1_A2     9
            yes       3
dtype: int64


In [199]:
print("\nTest distribution:")
print(few_shot.groupby(['category', 'Match']).size())


Test distribution:
category    Match
delete      A2       1
            B1       1
paraphrase  B1_A2    2
            yes      3
split       A2       1
            yes      1
dtype: int64


In [202]:
training_pairs = (
    few_shot.groupby('category').apply(lambda x: x.sample(2, random_state=42))
    .reset_index(drop=True)
    [['original', 'simplified_A2', 'category', 'Match']]
    .values.tolist()
)

  few_shot.groupby('category').apply(lambda x: x.sample(2, random_state=42))


In [203]:
test = test.drop_duplicates(subset = "original") #correction here - move one random few-shot to test as one duplicate
unmatched_few_shot = few_shot.loc[~(few_shot['original'].isin([orig for orig, simp, cat, mat in training_pairs])), :].reset_index(drop = True)
test = pd.concat([test, pd.DataFrame(unmatched_few_shot.loc[[0]])])

In [204]:
test.shape

(60, 9)

In [205]:
# Apply to both dataframes
train_transformed = transform_to_simplifications(train)
test_transformed = transform_to_simplifications(test) 

## Add ATS (train+test), LLM (train+test) and Human Scores (test only)

### Train

In [None]:
## Add Automatic Simplifications
train_ATS = pd.read_csv(f'{data_path}/train_ATS_final_v2.csv')

In [90]:
train_ATS = train_ATS.set_index("Unnamed: 0")
train_w_ATS = train_transformed.merge(train_ATS, left_index=True, right_index=True).drop(columns = ["Unnamed: 0.1"])

In [92]:
train_w_ATS = train_w_ATS.melt(
    id_vars=['original', 'simplifications'],              # columns to keep fixed
    value_vars=[col for col in train_ATS.columns if col.startswith('ATS_')],  # ATS model outputs
    var_name='ATS_Model',              # new column name for model name
    value_name='simplification'        # new column name for the simplified text
)

In [94]:
train_w_ATS['simplification'] = train_w_ATS['simplification'].apply(clean_output)
train_w_ATS = train_w_ATS.sample(frac = 1, random_state = 42)

In [None]:
train_LLM = pd.read_csv(f"{data_path}/LLM_scores_train.csv")
train_w_LLM = pd.concat([train_w_ATS, train_LLM.drop(columns = ["Unnamed: 0", "simplification"])], axis = 1)

### Test

In [None]:
test_ATS = pd.read_csv(f'{data_path}/test_ATS_final_melted_final_v2.csv')

In [None]:
test_transformed['orig_ID'] = test_transformed.reset_index(drop = True).index+1
test_w_ATS = pd.merge(
    test_transformed,
    test_ATS),
    on="orig_ID"
).sort_values(by = "Unnamed: 0").drop(columns = ['Unnamed: 0', 'orig_ID'])

In [None]:
test_df_final_melted_60 = test_w_ATS.groupby("ATS_Model").apply(lambda x: x.sample(n=10, random_state=42)).reset_index(drop=True).sample(frac = 1, random_state = 42)
test_ids = test_df_final_melted_60['simp_ID'].unique()
# Create the remaining dataset excluding those 60 rows
remaining_df = test_w_ATS[~test_w_ATS['simp_ID'].isin(test_ids)].sample(frac=1, random_state=42)

# Concatenate test set (first 60) + shuffled remaining
# This was done because originally we thought we could only get 60 rows of human-eval data; so now everything is reshuffled to keep consistent.
test_w_ATS_reshuffled = pd.concat([test_df_final_melted_60, remaining_df], ignore_index=True)


  test_df_final_melted_60 = test_w_ATS.groupby("ATS_Model").apply(lambda x: x.sample(n=10, random_state=42)).reset_index(drop=True).sample(frac = 1, random_state = 42)


In [216]:
LLM_scores_test = pd.read_csv(f'{data_path}/LLM_scores_test.csv', index_col= 0)

In [219]:
test_w_LLM = pd.concat([test_w_ATS_reshuffled.drop(columns = ['simp_ID', 'WordReductionRatio', 'WordCountOrig', 'SentenceReductionRatio', 'ROUGE1_Score']), LLM_scores_test], axis = 1)

In [222]:
human_df = pd.read_csv(f"{data_path}/human_grading.csv", index_col= 0)

In [228]:
test_w_human = pd.concat([test_w_LLM, human_df],axis = 1)

### Export to CSV

In [None]:
train_w_LLM.to_csv(f'{data_path}/SimpEvalDE_train.csv')
test_w_human.to_csv(f'{data_path}/SimpEvalDE_test.csv')

## Final Notes

To reproduce total score grading used in the DETECT metric, use this formula:

In [None]:
def compute_final_score(row, columns):
        simp, meaning, fluency = row[columns[0]], row[columns[1]], row[columns[2]]
        return round(min(simp, meaning, fluency)) if min(simp, meaning, fluency) < 25 else round(0.4 * meaning + 0.4 * simp + 0.2 * fluency)