# Collection of some Features for HHUplexity

Please copy the data of the shared task first into the "data/" directory.
All features generated in this file will be saved in "data/feats/[train|dev|test]\_1". If you also use the other files to generate features, they will be saved in other files. Please consider to merge all features into one file for training.

This files contains the generation of the following features:
* [Features based on Compound Nouns ](#Features-based-on-Compound-Nouns)
* [Features based on the Morphological Analyzer of Spacy](#Features-based-on-the-Morphological-Analyzer-of-Spacy)
* [Feature based on Dependency Tree Distance](#Feature-based-on-Dependency-Tree-Distance)
* [Features based on Verb-Noun-Ratio](#Features-based-on-Verb-Noun-Ratio)
* [Features based on Negations](#Features-based-on-Negations)
* [Features based on POS Tags](#Features-based-on-POS-Tags)
* [Features based on Imageability and concreteness](#Features-based-on-Imageability-and-concreteness)
* [Feature based on TSeval package](#Feature-based-on-TSeval-package)
* [Features based on Perplexity Score of Language Models](#Features-based-on-Perplexity-Score-of-Language-Models)
* [Text Leveling as Feature for Complexity Prediction](#Text-Leveling-as-Feature-for-Complexity-Prediction)

In [None]:
import pandas as pd
import spacy

## Read Data
- make sure that you copied the data of the shared task in the right directory

In [None]:
train_data = pd.read_csv("data/public_data_text_complexity22/training_set.csv")

In [None]:
data_dev = pd.read_csv("data/public_data_text_complexity22/validation_set.csv")

In [None]:
test_data = pd.read_csv("data/public_data_text_complexity22/part2_public.csv")

## Features based on Compound Nouns 
**Idea:**
- The longer the words, the more diffult to understand them. 
- The more lexemes are compounded in a word, the more difficult to understand the word.
- The more words are compounded of several lexems, the more complex the sentence.
- The more nouns are compounded of several lexemes, the more complex the sentence.

**Result:**
- weak correlation between MOS and ratio of compound words to all words (r=0.224703)
- low correlation between MOS and ratio of compound words to all words (r=0.165764

In [None]:
! pip install git+https://github.com/repodiac/german_compound_splitter

In [None]:
import german_compound_splitter
import spacy

In [None]:
nlp = spacy.load("de_core_news_lg")

In [None]:
from german_compound_splitter import comp_split

# please load an appropriate (external) dictionary, see the notes in section Installation/Setup on the dictionary
input_file = 'data/german.dic'
ahocs = comp_split.read_dictionary_from_file(input_file)

In [None]:
def ratio_compounds_per_sent(tokens: list, ahocs=ahocs, only_nouns=False):
    """
    Calculate the ratio of input tokens which are compounded. 
    tokens: list of tokens of a sentence or a text
    ahocs: german dictionary of words (https://sourceforge.net/projects/germandict/files/latest/download)
    only_nouns: specify if only the ratio of compound nouns or all compund words should be calculated.
    
    returns the rounded ratio of compound tokens/nouns to all tokens/nouns
    """

    num_compounds = 0
    num_nouns = 0
    for token in tokens:
        try:
            if only_nouns:
                dissection = comp_split.dissect(token.text, ahocs, make_singular=False, mask_unknown=True, only_nouns=True)
            else:
                dissection = comp_split.dissect(token.text, ahocs, make_singular=False, mask_unknown=True, only_nouns=False)
            if len(dissection) > 1:
                num_compounds += 1
        except IndexError:
            print("word:", token, "not in dictionary.")
        if token.pos_ == ("NOUN"):
            num_nouns += 1
    if only_nouns:
        if num_nouns == 0:
            return 0
        else:
            return round(num_compounds/num_nouns,6)
    else:
        return round(num_compounds/len(tokens),6)

In [None]:
def feature_ratio_compounds(data):
    """
    calculate the ratio of compound words, i.e., the ratio of all compound tokens to all tokens 
    and all compound nouns to all nouns.
    
    data: dataframe with the text in the sentence column
    
    returns the dataset with new columns regarding the compound words features
    """
    for i, row in data.iterrows():
        tokens = list(nlp(row["Sentence"]))
        data.loc[i,"F_compound_ratio"] = ratio_compounds_per_sent(tokens)
        data.loc[i,"F_compound_nouns_ratio"] = ratio_compounds_per_sent(tokens, only_nouns=True) 
    return data

### Examples

In [None]:
compound = 'Donaudampfschifffahrtskapitänsmützenabzeichen'

dissection = comp_split.dissect(compound, ahocs, make_singular=True)
print('SPLIT WORDS (plain):', dissection)
print('SPLIT WORDS (post-merge):', comp_split.merge_fractions(dissection))

### Real Data
- add new features of compound words to the dataset

In [None]:
train_data = feature_ratio_compounds(train_data)
data_dev = feature_ratio_compounds(data_dev)
test_data = feature_ratio_compounds(test_data)

In [None]:
data_correlation_table = train_data[["MOS", "F_compound_ratio", "F_compound_nouns_ratio"]].corr()
data_correlation_table

## Features based on the Morphological Analyzer of Spacy
**Idea:**
- subjunctive sentence are more difficult to understand than indicative sentences
- some German cases are more difficult to understand than others, e.g., the genitive is often named as difficult to understand therefore often replaced by the dative


**Result:**
- low correlation between MOS and boolean featuer of subjunctive/indicative (r=0.137552)
- no correlation between MOS and ratio of nouns in nominative to all nouns (r=-0.031380)
- moderate correlation between MOS and ratio of genitive in nominative to all nouns (r=0.251906)
- no correlation between MOS and ratio of dative in nominative to all nouns (r=0.086093)
- low correlation between MOS and ratio of accusative in nominative to all nouns (r=-0.142144)


In [None]:
def is_subjunctive(tokens):
    """
    tokens: list of spacy.Token objects

    returns 1 if a part of the sentence is subjunctive, 0 if not.
    """
    for token in tokens:
        if "Mood=Sub" in token.morph:
            return 1
    return 0

In [None]:
def ratio_case(tokens):
    """
    tokens: list of spacy.Token objects

    returns ratio of nouns in all four cases.
    """
    num_nouns = 0
    num_nom = 0
    num_gen = 0
    num_dat = 0
    num_acc = 0
    for token in tokens:
        if token.pos_ == "NOUN":
            num_nouns += 1
            if "Case=Nom" in token.morph:
                num_nom += 1
            elif "Case=Gen" in token.morph:
                num_gen += 1
            elif "Case=Dat" in token.morph:
                num_dat += 1
            elif "Case=Acc" in token.morph:
                num_acc += 1
    if num_nouns == 0:
        return 0, 0, 0, 0
    return round(num_nom/num_nouns,6), round(num_gen/num_nouns,6), round(num_dat/num_nouns,6), round(num_acc/num_nouns,6)

In [None]:
def feature_mophology(data):
    """
    calculate the ratio of nouns in nominative, genitive, dative and accusative to all nouns. 
    Also check if a sentence is written subjunctively or indicatively.
    
    data: dataframe with the text in the sentence column
    
    returns the dataset with new columns regarding the morphology features
    """
        
    for i, row in data.iterrows():
        tokens = list(nlp(row["Sentence"]))
        data.loc[i,"F_subjunctive"] = is_subjunctive(tokens)
        ratio_nom, ratio_gen, ratio_dat, ratio_acc = ratio_case(tokens)
        data.loc[i,"F_ratio_nom"] = ratio_nom
        data.loc[i,"F_ratio_gen"] = ratio_gen
        data.loc[i,"F_ratio_dat"] = ratio_dat
        data.loc[i,"F_ratio_acc"] = ratio_acc
    return data

### Examples

In [None]:
tokens = nlp("Der Satz könnte im Dativ geschrieben sein.")

In [None]:
ratio_case(tokens)

In [None]:
is_subjunctive(tokens)

### Real Data

In [None]:
train_data = feature_mophology(train_data)
data_dev = feature_mophology(data_dev)
test_data = feature_mophology(test_data)

In [None]:
data_correlation_table = train_data[["MOS", "F_subjunctive", "F_ratio_nom", "F_ratio_gen", "F_ratio_dat", "F_ratio_acc"]].corr()
data_correlation_table

## Feature based on Dependency Tree Distance
**Idea:**
- Words which are discontinuously connected in a sentence  are more difficult to understand because the reader need to memorize more elements of the sentence to combine the meaning.

**Results:**
- moderate correlation between MOS and average distance between words (r=0.594131)
- weak correlation between MOS and maximum distance between words (r=0.217165)
- no correlation between MOS and maximum distance between verbs and particle verbs (r=0.064130)
- no correlation between MOS and maximum distance between verbs and particle verbs (r=0.064669)

In [None]:
def distance_between_words(tokens):
    """
    tokens: list of spacy.Token objects
    calculate the average and the maximum distance between nodes in the dependency tree
    
    return average and max distance value
    """
    max_distance = 0
    list_distances = list()
    for token in tokens:
        distance = abs(token.i-token.head.i)
        list_distances.append(distance)
        if distance > max_distance:
            max_distance = distance
    # return round((sum(list_distances)/len(list_distances))/len(tokens),6), round(max_distance/len(tokens),6)
    return round(sum(list_distances)/len(list_distances),6), round(max_distance/len(tokens),6)

In [None]:
def distance_between_verb_particles(tokens):
    """
    tokens: list of spacy.Token objects
    calculate the average and the maximum distance between verbs and particle verbs in the dependency tree
    
    return average and max distance value
    """
    max_distance = 0
    list_distances = list()
    for token in tokens:
        if token.tag_ == "PTKVZ":
            distance = abs(token.i-token.head.i)
            list_distances.append(distance)
            if distance > max_distance:
                max_distance = distance
    if len(list_distances) > 0:
        return round((sum(list_distances)/len(list_distances))/len(tokens),6), round(max_distance/len(tokens),6)
    else:
        return 0, 0

In [None]:
def feature_depency_tree_distance(data):
    """
    calculate the average and maximum distance between nodes in the dependency tree. 
    And also calculates only the distance between verbs and verb particles.
    
    data: dataframe with the text in the sentence column
    
    returns the dataset with new columns regarding the distance features
    """
        
    for i, row in data.iterrows():
        tokens = list(nlp(row["Sentence"]))
        avg_distance, max_distance = distance_between_words(tokens)
        data.loc[i,"F_avg_distance_betweeen_words"] = avg_distance
        data.loc[i,"F_max_distance_betweeen_words"] = max_distance
        
        avg_distance_verb, max_distance_verb = distance_between_verb_particles(tokens)
        data.loc[i,"F_avg_distance_betweeen_verb_particle"] = avg_distance_verb
        data.loc[i,"F_max_distance_betweeen_verb_particles"] = max_distance_verb
    return data

### Examples

In [None]:
distance_between_verb_particles(list(nlp("Er schlägt das Buch auf .")))

In [None]:
distance_between_words(list(nlp("Er schlägt das Buch auf .")))

### Real data

In [None]:
train_data = feature_depency_tree_distance(train_data)
data_dev = feature_depency_tree_distance(data_dev)
test_data = feature_depency_tree_distance(test_data)

In [None]:
data_correlation_table = train_data[["MOS", "F_avg_distance_betweeen_words", "F_max_distance_betweeen_words", "F_avg_distance_betweeen_verb_particle", "F_max_distance_betweeen_verb_particles"]].corr()
data_correlation_table

## Features based on Verb-Noun-Ratio

**Idea:**
- the more verbs in a sentece, the better to understand the sentence
- the more nouns in a sentence, the less to understand the sentence
- the more verbs per nouns in a sentence, the better to understand the sentence

**Results:**
- weak negative correlation between MOS and verb-noun-ratio (r=-0.219467)


In [None]:
def verb_noun_ratio(tokens):
    """
    tokens: list of Spacy.Token objects
    calcualtes the ratio from verbs to nouns. 
    """
    n_nouns = 0
    n_verbs = 0
    for token in tokens:
        if token.pos_ == "NOUN":
            n_nouns += 1
        elif token.pos_ == "VERB":
            n_verbs += 1
    if n_nouns == 0:
        return 0
    else:
        return round(n_verbs/n_nouns,6)

In [None]:
def feature_noun_verb_ratio(data):
    """
    calculate the ratio between verbs and nouns
    
    data: dataframe with the text in the sentence column
    
    returns the dataset with new columns regarding the ver-noun-ratio
    """
    for i, row in data.iterrows():
        tokens = list(nlp(row["Sentence"]))
        data.loc[i,"F_noun_verb_ratio"] = verb_noun_ratio(tokens)
        
        # data.loc[i, "F_avg_verb_distance"] = distance_between_verb_particles(tokens)
    return data

### Examples

In [None]:
verb_noun_ratio(list(nlp("Der Satz hat nicht mehr Verben als Nomen.")))

In [None]:
verb_noun_ratio(list(nlp("Verben machen Texte leichter zu verstehen und zu lesen.")))

### Real Data

In [None]:
train_data = feature_noun_verb_ratio(train_data)
data_dev = feature_noun_verb_ratio(data_dev)
test_data = feature_noun_verb_ratio(test_data)

In [None]:
data_correlation_table = train_data[["MOS", "F_noun_verb_ratio"]].corr()
data_correlation_table

## Features based on Negations
**Idea:**
- negations turn the meaning of a sentence into the opposite.
- double negations, turn the meaning again make sentence much more complex.

**Results:**
- no correlation between MOS and ratio of negated words (r=0.025829)
- no correlation between MOS and ratio of negations (r=-0.044094)

In [None]:
def get_negations_ratio(tokens):
    """
    toeksn: list of spacy.Token objects
    Calculate the ratio of negated words (including negation prefixes) to all tokens 
    and the ratio of negation words (without prefixes) to all tokens
    """
    negations = ["kein", "nein", "nicht", "nie", "niemals"]
    negation_prefix = ["un", "des", "irr"]
    negation_suffix = ["los"]
    num_real_negations = 0
    num_negated_words = 0
    for token in tokens:
        if token.lemma_ in negations:
            num_real_negations += 1
        elif token.lemma_ != "los" and token.lemma_.endswith("los") and token.pos_ == "ADJ":
            num_negated_words += 1
        elif token.lemma_.lower().startswith("un") and token.lemma_ != "und" and token.pos_ == "ADJ":
            num_negated_words += 1
        elif token.lemma_.lower().startswith("des") and token.pos_ == "ADJ":
            num_negated_words += 1
        elif token.lemma_.lower().startswith("irr") and token.pos_ == "ADJ":
            num_negated_words += 1
    return round(num_negated_words/len(tokens),6), round(num_real_negations/len(tokens),6)

In [None]:
def feature_negations(data):
    """
    calculate the ratio of negated words to all tokens of a sentence
    
    data: dataframe with the text in the sentence column
    
    returns the dataset with new columns regarding the negation
    """
    for i, row in data.iterrows():
        tokens = list(nlp(row["Sentence"]))
        ratio_negated_words, ratio_negations = get_negations_ratio(tokens)
        data.loc[i,"F_ratio_negated_words"] = ratio_negated_words
        data.loc[i,"F_ratio_negations"] = ratio_negations
        
        # data.loc[i, "F_avg_verb_distance"] = distance_between_verb_particles(tokens)
    return data

### Real Data

In [None]:
train_data = feature_negations(train_data)
data_dev = feature_negations(data_dev)
test_data = feature_negations(test_data)

In [None]:
data_correlation_table = train_data[["MOS", "F_ratio_negated_words", "F_ratio_negations"]].corr()
data_correlation_table

## Features based on Verb Tense

In [None]:
def feature_verb_tense(data):
    # todo
    return data

## Features based on POS Tags
**Idea:**
- some part of speech might occur very infrequent and make a sentence more complex
- for more fine-grained pos tags we use the Stuttgart-Tübingen Tagset

**Result**
- strong negative correlation between punctuation marks and MOS (r=-0.6279)
- weak negative correlation between MOS and personal pronouns (r=-0.376017)
- weak negative correlation between MOS and finite model verbs (r=)
- weak correlation between MOS and commas (r=0.269273)
- weak correlation between MOS and attributive adjectives (r=0.222318)
- all other features have no or only a weak correlation

In [None]:
def get_ratio_pos(tokens):
    stts_dict = dict()
    stts = ["$(", "$,", "$.", "ADJA", "ADJD", "ADV", "APPO", "APPR", "APPRART", "APZR", "ART", "CARD", "FM", "ITJ", "KOKOM", "KON", "KOUI", "KOUS", "NE", "NN", "NNE", "PDAT", "PDS", "PIAT", "PIS", "PPER", "PPOSAT", "PPOSS", "PRELAT", "PRELS", "PRF", "PROAV", "PTKA", "PTKANT", "PTKNEG", "PTKVZ", "PTKZU", "PWAT", "PWAV", "PWS", "TRUNC", "VAFIN", "VAIMP", "VAINF", "VAPP", "VMFIN", "VMINF", "VMPP", "VVFIN", "VVIMP", "VVINF", "VVIZU", "VVPP", "XY"]
    for tag in stts:
#         if "$," in tag:
#             tag = "$comma"
        stts_dict[tag] = 0
    for token in tokens:
        if token.tag_ in stts_dict.keys():
#             if token.tag_ == "$,":
#                 stts_dict["$comma"] += 1    
#             else:
            stts_dict[token.tag_] += 1
        else:
            stts_dict["XY"] += 1
    return stts_dict

In [None]:
def feature_ratio_fpos(data):
    for i, row in data.iterrows():
        tokens = list(nlp(row["Sentence"]))
        new_stts_dict = get_ratio_pos(tokens)
        for tag in new_stts_dict.keys():
#             if "$," in tag:
#                 tag = "$comma"
            # print(tag, new_stts_dict[tag], round(new_stts_dict[tag]/len(tokens),6))
            data.loc[i,"F_ratio_finegrained_pos_"+tag] = round(new_stts_dict[tag]/len(tokens),6)        
        # data.loc[i, "F_avg_verb_distance"] = distance_between_verb_particles(tokens)
    return data

### Real Data

In [None]:
train_data = feature_ratio_fpos(train_data)
data_dev = feature_ratio_fpos(data_dev)
test_data = feature_ratio_fpos(test_data)

In [None]:
train_data.rename(columns={'F_ratio_finegrained_pos_$,': 'F_ratio_finegrained_pos_$comma'}, inplace=True)
data_dev.rename(columns={'F_ratio_finegrained_pos_$,': 'F_ratio_finegrained_pos_$comma'}, inplace=True)
test_data.rename(columns={'F_ratio_finegrained_pos_$,': 'F_ratio_finegrained_pos_$comma'}, inplace=True)

In [None]:
unique_cols = list(train_data.columns[train_data.nunique() <= 1])
unique_cols 

In [None]:
feature_ratio_fpos = [col for col in train_data.columns if col.startswith("F_ratio_fine") if col not in unique_cols]

In [None]:
data_correlation_table = train_data[["MOS"]+feature_ratio_fpos].corr()
data_correlation_table.sort_values("MOS")

## Features based on Imageability and concreteness

- **Idea**: use features as described by Richardson (https://journals.sagepub.com/doi/10.1080/14640747508400483) for text complexity assessment
- **Reference Lexicon**: https://www.clarin.si/repository/xmlui/handle/11356/1187
- **Method**: 
    - first download the lexicon
    - save it at "data/ImageabilityConcretenessDE"
    - afterwards sort the lexicon
- **Result**
    - low negative correlation between MOS and imagebility (r=-0.092804)
    - low negative correlation between MOS and concreteness (r=-0.140980)


### Calculate Scores

In [None]:
with open("data/ImageabilityConcretenessDE/ImageabilityConcretenessDE.txt", 'r', encoding='utf-8') as f:
    raw_scores = f.readlines()
    
raw_scores[0:10]

In [None]:
import re
from statistics import mean
from typing import Generator, Iterable, List, Set, Tuple
import random


def split_and_remove_noise(entry: str) -> str:
    term, imageability, concreteness = entry.strip("\n").split("\t")
    return re.sub(r'\A[\W_]+|[\W_]+\Z', '', term), imageability, concreteness


def lemmatize(term: str) -> List[str]:
    doc = nlp(term)
    return [token.lemma_ for token in doc]


def extract_entries(entries: Iterable[str]) -> Generator:
    for entry in entries:
        term, imageability, concreteness = split_and_remove_noise(entry)
        # ignore empty entries
        if not (term == ''):
            yield from [(lemma, imageability, concreteness) for lemma in lemmatize(term)]


def preprocess_entries(entries: Iterable[str]) -> Set[Tuple[str]]:
    # handle duplicate entries: mean average their scores
    extracted_entries = list(extract_entries(entries))
    
    def _has_equal_lemma(e1, e2) -> bool:
        return e1[0] == e2[0]
    
    def _average_duplicate_entries(e) -> Tuple[str]:
        duplicate_entries = [entry for entry in extracted_entries if _has_equal_lemma(e, entry)]
        if len(duplicate_entries) > 1:
            return (e[0], mean([float(score) for _, score, _ in duplicate_entries]), mean([float(score) for _, _, score in duplicate_entries]))
        return (e[0], float(e[1]), float(e[2]))
    
    return set([_average_duplicate_entries(entry) for entry in extracted_entries])
    

In [None]:
scores_clean = list(preprocess_entries(raw_scores))
scores_clean.sort(key=lambda e: e[0])

In [None]:
df = pd.DataFrame(scores_clean, columns = ['Lemma', 'Imageability', 'Concreteness'])

df.head()

In [None]:
with open("data/ImageabilityConcretenessDE/ImageabilityConcretenessDE.csv", 'w', encoding='utf-8') as f:
    df.to_csv(f)

### Use scores for average

In [None]:
imageability_concreteness_de = pd.read_csv("data/ImageabilityConcretenessDE/ImageabilityConcretenessDE.csv", index_col=0)

In [None]:
def calculate_imagebility_concreteness(data, nlp, imageability_concreteness_data):
    """
    calculate average imageability and concreteness scores over all words per sentence. if a word is not in the 
    dictionary, try if the lemma is in the dictionary, otherwise add the mean score of all words.
    """
    mean_concreteness = round(imageability_concreteness_de["Concreteness"].mean(),6)
    mean_imagebility = round(imageability_concreteness_de["Imageability"].mean(),6)

    min_concreteness = round(imageability_concreteness_de["Concreteness"].min(),6)
    min_imagebility = round(imageability_concreteness_de["Imageability"].min(),6)

    max_concreteness = round(imageability_concreteness_de["Concreteness"].max(),6)
    max_imagebility = round(imageability_concreteness_de["Imageability"].max(),6)
    
    for i, row in data.iterrows():
        text = nlp(row["Sentence"])
        imagebility = list()
        concreteness = list()
        for token in text:
            # print(token)
            if token.text in imageability_concreteness_data["Lemma"].to_list():
                imagebility.append(imageability_concreteness_data[imageability_concreteness_data["Lemma"]==token.text]["Imageability"].iloc[0])
                concreteness.append(imageability_concreteness_data[imageability_concreteness_data["Lemma"]==token.text]["Concreteness"].iloc[0])

            elif token.lemma_ in imageability_concreteness_data["Lemma"].to_list():
                imagebility.append(imageability_concreteness_data[imageability_concreteness_data["Lemma"]==token.lemma_]["Imageability"].iloc[0])
                concreteness.append(imageability_concreteness_data[imageability_concreteness_data["Lemma"]==token.lemma_]["Concreteness"].iloc[0])
            else:
                imagebility.append(mean_imagebility)
                concreteness.append(mean_concreteness)
                # imagebility.append(round(random.uniform(min_imagebility, max_imagebility),6))
                # concreteness.append(round(random.uniform(min_concreteness, max_concreteness),6))

        data.loc[i, "Imagebility"] = sum(imagebility)/len(imagebility)
        data.loc[i, "Concreteness"] = sum(concreteness)/len(concreteness)
    return data

In [None]:
imageability_concreteness_de.describe()

### Real Data

In [None]:
# train_data = calculate_imagebility_concreteness(train_data, nlp, imageability_concreteness_de)
data_dev = calculate_imagebility_concreteness(data_dev, nlp, imageability_concreteness_de)
test_data = calculate_imagebility_concreteness(test_data, nlp, imageability_concreteness_de)

In [None]:
train_data[["MOS", "Imagebility", "Concreteness"]].corr()

## Feature based on TSeval package
**Idea:** 

The python package for text simplification evaluation contains a lot of features to highlight differences between the original sentence and simplified sentence. Some of these features might also be helpful to determine if a sentence is complex to understand.
*Warning: Most of these features are originally calculated for English and might not language-independent.*

**Result:**
- strong correlation between MOS and count of characters (r=0.749874)
- strong correlation between MOS and count of syllables (r=0.733147)
- strong correlation between MOS and count of words (r=0.684206)
- strong correlation between MOS and Flesh-Kincaid-Grading-Level (r=0.616677)

- moderate correlation between MOS and parse tree height (r=0.569646)
- moderate correlation between MOS and maximum position in frequency table (r=0.44683)
- moderate correlation between MOS and average length of verb phrases (r=0.424383)
- moderate correlation between MOS and average length of noun phrases (r=0.422467)
- moderate negative correlation between MOS and type token ratio (r=-0.488884)


- weak negative correlation between MOS and Flesh-Reading Ease score (r=-0.392505)
- weak negative correlation between MOS and ratio of verbs(r=-0.279899)
- weak negative correlation between MOS and ratio of pronouns (r=-0.264084)
- weak correlation between MOS and lexical complexity score (r=0.352031)
- weak correlation between MOS and characters per word (r=0.341423)
- weak correlation between MOS and average length of prepositional phrase (r=0.298464)
- weak correlation between MOS and average position in frequency table (r=0.287437)
- weak correlation between MOS and syllables per word (r=0.244687)
- weak correlation between MOS and if a sentence is non-projective or not (r=0.232608)
- weak correlation between MOS and ratio of conjunctions (r=0.20639)
- weak correlation between MOS and ratio of adjectives (r=0.200404)


- all other features have no or only low correlation with MOS



### Installation
```
git clone https://github.com/rstodden/text-simplification-evaluation.git
cd text-simplification-evaluation
pip install -e .
pip install -r requirements.txt
```

In [None]:
import tseval.feature_extraction
import pandas as pd
import spacy

In [None]:
nlp = spacy.load("de_core_news_lg")

In [None]:
all_sentence_functions = tseval.feature_extraction.get_sentence_simplification_feature_extractors()   + tseval.feature_extraction.get_sentence_feature_extractors()

In [None]:
def calculate_tseval_feature(data, all_sentence_functions):
    """
    data: dataframe with sentences
    all_sentence_functions: all methods of the tseval pacakge which can be applied on a sentence (and not on a simplification pair)
    for each sentence in the dataset calculate each method of the tseval package
    return the data with new columns regarding the tseval features
    """
    for i, row in data.iterrows():
        for method in all_sentence_functions:
            sentence = nlp(row["Sentence"])
            data.loc[i, "F_tseval_"+method.__name__] = round(method(sentence, "de"),6)
        print(i)
    return data

### Real Data

In [None]:
train_data = calculate_tseval_feature(train_data, all_sentence_functions)
data_dev = calculate_tseval_feature(data_dev, all_sentence_functions)
test_data = calculate_tseval_feature(test_data, all_sentence_functions)

In [None]:
unique_cols = list(train_data.columns[train_data.nunique() <= 1])

In [None]:
train_data = train_data.drop(unique_cols,1)
data_dev = data_dev.drop(unique_cols,1)
test_data = test_data.drop(unique_cols,1)

In [None]:
feature_cols = [col for col in train_data.columns if col.startswith("F_tseval_")]

In [None]:
data_correlation_table = train_data[["MOS"]+feature_cols].corr()

In [None]:
data_correlation_table.sort_values("MOS")

## Features based on Perplexity Score of Language Models
- description of perplexity in huggingface: https://huggingface.co/docs/transformers/perplexity
- Implementation and example of perplexity in huggingface: https://github.com/huggingface/datasets/blob/master/metrics/perplexity/perplexity.py
- Idea: 
    - The more frequent the words and the more frequent the word order in a sentence, the easier the sentence and the more likely a language model can predit the sentence.
    - The higher the perplexity score, the more unlikely that the language would predict the input the sentence and the more complex the sentence.


In [None]:
# !pip install datasets
!pip install transformers

# install latest version of datasets 
!pip install git+https://github.com/huggingface/datasets
    
!pip install evaluate

In [None]:
from datasets import load_metric 
import pandas
import torch
import evaluate

In [None]:
# check installed version of dataset package
!pip freeze | grep datasets

In [None]:
perplexity = evaluate.load("perplexity", module_type="metric")

In [None]:
def ppl_of_sent(input_texts, model_id, perplexity=evaluate.load("perplexity", module_type="metric")):
    """
    Calculate the perplexity value of the input texts on the specified model.
    input texts: string or list of sentences
    model_id: name of a model in huggingaface
    perplexity: method to calculate the perplexity score
    
    returns the result score
    """
    
    results = perplexity.compute(input_texts=input_texts, model_id=model_id)
    return results

In [None]:
def calculate_ppl(data, models):
    """
    Calculate the perplexity value of the input texts in a specified dataset on the specified model.
    data: dataframe with input text
    models: list of names of models in huggingface 
    
    returns the dataset with new columns regarding the perplexity features
    """
    for model in models:
        for i, row in data.iterrows():
            ppl = perplexity.compute(input_texts=[str(row["Sentence"])], model_id=model)["mean_perplexity"]
            data.loc[i, "ppl_"+model.split("/")[-1]] = ppl
            if not i%100:
                print(i, model)
    return data

### Examples

In [None]:
ppl_of_sent(["Das klingt gut."], "benjamin/gerpt2", perplexity)

In [None]:
ppl_of_sent(["Das klingen gut."], "benjamin/gerpt2", perplexity)

In [None]:
ppl_of_sent(["Einige Geldautomaten können nicht nur Banknoten verarbeiten, sondern auch Münzen."], "benjamin/gerpt2", perplexity)

In [None]:
ppl_of_sent(["Einige Geldautomaten können Banknoten und auch Münzen verarbeiten."], "benjamin/gerpt2", perplexity)

### Real Data
- add perplexity features to the dataset

In [None]:
train_data = calculate_ppl(train_data, ["benjamin/gerpt2", "facebook/mbart-large-cc25"])

In [None]:
dev_data = calculate_ppl(dev_data, ["benjamin/gerpt2", "facebook/mbart-large-cc25"])

In [None]:
test_data = calculate_ppl(test_data, ["benjamin/gerpt2", "facebook/mbart-large-cc25"])

In [None]:
train_data.corr([["MOS", "ppl_gerpt2", "ppl_mbart-large-cc25"]])

## Text Leveling as Feature for Complexity Prediction

- **Idea:** The older the target group of a text, the more difficult to read is the text. 
- **Data**: [Lexica-Corpus](https://github.com/fhewett/lexica-corpus) with parallelen Wikipedia texts for children, youth and adults
- **Label:** children (0), youth (1), adults (2)
- **Method**: Sequence-Labeling.
    - split the lexica-corpus texts into sentences and label the sentences with their target group (label)
    - Fine-tune a Transformer model with these data (sentence - label)
    - predict the label of the complexity dataset (GermEval) with the fine-tuned model
- **Results:** 
- moderate correlation between MOS and text leveling feature (r=0.587874)

See separate [notebook called "Feature_Text_Leveling"](Feature_Text_Leveling.ipynb) for the code.

## Save Data

In [None]:
import os

In [None]:
if not os.path.exists("data/feats"):
    os.makedirs("data/feats")

In [None]:
train_data.to_csv("data/feats/train_1.csv", index=False)
data_dev.to_csv("data/feats/validation_1.csv", index=False)
test_data.to_csv("data/feats/test_1.csv", index=False)