In [1]:
! git clone https://github.com/babaknaderi/TextComplexityDE.git

Cloning into 'TextComplexityDE'...
remote: Enumerating objects: 21, done.[K
remote: Counting objects: 100% (21/21), done.[K
remote: Compressing objects: 100% (14/14), done.[K
remote: Total 21 (delta 7), reused 16 (delta 4), pack-reused 0[K
Unpacking objects: 100% (21/21), 392.71 KiB | 5.61 MiB/s, done.


In [2]:
import pandas as pd

# Explore Dataset

https://github.com/babaknaderi/TextComplexityDE

> Normal-Simple German Parallel Corpus: TextComplexityDE19/parallel_corpus.csv 250 sentences from above set are simplified by 75 native German speakers. Subjective ratings on complexity of simplified sentences are provided as well.

In [3]:
parallel = pd.read_csv("/kaggle/working/TextComplexityDE/TextComplexityDE19/parallel_corpus.csv", encoding="latin-1")

In [4]:
parallel.head(3)

Unnamed: 0,Sentence_Id,Article_ID,Article,Original_Sentence,Simplification,Rating
0,5,1,Seifenblase,"Wegen dieser leichten Vergänglichkeit wurde ,S...","Weil Seifenblasen nicht lange halten, wurden s...",Etwas einfacher
1,7,1,Seifenblase,In der Kunst wird spätestens seit dem Barock d...,In der Kunst wird die Seifenblase spätestens s...,Deutlich einfacher
2,11,1,Seifenblase,"Eine Seifenblase entsteht, wenn sich ein dünne...","Eine Seifenblase entsteht, wenn sich eine klei...",Etwas einfacher


## Example Simplified Translation

In [5]:
example = parallel.iloc[0]

In [6]:
example.Original_Sentence

'Wegen dieser leichten Vergänglichkeit wurde ,Seifenblase\x91 zu einer Metapher für etwas, das zwar anziehend, aber dennoch inhalts- und gehaltlos ist.'

In [7]:
example.Simplification

'Weil Seifenblasen nicht lange halten, wurden sie zu einem  sprachlichen Ausdruck für etwas, das anziehend aber inhaltslos ist.'

## Simplification Ratings

In [8]:
parallel.Rating.value_counts()

Rating
Etwas einfacher                                      135
Deutlich einfacher                                   114
Nicht einfacher / konnte nicht vereinfacht werden      1
Name: count, dtype: int64

# Explore Text Metrics

In [9]:
! pip install textstat -q

In [10]:
import textstat

metrics = {
    "flesch_reading_ease":textstat.flesch_reading_ease,
    "flesch_kincaid_grade":textstat.flesch_kincaid_grade,
    "smog_index":textstat.smog_index,
    "coleman_liau_index":textstat.coleman_liau_index,
    "automated_readability_index":textstat.automated_readability_index,
    "dale_chall_readability_score":textstat.dale_chall_readability_score,
    "difficult_words":textstat.difficult_words,
    "linsear_write_formula":textstat.linsear_write_formula,
    "gunning_fog":textstat.gunning_fog,
    "text_standard":textstat.text_standard,
    "fernandez_huerta":textstat.fernandez_huerta,
    "szigriszt_pazos":textstat.szigriszt_pazos,
    "gutierrez_polini":textstat.gutierrez_polini,
    "crawford":textstat.crawford,
    "gulpease_index":textstat.gulpease_index,
    "osman":textstat.osman,
}

In [11]:
df = pd.DataFrame()
df["metric"] = [key for key in metrics.keys()]
df["original"] = [metrics[key](example.Original_Sentence) for key in metrics.keys()]
df["simplified"] = [metrics[key](example.Simplification) for key in metrics.keys()]

In [12]:
df

Unnamed: 0,metric,original,simplified
0,flesch_reading_ease,59.64,61.67
1,flesch_kincaid_grade,9.9,9.1
2,smog_index,0.0,0.0
3,coleman_liau_index,18.1,16.24
4,automated_readability_index,18.7,15.8
5,dale_chall_readability_score,20.42,20.32
6,difficult_words,9,7
7,linsear_write_formula,11.0,9.0
8,gunning_fog,10.0,9.42
9,text_standard,9th and 10th grade,8th and 9th grade


**Key Readability Metrics in TextStat:**
**Key Readability Metrics in TextStat:**

1. **Flesch Reading Ease:** Measures the ease of comprehension on a scale from 0 to 100, with higher scores indicating easier readability.

2. **Flesch-Kincaid Grade:** Converts the Flesch Reading Ease score into a U.S. school grade level, providing an estimate of the educational level required to understand the text.

3. **SMOG Index:** Estimates the years of education required to understand a text, based on the number of polysyllabic words.

4. **Coleman-Liau Index:** Evaluates text readability based on characters per word and words per sentence.

5. **Automated Readability Index (ARI):** Computes a readability score based on characters, words, and sentences, providing an educational level estimate.

6. **Dale-Chall Readability Score:** Assesses text readability considering a list of familiar words, providing a more accurate measure for educational texts.

7. **Difficult Words:** Quantifies the number of difficult words in a text, contributing to a deeper understanding of its complexity.

8. **Linsear Write Formula:** Estimates the readability of a text by counting the number of simple and complex words.

9. **Gunning Fog:** Measures the complexity of a text based on sentence length and the percentage of complex words.

10. **Text Standard:** Determines the standard reading level of a text, aiding in content alignment with specific audience needs.

11. **Fernandez Huerta:** Evaluates readability using the Fernandez Huerta formula, considering syllable count and sentence length.

12. **Szigriszt Pazos:** Computes the readability index using word and sentence lengths, providing valuable insights into text complexity.

13. **Gutierrez Polini:** Measures text readability based on word and sentence lengths, contributing to a comprehensive readability assessment.

14. **Crawford:** Offers an alternative method to assess readability, taking into account word and sentence lengths.

15. **Gulpease Index:** Evaluates text readability considering the ratio of characters to words and the percentage of complex words.

16. **Osman:** Provides a readability score based on the number of words and syllables, contributing to a nuanced assessment of text complexity.