# Readability metrics in Python and in R

**AUTHOR: CAMELIA CIOLAC**

### Preamble

The readability metrics illustrated in this notebook were introduced decades ago(see [1]):

- 1948 - Flesch's Reading Ease Score 
- 1948 - Dale-Chall Readability formula
- 1967 - Automated Readability Index 
- 1969 - SMOG
- 1975 - Coleman-Liau Index 
- 1975 - Flesch-Kincaid Readability Score


Also, they have the following interpretability:

- produce an approximate representation of the US grade level needed to comprehend the text:  
    * Automated Readability Index 
    * Flesch–Kincaid grade level
    * Coleman–Liau index 
    * SMOG
- produce a score on a scale:
    * Flesch reading ease
    * Dale–Chall readability formula


[1] https://quanteda.io/reference/textstat_readability.html

### Readability metrics in Python with Textstat

In [None]:
!pip install -q textstat

[K     |████████████████████████████████| 102kB 4.3MB/s 
[K     |████████████████████████████████| 1.9MB 8.6MB/s 
[?25h

In [None]:
!pip show textstat

Name: textstat
Version: 0.7.1
Summary: Calculate statistical features from text
Home-page: https://github.com/shivam5992/textstat
Author: Shivam Bansal, Chaitanya Aggarwal
Author-email: shivam5992@gmail.com
License: MIT
Location: /usr/local/lib/python3.7/dist-packages
Requires: pyphen
Required-by: 


In [None]:
import textstat

import pandas as pd
from collections import OrderedDict
import pprint

In [None]:
[ el for el in dir(textstat) if el[0] != "_"]

['attribute',
 'automated_readability_index',
 'avg_character_per_word',
 'avg_letter_per_word',
 'avg_sentence_length',
 'avg_sentence_per_word',
 'avg_syllables_per_word',
 'char_count',
 'coleman_liau_index',
 'crawford',
 'dale_chall_readability_score',
 'dale_chall_readability_score_v2',
 'difficult_words',
 'difficult_words_list',
 'fernandez_huerta',
 'flesch_kincaid_grade',
 'flesch_reading_ease',
 'gunning_fog',
 'gutierrez_polini',
 'is_difficult_word',
 'is_easy_word',
 'letter_count',
 'lexicon_count',
 'linsear_write_formula',
 'lix',
 'polysyllabcount',
 'pyphen',
 'reading_time',
 'remove_punctuation',
 'rix',
 'sentence_count',
 'set_lang',
 'smog_index',
 'spache_readability',
 'syllable_count',
 'szigriszt_pazos',
 'text_standard',
 'textstat']

In [None]:
help(textstat.automated_readability_index)

Help on method automated_readability_index in module textstat.textstat:

automated_readability_index(text) method of textstat.textstat.textstatistics instance



In [None]:
texts = []

# source: https://www.elastic.co/guide/en/elasticsearch/reference/current/elasticsearch-intro.html
texts.append("""Elasticsearch is the distributed search and analytics engine at the heart of the Elastic Stack. Elasticsearch is where the indexing, search, and analysis magic happens. It provides scalable 
search, has near real-time search, and supports multitenancy. Elasticsearch provides near real-time search and analytics for all types of data. Whether you have structured or unstructured text, 
numerical data, or geospatial data, Elasticsearch can efficiently store and index it in a way that supports fast searches. You can go far beyond simple data retrieval and aggregate information to discover 
trends and patterns in your data. """)

#source: https://en.wikipedia.org/wiki/Readability
texts.append("""Readability is the ease with which a reader can understand a written text. In natural language, the readability of text depends on its content (the complexity of its vocabulary and syntax) 
and its presentation (such as typographic aspects that affect legibility, like font size, line height, character spacing, and line length).""")

#source: https://en.wikipedia.org/wiki/Smart_city
texts.append("""A smart city is an urban area that uses different types of electronic methods and sensors to collect data. Insights gained from that data are used to manage assets, 
resources and services efficiently; in return, that data is used to improve the operations across the city. This includes data collected from citizens, devices, buildings and assets that 
is then processed and analyzed to monitor and manage traffic and transportation systems, power plants, utilities, water supply networks, waste, crime detection, information systems, schools, 
libraries, hospitals, and other community services.""")

#source: https://kids.britannica.com/kids/article/city/352965
texts.append("""A city is a place where many people live closely together. City life has many benefits. Cities bring together a great variety of people from different backgrounds. 
They offer more jobs, more schools, and more kinds of activities than smaller towns and villages. But cities also can be dangerous and polluted. A city’s central business district, 
or downtown, usually has its tallest office buildings and biggest stores. The downtown area is often the oldest part of the city. A city usually has one or more areas of factories and warehouses 
(storage buildings) outside of downtown. Most of the city’s homes lie still farther from downtown.""")



In [None]:
def analyze_text(txt):

    dict_res = OrderedDict()
    print("\n---------------------------------------------------------------------")
    print(txt)
    print("---------------------------------------------------------------------")
    dict_res["num_sentences"] = textstat.sentence_count(txt)
    dict_res["num_difficult_word"] = textstat.difficult_words(txt)
    dict_res["num_polysyllab"] = textstat.polysyllabcount(txt)
    dict_res["num_syllables"] = textstat.syllable_count(txt)
    dict_res["num_chars"] = textstat.char_count(txt)
    dict_res["num_letters"] = textstat.letter_count(txt)
    dict_res["difficult_words_list"] = textstat.difficult_words_list(txt)

    dict_res["ARI_gd"] = textstat.automated_readability_index(txt)
    dict_res["FK_gd"] = textstat.flesch_kincaid_grade(txt)
    dict_res["CL_gd"] = textstat.coleman_liau_index(txt)
    dict_res["SMOG_gd"] = textstat.smog_index(txt)
    dict_res["FK"] = textstat.flesch_reading_ease(txt)
    dict_res["DC"] = textstat.dale_chall_readability_score_v2(txt) 

    return dict_res

In [None]:
pp = pprint.PrettyPrinter(width=100, compact=True)

list_res = []
for txt in texts:
    dict_res = analyze_text(txt)
    pp.pprint(dict_res)
    dict_res["txt_snippet"] = txt[0:90]
    list_res.append(dict_res)


---------------------------------------------------------------------
Elasticsearch is the distributed search and analytics engine at the heart of the Elastic Stack. Elasticsearch is where the indexing, search, and analysis magic happens. It provides scalable 
search, has near real-time search, and supports multitenancy. Elasticsearch provides near real-time search and analytics for all types of data. Whether you have structured or unstructured text, 
numerical data, or geospatial data, Elasticsearch can efficiently store and index it in a way that supports fast searches. You can go far beyond simple data retrieval and aggregate information to discover 
trends and patterns in your data. 
---------------------------------------------------------------------
OrderedDict([('num_sentences', 6), ('num_difficult_word', 24), ('num_polysyllab', 18),
             ('num_syllables', 159), ('num_chars', 533), ('num_letters', 518),
             ('difficult_words_list',
              ['distributed'

We observe that the result ontains incorrect number of sentences forlast text (8 instead of expected 9)

In [None]:
df = pd.DataFrame(list_res)
df

Unnamed: 0,num_sentences,num_difficult_word,num_polysyllab,num_syllables,num_chars,num_letters,difficult_words_list,ARI_gd,FK_gd,CL_gd,SMOG_gd,FK,DC,txt_snippet
0,6,24,18,159,533,518,"[distributed, indexing, supports, multitenancy...",13.3,10.5,14.73,13.0,47.28,8.48,Elasticsearch is the distributed search and an...
1,2,14,10,88,279,268,"[readability, depends, aspects, syntax, affect...",17.1,14.4,13.47,0.0,37.13,9.24,Readability is the ease with which a reader ca...
2,3,33,14,152,510,492,"[insights, analyzed, resources, services, coll...",20.6,15.9,15.73,15.5,33.28,11.01,A smart city is an urban area that uses differ...
3,8,18,12,156,537,522,"[closely, backgrounds, villages, central, vari...",9.2,7.2,10.66,10.1,66.64,6.99,A city is a place where many people live close...


In [None]:
for col in ["ARI_gd", "FK_gd", "CL_gd", "SMOG_gd", "DC"]:
    #How to rank the group of records that have the same value (i.e. ties): average rank of the group
    df[col + "_rank"] = df[col].rank(method='average')

#Note: readability is considered to be better when the `flesch_reading_ease` is higher 
for col in ["FK"]:
    #How to rank the group of records that have the same value (i.e. ties): average rank of the group
    df[col + "_rank"] = df[col].rank(method='average', ascending = False)

df.drop(columns=["difficult_words_list"], inplace = False)

Unnamed: 0,num_sentences,num_difficult_word,num_polysyllab,num_syllables,num_chars,num_letters,ARI_gd,FK_gd,CL_gd,SMOG_gd,FK,DC,txt_snippet,ARI_gd_rank,FK_gd_rank,CL_gd_rank,SMOG_gd_rank,DC_rank,FK_rank
0,6,24,18,159,533,518,13.3,10.5,14.73,13.0,47.28,8.48,Elasticsearch is the distributed search and an...,2.0,2.0,3.0,3.0,2.0,2.0
1,2,14,10,88,279,268,17.1,14.4,13.47,0.0,37.13,9.24,Readability is the ease with which a reader ca...,3.0,3.0,2.0,1.0,3.0,3.0
2,3,33,14,152,510,492,20.6,15.9,15.73,15.5,33.28,11.01,A smart city is an urban area that uses differ...,4.0,4.0,4.0,4.0,4.0,4.0
3,8,18,12,156,537,522,9.2,7.2,10.66,10.1,66.64,6.99,A city is a place where many people live close...,1.0,1.0,1.0,2.0,1.0,1.0


-----------------------------

### Readability metrics in R with Quanteda

In [None]:
%load_ext rpy2.ipython

See:  
https://quanteda.io/articles/quickstart.html    
https://cran.r-project.org/web/packages/quanteda.textstats/index.html  
https://cran.r-project.org/web/packages/quanteda.textstats/quanteda.textstats.pdf

In [None]:
%%R
install.packages("quanteda", quiet = TRUE) 
install.packages("quanteda.textstats", quiet = TRUE) 

R[write to console]: also installing the dependencies ‘ISOcodes’, ‘fastmatch’, ‘RcppParallel’, ‘SnowballC’, ‘stopwords’, ‘RcppArmadillo’


R[write to console]: also installing the dependencies ‘nsyllable’, ‘proxyC’




In [None]:
%%R
library(quanteda)
library(quanteda.textstats)
library(nsyllable)

R[write to console]: Package version: 3.0.0
Unicode version: 10.0
ICU version: 60.2

R[write to console]: Parallel computing: 2 of 2 threads used.

R[write to console]: See https://quanteda.io for tutorials and examples.



In [None]:
%%R
ls("package:quanteda.textstats")

 [1] "as.matrix"             "data_char_wordlists"   "nscrabble"            
 [4] "show"                  "textstat_collocations" "textstat_dist"        
 [7] "textstat_entropy"      "textstat_frequency"    "textstat_keyness"     
[10] "textstat_lexdiv"       "textstat_proxy"        "textstat_readability" 
[13] "textstat_select"       "textstat_simil"        "textstat_summary"     


In [None]:
%%R

formalArgs(quanteda.textstats::textstat_readability)

[1] "x"                   "measure"             "remove_hyphens"     
[4] "min_sentence_length" "max_sentence_length" "intermediate"       
[7] "..."                


In [None]:
%%R

t1 = 'Elasticsearch is the distributed search and analytics engine at the heart of the Elastic Stack. Elasticsearch is where the indexing, search, and analysis magic happens. It provides scalable 
search, has near real-time search, and supports multitenancy. Elasticsearch provides near real-time search and analytics for all types of data. Whether you have structured or unstructured text, 
numerical data, or geospatial data, Elasticsearch can efficiently store and index it in a way that supports fast searches. You can go far beyond simple data retrieval and aggregate information to discover 
trends and patterns in your data.'


t2 = 'Readability is the ease with which a reader can understand a written text. In natural language, the readability of text depends on its content (the complexity of its vocabulary and syntax) 
and its presentation (such as typographic aspects that affect legibility, like font size, line height, character spacing, and line length).'

t3 = 'A smart city is an urban area that uses different types of electronic methods and sensors to collect data. Insights gained from that data are used to manage assets, 
resources and services efficiently; in return, that data is used to improve the operations across the city. This includes data collected from citizens, devices, buildings and assets that 
is then processed and analyzed to monitor and manage traffic and transportation systems, power plants, utilities, water supply networks, waste, crime detection, information systems, schools, 
libraries, hospitals, and other community services.'

t4 = 'A city is a place where many people live closely together. City life has many benefits. Cities bring together a great variety of people from different backgrounds. 
They offer more jobs, more schools, and more kinds of activities than smaller towns and villages. But cities also can be dangerous and polluted. A city’s central business district, 
or downtown, usually has its tallest office buildings and biggest stores. The downtown area is often the oldest part of the city. A city usually has one or more areas of factories and warehouses 
(storage buildings) outside of downtown. Most of the city’s homes lie still farther from downtown.'


In [None]:
%%R
txts <- c(text1 = t1, 
          text2 = t2,
          text3 = t3,
          text4 = t4)

In [None]:
%%R

corpus_txts <- corpus(txts) 
corpus_summary <- summary(corpus_txts)
corpus_summary

Corpus consisting of 4 documents, showing 4 documents:

  Text Types Tokens Sentences
 text1    65    106         6
 text2    45     62         2
 text3    66    106         3
 text4    76    120         9



In [None]:
%%R
str(corpus_summary)

Classes ‘summary.corpus’ and 'data.frame':	4 obs. of  4 variables:
 $ Text     : chr  "text1" "text2" "text3" "text4"
 $ Types    : int  65 45 66 76
 $ Tokens   : int  106 62 106 120
 $ Sentences: int  6 2 3 9
 - attr(*, "ndoc_all")= int 4
 - attr(*, "ndoc_show")= int 4


In [None]:
%%R

list_numsyl = c()
list_numchar = c()

for (i in 1:length(txts)){
    txt <- txts[i]
    syls <- nsyllable(txt)
    chs <- nchar(txt)

    #append to list
    list_numsyl <- c(list_numsyl, syls)
    list_numchar <- c(list_numchar, chs) 
}

#make dataframe from vectors
df_aux = data.frame(numsyl = list_numsyl, 
                    numchar = list_numchar) 
df_aux

      numsyl numchar
text1    178     625
text2     97     329
text3    177     597
text4    185     641


In [None]:
%%R

#reset index
df_aux$document <- rownames(df_aux)
rownames(df_aux) <- 1:nrow(df_aux)

df_aux

  numsyl numchar document
1    178     625    text1
2     97     329    text2
3    177     597    text3
4    185     641    text4


In [None]:
%%R

#see https://quanteda.io/reference/textstat_readability.html
list_metrics  <- c("ARI", "Coleman.Liau.grade", "Coleman.Liau.short", "Flesch.Kincaid", "Flesch", "Dale.Chall.old", "Dale.Chall", "SMOG")
res <- quanteda.textstats::textstat_readability(txts, measure = list_metrics, remove_hyphens = FALSE, intermediate = TRUE)
res

  document      ARI Coleman.Liau.grade Coleman.Liau.short Flesch.Kincaid
1    text1 12.50355           14.97726           14.97806      11.644247
2    text2 15.97824           13.82166           13.82275      15.409902
3    text3 19.40928           15.86381           15.86500      17.975000
4    text4  7.68419           10.72641           10.72686       8.401905
    Flesch Dale.Chall.old Dale.Chall     SMOG   W St   C  Sy W3Sy W2Sy W_1Sy
1 39.18637      10.857128  14.487796 13.55910  93  6 517 167   20   41    52
2 29.99956      10.164633  14.738333 15.90319  51  2 267  91   10   19    32
3 18.43667      11.550979   4.896364 17.87935  88  3 489 165   20   50    38
4 55.60476       6.170119  44.188095 10.50422 105  9 519 173   15   49    56
  W6C W7C Wlt3Sy W_wl.Dale.Chall
1  40  31     73              38
2  21  16     41              17
3  41  32     68              36
4  40  31     90              13


In [None]:
%%R

str(res)

Classes ‘readability’, ‘textstat’ and 'data.frame':	4 obs. of  20 variables:
 $ document          : chr  "text1" "text2" "text3" "text4"
 $ ARI               : num  12.5 15.98 19.41 7.68
 $ Coleman.Liau.grade: num  15 13.8 15.9 10.7
 $ Coleman.Liau.short: num  15 13.8 15.9 10.7
 $ Flesch.Kincaid    : num  11.6 15.4 18 8.4
 $ Flesch            : num  39.2 30 18.4 55.6
 $ Dale.Chall.old    : num  10.86 10.16 11.55 6.17
 $ Dale.Chall        : num  14.5 14.7 4.9 44.2
 $ SMOG              : num  13.6 15.9 17.9 10.5
 $ W                 : int  93 51 88 105
 $ St                : int  6 2 3 9
 $ C                 : num  517 267 489 519
 $ Sy                : num  167 91 165 173
 $ W3Sy              : num  20 10 20 15
 $ W2Sy              : num  41 19 50 49
 $ W_1Sy             : num  52 32 38 56
 $ W6C               : num  40 21 41 40
 $ W7C               : num  31 16 32 31
 $ Wlt3Sy            : num  73 41 68 90
 $ W_wl.Dale.Chall   : int  38 17 36 13


In [None]:
%%R

#res$ARI_rank <-  rank(res$ARI)

for (col_name in list_metrics){
    new_col_name <- paste(col_name , "_rank", sep="")
    res[[new_col_name]] <-  rank(res[[col_name]])
}

res

  document      ARI Coleman.Liau.grade Coleman.Liau.short Flesch.Kincaid
1    text1 12.50355           14.97726           14.97806      11.644247
2    text2 15.97824           13.82166           13.82275      15.409902
3    text3 19.40928           15.86381           15.86500      17.975000
4    text4  7.68419           10.72641           10.72686       8.401905
    Flesch Dale.Chall.old Dale.Chall     SMOG   W St   C  Sy W3Sy W2Sy W_1Sy
1 39.18637      10.857128  14.487796 13.55910  93  6 517 167   20   41    52
2 29.99956      10.164633  14.738333 15.90319  51  2 267  91   10   19    32
3 18.43667      11.550979   4.896364 17.87935  88  3 489 165   20   50    38
4 55.60476       6.170119  44.188095 10.50422 105  9 519 173   15   49    56
  W6C W7C Wlt3Sy W_wl.Dale.Chall ARI_rank Coleman.Liau.grade_rank
1  40  31     73              38        2                       3
2  21  16     41              17        3                       2
3  41  32     68              36        4           

In [None]:
%%R

#join dataframes by index, all =  TRUE because we want full outer join
df_merged <- merge(x = df_aux, y = corpus_summary, by = "row.names", all = TRUE)


#join dataframes by speified columns
df_merged2 <- merge(x = df_merged, y = res, by.x = "document", by.y = "document", all = TRUE)

df_merged2

  document Row.names numsyl numchar  Text Types Tokens Sentences      ARI
1    text1         1    178     625 text1    65    106         6 12.50355
2    text2         2     97     329 text2    45     62         2 15.97824
3    text3         3    177     597 text3    66    106         3 19.40928
4    text4         4    185     641 text4    76    120         9  7.68419
  Coleman.Liau.grade Coleman.Liau.short Flesch.Kincaid   Flesch Dale.Chall.old
1           14.97726           14.97806      11.644247 39.18637      10.857128
2           13.82166           13.82275      15.409902 29.99956      10.164633
3           15.86381           15.86500      17.975000 18.43667      11.550979
4           10.72641           10.72686       8.401905 55.60476       6.170119
  Dale.Chall     SMOG   W St   C  Sy W3Sy W2Sy W_1Sy W6C W7C Wlt3Sy
1  14.487796 13.55910  93  6 517 167   20   41    52  40  31     73
2  14.738333 15.90319  51  2 267  91   10   19    32  21  16     41
3   4.896364 17.87935  88  3 48

In [None]:
%%R

#drop some columns
df_merged3 <- subset(df_merged2, select = -c(Text, Types))

df_merged3

  document Row.names numsyl numchar Tokens Sentences      ARI
1    text1         1    178     625    106         6 12.50355
2    text2         2     97     329     62         2 15.97824
3    text3         3    177     597    106         3 19.40928
4    text4         4    185     641    120         9  7.68419
  Coleman.Liau.grade Coleman.Liau.short Flesch.Kincaid   Flesch Dale.Chall.old
1           14.97726           14.97806      11.644247 39.18637      10.857128
2           13.82166           13.82275      15.409902 29.99956      10.164633
3           15.86381           15.86500      17.975000 18.43667      11.550979
4           10.72641           10.72686       8.401905 55.60476       6.170119
  Dale.Chall     SMOG   W St   C  Sy W3Sy W2Sy W_1Sy W6C W7C Wlt3Sy
1  14.487796 13.55910  93  6 517 167   20   41    52  40  31     73
2  14.738333 15.90319  51  2 267  91   10   19    32  21  16     41
3   4.896364 17.87935  88  3 489 165   20   50    38  41  32     68
4  44.188095 10.50422 1

Note that the results are not the same among the two libraries used (in Python and R).

---------------------

Appendix

In [None]:
!jupyter-kernelspec list

Available kernels:
  ir         /usr/local/share/jupyter/kernels/ir
  python2    /usr/local/share/jupyter/kernels/python2
  python3    /usr/local/share/jupyter/kernels/python3
