# Readability Metrics
This notebook will introduce an overview of readability metrics and how to use them in Python. Readability metrics are used to measure how easy it is to read a text. They are used in various fields, such as education, linguistics, and natural language processing. In this notebook, we will cover the following topics:

- What are readability metrics?
- How to calculate readability metrics in Python
- How to interpret the results of readability metrics
- How to use readability metrics in practice

## What are readability metrics?
Readability metrics are quantitative measures that are used to assess the readability of a text. They are used to evaluate how easy or difficult it is to read and understand a text. Readability metrics are based on various linguistic and cognitive factors, such as sentence length, word length, and vocabulary complexity.

There are many different readability metrics, each with its own formula and interpretation. Some of the most commonly used readability metrics include:
  
- **Simple Measure of Gobbledygook (SMOG)**: This metric estimates the years of formal education required to understand a text. The formula for the SMOG is:

  $1.043 \times \sqrt{\text{number of complex words} \times \frac{30}{\text{number of sentences}}} + 3.1291$
  
- **Dale-Chall Readability Score**: This metric estimates the grade level required to understand a text. The formula for the Dale-Chall Readability Score is:

  $0.1579 \times (\text{percentage of difficult words} + 0.0496 \times \text{average words per sentence})$
  
- **Spache Readability Formula**: This metric estimates the grade level required to understand a text. The formula for the Spache Readability Formula is:

  $0.121 \times \text{average sentence length} + 0.082 \times \text{average syllables per word} - 0.659$
  
- **Linsear Write Formula**: This metric estimates the grade level required to understand a text. The formula for the Linsear Write Formula is:

  $\frac{(\text{number of easy words} + \text{number of hard words}) \times 2}{\text{number of sentences}} - 2$
  
- **FORCAST Readability Formula**: This metric estimates the grade level required to understand a text. The formula for the FORCAST Readability Formula is:

  $20 - \frac{\text{number of syllables} \times 0.1}{\text{number of sentences}}$
  
- **Raygor Readability Estimate**: This metric estimates the grade level required to understand a text. The formula for the Raygor Readability Estimate is:

  $0.1579 \times \text{average words per sentence} + 0.0496 \times \text{percentage of difficult words} + 3.6365$
  
- **LIX Readability Formula**: This metric estimates the grade level required to understand a text. The formula for the LIX Readability Formula is:

  $\frac{\text{number of words}}{\text{number of sentences}} + \frac{\text{number of long words} \times 100}{\text{number of words}}$
  
- **RIX Readability Formula**: This metric estimates the grade level required to understand a text. The formula for the RIX Readability Formula is:

  $\frac{\text{number of long words}}{\text{number of sentences}}$
  
- **Strain Index**: This metric estimates the grade level required to understand a text. The formula for the Strain Index is:

  $\frac{\text{number of long words} \times 100}{\text{number of sentences}}$
  
- **Readability Consensus Grade**: This metric estimates the grade level required to understand a text. The formula for the Readability Consensus Grade is:

  $\frac{\text{Flesch-Kincaid Grade Level} + \text{Gunning Fog Index} + \text{Coleman-Liau Index} + \text{Automated Readability Index} + \text{SMOG} + \text{Dale-Chall Readability Score} + \text{Spache Readability Formula} + \text{New Dale-Chall Readability Score} + \text{Linsear Write Formula} + \text{FORCAST Readability Formula} + \text{Raygor Readability Estimate} + \text{LIX Readability Formula} + \text{RIX Readability Formula} + \text{Strain Index}}{14}$

## Notebook Setup

In [1]:
# Importing the necessary Python libraries
from whetstone.metrics.text.readability_metrics import (
    calculate_flesch_kincaid_reading_ease,
    calculate_flesch_kincaid_grade_level,
    calculate_gunning_fog_index, 
    calculate_coleman_liau_index,
    calculate_automated_readability_index
)

In [2]:
# Creating some high quality and low quality text
high_quality_text = """Creating a strong password is your first line of defense in the digital age, where cyber threats constantly evolve. Hackers are increasingly adept at exploiting vulnerabilities, making it crucial to understand the importance of robust security measures. While it may be convenient to use simple and memorable combinations like birthdays or pet names, such choices are perilously insecure. They leave your personal and financial information exposed to malicious attacks.

Think of a password as a digital fortress. To build this fortress, security experts recommend a mix of uppercase and lowercase letters, numbers, and special characters. The complexity of such a password makes it significantly more difficult for hackers to crack using brute-force methods. Additionally, the length of your password plays a pivotal role; each extra character exponentially increases the time required to breach it.

It’s not just about creating a single strong password, though. Using unique passwords for each of your accounts is essential. A single breach could otherwise grant hackers access to multiple platforms, compounding the damage.

Cybercrime costs the global economy billions of dollars annually. The effort required to create and maintain strong passwords pales in comparison to the consequences of a security failure. Just as you would invest in a sturdy lock to protect your home, you should prioritize secure passwords to safeguard your digital assets. In a world where data is increasingly valuable, this small step can make a world of difference."""



low_quality_text = """So, like, passwords are super important for, like, cyber-security and stuff. You know how there are hackers everywhere these days? They’re always trying to get into people’s accounts, which is kinda scary. It reminds me of this one time my cousin got hacked. It was so bad—like, they stole his email and even tried to get into his bank account or whatever. Anyway, this is why you’ve gotta make sure you’re using good passwords, but not something obvious like your birthday or your dog’s name. (By the way, my dog’s name is Max. He’s super cute, but yeah, don’t use that as a password.)

So, like, those IT security experts and computer nerds are always saying stuff like, “Use capitals and lowercase letters and numbers and those weird symbols on your keyboard.” You know, the ones you never use unless you’re doing something nerdy like coding or whatever? Oh, and they say to make your passwords really long. Like, the longer the better, I guess. I don’t know why, but it makes it harder for hackers, which is good.

Passwords are kinda like keys to your house, but for the internet. Or maybe it’s more like a vault or something? I dunno, but it’s super important because if a hacker gets in, they can take all your stuff. And not just money—they can steal your identity or your Instagram account, which would be awful. So yeah, make good passwords. Like, seriously."""

## Calculating Each Readability Metric


### Flesch-Kincade Reading Ease

The Flesch-Kincade Reading Ease metric is a readiability formula used to assess teh complexity of English texts by assigning a score that reflects how easy or difficult it is to understand.

This metric measures the readability of a text on a scale from 0 to 100, with higher scores indicating easier readability. The formula for the Flesch-Kincaid Reading Ease score is:

$$206.835 - 1.015 \times \text{average words per sentence} - 84.6 \times \text{average syllables per word}$$

This score may be interpreted using the table below:

| Score  | Reading Level       | Description           |
|--------|---------------------|-----------------------|
| 90-100 | 5th grade           | Very easy to read     |
| 80-89  | 6th grade           | Easy to read          |
| 70-79  | 7th grade           | Fairly easy           |
| 60-69  | 8th-9th grade       | Plain English         |
| 50-59  | 10th-12th grade     | Fairly difficult      |
| 30-49  | College             | Difficult             |
| 0-29   | College graduate    | Very difficult        |

Generally speaking, writers should aim for a score of 60 or higher, which indicates that the text is easily understood by most adults. Flesch-Kincaid Reading Ease is widely used in the field of education and is often used to evaluate the readability of textbooks and other educational materials. The Flesch-Kincaid Reading Ease metric is also used by software tools, like Microsoft Word, as a metric for readability analysis.

In [3]:
# Calculating the Flesch-Kincaid Reading Ease score for the high and low quality text
print('Results for Flesch-Kincaid Reading Ease:')
high_quality_flesch_kincaid_reading_ease_score = calculate_flesch_kincaid_reading_ease(high_quality_text)
low_quality_flesch_kincaid_reading_ease_score = calculate_flesch_kincaid_reading_ease(low_quality_text)
print(f'High quality text: {high_quality_flesch_kincaid_reading_ease_score}')
print(f'Low quality text: {low_quality_flesch_kincaid_reading_ease_score}')


Results for Flesch-Kincaid Reading Ease:
High quality text: [36.36]
Low quality text: [73.09]


### Flesch-Kincaid Grade Level
This metric estimates the grade level required to understand a text. The formula for the Flesch-Kincaid Grade Level is:

  $0.39 \times \text{average words per sentence} + 11.8 \times \text{average syllables per word} - 15.59$

This score may be interpreted using the table below:

| Grade Level | Description           |
|-------------|-----------------------|
| 0-1         | Kindergarten          |
| 2-3         | 1st-3rd grade         |
| 4-5         | 4th-5th grade         |
| 6-7         | 6th-7th grade         |
| 8-9         | 8th-9th grade         |
| 10-11       | 10th-11th grade       |
| 12-13       | 12th grade            |
| 14-15       | College               |

The Flesch-Kincaid Grade Level is often used in the field of education to assess the readability of textbooks and other educational materials. It is also used in natural language processing to evaluate the complexity of text data.

In [4]:
# Calculating the Flesch-Kincaid Grade Level score for the high and low quality text
print('Results for Flesch-Kincaid Grade Level:')
high_quality_flesch_kincaid_grade_level_score = calculate_flesch_kincaid_grade_level(high_quality_text)
low_quality_flesch_kincaid_grade_level_score = calculate_flesch_kincaid_grade_level(low_quality_text)
print(f'High quality text: {high_quality_flesch_kincaid_grade_level_score}')
print(f'Low quality text: {low_quality_flesch_kincaid_grade_level_score}')

Results for Flesch-Kincaid Grade Level:
High quality text: [12.1]
Low quality text: [6.59]


### Gunning Fog Index
The Gunning Fog Index is a readability metric used to estimate the complexity of English-language text and the level of education required to understand it on a first reading. It was developed by Robert Gunning in 1952 and is widely applied in journalism, business communication, and education to assess whether written material is appropriate for its intended audience.

The formula for the Gunning Fog Index is:

$$
0.4 \times (\text{average words per sentence} + 100 \times \text{percentage of complex words})
$$

The following table provides an interpretation of the Gunning Fog Index:

| Fog Index Score | Description                              | Target Audience                     |
|------------------|------------------------------------------|--------------------------------------|
| **7-8**         | Easy to read                             | Suitable for middle school students |
| **9-12**        | Moderately difficult                     | Suitable for high school students   |
| **13-16**       | Difficult                                | Requires college-level reading      |
| **17+**         | Very complex                             | Suitable for post-graduate level    |

To optimize the Gunning Fog index, consider applying the following strategies:

- Use shorter sentences.
- Avoid complex words where simpler alternatives exist.
- Focus on clarity and brevity.

In [5]:
# Calculating the Gunning Fog Index score for the high and low quality text
print('Results for Gunning Fog Index:')
high_quality_gunning_fog_index_score = calculate_gunning_fog_index(high_quality_text)
low_quality_gunning_fog_index_score = calculate_gunning_fog_index(low_quality_text)
print(f'High quality text: {high_quality_gunning_fog_index_score}')
print(f'Low quality text: {low_quality_gunning_fog_index_score}')

Results for Gunning Fog Index:
High quality text: 15.422296918767508
Low quality text: 8.670723684210527


### Coleman-Liau Index
The Coleman-Liau Index is a readability metric used to estimate the grade level required for someone to understand a text. It assesses the complexity of a text based on its characters, words, and sentences, rather than using syllables like some other readability formulas (e.g., the Flesch-Kincaid formula).

The formula for the Coleman-Liau Index is:

$$
0.0588 \times \text{average letters per 100 words} - 0.296 \times \text{average sentences per 100 words} - 
$$

Here’s a breakdown of what the ranges generally indicate:

| **CLI Range**   | **Reading Level**                                | **Audience**                          |
|------------------|-------------------------------------------------|---------------------------------------|
| 1.0 – 5.0       | Very easy to read                               | Young children or early elementary   |
| 6.0 – 8.0       | Fairly easy to read                             | Upper elementary or middle school    |
| 9.0 – 12.0      | Standard readability (average complexity)       | High school students                 |
| 13.0 – 16.0     | More complex (college level)                    | College students                     |
| 17.0+           | Very complex (graduate level or professional)   | Advanced academic or professional    |

To optimize the Coleman-Liau Index metric, consider applying the following strategies:

- Reduce sentence length. (e.g., Break down long, complex sentences into shorter, simpler ones.)
- Minimize word complexity. (e.g., Use shorter words with fewer characters and avoid technical or uncommon terms.)
- Focus on conciseness. (e.g., Eliminate unnecessary modifiers or redundant phrases.)
- Use clear, direct writing. (e.g., Aim for straightforward phrasing without overly elaborate descriptions.)
- Target grade-level readability. (e.g., Adjust content based on the desired audience grade level to balance 

In [6]:
# Calculating the Coleman-Liau Index score for the high and low quality text
print('Results for Coleman-Liau Index:')
high_quality_coleman_liau_index_score = calculate_coleman_liau_index(high_quality_text)
low_quality_coleman_liau_index_score = calculate_coleman_liau_index(low_quality_text)
print(f'High quality text: {high_quality_coleman_liau_index_score}')
print(f'Low quality text: {low_quality_coleman_liau_index_score}')

Results for Coleman-Liau Index:
High quality text: [14.03]
Low quality text: [6.63]


### Automated Readability Index (ARI)
The Automated Readability Index (ARI) is a readability test designed to assess the complexity of a text and estimate the grade level required to understand it. The index is calculated based on two primary text features: the average number of characters per word and the average number of words per sentence. ARI is widely used in education, publishing, and content evaluation to ensure materials are appropriate for their target audience.

It provides a numerical score that corresponds to U.S. school grade levels. For example, an ARI score of 8 indicates the text is suitable for an 8th-grade reader. This makes it a useful tool for evaluating the readability of documents, books, and online content.

The formula for the ARI is:

$$
4.71 \times \text{average characters per word} + 0.5 \times \text{average words per sentence} - 21.43
$$

| **ARI Score** | **Grade Level**                             | **Interpretation**                                         |
|---------------|--------------------------------------------|-----------------------------------------------------------|
| < 1           | Early elementary or beginner               | Very simple text, suitable for young children or beginners |
| 1–12          | Corresponds to U.S. school grade levels    | Each score aligns with the respective school grade level   |
| 5             | 5th grade                                  | Suitable for a 5th-grade student                          |
| 10            | 10th grade                                 | Suitable for a 10th-grade student                         |
| > 12          | Postsecondary (college level or higher)    | Requires higher education to understand                   |

To optimize the Automated Readability Index (ARI) metric, consider applying the following strategies:

- Reduce sentence length: Break down long, complex sentences into shorter, simpler ones.
- Minimize word complexity: Use shorter words with fewer characters and avoid technical or uncommon terms.
- Focus on conciseness: Eliminate unnecessary modifiers or redundant phrases.
- Use clear, direct writing: Aim for straightforward phrasing without overly elaborate descriptions.
- Target grade-level readability: Adjust content based on the desired audience grade level to balance simplicity and effectiveness.

In [7]:
# Calculating the Automated Readability Index score for the high and low quality text
print('Results for Automated Readability Index:')
high_quality_automated_readability_index_score = calculate_automated_readability_index(high_quality_text)
low_quality_automated_readability_index_score = calculate_automated_readability_index(low_quality_text)
print(f'High quality text: {high_quality_automated_readability_index_score}')
print(f'Low quality text: {low_quality_automated_readability_index_score}')

Results for Automated Readability Index:
High quality text: [11.89]
Low quality text: [5.64]
