Readability statistics #47

jdkato · 2017-06-13T04:50:58Z

I'm thinking about including a new readability extension point that will allow users to set standards for metrics like Flesch-Kincaid, Gunning-Fog, and Coleman-Liau. For example,

extends: readability
level: warning
metric: Flesch-Kincaid
grade: 8
scope: paragraph

This would warn about any paragraphs that exceed a reading level of 8th grade.

The prose library already supports these metrics, so it's just a matter of deciding on the check implementation details.

The text was updated successfully, but these errors were encountered:

jdkato · 2017-06-13T05:03:01Z

On a related note, I'm not wild about referring to the readability or capitalization checks as "extension points." These feel much less abstract than the others (in reality, they're really modified existence checks).

This is also relevant to my goal of supporting externally-defined checks that don't directly use one of the extension points (see #45 (comment) for details).

Closes #47

This holds block-level content (i.e., it excludes headings, lists, and table cells) which is meant to processed for summary statistics like readability scores. Related to #47.

mjang · 2021-10-08T14:41:21Z

Question: Does this "plugin" ignore content in Markdown that does not appear in a doc build? I'm thinking about links and descriptions such as alt text.

In other words, would a page full of links like some word bias the results? IIRC, the Flesch-Kincaid calculations would read bits like the relative path URL as a single (complicated) word.

Example: when I run https://developer.cobalt.io/getting-started/sign-in/ through:

The WebFX tool (https://www.webfx.com/tools/read-able/), I get a score of 5.8
Vale's flesch-kincaid plugin, I get a score of 8.09

My wild guess: Vale's flesch-kincaid plugin also reads link text in markdown, such as [some word](../path/to/something-complex) as single words, which would increase the score.

amyq · 2021-10-08T14:48:39Z

Thanks for posting this question, @mjang. (For context: we've been chatting on Slack and spitballing ideas of why the scores differ.)

Another idea: I wonder if the web tools are also counting sidebars and menus. 🤔 Those could distort scores in one direction or another.

Some examples:

https://docs.gitlab.com/ee/user/admin_area/settings/visibility_and_access_controls.html
- 11.83 in Vale version, 8.3 in WebFX version
https://docs.gitlab.com/ee/ci/pipeline_editor/
- 10.62 in Vale version, 8.7 in WebFX version
https://docs.gitlab.com/ee/topics/gitlab_flow.html
- 8.97 in Vale version, 6.5 in WebFX version

jdkato · 2021-10-08T18:15:15Z

Question: Does this "plugin" ignore content in Markdown that does not appear in a doc build? I'm thinking about links and descriptions such as alt text.

Yes -- Vale tries to be as accurate as possible when calculating these metrics. It uses its summary scope, which strictly follows the formula: (1) it doesn't include non-prose content (links, html tags, source code, front matter, etc.) and (2) only operates on sentence-containing blocks.

There's a few problems with the comparison to WebFX:

If you pass a link to a web page, it uses the entire page -- not just the equivalent Markdown contents.
It "strips" HTML naively, which results in it using source code, tables, and other non-prose content in its calculations.

Here's an example HTML document (a snippet from gitlab_flow):

<p>Organizations coming to Git from other version control systems frequently find it hard to develop a productive workflow.
This article describes GitLab flow, which integrates the Git workflow with an issue tracking system.
It offers a transparent and effective way to work with Git:</p>
<pre><code class="language-mermaid">graph LR
    subgraph Git workflow
    A[Working copy] --&gt; |git add| B[Index]
    B --&gt; |git commit| C[Local repository]
    C --&gt; |git push| D[Remote repository]
    end
</code></pre>

WebFX reports 10 sentences, 68 words, and a Flesch Kincaid Grade Level of 7.2, which is wildly inaccurate.
Vale, on the other hand, internally calculates 3 sentences, 44 words, and a score of 10.78.

Let's break this down:

Sentence 1 [18 words]: Organizations coming to Git from other version control systems frequently find it hard to develop a productive workflow.

Sentence 2 [15 words]: This article describes GitLab flow, which integrates the Git workflow with an issue tracking system.

Sentence 3 [11 words]: It offers a transparent and effective way to work with Git:

Total: 3 sentences, 44 words.

If we pass just the "correct" text to WebFx, it changes its calculations to 3, 44, and 10.2. The score difference is likely from the calculation of "complex words" and syllables, but it's much closer.

jdkato · 2021-10-08T19:14:29Z

I'm reopening this issue because I think it would be useful to add a "View: Readability" option to https://vale-studio.errata.ai/.

mjang-cobalt · 2021-10-11T14:35:10Z

To extend the discussion from the Write the Docs slack:

I need to be able to do an "apples to apples" comparison of Flesch-Kincaid scores. And it's at best difficult to apply the Vale plugin to HTML content (Sure, I could pull the source code from external HTML into a repo, but that requires understanding git, repos, and Vale).

So I need to know -- do you have / know of a Web tool that shows consistent results to your Flesch-Kincaid plugin?

5u623l20 · 2023-04-17T03:59:20Z

I think you have forgot to add this extension in the documentation.

jdkato added the enhancement label Jun 13, 2017

jdkato added a commit that referenced this issue Jun 13, 2017

feat: add readability extension point

7afc0a3

Closes #47

jdkato added a commit that referenced this issue Jul 9, 2017

feat: add a Summary field to File

0cf7d95

This holds block-level content (i.e., it excludes headings, lists, and table cells) which is meant to processed for summary statistics like readability scores. Related to #47.

jdkato added a commit that referenced this issue Jul 10, 2017

feat: add a Summary field to File

9991f25

This holds block-level content (i.e., it excludes headings, lists, and table cells) which is meant to processed for summary statistics like readability scores. Related to #47.

jdkato closed this as completed in 25c5df3 Jul 10, 2017

jdkato reopened this Oct 8, 2021

jdkato added the Type: Enhancement label Oct 8, 2021

jdkato added the Status: Resolved label Oct 27, 2021

jdkato closed this as completed Oct 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readability statistics #47

Readability statistics #47

jdkato commented Jun 13, 2017 •

edited

jdkato commented Jun 13, 2017

mjang commented Oct 8, 2021 •

edited

amyq commented Oct 8, 2021 •

edited

jdkato commented Oct 8, 2021 •

edited

jdkato commented Oct 8, 2021

mjang-cobalt commented Oct 11, 2021

5u623l20 commented Apr 17, 2023

Readability statistics #47

Readability statistics #47

Comments

jdkato commented Jun 13, 2017 • edited

jdkato commented Jun 13, 2017

mjang commented Oct 8, 2021 • edited

amyq commented Oct 8, 2021 • edited

jdkato commented Oct 8, 2021 • edited

jdkato commented Oct 8, 2021

mjang-cobalt commented Oct 11, 2021

5u623l20 commented Apr 17, 2023

jdkato commented Jun 13, 2017 •

edited

mjang commented Oct 8, 2021 •

edited

amyq commented Oct 8, 2021 •

edited

jdkato commented Oct 8, 2021 •

edited