Skip to content

filiperubini/text-difficulty-analyzer

Repository files navigation

text-difficulty-analyzer

Corpus-based text difficulty analyzer for language learners based on word dispersion; also a text corpus management framework powered by Pymongo. Calculates text difficulty through a Text Difficulty Scale (TDS), with values ranging from 0 (least difficult) to 1 (most difficult).

First proposed in a master's thesis advised by Prof. Heliana Ribeiro de Mello (Poslin, UFMG).

Demo (Difficulty Highlighter)

Difficulty highlighting of part of the English and Portuguese summaries of the Wikipedia article Johann Sebastian Bach, with dispersion values represented in superscript. Download the English version in .pdf, .html, and .tex

high_tex

Prerequisites / Tutorial

  1. An Anaconda Python 3.6 distribution
  2. MongoDB Community Edition
  3. Library requirements (languages that use a Latin alphabet -- master branch): tqdm, wikipedia, Pymongo
  4. Additional library requirements (languages that use a non-Latin alphabet -- ICUsupport branch): download and install ICU(far easier for a Linux-based OS), polyglot, PyICU, pycld2, morfessor
  5. Text (.txt) files in UTF-8 encoding. If you don't have these, use 00. Automatic Wikipedia extractor.
  6. Execute the MongoDB driver; then execute notebook 01 first and complete the process (may take a few minutes or even hours depending on the size of your data).
  7. Calculate TDS (Text Difficulty Scale) values ranging from 0 (very easy) to 1 (very difficult) using the notebook 02. Text Difficulty Analyzer + Difficulty Highlighter.ipynb. Optionally, you can try the Difficulty Highlighter.

Contents

Last Update    Description
09/10/2018     00. Automatic Wikipedia extractor.ipynb -- Extracts random articles from Wikipedia from a language of your choosing and saves them as text files in the ./output_dir folder. Useful for corpus building, if you don't have text files laying around.
09/10/2018     01. Corpus builder.ipynb -- Extracts text data as tokens and inserts them into a Pymongo database. Good compatibility with languages that use the Latin alphabet. Can work with non-Latin alphabets, but not as well. Adds Gries' (2008) deviation of proportions (DP), a measure of dispersion that takes into account the size of corpus parts.
09/10/2018     02. Text Difficulty Analyzer + Difficulty Highlighter.ipynb -- The Difficulty Highlighter can output the analyzer's results in .html and .tex with or without the DP values in superscript.

License

This project has an MIT license.

About

Text Difficulty Analyzer

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published