text-difficulty-analyzer

Corpus-based text difficulty analyzer for language learners based on word dispersion; also a text corpus management framework powered by Pymongo. Calculates text difficulty through a Text Difficulty Scale (TDS), with values ranging from 0 (least difficult) to 1 (most difficult).

First proposed in a master's thesis advised by Prof. Heliana Ribeiro de Mello (Poslin, UFMG).

Demo (Difficulty Highlighter)

Difficulty highlighting of part of the English and Portuguese summaries of the Wikipedia article Johann Sebastian Bach, with dispersion values represented in superscript. Download the English version in .pdf, .html, and .tex

Prerequisites / Tutorial

An Anaconda Python 3.6 distribution
MongoDB Community Edition
Library requirements (languages that use a Latin alphabet -- master branch): tqdm, wikipedia, Pymongo
Additional library requirements (languages that use a non-Latin alphabet -- ICUsupport branch): download and install ICU(far easier for a Linux-based OS), polyglot, PyICU, pycld2, morfessor
Text (.txt) files in UTF-8 encoding. If you don't have these, use 00. Automatic Wikipedia extractor.
Execute the MongoDB driver; then execute notebook 01 first and complete the process (may take a few minutes or even hours depending on the size of your data).
Calculate TDS (Text Difficulty Scale) values ranging from 0 (very easy) to 1 (very difficult) using the notebook 02. Text Difficulty Analyzer + Difficulty Highlighter.ipynb. Optionally, you can try the Difficulty Highlighter.

Last Update	Description
09/10/2018	00. Automatic Wikipedia extractor.ipynb -- Extracts random articles from Wikipedia from a language of your choosing and saves them as text files in the ./output_dir folder. Useful for corpus building, if you don't have text files laying around.
09/10/2018	01. Corpus builder.ipynb -- Extracts text data as tokens and inserts them into a Pymongo database. Good compatibility with languages that use the Latin alphabet. Can work with non-Latin alphabets, but not as well. Adds Gries' (2008) deviation of proportions (DP), a measure of dispersion that takes into account the size of corpus parts.
09/10/2018	02. Text Difficulty Analyzer + Difficulty Highlighter.ipynb -- The Difficulty Highlighter can output the analyzer's results in .html and .tex with or without the DP values in superscript.

License

This project has an MIT license.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
00. Automatic Wikipedia extractor.ipynb		00. Automatic Wikipedia extractor.ipynb
01. Corpus builder.ipynb		01. Corpus builder.ipynb
02. Text Difficulty Analyzer + Difficulty Highlighter.ipynb		02. Text Difficulty Analyzer + Difficulty Highlighter.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

00. Automatic Wikipedia extractor.ipynb

00. Automatic Wikipedia extractor.ipynb

01. Corpus builder.ipynb

01. Corpus builder.ipynb

02. Text Difficulty Analyzer + Difficulty Highlighter.ipynb

02. Text Difficulty Analyzer + Difficulty Highlighter.ipynb

LICENSE

LICENSE

README.md

README.md

Repository files navigation

text-difficulty-analyzer

Demo (Difficulty Highlighter)

Prerequisites / Tutorial

Contents

License

About

Releases

Packages

Languages

License

filiperubini/text-difficulty-analyzer

Folders and files

Latest commit

History

Repository files navigation

text-difficulty-analyzer

Demo (Difficulty Highlighter)

Prerequisites / Tutorial

Contents

License

About

Resources

License

Stars

Watchers

Forks

Languages