ClearText

Leveraging natural language processing and deep learning technology to help English language learners on the road to fluency.

Quick Start

For installation and usage instructions, refer to the ~~official ClearText extension page~~.

Update 2021.02 I've unpublished the ClearText extension from the Chrome web sore. Below is a screenshot for posterity.

What problem does ClearText address?

The English as a Second Language Market

According to TESOL (Teachers of English to Speakers of Other Languages), there are over 1.5 billion English language learners worldwide. A huge amount of human labour is involved in educating these learners. However, many learners may not be able to afford lesson costs and must resort to other methods.

There is a large market for assisted language learning applications. For instance, Forbes reports Babbel's revenue at $115 million and Duolingo's 2017 valuation at $700 million.

However, apps like these are generally limited to basic language skills that do not transfer to real world use. In order to retain users that would otherwise outgrow them, app developers are forced to design increasingly complex and challenging language games, which requires extensive work by multi-lingual language and education experts.

Assisted Reading for Language Learners

Many language learners make use of subtitles in film and other media to assist them in the learning process. An English language learner, for instance, might turn on English subtitles while watching a movie in English. The additional visual input helps learners with their oral comprehension skills.

Unfortunately, a similar solution for text media is missing. A learner who desires to regularly read reports from a certain English language news source in order to improve their reading comprehension skills might be frustrated at the difficulty they encounter and the crudeness of existing forms of assistance, such as dictionaries, which do not take context into account.

ClearText solves this problem through the use of text simplification technology.

Developing Simplification Models with ClearText

ClearText uses a sequence-to-sequence model trained on the WikiSmall/WikiLarge datasets. For more on these datasets, take a look at the notebook. A high-level overview of the development of ClearText can be found in these slides.

There are two ways of training simplification models with ClearText. In both cases, time spent and training/validation losses will be printed at the end of each epoch. When training completes or if you interrupt training, tests are run and diagnostics are printed.

Installing and Running ClearText as a Package

The ClearText package can be installed using pip. Any required data (including word vectors) will be downloaded on-the-fly at runtime.

pip install git+https://github.com/bencwallace/cleartext

After installing ClearText, instructions for training a simplification model can be printed with the following command.

python -m cleartext.scripts.train --help

Running ClearText with MLflow

Running ClearText with MLflow not only takes care of preparing and isolating your environment, but has the additional advantage of automatically logging training progress and metadata using MLflow tracking.

To train with MLflow, first install MLflow, either using pip (pip install mlflow) or conda (conda install -c conda-forge mlflow) and then run the following command

mlflow run [options] https://github.com/bencwallace/cleartext

where [options] is a sequence of options taking the form -P parameter=[value]. For instance, to train for 10 epochs with 100 hidden units, use the following command:

mlflow run -P num-epochs=10 -P rnn-units=100 https://github.com/bencwallace/cleartext

For a list of available options, run

mlflow run -e help https://github.com/bencwallace/cleartext

Repository Structure

This repository is divided into the following directories:

chrome: Source for the ClearText Chrome extension
cleartext: ClearText package
- app: Main entrypoint for inference (Flask application).
- data: Data loading modules.
- models: PyTorch models.
- pipeline: End-to-end pipeline (data loading and preprocessing, model training, inference, and evaluation).
- scripts: Main entrypoints for training and evaluation.
- utils: Miscellaneous utilities.
data: Placeholder into which ClearText will save downloaded datasets.
models: Placeholder into which ClearText will serialize models.
notebooks: Jupyter notebooks for EDA.
tests: Unit tests.
vectors: Placeholder into which ClearText will save downloaded word vectors.

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
chrome		chrome
cleartext		cleartext
data/raw		data/raw
models		models
notebooks		notebooks
tests/utils		tests/utils
vectors/glove		vectors/glove
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
MLproject		MLproject
README.md		README.md
cleartext.jpg		cleartext.jpg
conda.yaml		conda.yaml
requirements.txt		requirements.txt
setup.py		setup.py

License

bencwallace/cleartext

Folders and files

Latest commit

History

Repository files navigation

ClearText

Quick Start

What problem does ClearText address?

The English as a Second Language Market

Assisted Reading for Language Learners

Developing Simplification Models with ClearText

Installing and Running ClearText as a Package

Running ClearText with MLflow

Repository Structure

About

Resources

License

Stars

Watchers

Forks

Languages