<h1><center>ClearText</center></h1>
<center>Benjamin Wallace</center>
<center>(2020)</center>

## The *English as  Second Language* Market

According to [TESOL](https://www.tesol.org/) (Teachers of English to Speakers of Other Languages), there are over [1.5 billion](https://www.internationalteflacademy.com/blog/report-from-tesol-2-billion-english-learners-worldwide) English language learners worldwide. A huge amount of human labour is involved in educating these learners. However, many learners may not be able to afford lesson costs and must resort to other methods.

There is a large market for assisted language learning applications. For instance, [Forbes](https://www.forbes.com/sites/susanadams/2019/07/16/game-of-tongues-how-duolingo-built-a-700-million-business-with-its-addictive-language-learning-app/#5f99b9d83463) reports that [Duolingo](https://www.duolingo.com/) was valued at \$700 million in 2017 and that [Babbel](https://www.babbel.com/) has a revenue of \$115 million.

However, apps like these are generally limited to basic language skills that do not transfer to real world use. In order to retain users that would otherwise outgrow them, app developers are forced to design increasingly complex and challenging language games, which requires extensive work by multi-lingual language and education experts.

## The Problem: Assisted Reading for Language Learners

Many language learners make use of subtitles in film and other media to assist them in the learning process. An English language learner, for instance, might turn on English subtitles while watching a movie in English. The additional visual input helps learners with their oral comprehension skills.

Unfortunately, a similar solution for text media is missing. A learner who desires to regularly read reports from a certain English language news source in order to improve their reading comprehension skills might be frustrated at the difficulty they encounter and the crudeness of existing forms of assistance, such as dictionaries, which do not take *context* into account.

ClearText attempts to solve this problem through the use of text simplification technology.

## The Technology: Text Simplification

*Text simplification* is a natural language processing task that seeks to transform a source text into a "simpler" (easier to read or understand) target text while preserving the meaning or content of the text. Note that this is distinct from text *summarization*: A summary of a text need not be simpler (and may even be denser and more complex) and a simplification may not provide a useful summary (it may be just as long, if not longer, than the original text).

Text simplification can be understood as a form of *monolingual* machine translation. Indeed, this is the approach typically taken by many academic researchers on this problem. Unfortunately, at the time of writing we are not aware of any *production-ready* text simplification systems. Implementing such a system is one of the main technical obstacles we must overcome.

# The WikiLarge and WikiSmall Datasets

Encouraged by the incredible successes of natural language processing in the several years, we treat the text simplification problem as a machine learning problem. Precisely, we treat it as a neural machine translation problem. There are several datasets available for neural machine translation, but we will begin our work with the [WikiLarge and WikiSmall](https://github.com/XingxingZhang/dress) corpuses. Both corpuses, for the most part, consist of aligned text pairs automatically produced (using various similarity metrics) from corresponding articles on Wikipedia and the [Simple English Wikipedia](https://simple.wikipedia.org/).

In order to run the following cells, the compressed file containing these datasets should be extracted into `../data/raw`.

In [13]:
import io
import os

import pandas as pd

## WikiSmall

In [11]:
def load_wiki(dataset):
    wiki_dir = os.path.join('../data/raw/data-simplification', dataset)
    prefix = 'PWKP_108016.tag.80.aner.ori' if dataset == 'wikismall' else 'wiki.full.aner.ori'
    
    data = []
    for split in ['train', 'valid', 'test']:
        for loc in ['src', 'dst']:
            file_name = '.'.join([prefix, split, loc])
            file_path = os.path.join(wiki_dir, file_name)
            stream = io.open(file_path)
            lines = stream.read().split('\n')
            data.append(lines)

    src_train, dst_train, src_valid, dst_valid, src_test, dst_test = data
    train = pd.DataFrame(zip(src_train, dst_train), columns=['source', 'target'])
    valid = pd.DataFrame(zip(src_valid, dst_valid), columns=['source', 'target'])
    test = pd.DataFrame(zip(src_test, dst_test), columns=['source', 'target'])

    return train, valid, test

In [17]:
train, valid, test = load_wiki('wikismall')

print(f'Training examples: {len(train)}')
print(f'Validation examples: {len(valid)}')
print(f'Test examples: {len(test)}')

Training examples: 88838
Validation examples: 206
Test examples: 101


In [24]:
train.head(10)

Unnamed: 0,source,target
0,Demographics,People
1,Pony Canyon Popfrenzy Popsicle Records Poptone...,Pony Canyon Popfrenzy Poptones Pork Recordings...
2,"In 1199 , during the period of its highest spl...",Volterra .
3,The mills contributed to the growth of Minneap...,Minneapolis .
4,Apart from being a member of the Brazilian Aca...,Brazilian Academy of Sciences .
5,Nathan Fillion as Malcolm `` Mal '' Reynolds :...,Malcolm `` Mal '' Reynolds .
6,HostGIS HostGIS Linux is a Slackware based dis...,GoblinX HostGIS NimbleX .
7,Nawansher Urban is one of the 51 Union Council...,District Government Abbottabad .
8,"Joan Mondale , wife of Walter Mondale .","Joan Mondale , Wife of Walter Mondale ."
9,45 South Dakota SD S. Dak. .,Rhode Island R.I. South Carolina S.C. South Da...


In [22]:
valid.head()

Unnamed: 0,source,target
0,The program was transmitted by Onda Cero Radio...,Onda Cero .
1,Beyonce Sued For Copyright Infringement Track ...,Track listing Title !
2,Cast and characters Humans Shia LaBeouf stars ...,It stars Shia LaBeouf as Sam Witwicky .
3,"A sequel , A Hat Full of Sky , was published i...",It was published in 2003 .
4,"However , it is equally likely to decay at any...","However , it is radioactive ."


The dataset appears to be quite noisy, containing some nonsense examples such as the following.

In [32]:
i = 1
print(train.loc[i, 'source'])
print(train.loc[i, 'target'])

Pony Canyon Popfrenzy Popsicle Records Poptones Pork Recordings Portrait Records Posh Boy Records Positiva Records Positive Tone Positron !
Pony Canyon Popfrenzy Poptones Pork Recordings Portrait Records Posh Boy Records Positiva Records Positive Tone Positron !


It also contains some low-quality simplifications.

In [33]:
i = 2
print(train.loc[i, 'source'])
print(train.loc[i, 'target'])

In 1199 , during the period of its highest splendour , the city made itself independent from the bishops of Volterra .
Volterra .


The test set, on the other hand, comes from the [Turk corpus](https://github.com/cocoxu/simplification/). It was produced by humans on Amazon's [Mechanical Turk](https://www.mturk.com/).

In [23]:
test.head()

Unnamed: 0,source,target
0,Genetic engineering has expanded the genes ava...,New plants were created with genetic engineeri...
1,The name Jadgal applies specifically to groups...,The Jadgal people speak the Jadgali language .
2,Wikipedia is free content that anyone can edit...,Wikipedia is free content that anyone may chan...
3,The print collections are further supported by...,The print collections include large collection...
4,"Restoration of independence In 1991 , the Sovi...",Armenia received its independence from the Sov...
