## Example of Textual Augmenter Usage<a class="anchor" id="home"></a>:
* [Character Augmenter](#chara_aug)
    * [OCR](#ocr_aug)
    * [Keyboard](#keyboard_aug)
    * [Random](#random_aug)
* [Word Augmenter](#word_aug)
    * [Spelling](#spelling_aug)
    * [Word Embeddings](#word_embs_aug)
    * [TF-IDF](#tfidf_aug)
    * [Contextual Word Embeddings](#context_word_embs_aug)
    * [Synonym](#synonym_aug)
    * [Antonym](#antonym_aug)
    * [Random Word](#random_word_aug)
    * [Split](#split_aug)
    * [Back Translatoin](#back_translation_aug)
    * [Reserved Word](#reserved_aug)
* [Sentence Augmenter](#sent_aug)
    * [Contextual Word Embeddings for Sentence](#context_word_embs_sentence_aug)
    * [Abstractive Summarization](#abst_summ_aug)

In [37]:
import os
os.environ["MODEL_DIR"] = '../model/'

# Config

In [2]:
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc

from nlpaug.util import Action

In [3]:
text = 'The quick brown fox jumps over the lazy dog .'
print(text)

The quick brown fox jumps over the lazy dog .


# Character Augmenter<a class="anchor" id="chara_aug">

Augmenting data in character level. Possible scenarios include image to text and chatbot. During recognizing text from image, we need to optical character recognition (OCR) model to achieve it but OCR introduces some errors such as recognizing "o" and "0". `OCRAug` simulate these errors to perform the data augmentation. For chatbot, we still have typo even though most of application comes with word correction. Therefore, `KeyboardAug` is introduced to simulate this kind of errors.

### OCR Augmenter<a class="anchor" id="ocr_aug"></a>

##### Substitute character by pre-defined OCR error

In [4]:
aug = nac.OcrAug()
augmented_texts = aug.augment(text, n=3)
print("Original:")
print(text)
print("Augmented Texts:")
print(augmented_texts)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Texts:
['The qoick bkown fox jumps uver the lazy dog.', 'The quick bkown fox jumps uver the lazy dog.', 'The quick 6rown fox jumps ovek the lazy dog.']


### Keyboard Augmenter<a class="anchor" id="keyboard_aug"></a>

##### Substitute character by keyboard distance

In [5]:
aug = nac.KeyboardAug()
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
The quick brown fox mumps *ver the Pazy dog.


### Random Augmenter<a class="anchor" id="random_aug"></a>

##### Insert character randomly

In [6]:
aug = nac.RandomCharAug(action="insert")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
The quick broewn fox jumps Kover the laz1y dog.


##### Substitute character randomly

In [7]:
aug = nac.RandomCharAug(action="substitute")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
The qaick brown fox jumqs ojer the lazy dog.


##### Swap character randomly

In [8]:
aug = nac.RandomCharAug(action="swap")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
The uqick brown fox ujmps over the lzay dog.


##### Delete character randomly

In [9]:
aug = nac.RandomCharAug(action="delete")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
The quik bown fox jmps over the lazy dog.


# Word Augmenter<a class="anchor" id="word_aug"></a>

Besides character augmentation, word level is important as well. We make use of word2vec (Mikolov et al., 2013), GloVe (Pennington et al., 2014), fasttext (Joulin et al., 2016), BERT(Devlin et al., 2018) and wordnet to insert and substitute similar word. `Word2vecAug`,  `GloVeAug` and `FasttextAug` use word embeddings to find most similar group of words to replace original word. On the other hand, `BertAug` use language models to predict possible target word. `WordNetAug` use statistics way to find the similar group of words.

### Spelling Augmenter<a class="anchor" id="spelling_aug"></a>

##### Substitute word by spelling mistake words dictionary

In [10]:
aug = naw.SpellingAug()
augmented_texts = aug.augment(text, n=3)
print("Original:")
print(text)
print("Augmented Texts:")
print(augmented_texts)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Texts:
['The quick brown fox jumps overt the laizy gog.', 'The quick browm fox jumps ower the lazy doy.', 'Them quick Brawn fox jumps over zhe lazy dog.']


In [11]:
aug = naw.SpellingAug()
augmented_texts = aug.augment(text, n=3)
print("Original:")
print(text)
print("Augmented Texts:")
print(augmented_texts)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Texts:
["The qchick brown fox jumps other there's lazy dog.", 'The qchick braown fox jumps over the laisy dog.', 'Thte quick brown fox jumps over DE lazy djg.']


### Word Embeddings Augmenter<a class="anchor" id="word_embs_aug"></a>

##### Insert word randomly by word embeddings similarity

In [19]:
# model_type: word2vec, glove or fasttext
model_dir="../model/"
aug = naw.WordEmbsAug(
    model_type='word2vec', model_path=model_dir+'GoogleNews-vectors-negative300.bin',
    action="insert")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
Victor The quick docking brown fox jumps over the lazy actress dog.


##### Substitute word by word2vec similarity

In [20]:
# model_type: word2vec, glove or fasttext
aug = naw.WordEmbsAug(
    model_type='word2vec', model_path=model_dir+'GoogleNews-vectors-negative300.bin',
    action="substitute")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
_The quick brown fox heelflip over the lazy rottweiller.


### TF-IDF Augmenter<a class="anchor" id="tfidf_aug"></a>

##### Insert word by TF-IDF similarity

In [22]:
aug = naw.TfIdfAug(
    model_path=os.environ.get("MODEL_DIR"),
    action="insert")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
The 7852 quick 3K brown fox Sheddap jumps over the lazy dog.


##### Substitute word by TF-IDF similarity

In [23]:
aug = naw.TfIdfAug(
    model_path=os.environ.get("MODEL_DIR"),
    action="substitute")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
The quick brown fox jumps Observation DGL lazy bulked.


### Contextual Word Embeddings Augmenter<a class="anchor" id="context_word_embs_aug"></a>

##### Insert word by contextual word embeddings (BERT, DistilBERT, RoBERTA or XLNet)

In [24]:
aug = naw.ContextualWordEmbsAug(
    model_path='bert-base-uncased', action="insert")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

HBox(children=(IntProgress(value=0, description='Downloading', max=570, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=440473133, style=ProgressStyle(description_…




HBox(children=(IntProgress(value=0, description='Downloading', max=28, style=ProgressStyle(description_width='…




HBox(children=(IntProgress(value=0, description='Downloading', max=231508, style=ProgressStyle(description_wid…




HBox(children=(IntProgress(value=0, description='Downloading', max=466062, style=ProgressStyle(description_wid…


Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
however the quick brown fox instantly jumps over the lonely lazy dog.


##### Substitute word by contextual word embeddings (BERT, DistilBERT, RoBERTA or XLNet)

In [25]:
aug = naw.ContextualWordEmbsAug(
    model_path='bert-base-uncased', action="substitute")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
several quick hunting fox jumps over the yellow dog.


In [26]:
aug = naw.ContextualWordEmbsAug(
    model_path='distilbert-base-uncased', action="substitute")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

HBox(children=(IntProgress(value=0, description='Downloading', max=483, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=267967963, style=ProgressStyle(description_…




HBox(children=(IntProgress(value=0, description='Downloading', max=28, style=ProgressStyle(description_width='…




HBox(children=(IntProgress(value=0, description='Downloading', max=231508, style=ProgressStyle(description_wid…




HBox(children=(IntProgress(value=0, description='Downloading', max=466062, style=ProgressStyle(description_wid…


Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
the quick responding fox moves over the other dog.


In [27]:
aug = naw.ContextualWordEmbsAug(
    model_path='roberta-base', action="substitute")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

HBox(children=(IntProgress(value=0, description='Downloading', max=481, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=501200538, style=ProgressStyle(description_…




HBox(children=(IntProgress(value=0, description='Downloading', max=898823, style=ProgressStyle(description_wid…




HBox(children=(IntProgress(value=0, description='Downloading', max=456318, style=ProgressStyle(description_wid…




HBox(children=(IntProgress(value=0, description='Downloading', max=1355863, style=ProgressStyle(description_wi…


Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
The female buck fox jumps over the nearest dog.


### Synonym Augmenter<a class="anchor" id="synonym_aug"></a>

##### Substitute word by WordNet's synonym

In [30]:
import nltk
nltk.download('wordnet')
aug = naw.SynonymAug(aug_src='wordnet')
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
The quick robert brown fox jump over the faineant dog.


##### Substitute word by PPDB's synonym

In [39]:
aug = naw.SynonymAug(aug_src='ppdb', model_path=os.environ.get("MODEL_DIR") + 'ppdb-2.0-s-all')
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
The rapid brown fox climbs over the lazy dog.


### Antonym Augmenter<a class="anchor" id="antonym_aug"></a>

##### Substitute word by antonym

In [4]:
aug = naw.AntonymAug()
_text = 'Good boy'
augmented_text = aug.augment(_text)
print("Original:")
print(_text)
print("Augmented Text:")
print(augmented_text)

Original:
Good boy
Augmented Text:
Good daughter


### Random Word Augmenter<a class="anchor" id="random_word_aug"></a>

##### Swap word randomly

In [6]:
aug = naw.RandomWordAug(action="swap")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
Quick the brown fox jumps over the lazy dog .


##### Delete word randomly

In [18]:
aug = naw.RandomWordAug()
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog
Augmented Text:
The brown jumps over the lazy dog


##### Delete a set of contunous word will be removed randomly

In [4]:
aug = naw.RandomWordAug(action='crop')
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
The quick brown fox jumps dog .


### Split Augmenter<a class="anchor" id="split_aug"></a>

##### Split word to two tokens randomly

In [3]:
aug = naw.SplitAug()
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
The q uick b rown fox jumps o ver the lazy dog .


### Back Translation Augmenter<a class="anchor" id="back_translation_aug"></a>

In [1]:
import nlpaug.augmenter.word as naw

text = 'The quick brown fox jumped over the lazy dog'
back_translation_aug = naw.BackTranslationAug(
    from_model_name='facebook/wmt19-en-de', 
    to_model_name='facebook/wmt19-de-en'
)
back_translation_aug.augment(text)

'The speedy brown fox jumped over the lazy dog'

In [8]:
# Load models from local path
import nlpaug.augmenter.word as naw

from_model_dir = os.path.join(os.environ["MODEL_DIR"], 'word', 'fairseq', 'wmt19.en-de')
to_model_dir = os.path.join(os.environ["MODEL_DIR"], 'word', 'fairseq', 'wmt19.de-en')

text = 'The quick brown fox jumped over the lazy dog'
back_translation_aug = naw.BackTranslationAug(
    from_model_name=from_model_dir, from_model_checkpt='model1.pt',
    to_model_name=to_model_dir, to_model_checkpt='model1.pt', 
    is_load_from_github=False)
back_translation_aug.augment(text)


'The speedy brown fox jumped over the lazy dog'

### Reserved Word Augmenter<a class="anchor" id="reserved_aug"></a>

In [None]:
import nlpaug.augmenter.word as naw

text = 'Fwd: Mail for solution'
reserved_tokens = [
    ['FW', 'Fwd', 'F/W', 'Forward'],
]
reserved_aug = naw.ReservedAug(reserved_tokens=reserved_tokens)
augmented_text = reserved_aug.augment(text)

print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

# Sentence Augmentation

### Contextual Word Embeddings for Sentence Augmenter<a class="anchor" id="context_word_embs_sentence_aug"></a>

##### Insert sentence by contextual word embeddings (GPT2 or XLNet)

In [6]:
# model_path: xlnet-base-cased or gpt2
aug = nas.ContextualWordEmbsForSentenceAug(model_path='xlnet-base-cased')
augmented_texts = aug.augment(text, n=3)
print("Original:")
print(text)
print("Augmented Texts:")
print(augmented_texts)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Texts:
['The quick brown fox jumps over the lazy dog . A terrible , messy split second presents itself to the heart - which is we lose our heart.', 'The quick brown fox jumps over the lazy dog . Cast from the heart - the above flash is insight to the heart.', 'The quick brown fox jumps over the lazy dog . Give two mom s time to share some affection over this heart shaped version of Scott.']


In [7]:
aug = nas.ContextualWordEmbsForSentenceAug(model_path='gpt2')
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
The quick brown fox jumps over the lazy dog . J in a Better Balls of Fire cameo on St iring.


In [7]:
aug = nas.ContextualWordEmbsForSentenceAug(model_path='gpt2')
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
The quick brown fox jumps over the lazy dog . They start shooting wildly.


In [34]:
aug = nas.ContextualWordEmbsForSentenceAug(model_path='distilgpt2')
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

HBox(children=(IntProgress(value=0, description='Downloading', max=762, style=ProgressStyle(description_width=…




HBox(children=(IntProgress(value=0, description='Downloading', max=352833716, style=ProgressStyle(description_…




HBox(children=(IntProgress(value=0, description='Downloading', max=1042301, style=ProgressStyle(description_wi…




HBox(children=(IntProgress(value=0, description='Downloading', max=456318, style=ProgressStyle(description_wid…




HBox(children=(IntProgress(value=0, description='Downloading', max=1355256, style=ProgressStyle(description_wi…




Using pad_token, but it is not set yet.


Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
The quick brown fox jumps over the lazy dog . . .
This time around, the guy is a puppy that's been in the hospital so far. It's a great dog for the next few weeks, though. The guy has been getting a lot of the stress and the stress he has to deal with in the hospital, so let's look at what the stress has to do with the little dog.
As we said before, this puppy is an "old dog" and he's taken care of it in his own home. He will be on the street in the morning on Mondays, but he will not be on the street for the next few weeks.
We're not certain if the dog will have an issue with his condition, and so we'll keep an eye on his condition in the comments. We've heard a lot of speculation about his condition and we know it's very serious.
The dog, the same as the dog, has been in the hospital a few days now. It has not yet been diagnosed. The dog is just a little boy in his own home. He has been out of bed for many weeks 

### Abstractive Summarization Augmenter<a class="anchor" id="abst_summ_aug"></a>

In [7]:
article = """
The history of natural language processing (NLP) generally started in the 1950s, although work can be 
found from earlier periods. In 1950, Alan Turing published an article titled "Computing Machinery and 
Intelligence" which proposed what is now called the Turing test as a criterion of intelligence. 
The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian 
sentences into English. The authors claimed that within three or five years, machine translation would
be a solved problem. However, real progress was much slower, and after the ALPAC report in 1966, 
which found that ten-year-long research had failed to fulfill the expectations, funding for machine 
translation was dramatically reduced. Little further research in machine translation was conducted 
until the late 1980s when the first statistical machine translation systems were developed.
"""

aug = nas.AbstSummAug(model_path='t5-base', num_beam=3)
augmented_text = aug.augment(article)
print("Original:")
print(article)
print("Augmented Text:")
print(augmented_text)

Original:

The history of natural language processing (NLP) generally started in the 1950s, although work can be 
found from earlier periods. In 1950, Alan Turing published an article titled "Computing Machinery and 
Intelligence" which proposed what is now called the Turing test as a criterion of intelligence. 
The Georgetown experiment in 1954 involved fully automatic translation of more than sixty Russian 
sentences into English. The authors claimed that within three or five years, machine translation would
be a solved problem. However, real progress was much slower, and after the ALPAC report in 1966, 
which found that ten-year-long research had failed to fulfill the expectations, funding for machine 
translation was dramatically reduced. Little further research in machine translation was conducted 
until the late 1980s when the first statistical machine translation systems were developed.

Augmented Text:
the history of natural language processing (NLP) generally started in the 19