## Thesaurus

Thesaurus is a data augmentation method for NLP, which replaces words or phrases with their synonyms.

Currently (August 2020), there are several wrapper libraries of Thesaurus, however, the best choice would be using the wordnet corpus that is included in the "nltk".

In [1]:
import nltk

In [2]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [3]:
from nltk.corpus import wordnet as wn

In [4]:
wn.synsets('dog')

[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01'),
 Synset('chase.v.01')]

In [5]:
wn.synset('dog.n.01').definition()

'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds'

In [6]:
wn.synset('dog.n.03').definition()

'informal term for a man'

## Embedding

By using the word embedding models such as word2vec or GloVe, we could also perform the similarity search. By using the result of similarity search, we could augment the text data.

### word2vec

Below is a simple example of using word2vec model with gensim.

In [7]:
from gensim.test.utils import datapath
from gensim import utils
import gensim.models


class MyCorpus(object):
    """An interator that yields sentences (lists of str)."""

    def __iter__(self):
        corpus_path = datapath('lee_background.cor')
        for line in open(corpus_path):
            # assume there's one document per line, tokens separated by whitespace
            yield utils.simple_preprocess(line)

sentences = MyCorpus()
model = gensim.models.Word2Vec(sentences=sentences)

In [8]:
vec_king = model.wv['king']

In [9]:
vec_king

array([ 2.5239903e-02,  2.8153066e-02, -8.8988505e-03, -4.3449357e-02,
        5.2822381e-02,  1.4836514e-02,  1.0861372e-02, -2.7812922e-02,
        4.1385382e-02, -8.4287710e-02,  2.1733833e-02, -6.6571116e-02,
       -1.2768685e-02, -6.3128541e-03, -6.4659119e-02, -4.3681327e-02,
       -2.2735849e-02,  1.5545152e-04, -6.0136626e-03, -7.6202631e-02,
       -2.1076787e-02, -1.9879710e-02, -2.8931161e-02,  2.0047724e-02,
       -2.4434162e-02, -3.1652786e-02,  1.6024750e-02,  7.6883554e-02,
       -9.9423928e-03,  4.6257824e-02,  2.6851028e-02, -1.3506489e-02,
        9.0606086e-02, -3.4062155e-03,  1.7334465e-02,  2.2784904e-02,
       -3.1803221e-02, -4.4366878e-02, -2.7958740e-02,  1.2414410e-02,
       -3.8777418e-02, -1.8009298e-02, -4.0741444e-02, -4.0751034e-03,
       -5.4130819e-02,  3.4727905e-02, -1.1790309e-02,  1.2605113e-03,
        3.8443159e-03,  3.6358751e-02,  3.3663001e-03, -3.3479743e-02,
        7.6921033e-03,  4.2347223e-02, -2.1893881e-02, -4.6204928e-02,
      

## Back Translation

English is one of the language which having lots of training data for translation while some language may not has enough data for train a machine translation model. Sennrich et al. used back-translation method to generate more training data to improve translation model performance.

Let's assume that we want to train a machine translation model that translates Korean to English. However, there are no much public Korean dataset that we could use for training this model.

The back-translation is translating target language to source language and mixing both original source sentence and back-translated sentence to train a model. So the number of training data from source language to target language can be increased.

Unfortunately, I was not able to find some suitable example for "Back Translation".

## nlpaug

The nlpaug is a python library for nlp data augmentation. It supports various augmenters.

In [10]:
!pip3 install nlpaug numpy matplotlib python-dotenv



In [11]:
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc

from nlpaug.util import Action

In [12]:
text = 'The quick brown fox jumps over the lazy dog .'

aug = nac.OcrAug()
augmented_texts = aug.augment(text, n=3)
print("Original:")
print(text)
print("Augmented Texts:")
print(augmented_texts)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Texts:
['The quick 6rown fox jump8 0ver the lazy dog .', 'The qoick bkown fox jumps over the 1azy dog .', 'The quick bkown fux jumps over the lazy dog .']


In [13]:
aug = nac.KeyboardAug()
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
The quick brown fox jumps over the laz5 dog .
