## Data Augmentation using NLPaug

This notebook demostrate the usage of a character augmenter, word augmenter. There are other types such as augmentation for sentences, audio, spectrogram inputs etc. All of the types many before mentioned types and many more can be found at the [github repo](https://github.com/makcedward/nlpaug) and [docs](https://nlpaug.readthedocs.io/en/latest/) of nlpaug.

In [3]:
#Installing the nlpaug package
!pip install nlpaug==0.0.14

Collecting nlpaug==0.0.14
[?25l  Downloading https://files.pythonhosted.org/packages/1f/6c/ca85b6bd29926561229e8c9f677c36c65db9ef1947bfc175e6641bc82ace/nlpaug-0.0.14-py3-none-any.whl (101kB)
[K     |████████████████████████████████| 102kB 4.2MB/s 
[?25hInstalling collected packages: nlpaug
Successfully installed nlpaug-0.0.14


In [5]:
#this will be the base text which we will be using throughout this notebook
text="The quick brown fox jumps over the lazy dog ."

In [12]:
import nlpaug.augmenter.char as nac
import nlpaug.augmenter.word as naw
import nlpaug.augmenter.sentence as nas
import nlpaug.flow as nafc

from nlpaug.util import Action
import os
!git clone https://github.com/makcedward/nlpaug.git
os.environ["MODEL_DIR"] = 'nlpaug/model/'

Cloning into 'nlpaug'...
remote: Enumerating objects: 27, done.[K
remote: Counting objects: 100% (27/27), done.[K
remote: Compressing objects: 100% (22/22), done.[K
remote: Total 4489 (delta 6), reused 14 (delta 5), pack-reused 4462[K
Receiving objects: 100% (4489/4489), 2.94 MiB | 2.74 MiB/s, done.
Resolving deltas: 100% (3166/3166), done.


### Augmentation at the Character Level


1.   OCR Augmenter: To read textual data from on image, we need an OCR(optical character recognition) model. Once the text is extracted from the image, there may be errors like; '0' instead of an 'o', '2' instead of 'z' and other such similar errors.  
2.   Keyboard Augmenter: While typing/texting typos are fairly common this augmenter simulates the errors by substituting characters in words with ones at a similar distance on a keyboard.



In [8]:
#OCR augmenter
#import nlpaug.augmenter.char as nac

aug = nac.OcrAug()  
augmented_texts = aug.augment(text, n=3) #specifying n=3 gives us only 3 augmented versions of the sentence.

print("Original:")
print(text)

print("Augmented Texts:")
print(augmented_texts)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Texts:
['The quick brown fox jumps uver the 1azy dog .', 'The quicr brown fox jumps over the lazy dog .', 'The quick brown fux jumps over the la2y do9 .']


In [9]:
#Keyboard Augmenter
#import nlpaug.augmenter.word as naw


aug = nac.KeyboardAug()
augmented_text = aug.augment(text, n=3) #specifying n=3 gives us only 3 augmented versions of the sentence.

print("Original:")
print(text)

print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
['The quick brown fox jumpW over the lazy dog .', 'The quick broAn fox jumps iver the lazy dog .', 'The quixk brown fox jumps over the laAy dog .']


There are other types of character augmenters too. Their details are avaiable in the links mentioned at the beginning of this notebook.

### Augmentation at the Word Level

Augmentation is important at the word level as well , here we use word2vec to insert or substitute a similar word.

**Spelling** **augmentor**


In [13]:
#Downloading the required txt file
!wget https://github.com/makcedward/nlpaug/blob/master/nlpaug/res/word/spelling/spelling_en.txt

--2021-01-14 01:28:11--  https://github.com/makcedward/nlpaug/blob/master/nlpaug/res/word/spelling/spelling_en.txt
Resolving github.com (github.com)... 13.114.40.48
Connecting to github.com (github.com)|13.114.40.48|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘spelling_en.txt.1’

spelling_en.txt.1       [  <=>               ]   3.00M  11.5MB/s    in 0.3s    

2021-01-14 01:28:12 (11.5 MB/s) - ‘spelling_en.txt.1’ saved [3145213]



In [21]:
#Substitute word by spelling mistake words dictionary
aug = naw.SpellingAug('spelling_en.txt')
augmented_texts = aug.augment(text)
print("Original:")
print(text)
print("Augmented Texts:")
print(augmented_texts)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Texts:
quick brown fox jumps over lazy dog .


**Word embeddings augmentor**

In [16]:
#Downloading the reqired model
!wget -c "https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz"

--2021-01-14 01:33:46--  https://s3.amazonaws.com/dl4j-distribution/GoogleNews-vectors-negative300.bin.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.109.173
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.109.173|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1647046227 (1.5G) [application/x-gzip]
Saving to: ‘GoogleNews-vectors-negative300.bin.gz’


2021-01-14 01:35:23 (16.2 MB/s) - ‘GoogleNews-vectors-negative300.bin.gz’ saved [1647046227/1647046227]



In [17]:
!gunzip GoogleNews-vectors-negative300.bin.gz

In [18]:
!ls

GoogleNews-vectors-negative300.bin  nlpaug  sample_data  spelling_en.txt


Insert word randomly by word embeddings similarity

In [19]:
# model_type: word2vec, glove or fasttext
aug = naw.WordEmbsAug(
    model_type='word2vec', model_path='GoogleNews-vectors-negative300.bin',
    action="insert")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
QuickJack The quick Mid brown fox jumps over the Dinatla lazy dog .


Substitute word by word2vec similarity


In [20]:
aug = naw.WordEmbsAug(
    model_type='word2vec', model_path='GoogleNews-vectors-negative300.bin',
    action="substitute")
augmented_text = aug.augment(text)
print("Original:")
print(text)
print("Augmented Text:")
print(augmented_text)

Original:
The quick brown fox jumps over the lazy dog .
Augmented Text:
The quick brown fox salchows over the intellectually_lazy german_shepherd .


There are many more features which nlpaug offers you can visit the github repo and documentation for further details