# Data Augmentation Techniques in NLP

Data augmentation is commonly used in computer vision. In vision, you can almost certainly flip, rotate, or mirror an image without risk of changing the original label. However, in natural language processing (NLP), the story is totally different. Changing one word has the potential to change the meaning of the entire sentence. So we can’t come up with easy rules for data augmentation. Or can we?

[Jason Wei and Kai Zou](https://arxiv.org/pdf/1901.11196.pdf) presented **EDA**: **E**asy **D**ata **A**ugmentation techniques for boosting performance on text classification tasks (for a quick implementation, see the [EDA Github repository](https://github.com/jasonwei20/eda_nlp)). EDA consists of four simple operations that do a surprisingly good job at preventing overfitting and helping train more robust models. Here they are:

* **Synonym Replacement**: Randomly choose n words from the sentence that are not stop words. Replace each of these words with one of its synonyms chosen at random.
* **Random Insertion**: Find a random synonym of a random word in the sentence that is not a stop word. Insert that synonym into a random position in the sentence. Do this n times.
* **Random Swap**: Randomly choose two words in the sentence and swap their positions. Do this n times.
* **Random Deletion**: Randomly remove each word in the sentence with probability p.


<p align="center"> <img src="https://d3i71xaburhd42.cloudfront.net/28e30b4b5cd511f64b3bb3d7d0f57e067b3977be/4-Table3-1.png" alt="drawing" width="400" class="center"> </p>

Example:

<p align="center"> <img src="https://miro.medium.com/max/778/1*y88F2-lpLQNxw_ubWoGctQ.png" alt="drawing" width="400" class="center"> </p>

In [1]:
!git clone https://github.com/jasonwei20/eda_nlp

Cloning into 'eda_nlp'...
remote: Enumerating objects: 30, done.[K
remote: Counting objects: 100% (30/30), done.[K
remote: Compressing objects: 100% (30/30), done.[K
remote: Total 379 (delta 18), reused 1 (delta 0), pack-reused 349[K
Receiving objects: 100% (379/379), 20.41 MiB | 21.99 MiB/s, done.
Resolving deltas: 100% (181/181), done.


In [2]:
import nltk

In [3]:
nltk.download("wordnet")

[nltk_data] Downloading package wordnet to /home/jupyter/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


True

In [4]:
from eda_nlp.code.eda import *

In [19]:
synonym_replacement = 1
random_insertion = 1
random_swap = 0
random_delection = 1

def apply_text_augmentation(sentences, alpha=0.1, num_aug=9):
    """
    Generate more data with standard augmentation.

    arguments:
        sentences -- List[tuple(label:int, text:str)]
        alpha -- Change probability. How much to change each sentence (default 0.1).
        num_aug -- number of augmented sentences to generate per original sentence (default 9).
    
    output:
        augmented_sentences -- List[tuple(label:int, text:str)]    
    """
    
    augmented_sentences = []
    for label, text in sentences:
        
        aug_sentences = eda(text, 
                            alpha_sr=alpha * synonym_replacement, 
                            alpha_ri=alpha * random_insertion, 
                            alpha_rs=alpha * random_swap, 
                            p_rd=alpha * random_delection, 
                            num_aug=num_aug)
        
        for aug_sentence in aug_sentences:
            augmented_sentences.append((label, aug_sentence))

    return augmented_sentences


In [20]:
apply_text_augmentation([
         (1,"This is an interesting example of data augmentation"),
         (1,"Simple text editing techniques can make huge performance gains for small datasets."),
         (2,"A sad, superior human comedy played out on the back road of our lifes.")], 
    alpha=0.2, num_aug=5)

[(1, 'this an interesting of data augmentation'),
 (1, 'this is an interesting example of data point augmentation'),
 (1, 'this worry is an interesting example of data augmentation'),
 (1, 'this is interesting an example of data augmentation'),
 (1, 'this an interesting example augmentation'),
 (1, 'this is an interesting example of data augmentation'),
 (1,
  'simple text editing techniques can make brobdingnagian operation gains for small datasets'),
 (1,
  'dim witted text redact techniques can make huge performance gains for small datasets'),
 (1,
  'text simple editing techniques can make huge performance gains for small datasets'),
 (1,
  'simple text editing techniques can gains huge performance make for small datasets'),
 (1,
  'text editing techniques can make huge performance gains for small datasets'),
 (1,
  'simple text editing techniques can make huge performance gains for small datasets '),
 (2,
  'superscript a sad superior human comedy played out on the back road of ou

## Does it really work?

Although some generated sentences may be a little nonsensical, inducing some amount of noise into the dataset can be extremely helpful for training a more robust model, especially when training on smaller datasets. 

As shown in [the paper], using EDA outperforms normal training at almost all dataset sizes over 5 benchmark text classification tasks, and does way better when training on small amounts of data. On average, training a recurrent neural network (RNN) **with EDA while using only 50% of the available training set achieved the same accuracy as normal training with all available data**.

<p align="center"> <img src="https://raw.githubusercontent.com/jasonwei20/eda_nlp/master/eda_figure.png" alt="drawing" width="400" class="center"> </p>



## How much improvement can we expect?

You will not se miracles in terms of performance improvement. Authors described the performance gain for small datasets being around 2–3% and modest for larger sizes (~1%). However, the best benefit here it to achieve competitive results with less real data. 

## How much augmentation?

How many augmented sentences should we generate for the real sentence? 

The answer for this depends on the size of your dataset. If you only have a small dataset, overfitting is more likely so you can generate a larger number of augmented sentences. For larger datasets, adding too much augmented data can be unhelpful since your model may already be able to generalize when there is a large amount of real data. This figure shows performance gain with respect to the number of augmented sentences generated per original sentence:

<p align="center"> <img src="https://miro.medium.com/max/842/1*eKvsUdhBS3cCwsqdu9G0JA.png" alt="drawing" width="400" class="center"> </p>


## What next?

Simple text editing techniques can make huge performance gains for small datasets.
We’ve shown that simple data augmentation operations can significantly boost performance in text classification. If you are training a text classifier on a small dataset and looking for an easy way to get better performance.

## Credits

By Arian Pasquali, 100% based on Jason Wei and Kai Zou work and Jason Wei blog post.

* Code https://github.com/jasonwei20/eda_nlp
* Paper https://arxiv.org/abs/1901.11196

## fin