# Data Augmentation

**Data Augmentation** is a technique to increase data samples without haing to directly collect data from sources by transforming existing dataset into multiple forms. Let's dive into the following data augmentation techniques used in Natural Language Processing:

- [Easy Data Augmentation](https://arxiv.org/abs/1901.11196)
    - Synonym Replacement
    - Random Insertion
    - Random Deletion
    - Random Swap
- Back Translation
- [Albumentations Package](https://github.com/albumentations-team/albumentations)
    - Shuffle Sentences Transform
    - Exclude Duplicate Transform
- [NLPAug Library](https://github.com/makcedward/nlpaug)
    - Character-Level Augmentation
    - Word-Level Augmentation
    - Sentence-Level Augmentation
    

## Data Ingestion

In [2]:
import pandas as pd
from utils import EasyDataAugmentation

In [3]:
df = pd.read_csv('./../data/reviews.csv')
df.head()

Unnamed: 0,Id,Review,Label
0,103868,Very bad course.,1
1,15884,"Creativity without a reason, without a real pr...",1
2,25381,Hopeless ! Less clear and understandable than ...,1
3,64220,If you are considering this specialization I w...,1
4,52846,Week 4 does not give enough explanation or ext...,1


## Easy Data Augmentation

**EDA** proposes verious easy and simple augmentation techniques for transforming a text into its augmented version. We will take help from `nltk` library to work with Easy Data Augmentation. To prevent the chotic environment, EDA has been implemented in `utils.py` file within the same directory as the current notebook under `EasyDataAugmentation` class.

#### 1. Synonym Replacement

Randomly choose a word from a text and replace it with one of its synonyms.

In [3]:
eda = EasyDataAugmentation()

In [4]:
data = df['Review'].apply(eda.synonym_replacement)
data.head()

0                                     very big course.
1    Creativity without a reason, without a real pr...
2    Hopeless ! lupus erythematosus brighten and un...
3    If you are considering this specialisation I w...
4    Week 4 does not give enough explanation or ext...
Name: Review, dtype: object

#### 2. Random Insertion

Insert the synonym of a randomly selected word at random position in the text

In [5]:
data = df['Review'].apply(eda.random_insertion)
data.head()

0                Very notional bad speculative course.
1    Creativity without a reason, without a real pr...
2    Hopeless ! Less clear and understandable than ...
3    If you are considering this specialization I w...
4    Week 4 does not non give enough explanation or...
Name: Review, dtype: object

##### 3. Random Delection

Rremove a word randomly from any position in the text.

In [6]:
data = df['Review'].apply(eda.random_deletion)
data.head()

0       [bad]
1        [be]
2        [of]
3    [Andrew]
4    [unable]
Name: Review, dtype: object

##### 4. Random Swap

Randomly swap any two words in the text.

In [7]:
data = df['Review'].apply(eda.random_swap)
data.head()

0                                     course. Very bad
1    Try without be reason, without a real problem/...
2    understandable ! clear Less and Hopeless than ...
3    If completely are considering this specializat...
4    Week 4 exercises. not give explanation enough ...
Name: Review, dtype: object

## Back Translation

A sentence is translated in one language and then a new sentence is translated again in the original language. So, different sentences are created. The resulted text is removed if it is same as the original text.

## References

- [A Survey of Data Augmentation Approaches for NLP](https://arxiv.org/pdf/2105.03075.pdf)
- [Data Augmentation in NLP: Best Practices From a Kaggle Master ](https://neptune.ai/blog/data-augmentation-nlp)