# This is a tutorial for the data augmentation. 

You can use the data augmentation by simply add a processor in the pipeline. Take Back-Translation for an example:

In [None]:
from forte.processors.data_augment import ReplacementDataAugmentProcessor
from forte.pipeline import Pipeline
from forte.data.multi_pack import MultiPack

nlp = Pipeline[MultiPack]()

# Configuration for the data augmentation processor.
processor_config = {
    'augment_entry': 'ft.onto.base_ontology.Token',
    'other_entry_policy': {
        'kwargs': {
            'ft.onto.base_ontology.Document': 'auto_align',
            'ft.onto.base_ontology.Sentence': 'auto_align'
        }
    },
    'type': 'data_augmentation_op',
    'data_aug_op': 'forte.processors.data_augment.algorithms.back_translation_op.BackTranslationOp',
    'data_aug_op_config': {
        'kwargs': {
            'model_to': 'forte.processors.data_augment.algorithms.machine_translator.MarianMachineTranslator',
            'model_back': 'forte.processors.data_augment.algorithms.machine_translator.MarianMachineTranslator',
            'src_language': 'en',
            'tgt_language': 'fr',
            'device': 'cpu'
        }
    }
}

processor = ReplacementDataAugmentProcessor()
nlp.add(component=processor, config=processor_config)

Here is another example for typo data augmentation.

In [None]:
from forte.data.data_pack import DataPack
from ft.onto.base_ontology import Token
from forte.processors.data_augment.algorithms.typo_replacement_op import (
    TypoReplacementOp,
)

opr = TypoReplacementOp(
    configs={
        "prob": 0.6,
        'typo_generator': 'uniform',
        'dict_path': 'https://raw.githubusercontent.com/wanglec/temporaryJson/main/misspelling.json'
    }
)
data_pack = DataPack()
data_pack.set_text("commonly addressable")
token_1 = Token(data_pack, 0, 8)
token_2 = Token(data_pack, 9, 20)
data_pack.add_entry(token_1)
data_pack.add_entry(token_2)
print(opr.replace(token_1))
print(opr.replace(token_2))

## Text Replacement Ops

Lots of data augmentation methods can be considered as replacement-based approaches. That means, given a piece of input text, we will replace it with a new piece of augmented text. The back translation achieves this by translating the input into another language, then translating it back, and replacing the input with back-translated text.

We wrapped these algorithms as the text replacement ops:

1. *`DictionaryReplacementOp`*:
    It utilizes the dictionaries, such as WORDNET, to replace the input word.
2. *`BackTranslationOp`*:
    It uses back translation to generate data with the same semantic meanings.
3. *`DistributionReplacementOp`*:
    It samples from a distribution to generate word-level new text.
4. *`EmbeddingSimilarityOp`*:
    It leverages pre-trained word embeddings, such as `word2vec` and `glove`, to replace the input word with another word with similar word embedding.

The replacement ops should be under the *`'forte.processors.data_augment.algorithms'`*. To use these algorithms, set the value of *`'data_aug_op'`* in the configuration to full qualified name of the replacement op, and set the configuration of the replacement op in the *`'kwargs'`* under the field *`'data_aug_op_config'`*. Please check the documentation for specific configuration requirements of each replacement op.

## Replacement-based Processor

The processor *`'ReplacementDataAugmentProcessor'`* is responsible for managing the Forte data structures. Given a *`MultiPack`* as input, it will call the text replacement op to implement a specific data augmentation algorithm. Afterwards, it will handle the Forte data structures automatically. The output is the original *`MultiPack`*, with orignal *`DataPack`*s and new augmented *`DataPack`*s.

For example, given an input *`MultiPack`*:

    input(MultiPack){
        dp1(DataPack): {
            "I love NLP!"
        }
        dp2(DataPack): {
            "Forte makes NLP easier."
        }
    }

The output should be the same *`MultiPack`* with new *`DataPack`*s:

    output(MultiPack){
        dp1(DataPack): {
            "I love NLP!"
        }
        dp2(DataPack): {
            "Forte makes NLP easier."
        }
        dp1-aug(DataPack): {
            "I love Natural Language Processing!"
        }
        dp2-aug(DataPack): {
            "Forte makes Natural Language Processing easier."
        }
    }

The *`'augment_entry'`* defines the entry the processor will augment. It should be a full qualified name of the entry class.

The *`'other_entry_policy'`* specifies the policies for entries other than the *`'augment_entry'`*. If the policy is *`'auto_align'`*, the span of the Annotation will be automatically modified according to its original location.

The *`Link`*, *`MultiPackLink`*, *`Group`* and *`MultiPackGroup`* are automatically copied if the *`Annotation`*s they attached to are present in the new *`DataPack`*.

## Data Selector

The data selector is used for pre-select data from the dataset that are suitable for data augmentation tasks. We support random and query-based elastic search data selectors that yield a subset of the original `Datapack`.

There are two steps to perform data selection:
1. Create an elastic search indexer from your data.
2. Select data from the indexer according to the criteria: random or query-based.

For details on how to create an indexer, and how to search for relevant documents, please refer to `examples/data_augmentation/data_select/README.md` for details.

## Reinforcement Learning-based Data Augmentation model

This model builds upon the connection of supervised learning and reinforcement learning (RL), and adapts reward learning algorithm from RL for joint data augmentation learning and model training. For details of the algorithm, please refer to the paper [Learning Data Manipulation for Augmentation and Weighting](https://arxiv.org/pdf/1910.12795.pdf).

This algorithm updates data augmentation model parameters phi and downstream model parameters theta alternatingly. Two classes are designed to perform this algorithm: `forte/models/da_rl/MetaAugmentationWrapper` and `forte/models/da_rl/MetaModel`. 

`MetaAugmentationWrapper` wraps the data augmentation model, such as `BertForMaskedLM`, to perform functions needed in the algorithm that updates the augmentation model parameters phi. `MetaModel` is a `torch.nn.Module` that copies a model to update its parameters locally. It is a part of the algorithm to update parameters phi.

To see how to use these two classes to build the RL-based DA model, and to see an example that uses this algorithm for text classification, please refer to `examples/data_augmentation/reinforcemennt/README.md` for details.