# This is a tutorial for the data augmentation. 

You can use the data augmentation by simply add a processor in the pipeline. Take Back-Translation for an example:

In [None]:
from forte.processors.base.data_augment_processor import ReplacementDataAugmentProcessor
from forte.pipeline import Pipeline

nlp = Pipeline[MultiPack]()

# Configuration for the data augmentation processor.
processor_config = {
    'augment_entry': 'ft.onto.base_ontology.Token',
    'other_entry_policy': {
        'kwargs': {
            'ft.onto.base_ontology.Document': 'auto_align',
            'ft.onto.base_ontology.Sentence': 'auto_align'
        }
    },
    'type': 'data_augmentation_op',
    'data_aug_op': 'forte.processors.data_augment.algorithms.back_translation_op.BackTranslationOp',
    'data_aug_op_config': {
        'kwargs': {
            'model_to': 'forte.processors.data_augment.algorithms.machine_translator.MarianMachineTranslator',
            'model_back': 'forte.processors.data_augment.algorithms.machine_translator.MarianMachineTranslator',
            'src_language': 'en',
            'tgt_language': 'fr',
            'device': 'cpu'
        }
    }
}

processor = ReplacementDataAugmentProcessor()
nlp.add(component=processor, configs=processor_config)

## Text Replacement Ops

Lots of data augmentation methods can be considered as replacement-based approaches. That means, given a piece of input text, we will replace it with a new piece of augmented text. The back translation achieves this by translating the input into another language, then translating it back, and replacing the input with back-translated text.

We wrapped these algorithms as the text replacement ops:

1. *`DictionaryReplacementOp`*:
    It utilizes the dictionaries, such as WORDNET, to replace the input word.
2. *`BackTranslationOp`*:
    It uses back translation to generate data with the same semantic meanings.
3. *`DistributionReplacementOp`*:
    It samples from a distribution to generate word-level new text.
4. *`EmbeddingSimilarityOp`*:
    It leverages pre-trained word embeddings, such as word2vec and glove, to replace the input word with another word with similar word embedding.

The replacement ops should be under the *`'forte.processors.data_augment.algorithms'`*. To use these algorithms, set the value of *`'data_aug_op'`* in the configuration to full qualified name of the replacement op, and set the configuration of the replacement op in the *`'kwargs'`* under the field *`'data_aug_op_config'`*. Please check the documentation for specific configuration requirements of each replacement op.

## Replacement-based Processor

The processor *`'ReplacementDataAugmentProcessor'`* is responsible for managing the Forte data structures. Given a *`MultiPack`* as input, it will call the text replacement op to implement a specific data augmentation algorithm. Afterwards, it will handle the Forte data structures automatically. The output is the original *`MultiPack`*, with orignal *`DataPack`*s and new augmented *`DataPack`*s.

For example, given an input *`MultiPack`*:

    input(MultiPack){
        dp1(DataPack): {
            "I love NLP!"
        }
        dp2(DataPack): {
            "Forte makes NLP easier."
        }
    }

The output should be the same *`MultiPack`* with new *`DataPack`*s:

    output(MultiPack){
        dp1(DataPack): {
            "I love NLP!"
        }
        dp2(DataPack): {
            "Forte makes NLP easier."
        }
        dp1-aug(DataPack): {
            "I love Natural Language Processing!"
        }
        dp2-aug(DataPack): {
            "Forte makes Natural Language Processing easier."
        }
    }

The *`'augment_entry'`* defines the entry the processor will augment. It should be a full qualified name of the entry class.

The *`'other_entry_policy'`* specifies the policies for entries other than the *`'augment_entry'`*. If the policy is *`'auto_align'`*, the span of the Annotation will be automatically modified according to its original location.

The *`Link`*, *`MultiPackLink`*, *`Group`* and *`MultiPackGroup`* are automatically copied if the *`Annotation`*s they attached to are present in the new *`DataPack`*.