# Data augmentation in NLP

In [1]:
text = """Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine."""

General TODO:
- for each method, list scientific papers
- check with gensim to put word2vec models on the HF Hub

# 1 Lexical substitution
## 1.1 Thesaurus-based
### 1.1.1 Dictionary of synonyms
### 1.1.2 WordNet Thesaurus + (Others WordNet Thesaurus like → e.g. WoNef, a French WordNet)

TODO:
- add the others WordNet Thesaurus like available on http://globalwordnet.org/resources/wordnets-in-the-world/ (need to check that they are not in the nltk's wordnet) (no hurry at all)
- propose that the user can use their own dictionary (e.g homonyms, antonyms, abbreviations, etc.) (no hurry at all)

In [2]:
import pandas as pd
from datasets import load_dataset
import numpy as np
import re
import random
import nltk
# nltk.download('omw-1.4')
from nltk.corpus import wordnet
import itertools

In [3]:
def thesaurus_replacement(text, lang, method, num):
    """
    A given word in the input text is replaced by one of its synonyms available in a given thesaurus. 
    The thesaurus is chosen by the user. 
    In practice, for a given thesaurus, we check if the text contains words that can be replaced. 
    If so, we enter the pipeline, otherwise we return the text without modification.
    If we enter the pipeline, we check if the text language is managed by the method or not.
    If yes, we continue the pipeline, otherwise we return the list of languages managed by the given method.
    If we find one or more words in the text that can be replaced because they are present in a thesaurus, 
    we replace n words at random,n being chosen by the user.
    If n is greater than the number of replaceable words, we stop at the maximum number of replaceable words.
    If a word has several applicable synonyms, we take one at random.
    Finally, the final augmented text is returned via synonym replacement in a thesaurus.
    
    
    Parameters:
    - **text** (`str`):
        The text entered by the user.
    - **lang** (`str`):
        The text language to be indicated using the ISO 639-3 convention. For example ("fra" for "French"). The languages managed are :
       "als" (Tosk Albanian), "ara" (Arabic), "bul" (Bulgarian), "cat" (Catalan), "ces" (Czech), "cmn" (Mandarin Chinese), 
       "dan" (Danish), "deu" (German), "ell" (Greek), "eng" (English) (and for the "thesaurus" method: "eng_AU" (Australian English),
       "eng_GB" (British English), "eng_US" (American English), "eus" (Basque), "fin" (Finnish), "fra" (French), "glg" (Galician),
       "gle" (Irish), "gsw" (Swiss German), "heb" (Hebrew), "hun" (Hungarian), "hrv" (Croatian), "ind" (Indonesian), "isl" (Icelandic),
       "ita" (Italian), "jpn" (Japanese), "lit" (Lithuanian), "nld" (Dutch), "nno" (Nynorsk), "nob" (Bokmål), "pol" (Polish), 
       "por" (Portuguese), "ron" (Romanian), "rus" (Russian), "sin" (Sinhala), "slk" (Slovak), "slv" (Slovenian), "spa" (Spanish),
       "swe" (Swedish), "ukr" (Ukrainian), "tha" (Thai) and "zsm" (Malay).
       See **method** to know exactly which method is applicable for a given language.
    - **method** (`str`):
        The method applied. Two are available: "thesaurus" et "wordnet".
         - "thesaurus", we look for a synonym in one of the 28 OpenOffice's thesauruses. The "thesaurus" method supports 
        "arb" (Arabic),"bul" (Bulgarian),"cat" (Catalan),"ces" (Czech),"dan" (Danish),"deu" (German),"ell" (Greek),
        "eng_AU (Australian English)","eng_GB" (British English),"eng_US" (American English),"fra" (French),"glg" (Galician),
        "gle" (Irish),"gsw" (Swiss German),"hun" (Hungarian),"isl" (Icelandic),"ita" (Italian),"nno" (Nynorsk),"nob" (Bokmål),
        "pol" (Polish),"por" (Portuguese),"ron" (Romanian),"rus" (Russian),"sin" (Sinhala),"slk" (Slovak),"spa" (Spanish),
        "swe" (Swedish),"ukr" (Ukrainian).
        - "wordnet", we look for a synonym in the nltk's wordnet. The "wordnet" method supports.
        "als" (Tosk Albanian), "arb" (Arabic), "bul (Bulgarian)", "cat" (Catalan), "cmn" (Mandarin Chinese), "dan" (Danish), "ell" (Greek),
        "eng" (English), "eus" (Basque), "fin" (Finnish), "fra" (French), "glg" (Galician), "heb" (Hebrew), "hrv" (Croatian),
        "ind" (Indonesian), "isl" (Icelandic), "ita" (Italian), "ita_iwn" (??), "jpn" (Japanese), "lit" (Lithuanian), "nld" (Dutch),
        "nno" (Nynorsk),"nob" (Bokmål), "pol" (Polish), "por" (Portuguese), "ron" (Romanian), "slk" (Slovak), "slv" (Slovenian),
        "spa" (Spanish), "swe" (Swedish), "tha" (Thai), "zsm" (Malay)'.
        - Initiatives have been initiated by researchers to provide a wordnet in their language when it does not exist in nltk. 
        These non-nltk wordnets are available in the form "wordnet/wordnet_name". 
        For example "wordnet/wonef" to access to Wonef which is a wordnet in French. 
        The additional wordnets currently available are: "wordnet/wonef" (French wordnet).
    - **num** (`int`):
        The number of words to replace in the text. 
        If **num** is greater than the number of replaceable words, we stop at the maximum number of replaceable words.
    """

    if method == "thesaurus":
        if lang in ["ara","bul","cat","ces","dan","deu","ell","eng_AU","eng_GB","eng_US","fra","glg","gle","gsw","hun","isl","ita","nno","nob","pol","por","ron","rus","sin","slk","spa","swe","ukr"]: # thesaurus from OpenOffice. 
            # load thesaurus
            # pandas version
            thesaurus = pd.read_csv(f"thesaurus_{lang}.csv") # synonyms dictionary used in OpenOffice
            # datasets version (10 times slower)
#             thesaurus = load_dataset("fraug-library/thesaurus",data_files=f"thesaurus_{lang}.csv")["train"].to_pandas() # synonyms dictionary used in OpenOffice
            thesaurus = thesaurus[thesaurus['word'].apply(lambda x: len(x) >= 3)] # to reduce noise (e.g. without this, "ampere" replacing "a" appears in all words containing an "a")
            # check if there is at least one word in the text for which there is a synonym in the dictionary
            words = []
            for word in text.replace(",","").split():
                if word in list(thesaurus['word']):
                    words.append(word)
            if len(words) < num:
                num = len(words)
            if words != []: # check ok
                # select words to change
                list_word_to_change = random.sample(words, k=num)
                # replacement
                for word_to_change in list_word_to_change:
                    # regex to manage punctuation around the word to be changed: brackets, commas, etc.
                    text = re.sub(rf'(^\w\s??)??{word_to_change}(^\w\s??)??',random.sample(thesaurus[thesaurus.word == word_to_change].synonyms.iloc[0].replace("[","").replace("]","").split(", "), 1)[0][1:-1], text) 
                return text
            else:
                return text # we return the text without augmentation    
        else:
            return 'The thesaurus method supports the following languages: "ara" (Arabic),"bul" (Bulgarian),"cat" (Catalan),"ces" (Czech),"dan" (Danish),"deu" (German),"ell" (Greek),"eng_AU" (Australian English),"eng_GB" (British English),"eng_US" (American English),"fra" (French),"glg" (Galician),"gle" (Irish),"gsw" (Swiss German),"hun" (Hungarian),"isl" (Icelandic),"ita" (Italian),"nno" (Norwegian Nynorsk),"nob" (Norwegian Bokmål),"pol" (Polish),"por" (Portuguese),"ron" (Romanian),"rus" (Russian),"sin" (Sinhala),"slk" (Slovak),"spa" (Spanish),"swe" (Swedish),"ukr" (Ukrainian)'
    
    
    if "wordnet" in method:
        if method == "wordnet":
            if lang in ['als', 'arb', 'bul', 'cat', 'cmn', 'dan', 'ell', 'eng', 'eus', 'fin', 'fra', 'glg', 'heb', 'hrv', 'ind', 'isl', 'ita', 'ita_iwn', 'jpn', 'lit', 'nld', 'nno', 'nob', 'pol', 'por', 'ron', 'slk', 'slv', 'spa', 'swe', 'tha', 'zsm']:
                word_text = text.split()
                # We collect the wordnet contained in nltk and clean it
                words2lemmanames = [{'word': word, 'synset':ss.name(), 'synonyms':ss.lemma_names(lang)}
                                    for word in word_text for ss in wordnet.synsets(word)]
                df_wordnet = pd.DataFrame(words2lemmanames)
                df_wordnet = df_wordnet[["word","synonyms"]][df_wordnet.word.str.len() > 2].reset_index(drop=True) # to reduce noise
                df_wordnet = df_wordnet[df_wordnet['synonyms'].map(lambda d: len(d)) > 0] # some line in wordnet contains empty lists that are not kept
                df_wordnet_clean = pd.DataFrame()
                df_wordnet_clean["word"] = list(df_wordnet["word"].unique())
                df_wordnet_clean["synonyms"] = list(df_wordnet["word"].unique())
                for idx in range(len(df_wordnet_clean)):
                    df_wordnet_clean["synonyms"][idx] = list(set(list(itertools.chain.from_iterable(df_wordnet["synonyms"][df_wordnet["word"] == df_wordnet_clean["word"][idx]]))))
                # end of cleaning
                if len(df_wordnet_clean) < num:
                    num = len(df_wordnet_clean)
                if not df_wordnet_clean.empty: # check ok
                    # select words to change
                    list_word_to_change = random.sample(list(df_wordnet_clean.word), k=num)
                    # replacement
                    for word_to_change in list_word_to_change:
                        synonyms = df_wordnet_clean[df_wordnet_clean.word == word_to_change].synonyms.iloc[0]
                        if word_to_change in synonyms:
                            synonyms.remove(word_to_change)
                        # regex to manage punctuation around the word to be changed: brackets, commas, etc.
                        text = re.sub(rf'(^\w\s??)??{word_to_change}(^\w\s??)??',random.choice(synonyms), text)
                    return text
                else:
                    return text # we return the text without augmentation
            else:
                return 'The wordnet method supports the following languages: "als" (Tosk Albanian), "arb" (Arabic), "bul (Bulgarian)", "cat" (Catalan), "cmn" (Mandarin Chinese), "dan" (Danish), "ell" (Greek), "eng" (English), "eus" (Basque), "fin" (Finnish), "fra" (French), "glg" (Galician), "heb" (Hebrew), "hrv" (Croatian), "ind" (Indonesian), "isl" (Icelandic), "ita" (Italian), "ita_iwn" (??), "jpn" (Japanese), "lit" (Lithuanian), "nld" (Dutch), "nno" (Nynorsk),"nob" (Bokmål), "pol" (Polish), "por" (Portuguese), "ron" (Romanian), "slk" (Slovak), "slv" (Slovenian), "spa" (Spanish), "swe" (Swedish), "tha" (Thai), "zsm" (Malay)'


        if method == "wordnet/wonef":
            # load wonef
            # pandas version
            wonef = pd.read_csv("Wonef.csv")
            # datasets version
#             wonef = load_dataset("fraug-library/wonef")["train"].to_pandas()
            wonef = wonef[wonef['word'].apply(lambda x: len(str(x)) >= 3)] # to reduce noise
            # check if there is at least one word in the text for which there is a synonym in the dictionary
            words = []
            for word in text.replace(",","").split():
                if word in list(wonef['word']):
                    words.append(word)
            if len(words) < num:
                num = len(words)
            if words != []: # check ok
                # select words to change
                list_word_to_change = random.sample(words, k=num)
                # replacement
                for word_to_change in list_word_to_change:
                    # regex to manage punctuation around the word to be changed: brackets, commas, etc.
                    text = re.sub(rf'(^\w\s??)??{word_to_change}(^\w\s??)??',random.choice(wonef[wonef.word == word_to_change].synonyms.iloc[0].split('|')), text)
                return text
            else:
                return text # we return the text without augmentation
        
        # Note: add all available languages on http://globalwordnet.org/resources/wordnets-in-the-world/ (need to check that they are not in the nltk's wordnet)
        
    else:
        return 'Invalid method, please choose either method="thesaurus" or method="wordnet" (possibly method="wordnet/adaptation_of_wordnet_in_a_given_language")'

In [4]:
%%time
# original text
print(text)

# new text
print(thesaurus_replacement(text,"fra","thesaurus",3))

Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Le pongiste tricolore Alexis Lebrun, 19 temps, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de conclusion du tournoi WTT Champions de Macao, en Chine.
Wall time: 178 ms


In [5]:
%%time
# original text
print(text)

# new text
print(thesaurus_replacement(text,"fra","wordnet",3))

Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois amant Zhendong, numéro 1 mondial, en quarts de fermeture du tournoi WTT expert de Macao, en Chine.
Wall time: 2.28 s


In [6]:
%%time
# original text
print(text)

# new text
print(thesaurus_replacement(text,"fra","wordnet/wonef",3))

Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Le pongiste française Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 planétaire, en quarts de finale du challenge WTT Champions de Macao, en Chine.
Wall time: 184 ms


## 1.2 Based on word embedding
### 1.2.1 Word2vec 
TODO:
- Models to add (on the Hub?):
    - https://raw.githubusercontent.com/wikipedia2vec/wikipedia2vec/b2712cfed5f266147d2358ee531da678601ada46/docs/pretrained.md
    - http://vectors.nlpl.eu/repository/
- have a look at https://github.com/oborchers/Fast_Sentence_Embeddings to speed up the process (it takes about 1s once the model is downloaded)?

### 1.2.2 FastText
- find a way to make it run much faster, 30s for an inference is unusable in practice

In [7]:
import fasttext
import random
import re
from huggingface_hub import hf_hub_download
from gensim.models import KeyedVectors

In [8]:
def word_embedding_replacement(text, lang, method, num, model_name="defaut"):
    """
    A given word in the input text is replaced by one of its synonyms available in a word embedding method. The method is chosen by the user. 
    The "word2vec" method is based on gensim. 
    For a pre-trained word embedding model, we check if the text contains words that can be replaced. 
    If so, we enter the pipeline, otherwise we return the text without modification.
    If we enter the pipeline, we check if the text language is managed by the method or not.
    If yes, we continue the pipeline, otherwise we return the list of languages managed by the given method.
    We propose a default model for the identified language if the user does not provide a model. Otherwise, we take the model specified by the user.
    Next, if we find one or more words in the text that can be replaced because they are present in the vocabulary of the word embedding model,
    we replace n words at random, n being chosen by the user.
    If n is greater than the number of replaceable words, we stop at the maximum number of replaceable words.
    For a given word, we take the word that is most similar using gensim's `most_similar` function.
    Finally, the final augmented text is returned via synonym replacement in a word2vec.
    
    The "fasttext" method is based on fasttext.
    For this method, there is no need to check whether the word is available in a vocabulary or not.
    We simply choose n words at random from the sentence and replace them with the `get_nearest_neighbors` function from fasttext.

    
    Parameters:
    - **text** (`str`):
        The text entered by the user.
    - **lang** (`str`):
        The text language to be indicated using the ISO 639-3 convention. For example ("fra" for "French"). The languages managed are :
        "afr" (Afrikaans), "als" (Tosk Albanian),"amh" (Amharic), "arg" (Aragonese), "ara" (Standard Arabic), "arz" (Egyptian Arabic),
        "asm" (Assamese), "ast" (Asturian), "aze" (Azerbaijani), "azb" (South Azerbaijani), "bak" (Bashkir), "bar" (Bavarian), 
        "bcl" (Central Bikol), "bel" (Belarusian), "bul" (Bulgarian), "bih" (Bihari languages), "ben" (Bengali), "bod" (Tibetan),
        "bpy" (Bishnupriya), "bre" (Breton), "bos" (Bosnian), "cat" (Catalan), "che" (Chechen), "ceb" (Cebuano), "ckb" (Central Kurdish),
        "cos" (Corsican), "cze" (Czech), "chv" (Chuvash), "wel" (Welsh), "dan" (Danish), "ger" (German), "diq" (Dimli), "div" (Dhivehi),
        "gre" (Greek), "eml" (), "eng" (English), "epo" (Esperanto), "spa" (Spanish), "est" (Estonian), "baq" (Basque), "fas" (Persian),
        "fin" (Finnish), "fra" (French), "frr" (Northern Frisian), "fry" (Western Frisian), "gle" (Irish), "gla" (Scottish Gaelic),
        "glg" (Galician), "gom" (Goan Konkani), "guj" (Gujarati), "glv" (Manx), "heb" (Hebrew), "hin" (Hindi), "hif" (Fiji Hindi),
        "hrv" (Croatian), "hsb" (Upper Sorbian), "hun" (Hungarian), "arn" (Mapudungun), "ina" (Interlingua), "ind" (Indonesian),
        "ilo" (Iloko), "ido" (Ido), "ice" (Icelandic), "ita" (Italian), "jpn" (Japanese), "jav" (Javanese), "geo" (Georgian),
        "kaz" (Kazakh), "khm" (Khmer), "kan" (Kannada), "kur" (Kurdish), "kir" (Kirghiz), "lat" (Latin), "ltz" (Luxembourgish),
        "lim" (Limburgan), "lmo" (Lombard), "lit" (Lithuanian), "lav" (Latvian), "mai" (Maithili), "mlg" (Malagasy),
        "mhr" (Eastern Mari), "min" (Minangkabau), "mac" (Macedonian), "mal" (Malayalam), "mon" (Mongolian), "mar" (Marathi),
        "mrj" (Western Mari), "may" (Malay), "mlt" (Maltese), "mwf" (Murrinh-Patha), "mya" (Burmese), "myv" (Erzya),
        "mzn" (Mazanderani), "nah" (), "nap" (Neapolitan), "nds" (Low German), "nep" (Nepali), "new" (Newari), "dut" (Dutch),
        "nno" (Norwegian Nynorsk), "nor" (Norwegian), "nso" (Pedi), "oci" (Occitan), "ori" (Oriya), "oss" (Ossetian),
        "pan" (Panjabi), "pam" (Pampanga), "pfl" (Pfaelzisch), "pol" (Polish), "pms" (Piemontese), "pnb" (Western Panjabi),
        "pus" (Pushto), "por" (Portuguese), "que" (Quechua), "run" (Rundi), "rum" (Romanian), "rus" (Russian), "san" (Sanskrit),
        "sah" (Yakut), "srd" (Sardinian), "scn" (Sicilian), "sco" (Scots), "snd" (Sindhi), "sbs" (Subiya), "sin" (Sinhala), 
        "slk" (Slovak), "slv" (Slovenian), "som" (Somali), "sqi" (Albanian), "srp" (Serbian), "sun" (Sundanese), "swe" (Swedish),
        "swa" (Swahili), "tam" (Tamil), "tel" (Telugu), "tgk" (Tajik), "tha" (Thai), "tuk" (Turkmen), "tgl" (Filipino), 
        "tur" (Turkish), "tat" (Tatar), "uig" (Uighur), "ukr" (Ukrainian), "urd" (Urdu), "uzb" (Uzbek), "vec" (Venetian), 
        "vie" (Vietnamese), "vls" (Vlaams), "vol" (Volapük), "wln" (Walloon), "war" (Waray), "xmf" (Mingrelian), "yid" (Yiddish),
        "yor" (Yoruba), "zea" (Zeeuws), "zho" (Chinese)'.
        See **method** to know exactly which method is applicable for a given language.
    - **method** (`str`):
        The method applied. Two are available: "word2vec" et "fasttext". 
        - If "word2vec", we look for a synonym in a word2vec model based on gensim. The "word2vec" method supports:
        "eng" (English) and "fra" (French). 
        - If "fasttext", we look for a synonym in a word2vec model based on fasttext. The "fasttext" method supports:
        "afr" (Afrikaans), "als" (Tosk Albanian),"amh" (Amharic), "arg" (Aragonese), "ara" (Standard Arabic), "arz" (Egyptian Arabic),
        "asm" (Assamese), "ast" (Asturian), "aze" (Azerbaijani), "azb" (South Azerbaijani), "bak" (Bashkir), "bar" (Bavarian), 
        "bcl" (Central Bikol), "bel" (Belarusian), "bul" (Bulgarian), "bih" (Bihari languages), "ben" (Bengali), "bod" (Tibetan),
        "bpy" (Bishnupriya), "bre" (Breton), "bos" (Bosnian), "cat" (Catalan), "che" (Chechen), "ceb" (Cebuano), "ckb" (Central Kurdish),
        "cos" (Corsican), "cze" (Czech), "chv" (Chuvash), "wel" (Welsh), "dan" (Danish), "ger" (German), "diq" (Dimli), "div" (Dhivehi),
        "gre" (Greek), "eml" (), "eng" (English), "epo" (Esperanto), "spa" (Spanish), "est" (Estonian), "baq" (Basque), "fas" (Persian),
        "fin" (Finnish), "fra" (French), "frr" (Northern Frisian), "fry" (Western Frisian), "gle" (Irish), "gla" (Scottish Gaelic),
        "glg" (Galician), "gom" (Goan Konkani), "guj" (Gujarati), "glv" (Manx), "heb" (Hebrew), "hin" (Hindi), "hif" (Fiji Hindi),
        "hrv" (Croatian), "hsb" (Upper Sorbian), "hun" (Hungarian), "arn" (Mapudungun), "ina" (Interlingua), "ind" (Indonesian),
        "ilo" (Iloko), "ido" (Ido), "ice" (Icelandic), "ita" (Italian), "jpn" (Japanese), "jav" (Javanese), "geo" (Georgian),
        "kaz" (Kazakh), "khm" (Khmer), "kan" (Kannada), "kur" (Kurdish), "kir" (Kirghiz), "lat" (Latin), "ltz" (Luxembourgish),
        "lim" (Limburgan), "lmo" (Lombard), "lit" (Lithuanian), "lav" (Latvian), "mai" (Maithili), "mlg" (Malagasy),
        "mhr" (Eastern Mari), "min" (Minangkabau), "mac" (Macedonian), "mal" (Malayalam), "mon" (Mongolian), "mar" (Marathi),
        "mrj" (Western Mari), "may" (Malay), "mlt" (Maltese), "mwf" (Murrinh-Patha), "mya" (Burmese), "myv" (Erzya),
        "mzn" (Mazanderani), "nah" (), "nap" (Neapolitan), "nds" (Low German), "nep" (Nepali), "new" (Newari), "dut" (Dutch),
        "nno" (Norwegian Nynorsk), "nor" (Norwegian), "nso" (Pedi), "oci" (Occitan), "ori" (Oriya), "oss" (Ossetian),
        "pan" (Panjabi), "pam" (Pampanga), "pfl" (Pfaelzisch), "pol" (Polish), "pms" (Piemontese), "pnb" (Western Panjabi),
        "pus" (Pushto), "por" (Portuguese), "que" (Quechua), "run" (Rundi), "rum" (Romanian), "rus" (Russian), "san" (Sanskrit),
        "sah" (Yakut), "srd" (Sardinian), "scn" (Sicilian), "sco" (Scots), "snd" (Sindhi), "sbs" (Subiya), "sin" (Sinhala), 
        "slk" (Slovak), "slv" (Slovenian), "som" (Somali), "sqi" (Albanian), "srp" (Serbian), "sun" (Sundanese), "swe" (Swedish),
        "swa" (Swahili), "tam" (Tamil), "tel" (Telugu), "tgk" (Tajik), "tha" (Thai), "tuk" (Turkmen), "tgl" (Filipino), 
        "tur" (Turkish), "tat" (Tatar), "uig" (Uighur), "ukr" (Ukrainian), "urd" (Urdu), "uzb" (Uzbek), "vec" (Venetian), 
        "vie" (Vietnamese), "vls" (Vlaams), "vol" (Volapük), "wln" (Walloon), "war" (Waray), "xmf" (Mingrelian), "yid" (Yiddish),
        "yor" (Yoruba), "zea" (Zeeuws), "zho" (Chinese)'.  
    - **num** (`int`):
        The number of words to replace in the text. 
        If **num** is greater than the number of replaceable words, we stop at the maximum number of replaceable words.
    - **model_name** (`str`, *optional*, defaults to `"default"`):
        Only used for `method=="word2vec"`. 
        If `"default"`, a default model is used for a given language. Otherwise, we let the user specify the path to their model (can be local or hosted online).
    """
    
    if method=="word2vec":
        if lang in ["fra","eng"]:
            if (model_name=="defaut") and (lang=="fra"):
                try: 
                    model = KeyedVectors.load_word2vec_format(f"frWac_non_lem_no_postag_no_phrase_200_cbow_cut100.bin", binary=True, unicode_errors="ignore")
                    pass
                except:
                    model = KeyedVectors.load_word2vec_format(f"https://embeddings.net/embeddings/frWac_non_lem_no_postag_no_phrase_200_cbow_cut100.bin", binary=True, unicode_errors="ignore")
                    model.save_word2vec_format('frWac_non_lem_no_postag_no_phrase_200_cbow_cut100.bin', binary=True) 
                    # maybe not the best model (I just took the smallest frWaC model), it should be tested
                    pass
            if (model_name=="defaut") and (lang=="eng"):
                try: 
                    model = KeyedVectors.load_word2vec_format(f"frWiki_no_lem_no_postag_no_phrase_1000_cbow_cut200.bin", binary=True, unicode_errors="ignore")
                    pass
                except:
                    model = KeyedVectors.load_word2vec_format(f"https://embeddings.net/embeddings/frWiki_no_lem_no_postag_no_phrase_1000_cbow_cut200.bin", binary=True, unicode_errors="ignore")
                    model.save_word2vec_format('frWiki_no_lem_no_postag_no_phrase_1000_cbow_cut200.bin', binary=True) 
                    # maybe not the best model (I just took the smallest frWaC model), it should be tested
                    pass
            if (model_name!="defaut"):
                model = KeyedVectors.load_word2vec_format(f"{model_name}.bin", binary=True, unicode_errors="ignore")
                # it would be nice here to be able to import models from the HF Hub, we will have to lobby the gensim teams
                pass

            
            vocab = model.vocab.keys()
            # check if there is at least one word in the text for which there is a synonym in the dictionary
            words = []
            for word in text.replace(",","").split():
                if (word in vocab) and len(word)>3:
                    words.append(word)
            if len(words) < num:
                num = len(words)
            if words != []: # check ok
                # select words to change
                list_word_to_change = random.sample(words, k=num)
                # replacement
                for word_to_change in list_word_to_change:
                    # regex to manage punctuation around the word to be changed: brackets, commas, etc.
                    text = re.sub(rf'(^\w\s??)??{word_to_change}(^\w\s??)??',model.most_similar(word_to_change)[0][0], text) 
                    return text
            else:
                return text # we return the text without augmentation

        else:
            return 'The "word2vec" method only supports "fra" (French) and "eng" (English).'
            
                
    if method=="fasttext":
        # check lang
        # The following two lists are used to manage the conversion iso 639-1 to iso 639-3 
        list_lang = ['afr','als','amh','arg','ara','arz','asm','ast','aze','azb','bak','bar','bcl','bel','bul','bih','ben','bod','bpy','bre','bos','cat','che','ceb','ckb','cos','cze','chv','wel','dan','ger','diq','div','gre','eml','eng','epo','spa','est','baq','fas','fin','fra','frr','fry','gle','gla','glg','gom','guj','glv','heb','hin','hif','hrv','hsb','hun','arn','ina','ind','ilo','ido','ice','ita','jpn','jav','geo','kaz','khm','kan','kur','kir','lat','ltz','lim','lmo','lit','lav','mai','mlg','mhr','min','mac','mal','mon','mar','mrj','may','mlt','mwf','mya','myv','mzn','nah','nap','nds','nep','new','dut','nno','nor','nso','oci','ori','oss','pan','pam','pfl','pol','pms','pnb','pus','por','que','run','rum','rus','san','sah','srd','scn','sco','snd','sbs','sin','slk','slv','som','sqi','srp','sun','swe','swa','tam','tel','tgk','tha','tuk','tgl','tur','tat','uig','ukr','urd','uzb','vec','vie','vls','vol','wln','war','xmf','yid','yor','zea','zho']
        list_fasttext = ['af','als','am','an','ar','arz','as','ast','az','azb','ba','bar','bcl','be','bg','bh','bn','bo','bpy','br','bs','ca','ce','ceb','ckb','co','cs','cv','cy','da','de','diq','dv','el','eml','en','eo','es','et','eu','fa','fi','fr','frr','fy','ga','gd','gl','gom','gu','gv','he','hi','hif','hr','hsb','hu','hy','ia','id','ilo','io','is','it','ja','jv','ka','kk','km','kn','ku','ky','la','lb','li','lmo','lt','lv','mai','mg','mhr','min','mk','ml','mn','mr','mrj','ms','mt','mwf','my','myv','mzn','nah','nap','nds','ne','new','nl','nn','no','nso','oc','or','os','pa','pam','pfl','pl','pms','pnb','ps','pt','qu','rm','ro','ru','sa','sah','sc','scn','sco','sd','sh','si','sk','sl','so','sq','sr','su','sv','sw','ta','te','tg','th','tk','tl','tr','tt','ug','uk','ur','uz','vec','vi','vls','vo','wa','war','xmf','yi','yo','zea','zh']
        if lang in list_lang:
            lang = list_fasttext[list_lang.index(lang)]
            try:
                model = fasttext.load_model(f"cc.{lang}.bin")
                pass
            except:
                model = fasttext.load_model(hf_hub_download(repo_id=f"facebook/fasttext-{lang}-vectors", filename="model.bin"))
                model.save_model(f"cc.{lang}.bin")
                pass
        else:
            return lang,' is not in the list of languages managed, namely: "afr" (Afrikaans), "als" (Tosk Albanian),"amh" (Amharic), "arg" (Aragonese), "ara" (Standard Arabic), "arz" (Egyptian Arabic), "asm" (Assamese), "ast" (Asturian), "aze" (Azerbaijani), "azb" (South Azerbaijani), "bak" (Bashkir), "bar" (Bavarian), "bcl" (Central Bikol), "bel" (Belarusian), "bul" (Bulgarian), "bih" (Bihari languages), "ben" (Bengali), "bod" (Tibetan), "bpy" (Bishnupriya), "bre" (Breton), "bos" (Bosnian), "cat" (Catalan), "che" (Chechen), "ceb" (Cebuano), "ckb" (Central Kurdish), "cos" (Corsican), "cze" (Czech), "chv" (Chuvash), "wel" (Welsh), "dan" (Danish), "ger" (German), "diq" (Dimli), "div" (Dhivehi), "gre" (Greek), "eml" (), "eng" (English), "epo" (Esperanto), "spa" (Spanish), "est" (Estonian), "baq" (Basque), "fas" (Persian), "fin" (Finnish), "fra" (French), "frr" (Northern Frisian), "fry" (Western Frisian), "gle" (Irish), "gla" (Scottish Gaelic), "glg" (Galician), "gom" (Goan Konkani), "guj" (Gujarati), "glv" (Manx), "heb" (Hebrew), "hin" (Hindi), "hif" (Fiji Hindi), "hrv" (Croatian), "hsb" (Upper Sorbian), "hun" (Hungarian), "arn" (Mapudungun), "ina" (Interlingua), "ind" (Indonesian), "ilo" (Iloko), "ido" (Ido), "ice" (Icelandic), "ita" (Italian), "jpn" (Japanese), "jav" (Javanese), "geo" (Georgian), "kaz" (Kazakh), "khm" (Khmer), "kan" (Kannada), "kur" (Kurdish), "kir" (Kirghiz), "lat" (Latin), "ltz" (Luxembourgish), "lim" (Limburgan), "lmo" (Lombard), "lit" (Lithuanian), "lav" (Latvian), "mai" (Maithili), "mlg" (Malagasy), "mhr" (Eastern Mari), "min" (Minangkabau), "mac" (Macedonian), "mal" (Malayalam), "mon" (Mongolian), "mar" (Marathi), "mrj" (Western Mari), "may" (Malay), "mlt" (Maltese), "mwf" (Murrinh-Patha), "mya" (Burmese), "myv" (Erzya), "mzn" (Mazanderani), "nah" (), "nap" (Neapolitan), "nds" (Low German), "nep" (Nepali), "new" (Newari), "dut" (Dutch), "nno" (Norwegian Nynorsk), "nor" (Norwegian), "nso" (Pedi), "oci" (Occitan), "ori" (Oriya), "oss" (Ossetian), "pan" (Panjabi), "pam" (Pampanga), "pfl" (Pfaelzisch), "pol" (Polish), "pms" (Piemontese), "pnb" (Western Panjabi), "pus" (Pushto), "por" (Portuguese), "que" (Quechua), "run" (Rundi), "rum" (Romanian), "rus" (Russian), "san" (Sanskrit), "sah" (Yakut), "srd" (Sardinian), "scn" (Sicilian), "sco" (Scots), "snd" (Sindhi), "sbs" (Subiya), "sin" (Sinhala), "slk" (Slovak), "slv" (Slovenian), "som" (Somali), "sqi" (Albanian), "srp" (Serbian), "sun" (Sundanese), "swe" (Swedish), "swa" (Swahili), "tam" (Tamil), "tel" (Telugu), "tgk" (Tajik), "tha" (Thai), "tuk" (Turkmen), "tgl" (Filipino), "tur" (Turkish), "tat" (Tatar), "uig" (Uighur), "ukr" (Ukrainian), "urd" (Urdu), "uzb" (Uzbek), "vec" (Venetian), "vie" (Vietnamese), "vls" (Vlaams), "vol" (Volapük), "wln" (Walloon), "war" (Waray), "xmf" (Mingrelian), "yid" (Yiddish), "yor" (Yoruba), "zea" (Zeeuws), "zho" (Chinese)'
        # select words to change
        list_word_to_change = random.sample(text.replace(",","").split(), k=num)
        # replacement
        for word_to_change in list_word_to_change:
            # regex to manage punctuation around the word to be changed: brackets, commas, etc.
            text = text = re.sub(rf'(^\w\s??)??{word_to_change}(^\w\s??)??',model.get_nearest_neighbors(word_to_change)[0][1], text) 
        return text
    
    
    else:
        return 'Invalid method, please choose either method="word2vec" or method="fasttext".'

In [9]:
%%time
# original text
print(text)
# new text
print(word_embedding_replacement(text, "fra", "word2vec", 3))

Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournois WTT Champions de Macao, en Chine.
Wall time: 1.59 s


In [10]:
%%time
# original text
print(text)
# new text
print(word_embedding_replacement(text, "fra", "fasttext", 3)) # Very very long

Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.




Le pongistes français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de demi-finale du Tournoi WTT Champions de Macao, en Chine.
Wall time: 35.2 s


Which model to use?

The 20 models of https://fauconnier.github.io/#data :  
12 trained on frWaC (https://wacky.sslmit.unibo.it/doku.php?id=corpora): a 1.6 billion word corpus built from the Web by limiting the exploration to the .fr domain and using medium frequency words from the Le Monde Diplomatique corpus and core French vocabulary lists as seeds. The dataset dates from 2008.
- "frWac_non_lem_no_postag_no_phrase_200_cbow_cut0.bin" # (2.7Gb)  
- "frWac_non_lem_no_postag_no_phrase_200_cbow_cut100.bin" # (120Mb)  
- "frWac_non_lem_no_postag_no_phrase_200_skip_cut100.bin" # (120Mb)  
- "frWac_non_lem_no_postag_no_phrase_500_skip_cut100.bin" # (298Mb)  
- "frWac_non_lem_no_postag_no_phrase_500_skip_cut200.bin" # (202Mb)  
- "frWac_no_postag_no_phrase_500_cbow_cut100.bin" # (229Mb)  
- "frWac_no_postag_no_phrase_500_skip_cut100.bin" # (229Mb)  
- "frWac_no_postag_no_phrase_700_skip_cut50.bin" # (494Mb)  
- "frWac_postag_no_phrase_700_skip_cut50.bin" # (577Mb)  
- "frWac_postag_no_phrase_1000_skip_cut100.bin" # (520Mb)  
- "frWac_no_postag_phrase_500_cbow_cut10.bin" # (2Gb)  
- "frWac_no_postag_phrase_500_cbow_cut100.bin" # (289Mb)  

8 trained on FrWiki dump (https://dumps.wikimedia.org/frwiki/frwiki/), 600 million words. The dataset is from 2010.
- "frWiki_no_lem_no_postag_no_phrase_1000_cbow_cut100.bin" # (253Mb)  
- "frWiki_no_lem_no_postag_no_phrase_1000_cbow_cut200.bin" # (195Mb)  
- "frWiki_no_lem_no_postag_no_phrase_1000_skip_cut100.bin" # (253Mb)  
- "frWiki_no_lem_no_postag_no_phrase_1000_skip_cut200.bin" # (195Mb)  
- "frWiki_no_phrase_no_postag_500_cbow_cut10.bin" # (128Mb)  
- "frWiki_no_phrase_no_postag_700_cbow_cut100.bin" # (106Mb)  
- "frWiki_no_phrase_no_postag_1000_skip_cut100.bin" # (151Mb)  
- "frWiki_no_phrase_no_postag_1000_skip_cut200.bin" # (121Mb)  


We can also mention:
- The 2 (a general, a politician) proposed by a team of Polytechnique: http://nlp.polytechnique.fr/word2vec#french trained on 33Go of texts. Seems to date from 2021 (= the most recent). The resources are available for academic/non profit use by sending a request by email to mvazirg~lix.polytechnique.fr.
- A trained model on Wikipedia 2018 lemmatized : https://zenodo.org/record/3241447#.ZD_iTs7P1PY

Remember to look at: https://flairnlp.github.io/docs/category/tutorial-3-embeddings to see if we can add flair models in addition to gensim ones

## 1.3 Based on a masked language model (BERT like)

TODO:
- Check if there is a bug (sometimes the `random` method return `PipelineException: No mask_token (<mask>) found on the input`) but unable to reproduce it.
- Propose transformers models in addition to flair models (e.g. for French: `pipeline(model="qanastek/pos-french-camembert", aggregation_strategy="simple"`)(not hurry)

In [11]:
from transformers import pipeline
from flair.data import Sentence
from flair.models import SequenceTagger
import random
import re

In [12]:
def mlm_replacement(text, lang, method, num, top_k=3, pos_tag=['NUM','ADJ','VERB'], automodel=""):
    """
    A given word in the input text is replaced by one generated by a masked language model that the user can choose. 
    The user can choose whether the replaced words are chosen at random or by using a POS model.
    For the POS model, we replace n words among those found by the model and belonging to a very specific category of tags that the user can choose 
    (by default pos_tag=['NUM','ADJ','VERB']).
    
    Parameters:
    - **text** (`str`):
        The text entered by the user
    - **lang** (`str`):
        The text language to be indicated using the ISO 639-3 convention to use the default models.
        For example ("fra" for "French"). The languages managed by default are : "eng" (English) and "fra" (French).
        All other languages supported are those for which a fill-mask model is available on the Hugging Face Hub
        https://huggingface.co/models?pipeline_tag=fill-mask.
    - **method** (`str`):
        The method applied. Two are available: "random" et "pos" to choose which words will be replaced. 
        Puis le modèle de l'utilisateur est appliqué (see **model** below).
        - "random", we randomly choose words that will be passed on dans the Hugging Face fill-mask pipeline. 
        The "random" method supports any written language in the world.
        - "pos", we choose words that will be passed on dans the Hugging Face fill-mask pipeline via a flair POS model. 
        The "pos" method supports "ces" (Czech), "dan" (Danish), "deu" (German), "eng" (English), "fin" (Finnish),
        "fra" (French),  "ita" (Italian), "nld" (Dutch), "nor" (Norwegian), "pol" (Polish), "spa" (Spanish) and "swe" (Swedish).
    - **num** (`int`):
        The number of words to replace in the text.
    - **top_k** (`int`, *optional*, defaults to `3`):
        The number of proposals that the Hugging Face fill-mask pipeline must return in order to make a selection from them.
    - **pos_tag** (`list`, *optional*, defaults to `"['NUM','ADJ','VERB']"`):
        The user can choose among ['ADJ', 'ADP', 'ADV', 'AUX', 'CCONJ', 'DET', 'INTJ', 'NOUN', 'NUM', 'PART', 'PRON', 'PROPN', 
        'PUNCT', 'SCONJ', 'SYM', 'VERB', 'X'] the type of words he wants to change in the sentence. 
        Meaning of the list : "ADJ" (adjective), "ADP" (adposition), "ADV" (adverb), "AUX" (auxiliary),
        "CCONJ" (coordinating conjunction), "DET" (determiner), "INTJ" (interjection), "NOUN" (noun),
        "NUM" (numeral), "PART" (particle), "PRON" (pronoun), "PROPN" (proper noun), "PUNCT" (punctuation),
        "SCONJ" (subordinating conjunction), "SYM" (symbol), "VERB" (verb) and "X" (other).
     - **automodel** (`str`, *optional*, defaults to `""`):
         If not filled in, use a default model, otherwise use the model indicated by the user, which must be a masked language model
         among those proposed on the Hugging Face Hub (https://huggingface.co/models?pipeline_tag=fill-mask).
         The languages managed by default are : "eng" (English) and "fra" (French).
         All other languages supported are those for which a fill-mask model is available on the Hugging Face Hub.
    """
    
    # by modifying a random word
    if method=="random":
        # select words to change
        list_word_to_change = random.sample(text.replace(",","").split(), k=num)   
        pass
        
    # POS
    if method=="pos":
        # load the model
        model = SequenceTagger.load('pos-multi') #"qanastek/pos-french"
        sentence = Sentence(text)
        # predict tags
        model.predict(sentence)
        # creation of a dictionary with a word as key and its value as POS tag
        list_token = []
        list_pos = []
        for label in sentence.get_labels():
            list_token.append(label.data_point.text)
            list_pos.append(label.value)
        dict_POS = [{ "text": t, "label": l } for t, l in zip(list_token, list_pos)]
        # we keep only certain words present in the selected POS tags (by default ['NUM', 'ADJ', 'VERB']) and then change only these types of words
        words = []
        for idx in range(len(dict_POS)):
            if dict_POS[idx]["label"] in pos_tag:
                words.append(dict_POS[idx]["text"])
                
        if len(words) < num:
            num = len(words)
        if words != []: # check ok
            # select words to change
            list_word_to_change = random.sample(words, k=num)
        pass
    
    if  automodel=="":
        # defaut pipeline for a given language if auto==False
        if lang =="fra":
            fill_mask = pipeline("fill-mask", model="camembert-base", tokenizer="camembert-base", top_k=top_k)
            pass
        if lang =="eng":
            fill_mask = pipeline("fill-mask", model="olm/olm-roberta-base-dec-2022", tokenizer="olm/olm-roberta-base-dec-2022", top_k=top_k)
            pass
        
    if  automodel!="":
        fill_mask = pipeline("fill-mask", model=automodel, tokenizer=automodel, top_k=top_k)
        pass
        
    
    # replacement 
    for word_to_change in list_word_to_change:
        # regex to manage punctuation around the word to be changed: brackets, commas, etc.
        text = re.sub(rf'(^\w\s??)??{word_to_change}(^\w\s??)??',"<mask>", text)
        # use pipeline
        results = fill_mask(text)
        if all(isinstance(elem, list) for elem in results):
        # Sometimes the output from the pipeline is a single list containing the requested top_k and sometimes the pipeline returns a list of lists.
        # So we must handle the two different cases.
            if word_to_change == results[0][0]['token_str']: # make sure that the new word chosen by CamemBERT is different from the one in the original text
                # regex to manage punctuation around the word to be changed: brackets, commas, etc.
                text = re.sub(rf'(^\w\s??)??<mask>(^\w\s??)??',results[0][random.randint(1,len(results)-1)]['token_str'], text)
                pass
            else:
                # regex to manage punctuation around the word to be changed: brackets, commas, etc.
                text = re.sub(rf'(^\w\s??)??<mask>(^\w\s??)??',results[0][0]['token_str'], text)
                pass
        else :
            if word_to_change == results[0]['token_str']: # make sure that the new word chosen by CamemBERT is different from the one in the original text
                # regex to manage punctuation around the word to be changed: brackets, commas, etc.
                text = re.sub(rf'(^\w\s??)??<mask>(^\w\s??)??',results[random.randint(1,len(results)-1)]['token_str'], text)
                pass
            else:
                # regex to manage punctuation around the word to be changed: brackets, commas, etc.
                text = re.sub(rf'(^\w\s??)??<mask>(^\w\s??)??',results[0]['token_str'], text)
                pass
    return text

In [13]:
%%time
print(text)

print(mlm_replacement(text, "fra", "random", 3))

Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Li Ping, numéro 1 mondial, en quarts du finale du tournoi WTT Champions du Macao, en Chine.
Wall time: 2.59 s


In [14]:
%%time
print(text)

print(mlm_replacement(text, "fra", "random", 3, automodel="camembert-base")) #problem with the regex to fix

Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Le pongiste français Alexis Lebrun, de9 ans, a réussi un exploit ce vedredi 2de avril e battant le Chinois Fan Zhedong, numéro de mondial, e quarts de finale du tournoi WTT Champions de Macao, e Chine.
Wall time: 2.1 s


In [15]:
%%time
print(text)

print(mlm_replacement(text, "fra", "pos", 3))

Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
2023-05-08 17:20:43,851 SequenceTagger predicts: Dictionary with 21 tags: <unk>, O, PROPN, PUNCT, ADJ, NOUN, VERB, DET, ADP, AUX, PRON, PART, SCONJ, NUM, ADV, CCONJ, X, INTJ, SYM, <START>, <STOP>
Le pongiste français Alexis Lebrun, 24 ans, a réussi un exploit ce vendredi 21 avril en éliminant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Wall time: 6.06 s


## 1.4 Based on a TF-IDF
TODO : train models on large databases, for example http://downloads.dbpedia.org/2015-04/ext/nlp/abstracts/ or OSCAR

## 2. Back-translation
TODO:
- try to make it run faster
- create dictionaries iso639-1 / iso639-3 ?
- if the user does not propose a pivot language, automatically propose by default the language closest to the one being processed (e.g. English for French and not Chinese) (do not hurry at all as this requires creating language family trees)

### 2.1. Marian (Helsinki-NLP models)
### 2.2 M2M100
The model seems to be available in rust: https://huggingface.co/facebook/m2m100_418M/tree/main!
### 2.3 NLLB
### 2.4 AutoModel

In [16]:
from transformers import pipeline
import itertools

In [17]:
def backtranslation(text, scr_lang, list_trg_lang, model, automodel=""):
    """
    A sentence given as input is translated into n different languages successively before being translated back into the original language.

    Parameters:
    - **text** (`str`):
        The text entered by the user.
    - **scr_lang** (`str`):
        The source language.
    - **list_trg_lang** (`list`):
        The languages that will be used as a pivot. 
        The length of the list determines the number of translations that will be performed before returning to the source language.
    - **model** (`str`):
        The model used: can be "Helsinki-NLP", "m2m100", "nllb" or an "automodel".
        - "Helsinki-NLP".
          Allows you to use all Helsinki-NLP/opus-mt-lang1-lang2 models available on the Hugging Face Hub:
          https://huggingface.co/models?sort=downloads&search=helsinki%2Fopus-mt-
        - "m2m100". 
          Two models are available: "m2m100M" corresponding to the facebook/m2m100_418M model and "m2m100B" corresponding to the facebook/m2m100_1.2B model.
          These models manage 100 languages in all directions:
          "af" (Afrikaans), "am" (Amharic), "ar" (Arabic), "ast" (Asturian), "az" (Azerbaijani), "ba" (Bashkir), "be" (Belarusian),
          "bg" (Bulgarian), "bn" (Bengali),"br" (Breton), "bs" (Bosnian), "ca" (Catalan), "ceb" (Cebuano), "cs" (Czech),
          "cy" (Welsh), "da" (Danish), "de" (German), "el" (Greek), "en" (English), "es" (Spanish), "et" (Estonian), "fa" (Persian),
          "ff" (Fula), "fi" (Finnish), "fr" (French), "fy" (Western Frisian), "ga" (Irish), "gd" (Scottish Gaelic), "gl" (Galician),
          "gu" (Gujarati), "ha" (Hausa), "he" (Hebrew), "hi" (Hindi), "hr" (Croatian), "ht" (Haitian), "hu" (Hungarian), "hy" (Armenian),
          "id" (Indonesian), "ig" (Igbo), "ilo" (Iloko), "is" (Icelandic), "it" (Italian), "ja" (Japanese), "jv" (Javanese), "ka" (Georgian),
          "kk" (Kazakh), "km" (Khmer), "kn" (Kannada), "ko" (Korean), "lb" (Luxembourgish), "lg" (Ganda), "ln" (Lingala), "lo" (Lao),
          "lt" (Lithuanian), "lv" (Latvian), "mg" (Malagasy), "mk" (Macedonian), "ml" (Malayalam), "mn" (Mongolian), "mr" (Marathi),
          "ms" (Malay), "my" (Burmese), "ne" (Nepali), "nl" (Dutch), "no" (Norwegian), "nso" (Northern Sotho), "oc" (Occitan), "or" (Oriya),
          "pa" (Panjabi), "pl" (Polish), "ps" (Pashto), "pt" (Portuguese), "ro" (Romanian), "ru" (Russian), "sd" (Sindhi), "si" (Sinhala),
          "sk" (Slovak), "sl" (Slovenian), "so" (Somali), "sq" (Albanian), "sr" (Serbian), "ss" (Swati), "su" (Sundanese), "sv" (Swedish),
          "sw" (Swahili), "ta" (Tamil), "th" (Thai), "tl" (Tagalog), "tn" (Tswana), "tr" (Turkish), "uk" (Ukrainian), "ur" (Urdu), "uz" (Uzbek),
          "vi" (Vietnamese), "wo" (Wolof), "xh" (Xhosa), "yi" (Yiddish), "yo" (Yoruba), "zh" (Chinese), "zu" (Zulu)'.
        - "nllb".
          Five models are available: "nllbD" corresponding to the facebook/nllb-200-distilled-600M model, 
          "nllbDL" corresponding to the facebook/nllb-200-distilled-1.3B model, "nllbL" corresponding to the facebook/nllb-200-1.3B, 
          "nllbXL" corresponding to the facebook/nllb-200-3.3B model, "nllbD" corresponding to the facebook/nllb-200-moe-54b model.
          These models manage 100 languages in all directions:
          "ace_Arab" (Acehnese Arabic), "ace_Latn" (Acehnese Latin), "acm_Arab" (Mesopotamian Arabic), "acq_Arab" (Ta’izzi-Adeni Arabic),
          "aeb_Arab" (Tunisian Arabic), "afr_Latn" (Afrikaans), "ajp_Arab" (South Levantine Arabic), "aka_Latn" (Akan), "amh_Ethi" (Amharic),
          "apc_Arab" (North Levantine Arabic), "arb_Arab" (Modern Standard Arabic), "arb_Latn" (Modern Standard Arabic Romanized),
          "ars_Arab" (Najdi Arabic), "ary_Arab" (Moroccan Arabic), "arz_Arab" (Egyptian Arabic), "asm_Beng" (Assamese), "ast_Latn" (Asturian),
          "awa_Deva" (Awadhi), "ayr_Latn" (Central Aymara), "azb_Arab" (South Azerbaijani), "azj_Latn" (North Azerbaijani), "bak_Cyrl" (Bashkir),
          "bam_Latn" (Bambara), "ban_Latn" (Balinese), "bel_Cyrl" (Belarusian), "bem_Latn" (Bemba), "ben_Beng" (Bengali), "bho_Deva" (Bhojpuri),
          "bjn_Arab" (Banjar Arabic), "bjn_Latn" (Banjar Latin), "bod_Tibt" (Standard Tibetan), "bos_Latn" (Bosnian), "bug_Latn" (Buginese), 
          "bul_Cyrl" (Bulgarian), "cat_Latn" (Catalan), "ceb_Latn" (Cebuano), "ces_Latn" (Czech), "cjk_Latn" (Chokwe), "ckb_Arab" (Central Kurdish),
          "crh_Latn" (Crimean Tatar), "cym_Latn" (Welsh), "dan_Latn" (Danish), "deu_Latn" (German), "dik_Latn" (Southwestern Dinka), 
          "dyu_Latn" (Dyula), "dzo_Tibt" (Dzongkha), "ell_Grek" (Greek), "eng_Latn" (English), "epo_Latn" (Esperanto), "est_Latn" (Estonian),
          "eus_Latn" (Basque), "ewe_Latn" (Ewe), "fao_Latn" (Faroese), "fij_Latn" (Fijian), "fin_Latn" (Finnish), "fon_Latn" (Fon), 
          "fra_Latn" (French), "fur_Latn" (Friulian), "fuv_Latn" (Nigerian Fulfulde), "gla_Latn" (Scottish Gaelic), "gle_Latn" (Irish), 
          "glg_Latn" (Galician), "grn_Latn" (Guarani), "guj_Gujr" (Gujarati), "hat_Latn" (Haitian Creole), "hau_Latn" (Hausa), 
          "heb_Hebr" (Hebrew), "hin_Deva" (Hindi), "hne_Deva" (Chhattisgarhi), "hrv_Latn" (Croatian), "hun_Latn" (Hungarian), 
          "hye_Armn" (Armenian), "ibo_Latn" (Igbo), "ilo_Latn" (Ilocano), "ind_Latn" (Indonesian), "isl_Latn" (Icelandic), "ita_Latn" (Italian), 
          "jav_Latn" (Javanese), "jpn_Jpan" (Japanese), "kab_Latn" (Kabyle), "kac_Latn" (Jingpho), "kam_Latn" (Kamba), "kan_Knda" (Kannada),
          "kas_Arab" (Kashmiri Arabic), "kas_Deva" (Kashmiri Devanagari), "kat_Geor" (Georgian), "knc_Arab" (Central Kanuri Arabic), 
          "knc_Latn" (Central Kanuri Latin), "kaz_Cyrl" (Kazakh), "kbp_Latn" (Kabiyè), "kea_Latn" (Kabuverdianu), "khm_Khmr" (Khmer), 
          "kik_Latn" (Kikuyu), "kin_Latn" (Kinyarwanda), "kir_Cyrl" (Kyrgyz), "kmb_Latn" (Kimbundu), "kmr_Latn" (Northern Kurdish), 
          "kon_Latn" (Kikongo), "kor_Hang" (Korean), "lao_Laoo" (Lao), "lij_Latn" (Ligurian), "lim_Latn" (Limburgish), "lin_Latn" (Lingala),
          "lit_Latn" (Lithuanian), "lmo_Latn" (Lombard), "ltg_Latn" (Latgalian), "ltz_Latn" (Luxembourgish), "lua_Latn" (Luba-Kasai), 
          "lug_Latn" (Ganda), "luo_Latn" (Luo), "lus_Latn" (Mizo), "lvs_Latn" (Standard Latvian), "mag_Deva" (Magahi), "mai_Deva" (Maithili),
          "mal_Mlym" (Malayalam), "mar_Deva" (Marathi), "min_Arab" (Minangkabau Arabic), "min_Latn" (Minangkabau Latin), "mkd_Cyrl" (Macedonian),
          "plt_Latn" (Plateau Malagasy), "mlt_Latn" (Maltese), "mni_Beng" (Meitei Bengali), "khk_Cyrl" (Halh Mongolian), "mos_Latn" (Mossi), 
          "mri_Latn" (Maori), "mya_Mymr" (Burmese), "nld_Latn" (Dutch), "nno_Latn" (Norwegian Nynorsk), "nob_Latn" (Norwegian Bokmål), 
          "npi_Deva" (Nepali), "nso_Latn" (Northern Sotho), "nus_Latn" (Nuer), "nya_Latn" (Nyanja), "oci_Latn" (Occitan), 
          "gaz_Latn" (West Central Oromo), "ory_Orya" (Odia), "pag_Latn" (Pangasinan), "pan_Guru" (Eastern Panjabi), "pap_Latn" (Papiamento),
          "pes_Arab" (Western Persian), "pol_Latn" (Polish), "por_Latn" (Portuguese), "prs_Arab" (Dari), "pbt_Arab" (Southern Pashto), 
          "quy_Latn" (Ayacucho Quechua), "ron_Latn" (Romanian), "run_Latn" (Rundi), "rus_Cyrl" (Russian), "sag_Latn" (Sango), "san_Deva" (Sanskrit),
          "sat_Olck" (Santali), "scn_Latn" (Sicilian), "shn_Mymr" (Shan), "sin_Sinh" (Sinhala), "slk_Latn" (Slovak), "slv_Latn" (Slovenian),
          "smo_Latn" (Samoan), "sna_Latn" (Shona), "snd_Arab" (Sindhi), "som_Latn" (Somali), "sot_Latn" (Southern Sotho), "spa_Latn" (Spanish), 
          "als_Latn" (Tosk Albanian), "srd_Latn" (Sardinian), "srp_Cyrl" (Serbian), "ssw_Latn" (Swati), "sun_Latn" (Sundanese), "swe_Latn" (Swedish),
          "swh_Latn" (Swahili), "szl_Latn" (Silesian), "tam_Taml" (Tamil), "tat_Cyrl" (Tatar), "tel_Telu" (Telugu), "tgk_Cyrl" (Tajik), 
          "tgl_Latn" (Tagalog), "tha_Thai" (Thai), "tir_Ethi" (Tigrinya), "taq_Latn" (Tamasheq Latin), "taq_Tfng" (Tamasheq Tifinagh), 
          "tpi_Latn" (Tok Pisin), "tsn_Latn" (Tswana), "tso_Latn" (Tsonga), "tuk_Latn" (Turkmen), "tum_Latn" (Tumbuka), "tur_Latn" (Turkish),
          "twi_Latn" (Twi), "tzm_Tfng" (Central Atlas Tamazight), "uig_Arab" (Uyghur), "ukr_Cyrl" (Ukrainian), "umb_Latn" (Umbundu), 
          "urd_Arab" (Urdu), "uzn_Latn" (Northern Uzbek), "vec_Latn" (Venetian), "vie_Latn" (Vietnamese), "war_Latn" (Waray), 
          "wol_Latn" (Wolof), "xho_Latn" (Xhosa), "ydd_Hebr" (Eastern Yiddish), "yor_Latn" (Yoruba), "yue_Hant" (Yue Chinese), 
          "zho_Hans" (Chinese Simplified), "zho_Hant" (Chinese Traditional), "zsm_Latn" (Standard Malay), "zul_Latn" (Zulu)'.
        - "automodel".
        Use the model indicated by the user in the **automodel** argument (see below), which must be a translation model among
        those proposed on the Hugging Face Hub (https://huggingface.co/models?pipeline_tag=translation).
    - **automodel** (`str`, *optional*, defaults to `""`):
        The user AutoModel name.
    """
    
    list_trg_lang = [scr_lang]+list_trg_lang

    # Helsinki
    if model == "Helsinki-NLP":
        model = "Helsinki-NLP/opus-mt"
        with open('Helsinki-NLP-models.txt') as f:
            lines = f.read().splitlines()
        lang_hel = []
        for lang_tag in lines:
            if "opus-mt" in lang_tag:
                lang_hel.append(lang_tag.split("Helsinki-NLP/opus-mt-")[1].split("-"))
            else:
                pass
        lang_hel = list(set(list(itertools.chain.from_iterable(lang_hel))))
        lang_hel = [x for x in lang_hel if len(x)<4]
        for lang in list_trg_lang:
            if lang not in lang_hel:
                return 'Invalid language for the model "Helsinki-NLP", please choose one in the following list: ', lang_hel

        for lang in range(len(list_trg_lang)-1):
            if f'{model}-{list_trg_lang[lang]}-{list_trg_lang[lang+1]}' in lines:
                forward_translator = pipeline("translation", f'{model}-{list_trg_lang[lang]}-{list_trg_lang[lang+1]}', f'{model}-{scr_lang}-{list_trg_lang[lang+1]}')
            else:
                return "The model " f'{model}-{list_trg_lang[lang]}-{list_trg_lang[lang+1]}' " does not exist. We invite you to replace it by one of those available in the following list: ", lines
        if f'{model}-{list_trg_lang[lang]}-{list_trg_lang[lang+1]}' in lines:
            reverse_translator = pipeline("translation", f'{model}-{list_trg_lang[-1]}-{list_trg_lang[0]}', f'{model}-{list_trg_lang[-1]}-{list_trg_lang[0]}')
        else:
            return "The model " f'{model}-{list_trg_lang[-1]}-{list_trg_lang[0]}' " does not exist. We invite you to replace it by one of those available in the following list: ",lines
        new_text = reverse_translator(forward_translator(text)[0]['translation_text'])
        return new_text[0]['translation_text']


    # m2m100
    if "m2m100" in model:
        if model == "m2m100M": 
            model = "facebook/m2m100_418M"
        elif model == "m2m100B": 
            model = "facebook/m2m100_1.2B"
        else:
            return model, 'is not part of the managed models, i.e. "Helsinki-NLP", "m2m100M" (facebook/m2m100_418M), "m2m100B" (facebook/m2m100_1.2B), "nllbD" (facebook/nllb-200-distilled-600M), "nllbDL" (facebook/nllb-200-distilled-1.3B), "nllbL" (facebook/nllb-200-1.3B), "nllbXL" (facebook/nllb-200-3.3B), "nllbXXL" (facebook/nllb-moe-54b). If you want to use an AutoModel different from the previous list, use the argument "auto=True"'
        for lang in list_trg_lang:
            if lang not in ["af","am","ar","ast","az","ba","be","bg","bn","br","bs","ca","ceb","cs","cy","da","de","el","en","es","et","fa","ff","fi","fr","fy","ga","gd","gl","gu","ha","he","hi","hr","ht","hu","hy","id","ig","ilo","is","it","ja","jv","ka","kk","km","kn","ko","lb","lg","ln","lo","lt","lv","mg","mk","ml","mn","mr","ms","my","ne","nl","no","ns","oc","or","pa","pl","ps","pt","ro","ru","sd","si","sk","sl","so","sq","sr","ss","su","sv","sw","ta","th","tl","tn","tr","uk","ur","uz","vi","wo","xh","yi","yo","zh","zu"]:
                return 'Invalid language for the model "m2m100", please choose beetween "af" (Afrikaans), "am" (Amharic), "ar" (Arabic), "ast" (Asturian), "az" (Azerbaijani), "ba" (Bashkir), "be" (Belarusian), "bg" (Bulgarian), "bn" (Bengali),"br" (Breton), "bs" (Bosnian), "ca" (Catalan), "ceb" (Cebuano), "cs" (Czech), "cy" (Welsh), "da" (Danish), "de" (German), "el" (Greek), "en" (English), "es" (Spanish), "et" (Estonian), "fa" (Persian), "ff" (Fula), "fi" (Finnish), "fr" (French), "fy" (Western Frisian), "ga" (Irish), "gd" (Scottish Gaelic), "gl" (Galician), "gu" (Gujarati), "ha" (Hausa), "he" (Hebrew), "hi" (Hindi), "hr" (Croatian), "ht" (Haitian), "hu" (Hungarian), "hy" (Armenian), "id" (Indonesian), "ig" (Igbo), "ilo" (Iloko), "is" (Icelandic), "it" (Italian), "ja" (Japanese), "jv" (Javanese), "ka" (Georgian), "kk" (Kazakh), "km" (Khmer), "kn" (Kannada), "ko" (Korean), "lb" (Luxembourgish), "lg" (Ganda), "ln" (Lingala), "lo" (Lao), "lt" (Lithuanian), "lv" (Latvian), "mg" (Malagasy), "mk" (Macedonian), "ml" (Malayalam), "mn" (Mongolian), "mr" (Marathi), "ms" (Malay), "my" (Burmese), "ne" (Nepali), "nl" (Dutch), "no" (Norwegian), "nso" (Northern Sotho), "oc" (Occitan), "or" (Oriya), "pa" (Panjabi), "pl" (Polish), "ps" (Pashto), "pt" (Portuguese), "ro" (Romanian), "ru" (Russian), "sd" (Sindhi), "si" (Sinhala), "sk" (Slovak), "sl" (Slovenian), "so" (Somali), "sq" (Albanian), "sr" (Serbian), "ss" (Swati), "su" (Sundanese), "sv" (Swedish), "sw" (Swahili), "ta" (Tamil), "th" (Thai), "tl" (Tagalog), "tn" (Tswana), "tr" (Turkish), "uk" (Ukrainian), "ur" (Urdu), "uz" (Uzbek), "vi" (Vietnamese), "wo" (Wolof), "xh" (Xhosa), "yi" (Yiddish), "yo" (Yoruba), "zh" (Chinese), "zu" (Zulu)'
        for lang in range(len(list_trg_lang)-1):
            forward_translator = pipeline('translation', model, src_lang=list_trg_lang[lang], tgt_lang=list_trg_lang[lang+1])
        reverse_translator = pipeline('translation', model, src_lang=list_trg_lang[-1], tgt_lang=list_trg_lang[0])
        new_text = reverse_translator(forward_translator(text)[0]['translation_text'])
        return new_text[0]['translation_text']


    # NLLB
    if "nllb" in model:
        if model == "nllbD": 
            model = "facebook/nllb-200-distilled-600M"
        elif model == "nllbDL": 
            model = "facebook/nllb-200-distilled-1.3B"
        elif model == "nllbL": 
            model = "facebook/nllb-200-1.3B"
        elif model == "nllbXL": 
            model = "facebook/nllb-200-3.3B"
        elif model == "nllbXXL": 
            model = "facebook/nllb-moe-54b"
        else:
            return model, 'is not part of the managed models, i.e. "Helsinki-NLP", "m2m100M" (facebook/m2m100_418M), "m2m100B" (facebook/m2m100_1.2B), "nllbD" (facebook/nllb-200-distilled-600M), "nllbDL" (facebook/nllb-200-distilled-1.3B), "nllbL" (facebook/nllb-200-1.3B), "nllbXL" (facebook/nllb-200-3.3B), "nllbXXL" (facebook/nllb-moe-54b). If you want to use an AutoModel different from the previous list, use the argument "auto=True"'
        for lang in list_trg_lang:
            if lang not in ["ace_Arab","ace_Latn","acm_Arab","acq_Arab","aeb_Arab","afr_Latn","ajp_Arab","aka_Latn","amh_Ethi","apc_Arab","arb_Arab","ars_Arab","ary_Arab","arz_Arab","asm_Beng","ast_Latn","awa_Deva","ayr_Latn","azb_Arab","azj_Latn","bak_Cyrl","bam_Latn","ban_Latn,bel_Cyrl","bem_Latn","ben_Beng","bho_Deva","bjn_Arab","bjn_Latn","bod_Tibt","bos_Latn","bug_Latn","bul_Cyrl","cat_Latn","ceb_Latn","ces_Latn","cjk_Latn","ckb_Arab","crh_Latn","cym_Latn","dan_Latn","deu_Latn","dik_Latn","dyu_Latn","dzo_Tibt","ell_Grek","eng_Latn","epo_Latn","est_Latn","eus_Latn","ewe_Latn","fao_Latn","pes_Arab","fij_Latn","fin_Latn","fon_Latn","fra_Latn","fur_Latn","fuv_Latn","gla_Latn","gle_Latn","glg_Latn","grn_Latn","guj_Gujr","hat_Latn","hau_Latn","heb_Hebr","hin_Deva","hne_Deva","hrv_Latn","hun_Latn","hye_Armn","ibo_Latn","ilo_Latn","ind_Latn","isl_Latn","ita_Latn","jav_Latn","jpn_Jpan","kab_Latn","kac_Latn","kam_Latn","kan_Knda","kas_Arab","kas_Deva","kat_Geor","knc_Arab","knc_Latn","kaz_Cyrl","kbp_Latn","kea_Latn","khm_Khmr","kik_Latn","kin_Latn","kir_Cyrl","kmb_Latn","kon_Latn","kor_Hang","kmr_Latn","lao_Laoo","lvs_Latn","lij_Latn","lim_Latn","lin_Latn","lit_Latn","lmo_Latn","ltg_Latn","ltz_Latn","lua_Latn","lug_Latn","luo_Latn","lus_Latn","mag_Deva","mai_Deva","mal_Mlym","mar_Deva","min_Latn","mkd_Cyrl","plt_Latn","mlt_Latn","mni_Beng","khk_Cyrl","mos_Latn","mri_Latn","zsm_Latn","mya_Mymr","nld_Latn","nno_Latn","nob_Latn","npi_Deva","nso_Latn","nus_Latn","nya_Latn","oci_Latn","gaz_Latn","ory_Orya","pag_Latn","pan_Guru","pap_Latn","pol_Latn","por_Latn","prs_Arab","pbt_Arab","quy_Latn","ron_Latn","run_Latn","rus_Cyrl","sag_Latn","san_Deva","sat_Beng","scn_Latn","shn_Mymr","sin_Sinh","slk_Latn","slv_Latn","smo_Latn","sna_Latn","snd_Arab","som_Latn","sot_Latn","spa_Latn","als_Latn","srd_Latn","srp_Cyrl","ssw_Latn","sun_Latn","swe_Latn","swh_Latn","szl_Latn","tam_Taml","tat_Cyrl","tel_Telu","tgk_Cyrl","tgl_Latn","tha_Thai","tir_Ethi","taq_Latn","taq_Tfng","tpi_Latn","tsn_Latn","tso_Latn","tuk_Latn","tum_Latn","tur_Latn","twi_Latn","tzm_Tfng","uig_Arab","ukr_Cyrl","umb_Latn","urd_Arab","uzn_Latn","vec_Latn","vie_Latn","war_Latn","wol_Latn","xho_Latn","ydd_Hebr","yor_Latn","yue_Hant","zho_Hans","zho_Hant","zul_Latn"]:
                return 'Invalid language for the model "nllbm", please choose beetween "ace_Arab" (Acehnese Arabic), "ace_Latn" (Acehnese Latin), "acm_Arab" (Mesopotamian Arabic), "acq_Arab" (Ta’izzi-Adeni Arabic), "aeb_Arab" (Tunisian Arabic), "afr_Latn" (Afrikaans), "ajp_Arab" (South Levantine Arabic), "aka_Latn" (Akan), "amh_Ethi" (Amharic), "apc_Arab" (North Levantine Arabic), "arb_Arab" (Modern Standard Arabic), "arb_Latn" (Modern Standard Arabic Romanized), "ars_Arab" (Najdi Arabic), "ary_Arab" (Moroccan Arabic), "arz_Arab" (Egyptian Arabic), "asm_Beng" (Assamese), "ast_Latn" (Asturian), "awa_Deva" (Awadhi), "ayr_Latn" (Central Aymara), "azb_Arab" (South Azerbaijani), "azj_Latn" (North Azerbaijani), "bak_Cyrl" (Bashkir), "bam_Latn" (Bambara), "ban_Latn" (Balinese), "bel_Cyrl" (Belarusian), "bem_Latn" (Bemba), "ben_Beng" (Bengali), "bho_Deva" (Bhojpuri), "bjn_Arab" (Banjar Arabic), "bjn_Latn" (Banjar Latin), "bod_Tibt" (Standard Tibetan), "bos_Latn" (Bosnian), "bug_Latn" (Buginese), "bul_Cyrl" (Bulgarian), "cat_Latn" (Catalan), "ceb_Latn" (Cebuano), "ces_Latn" (Czech), "cjk_Latn" (Chokwe), "ckb_Arab" (Central Kurdish), "crh_Latn" (Crimean Tatar), "cym_Latn" (Welsh), "dan_Latn" (Danish), "deu_Latn" (German), "dik_Latn" (Southwestern Dinka), "dyu_Latn" (Dyula), "dzo_Tibt" (Dzongkha), "ell_Grek" (Greek), "eng_Latn" (English), "epo_Latn" (Esperanto), "est_Latn" (Estonian), "eus_Latn" (Basque), "ewe_Latn" (Ewe), "fao_Latn" (Faroese), "fij_Latn" (Fijian), "fin_Latn" (Finnish), "fon_Latn" (Fon), "fra_Latn" (French), "fur_Latn" (Friulian), "fuv_Latn" (Nigerian Fulfulde), "gla_Latn" (Scottish Gaelic), "gle_Latn" (Irish), "glg_Latn" (Galician), "grn_Latn" (Guarani), "guj_Gujr" (Gujarati), "hat_Latn" (Haitian Creole), "hau_Latn" (Hausa), "heb_Hebr" (Hebrew), "hin_Deva" (Hindi), "hne_Deva" (Chhattisgarhi), "hrv_Latn" (Croatian), "hun_Latn" (Hungarian), "hye_Armn" (Armenian), "ibo_Latn" (Igbo), "ilo_Latn" (Ilocano), "ind_Latn" (Indonesian), "isl_Latn" (Icelandic), "ita_Latn" (Italian), "jav_Latn" (Javanese), "jpn_Jpan" (Japanese), "kab_Latn" (Kabyle), "kac_Latn" (Jingpho), "kam_Latn" (Kamba), "kan_Knda" (Kannada), "kas_Arab" (Kashmiri Arabic), "kas_Deva" (Kashmiri Devanagari), "kat_Geor" (Georgian), "knc_Arab" (Central Kanuri Arabic), "knc_Latn" (Central Kanuri Latin), "kaz_Cyrl" (Kazakh), "kbp_Latn" (Kabiyè), "kea_Latn" (Kabuverdianu), "khm_Khmr" (Khmer), "kik_Latn" (Kikuyu), "kin_Latn" (Kinyarwanda), "kir_Cyrl" (Kyrgyz), "kmb_Latn" (Kimbundu), "kmr_Latn" (Northern Kurdish), "kon_Latn" (Kikongo), "kor_Hang" (Korean), "lao_Laoo" (Lao), "lij_Latn" (Ligurian), "lim_Latn" (Limburgish), "lin_Latn" (Lingala), "lit_Latn" (Lithuanian), "lmo_Latn" (Lombard), "ltg_Latn" (Latgalian), "ltz_Latn" (Luxembourgish), "lua_Latn" (Luba-Kasai), "lug_Latn" (Ganda), "luo_Latn" (Luo), "lus_Latn" (Mizo), "lvs_Latn" (Standard Latvian), "mag_Deva" (Magahi), "mai_Deva" (Maithili), "mal_Mlym" (Malayalam), "mar_Deva" (Marathi), "min_Arab" (Minangkabau Arabic), "min_Latn" (Minangkabau Latin), "mkd_Cyrl" (Macedonian), "plt_Latn" (Plateau Malagasy), "mlt_Latn" (Maltese), "mni_Beng" (Meitei Bengali), "khk_Cyrl" (Halh Mongolian), "mos_Latn" (Mossi), "mri_Latn" (Maori), "mya_Mymr" (Burmese), "nld_Latn" (Dutch), "nno_Latn" (Norwegian Nynorsk), "nob_Latn" (Norwegian Bokmål), "npi_Deva" (Nepali), "nso_Latn" (Northern Sotho), "nus_Latn" (Nuer), "nya_Latn" (Nyanja), "oci_Latn" (Occitan), "gaz_Latn" (West Central Oromo), "ory_Orya" (Odia), "pag_Latn" (Pangasinan), "pan_Guru" (Eastern Panjabi), "pap_Latn" (Papiamento), "pes_Arab" (Western Persian), "pol_Latn" (Polish), "por_Latn" (Portuguese), "prs_Arab" (Dari), "pbt_Arab" (Southern Pashto), "quy_Latn" (Ayacucho Quechua), "ron_Latn" (Romanian), "run_Latn" (Rundi), "rus_Cyrl" (Russian), "sag_Latn" (Sango), "san_Deva" (Sanskrit), "sat_Olck" (Santali), "scn_Latn" (Sicilian), "shn_Mymr" (Shan), "sin_Sinh" (Sinhala), "slk_Latn" (Slovak), "slv_Latn" (Slovenian), "smo_Latn" (Samoan), "sna_Latn" (Shona), "snd_Arab" (Sindhi), "som_Latn" (Somali), "sot_Latn" (Southern Sotho), "spa_Latn" (Spanish), "als_Latn" (Tosk Albanian), "srd_Latn" (Sardinian), "srp_Cyrl" (Serbian), "ssw_Latn" (Swati), "sun_Latn" (Sundanese), "swe_Latn" (Swedish), "swh_Latn" (Swahili), "szl_Latn" (Silesian), "tam_Taml" (Tamil), "tat_Cyrl" (Tatar), "tel_Telu" (Telugu), "tgk_Cyrl" (Tajik), "tgl_Latn" (Tagalog), "tha_Thai" (Thai), "tir_Ethi" (Tigrinya), "taq_Latn" (Tamasheq Latin), "taq_Tfng" (Tamasheq Tifinagh), "tpi_Latn" (Tok Pisin), "tsn_Latn" (Tswana), "tso_Latn" (Tsonga), "tuk_Latn" (Turkmen), "tum_Latn" (Tumbuka), "tur_Latn" (Turkish), "twi_Latn" (Twi), "tzm_Tfng" (Central Atlas Tamazight), "uig_Arab" (Uyghur), "ukr_Cyrl" (Ukrainian), "umb_Latn" (Umbundu), "urd_Arab" (Urdu), "uzn_Latn" (Northern Uzbek), "vec_Latn" (Venetian), "vie_Latn" (Vietnamese), "war_Latn" (Waray), "wol_Latn" (Wolof), "xho_Latn" (Xhosa), "ydd_Hebr" (Eastern Yiddish), "yor_Latn" (Yoruba), "yue_Hant" (Yue Chinese), "zho_Hans" (Chinese Simplified), "zho_Hant" (Chinese Traditional), "zsm_Latn" (Standard Malay), "zul_Latn" (Zulu)'
        for lang in range(len(list_trg_lang)-1):
            forward_translator = pipeline('translation', model, src_lang=list_trg_lang[lang], tgt_lang=list_trg_lang[lang+1])
        reverse_translator = pipeline('translation', model, src_lang=list_trg_lang[-1], tgt_lang=list_trg_lang[0])
        new_text = reverse_translator(forward_translator(text)[0]['translation_text'])
        return new_text[0]['translation_text']

    
    # AutoModel
    if model=="automodel":
        if automodel=="":
            return 'Please enter a model name from those available on https://huggingface.co/models?pipeline_tag=translation'
            pass
        else:
            for lang in range(len(list_trg_lang)-1):
                forward_translator = pipeline('translation', automodel, tokenizer=automodel)
            reverse_translator = pipeline('translation', automodel, tokenizer=automodel)
            new_text = reverse_translator(forward_translator(text)[0]['translation_text'])
            return new_text[0]['translation_text']

        
    else:
        return model, 'is not part of the managed models, i.e. "Helsinki-NLP", "m2m100M" (facebook/m2m100_418M), "m2m100B" (facebook/m2m100_1.2B), "nllbD" (facebook/nllb-200-distilled-600M), "nllbDL" (facebook/nllb-200-distilled-1.3B), "nllbL" (facebook/nllb-200-1.3B), "nllbXL" (facebook/nllb-200-3.3B), "nllbXXL" (facebook/nllb-moe-54b) or "automodel".'

In [18]:
%%time
print(text)

#AutoModel
print(backtranslation(text, "en", ["fr"], "automodel", automodel="facebook/wmt19-en-ru")) 

Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Le pongiste français Alexis Hesrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois - Zhendong, numéro 1 mondial, en quarts de final du tournoi WTT Champions de Macao, en Chine.
Wall time: 27.8 s


In [19]:
%%time
print(text)

#Helsinki 1 language
print(backtranslation(text, "fr", ["en"], "Helsinki-NLP"))

Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Le pongiste français Alexis Lebrun, 19 ans, a fait un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, leader mondial, en quart de finale du tournoi WTT Champions à Macao, en Chine.
Wall time: 9.45 s


In [20]:
%%time
print(text)

#Helsinki n languages
print(backtranslation(text, "fr", ["es","en"], "Helsinki-NLP"))

Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
You pong français Alexis Lebrun, 19 ans, réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quaarts de finale du tournoi WTT Champions de Macao, en Chine.
Wall time: 17.5 s


In [21]:
%%time
print(text)

#m2m100 n languages
print(backtranslation(text, "fr", ["es","en"], "m2m100M"))

Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril dans le battant le Chinois Fan Zhendong, numéro 1 mondial, dans les quarts de finale du tournoi WTT Champions of Macao, en Chine.
Wall time: 54.8 s


# 3. Transformation of the text surface

TODO:
- Add a dash in the middle of a word to simulate a line break ? (can be complicated because you have to cut the word between two syllables) (not hurry)

In [22]:
import re
import pandas as pd
from num2words import num2words
from text_to_num import alpha2digit

In [23]:
def surface_remplacements(text, lang, method):
    """
    Applies transformations to the surface of a text.

    Parameters:
    - **text** (`str`):
        The text entered by the user.
    - **lang** (`str`):
        The text language to be indicated using the ISO 639-3 convention. For example ("fra" for "French"). The languages managed are :
        "amh" (Amharic), "ara" (Arabic), "aze" (Azerbaijani), "cat" (Catalan), "ces" (Czech), "deu" (German), "dan" (Danish), 
        "eng" (English), "eng_GB" (English - Great Britain), "eng_IN" (English - India), "eng_NG" (English - Nigeria), "spa" (Spanish), 
        "spa_CO" (Spanish - Colombia), "spa_VE" (Spanish - Venezuela), "spa_GT" (Spanish - Guatemala), "fas" (Persian), "fin" (Finnish), 
        "fra" (French),"fra_CH" (French - Switzerland), "fra_BE" (French - Belgium), "fra_DZ" (French - Algeria), "heb" (Hebrew),
        "hun" (Hungarian), "ind" (Indonesian), "isl" (Icelandic), "ita" (Italian), "jpn" (Japanese), "kan" (Kannada), "kor" (Korean),
        "kaz" (Kazakh), "lit" (Lithuanian), "lav" (Latvian), "nor" (Norwegian), "pol" (Polish), "por" (Portuguese), 
        "por_BR" (Portuguese - Brazilian), "slv" (Slovene), "srp" (Serbian), "swe" (Swedish), "ron" (Romanian), "rus" (Russian), 
        "tel" (Telugu), "tgk" (Tajik), "tur" (Turkish), "tha" (Thai), "vie" (Vietnamese), "nld" (Dutch), "ukr" (Ukrainian)'.
        See **method** to know exactly which method is applicable for a given language.
    - **method** (`str`):
        Four methods can be applied: "numbers", "letters", "extensions", "contractions".
        - "numbers".
        Numbers are converted to their equivalent in letters via the num2words library. The languages managed by this method are:
        "amh" (Amharic), "ara" (Arabic), "aze" (Azerbaijani), "ces" (Czech), "deu" (German), "dan" (Danish), "eng" (English),
        "eng_GB" (English - Great Britain), "eng_IN" (English - India), "eng_NG" (English - Nigeria), "spa" (Spanish), 
        "spa_CO" (Spanish - Colombia), "spa_VE" (Spanish - Venezuela), "spa_GT" (Spanish - Guatemala), "fas" (Persian), 
        "fin" (Finnish), "fra" (French),"fra_CH" (French - Switzerland), "fra_BE" (French - Belgium), "fra_DZ" (French - Algeria),
        "heb" (Hebrew), "hun" (Hungarian), "ind" (Indonesian), "isl" (Icelandic), "ita" (Italian), "jpn" (Japanese), "kan" (Kannada),
        "kor" (Korean), "kaz" (Kazakh), "lit" (Lithuanian), "lav" (Latvian), "nor" (Norwegian), "pol" (Polish), "por" (Portuguese), 
        "por_BR" (Portuguese - Brazilian), "slv" (Slovene), "srp" (Serbian), "swe" (Swedish), "ron" (Romanian), "rus" (Russian), 
        "tel" (Telugu), "tgk" (Tajik), "tur" (Turkish), "tha" (Thai), "vie" (Vietnamese), "nld" (Dutch), "ukr" (Ukrainian)'.
        - "letters".
        Letters are converted to their equivalent in numbers via the text_to_num library. The languages managed by this method are:
        "cat" (Catalan), "deu" (German), "fra" (French), "eng" (English), "spa" (Spanish), "por" (Portuguese), "rus" (Russian).'
        - "extensions".
        This method is only available for English ("lang=eng"). It extends contractions using a dictionary. Example: "didn't" becomes "did not".
        - "contractions".
        This method is only available for English ("lang=eng"). It contracts words using a dictionary. Example: "did not" becomes "didn't".
    """
        
    if method == "numbers":
        # iso639-3
        list_lang = ["amh", "ara", "aze", "ces", "deu", "dan", "eng", "eng_GB", "eng_IN", "eng_NG", "spa", "spa_CO", "spa_VE", "spa_GT", "fas", "fin", "fra","fra_CH", "fra_BE", "fra_DZ", "heb", "hun", "ind", "isl", "ita", "jpn", "kan", "kor", "kaz", "lit", "lav", "nor", "pol", "por", "por_BR", "slv", "srp", "swe", "ron", "rus", "tel", "tgk", "tur", "tha", "vie", "nld", "ukr"] 
        # iso639-1 use by num2words
        list_num_to_text = ["am", "ar", "az", "cz", "de", "dk", "en", "en_GB", "en_IN", "en_NG", "es", "es_CO", "es_VE", "es_GT", "fa", "fi", "fr","fr_CH", "fr_BE", "fr_DZ", "he", "hu", "id", "is", "it", "ja", "kn", "ko", "kz", "lt", "lv", "no", "pl", "pt", "pt_BR", "sl", "sr", "sv", "ro", "ru", "te", "tg", "tr", "th", "vi", "nl", "uk"]
        if lang not in list_lang:
            return 'Invalid language for the method "numbers", please choose beetween "amh" (Amharic), "ara" (Arabic), "aze" (Azerbaijani), "ces" (Czech), "deu" (German), "dan" (Danish), "eng" (English), "eng_GB" (English - Great Britain), "eng_IN" (English - India), "eng_NG" (English - Nigeria), "spa" (Spanish), "spa_CO" (Spanish - Colombia), "spa_VE" (Spanish - Venezuela), "spa_GT" (Spanish - Guatemala), "fas" (Persian), "fin" (Finnish), "fra" (French),"fra_CH" (French - Switzerland), "fra_BE" (French - Belgium), "fra_DZ" (French - Algeria), "heb" (Hebrew), "hun" (Hungarian), "ind" (Indonesian), "isl" (Icelandic), "ita" (Italian), "jpn" (Japanese), "kan" (Kannada), "kor" (Korean), "kaz" (Kazakh), "lit" (Lithuanian), "lav" (Latvian), "nor" (Norwegian), "pol" (Polish), "por" (Portuguese), "por_BR" (Portuguese - Brazilian), "slv" (Slovene), "srp" (Serbian), "swe" (Swedish), "ron" (Romanian), "rus" (Russian), "tel" (Telugu), "tgk" (Tajik), "tur" (Turkish), "tha" (Thai), "vie" (Vietnamese), "nld" (Dutch), "ukr" (Ukrainian)'
        else:
            lang = list_num_to_text[list_lang.index(lang)]
            numbers = re.findall(r'\b\d+\b', text)
            numbers.sort(reverse = True)
            new_text = text
            for idx in numbers:
                new_text = new_text.replace(idx,num2words(idx, lang=lang))
            return new_text
        
        
    if method == "letters":
        list_lang = ["cat", "deu", "fra", "eng", "spa", "por", "rus"] # iso639-3
        list_text_to_num = ["ca", "de", "fr", "en", "es", "pt", "ru"] # iso639-1 use by alpha2digit
        if lang not in list_lang:
            return 'Invalid language for the method "numbers", please choose beetween "cat" (Catalan), "deu" (German), "fra" (French), "eng" (English), "spa" (Spanish), "por" (Portuguese), "rus" (Russian).'
        else:
            lang = list_text_to_num[list_lang.index(lang)]
            new_text = alpha2digit(text, lang, ordinal_threshold=0)
            return new_text
        
        
    if (method == "extensions") and (lang=="eng"):
        # pandas version
        df_extensions = pd.read_csv("df_extensions.csv", sep=";")
        # datasets version
#         df_extensions = load_dataset("fraug-library/english_contractions_extensions", data_files=f"df_extensions.csv", sep=";")["train"].to_pandas()
        for idx in range(len(df_extensions)):
            # regex to manage punctuation around the word to be changed: brackets, commas, etc.
            text =  re.sub(rf'(^\w\s)??{df_extensions.contractions[idx]}(^\w\s??)??',df_extensions.extensions[idx], text) 
        return text
        
        
    if (method == "contractions") and (lang=="eng"):
        # pandas version
        df_contractions = pd.read_csv("df_contractions.csv", sep=";")
        # datasets version
#         df_contractions = load_dataset("fraug-library/english_contractions_extensions", data_files=f"df_contractions.csv", sep=";")["train"].to_pandas()
        for idx in range(len(df_contractions)):
            # regex to manage punctuation around the word to be changed: brackets, commas, etc.
            text =  re.sub(rf'(^\w\s??)??{df_contractions.extensions[idx]}(^\w\s??)??',df_contractions.contractions[idx], text) 
        return text
    
    else:
        return 'Invalid method please choose beetwen "numbers", "letters", "extensions" and "contractions".'

In [24]:
%%time
print(text)

print(surface_remplacements(text,"fra","numbers"))

Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Le pongiste français Alexis Lebrun, dix-neuf ans, a réalisé un exploit ce vendredi vingt et un avril en battant le Chinois Fan Zhendong, numéro un mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Wall time: 10.2 ms


In [25]:
%%time
print(surface_remplacements(text,"fra","numbers"))

print(surface_remplacements(surface_remplacements(text,"fra","numbers"),"fra","letters"))

Le pongiste français Alexis Lebrun, dix-neuf ans, a réalisé un exploit ce vendredi vingt et un avril en battant le Chinois Fan Zhendong, numéro un mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro un mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Wall time: 20.1 ms


In [26]:
%%time
text_eng = "French table tennis player Alexis Lebrun, (isn't 19?), achieved a feat on Friday 21 April by beating China's Fan Zhendong, world number 1, in the quarter-finals of the WTT Champions tournament in Macau, China."
print(text_eng)

print(surface_remplacements(text_eng,"eng","extensions"))

French table tennis player Alexis Lebrun, (isn't 19?), achieved a feat on Friday 21 April by beating China's Fan Zhendong, world number 1, in the quarter-finals of the WTT Champions tournament in Macau, China.
French table tennis player Alexis Lebrun, (is not 19?), achieved a feat on Friday 21 April by beating China's Fan Zhendong, world number 1, in the quarter-finals of the WTT Champions tournament in Macau, China.
Wall time: 50.9 ms


In [27]:
%%time
text_eng = "French table tennis player Alexis Lebrun, (is not 19?), achieved a feat on Friday 21 April by beating China's Fan Zhendong, world number 1, in the quarter-finals of the WTT Champions tournament in Macau, China."
print(text_eng)

print(surface_remplacements(text_eng,"eng","contractions"))

French table tennis player Alexis Lebrun, (is not 19?), achieved a feat on Friday 21 April by beating China's Fan Zhendong, world number 1, in the quarter-finals of the WTT Champions tournament in Macau, China.
French table tennis player Alexis Lebrun, (isn't 19?), achieved a feat on Friday 21 April by beating China's Fan Zhendong, world number 1, in the quarter-finals of the WTT Champions tournament in Macau, China.
Wall time: 51 ms


# 4. Random noise injection

## 4.1 Spelling mistakes injection/correction
TODO:
- try to clean up wikipedia_eng so that it is as clean as the French one and so use only one if condition handling both and not two as currently
- merge the wikipedia dictionaries with other datasets that may exist, e.g. df_typos_wikipedia_eng & df_wikopaco (not hurry).

## 4.2 Typing mistakes injection
Note: For French, the dictionary was created manually. For the other languages, the files were taken from https://github.com/makcedward/nlpaug/tree/master/nlpaug/res/char/keyboard. It is possible that these files are incomplete (French was incomplete, that is why it was entirely redone by hand).

## 4.3 Swap sentences/words

## 4.4 Random token_noise/deletion

In [28]:
import re
import pandas as pd
import regex as reg
import json
import random

In [29]:
def random_noise(text, lang, num, method, proba=0.05):
    """
    Allows you to add random noise to text in seven different ways: adding spelling mistakes, correction of spelling mistakes, 
    addition of typos, inversion of sentences in a text, reversing words in text, adding tokens, deleting tokens.

    Parameters:
    - **text** (`str`):
        The text entered by the user.
    - **lang** (`str`):
        The text language to be indicated using the ISO 639-3 convention. For example ("fra" for "French"). 
        Depending on the method considered, only "eng" (English) and "fra" (French) are managed 
        to potentially all possible written languages in the world.
        See **method** to know exactly which method is applicable for a given language.
    - **num** (`int`):
        The number of words to replace in the text. 
        If **num** is greater than the number of replaceable words, we stop at the maximum number of replaceable words.
    - **method** (`str`):
        There are seven different methods available to add noise in a text: "spelling_mistake_correction",
        "spelling_mistake_injection", "typing_mistake_injection", "swap_sentences", "swap_words", "random_token_noise", "random_deletion".'
        - "spelling_mistake_correction".
        Spelling mistakes in a text are corrected using datasets of "error/correction" pairs. The languages managed by this method are:
        "eng" (English) and "fra" (French).
        - "spelling_mistake_injection".
        Spelling mistakes are injected in a text using datasets of "error/correction" pairs. The languages managed by this method are:
        "eng" (English) and "fra" (French).
        - "typing_mistake_injection".
        Typos mistakes are injected in a text using datasets of closest keyboard keys for a given character.
        The languages managed by this method are: "eng" (English) and "fra" (French).
        - "swap_sentences".
        Reverse the position of n sentences in a text. The languages managed by this method are: all written languages in the world.
        - "swap_words".
        Reverse the position of n words in a text. The languages managed by this method are: all written languages in the world.
        - "random_token_noise".
        Replaces a word with a "_" with a user-defined probability p. The languages managed by this method are: all written languages in the world.
        - "random_deletion".
        Delete a word with a user-defined probability p. The languages managed by this method are: all written languages in the world.
    - **proba** (`float`, *optional*, defaults to `0.05`):
        Probability of replacing a word with another value ("_" or "") defined by the user.
        Only used for the "random_token_noise" and "random_deletion" methods. 
    """

    if method == "spelling_mistake_correction":
        if lang=="fra":
            # pandas version
            df_wicopaco = pd.read_csv("df_wicopaco.csv").to_dict()
            # datasets version
#             df_wicopaco = load_dataset("fraug-library/wicopaco")["train"].to_pandas()
            for idx in range(len(df_wicopaco["error"])):
                text =  re.sub(" "+df_wicopaco["error"][idx]+" "," "+df_wicopaco["correction"][idx]+" ", text) 
            return text
#         if lang=="fra":
#             df_typos_wikipedia = pd.read_csv("df_typos_wikipedia_fra.csv").to_dict() # https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:AutoWikiBrowser/Typos
#             for idx in range(len(df_typos_wikipedia["error"])):
#                 text =  re.sub(" "+df_typos_wikipedia["error"][idx]+" ", " "+df_typos_wikipedia["correction"][idx]+" ", text)
#             return text
        if lang=="eng":
            # TODO: check https://www.dcs.bbk.ac.uk/~ROGER/corpora.html to propose other errors (or merge them all into a single file)
            df_typos_wikipedia = pd.read_csv("df_typos_wikipedia_eng.csv").to_dict() # from https://en.wikipedia.org/wiki/Wikipedia:AutoWikiBrowser/Typos
            for idx in range(len(df_typos_wikipedia["error"])):
                text =  reg.sub(" "+str(df_typos_wikipedia["error"][idx])+" ", " "+str(df_typos_wikipedia["correction"][idx]).replace('"','')+" ", text)
            return text
        else:
            return 'Invalid language for the method "spelling_mistake_correction", please choose "fra" (French)'

# code expected in the future when the data will be clean
#         if lang in ["fra","eng"]:
#             df_typos_wikipedia = pd.read_csv("df_typos_wikipedia_fra.csv").to_dict()
#             for idx in range(len(df_typos_wikipedia["error"])):
#                 text =  re.sub(" "+df_typos_wikipedia["error"][idx]+" ", " "+df_typos_wikipedia["correction"][idx]+" ", text)
#             return text
        
        if lang=="eng":
            # TODO: check https://www.dcs.bbk.ac.uk/~ROGER/corpora.html to propose other errors (or merge them all into a single file)
            df_typos_wikipedia = pd.read_csv("df_typos_wikipedia_eng.csv").to_dict() # from https://en.wikipedia.org/wiki/Wikipedia:AutoWikiBrowser/Typos
            for idx in range(len(df_typos_wikipedia["error"])):
                text =  reg.sub(" "+str(df_typos_wikipedia["error"][idx])+" ", " "+str(df_typos_wikipedia["correction"][idx]).replace('"','')+" ", text)
            return text
        else:
            return 'Invalid language for the method "spelling_mistake_correction", please choose "fra" (French)'
        
# code expected in the future when the data will be clean
#         if lang in ["fra","eng"]:
#             df_typos_wikipedia = pd.read_csv("df_typos_wikipedia_fra.csv").to_dict()
#             for idx in range(len(df_typos_wikipedia["correction"])):
#                 text =  re.sub(" "+df_typos_wikipedia["correction"][idx]+" ", " "+df_typos_wikipedia["error"][idx]+" ", text)
#             return text

        
    if method == "spelling_mistake_injection":
        if lang=="fra":
            df_wicopaco = pd.read_csv("df_wicopaco.csv").to_dict()
            for idx in range(len(df_wicopaco["correction"])):
                text =  re.sub(" "+df_wicopaco["correction"][idx]+" "," "+df_wicopaco["error"][idx]+" ", text) 
            return text
#         if lang=="fra":
#             df_typos_wikipedia = pd.read_csv("df_typos_wikipedia_fra.csv").to_dict() # https://fr.wikipedia.org/wiki/Wikip%C3%A9dia:AutoWikiBrowser/Typos
#             for idx in range(len(df_typos_wikipedia["correction"])):
#                 text =  re.sub(" "+df_typos_wikipedia["correction"][idx]+" ", " "+df_typos_wikipedia["error"][idx]+" ", text)
#             return text
        if lang=="eng":
            # TODO: check https://www.dcs.bbk.ac.uk/~ROGER/corpora.html to propose other errors (or merge them all into a single file)
            df_typos_wikipedia = pd.read_csv("df_typos_wikipedia_eng.csv").to_dict() # from https://en.wikipedia.org/wiki/Wikipedia:AutoWikiBrowser/Typos
            for idx in range(len(df_typos_wikipedia["correction"])):
                text =  reg.sub(" "+str(df_typos_wikipedia["correction"][idx])+" ", " "+str(df_typos_wikipedia["error"][idx]).replace('"','')+" ", text)
            return text
        else:
            return 'Invalid language for the method "spelling_mistake_correction", please choose "fra" (French)'

        
        
    if method == "typing_mistake_injection":
        if lang in ["deu", "eng", "fra", "heb", "ita", "nld", "pol", "spa", "tha", "tur", "ukr"]:
            # local version
#             with open(f'keyboard_{lang}.txt', encoding="utf-8") as file:
#                 data = file.read()
#                 keyboard = json.loads(data)
            # datasets version
            dataset = load_dataset("fraug-library/keyboards",data_files=f"keyboard_{lang}.txt")["train"]["text"]
            keyboard = json.loads("".join(dataset))
                
            list_word_to_change =  random.sample(text.split(), k=num)
            for word_to_change in list_word_to_change:
                letter_to_change = random.choice(word_to_change).lower() # lower because the keyboard dictionary only manage lower case letters for the moment
                while letter_to_change not in keyboard.keys(): # verification because dictionaries may be incomplete
                    letter_to_change = random.choice(word_to_change).lower() 
                text = re.sub(" "+word_to_change+" "," "+word_to_change.lower().replace(letter_to_change,random.choice(keyboard[letter_to_change]))+" ", text)
            return text
        else:
            return 'Invalid language for the method "typing_mistake_injection", please choose between "deu" (German), "eng" (English), "fra" (French), "heb" (Hebrew), "ita" (Italian), "nld" (Dutch), "pol" (Polish), "spa" (Spanish), "tha" (Thai), "tur" (Turk) and "ukr" (Ukrainian).'        
    
    
    if "swap" in method:
        if method == "swap_sentences":
            words = text.split(". ")
            pass
        if method == "swap_words":
            words = text.split()
            pass

        new_words = words.copy()
        for _ in range(num):
            random_idx_1 = random.randint(0, len(new_words)-1)
            random_idx_2 = random_idx_1
            counter = 0
            while random_idx_2 == random_idx_1:
                random_idx_2 = random.randint(0, len(new_words)-1)
                counter += 1
                if counter > 3:
                    break
            new_words[random_idx_1], new_words[random_idx_2] = new_words[random_idx_2], new_words[random_idx_1] 
        return ' '.join(new_words)   
    
    
    if "random" in method:  
        words = text.split()
        # if there's only one word, don't delete it
        if len(words) == 1:
            return words
        # randomly replace words by a token with probability p
        new_words = []
        for word in words:
            uniform = random.uniform(0, 1)
            if uniform > proba:
                new_words.append(word)
            else:
                if method == "random_token_noise":
                    new_words.append("_") # token used by https://arxiv.org/abs/1703.02573
                else: # method == "random_deletion"
                    pass
        # if you end up deleting all words, just return a random word
        if len(new_words) == 0:
            rand_int = random.randint(0, len(words)-1)
            return [words[rand_int]]
        sentence = ' '.join(new_words)
        return sentence
        
    else:
        'Invalid method please choose beetwen "spelling_mistake_correction", "spelling_mistake_injection", "typing_mistake_injection", "swap_sentences", "swap_words", "random_token_noise", "random_deletion".'

In [30]:
%%time
text_eng = " The amount of times for this article peer reviewed is huge."
print(text_eng)

print(random_noise(text_eng,"eng",3, method="spelling_mistake_correction"))

 The amount of times for this article peer reviewed is huge.
 The number of times for this article peer-reviewed is huge.
Wall time: 3.23 s


In [31]:
%%time
print(text)

text_test = text+ " Il sera décroré pour ce succès."
print(random_noise(text_test,"fra",3, method="spelling_mistake_correction"))

Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine. Il sera décoré pour ce succès.
Wall time: 1.49 s


In [32]:
%%time
print(text)

text_test = text+ " Il sera décoré pour ce succès."
print(random_noise(text_test,"fra",3, method="spelling_mistake_injection"))

Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Le pongiste françaissd Alexis Lebrun, 19 ans, ā ralisé uan exploit cee vendredi 21 april eb basttant el Chinois Fan Zhendong, numero 1 mondial, eb quarts of fianle dsu tounoi WTT Champions of Macao, eb Chine. ll sere déoré pout cee succès.
Wall time: 1.44 s


In [33]:
%%time
# original text
print(text)

# new text
print(random_noise(text,"fra",3, method="typing_mistake_injection"))

Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Le pongiste français Alexis Lebrun, 19 ans, 2 réalisé u? exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 4 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Wall time: 10.5 ms


In [34]:
%%time
# original text
print(text)

# new text
text_test = text+ " Il sera décroré pour ce succès."
print(random_noise(text_test,"fra",3, method="swap_sentences"))

Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Il sera décroré pour ce succès. Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine
Wall time: 0 ns


In [35]:
%%time
# original text
print(text)

# new text
print(random_noise(text,"fra",3, method="swap_words"))

Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
numéro pongiste français Alexis Lebrun, 19 ans, mondial, réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Le Zhendong, 1 a en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Wall time: 0 ns


In [36]:
%%time
# original text
print(text)

# new text
print(random_noise(text,"fra",3, method="random_token_noise", proba=0.1))

Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Wall time: 0 ns


In [37]:
%%time
# original text
print(text)

# new text
print(random_noise(text,"fra",3, method="random_deletion", proba=0.1))

Le pongiste français Alexis Lebrun, 19 ans, a réalisé un exploit ce vendredi 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Le français Alexis Lebrun, 19 ans, a réalisé un exploit ce 21 avril en battant le Chinois Fan Zhendong, numéro 1 mondial, en quarts de finale du tournoi WTT Champions de Macao, en Chine.
Wall time: 0 ns
