<a href="https://colab.research.google.com/github/arampacha/nlp_tools/blob/main/backtranslation_hf_helsinki.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Generating backtranslations

This notebook can be used to generate backtranslations using pretrained models by Helsinki NLP group from huggingface model hub.

In [1]:
import sys
if 'google.colab' in sys.modules:
    !pip install -Uqq transformers sentencepiece fastai

In [1]:
import torch
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

### test

In [4]:
tok_enes = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-es")
en2es = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-en-es")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=801636.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=825924.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1590040.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=44.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=312087523.0, style=ProgressStyle(descri…




In [10]:
tok_enes = tokenizer

In [6]:
input_ids = tok_enes.encode_plus('hello, world', return_tensors='pt').input_ids

In [7]:
out_ids = en2es.generate(input_ids)

In [8]:
out_ids

tensor([[65000,  2119,     2,   372,     3,     0]])

In [11]:
tok_enes.decode(out_ids[0].numpy(), skip_special_tokens=True)

'Hola, mundo.'

In [12]:
tok_esen = AutoTokenizer.from_pretrained("Helsinki-NLP/opus-mt-es-en")
es2en = AutoModelForSeq2SeqLM.from_pretrained("Helsinki-NLP/opus-mt-es-en")

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1189.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=825924.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=801636.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1590040.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=44.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=312087523.0, style=ProgressStyle(descri…




In [13]:
out_ids = es2en.generate(out_ids)

In [17]:
tok_esen.decode(out_ids[0].numpy(), skip_special_tokens=True)

'Hello, world.'

## Setup

In [2]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

In [3]:
def get_models(lang1, lang2):
    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    tokenizer = AutoTokenizer.from_pretrained(f"Helsinki-NLP/opus-mt-{lang1}-{lang2}")
    fwd = AutoModelForSeq2SeqLM.from_pretrained(f"Helsinki-NLP/opus-mt-{lang1}-{lang2}").to(device)
    bwd = AutoModelForSeq2SeqLM.from_pretrained(f"Helsinki-NLP/opus-mt-{lang2}-{lang1}").to(device)
    return tokenizer, fwd, bwd

In [4]:
# tokenizer, fwd, bwd = get_models('en', 'es')

In [23]:
def backtranslate(texts, tokenizer, fwd, bwd, num_beams=1):
    input_ids = tokenizer.batch_encode_plus(texts, return_tensors='pt', padding=True, max_length=512, truncation=True).input_ids
    output_ids = fwd.generate(input_ids.to(device), num_beams=num_beams)
    res_ids = bwd.generate(output_ids, num_beams=num_beams)
    return [tokenizer.decode(ids.detach().cpu(), skip_special_tokens=True).replace('▁', ' ').strip() for ids in res_ids]

In [7]:
# out = backtranslate(['some random text which might go out different'])
# out

['some random text that might come out differently']

In [18]:
def generate(df, tok, fwd, bwd, bs=16):
    res = []
    for idx in tqdm(np.array_split(df.index.to_numpy(), int(np.ceil(len(df)/bs)))):
        texts = df.iloc[idx, 1].to_list()
        res.extend(backtranslate(texts, tok, fwd, bwd))
    return pd.DataFrame({'text':res})

In [19]:
def generate_backtranslations(input_fn, output_fn, lang1:str, lang2:str, num_beams:int=1, bs:int=16):
    df = pd.read_csv(input_fn)
    tokenizer, fwd, bwd = get_models(lang1, lang2)
    btrs = generate(df, tok, fwd, bwd, bs=bs)
    btr_df = pd.DataFrame({'text':btr_es})
    btr_df.to_csv(output_fn)
    return btr_df

## data prep

In [9]:
from fastai.text.all import *

In [10]:
path = untar_data(URLs.IMDB_SAMPLE)
path.ls()

(#1) [Path('/root/.fastai/data/imdb_sample/texts.csv')]

In [11]:
df = pd.read_csv(path/'texts.csv')
df.head()

Unnamed: 0,label,text,is_valid
0,negative,"Un-bleeping-believable! Meg Ryan doesn't even look her usual pert lovable self in this, which normally makes me forgive her shallow ticky acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what kind of suicide trip has his career been on? Whoosh... Banzai!!! Finally this was directed by the guy who did Big Chill? Must be a replay of Jonestown - hollywood style. Wooofff!",False
1,positive,"This is a extremely well-made film. The acting, script and camera-work are all first-rate. The music is good, too, though it is mostly early in the film, when things are still relatively cheery. There are no really superstars in the cast, though several faces will be familiar. The entire cast does an excellent job with the script.<br /><br />But it is hard to watch, because there is no good end to a situation like the one presented. It is now fashionable to blame the British for setting Hindus and Muslims against each other, and then cruelly separating them into two countries. There is som...",False
2,negative,"Every once in a long while a movie will come along that will be so awful that I feel compelled to warn people. If I labor all my days and I can save but one soul from watching this movie, how great will be my joy.<br /><br />Where to begin my discussion of pain. For starters, there was a musical montage every five minutes. There was no character development. Every character was a stereotype. We had swearing guy, fat guy who eats donuts, goofy foreign guy, etc. The script felt as if it were being written as the movie was being shot. The production value was so incredibly low that it felt li...",False
3,positive,"Name just says it all. I watched this movie with my dad when it came out and having served in Korea he had great admiration for the man. The disappointing thing about this film is that it only concentrate on a short period of the man's life - interestingly enough the man's entire life would have made such an epic bio-pic that it is staggering to imagine the cost for production.<br /><br />Some posters elude to the flawed characteristics about the man, which are cheap shots. The theme of the movie ""Duty, Honor, Country"" are not just mere words blathered from the lips of a high-brassed offic...",False
4,negative,"This movie succeeds at being one of the most unique movies you've seen. However this comes from the fact that you can't make heads or tails of this mess. It almost seems as a series of challenges set up to determine whether or not you are willing to walk out of the movie and give up the money you just paid. If you don't want to feel slighted you'll sit through this horrible film and develop a real sense of pity for the actors involved, they've all seen better days, but then you realize they actually got paid quite a bit of money to do this and you'll lose pity for them just like you've alr...",False


## es

In [12]:
tokenizer, fwd, bwd = get_models('en', 'es')

In [13]:
btr_es = []
for idx in tqdm(np.array_split(df.index.to_numpy(), int(np.ceil(len(df)/4)))):
    texts = df.iloc[idx, 1].to_list()
    btr_es.extend(backtranslate(texts))

HBox(children=(FloatProgress(value=0.0, max=250.0), HTML(value='')))




In [15]:
btr_es = pd.DataFrame({'text':btr_es, 'label':df.label})
btr_es.head()

Unnamed: 0,text,label
0,"In-beeping-creible! Meg Ryan doesn’t even see her usual adorable pert in this, which usually makes me forgive her shallow schtick tickling performance. Hard to believe she was the producer in this dog. Besides Kevin Kline: what kind of suicide trip has been in her career? Whoosh... Banzai!!! Finally this was directed by the guy Big Chill did? It must be a Jonestown replay - Hollywood style. Wooofff!",negative
1,"In the truth movie, in the truth movie, in which people feel very good, in which people feel very good, in which people feel very good, in which people feel great, in which people feel very good, in which people feel great, in which people feel great, in which people feel great, in where people feel great, in where people feel great, in where people feel great, in where people feel very good, in where people feel very good, in where people feel great, in where people feel great, in where people feel great, in where people feel great, in where people feel great, in where people feel great, ...",positive
2,"Every once in a while a movie will come along for a while that will be so horrible that I feel compelled to warn people. If I work all my days and can save only one soul from watching this movie, the big thing will be my joy.<br /br /> Where to start my discussion of pain. To start with, there was a musical montage every five minutes. There was no character development. Each character was a stereotype. We had a guy curse, fat guy eats doughnuts, weird dumb guy, etc. The script felt like it was written as if the film was being filmed. The value of production was so incredibly low that it fe...",negative
3,"The name says it all. I saw this film with my father when he came out and having served in Korea, I had a great admiration for the man. The disappointing thing about this film is that it only concentrates on a short period of man's life - which is interesting that all the life of man would have done such an epic biography that it is amazing to imagine the cost of production.<br /br />Some posters elude the wrong characteristics of man, which are cheap shots.The theme of the film ""Duty, Honor, Country"" is not just mere words blastered from the lips of a high-armed officer - it is the profou...",positive
4,"This movie is successful in being one of the most unique films you’ve ever seen. However, this comes from the fact that you can’t make heads or tails of this disaster. It almost seems like a set of challenges set to determine whether you’re willing or not to leave the movie and give up the money you just paid. If you don’t want to feel disapproved you’ll sit through this horrible movie and develop a real sense of pity for the actors involved, they’ve all seen better days, but then you realize that they were actually paid a little money to do this and you’ll lose pity for them, as you’ve al...",negative


In [None]:
# btr_es.to_csv(path/'btr_es.csv')

### de

In [16]:
del tokenizer, fwd, bwd
tokenizer, fwd, bwd = get_models('en', 'de')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1132.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=768489.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=796845.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1273232.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=42.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=297928209.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1132.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=297928209.0, style=ProgressStyle(descri…




In [17]:
btr_de = generate(df, tokenizer, bwd, fwd)
btr_de.head()

HBox(children=(FloatProgress(value=0.0, max=63.0), HTML(value='')))




["Meg Ryan doesn't even look like her usual pert lovable self in what usually makes me forgive her flat ticky acting snail. Hard to believe she was the producer on this dog. Also Kevin Kline: what kind of suicide journey was his career on? Whoosh... Banzai!!! Finally, this was staged by the guy who made Big Chill? Must be a repeat of Jonestown - Hollywood style. Wooofff!",
 "It's a very good thing, but it's not so easy that it doesn't make it clear that it's not really cinematic. The actors, script and camera work are all top-notch. Even the music is good, although it's mostly early in the movie when things are still relatively cheerful. There are no really superstars in the cast, although several faces will be familiar. The entire cast does an excellent job with the script. But it's hard to observe because there's no good end to a situation like one is presented. It's now fashionable to blame the British for putting Hindus and Muslims against each other. It seems more likely that the 

In [None]:
# btr_de.to_csv(path/'btr_de.csv')

## fr

In [20]:
del tokenizer, fwd, bwd

In [21]:
tokenizer, fwd, bwd = get_models('en', 'fr')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1132.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=778395.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=802397.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1339166.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=42.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=300827685.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1132.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=300827685.0, style=ProgressStyle(descri…




In [24]:
btr_fr = generate(df, tokenizer, fwd, bwd, bs=32)
btr_fr.head()

HBox(children=(FloatProgress(value=0.0, max=32.0), HTML(value='')))




Unnamed: 0,text
0,"Meg Ryan doesn't even seem like her usual pert even adorable in what, which normally makes me forgive her actor schtick at the tip of the cat. Hard to believe she was the producer on this dog. More Kevin Kline: What kind of suicide trip was her career? Whoosh... Banzai!!! Finally, it was led by the guy who did Big Chill? Must be a Jonestown replay - Hollywood style. Wooofff!"
1,"There are no superstars in the play, but there are no superstars in the case where the English have been against Hindus and Muslims, and then they are cruelly divided into two countries. There is also some merit in this vision, but it is true that no one has forced Hindus and Muslims in the region to be mistaken as they did at the time of the score. It seems more likely that the British have simply seen tensions between religions and have never had the sense of exploiting them for their own purposes."
2,"If I work all my days and I can save one soul from watching this film, how great will be my joy.<br /><br />Where to start my pain discussion. To start, there was a musical montage every five minutes. There was no character development. Each character was a stereotype. We had sworn guys, big guys who eat donuts, goofy stranger, etc. The script felt like it was written as the film was shot. The production value was so low that it seemed like I was watching a high junior video presentation. Have you directors, producers, etc. ever seen a movie before? Halestorm is getting worse and worse wit..."
3,"I watched this film with my father when he came out and served in Korea, he had a great admiration for man. The disappointing thing of this film is that he focuses only on a short period of man's life - interestingly the whole life of man would have made a biopic so epic that it is stunning to imagine the cost of production.<br /><br />Some posters elude the imperfect characteristics of man, which are cheap clichés. The theme of the film ""Duty, Honor, Country"" are not only words blathered from the lips of a high-brown officer - it is the deep statement of a man's total devotion to his coun..."
4,"This film comes to be one of the most unique movies you've seen. However, it comes from the fact that you can't make heads or tails of this mess. It seems almost like a series of challenges put in place to determine whether or not you are willing to get out of the movie and give up the money you just paid. If you don't want to feel slightly you're going to sit through this horrible movie and develop a real feeling of pity for the actors involved, they've all seen better days, but then you realize that they actually paid a little money to do that and you'll lose pity for them as you've alre..."


In [None]:
# btr_fr.to_csv(path/'btr_fr.csv')

## ru

In [25]:
del tokenizer, fwd, bwd
tokenizer, fwd, bwd = get_models('en', 'ru')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1132.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=802781.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1080169.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=2601758.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=42.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=306991893.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1133.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=306991893.0, style=ProgressStyle(descri…




In [26]:
btr_ru = generate(df, tokenizer, fwd, bwd, bs=64)
btr_ru.head()

HBox(children=(FloatProgress(value=0.0, max=16.0), HTML(value='')))




Unnamed: 0,text
0,"Meg Ryan doesn't even look at himself as usual, loving, which usually makes me forgive her shallow, tickling actor sketch."
1,"There is no real superstar in this film, no face will be familiar. All the acting work with the script. (br.) But it's hard to see (mainly in the film, when all things are relatively fun.) There is no real superstar in the world, although some faces will know. The whole scene works perfectly with the script. (br.) But it's not very good to see because there's no good understanding of the situation, like the one in this story/in this style. At present, you could blame the British who are more clever than their Hindus and Muslims and then brutally divide them into two countries. In this case..."
2,"Every time a movie happens, it's gonna be so terrible that I feel compelled to warn people, if I work all day and I can save one soul from watching this movie, how great my joy will be."
3,"I watched this movie with my father when he came out and served in Korea, he was very impressed by the man, and the disappointing thing about this movie is that he only focuses on a short period of human life -- it's interesting that a person's whole life would make an epic biopic that it's amazing to imagine the cost of production."
4,"This film may be one of the most unique films you've just seen, but it's because you can't make the head or tails of this mess. Almost like a series of tasks that you can decide whether you want to leave the film and give up the money you just paid. If you don't want to feel unattractive, you're gonna sit in this horrible movie and develop a real feeling of pity for the actors, I saw it all when all these games were best, but then you realize that they were actually paid enough money to do it, and you're gonna lose pity for them, like you've already done for the film. I can't go on with th..."


In [None]:
# btr_ru.to_csv(path/'btr_ru.csv')

## nl

In [27]:
del tokenizer, fwd, bwd
tokenizer, fwd, bwd = get_models('en', 'nl')

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1132.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=789525.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=813866.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1660216.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=42.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=316246425.0, style=ProgressStyle(descri…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1133.0, style=ProgressStyle(description…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=316246425.0, style=ProgressStyle(descri…




In [28]:
btr_nl = generate(df, tokenizer, fwd, bwd, bs=64)
btr_nl.head()

HBox(children=(FloatProgress(value=0.0, max=16.0), HTML(value='')))




Unnamed: 0,text
0,"Meg Ryan doesn't even look her usual pert lovely self in this, which normally forgives me her superficial touch of acting schtick. Hard to believe she was the producer on this dog. Plus Kevin Kline: what suicide journey has his career been? Whoosh... Banzai!!! Finally this was directed by the man who did Big Chill? Must be a replay of Jonestown - Hollywood style. Wooofff!"
1,"This is a very well-made film. Acting, scripting and camera work are all first class.The music is also good, although it is usually early in the film, when things are still relatively happy. There are no really superstars in the cast, although different faces will be known. The whole cast does an excellent job with the script.<br /><br />But it is hard to look, because there is no good end to a situation like that one presented. It is now fashionable to accuse the British of placing Hindus and Muslims against each other, and then cruel enough to separate them in two countries. There is als..."
2,"A film that will be so terrible that I feel compelled to warn people. If I work all my days and I can save only one soul from watching this movie, how great will be my joy.<br /><br />Where to start my discussion about pain. To begin, there was a musical editing every five minutes. There was no character development. Each character was a stereotype. We had cursed man, fat guy who eats donuts, goofy foreign man, etc. The script felt like it was written when the movie was recorded. The production value was so incredibly low that it felt like I was watching a junior high video presentation. H..."
3,"The name says it all. I watched this film with my father when it came out and had served in Korea he had great admiration for the man. The disappointing thing about this film is that it only concentrates on a short period of the man's life - interestingly enough the man would have made his whole life such an epic bio-pic that it is staggering to imagine the cost of production.<br /><br />Some posters avoid the defects of the man, which are cheap shots. The theme of the movie ""Duty, Honor, Country"" is not only words he has plod out of the lips of a high-scrappy officer - it is the deep expl..."
4,"This film manages to be one of the most unique films you've seen. However, this comes from the fact that you can't make head or tail of this mess. It almost seems like a series of challenges set to determine whether you're willing to walk out of the movie and give the money you just paid for. If you don't want to be tempted, then you'll get a little bit of pity for this horrible movie and then you'll get a real feeling of compassion for the actors involved. They've all seen better days, but then you realize that they've actually got a little bit of money to do this and then it would have b..."


In [None]:
# btr_nl.to_csv(path/'btr_nl.csv')