# Homework 3

**How to submit.** For this homework, submit this `.ipynb` file with your answers.

## BERT

### Task 1: BERT + text classification (3 points)

In [practice session 4](https://colab.research.google.com/drive/1yQM8c_idzBLO0efPQ9crrjZPyqvhqaLG?usp=sharing), we trained a text classification model on the IMDb dataset using BERT's sentence representations. We used a randomly selected part of the dataset, froze all layers of the pre-trained `bert-base-uncased` model, and only trained the classifier itself, which takes the final-layer representation of the [CLS] token as input. In this task, you will try to apply a different strategy to the same task.

Use the whole IMDb dataset. Use the `train` split (25,000 examples) as the training set, and the `test` split (25,000 examples) as the development set.

Train a binary classifier based on embeddings produced by a BERT-like model. You may use `bert-base-uncased`, as we did in the lab, or any other model of the same class. You may employ any strategy except the one used in practice session: for example, you can fine-tune the whole model and have a classifier on top of the [CLS] token, or extract embeddings from the model first and then train an independent classifier using those (see [this post](http://jalammar.github.io/a-visual-guide-to-using-bert-for-the-first-time/)). You can use the output of the last layer, any intermediate layer, or a combination of several layers. Instead of using the embeddings of the [CLS] token, you can use an average of embeddings of all words of the text. You can try changing the learning rate, the number of training epochs, and other hyperparameters.

The evaluation accuracy of your model should be over 83%. You get **1 bonus point** if it's over 93%.

**Subtask 1 (1 point).** Describe the details of your approach.

* What model are you using (e.g. `bert-base-uncased`, `distilbert-base-cased`, etc.)?
* Are you fine-tuning the parameters of the model or just using outputs of the pre-trained model?
* What does your classifier look like (e.g. `AutoModelForSequenceClassification`, a feed-forward neural network, logistic regression, etc.)?
* What is the classifier's input (the final-layer representation of the [CLS] token, the average of final-layer word representations, etc.)?
* What hyperparameters does your model have (learning rate, number of epochs)?

**Your answer:** 

I tried to use __distilber-base-cased__ because it is supposed to be a lightweight version of bert and it would train faster on my computer. After a lot of experiments training the whole network, none of them gave consistent results and __I decided to fine-tune the FC layer and also the embedding layer__, leaving the encoder/decoder as is. Since I was training and end to end model I used __AutoModelForSequenceClassification__ for the sake of simplicity.


Hyperparameters used:

**Subtask 2 (2 points).** Train your model using `transformers` and `datasets` libraries (refer to practice session 4 materials  for a detailed example).

In [85]:
# !pip install transformers
# !pip install datasets
# !pip install torch scikit-learn

## Evaluating MT

In the following tasks, you will explore MT quality metrics and work on evaluating the Estonian$\rightarrow$English model that you trained in homework 2. 

### Task 2: MT evaluation metrics (1 point)

There are quite a few metrics for evaluating machine translation. BLEU is the most popular automatic metric, but it has its disadvantages, which other metrics try to overcome. In this task we will compare these different metrics.

**Subtask 1 (1.5 points).** Explain the main idea behind five MT quality metrics (2-3 sentences about each metric is enough).

**BLEU:** It compares the translated text (candidate translation) with human-generated translations (reference translations/corpus). It computes the fraction of tokens from the candidate that are covered by the references but with penalty to correct cases where the same word is repeated several times (clipped precision, or the maximum number of occurrences of a token in the references) or the candidate translation is too short (brevity penalty).

**chrF:** F-score metric based on character n-grams. The central idea is to average character n-gram precision and recall over all n-grams. 

**METEOR:** The __Metric for Evaluation of Translation with Explicit ORdering__ employs harmonic mean of unigram precision and recall, with the latter weighted higher than the former. It tries to align the candidate translation and the reference between unigrams.

**TER:** Similar to edit distance (Levenshtein distance), it measures the number of edits needed in order to change the candidate translation to one of the references.

**BERTScore:** It computes pairwise cosine similarity in the embedding space of candidate and reference translation. The main goal is to measure semantic similarity. 

### Task 3: Tricking BLEU (1 point)

The most popular automatic metric for evaluating machine translation quality is BLEU (bilingual evaluation understudy). It measures how close a translation is to a reference ("ground truth") translation produced by a human. We put "ground truth" in quotes, because, unlike with, say, classification, in machine translation there is no single correct answer. Several different translations can all be perfectly correct, while having very different wording.

BLEU claims high correlation with human judgements of how good a translation is. However, if we are looking at a BLEU score for one sentence, and not an average score over many sentences, that number can be misleading.

**Subtask 1 (0.5 points)**. Try to come up with examples of translations that can fool BLEU. (If you are unsure how BLEU works, check out practice session 5 materials or google around.) Bring an example of a sentence in some language you know, a good translation of this sentence into English, and a bad translation into English, which would have a decent BLEU score with the good translation as reference. (Please also explain what is happening in your non-English sentence, what it means and why the bad translation is bad.) 

In [31]:
from nltk.translate.bleu_score import sentence_bleu

#"e ela pediu 'bota a bota'"
reference = [
    "and she asked put the boots on".split()
]

candidate = "and she asked boot to boot".split()
print('BLEU score -> {}'.format(sentence_bleu(reference, candidate)))

candidate = "and she asked to wear boots".split()
print('BLEU score -> {}'.format(sentence_bleu(reference, candidate)))

BLEU score -> 4.888731610154635e-78
BLEU score -> 5.253283993027542e-78


**Your answer:** 


**Subtask 2 (0.5 points).** Now do the same, but the other way around: come up with a sentence in your language, a good reference translation of that sentence into English, and another translation which is also good, but would have a low BLEU score when compared to the first translation. Explain.

**Your answer:** ...

### Task 4: Calculate your model's BLEU and BERTScore (2 points)

**Subtask 1 (0.5 points).** In the previous homework, you separated a test set of 2,000 lines. Preprocess this test set, translate it with the last checkpoint of your model, and postprocess the translation. Compare your translation to the reference translation (the English side of your test set) by calculating the BLEU score with `sacreBLEU`. (Do `pip install sacrebleu` if you don't have the package in your virtual environment.)

Use `sacreBLEU` in the following way:

`cat hypothesis.en | sacrebleu reference_translation.en`

Report the output.

**Your answer:** 

(full output below)

`{
 "name": "BLEU",
 "score": 18.3,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0",
 "verbose_score": "45.2/21.6/13.2/8.7 (BP = 1.000 ratio = 1.073 hyp_len = 31581 ref_len = 29434)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.0.0"
}`

__BLEU__: 18.3

In [12]:
!python ../homework2/dev/scripts/apply_sentencepiece.py \
    --action split \
    --corpora ../homework2/dev/data/test.et \
    --model ../homework2/dev/wordpieces

04/09/2022 09:44:39 PM INFO: Splitting file ../homework2/dev/data/test.et


In [14]:
!cat ../homework2/dev/data/wordpieces-test.et | fairseq-interactive ../homework2/dev/data/bin-data \
                                                    --source-lang et \
                                                    --target-lang en \
                                                    --path ../homework2/dev/model-checkpoint/checkpoint_best.pt \
                                                    --remove-bpe sentencepiece \
                                                    > data/hold_out_test_set.en

2022-04-09 21:46:36 | INFO | fairseq_cli.interactive | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_na

In [17]:
!grep "^H" data/hold_out_test_set.en | cut -f3 > data/hypothesis_hold_out.en

In [19]:
!head data/hypothesis_hold_out.en

The rise of the inflicted on the rows, the rise in the distribution of colony, the rise in the spread of the massacre, the rise in the spread of the death penalty, the rise in the spread of the spread of the death penalty, the decline in the death penalty, the rise in the spread of the divisions, the rise in the number of inflicted on the rows, the rise in the spread of the massacres, the rise in the spread of the spread of the spread of the massacres, the massacres, the massacres.
in writing. - (EL) The European Parliament resolution coincides with one and transcends the reactionary policy of the EU in many respects and anti-democratic measures which, under the pretext of the fight against terrorism, restricts fundamental personal rights and democratic freedom of workers.
Right now.
More budget support is required, but also greater transparency and greater participation by Parliament, social mediators and local authorities.
This is a problem that certainly does not exist if the Eu

In [20]:
!head ../homework2/dev/data/test.en

alanine aminotransferase increased, aspartate aminotransferase increased, blood creatine phosphokinase increased, blood glucose increased, blood pressure decreased, blood prolactin increased, body temperature decreased, body temperature increased, electrocardiogram QT prolonged, eosinophil count increased, haematocrit decreased, haemoglobin decreased, heart rate increased, transaminases increased, white blood cell count decreased
in writing. - (EL) The European Parliament resolution matches and in many respects outstrips the reactionary policy and anti-democratic measures of the EU, which, on the pretext of fighting terrorism, restricts the fundamental personal rights and democratic freedoms of workers.
Now!
More budgetary support is needed, but also greater transparency and involvement by parliaments, the social midfield and local authorities.
It is a problem that, to be sure, would not exist if the European Union - as well as condemning it so many times - had truly rejected it, r

In [18]:
!cat data/hypothesis_hold_out.en | sacrebleu ../homework2/dev/data/test.en

{
 "name": "BLEU",
 "score": 18.3,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0",
 "verbose_score": "45.2/21.6/13.2/8.7 (BP = 1.000 ratio = 1.073 hyp_len = 31581 ref_len = 29434)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.0.0"
}
[0m

**Subtask 2 (0.5 points).** Now let's see how well your model does on data that does not come from the corpora on which the model was trained. Copy the test set from `/gpfs/space/projects/mt2022/data/test-set/`. There are two files. `test-src.et` contains the Estonian side of the test set. Preprocess this set, translate it with the last checkpoint of your model, postprocess the result. Use `sacreBLEU` to calculate BLEU with reference to `test-ref.en`. Report the output of `sacreBLEU`.

**Your answer:** 

(full output below)

`{
 "name": "BLEU",
 "score": 10.0,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0",
 "verbose_score": "39.1/14.0/6.2/2.9 (BP = 1.000 ratio = 1.158 hyp_len = 25775 ref_len = 22256)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.0.0"
}`

__BLEU__: 10.0

In [3]:
!python ../homework2/dev/scripts/apply_sentencepiece.py \
    --action split \
    --corpora data/test-src.et \
    --model ../homework2/dev/wordpieces

04/09/2022 01:55:38 PM INFO: Splitting file data/test-src.et


In [21]:
!cat data/wordpieces-test-src.et | fairseq-interactive ../homework2/dev/data/bin-data \
                                                      --source-lang et \
                                                      --target-lang en \
                                                      --path ../homework2/dev/model-checkpoint/checkpoint_best.pt \
                                                      --remove-bpe sentencepiece \
                                                      > data/external_test_set.en

2022-04-09 22:01:52 | INFO | fairseq_cli.interactive | {'_name': None, 'common': {'_name': None, 'no_progress_bar': False, 'log_interval': 100, 'log_format': None, 'log_file': None, 'aim_repo': None, 'aim_run_hash': None, 'tensorboard_logdir': None, 'wandb_project': None, 'azureml_logging': False, 'seed': 1, 'cpu': False, 'tpu': False, 'bf16': False, 'memory_efficient_bf16': False, 'fp16': False, 'memory_efficient_fp16': False, 'fp16_no_flatten_grads': False, 'fp16_init_scale': 128, 'fp16_scale_window': None, 'fp16_scale_tolerance': 0.0, 'on_cpu_convert_precision': False, 'min_loss_scale': 0.0001, 'threshold_loss_scale': None, 'amp': False, 'amp_batch_retries': 2, 'amp_init_scale': 128, 'amp_scale_window': None, 'user_dir': None, 'empty_cache_freq': 0, 'all_gather_list_size': 16384, 'model_parallel_size': 1, 'quantization_config_path': None, 'profile': False, 'reset_logging': False, 'suppress_crashes': False, 'use_plasma_view': False, 'plasma_path': '/tmp/plasma'}, 'common_eval': {'_na

In [22]:
!grep "^H" data/external_test_set.en | cut -f3 > data/hypothesis_external.en

In [23]:
!head data/hypothesis_external.en

Many of them have joined the death of Mr Papua (Vi) army, which, according to its own eyes, defends the rights of the victims, by attacking the brutality and abductions of the Indonesian Orthodox Church.
Mrs Maria also does not recognise the fact that creativity must be presented as a super-smoded superse, he is, at the very least, the ordinary women who are born.
They are asking reporters to provide evidence of steril evidence of the strings, the impression is enough and internal sense.
In a high level of competition, she improved on the altar of a personal excellence in the compartmentation of peat, and on the ground, on the ground, on the ground, and on the ground, on the ground, on the ground.
Opposition has been an anchored by the regulatory authorities, which said that Converging and the head of the Bovery leader, Patrick Martínez.
Estonia rose from all the ages in the south, and his fresh brand of personal excellence is the result of all the timelessness of Estonia.
Accord

In [24]:
!head data/test-ref.en

Many joined the Free Papua Movement (OPM), the rebel army that claims to defend the rights of the Papuans by launching sporadic attacks and kidnapping raids on Indonesian soldiers.
Maria doesn’t also accept the fashion custom that fashion designs have to be presented by a twiggy supermodel; she is happy to bring ordinary women on to the catwalk.
They ask the reporters to suggest names for each column - no proof is needed, perceptions and gut feelings are enough.
At this high-level competition, she bested her own top result in the shot put by close to a metre (13.89) and ran a near-record in the hurdles.
"The court of public opinion has usurped regulators," said Patrick Quinlan, the founder and chief executive of Convercent.
In Estonia’s all-time rankings, Gold rose to third and her latest personal best is the fifth all-time Estonian result.
Gobai says the Paniai people, like other Papuans, consider their vote to Jokowi as a "debt" he must repay.
For example in Setomaa, meat was 

In [25]:
!cat data/hypothesis_external.en | sacrebleu data/test-ref.en

{
 "name": "BLEU",
 "score": 10.0,
 "signature": "nrefs:1|case:mixed|eff:no|tok:13a|smooth:exp|version:2.0.0",
 "verbose_score": "39.1/14.0/6.2/2.9 (BP = 1.000 ratio = 1.158 hyp_len = 25775 ref_len = 22256)",
 "nrefs": "1",
 "case": "mixed",
 "eff": "no",
 "tok": "13a",
 "smooth": "exp",
 "version": "2.0.0"
}
[0m

**Subtask 3 (1 point).** Now calculate BERTScore for both your held-out test set and the out-of-domain test set. (You can find instructions on using the BERTScore command line interface [here](https://github.com/Tiiiger/bert_score#usage).) Use the default model for English with baseline rescaling. Report the full output for both sets.

**Your answer:** ...

In [36]:
!bert-score -r ../homework2/dev/data/test.en -c data/hypothesis_hold_out.en --lang en --rescale_with_baseline

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.decoder.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.dense.weight', 'lm_head.layer_norm.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
roberta-large_L17_no-idf_version=0.3.11(hug_trans=4.18.0)-rescaled_fast-tokenizer P: 0.434280 R: 0.411016 F1: 0.422468


In [35]:
!bert-score -r data/test-ref.en -c data/hypothesis_external.en --lang en --rescale_with_baseline

Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.decoder.weight', 'lm_head.layer_norm.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.weight', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
roberta-large_L17_no-idf_version=0.3.11(hug_trans=4.18.0)-rescaled_fast-tokenizer P: 0.325767 R: 0.340905 F1: 0.334024


### Task 5: Manual analysis (3 points)

Even though automatic metrics are widely used to evaluate machine translation quality, they cannot show what kinds of errors the models make. A number provided by an automatic metric is not enough to make informed decisions about how to improve your model. It is always important to have an idea of what exactly your model is doing right and wrong.

That is why, in this task, you will manually evaluate your model's performance on the external test set (`/gpfs/space/projects/mt2022/data/test-set/`) that you translated in task 4.

**Subtask 1 (2 points).** Analyse any 30 sentences from the translated test set. For each of the 30, report:

1. Sentence ID (line number)
2. Source sentence (in Estonian)
3. Reference translation
4. Machine translation (by your model)
5. Description of errors in the translation. You may use any system that seems reasonable to you. For instance, you could classify errors as "word order errors", "untranslated words in source", etc. A description of a sentence can be something like "it tried to represent meaning, but made grammatical errors" or "hypothesis is fluent, but does not represent meaning correctly".

**Hint.** A convenient tool for comparing translations to references: [https://www.letsmt.eu/Bleu.aspx](https://www.letsmt.eu/Bleu.aspx)

**Your answer:** 

* __Sentence ID__: 275
* __Source__: Tuli neelas Dubai Torchi pilvelõhkuja juba teist korda enda alla
* __Reference__: fire engulfs dubai ' s ' torch ' skyscraper for second time
* __Machine translation__: the distribution of the population was swallowed up by the comparters of dubach , for another time .
* __BLEU__: 2.55405
* __Description__: It looks terribly wrong. I noticed in other sentences that the model performs very bad in translating NER (organization names, places...). The sentence is readable tho.


* __Sentence ID__: 137
* __Source__: "Ja me kaitseme oma osanikke kohtuvaidluste eest töötajatega."
* __Reference__: " and we ' re protecting our shareholders from employment litigation . "
* __Machine translation__: we are protecting our shareholders against the workers ' defendants for their prosecutions .
* __BLEU__: 12.571193
* __Description__: I'd say not a bad one in terms of meaning, however `employment litigation` and ` workers ' defendants for their prosecutions` are not the same. No grammatical errors.


* __Sentence ID__: 351
* __Source__: Hoolimata sellest, et "Narcose" teises hooajas paljastati Pablo Escobari tapja, ei ole Netflixi sari kaugeltki lõppenud.
* __Reference__: despite season two of narcos revealing pablo escobar ' s killer , netflix ' s series is far from over .
* __Machine translation__: it is the fact that the blockade of narcois ' s second is revealed by mr pablo ' s killer , that netxol is far from finished .
* __BLEU__: 14.479536
* __Description__: Again issues with Organization names (netflix, narcos). Overall grammatical correct, but lost the meaning.


* __Sentence ID__: 605
* __Source__: Pärnusse tulnud külalised on tema sõnul viisakad.
* __Reference__: the visitors to pärnu are well behaved , he says .
* __Machine translation__: the irreparable trend is , as he said , polite .
* __BLEU__: 5.300157
* __Description__: Again no mention to `Parnu` (place).  Grammatical correct but lost the meaning.


* __Sentence ID__: 316
* __Source__: ÜRO Afganistani missiooni juht ütles kolmapäeval, et valimised saavad esindada rahvast ainult siis, kui kõik ühiskonna liikmed, sealhulgas naised, mängivad nendes rolli.
* __Reference__: elections can only be representative if all members of society , including women , play a role , the head of the u.n . mission in afghanistan said wednesday .
* __Machine translation__: the leader of the un mission in afghanistan said wednesday that the elections can only represent the people when all members of society , including women , play their role .
* __BLEU__: 44.283242
* __Description__: Good translation. Can't see grammatical errors and the meaning is the same. 

* __Sentence ID__: 957
* __Source__: Küll aga tõdeb ta, et Pärnusse on kohale tulnud tõelised oma ala tegijad.
* __Reference__: but he does admit the true top calibre performers have arrived in pärnu .
* __Machine translation__: however , it is true that there have been real players in the pillarch .
* __BLEU__: 3.929719
* __Description__: Problem with place names (was `pillarch` supposed to be `pärnu`?). Grammatically correct but meaning is not exactly the same, even replacing pillarch by pärnu.


* __Sentence ID__: 443
* __Source__: Kui kolmanda poole analüütikatööriistad lubavad suurendada töötajate pühendumust ja kaasatust, siis pole ime, et need toetust leiavad.
* __Reference__: if third - party analytics tools promise to increase employee commitment and engagement , it ' s no wonder they ' re finding backing .
* __Machine translation__: if the deportations of the third side are allowed to increase the commitment and involvement of workers , then it is no wonder that they will be supported .
* __BLEU__: 5.300659
* __Description__: The `third party analytics tools` got lost in translation. Grammatically correct.


* __Sentence ID__: 571
* __Source__: Vahetult enne töö algust Adelega oli Klimt käinud Ravennas San Vitale basiilikas ning uurijad arvavad, et just seal nähtud hiilgavad mosaiigid keisrinna Theodorast andsid kunstnikule idee teha kuldne portree Adelest värvitud mosaiigina.
* __Reference__: immediately before starting work with adele , klimt had been to the san vitale basilica in ravenna and scholars believe that it was the brilliant mosaics of the empress theodora he saw there that gave the artist the idea of creating a golden portrait of adele in the form of a painted mosaic .
* __Machine translation__: mr president , before the start of her work , mr vita , mr santa vita , there was a smell , and the fact that it was precisely the distribution of the despogenie mode , the idea of a golden man who was born on the ground was born .
* __BLEU__: 6.711801
* __Description__: The funniest so far :) . Not totally incorrect in terms of grammar, but incomprehensible translation.


* __Sentence ID__: 523
* __Source__: Kõigest päev pärast New Yorki reisimist istus see St. Louisi elanik Airbnb kaudu renditud toas voodil, sirvides telefonis Facebooki.
* __Reference__: just a day after arriving in new york city on a trip , the st . louis native sat on the bed of an airbnb she was renting , scrolling through facebook on her phone .
* __Machine translation__: after all , in new days ' time , this father of travel was born in st petersburg , and louis michel , through the emergence of the emergence obtained through the emergence , in a telephone presence .
* __BLEU__: 3.068467
* __Description__: Again not totally wrong in terms of grammar but translation is incomprehensible.


* __Sentence ID__: 304
* __Source__: Müra, mis kestab öötundidest varahommikuni, segab naabruses elavate inimeste und.
* __Reference__: the noise , which lasts from the night hours to early morning , disturbs the sleep of the people living in the neighbourhood .
* __Machine translation__: the murder , which continues to spread the property of nights , disrupts the people living in the neighbourhood .
* __BLEU__: 29.500465
* __Description__: I'm not fluent in estonian to understand why `noise` was translated to `murder`. Grammatically correct however incomprehensible.

* __Sentence ID__: 483
* __Source__: "Kui me seda ei tee, siis linnaosa trahvib meid", märgib Scholler.
* __Reference__: “ if we don ’ t do it , the city district will fine us ,” says scholler .
* __Machine translation__: while we do not do so , the city parts are fineing us scholler .
* __BLEU__: 10.596135
* __Description__: Grammatically wrong and incomprehensible.


* __Sentence ID__: 733
* __Source__: See pommitamine on viimane vägivallaaktidest, mis on Afganistani sel kuul tabanud - teisipäeval sai Herati mošeerünnakus surma rohkem kui 30 inimest.
* __Reference__: the bombing is the latest violence to have hit afghanistan this month - on tuesday more than 30 people were killed in a mosque blast in herat .
* __Machine translation__: this bombing is the latest acts of violence that have hit afghanistan this month - on tuesday , heratis mosev ' s strike killed more than 30 people .
* __BLEU__: 41.732602
* __Description__: A good one. the place where the bombing happened got lost in translation though. 


* __Sentence ID__: 871
* __Source__: Lätis ja Leedus langes hind viie protsendi võrra 36,27 euroni megavatt-tunnist.
* __Reference__: in latvia and lithuania , the price fell by five per cent to 36.27 euros per megawatt - hour .
* __Machine translation__: in latvia and lithuania , the price fell by eur 5 per cent of eur 36.2 per hour .
* __BLEU__: 47.570528
* __Description__: Good one, with a few grammatical errors and `megawatt` was ommited, making it hard to understand what has decresed.


* __Sentence ID__: 765
* __Source__: Hinnad tõusid vastavalt 12 ja seitse protsenti.
* __Reference__: prices rose 12 and seven per cent , respectively .
* __Machine translation__: prices were rising according to 12 and seven per cent respectively .
* __BLEU__: 37.700638
* __Description__: Removing `according` from translation would fix the sentence in terms of meaning. 


* __Sentence ID__: 666
* __Source__: Klubidel on lubatud mängijate värbamiseks rohkelt kulutada, kuid nad peavad tasakaalustama seda õiguslike sissetulekuallikatega, mis võimaldavad neil jalgpalliäris omadega välja tulla.
* __Reference__: clubs are allowed to spend heavily on acquiring players but they have to counterbalance that with legitimate sources of income , allowing them to approach break - even on their football - related business .
* __Machine translation__: they have been allowed to spend on the recruitment of performers , but they must balance it with legal sources that allow them to come out with their football companies .
* __BLEU__: 7.967917
* __Description__: Not totally wrong in terms of meaning, but players!=performers. No grammatical errors.


* __Sentence ID__: 51
* __Source__: Enamik välja- ja sissekolijatest on Eesti kodanikud, kuid rohkem on neid lahkujate seas.
* __Reference__: most immigrants and emigrants are estonian citizens , but there are more of them among the ones leaving .
* __Machine translation__: most of the emerging and emerging are estonian citizens , but there are more of those leaving .
* __BLEU__: 49.011085
* __Description__: immigrants and emigrants translated to `emerging`. incomprehensible sentence.


* __Sentence ID__: 930
* __Source__: Tegemist ei olnud lihtsalt platvormi ehitamisega, vaid kasutasime teraskonstruktsiooni, et nii ülemise kui ka alumise ruumi kõrgus oleks õige."
* __Reference__: it wasn ' t just building a simple platform , but cranking the steels so that they were at the right height for the function above or below . "
* __Machine translation__: it was not just about building a blind eye , but we used the stackling that both over and in the area of radicalisation and distribution was the right thing to strike .
* __BLEU__: 4.626647
* __Description__: Sentence misalingment? incomprehensible sentence and doesn't match at all with reference.


* __Sentence ID__: 960
* __Source__: Tallinnas Meriväljal oli 6.augusti õhtul üks pisike plikatirts hüüdnud: "Ema, ema!
* __Reference__: in the merivälja part of tallinn , on 6 august , one tiny slip of a girl was said to have shouted : “ mother , mother !
* __Machine translation__: in her city , merier , the plant , on 6 august , had one of the slogans born : the starea , the mother !
* __BLEU__: 16.813022
* __Description__: Again incomprehensible... Gramatically "weird".


* __Sentence ID__: 524
* __Source__: Vaesus ja surm Indoneesia kullarohkel maal
* __Reference__: poverty and death in indonesia ' s land of gold
* __Machine translation__: poverty and death on the ground of indonesia ' s gold gold rotten land
* __BLEU__: 19.67498
* __Description__: Sentence introduced `gold gold rotten land` where there's nothing like that on the reference. Grammatically ok but totally wrong meaning.


* __Sentence ID__: 792
* __Source__: Klimt esitles oma "Daami kullas" Viinis 1908. aastal ja umbes samal ajal (aastatel 1907-1908) sündis teinegi kuldne maal "Suudlus".
* __Reference__: klimt presented his lady in gold in vienna in 1908 , and around the same time ( 1907 - 1908 ) another golden painting called kiss was conceived .
* __Machine translation__: the most striking was presented in its drew from the drew from the ' da gold mine ' gold in vienna 1908 and around the same time ( e . 19 years ago , the second was born of the emergence of the paint stripping of 19 - 19.
* __BLEU__: 13.481992
* __Description__: incomprehensible sentence. It seems to be from the source but totally wrong translation. Grammatical errors.


* __Sentence ID__: 974
* __Source__: Ma pole oma loomingust kunagi kaugel ja tööd ma ei karda.
* __Reference__: i ’ m never far from my work and i ’ m not afraid of work .
* __Machine translation__: i am never far from my creation and i am not afraid of my work .
* __BLEU__: 26.769118
* __Description__: Almost nailed it. I don't know estonian enough to say if work and creation could be similar. Gramatically ok, meaning ok.


* __Sentence ID__: 916
* __Source__: Mullune hinnatõus oli 37 protsenti.
* __Reference__: last year ’ s price rise was 37 per cent .
* __Machine translation__: the increase in milk prices was 37 per cent .
* __BLEU__: 35.543339
* __Description__: `milk`? Gramatically ok sentence, meaning deviated a bit (perhaps there's some context in the previous sentences?).


* __Sentence ID__: 464
* __Source__: Seega on kuberneri lehte  mis on ametliku märgistusega avalik foorum, mida haldab maksumaksjate dollaritest palka saav personal  külastavate inimeste blokeerimine ebavajalik ja lõppkokkuvõttes ohtlik.
* __Reference__: so blocking people who come to the governor ' s page - which is a public forum , labeled as official and administered by staff members paid public tax dollars - is unnecessary and ultimately dangerous .
* __Machine translation__: it is therefore the result of the distribution of the rows , which is a public phonote with official labelling , which is the result of the lack of and , ultimately , dangerous for the people who have been paid for by the workers who have fallen from the taxpayers ' money .
* __BLEU__: 6.506148
* __Description__: Incomprehensible sentence. Some words look like they are made up (`phonote`).


* __Sentence ID__: 860
* __Source__: Valge Maja sõnul ei koormanud mitte ükski teine selle ürituse eksponaat maksumaksja rahakotti.
* __Reference__: the white house said taxpayers did not pick up the burden for any of the other props featured at the event .
* __Machine translation__: according to the white house , none of the other taxpayers ' money was burdened by the emergence of this cause .
* __BLEU__: 11.966558
* __Description__: It seems meaning is not the same. Grammatically ok.


* __Sentence ID__: 553
* __Source__: 4. - 13. augustini muutuvad Pirita kloostri muistsed varemed taaskord paigaks, kus peetakse maha 13 aastat tagasi maestro Eri Klasi poolt ellu kutsutud Birgitta Festival.
* __Reference__: from 4 - 13 august , the ancient ruins of pirita convent again turn into the venue for the birgitta festival started 13 years ago by maestro eri klas .
* __Machine translation__: on 4 - 13 august , the rows of the past become the place once again in which , 13 years ago , the special krods , mr bible , was called upon to be executed for 13 years ago .
* __BLEU__: 15.955012
* __Description__: Again problems with name of places and organizations. Grammatically ok tho.


* __Sentence ID__: 145
* __Source__: Läänetiivas asub ka Ovaalkabinet ning teised presidendi tööruumid.
* __Reference__: the west wing is where the oval office and other working areas for the present are located .
* __Machine translation__: the west is also located in ovakabstka and other president ' s working spaces .
* __BLEU__: 7.237068
* __Description__: It seems one workd was not translated at all. `Ovaalkabinet` became `ovakabstka`. Also word `wing` was dropped and sentence lost it's meaning. Grammatically ok.


* __Sentence ID__: 261
* __Source__: Endine Euroopa esinumber Ronan Rafferty loodab järgmise kolme päeva jooksul koduterritooriumi mugavusele, kui East Lothiani Renaissance Clubi golfiradadel toimub 25. Scottish Senior Open.
* __Reference__: former european no 1 ronan rafferty will be hoping for home comforts over the next three days as the renaissance club in east lothian hosts the 25 th edition of the scottish senior open .
* __Machine translation__: the former focal point of the european adventure , rabty , is set up in the next three days to the comfort of the homelessness of the embasan renaissance , when the golf course of the golf courses are carried out on 25 september .
* __BLEU__: 7.480985
* __Description__: Incomprehensible sentence (also funny). Looks like gibberish. 


* __Sentence ID__: 896
* __Source__: Enne oli nii, et kui midagi uut sai püsti pandud, siis oli see juba järgmine päev katki.
* __Reference__: in the past , if something new was erected , it was broken the next day .
* __Machine translation__: previously , when something new was set up , it was already the next day that was born .
* __BLEU__: 18.061759
* __Description__: Almost. `next day was born` should relate to `broken`. Gramatically correct.


* __Sentence ID__: 73
* __Source__: Neiu mõtles pikalt, kuid leidis endas rahu.
* __Reference__: she thought for a long time , but then found peace of mind .
* __Machine translation__: neue thought at length , but peace was found in itself .
* __BLEU__: 7.54534
* __Description__: `neue` was introduced making the meaning a little off. Overall not bad and grammatically correct.


* __Sentence ID__: 749
* __Source__: Kayame ei tea endiselt, kes tulistajaks oli, aga ütleb, et kuul tuli kogunenud sõdurite ridadest.
* __Reference__: kayame still doesn ' t know who fired but says the bullet came from the ranks of amassed soldiers .
* __Machine translation__: we do not continue to know who was shooting , but it says that there were a series of structurings in the month .
* __BLEU__: 4.303846
* __Description__: Grammatically correct but doesn't represent the reference in terms of meaning.


In [84]:
## my sampling method.

# import pandas as pd
# import numpy as np

# source_data = np.loadtxt("data/test-src.et",delimiter="\n", dtype='str')
# translations = pd.read_csv("data/ibleu_2022-04-10_20-12-08.csv", sep=";", index_col=0)
# translations["source"] = source_data

# random_idx = np.random.randint(1, high=len(translations), size=(30))

# for idx in random_idx:
#     print(f"* __Sentence ID__: {idx}")
#     print(f"* __Source__: {translations.loc[idx][['source']].values[0]}")
#     print(f"* __Reference__: {translations.loc[idx][['Human translated sentence']].values[0]}")
#     print(f"* __Machine translation__: {translations.loc[idx][['First machine translated sentence']].values[0]}")
#     print(f"* __BLEU__: {translations.loc[idx][['First machine translated sentence bleu score']].values[0]}")
#     print(f"* __Description__:\n\n")

**Subtask 2 (1 point).** Can you see any patterns and typical errors? Summarize your analysis.

**Your answer:** 

In general translations are grammatically correct, just a few had weird grammatical errors. However, most of them seems to come from a gibberish generator and the meaning is off. In just a few cases the translation was close, but even in those cases there were some weird words introduced that made the translation far from the original reference. Also, it seems that `Entities` (People, Organization, dates) are very bad translated, most of them are lost or misplaced.

Overall I think my model is far from good, altough it seems to be really good at generating gibberish in english :D .