# HW5: Evaluating machine translated output (31 possible points)

In this homework, we will be assessing the properties of a machine translation system between a language you probably do not speak (Finnish) into a language you probably do speak (English). The goal of this homework is to get you to:

1. Use the huggingface `transformers` package
2. Build a function that can evaluate the quality of a back-translation
3. Build a function that can evaluate machine translated output using a _monolingual_ language model

## The data

You will be using two files for this assignment -- `en-fi-en-translations.txt` and `fi-en-fi-translations.txt`. Each line in the files is tab-separated into columns. The following schema is used for both files:

1. Language 1 (e.g. `ENG`)
2. Language 2 (e.g. `FIN`)
3. Correct intermediate sentence (e.g., the `FIN` translation)
4. Original sentence (i.e., Language 1's original form)
5. Round trip translated sentence (i.e., the output of translating from ENG -> FIN -> ENG)

The data was generated using two large neural machine translation (NMT) models. These models were trained on open subtitle corpora, which are largely from movies and TV shows. NMT models are usually trained on parallel text, so the subtitles (or captions) from one segment in a movie are usually assumed to line up with the same line in another movie.

Most of the modern models treat machine translation as a sequence-to-sequence problem. That is, we try to find the best representation of an input sequence (e.g., a sequence of English words) to predict an output sequence (e.g., a sequence of Finnish words). There are lots of tricks to make the model work, but for this homework we are interested in seeing how easily we can faithfully represent the original input to our machine translation systems by testing **backtranslation** or a specific case of round-trip machine translation that basically goes L1-L2-L1.

## The languages

But, languages vary in the way they encode different types of linguistic information. For example, Finnish has much more complex morphology than English does, which means that many different strings in Finnish can translate to exactly the same string in English. You can look up various grammatical properties of Finnish in the WALS database: https://wals.info/languoid/lect/wals_code_fin

Here are a handful of facts about Finnish within the Morphology domain:

* Exclusively concatenative
* Case + number
* 2-3 categories per word
* Dependent marking
* Double marking in possessive noun phrases	
* Strongly suffixing

Likewise you can find out more about English here: https://wals.info/languoid/lect/wals_code_eng

Take a look at these pages before you start the full assignment -- it will help you understand the data better.

# Q1: Create a function that loads the data (2 points)

**Create a function called `load_data`** that takes a file path, opens the file, processes the file using the `.readlines()` method. Create a list called `data`. Then the function should loop through each row and split it along the `\t` character, and append this list to `data`. Return `data` at the end of the loop. 

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [9]:
# your answer for Q1 goes here
def load_data(filepath):
  data = []
  file = open(filepath, "r")
  if file.mode == 'r':
    file_contents = file.readlines()
    #print(file_contents)
  for line in file_contents:
    data.append(line.split("\t"))
  return data


# Q2: Loading in data (4 points)

## Q2a: Loading in the English round trip data (2 points)

Load in `en-fi-en.txt` saved to a variable called `english`.

You may load in the file to colab or jupyter however is most convenient for you.

Print out the contents of `english`.

In [10]:
# load in English data
english = load_data("/content/drive/MyDrive/Computational Linguistic/HW5-3/en-fi-en-translations.txt")

In [None]:
# print contents of `english`
for contents_english in english:
  print (contents_english)

## Q2b: Loading in the Finnish round trip data (2 points)

Load in `fi-en-fi.txt`, saved as a variable called `finnish`.

You may load in the file to colab or jupyter however is most convenient for you.

Print out the contents of `finnish`.

In [12]:
# load in Finnish data
finnish = load_data("/content/drive/MyDrive/Computational Linguistic/HW5-3/fi-en-fi-translations.txt") 

In [None]:
# print contents of `finnish`
for contents_finnish in finnish:
  print(contents_finnish)

# Q3: Qualitative evaluation (7 points)

Referring to the output of Q2a and Q2b, pick two cases that the model got right and three cases that the model got wrong, in terms of reconstructing the original message. Show us the 2 incorrect and 3 correct cases.

For the Finnish translations, you may want to consider the English sentence as well. For both languages, propose 3 possible contributing _linguistic_ factors that may influence whether the backtranslations are correct/incorrect. For example, did any of the linguistic properties of English or Finnish from the WALS database appear in the examples you picked? 

As far as you can tell, does it look like one language was easier to translate into or out of than the other? Can you think of linguistic and non-linguistic reasons why this might be the case?

##Q3a

Right:
1. 
> Gold output =  'All they had to do was stop' \
> Translated output = 'All they had to do was stop'
2. 
> Gold output = 'I don't have a computer.' \
> Translated output = 'I don't have a computer.'

Wrong:

1. 
> Gold output =  'All you need to do is wait.' \
> Translated output = 'All you have to do is wait.'
2. 
> Gold output = 'Be quiet!' \
> Translated output = 'Shut up!'
3. 
> Gold output = 'Give us a hand. \
> Translated output = 'Give me a hand.'


Qualitatively evaluate the English-English data here by answering the questions above.
According to me, the possible contributing *linguistic* factors that influences whether backtransaltions are correct or wrong are:
>1. Epistemic Possibility.
>2. Morphological Imperative.
>3. Inclusive/Exclusive Distinction in Verbal Inflection.
>4. Present/Past Tense.
>5. Predicative Possession \

So in 'Right' section, for the first sentence, the output of gold sentence and translated sentence is matched beause they use 'Past Tense' where it meant is 'they just needed to do one thing which was stop'. Then the second sentence is a type of 'Predicative Possession' where it is telling that particular person doesn't has a possession of a particular object.
And in the 'Wrong' section, for the first sentence, the translated sentence is showing 'Present Tense' where as the gold sentence is showing some kind of suggesion, that's why the result is not matching.For the second sentence, the english language has no morphologically dedicated second-person imperatives at all. Because of that only the results are wrong. For the third sentence, eventhough it uses 'Inclusive/Exclusive Distinction in Verbal Inflection' meaning in English language 'We and I are same' it still showing that result is not correct.


##Q3b

Right
1. 
> Gold output = 'Ovatko kätesi puhtaat?' \
> Translated output = 'Ovatko kätesi puhtaat? ' \
> Google Translation = 'Are your hands clean?' \

2. 
> Gold output = 'Mene kysymään Tomilta' \
> Translated output = 'Mene kysymään Tomilta' \
> Google Translation = 'Go ask Tom' \

Wrong

1. 
> Gold output = Perusteellinen aseistariisuntasopimus kieltää kaikki aseistukset ja asevoimat.  \
> Google Translation = A thorough disarmament treaty prohibits all regulations and armed forces. \
> Translated output = 'Kattava aseriisuntasopimus kieltää kaiken aseistuksen ja asevoimat.'  
> Google Translation = 'A comprehensive disarmament treaty bans all armaments and the armed forces' \

2. 
> Gold output = Hänen apunsa tuli viime tingassa.  \
> Google Translation = His/Her help came last time. \
> Translated output = Hänen apunsa tuli viime hetkellä \
> Google Translation = Her/His help came at the last minute \

3. 
> Gold output = Anteeksi, missä on WC?.  \
> Google Translation = Excuse me, where is the toilet? \
> Translated output = Anteeksi, missä vessa on? \
> Google Translation = Sorry about where the toilet is \

Qualitatively evaluate the Finnish-Finnish data here by answering the questions above.
> The big difference between Finnish and English is that, gendered pronouns are not distinguished, as you can see in example 2 of 'Wrong' section, it kind of creates confusion whether is it he or she. But in finnish language it bascially consider as 'it'. In every sentence they only use one tense which is present tense, as the future tense is not there in finnish language. So it is difficult to understand wheather statment written over here is in present tense or future tense. Also, in finnsih language you can express your thoughts in lesser words, that's why in example 2 of 'Right' section, it uses only three words, instead of forming entire sentence.

## Q3c

Discuss potential linguistic and non-linguistic reasons for errors.
> There are various reasons in linguistic and non-linguistic area where it causes error because of that, like, \
-- The person wheather he is linguistic or non-linguistic if he is not familar with other native langauges, then because of that it causes errors in result. \
-- Sometimes if the person is working continously on some data, then he/she might get tired and because of that also it causes error. \
-- As the rules of languages(grammer rules) are different from one another, so while performing translation there are chances that it may leads to error. 

# Q4: Exact string matching (5 points)

For this question, you should compute the average accuracy of each of the datasets using _strict_ exact string matching. That is, you should compare the translated output (the 4th/final column in each data entry) to the gold output (the 3rd/second-to-last column in each data entry) for both `ENG` and `FIN`.

To do this, write a function called `score` that does the following:

0. Create an empty list called `matches`
1. Takes the dataset (e.g., `english` or `finnish` as an argument) and loops through every row:
  * Identifies the gold sentence
  * Identifies the translated sentence
  * Compares whether (2) is exactly the same string as (3) as a boolean called `match`
  * Appends `match` to `matches`
2. Returns the mean of `matches` (accuracy for the full dataset)

For both `ENG` and `FIN`, run `score` and print the accuracy out to the notebook.

In [14]:
import numpy as np
def score(dataset):
  matches = []
  gold_sentence = []
  translated_sentence = []
  data = dataset
  for contents in data:
    gold_sentence.append(contents[-2].strip('\n').split('\n'))
  for contents in data:
    translated_sentence.append(contents[-1].strip('\n').split('\n'))
  
  for i in range(len(gold_sentence)):
    if gold_sentence[i] == translated_sentence[i]:
      match = True
      matches.append(int(match))
    else:
      match = False
      matches.append(int(match))
  return np.mean(matches)

In [15]:
# compute and print accuracy for english
accuracy_english = score(english)
print("Accuracy for English", accuracy_english)

Accuracy for English 0.32


In [16]:
# compute and print accuracy for finnish
accuracy_finnish = score(finnish)
print("Accuracy for Finnish", accuracy_finnish)

Accuracy for Finnish 0.28


# Q5: Monolingual embedding comparisons (10 points)

In this section, we would like you to compute the similarity between the two sentences for each translation. Specifically, we will use `BERT`, a large language model, as a _sentence encoder_ that will produce one 768-dimensional vector for each token in our sentences. In the below, we use a tokenizer to transform a sentence $s$ into $k$ subwords, and then we give a pre-trained model the tokenized representation of $s$. The result is a `(1, k, 768)` dimensional tensor at each layer.

The layers are held together in a `tuple`, in which the first element is the lowest layer, and the last element is the highest layer.

### Q5a: Run the next cell to load in the English and Finnish BERT models (1 points)

In [None]:
!pip install transformers
from transformers import BertModel, BertTokenizer
from torch.nn.functional import cosine_similarity

finbert = BertModel.from_pretrained("TurkuNLP/bert-base-finnish-cased-v1")
finbert.eval()
fintokenizer = BertTokenizer.from_pretrained("TurkuNLP/bert-base-finnish-cased-v1")

engbert = BertModel.from_pretrained("bert-base-cased")
engbert.eval()
engtokenizer = BertTokenizer.from_pretrained("bert-base-cased")

### Q5b: Computing the similarity between two English sentences at a specific layer (3 points)

Note, you do NOT need to change `embed_sentence` or `sentence_similarity`. Changing them will cause you to lose points for this question.

In [18]:
def embed_sentence(model, tokenizer, s):
  s_tokenized = tokenizer(s, return_tensors="pt")
  s_embeds = model(**s_tokenized, output_hidden_states=True)['hidden_states']
  return s_embeds


def embed_sentences(model, tokenizer, s1, s2):
  s1_embeds = embed_sentence(model, tokenizer, s1)
  s2_embeds = embed_sentence(model, tokenizer, s2)
  return s1_embeds, s2_embeds


def sentence_similarity(s1_embeds, s2_embeds, layer):
  s1_vector = convert_embeds_to_vector(s1_embeds, layer)
  s2_vector = convert_embeds_to_vector(s2_embeds, layer)
  similarity = cosine_similarity(s1_vector, s2_vector).detach().item()
  return similarity

In [19]:
import torch

def convert_embeds_to_vector(embeds, layer):
  # TODO: get sentence embeddings at a specific layer
  sent_embeds = embeds[layer-1]
  # TODO: compute the mean along axis=1 to turn the embeddings into a single vector of size (1, 768)
  tensor = torch.tensor(sent_embeds,requires_grad=True)
  convert_tensor_to_numpy = tensor.detach().numpy()
  vector = np.mean(convert_tensor_to_numpy, axis=1)
  # TODO: verify the shape of the vector is (1, 768) using assert
  shape_vector = vector.shape
  assert shape_vector[0] == 1
  assert shape_vector[1] == 768

  return torch.from_numpy(vector)

After editing the above code, run the cell below. The output you should get is that sentence 1 and sentence 2 have cosine similarity $\approx 0.95$.

In [20]:
sent1 = 'My cat Vector is very cute.'
sent2 = 'Professor Jacobs has a cat named Vector who is very cute.'

s1_embeds, s2_embeds = embed_sentences(engbert, engtokenizer, sent1, sent2)
print(sentence_similarity(s1_embeds, s2_embeds, layer=6))

0.9495340585708618


  import sys


### Q5c: Computing the similarity between all English gold and translated sentences at all layers (3 points)

Create a nested `for` loop to iterate through all the sentences in `english` using `sentence_similarity` from Q5b.

Outer loop -- all sentences

Inner loop -- all layers

Print averages of all layers out at the end

In [21]:
from collections import defaultdict

layer_sims = defaultdict(list)
j = 0
for outer in english:
  data1= "".join((outer[-2].strip('\n').split('\n')))
  data2 = "".join(outer[-1].strip('\n').split('\n'))
  temp_list = []
  for i in range(13):
    s1_embeds, s2_embeds = embed_sentences(engbert, engtokenizer, data1, data2)
    temp_list.append(sentence_similarity(s1_embeds, s2_embeds, layer= i))
    if i == 12:
      j = j + 1
      layer_sims['layer ' + str(j-1)] = temp_list
      break

  import sys


In [22]:
print(layer_sims)

defaultdict(<class 'list'>, {'layer 0': [0.9694236516952515, 0.9370449781417847, 0.9469146728515625, 0.9575198292732239, 0.95558762550354, 0.9661802649497986, 0.9723868370056152, 0.9776244163513184, 0.978664219379425, 0.9805426001548767, 0.9857415556907654, 0.9856081008911133, 0.9899044036865234], 'layer 1': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'layer 2': [0.9830528497695923, 0.9710646867752075, 0.9753118753433228, 0.9790318012237549, 0.9802778959274292, 0.9844610095024109, 0.9862881898880005, 0.9889016151428223, 0.9860441088676453, 0.9867919683456421, 0.9891160130500793, 0.9890556931495667, 0.992056667804718], 'layer 3': [0.8592914342880249, 0.7399447560310364, 0.7949160933494568, 0.8179962635040283, 0.8191359639167786, 0.8485340476036072, 0.8640307188034058, 0.880382776260376, 0.8818475604057312, 0.8888779282569885, 0.904603123664856, 0.9183129668235779, 0.9334012269973755], 'layer 4': [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0], 'l

In [24]:
# loop over each layer in layer_sims and print the average
dict1 = {}
for key in layer_sims:
  for i in range(0,13):
    if i in dict1:
      dict1[i].append(layer_sims[key][i])
    else:
      dict1[i] = [layer_sims[key][i]]

average = []
for key in dict1:
  avg = np.mean(dict1[key])
  print(avg)
  average.append(avg)

0.9711949670314789
0.9445295786857605
0.9591687428951263
0.9665106272697449
0.9670855474472045
0.9738864541053772
0.9776055288314819
0.9797219347953796
0.9788978087902069
0.9788581275939942
0.9820836174488068
0.9837425839900971
0.9876769196987152


### Q5d: Compute the similarity between all Finnish gold and translated sentences at all layers (3 points)

Create a `for` loop to iterate through all the sentences in `finnish` using `sentence_similarity` from Q5b.

Outer loop -- all sentences

Inner loop -- all layers

Print averages of all layers out at the end

In [25]:
layer_sims = defaultdict(list)
j = 0
for outer in finnish:
  data1= "".join((outer[-2].strip('\n').split('\n')))
  data2 = "".join(outer[-1].strip('\n').split('\n'))
  temp_list = []
  for i in range(13):
    s1_embeds, s2_embeds = embed_sentences(engbert, engtokenizer, data1, data2)
    temp_list.append(sentence_similarity(s1_embeds, s2_embeds, layer= i))
    if i == 12:
      j = j + 1
      layer_sims['layer ' + str(j-1)] = temp_list
      break

  import sys


In [26]:
# loop over each layer in layer_sims and print the average
dict1 = {}
for key in layer_sims:
  for i in range(0,13):
    if i in dict1:
      dict1[i].append(layer_sims[key][i])
    else:
      dict1[i] = [layer_sims[key][i]]

average = []
for key in dict1:
  avg = np.mean(dict1[key])
  print(avg)
  average.append(avg)

0.9646093988418579
0.921919367313385
0.9406355464458466
0.9456805217266083
0.9490394127368927
0.961135778427124
0.9669098603725433
0.9717363572120666
0.9730573797225952
0.9732820308208465
0.9777024221420288
0.9793009424209594
0.9834346938133239


# Q6: Compare and contrast exact matches to embedding distances (3 points)

Exact matches and embedding distances can be thought of as two opposite ends of a spectrum. Write down 1 clear advantage that you see for each method over the other one (1 point each). For example, what information does an an embedding distance give you over an exact string match, and vice versa? Can you think of specific cases in the above analyses where neither method is very good? (1 point)

*   Because of the embedding distance, is we get to know that, how one sentence is similar to another sentence. It first finds out distance of each word to another word and it creates one matrix. From that matrix we can get an idea about exact string similar score.
*   And if the two statements are same from begining, then the distance between them is zero. That means, if the distace is zero meaning two sentences are same and if it contains any value then it shows how similar the one sentence is to another.
*   Whether the method is good or not is all based on size of the vector. If we have a huge amount of data then we can select other methods or distances to calculate similarity between them.





# Bonus: Compute string similarity between gold and machine translated sentences (5 points)

There are a number of ways two strings can be similar but different. Consider any **ONE** (1) of the following measures for computing the **mean** similarity between the gold sentence and the machine translated sentence. Summarize why that measure would be appropriate for comparing the input and output strings. Consider discussion of these points throughout the SLP3 book as well as Wikipedia as well as [Manning, Wein, and Schneider (2020)](https://aclanthology.org/2020.coling-main.420.pdf). You may use any off-the-shelf implementation that you find as long as you cite it below.

* [BLEU](https://en.wikipedia.org/wiki/BLEU)
* [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric))
* [Levenstein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance)

For full points, this question must:

1. Justify the metric (NO quoting from Wikipedia)
2. Summarize the results for English AND Finnish
3. Cite all sources

In [None]:
!pip install python-Levenshtein

In [28]:
import Levenshtein

#Code for english

In [29]:
# Bonus code here
## Code for English
similarity_score = []
for outer in english:
  data1= "".join((outer[-2].strip('\n').split('\n')))
  data2 = "".join(outer[-1].strip('\n').split('\n'))
  similarity = Levenshtein.distance(data1, data2)
  similarity_score.append(similarity)

In [30]:
print(similarity_score)

[10, 0, 4, 8, 0, 12, 2, 4, 7, 0, 5, 4, 0, 7, 4, 3, 5, 7, 0, 7, 4, 2, 0, 8, 6, 2, 0, 4, 5, 37, 29, 0, 6, 14, 8, 0, 7, 0, 0, 4, 14, 0, 0, 0, 6, 0, 0, 3, 10, 4]


#Code for finnish

In [None]:
## Code for Finnish
similarity_score = []
for outer in finnish:
  data1= "".join((outer[-2].strip('\n').split('\n')))
  data2 = "".join(outer[-1].strip('\n').split('\n'))
  similarity = Levenshtein.distance(data1, data2)
  similarity_score.append(similarity)


In [None]:
print(similarity_score)

[20, 0, 7, 7, 0, 0, 16, 14, 8, 14, 0, 12, 0, 7, 7, 6, 10, 8, 0, 12, 16, 0, 9, 9, 0, 7, 7, 17, 11, 0, 6, 9, 0, 15, 11, 9, 0, 0, 2, 1, 5, 10, 20, 4, 9, 11, 0, 11, 0, 19]


*   What exactly is a Levenstein Distance?
> Levenstein Distance is used to calculate the distance between two string sequences where the distance score will be after how many insertion, deletion and substitution the one string is similar to the another.
> It uses dynamic programming approach, where the each words of the sentence is compared with the another one. And then we get the distace.
*  In above code, the similarity score between two english sentence for the first sentence is 10 i.e after 10 insertion, deletion, and substitution the first sentence will be exactly be equal to second sentence. Similar with the two finnish sentences.
* *Reference:*
  *   https://www.cuelogic.com/blog/the-levenshtein-algorithm






# Submission format (1 point)

Please upload the file with the name

LASTNAME_FIRSTNAME_HW5.ipynb