# HW5: Evaluating machine translated output (31 possible points)

In this homework, we will be assessing the properties of a machine translation system between a language you probably do not speak (Finnish) into a language you probably do speak (English). The goal of this homework is to get you to:

1. Use the huggingface `transformers` package
2. Build a function that can evaluate the quality of a back-translation
3. Build a function that can evaluate machine translated output using a _monolingual_ language model

## The data

You will be using two files for this assignment -- `en-fi-en-translations.txt` and `fi-en-fi-translations.txt`. Each line in the files is tab-separated into columns. The following schema is used for both files:

1. Language 1 (e.g. `ENG`)
2. Language 2 (e.g. `FIN`)
3. Correct intermediate sentence (e.g., the `FIN` translation)
4. Original sentence (i.e., Language 1's original form)
5. Round trip translated sentence (i.e., the output of translating from ENG -> FIN -> ENG)

The data was generated using two large neural machine translation (NMT) models. These models were trained on open subtitle corpora, which are largely from movies and TV shows. NMT models are usually trained on parallel text, so the subtitles (or captions) from one segment in a movie are usually assumed to line up with the same line in another movie.

Most of the modern models treat machine translation as a sequence-to-sequence problem. That is, we try to find the best representation of an input sequence (e.g., a sequence of English words) to predict an output sequence (e.g., a sequence of Finnish words). There are lots of tricks to make the model work, but for this homework we are interested in seeing how easily we can faithfully represent the original input to our machine translation systems by testing **backtranslation** or a specific case of round-trip machine translation that basically goes L1-L2-L1.

## The languages

But, languages vary in the way they encode different types of linguistic information. For example, Finnish has much more complex morphology than English does, which means that many different strings in Finnish can translate to exactly the same string in English. You can look up various grammatical properties of Finnish in the WALS database: https://wals.info/languoid/lect/wals_code_fin

Here are a handful of facts about Finnish within the Morphology domain:

* Exclusively concatenative
* Case + number
* 2-3 categories per word
* Dependent marking
* Double marking in possessive noun phrases	
* Strongly suffixing

Likewise you can find out more about English here: https://wals.info/languoid/lect/wals_code_eng

Take a look at these pages before you start the full assignment -- it will help you understand the data better.

# Q1: Create a function that loads the data (2 points)

**Create a function called `load_data`** that takes a file path, opens the file, processes the file using the `.readlines()` method. Create a list called `data`. Then the function should loop through each row and split it along the `\t` character, and append this list to `data`. Return `data` at the end of the loop. 

In [1]:
# your answer for Q1 goes here



# Q2: Loading in data (4 points)

## Q2a: Loading in the English round trip data (2 points)

Load in `en-fi-en.txt` saved to a variable called `english`.

You may load in the file to colab or jupyter however is most convenient for you.

Print out the contents of `english`.

In [2]:
# load in English data



In [3]:
# print contents of `english`


## Q2b: Loading in the Finnish round trip data (2 points)

Load in `fi-en-fi.txt`, saved as a variable called `finnish`.

You may load in the file to colab or jupyter however is most convenient for you.

Print out the contents of `finnish`.

In [4]:
# load in Finnish data


In [5]:
# print contents of `finnish`



# Q3: Qualitative evaluation (7 points)

Referring to the output of Q2a and Q2b, pick two cases that the model got right and three cases that the model got wrong, in terms of reconstructing the original message. Show us the 2 incorrect and 3 correct cases.

For the Finnish translations, you may want to consider the English sentence as well. For both languages, propose 3 possible contributing _linguistic_ factors that may influence whether the backtranslations are correct/incorrect. For example, did any of the linguistic properties of English or Finnish from the WALS database appear in the examples you picked? 

As far as you can tell, does it look like one language was easier to translate into or out of than the other? Can you think of linguistic and non-linguistic reasons why this might be the case?

##Q3a

Right:
1. 
> Sentences here 
2. 
> Sentences here

Wrong:

1. 
> Sentences here

2. 
> Sentences here
3. 
> Sentences here

Qualitatively evaluate the English-English data here by answering the questions above.

##Q3b

Right
1. 
> Sentences here
2.
> Sentences here

Wrong

1. 
> Sentences here
2.
> Sentences here
3.
> Sentences here

Qualitatively evaluate the Finnish-Finnish data here by answering the questions above.

## Q3c

Discuss potential linguistic and non-linguistic reasons for errors.

# Q4: Exact string matching (5 points)

For this question, you should compute the average accuracy of each of the datasets using _strict_ exact string matching. That is, you should compare the translated output (the 4th/final column in each data entry) to the gold output (the 3rd/second-to-last column in each data entry) for both `ENG` and `FIN`.

To do this, write a function called `score` that does the following:

0. Create an empty list called `matches`
1. Takes the dataset (e.g., `english` or `finnish` as an argument) and loops through every row:
  * Identifies the gold sentence
  * Identifies the translated sentence
  * Compares whether (2) is exactly the same string as (3) as a boolean called `match`
  * Appends `match` to `matches`
2. Returns the mean of `matches` (accuracy for the full dataset)

For both `ENG` and `FIN`, run `score` and print the accuracy out to the notebook.

In [6]:
# write your `score` function here
import numpy as np



In [7]:
# compute and print accuracy for english



In [9]:
# compute and print accuracy for finnish



# Q5: Monolingual embedding comparisons (10 points)

In this section, we would like you to compute the similarity between the two sentences for each translation. Specifically, we will use `BERT`, a large language model, as a _sentence encoder_ that will produce one 768-dimensional vector for each token in our sentences. In the below, we use a tokenizer to transform a sentence $s$ into $k$ subwords, and then we give a pre-trained model the tokenized representation of $s$. The result is a `(1, k, 768)` dimensional tensor at each layer.

The layers are held together in a `tuple`, in which the first element is the lowest layer, and the last element is the highest layer.

### Q5a: Run the next cell to load in the English and Finnish BERT models (1 points)

In [None]:
!pip install transformers
from transformers import BertModel, BertTokenizer
from torch.nn.functional import cosine_similarity

finbert = BertModel.from_pretrained("TurkuNLP/bert-base-finnish-cased-v1")
finbert.eval()
fintokenizer = BertTokenizer.from_pretrained("TurkuNLP/bert-base-finnish-cased-v1")

engbert = BertModel.from_pretrained("bert-base-cased")
engbert.eval()
engtokenizer = BertTokenizer.from_pretrained("bert-base-cased")

### Q5b: Computing the similarity between two English sentences at a specific layer (3 points)

Note, you do NOT need to change `embed_sentence` or `sentence_similarity`. Changing them will cause you to lose points for this question.

In [10]:
def embed_sentence(model, tokenizer, s):
  s_tokenized = tokenizer(s, return_tensors="pt")
  s_embeds = model(**s_tokenized, output_hidden_states=True)['hidden_states']
  return s_embeds


def embed_sentences(model, tokenizer, s1, s2):
  s1_embeds = embed_sentence(model, tokenizer, s1)
  s2_embeds = embed_sentence(model, tokenizer, s2)
  return s1_embeds, s2_embeds


def sentence_similarity(s1_embeds, s2_embeds, layer):
  s1_vector = convert_embeds_to_vector(s1_embeds, layer)
  s2_vector = convert_embeds_to_vector(s2_embeds, layer)
  similarity = cosine_similarity(s1_vector, s2_vector).detach().item()
  return similarity

In [None]:
def convert_embeds_to_vector(embeds, layer):
  # TODO: get sentence embeddings at a specific layer

  # TODO: compute the mean along axis=1 to turn the embeddings into a single vector of size (1, 768)

  # TODO: verify the shape of the vector is (1, 768) using assert

  return vector

After editing the above code, run the cell below. The output you should get is that sentence 1 and sentence 2 have cosine similarity $\approx 0.95$.

In [None]:
sent1 = 'My cat Vector is very cute.'
sent2 = 'Professor Jacobs has a cat named Vector who is very cute.'

s1_embeds, s2_embeds = embed_sentences(engbert, engtokenizer, sent1, sent2)
print(sentence_similarity(s1_embeds, s2_embeds, layer=6))

### Q5c: Computing the similarity between all English gold and translated sentences at all layers (3 points)

Create a nested `for` loop to iterate through all the sentences in `english` using `sentence_similarity` from Q5b.

Outer loop -- all sentences

Inner loop -- all layers

Print averages of all layers out at the end

In [None]:
from collections import defaultdict

layer_sims = defaultdict(list)



In [11]:
# loop over each layer in layer_sims and print the average

### Q5d: Compute the similarity between all Finnish gold and translated sentences at all layers (3 points)

Create a `for` loop to iterate through all the sentences in `finnish` using `sentence_similarity` from Q5b.

Outer loop -- all sentences

Inner loop -- all layers

Print averages of all layers out at the end

In [None]:
layer_sims = defaultdict(list)


In [12]:
# loop over each layer in layer_sims and print the average

# Q6: Compare and contrast exact matches to embedding distances (3 points)

Exact matches and embedding distances can be thought of as two opposite ends of a spectrum. Write down 1 clear advantage that you see for each method over the other one (1 point each). For example, what information does an an embedding distance give you over an exact string match, and vice versa? Can you think of specific cases in the above analyses where neither method is very good? (1 point)

Your free response to Q6 goes here.

# Bonus: Compute string similarity between gold and machine translated sentences (5 points)

There are a number of ways two strings can be similar but different. Consider any **ONE** (1) of the following measures for computing the **mean** similarity between the gold sentence and the machine translated sentence. Summarize why that measure would be appropriate for comparing the input and output strings. Consider discussion of these points throughout the SLP3 book as well as Wikipedia as well as [Manning, Wein, and Schneider (2020)](https://aclanthology.org/2020.coling-main.420.pdf). You may use any off-the-shelf implementation that you find as long as you cite it below.

* [BLEU](https://en.wikipedia.org/wiki/BLEU)
* [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric))
* [Levenstein Distance](https://en.wikipedia.org/wiki/Levenshtein_distance)

For full points, this question must:

1. Justify the metric (NO quoting from Wikipedia)
2. Summarize the results for English AND Finnish
3. Cite all sources

In [None]:
# Bonus code here

Bonus free response

# Submission format (1 point)

Please upload the file with the name

LASTNAME_FIRSTNAME_HW5.ipynb