## Homework 1 - Machine Translation - MDS Computational Linguistics

### Assignment Topics
- Phrase Based Machine Translation
- Multilingual Word Embeddings
- MT datasets

### Software Requirements
- Python (>=3.6)
- PyTorch (>=1.7.0) 
- Jupyter (latest)

- GIZA++ Needs:
    - Perl
    - OSX/Linux (recommended) or a Cygwin Terminal
    
If you are installing GIZA++ on Cygwin, make sure you have Perl, make, and GCC/G++ installed when you install Cygwin (default installation does not include these)

### Submission Info.
- Due Date: February 27, 2021, 23:59:00 (Vancouver time)



## Tidy Submission

rubric={mechanics:1}

To get the marks for tidy submission:
- Submit the assignment by filling in this jupyter notebook with your answers embedded
- Be sure to follow the [general lab instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions)

## T1 Build Romanian-English/English-Romanian GIZA++ alignments 

We can use the GIZA++ tool to automatically align vocabularies in two different languages using parallel corpora. This is useful for example in creating bilingual dictionaries or other term-based translation tasks. Because of the age of the tool (2001) and the complications with getting it working on modern computers, we suggest pair programming with your group to install and run it. *You will not need to generate alignments for the individual portions of this lab, they will be provided in the student data directory*

In this group assignment you'll be working with GIZA++ to generate alignments based on the EuroParl corpus between English and Romanian. 

First, Download the En/Ro parallel corpus here: http://www.statmt.org/europarl/ and follow the procedure in the tutorial to create the aligned output from GIZA++.  Decrease the size of the file to 200,000 sentences to speed up the training time using the following unix command (where filename and newfilename are your original and truncated corpus respectively):



### T1.1 Check your output
rubric={accuracy:5}

Were you able to get it to align appropriately? Copy the first three alignments out of your *.VA3.final files

## Exercise 1: GIZA++ Warm Up

For the following "warm up" questions, check that you are understanding GIZA++'s outputs.

### 1.1 Interpretting GIZA++ results
rubric={reasoning:2}


For these three sentences, which one does GIZA++ handle "best"? Explain how you should measure "best" with GIZA++

###### A. Sentence pair (28) source length 7 target length 8 alignment score : 1.13908e-07
- It is the case of Alexander Nikitin . =
- NULL ({ }) Es ({ 1 2 }) el ({ 3 }) caso ({ 4 }) de ({ 5 }) Alexander ({ 6 }) Nikitin ({ 7 }) . ({ 8 }) 

###### B. Sentence pair (122) source length 6 target length 7 alignment score : 7.36437e-21
- You did not call me either . =
- NULL ({ }) Tampoco ({ 3 }) me ({ 5 }) ha ({ 2 }) nombrado ({ 4 6 }) usted ({ 1 }) . ({ 7 }) 

###### C. Sentence pair (115) source length 4 target length 7 alignment score : 8.73259e-12
- There is no room for amendments . =
- NULL ({ 2 5 }) No ({ 1 }) caben ({ 3 4 }) modificaciones ({ 6 }) . ({ 7 }) 


**Put Your Answer Here**

### 1.2 GIZA++ vocab 
rubric={reasoning:1}

What do the two numbers represent in this sample GIZA++ vocab entry? (see europarl-es-en/giza_output/en-es.trn.trg.vcb from the tutorial)

```
45 producido 391
```


**Put Your Answer Here**


### 1.3a GIZA++ Translation Tables
rubric={reasoning:1}

Below is a (partial) GIZA++ translation table for one of the words in the corpus. What is the word index in the target language that corresponds to the most likely translation of the source term? [Note this should look bad on GitHub but fine in a jupyter notebook]

*Hint:You should understand what the three numbers of each row represent.* 


```
6 5 0.344281  
6 6 0.0051642  
6 70 0.360418  
6 164 2.0722e-07  
6 242 0.00264185  
6 269 1.49345e-05  
6 350 1.24741e-07  
6 408 0.00296726  
6 422 0.00269407  
6 433 9.45562e-05  
6 450 0.00538036  
6 452 0.00269633  
6 492 0.00272651  
6 516 9.83587e-06  
6 613 0.0026897  
6 747 0.00478991  
6 762 0.0162501  
```

**Put Your Answer Here**


### 1.3b GIZA++ Translation Tables
rubric={reasoning:1}

In the above translation table, is the vocab index (6) coming from the source vocab or target vocab?

**Put Your Answer Here**

## 2.1 Build a bilingual dictionary
rubric={accuracy:3,quality:3,efficiency:3}

Using the provided alignment files (see /data/lab1) covering En/Fr alignments, create a program to build a bilingual French-English/English-French dictionary, the dictionary should be a function that takes in a word and a language argument and returns the (single) most likely word based on the GIZA++ alignment. Comment your code, create appropriate test cases and catch errors.

*Hint: Think about subtle issues with the generation of the alignments, for instance how should you handle situations where you have a rare word in one language that matches with a common word in the other language and vice versa?*


In [7]:
# your code below, strongly suggested to implement as a CLASS

def myDictionary(word, language):
    ## to complete
    return alignment_word

# or 

class myDictionary():
    def __init__(self):  #add arguments you'll need to initialize correctly
        ## to complete
    def lookup(word, language):  
        ## to complete
        return alignment_word


### 2.2 Test cases
rubric={accuracy:1}

To the provided entries, provide additional test cases to ensure that the dictionary functions correctly. Consider words that shouldn't exist in a language, number of entries returned etc.

In [None]:
### Change to class implementation (e.g. foo.lookup(bar, baz)) if you decided to use that instead.
print(myDictionary("vous",language="french"))
print(myDictionary("Minutes",language="english"))
print(myDictionary("awesome",language="french"))
## Additional testing below:


### 2.3 Explanation

If your testing catches some subtle errors (i.e. it runs fine for the most part but in some edge cases translation errors occur), explain what the issue is and how you should fix it.

*Response here IF NEEDED*

## Exercise 3: Neural Language Alignment

We compare GIZA++'s alignment with MUSE to see how purely statistical models compare with neural models.

First get the English and French aligned word embeddings from https://github.com/facebookresearch/MUSE

Next some libraries for visualizing things.

In [None]:
import io
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt

### 3.1 Cosine Similarity
rubric={accuracy:2}

Cosine similarity is used fairly frequently in NLP to show distance or similarity between two vectors (and thus two words or embeddings etc).

$similarity = cos(\theta) = \frac{A \cdot B}{||A|| ||B||} = \frac{\sum_{i=1}^{d} A_iB_i}{\sqrt{\sum_{i=1}^{d} A^2_i}\sqrt{\sum_{i=1}^{d} B^2_i}}$

Using the following two tensors (of a dim=3 embedding) show **by hand** how you would calculate the cosine similarity between these two vectors.

A = [ 0.1010, -1.1388, -0.7991]

B = [ 0.5083, -0.2255,  1.9037]


**Write Answer Here** (alternatively take a picture of your work)

### 3.2 Plotting Aligned Embeddings
rubric={accuracy:2}

For the following set of words: "excessively", "Minutes", "citizens", "standards", and "disaster" create a 2D diagram showing the words in vector space alongside their French closest 'equivalents'. *Use Load_vec and getnn functions from the colab notebook and feel free to use [this MUSE tutorial](https://github.com/facebookresearch/MUSE/blob/master/demo.ipynb) as a reference for how you can plot these embeddings*

In [8]:
# your code/diagram goes here

### 3.3 Comparing MUSE to GIZA++
rubric={reasoning:2}

Based on the words that you have identified, how do the MUSE pre-trained embeddings compare with the GIZA++ mappings? Which method do you think works better in practice? What might be some drawbacks of this comparison?


*Hint:You can perform a case-study to compare the alignment dictionary.*


**Answer Goes Here**

## Exercise 4: MUSE from scratch 

We sort of cheated in Exercise 3 with using the pre-aligned embeddings, unfortunately these are trained on Wikipedia, and thus not a great comparison to the Europarl corpus dictionary we've made with GIZA++. We've provided a Google colab notebook in the repo 531_L1_MUSE.ipynb which should setup the environment correctly for running these on Google's colab platform.


### 4.1 Create Fastext word embeddings for Europarl 
rubric={accuracy:3}

FasText is similar to Word2Vec, but has some slight improvements (e.g. it can capture orthographic information). It is located as it:
https://github.com/facebookresearch/fastText and then follow the instructions to train an embedding model for the English and French Europarl corpuses you downloaded earlier for 2.1.

Copy your training procedure below as well as give example outputs for the closest words to "Minutes", "minutes", and "vote" for English and "vous", "intervienne", and "accord" for French.

In [None]:
#your training code goes here!


In [None]:
#Give example outputs for the suggested words: "Minutes", "minutes", and "vote" for English and "vous", "intervienne", and "accord" for French.

### 4.2 Use MUSE to align the word embeddings 
rubric={accuracy:2}

Using MUSE's unsupervised functionality, align the two language embedding spaces and find the closest French words that match the English words "disaster", "vote", and "excessively"



### 4.3 Compare MUSE with GIZA++
rubric={reasoning:3}

Based on the aligned dictionaries you've made, qualitatively does there seem to be much difference between GIZA++ and MUSE? Which one appears to work better? Write a brief (2-3 sentence) comparison of both the process to make the two dictionaries and the ultimate quality of the final dictionaries.

**Answer Here**

## Exercise 5: Conceptual Questions

### 5.1 Data Sets
rubric={reasoning:3}

Aligned corpuses are vital when it comes to creating a Machine Translation model. We used Europarl for this assignment, but you should think about your language needs if you are planning on working on projects dealing with non-European languages or other genres of text. Find two datasets that you might be interested in using for this unit. 

One important resource for this is the [Linguistics Data Consortium](https://catalog.ldc.upenn.edu/search) of which UBC is a member. To access the datasets that UBC has available please use [Abacus](https://resources.library.ubc.ca/page.php?details=abacus&id=1114)  (Recommended that you search on the LDC website, which has a better metadata search engine for the datasets, make sure you find something that has a parallel translation, and then locate the dataset on Abacus by searching for the title of it. Note UBC has most, but not all of the sets)


For these two corpuses please identify:

- What language pair(s) are covered by the corpus
- Size of the corpus
- "Style" of text covered (is it formal text? news writing? email? informal social media posts?)

**Response goes here**