![Save2Drive](https://raw.githubusercontent.com/alahnala/AI4All2020-Michigan-NLP/master/slides/save2drive.png)

# Word Level Translation

In this notebook, you will learn how to use some simple probabilities and a corpus of parallel text from two languages to create your own word translator.

### Run the cell below to get setup

In [None]:
import os, sys
IN_COLAB = 'google.colab' in sys.modules
if IN_COLAB:
  !rm -r AI4All2020-Michigan-NLP
  !git clone https://github.com/alahnala/AI4All2020-Michigan-NLP.git
  !cp -r AI4All2020-Michigan-NLP/utils/ .
  !cp -r AI4All2020-Michigan-NLP/Data/ .
  !cp -r AI4All2020-Michigan-NLP/slides/ .
  !cp -r AI4All2020-Michigan-NLP/Experiment-Report-Templates/ .
  !echo "=== Files Copied ==="

# Data - we're going to start with a parallel corpus of English and Spanish sentences

A parallel corpus is one that has the same text in two different languages.
Let's load the data and take a look at the first time in the list:

In [None]:
with open("Data/mt-data/eng-spa.txt") as f:
    english_spanish = f.read().split('\n')

In [None]:
english_spanish[0]

# Now let's parse the data to make a list of spanish sentences and a list of english sentences

Fill in code where the comments are in the `parse_lines` function. The function should return two lists, one that has just the English sentences, and one that has just the Spanish sentences.

In [None]:
from tqdm import tqdm

def parse_lines(lines):
    language1_sentences = []
    language2_sentences = []
    for line in tqdm(lines):
        try:
            '''
            Add code here to separate the english part from the spanish part of the line
            Then save the english and spanish part into two separate variables to add to the lists
            Hint: try using the split function to split on tabs, which are represented with the string '\t'
            We only need the first two list items after splitting on tabs as the last item is extra information about
            the source of the sentences.
            '''
            
            
        except:
            continue
        language1_sentences.append('''TODO''')
        language2_sentences.append('''TODO''')

    return language1_sentences, language2_sentences


english, spanish = parse_lines(english_spanish)
    

# We are going to implement a function that compute pointwise mutual information



Take a brief moment to read the definition of [Pointwise Mutual Information](https://en.wikipedia.org/wiki/Pointwise_mutual_information) from Wikipedia below.

![Save2Drive](https://raw.githubusercontent.com/alahnala/AI4All2020-Michigan-NLP/master/slides/PMI.png)

### The main idea of PMI is that it compute the probability that two things are associated with eachother. 
In the context of translation, we want to know the probability that two words from different languages appear in the same line of the parallel corpus in their respective languages.

For example, if the word "Stop" shows up in the English corpus and the word "Parad" shows up in the Spanish corpus, we want to know how often those two words appear in parallel translations in our corpus.

In [None]:
english_spanish[20]

Let's begin by understanding what components we need are in order to compute the PMI of `token_A` and of `token_B`, where `token_A` is a word from English and `token_B` is a word from Spanish. Take a look at the cell below to understand the input to a function called `pointwise_mutual_information`, which we will write to compute the PMI of `token_A` and `token_B`. The cell below the function explains each component.

In [None]:
def pointwise_mutual_information(token_A, token_B, A_B_probabilities, A_probabilities, B_Probabilities):
    return

- We need to keep track of the probabilities of A, and the probabilties of B, so these are really just unigram dictionaries for each language. We do that with `A_probabilities` and `B_Probabilities` which are dictionaries of unigram probabilities. We create these the same way we did in the Language Identification project ([see a walkthrough of that project here](https://colab.research.google.com/drive/1Pj8mL_amPoYJbb8nTpcX-EMNimQS8IN9?usp=sharing)). 
- We also need to keep track of the probabilities of token_A AND token_B, meaning the probability that token_A shows up in a parallel translation of token_B. We do this with a dictionary called `A_B_probabilities`. Computing these probabilities will be explained deeper later.

# Let's prepare a preprocessing function to obtain tokens

- First step is to preprocessing the sentences into tokens. We've written a function for you that uses the nltk `word_tokenize` function, but feel free to modify this if you like.
- Run the cell below to load the preprocess function and to show its output on an English test string and a Spanish test string.

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

def preprocess(string, language):
    tokens = word_tokenize(string.lower(), language=language)
    return tokens

test_string = english[0]
tokens = preprocess(test_string, 'english')
print(test_string)
print(tokens)

test_string = spanish[0]
tokens = preprocess(test_string, 'spanish')
print(test_string)
print(tokens)

In [None]:
english[2]

# Count the english unigrams and spanish unigrams

- Fill in the code as detailed in the comments below to count the unigram tokens for each language.
- If you need help, remember we did something very similar in the Language Identification project. Find where did that in the [Language Identification Walkthrough](https://colab.research.google.com/drive/1Pj8mL_amPoYJbb8nTpcX-EMNimQS8IN9?usp=sharing) for reference.

In [None]:
from collections import defaultdict

english_unigrams = defaultdict(lambda:0)

for sentence in tqdm(english):
    tokens = preprocess(sentence, 'english')
    '''
    Iterate through the tokens to add them to the english_unigrams dictionary and count them
    '''
        
spanish_unigrams = defaultdict(lambda:0)

for sentence in tqdm(spanish):
    tokens = preprocess(sentence, 'spanish')
    '''
    Iterate through the tokens to add them to the spanish_unigrams dictionary and count them
    '''

# Compute English and Spanish unigram probabilities

- Here we compute the probabilities of each token, which is the token count divided by the total tokens.
- We've written these steps for you, so you just need to run the cell (try to understand each line of code however)

In [None]:
english_unigram_probabilities = {}
total_english_unigrams = sum(english_unigrams.values())
for token in tqdm(english_unigrams):
    token_count = english_unigrams[token]
    english_unigram_probabilities[token] = token_count / total_english_unigrams
    
spanish_unigram_probabilities = {}
total_spanish_unigrams = sum(spanish_unigrams.values())
for token in tqdm(spanish_unigrams):
    token_count = spanish_unigrams[token]
    spanish_unigram_probabilities[token] = token_count / total_spanish_unigrams
    

# To compute probabilities of encountering english unigram and spanish unigram in parallel sentences, first count the co-occurrences

- Here we are creating a dictionary of dictionaries that counts the number of times that `token_A` occurs in parallel translations of sentences that have `token_B`.
- This is the first step in building the data structure needed for the `A_B_probabilities` described earlier. 
- We are going to name this data structure `en_es_cooc_count`
- Walk through the cell below reading the comments to understand what each line is doing.

In [None]:
# Initializing a dictionary of dictionaries, with lambda:0 so that each new key automatically maps to 0.
en_es_cooc_count = defaultdict(lambda:defaultdict(lambda:0))


# We zip up our list of english and spanish sentences and walk through each parallel pair of english and spanish sentences
for english_sentence, spanish_sentence in tqdm(zip(english, spanish)):
    
    # first we preprocess the English and Spanish sentences to obtain tokens
    english_tokens = preprocess(english_sentence, 'english')
    spanish_tokens = preprocess(spanish_sentence, 'spanish')
    
    # let's create a set so that so that we get just one instance of each token in case any tokens appear multiple times in the sentences
    english_token_set = set(english_tokens)
    spanish_token_set = set(spanish_tokens)
    
    # For each individual English token, we count the number of times it appears with each individual Spanish token in this particular parallel translation
    for english_token in list(english_token_set):
        for spanish_token in spanish_token_set:
            en_es_cooc_count[english_token][spanish_token] += 1


# Now compute the probabilities

- Here, we compute the probabilities from the counts we just computed
- Walk through the lines of code reading the comments to understand how this works.

In [None]:
# Initializing a dictionary of dictionaries, with lambda:0 so that each new key automatically maps to 0.
en_es_cooc_probabilities = defaultdict(lambda:defaultdict(lambda:0))

# first compute the total pairs. We iterate through each English word and add the sum of it's dictionary of Spanish token's values
total_pairs = 0
for en_tok in tqdm(en_es_cooc_count):
    total_pairs += sum(en_es_cooc_count[en_tok].values())

# Then we go through all of the pairs, and get the count of each pair for the top, and divide it by the total_pairs to get the probability
for en_tok in tqdm(en_es_cooc_count):
    for es_tok in en_es_cooc_count[en_tok]:
        top = en_es_cooc_count[en_tok][es_tok]
        bottom = total_pairs
        
        pr = top / bottom
        en_es_cooc_probabilities[en_tok][es_tok] = pr
    

# Now we have all of the components we need for the PMI function
- Now we have all of the components we need for the PMI function. 
- Add code below to complete the pointwise mutual information function to compute the PMI of token A and token B (refer back to the equation).
- We imported a function to compute the log of a number. It's syntax is simply log(some number)
- We included a mini test case. If your function is working correctly, then the cell should output:
``PMI of "go" and "vaya": 18.698165913571383``


In [None]:
from math import log

def pointwise_mutual_information(token_A, token_B, A_B_probabilities, A_probabilities, B_Probabilities):

    '''
    Add code here to complete the function
    '''
    return pmi


test_A_B_Probabilities = {'go':{'vaya':79}}
test_A_Probabilities = {'go': 0.003110732656094365}
test_B_Probabilities = {'vaya': 0.0001924217344287494}
test_token_A = 'go'
test_token_B = 'vaya'

test_pmi = pointwise_mutual_information('go', 'vaya', test_A_B_Probabilities, test_A_Probabilities, test_B_Probabilities)
print('PMI of "go" and "vaya":', test_pmi)

# Run the cell below to compute the PMIs for all of the token pairs

In [None]:
en_es_pmis = {}

for en_tok in tqdm(en_es_cooc_count):
    for es_tok in en_es_cooc_count[en_tok]:
        pmi = pointwise_mutual_information(en_tok, es_tok, en_es_cooc_probabilities, english_unigram_probabilities, spanish_unigram_probabilities)
        pair = en_tok + '\t' + es_tok
        en_es_pmis[pair] = pmi

# Run the cell below to sort our dictionary of PMI's to see the pairs with high pmi scores

In [None]:
import operator

sorted_pmis = sorted(en_es_pmis.items(), key=operator.itemgetter(1), reverse=True)

# Run the cell below to print out the words with the top ten highest PMI's

In [None]:
for pair, pmi in sorted_pmis[:10]:
    parts = pair.split('\t')
    en = parts[0]
    es = parts[1]
    print(en, '--->', es)

# Now that you have learn how to compute PMI's, can you create a system that translate a full sentence using just these PMI's? 

1. Expand upon what you learned throughout all of the NLP project notebooks. 
2. Then, if you are interested, try coming up with a way to use translation in a Chatbot [open a fresh Chatbot starter here](https://colab.research.google.com/github/alahnala/AI4All2020-Michigan-NLP/blob/master/Chatbot-Starter.ipynb)