# Model
#### Charlie Liou

In [1]:
import numpy as np, pandas as pd
import string, re
import os, glob, requests
from bs4 import BeautifulSoup
from itertools import chain

Much of this is adapted from Kevin Knight's [A Statistical MT Tutorial Workbook.](http://www.isi.edu/natural-language/mt/wkbk.rtf)

To write a statistical English to Chinese translator, we will consider English sentences $e$ and Chinese sentences $c$. This may seem like solving the opposite problem, but given $c$, we seek the English sentence that will maximize $P( e \mid  c)$ (that is, the most likely English sentence when given the Chinese sentence.) This reversal will be explained later. More formally, we are trying to find the english sentence $\vec{e}$ that satisfies

$$\vec{e} = arg\hspace{0.05cm} max\hspace{0.05cm}P(e \mid c)$$

How do we approach maximizing the probability $P(e \mid  c)?$ Given that we know Bayes' Theorem:

$$P( e \mid  c) = \dfrac{P(e)\hspace{0.05cm}P(c\mid e)}{P(c)}$$

the above equation becomes

$$\vec{e} = arg\hspace{0.05cm} max\hspace{0.05cm}P(e \mid c) = arg\hspace{0.05cm} max\hspace{0.05cm}P(e)\hspace{0.05cm}P(c \mid e)$$

This is pure Bayesian reasoning; think of the given Chinese sentence $c$ as a crime scene. Knight gives a good analogy for this: $e$ is the person who did the crime, $P(e)$ is the description of the person, and $P(c \mid e)$ is how they did it. There are possibly many people who fit the description of $P(e)$ but those people may not have the means of committing the crime. Likewise, there are possibly many people who have the means $P(c \mid e)$ of committing the crime but they don't fit the personality. We're trying to solve for the person who is most likely to commit the crime and has the means to commit it.

Now if we think about translating language itself, accurate syntax and translations are necessary. We can't have one without the other. In our last equation, $P(e)$ is equivalent to correct syntax and $P(c\mid e)$ is equivalent to correct translations. This is why we must maximize the probability $P(e \mid c)$. Maximizing this probability is equivalent to finding sentences that have correct syntax and accurate translations.

### Language Model

The easiest way to account for syntax is to use a *n*-gram model.

Data directory:

In [2]:
#my personal computer
#path = 

#math lounge computer
path = "/Users/csmuser/Desktop/Cal Poly Summer Research 2017/data"

Trigram code:

In [3]:
def trigram(final):
    
    temp = {}
    
    for i in range(0, len(final) - 2):
        ot = final[i] + " " + final[i + 1]
        tt = final[i + 1] + " " + final[i + 2]
        if ot in temp:
            if tt in temp[ot]:
                temp[ot][tt] += 1
            else:
                temp[ot].update({tt : 1})
        else:
            temp.update({ot : {tt : 1}})
    
    for i in list(temp):
        num = 0
        for j in list(temp[i]):
            num += temp[i][j]
        for j in list(temp[i]):
            temp[i][j] /= num
    
    return temp

Function for merging a list of dictionaries:

In [4]:
def merge_dicts(dict_args):
    """
    Given a list of dicts, shallow copy and merge into a new dict,
    precedence goes to key value pairs in latter dicts.
    """
    result = {}
    for dictionary in dict_args:
        result.update(dictionary)
    return result

This is preparing code to check contractions and a list of punctuations to remove (except sentence stops).

Contraction edge cases:
    - he'll / hell (think of a solution!!!)
    - we'd / wed (think of a solution!!!)
    - who're / whore (not in casia2015_en)

In [17]:
#pulling contractions from two different sites
c = requests.get("http://www.softschools.com/language_arts/grammar/contractions/contractions_list/").content
soup = BeautifulSoup(c, "lxml")
cont = [x for x in str(soup.find_all("span", class_ = "myFont14")[0])
                .replace("<br/>", " ").split(" ") if "\'" in x]

c1 = requests.get("http://grammar.wikia.com/wiki/List_of_contractions").content
soup1 = BeautifulSoup(c1, "lxml")
cont1 = [str(x).replace("</td>", "").replace("\n", "").replace("<td>", "") 
         for x in soup1.find_all("td") if "\'" in str(x)]

conts = list(chain.from_iterable([[x.ljust(len(x) + 1).capitalize(), x.center(len(x) + 2)]
                                  for x in list(set(cont).union(cont1))]))
[conts.append(x) for x in ["That'll", " that'll "]]

nopunct = merge_dicts(list(chain.from_iterable([[{conts[x].replace("\'", "") : x, conts[x].replace("\'", " ") : x}] 
                                        for x in range(len(conts))])))
del nopunct[" whore "]
del nopunct[" who re "]
del nopunct["Who re "]

#punctuations to remove
exclude = set(string.punctuation.replace(".", "").replace("!", "").replace("?", ""))

This function cleans the `casia2015_en` dataset.

In [19]:
def process_c2015en(list_, conts, nopunct, exclude):
    
    #kill all punct except stops
    list_ = "".join(ch for ch in list_ if ch not in exclude).replace("  ", " ").split("\n")[:200]
    
    #replace contractions w/ correct form
    for sent in list_:
        for x in nopunct:
            if x in sent:
                print(sent)
                sent = sent.replace(x, conts[nopunct[x]])
                print(sent)

    #add _START_ and _STOP_ markers
    
 
    return list_

`CASIA 2015` dataset:

In [20]:
os.chdir(path + "/casia2015")
files = glob.glob("*.txt")

c2015en = process_c2015en(open(files[1], "r").read(), conts, nopunct, exclude)


After its over we watch the pairs of headlights glide in a neat line back up Main Street dispersing as drivers turn off toward home.
After it's over we watch the pairs of headlights glide in a neat line back up Main Street dispersing as drivers turn off toward home.
After the end of each performance with paper can be spread the water light on the surface were saved. Kids draw.
After the end of each performance with paper can be spread the water light on the surface we're saved. Kids draw.
Show me a cork that breathes and Ill show you a bottle of vinegar.
Show me a cork that breathes and I'll show you a bottle of vinegar.
The performer doesn t have any gunpowder for her cannon. So Clifford puffs and helps her out. 
The performer doesn't have any gunpowder for her cannon. So Clifford puffs and helps her out. 
The first column contains the task and the next two columns designate whether or not the task works based on its origin.
The first column contains the task and the next two columns 

In [21]:
c2015en

['The show stars the X Girls a troupe of talented topless dancers some of whom are classically trained.',
 'The centerpiece of the show is a farcical rendition of Swan Lake in which male and female performers dance in pink tutus and imitate swans.',
 'The removal of the barrier between performance and postproduction was just as helpful for the actors.',
 'assist somebody acting or reciting by suggesting the next words of something forgotten or imperfectly learned.',
 'Basically it was a fine performance I have only minor quibbles to make about her technique.',
 'After its over we watch the pairs of headlights glide in a neat line back up Main Street dispersing as drivers turn off toward home.',
 'After the performance they removed the back wall of the theatre and the cast summoned the audience onstage for an impromptu line dance.',
 'After the end of each performance with paper can be spread the water light on the surface were saved. Kids draw.',
 'After the performances a garden party

In [22]:
trigram(["Hello", "everyone", "how", "are", "y'all", "doing"])

{'Hello everyone': {'everyone how': 1.0},
 "are y'all": {"y'all doing": 1.0},
 'everyone how': {'how are': 1.0},
 'how are': {"are y'all": 1.0}}