# Model
#### Charlie Liou

In [1]:
import numpy as np, pandas as pd
import string, re
import os, glob, requests
from bs4 import BeautifulSoup
from itertools import chain

Much of this is adapted from Kevin Knight's [A Statistical MT Tutorial Workbook.](http://www.isi.edu/natural-language/mt/wkbk.rtf)

To write a statistical English to Chinese translator, we will consider English sentences $e$ and Chinese sentences $c$. This may seem like solving the opposite problem, but given $c$, we seek the English sentence that will maximize $P( e \mid  c)$ (that is, the most likely English sentence when given the Chinese sentence.) This reversal will be explained later. More formally, we are trying to find the english sentence $\vec{e}$ that satisfies

$$\vec{e} = arg\hspace{0.05cm} max\hspace{0.05cm}P(e \mid c)$$

How do we approach maximizing the probability $P(e \mid  c)?$ Given that we know Bayes' Theorem:

$$P( e \mid  c) = \dfrac{P(e)\hspace{0.05cm}P(c\mid e)}{P(c)}$$

the above equation becomes

$$\vec{e} = arg\hspace{0.05cm} max\hspace{0.05cm}P(e \mid c) = arg\hspace{0.05cm} max\hspace{0.05cm}P(e)\hspace{0.05cm}P(c \mid e)$$

This is pure Bayesian reasoning; think of the given Chinese sentence $c$ as a crime scene. Knight gives a good analogy for this: $e$ is the person who did the crime, $P(e)$ is the description of the person, and $P(c \mid e)$ is how they did it. There are possibly many people who fit the description of $P(e)$ but those people may not have the means of committing the crime. Likewise, there are possibly many people who have the means $P(c \mid e)$ of committing the crime but they don't fit the personality. We're trying to solve for the person who is most likely to commit the crime and has the means to commit it.

Now if we think about translating language itself, accurate syntax and translations are necessary. We can't have one without the other. In our last equation, $P(e)$ is equivalent to correct syntax and $P(c\mid e)$ is equivalent to correct translations. This is why we must maximize the probability $P(e \mid c)$. Maximizing this probability is equivalent to finding sentences that have correct syntax and accurate translations.

### Language Model

The easiest way to account for syntax is to use a *n*-gram model.

Data directory:

In [2]:
#my personal computer
#path = 

#math lounge computer
path = "/Users/csmuser/Desktop/Cal Poly Summer Research 2017/data"

Trigram code:

In [3]:
def trigram(final):
    
    temp = {}
    
    for i in range(0, len(final) - 2):
        ot = final[i] + " " + final[i + 1]
        tt = final[i + 1] + " " + final[i + 2]
        if ot in temp:
            if tt in temp[ot]:
                temp[ot][tt] += 1
            else:
                temp[ot].update({tt : 1})
        else:
            temp.update({ot : {tt : 1}})
    
    for i in list(temp):
        num = 0
        for j in list(temp[i]):
            num += temp[i][j]
        for j in list(temp[i]):
            temp[i][j] /= num
    
    return temp

Function for merging a list of dictionaries:

In [4]:
def merge_dicts(dict_args):
    """
    Given a list of dicts, shallow copy and merge into a new dict,
    precedence goes to key value pairs in latter dicts.
    """
    result = {}
    for dictionary in dict_args:
        result.update(dictionary)
    return result

This is preparing code to check contractions and a list of punctuations to remove (except sentence stops).

Contraction edge cases:
    - he'll / hell (think of a solution!!!)
    - we'd / wed (think of a solution!!!)
    - who're / whore (not in casia2015_en)

In [17]:
#pulling contractions from two different sites
c = requests.get("http://www.softschools.com/language_arts/grammar/contractions/contractions_list/").content
soup = BeautifulSoup(c, "lxml")
cont = [x for x in str(soup.find_all("span", class_ = "myFont14")[0])
                .replace("<br/>", " ").split(" ") if "\'" in x]

c1 = requests.get("http://grammar.wikia.com/wiki/List_of_contractions").content
soup1 = BeautifulSoup(c1, "lxml")
cont1 = [str(x).replace("</td>", "").replace("\n", "").replace("<td>", "") 
         for x in soup1.find_all("td") if "\'" in str(x)]

conts = list(chain.from_iterable([[x.ljust(len(x) + 1).capitalize(), x.center(len(x) + 2)]
                                  for x in list(set(cont).union(cont1))]))
[conts.append(x) for x in ["That'll", " that'll "]]

nopunct = merge_dicts(list(chain.from_iterable([[{conts[x].replace("\'", "") : x, conts[x].replace("\'", " ") : x}] 
                                        for x in range(len(conts))])))
del nopunct[" whore "]
del nopunct[" who re "]
del nopunct["Who re "]

#punctuations to remove
exclude = set(string.punctuation.replace(".", "").replace("!", "").replace("?", ""))

This function cleans the `casia2015_en` dataset.

In [35]:
def process_c2015en(list_, conts, nopunct, exclude):
    
    start = "_START_"
    stop = "_STOP_"
    
    #kill all punct except stops
    list_ = "".join(ch for ch in list_ if ch not in exclude).replace("  ", " ").split("\n")[:200]
    
    #replace contractions w/ correct form, add _START_ and _STOP_ markers at beginning and end
    for i in range(len(list_)):
        list_[i] = start + " " + list_[i].strip() + " " + stop
        for x in nopunct:
            if x in list_[i]:
                list_[i] = list_[i].replace(x, conts[nopunct[x]])

 
    return list_

`CASIA 2015` dataset:

In [41]:
os.chdir(path + "/casia2015")
files = glob.glob("*.txt")

#c2015en = process_c2015en(open(files[1], "r", errors = "ignore").read(), conts, nopunct, exclude)

c2015en = open(files[1], "r", errors = "ignore").read()
c2015cn = open(files[0], "r", errors = "ignore").read()


In [49]:
c2015en[:30000].split("\n")

['The show stars the X Girls - a troupe of talented topless dancers, some of whom are classically trained.',
 'The centerpiece of the show is a farcical rendition of Swan Lake in which male and female performers dance in pink tutus and imitate swans.',
 'The removal of the barrier between performance and post-production was just as helpful for the actors.',
 'assist (somebody acting or reciting) by suggesting the next words of something forgotten or imperfectly learned.',
 'Basically it was a fine performance I have only minor quibbles to make about her technique.',
 "After it's over, we watch the pairs of headlights glide in a neat line back up Main Street, dispersing as drivers turn off toward home.",
 'After the performance they removed the back wall of the theatre and the cast summoned the audience onstage for an impromptu line dance.',
 'After the end of each performance with paper can be spread the water, light on the surface were saved. Kids draw.',
 'After the performances, a g

In [47]:
c2015cn[:5000].split("\n")

['表演的明星是X女孩团队——由一对具有天才技艺的艳舞女孩们组成，其中有些人受过专业的训练。',
 '表演的压轴戏是闹剧版《天鹅湖》，男女小人们身着粉红色的芭蕾舞裙扮演小天鹅。',
 '表演和后期制作之间的屏障被清除了，这对演员来说一样大有裨益。',
 '（表演或背诵时）通过暗示下面忘记或记地不准的东西来帮助某人。',
 '表演基本上很精彩--我只对她的技巧稍有意见。',
 '表演结束后，我们看到一对对车灯沿主路一路排回镇上，然后散开来各回各家。',
 '表演结束后，移走了背景墙，随后全体演员即兴邀请观众上台齐跳并排舞。',
 '表演结束后用宣纸轻铺水面，可将水面上的画进行拓印保存。',
 '表演结束后，众人期待已久的园游会终于正式开锣，美味可口的素食佳肴让大家一饱口福。',
 '表演节目丰富精采，交换礼物的欢乐时刻一到，则形成另一波高潮。',
 '表演仅仅是造就一个近乎觉察不出来的「直线的地质的移位」。',
 '表演开始十五分钟后，一帮足球运动员开始集体向着舞台上的女演员发出嘘声。',
 '表演开始时，舞台上会有一张床，一面镜子，一张椅子以及一位穿着内衣的美女。在你觉察到前，一位性感的女孩会突然变成三位。',
 '表演开始时，艺人坐在地毯上轻击盅子，徐缓起舞；',
 '表演“猫女”——在一个笼子里穿着带有一根长尾巴和猫耳的豹纹女内衣。',
 '表演前，她紧张得浑身颤抖不已。',
 '表演是他们的嫡传技艺，150多年来他们家族一直都是演员。',
 '表演算得上是一门残忍的职业，你偏离正统美越远，就越艰难。',
 '表演我软木塞哪一呼吸和我将表演你一瓶醋。',
 '表演：悉尼歌剧院首席男高音丁毅先生悉尼歌剧院首席女高音',
 '表演秀以及开园时间有可能因故不经预告而取消或中止，敬请留意。',
 '表演一个不限分析国内受损的期望往往带来了个人工作的公司。',
 '“表演”一结束，他马上给那名黑人“歹徒”发工钱，然后两个人还握手拥抱。',
 '表演艺术家MarniKotak（持玩偶者）在布鲁克林的“显微镜”画廊生下一婴作为她艺术作品的一部分。',
 '表演艺术来源于生活而高于生活，生活是艺术创作的来源，取之不尽用之不竭的源泉。',
 '表演在安慰疗法中也很重要。安慰性注射虽然比安慰药丸有效果，但是却没有虚假手术来得见效。',
 '“表演在本质上给了他们一个正当的理由去学习

In [22]:
trigram(["Hello", "everyone", "how", "are", "y'all", "doing"])

{'Hello everyone': {'everyone how': 1.0},
 "are y'all": {"y'all doing": 1.0},
 'everyone how': {'how are': 1.0},
 'how are': {"are y'all": 1.0}}