# Model
#### Charlie Liou

In [3]:
#import numpy as np, pandas as pd
import string, re
import os, glob, requests
from bs4 import BeautifulSoup
from itertools import chain

Much of this is adapted from Kevin Knight's [A Statistical MT Tutorial Workbook.](http://www.isi.edu/natural-language/mt/wkbk.rtf)

To write a statistical English to Chinese translator, we will consider English sentences $e$ and Chinese sentences $c$. This may seem like solving the opposite problem, but given $c$, we seek the English sentence that will maximize $P( e \mid  c)$ (that is, the most likely English sentence when given the Chinese sentence.) This reversal will be explained later. More formally, we are trying to find the english sentence $\vec{e}$ that satisfies

$$\vec{e} = arg\hspace{0.05cm} max\hspace{0.05cm}P(e \mid c)$$

How do we approach maximizing the probability $P(e \mid  c)?$ Given that we know Bayes' Theorem:

$$P( e \mid  c) = \dfrac{P(e)\hspace{0.05cm}P(c\mid e)}{P(c)}$$

the above equation becomes

$$\vec{e} = arg\hspace{0.05cm} max\hspace{0.05cm}P(e \mid c) = arg\hspace{0.05cm} max\hspace{0.05cm}P(e)\hspace{0.05cm}P(c \mid e)$$

This is pure Bayesian reasoning; think of the given Chinese sentence $c$ as a crime scene. Knight gives a good analogy for this: $e$ is the person who did the crime, $P(e)$ is the description of the person, and $P(c \mid e)$ is how they did it. There are possibly many people who fit the description of $P(e)$ but those people may not have the means of committing the crime. Likewise, there are possibly many people who have the means $P(c \mid e)$ of committing the crime but they don't fit the personality. We're trying to solve for the person who is most likely to commit the crime and has the means to commit it.

Now if we think about translating language itself, accurate syntax and translations are necessary. We can't have one without the other. In our last equation, $P(e)$ is equivalent to correct syntax and $P(c\mid e)$ is equivalent to correct translations. This is why we must maximize the probability $P(e \mid c)$. Maximizing this probability is equivalent to finding sentences that have correct syntax and accurate translations.

### Language Model

To account for syntax, meaning finding values of $P(e)$, we use a *n*-gram model.

Data directory:

In [4]:
#my personal computer
path = "C:\\Users\\chuck189\\Desktop\\Cal Poly Summer Research 2017\\data"

#math lounge computer
#path = "/Users/csmuser/Desktop/Cal Poly Summer Research 2017/data"

Trigram code:

In [5]:
def trigram(final, freqs):
    '''
    Takes a list of strings (which are sentences) and a dictionary with phrase frequencies
    Returns a trigram and updated phrase frequencies.
    '''
    temp = {}
    
    
    for sent in final:
        
        for i in range(0, len(sent) - 2):
            ot = final[i] + " " + final[i + 1]
            tt = final[i + 1] + " " + final[i + 2]
            if ot in temp:
                if tt in temp[ot]:
                    temp[ot][tt] += 1
                else:
                    temp[ot].update({tt : 1})
            else:
                temp.update({ot : {tt : 1}})
    '''
    for i in list(temp):
        num = 0
        for j in list(temp[i]):
            num += temp[i][j]
        for j in list(temp[i]):
            temp[i][j] /= num
    '''
    
    return temp

Function for merging a list of dictionaries:

In [6]:
def merge_dicts(dict_args):
    result = {}
    for dictionary in dict_args:
        result.update(dictionary)
    return result

The next block prepares to check contractions and a list of punctuations to remove (except sentence stops, -, $).

In [79]:
#pulling contractions from two different sites
c = requests.get("http://www.softschools.com/language_arts/grammar/contractions/contractions_list/").content
soup = BeautifulSoup(c, "lxml")
cont = [x for x in str(soup.find_all("span", class_ = "myFont14")[0])
                .replace("<br/>", " ").split(" ") if "\'" in x]

c1 = requests.get("http://grammar.wikia.com/wiki/List_of_contractions").content
soup1 = BeautifulSoup(c1, "lxml")
cont1 = [str(x).replace("</td>", "").replace("\n", "").replace("<td>", "") 
         for x in soup1.find_all("td") if "\'" in str(x)]

e = ["He'll ", " he'll ", "We'd ", " we'd ", "She'll ", " she'll ", "We're ", 
        " we're ", "It's ", " it's ", "So's ", " so's ", "Who're ", " who're ", "'s", "s'", " we'll ", "We'll "]
conts = list(chain.from_iterable([[x.ljust(len(x) + 1).capitalize(), x.center(len(x) + 2)]
                                  for x in list(set(cont).union(cont1))]))
[conts.append(x) for x in ["That'll", " that'll "]]
for i in e:
    if i in conts:
        conts.remove(i)

nopunct = merge_dicts(list(chain.from_iterable([[{conts[x].replace("\'", "") : x, conts[x].replace("\'", " ") : x}] 
                                        for x in range(len(conts))])))

#punctuations to remove
sub = {".": "", "!": "", "?": "", "-": "", "$": "", "%": ""}
rep = dict((re.escape(k), v) for k, v in sub.items())
pattern = re.compile("|".join(rep.keys()))
exclude = set(pattern.sub(lambda m: rep[re.escape(m.group(0))], string.punctuation))

In [153]:
nopunct

{' I d ': 3,
 ' I ll ': 17,
 ' I m ': 45,
 ' I ve ': 71,
 ' Id ': 3,
 ' Ill ': 17,
 ' Im ': 45,
 ' Ive ': 71,
 ' aren t ': 91,
 ' arent ': 91,
 ' can t ': 97,
 ' cant ': 97,
 ' couldn t ': 75,
 ' couldnt ': 75,
 ' didn t ': 57,
 ' didnt ': 57,
 ' doesn t ': 77,
 ' doesnt ': 77,
 ' don t ': 25,
 ' dont ': 25,
 ' hadn t ': 1,
 ' hadnt ': 1,
 ' hasn t ': 23,
 ' hasnt ': 23,
 ' haven t ': 61,
 ' havent ': 61,
 ' he d ': 93,
 ' he s ': 11,
 ' hed ': 93,
 ' hes ': 11,
 ' how d ': 13,
 ' how s ': 55,
 ' howd ': 13,
 ' hows ': 55,
 ' isn t ': 101,
 ' isnt ': 101,
 ' let s ': 5,
 ' lets ': 5,
 ' might ve ': 79,
 ' mightn t ': 65,
 ' mightnt ': 65,
 ' mightve ': 79,
 ' mustn t ': 27,
 ' mustnt ': 27,
 ' needn t ': 89,
 ' neednt ': 89,
 ' shan t ': 7,
 ' shant ': 7,
 ' she d ': 113,
 ' she s ': 49,
 ' shed ': 113,
 ' shes ': 49,
 ' should ve ': 95,
 ' shouldn t ': 99,
 ' shouldnt ': 99,
 ' shouldve ': 95,
 ' that ll ': 115,
 ' that s ': 19,
 ' thatll ': 115,
 ' thats ': 19,
 ' there s ': 81,
 ' t

This function prepares the `casia2015_en` dataset for *n*-gram processing.

In [154]:
def process_c2015en(l, conts, nopunct, exclude, e):

    money = "<$>"
    percent = "<%>"
    
    for i in range(len(l)):
        
        #####remove punctuations except - ! . ?
        #####accounts for many apostrophe edge cases including possession
        
        if (e[0] in l[i]) or (e[1] in l[i]) or (e[2] in l[i]) or (e[3] in l[i]) or (e[4] in l[i]) or (e[5] in l[i]) \
            or (e[6] in l[i]) or (e[7] in l[i]) or (e[8] in l[i]) or (e[9] in l[i]) or (e[10] in l[i]) or (e[11] in l[i]) \
            or (e[12] in l[i]) or (e[13] in l[i]) or (e[14] in l[i]) or (e[15] in l[i]) or (e[16] in l[i]) or (e[17] in l[i]): 
            for item in exclude:
                if item is "\'":
                    indexes = [m.start() for m in re.finditer("\'", l[i])]
                    num = 0
                    for index in indexes:
                        if (e[14] in l[i]) or (e[15] in l[i]):
                            continue
                        t = l[i][index - 2 - num: index - num]
                        if (t != "he") and (t != "He") and (t != "We") and (t != "we") and (t != "It") and \
                        (t != "it") and (t != "So") and (t != "so") and (t != "ho") and \
                        (index - num != 0) and (index - num != len(l[i])):
                            l[i] = l[i][:index - num] + l[i][index + 1 - num:]
                            num += 1
                else:
                    l[i] = l[i].replace(item, "")
        else:
            l[i] = "".join(ch for ch in l[i] if ch not in exclude)
           
        #####replace contractions w/ correct form, add sentence markers
        
        j = l[i].strip()
        #l[i] = start + " " + j[:len(j)-1] + " " + stop
        for x in nopunct:
            if x in l[i]:
                l[i] = l[i].replace(x, conts[nopunct[x]])
        
        #####deal with - and --
        
        if " - " in l[i]:
            indexes = [m.start() for m in re.finditer(" - ", l[i])]
            num = 0
            for index in indexes:
                l[i] = l[i][:index - num] + l[i][index + 2 - num:]
                num += 2
                    
        elif "--" in l[i]:
            indexes = [m.start() for m in re.finditer("--", l[i])]
            num = 0
            for index in indexes:
                l[i] = l[i][:index - num] + l[i][index + 2 - num:]
                num += 2
                
        #####replace $__ with <$>, 
        
        if "$" in l[i]:
            indexes = [m.start() for m in re.finditer("\$", l[i])]
            nums = [m.end() for m in re.finditer("\$[-+]?([0-9]*?\.?[\s][0-9]+|[0-9]+)", l[i])]
            if len(indexes) == len(nums):
                num = 0
                for j in range(len(indexes)):
                    l[i] = l[i][:indexes[j] - num] + money + l[i][nums[j] - num:]
                    num += nums[j] - indexes[j] - 3
            else:
                l[i] = "" #bad sentence
            
        #####replace __% with <%>
        #####accounts for cases such as "29. 2%"
        
        if "%" in l[i]:
            indexes = [m.start() for m in re.finditer("([0-9]*?\.[\s]?[0-9]+|[0-9]+)%", l[i])]            
            percents = [m.end() for m in re.finditer("%", l[i])]
            if len(indexes) == len(percents):
                num = 0
                for j in range(len(indexes)):
                    l[i] = l[i][:indexes[j] - num] + percent + l[i][percents[j] - num:]
                    num += percents[j] - indexes[j] - 3
            else:
                l[i] = ""

                
        l[i] = l[i].replace(". .", "").replace("..", "").replace("  ", " ")
 
    return l

In [167]:
re.split("!|\?", "hello pupper?! how are you doing! i'm doing great")

['hello pupper', '', ' how are you doing', " i'm doing great"]

In [174]:
def sentence_split_c2015en(l):
    
    start = "<s>"
    stop = "</s>"
    titles = "(?<![A-Z])(?<![A-Z][a-z])(?<![A-Z][a-z][a-z])\."
    k = []
    
    for i in range(200):
        temp = [x.strip() for x in re.split(titles, l[i]) if x.strip() != ""] 
        print(temp)
        if len(temp) == 1: #just 1 sentence, no .
            ex = [x.strip() for x in re.split("!|\?", temp[0]) if x.strip() != ""]
            if len(ex) == 1: #just 1 sentence, no . ? !
                j = ex[0].strip()
                k.append(start + " " + start + " " + j[:len(j)] + " " + stop + " " + stop)
                print(j, i)
            else:
                print(ex, i)
        else:
            for sent in temp: #>= 2 sentences, no .
                ex = [x.strip() for x in re.split("!|\?", temp[0]) if x.strip() != ""]
                
    
    return k

In [175]:
c2015en = sentence_split_c2015en(c2015entemp)

['The show stars the X Girls a troupe of talented topless dancers some of whom are classically trained']
The show stars the X Girls a troupe of talented topless dancers some of whom are classically trained 0
['The centerpiece of the show is a farcical rendition of Swan Lake in which male and female performers dance in pink tutus and imitate swans']
The centerpiece of the show is a farcical rendition of Swan Lake in which male and female performers dance in pink tutus and imitate swans 1
['The removal of the barrier between performance and post-production was just as helpful for the actors']
The removal of the barrier between performance and post-production was just as helpful for the actors 2
['assist somebody acting or reciting by suggesting the next words of something forgotten or imperfectly learned']
assist somebody acting or reciting by suggesting the next words of something forgotten or imperfectly learned 3
['Basically it was a fine performance I have only minor quibbles to make

`CASIA 2015` dataset:

In [144]:
os.chdir(path + "/casia2015")
files = glob.glob("*.txt")

c2015entemp = process_c2015en(open(files[1], "r", errors = "ignore").read().split("\n"), conts, nopunct, exclude, e)

In [290]:
lol = open(files[0], "r", encoding = "utf-8").read().split("\n")
#for i in range(len(lol)):
    #lol[i] = lol[i].encode("unicode-escape").decode("utf8")
    #for j in i:
        #print(j.encode("unicode-escape"))
        #j = j.encode("unicode-escape")

In [292]:
for i in range(len(lol)):
    print(lol[i], i)

表演的明星是X女孩团队——由一对具有天才技艺的艳舞女孩们组成，其中有些人受过专业的训练。 0
表演的压轴戏是闹剧版《天鹅湖》，男女小人们身着粉红色的芭蕾舞裙扮演小天鹅。 1
表演和后期制作之间的屏障被清除了，这对演员来说一样大有裨益。 2
（表演或背诵时）通过暗示下面忘记或记地不准的东西来帮助某人。 3
表演基本上很精彩--我只对她的技巧稍有意见。 4
表演结束后，我们看到一对对车灯沿主路一路排回镇上，然后散开来各回各家。 5
表演结束后，移走了背景墙，随后全体演员即兴邀请观众上台齐跳并排舞。 6
表演结束后用宣纸轻铺水面，可将水面上的画进行拓印保存。 7
表演结束后，众人期待已久的园游会终于正式开锣，美味可口的素食佳肴让大家一饱口福。 8
表演节目丰富精采，交换礼物的欢乐时刻一到，则形成另一波高潮。 9
表演仅仅是造就一个近乎觉察不出来的「直线的地质的移位」。 10
表演开始十五分钟后，一帮足球运动员开始集体向着舞台上的女演员发出嘘声。 11
表演开始时，舞台上会有一张床，一面镜子，一张椅子以及一位穿着内衣的美女。在你觉察到前，一位性感的女孩会突然变成三位。 12
表演开始时，艺人坐在地毯上轻击盅子，徐缓起舞； 13
表演“猫女”——在一个笼子里穿着带有一根长尾巴和猫耳的豹纹女内衣。 14
表演前，她紧张得浑身颤抖不已。 15
表演是他们的嫡传技艺，150多年来他们家族一直都是演员。 16
表演算得上是一门残忍的职业，你偏离正统美越远，就越艰难。 17
表演我软木塞哪一呼吸和我将表演你一瓶醋。 18
表演：悉尼歌剧院首席男高音丁毅先生悉尼歌剧院首席女高音 19
表演秀以及开园时间有可能因故不经预告而取消或中止，敬请留意。 20
表演一个不限分析国内受损的期望往往带来了个人工作的公司。 21
“表演”一结束，他马上给那名黑人“歹徒”发工钱，然后两个人还握手拥抱。 22
表演艺术家MarniKotak（持玩偶者）在布鲁克林的“显微镜”画廊生下一婴作为她艺术作品的一部分。 23
表演艺术来源于生活而高于生活，生活是艺术创作的来源，取之不尽用之不竭的源泉。 24
表演在安慰疗法中也很重要。安慰性注射虽然比安慰药丸有效果，但是却没有虚假手术来得见效。 25
“表演在本质上给了他们一个正当的理由去学习如何表达自己。”阿伦森说。 26
表演者把信封靠近他的脑袋，然后先给出答案，

KeyboardInterrupt: 

In [288]:
lol[0].encode("unicode-escape").decode("gb18030")#.decode("gb2312")

'\\xe8\\xa1\\xa8\\xe6\\xbc\\u201d\\xe7\\u0161\\u201e\\xe6\\u02dc\\u017d\\xe6\\u02dc\\u0178\\xe6\\u02dc\\xafX\\xe5\\xa5\\xb3\\xe5\\xad\\xa9\\xe5\\u203a\\xa2\\xe9\\u02dc\\u0178\\xe2\\u20ac\\u201d\\xe2\\u20ac\\u201d\\xe7\\u201d\\xb1\\xe4\\xb8\\u20ac\\xe5\\xaf\\xb9\\xe5\\u2026\\xb7\\xe6\\u0153\\u2030\\xe5\\xa4\\xa9\\xe6\\u2030\\xe6\\u0160\\u20ac\\xe8\\u2030\\xba\\xe7\\u0161\\u201e\\xe8\\u2030\\xb3\\xe8\\u02c6\\u017e\\xe5\\xa5\\xb3\\xe5\\xad\\xa9\\xe4\\xbb\\xac\\xe7\\xbb\\u201e\\xe6\\u02c6\\xef\\xbc\\u0152\\xe5\\u2026\\xb6\\xe4\\xb8\\xad\\xe6\\u0153\\u2030\\xe4\\xba\\u203a\\xe4\\xba\\xba\\xe5\\u2014\\xe8\\xbf\\u2021\\xe4\\xb8\\u201c\\xe4\\xb8\\u0161\\xe7\\u0161\\u201e\\xe8\\xae\\xad\\xe7\\xbb\\u0192\\xe3\\u20ac\\u201a'

In [287]:
print(lol[0])

è¡¨æ¼”çš„æ˜Žæ˜Ÿæ˜¯Xå¥³å­©å›¢é˜Ÿâ€”â€”ç”±ä¸€å¯¹å…·æœ‰å¤©æ‰æŠ€è‰ºçš„è‰³èˆžå¥³å­©ä»¬ç»„æˆï¼Œå…¶ä¸­æœ‰äº›äººå—è¿‡ä¸“ä¸šçš„è®­ç»ƒã€‚


In [217]:
re.split("(?<![A-Z][a-z])(?<![A-Z][a-z][a-z])\.", c2015en[0])

['<s> <s> The show stars the X Girls a troupe of talented topless dancers some of whom are classically trained </s> </s>']

In [102]:
re.split("(?<![A-Z][a-z])(?<![A-Z][a-z][a-z])\.", c2015en[25])

['<s> Drama is important too',
 ' Placebo injections are more effective than placebo pills and neither is as potent as sham surgery </s>']

In [91]:
trigram(c2015en[0].split(" "))

{'<s> The': {'The show': 1.0},
 'Girls a': {'a troupe': 1.0},
 'The show': {'show stars': 1.0},
 'X Girls': {'Girls a': 1.0},
 'a troupe': {'troupe of': 1.0},
 'are classically': {'classically trained': 1.0},
 'classically trained': {'trained </s>': 1.0},
 'dancers some': {'some of': 1.0},
 'of talented': {'talented topless': 1.0},
 'of whom': {'whom are': 1.0},
 'show stars': {'stars the': 1.0},
 'some of': {'of whom': 1.0},
 'stars the': {'the X': 1.0},
 'talented topless': {'topless dancers': 1.0},
 'the X': {'X Girls': 1.0},
 'topless dancers': {'dancers some': 1.0},
 'troupe of': {'of talented': 1.0},
 'whom are': {'are classically': 1.0}}