# Model
#### Charlie Liou

In [1]:
import numpy as np, pandas as pd
import re, string
import os, glob, requests
from bs4 import BeautifulSoup
from itertools import chain

Much of this is adapted from Kevin Knight's [A Statistical MT Tutorial Workbook.](http://www.isi.edu/natural-language/mt/wkbk.rtf)

To write a statistical English to Chinese translator, we will consider English sentences $e$ and Chinese sentences $c$. This may seem like solving the opposite problem, but given $c$, we seek the English sentence that will maximize $P( e \mid  c)$ (that is, the most likely English sentence when given the Chinese sentence.) This reversal will be explained later. More formally, we are trying to find the english sentence $\vec{e}$ that satisfies

$$\vec{e} = arg\hspace{0.05cm} max\hspace{0.05cm}P(e \mid c)$$

How do we approach maximizing the probability $P(e \mid  c)?$ Given that we know Bayes' Theorem:

$$P( e \mid  c) = \dfrac{P(e)\hspace{0.05cm}P(c\mid e)}{P(c)}$$

the above equation becomes

$$\vec{e} = arg\hspace{0.05cm} max\hspace{0.05cm}P(e \mid c) = arg\hspace{0.05cm} max\hspace{0.05cm}P(e)\hspace{0.05cm}P(c \mid e)$$

This is pure Bayesian reasoning; think of the given Chinese sentence $c$ as a crime scene. Knight gives a good analogy for this: $e$ is the person who did the crime, $P(e)$ is the description of the person, and $P(c \mid e)$ is how they did it. There are possibly many people who fit the description of $P(e)$ but those people may not have the means of committing the crime. Likewise, there are possibly many people who have the means $P(c \mid e)$ of committing the crime but they don't fit the personality. We're trying to solve for the person who is most likely to commit the crime and has the means to commit it.

Now if we think about translating language itself, accurate syntax and translations are necessary. We can't have one without the other. In our last equation, $P(e)$ is equivalent to correct syntax and $P(c\mid e)$ is equivalent to correct translations. This is why we must maximize the probability $P(e \mid c)$. Maximizing this probability is equivalent to finding sentences that have correct syntax and accurate translations.

## Language Model

To account for syntax, meaning finding values of $P(e)$, we use a *n*-gram model. I will find a trigram using data from  `casia2015_en`, TED talks, NLTK's Brown corpus, 

Data directory:

In [2]:
#my personal computer
#path = "C:\\Users\\chuck189\\Desktop\\Cal Poly Summer Research 2017\\data"

#math lounge computer
path = "/Users/csmuser/Desktop/Cal Poly Summer Research 2017/data"

### Functions for *n*-gram processing

Function for merging a list of dictionaries:

In [3]:
def merge_dicts(dict_args):
    result = {}
    for dictionary in dict_args:
        result.update(dictionary)
    return result

This function prepares any list of English sentences for *n*-gram processing. It cleans up punctuations among other actions.

In [51]:
def process_eng(l, conts, nopunct, exclude, e):
    '''
    Takes the casia2015_en.txt file and removes unnecessary punctuations. Accounts
    for contractions. Replaces %'s with <%> and references to money with <$>
    '''

    money = "<$>"
    percent = "<%>"
    start = "<s>"
    stop = "</s>"
    
    for i in range(len(l)):
        
        #####remove punctuations except - ! . ?
        #####accounts for many apostrophe edge cases including possession
        
        l[i] = l[i].strip()
        
        if (e[0] in l[i]) or (e[1] in l[i]) or (e[2] in l[i]) or (e[3] in l[i]) or (e[4] in l[i]) \
            or (e[5] in l[i]) or (e[6] in l[i]) or (e[7] in l[i]) or (e[8] in l[i]) or (e[9] in l[i]) \
            or (e[10] in l[i]) or (e[11] in l[i]) or (e[12] in l[i]) or (e[13] in l[i]) or (e[14] in l[i]) \
            or (e[15] in l[i]) or (e[16] in l[i]) or (e[17] in l[i]): 
            for item in exclude:
                if item is "\'":
                    indexes = [m.start() for m in re.finditer("\'", l[i])]
                    num = 0
                    if len(l[i]) in indexes:
                        l[i] = l[i][len(l[i]) - 1]
                        indexes.remove(len(l[i]))
                    if 0 in indexes:
                        l[i] = l[i][1:]
                        indexes.remove(0)
                        if 1 in indexes:
                            l[i] = l[i][1:]
                            indexes.remove(1)
                    for index in indexes:
                        if (e[14] in l[i]) or (e[15] in l[i]):
                            continue
                        t = l[i][index - 2 - num: index - num]
                        if (t != "he") and (t != "He") and (t != "We") and (t != "we") and (t != "It") and \
                        (t != "it") and (t != "So") and (t != "so") and (t != "ho") and \
                        (index - num != 0) and (index - num != len(l[i])):
                            l[i] = l[i][:index - num] + l[i][index + 1 - num:]
                            num += 1
                else:
                    l[i] = l[i].replace(item, "")
        else:
            l[i] = "".join(ch for ch in l[i] if ch not in exclude)
           
        #####replace contractions w/ correct form
        
        j = l[i].strip()
        for x in nopunct:
            if x in l[i]:
                l[i] = l[i].replace(x, conts[nopunct[x]])
        
        #####deal with - and --
        
        if " - " in l[i]:
            indexes = [m.start() for m in re.finditer(" - ", l[i])]
            num = 0
            for index in indexes:
                l[i] = l[i][:index - num] + l[i][index + 2 - num:]
                num += 2
                    
        elif "--" in l[i]:
            indexes = [m.start() for m in re.finditer("--", l[i])]
            num = 0
            for index in indexes:
                l[i] = l[i][:index - num] + l[i][index + 2 - num:]
                num += 2
                
        #####replace $__ with <$>, 
        
        if "$" in l[i]:
            indexes = [m.start() for m in re.finditer("\$", l[i])]
            nums = [m.end() for m in re.finditer("\$[-+]?([0-9]*?\.?[\s][0-9]+|[0-9]+)", l[i])]
            if len(indexes) == len(nums):
                num = 0
                for j in range(len(indexes)):
                    l[i] = l[i][:indexes[j] - num] + money + l[i][nums[j] - num:]
                    num += nums[j] - indexes[j] - 3
            else:
                l[i] = "" #bad sentence
            
        #####replace __% with <%>
        #####accounts for cases such as "29. 2%"
        
        if "%" in l[i]:
            indexes = [m.start() for m in re.finditer("([0-9]*?\.[\s]?[0-9]+|[0-9]+)%", l[i])]            
            percents = [m.end() for m in re.finditer("%", l[i])]
            if len(indexes) == len(percents):
                num = 0
                for j in range(len(indexes)):
                    l[i] = l[i][:indexes[j] - num] + percent + l[i][percents[j] - num:]
                    num += percents[j] - indexes[j] - 3
            else:
                l[i] = ""

                
        l[i] = l[i].replace(". .", "").replace("..", "").replace("  ", " ")
        l[i] = start + " " + start + " " + l[i].strip() + " " + stop + " " + stop
 
    return l

This function splits english sentences by sentence stops (!, ?, .) while accounting for titles such as Dr., Mr., Mrs. This is necessary for `casia2015_en.txt` because many sentences are often comprised of two or more sentences.

In [21]:
def eng_sentence_split(l):
    '''
    
    '''
    start = "<s>"
    stop = "</s>"
    titles = "(?<![A-z])(?<![A-Z][a-z])(?<![A-Z][a-z][a-z])\."
    k = []
    
    if type(l) is list:
    
        for i in range(len(l)):
        
            temp = [x.strip() for x in re.split(titles, l[i]) if x.strip() != ""] 

            if len(temp) == 1: #just 1 sentence, no .
            
                ex = [x.strip() for x in re.split("!|\?", temp[0]) if x.strip() != ""]
            
                if len(ex) == 1: #just 1 sentence, no . ? !
                    j = ex[0].strip()
                    k.append(j)
                    #print(k[len(k)-1])
                    #print()
                    #print(len(k), "one sent no . ? !")
                else:
                    for sent in ex:
                        j = sent.strip()
                        k.append(sent)
                        #print(k[len(k)-1])
                        #print()
                        #print(len(k), ">= two sent no . yes ? !")
            else:
                for sent in temp: #>= 2 sentences, no .
                
                    ex = [x.strip() for x in re.split("!|\?", sent) if x.strip() != ""]
                
                    if len(ex) == 1:
                        j = ex[0].strip()
                        k.append(j)
                        #print(k[len(k)-1])
                        #print()
                        #print(len(k), "two sent yes . no ? !")
                    else:
                        for sent in ex:
                            j = sent.strip()
                            k.append(j)
                            #print(k[len(k)-1])
                            #print()
                            #print(len(k), "two sent yes . yes ? !")
    elif type(l) is str:
        
        temp = [x.strip() for x in re.split(titles, l) if x.strip() != ""] 

        if len(temp) == 1: #just 1 sentence, no .
            
            ex = [x.strip() for x in re.split("!|\?", temp[0]) if x.strip() != ""]
            
            if len(ex) == 1: #just 1 sentence, no . ? !
                j = ex[0].strip()
                k.append(j)
                #print(k[len(k)-1])
                #print()
                #print(len(k), "one sent no . ? !")
            else:
                for sent in ex:
                    j = sent.strip()
                    k.append(sent)
                    #print(k[len(k)-1])
                    #print()
                    #print(len(k), ">= two sent no . yes ? !")
        else:
            for sent in temp: #>= 2 sentences, no .
                
                ex = [x.strip() for x in re.split("!|\?", sent) if x.strip() != ""]
                
                if len(ex) == 1:
                    j = ex[0].strip()
                    k.append(j)
                    #print(k[len(k)-1])
                    #print()
                    #print(len(k), "two sent yes . no ? !")
                else:
                    for sent in ex:
                        j = sent.strip()
                        k.append(sent)
                        #print(k[len(k)-1])
                        #print()
                        #print(len(k), "two sent yes . yes ? !")        
    
    return k

This function generates *n*-grams when n = 3.

In [6]:
def trigram(list_, temp = {}, freqs = {}):
    '''
    Takes a list of strings (which are sentences) and a dictionary with phrase frequencies
    Returns a trigram and updated phrase frequencies.
    '''
    
    for sent in list_:
        
        sent = sent.split()
        
        for i in range(0, len(sent) - 2):
            ot = sent[i] + " " + sent[i + 1]
            tt = sent[i + 1] + " " + sent[i + 2]
            if ot in temp:
                if tt in temp[ot]:
                    temp[ot][tt] += 1
                else:
                    temp[ot].update({tt : 1})
            else:
                temp.update({ot : {tt : 1}})
    
    for i in list(temp):
        num = 0
        for j in list(temp[i]):
            num += temp[i][j]
            if temp[i][j] in freqs:
                freqs[i] += num
            else:
                freqs[i] = num

    
    return temp, freqs

### Variables for *n*-gram processing

The next block pulls contractions from two different sites and a list of punctuations to remove (except sentence stops, -, $).

In [7]:
#pulling contractions from two different sites
c = requests.get("http://www.softschools.com/language_arts/grammar/contractions/contractions_list/").content
soup = BeautifulSoup(c, "lxml")
cont = [x for x in str(soup.find_all("span", class_ = "myFont14")[0])
                .replace("<br/>", " ").split(" ") if "\'" in x]

c1 = requests.get("http://grammar.wikia.com/wiki/List_of_contractions").content
soup1 = BeautifulSoup(c1, "lxml")
cont1 = [str(x).replace("</td>", "").replace("\n", "").replace("<td>", "") 
         for x in soup1.find_all("td") if "\'" in str(x)]

e = ["He'll ", " he'll ", "We'd ", " we'd ", "She'll ", " she'll ", "We're ", 
        " we're ", "It's ", " it's ", "So's ", " so's ", "Who're ", " who're ", "'s", "s'", " we'll ", "We'll "]
conts = list(chain.from_iterable([[x.ljust(len(x) + 1).capitalize(), x.center(len(x) + 2)]
                                  for x in list(set(cont).union(cont1))]))
[conts.append(x) for x in ["That'll", " that'll "]]
for i in e:
    if i in conts:
        conts.remove(i)

nopunct = merge_dicts(list(chain.from_iterable([[{conts[x].replace("\'", "") : x, conts[x].replace("\'", " ") : x}] 
                                        for x in range(len(conts))])))

#punctuations to remove
sub = {".": "", "!": "", "?": "", "-": "", "$": "", "%": ""}
rep = dict((re.escape(k), v) for k, v in sub.items())
pattern = re.compile("|".join(rep.keys()))
exclude = set(pattern.sub(lambda m: rep[re.escape(m.group(0))], string.punctuation))

### CASIA 2015

We first clean the `CASIA 2015` dataset:

In [52]:
os.chdir(path + "/casia2015")
files = glob.glob("*.txt")

c2015entemp = process_eng(open(files[1], "r", encoding = "utf-8").read().split("\n"), conts, nopunct, exclude, e)
c2015en = eng_sentence_split(c2015entemp)

*n*-gram for `casia2015_en`:

In [53]:
c2015engram, freqs = trigram(c2015en, {}, {})

In [None]:
c2015eng

In [56]:
c2015engram["<s> 'It's"]

{"'It's funny": 1}

### NTLK Brown

*n*-gram for NLTK Brown corpus:

In [22]:
def process_NLTK_Brown(conts, nopunct, exclude, e):
    from nltk.corpus import brown
    temp = " ".join([x for x in brown.words(categories = brown.categories())])
    listform = eng_sentence_split(temp)
    return process_eng(listform, conts, nopunct, exclude, e)

In [23]:
browntemp = process_NLTK_Brown(conts, nopunct, exclude, e)

In [26]:
browngram, freqs = trigram(browntemp, c2015engram, freqs)

In [35]:
c2015cntemp = open(files[0], "r", encoding = "utf-8").read().split("\n")

In [36]:
for i in range(100):
    print(lol[i], i)

表演的明星是X女孩团队——由一对具有天才技艺的艳舞女孩们组成，其中有些人受过专业的训练。 0
表演的压轴戏是闹剧版《天鹅湖》，男女小人们身着粉红色的芭蕾舞裙扮演小天鹅。 1
表演和后期制作之间的屏障被清除了，这对演员来说一样大有裨益。 2
（表演或背诵时）通过暗示下面忘记或记地不准的东西来帮助某人。 3
表演基本上很精彩--我只对她的技巧稍有意见。 4
表演结束后，我们看到一对对车灯沿主路一路排回镇上，然后散开来各回各家。 5
表演结束后，移走了背景墙，随后全体演员即兴邀请观众上台齐跳并排舞。 6
表演结束后用宣纸轻铺水面，可将水面上的画进行拓印保存。 7
表演结束后，众人期待已久的园游会终于正式开锣，美味可口的素食佳肴让大家一饱口福。 8
表演节目丰富精采，交换礼物的欢乐时刻一到，则形成另一波高潮。 9
表演仅仅是造就一个近乎觉察不出来的「直线的地质的移位」。 10
表演开始十五分钟后，一帮足球运动员开始集体向着舞台上的女演员发出嘘声。 11
表演开始时，舞台上会有一张床，一面镜子，一张椅子以及一位穿着内衣的美女。在你觉察到前，一位性感的女孩会突然变成三位。 12
表演开始时，艺人坐在地毯上轻击盅子，徐缓起舞； 13
表演“猫女”——在一个笼子里穿着带有一根长尾巴和猫耳的豹纹女内衣。 14
表演前，她紧张得浑身颤抖不已。 15
表演是他们的嫡传技艺，150多年来他们家族一直都是演员。 16
表演算得上是一门残忍的职业，你偏离正统美越远，就越艰难。 17
表演我软木塞哪一呼吸和我将表演你一瓶醋。 18
表演：悉尼歌剧院首席男高音丁毅先生悉尼歌剧院首席女高音 19
表演秀以及开园时间有可能因故不经预告而取消或中止，敬请留意。 20
表演一个不限分析国内受损的期望往往带来了个人工作的公司。 21
“表演”一结束，他马上给那名黑人“歹徒”发工钱，然后两个人还握手拥抱。 22
表演艺术家MarniKotak（持玩偶者）在布鲁克林的“显微镜”画廊生下一婴作为她艺术作品的一部分。 23
表演艺术来源于生活而高于生活，生活是艺术创作的来源，取之不尽用之不竭的源泉。 24
表演在安慰疗法中也很重要。安慰性注射虽然比安慰药丸有效果，但是却没有虚假手术来得见效。 25
“表演在本质上给了他们一个正当的理由去学习如何表达自己。”阿伦森说。 26
表演者把信封靠近他的脑袋，然后先给出答案，