# Model
#### Charlie Liou

In [1]:
import numpy as np#, pandas as pd
import re, string
import os, glob, requests
from bs4 import BeautifulSoup
from itertools import chain

Much of this is adapted from Kevin Knight's [A Statistical MT Tutorial Workbook.](http://www.isi.edu/natural-language/mt/wkbk.rtf)

To write a statistical Chinese to English translator, we will consider English sentences $e$ and Chinese sentences $c$. This may seem like solving the opposite problem, but given $c$, we seek the English sentence that will maximize $P( e \mid  c)$ (that is, the most likely English sentence when given the Chinese sentence.) This reversal will be explained later. More formally, we are trying to find the english sentence $\vec{e}$ that satisfies

$$\vec{e} = \arg\hspace{0.05cm} max\hspace{0.05cm}P(e \mid c)$$

How do we approach maximizing the probability $P(e \mid  c)?$ Bayes' Theorem tells us:

$$P( e \mid  c) = \dfrac{P(e)\hspace{0.05cm}P(c\mid e)}{P(c)}$$

which allows us to reformat our problem into

$$\vec{e} = \arg max\hspace{0.05cm}P(e \mid c) = \arg max\hspace{0.05cm}P(e)\hspace{0.05cm}P(c \mid e)$$

This is pure Bayesian reasoning; think of the given Chinese sentence $c$ as a crime scene. Knight gives a good analogy for this: $e$ is the person who did the crime, $P(e)$ is the description of the person, and $P(c \mid e)$ is how they did it. There are possibly many people who fit the description of $P(e)$ but those people may not have the means of committing the crime. Likewise, there are possibly many people who have the means $P(c \mid e)$ of committing the crime but they don't fit the personality. We're trying to solve for the person who is most likely to commit the crime and has the means to commit it.

Now if we think about translating language itself, accurate syntax and translations are necessary. We can't have one without the other. In our last equation, $P(e)$ is equivalent to correct syntax and $P(c\mid e)$ is equivalent to correct translations. This is why we must maximize the probability $P(e \mid c)$. Maximizing this probability is equivalent to finding sentences that have correct syntax and accurate translations.

## Language Model

To account for syntax, meaning finding values of $P(e)$, we use a *n*-gram model. I will find a trigram using data from  `casia2015_en`, TED talks, NLTK's Brown corpus, 

Data directory:

In [3]:
#my personal computer
path = "C:/Users/chuck189/Desktop/Cal Poly Summer Research 2017/data"

#math lounge computer
#path = "/Users/csmuser/Desktop/Cal Poly Summer Research 2017/data"

This function generates *n*-grams when n = 3.

In [93]:
def trigram(list_, temp = {}, freqs = {}):
    '''
    Takes a list of strings (which are sentences) and a dictionary with phrase frequencies
    Returns a trigram and updated phrase frequencies.
    '''
    
    for sent in list_:
        
        sent = sent.split()
        
        for i in range(0, len(sent) - 2):
            ot = sent[i] + " " + sent[i + 1]
            tt = sent[i + 1] + " " + sent[i + 2]
            if ot in temp:
                if tt in temp[ot]:
                    temp[ot][tt] += 1
                else:
                    temp[ot].update({tt : 1})
            else:
                temp.update({ot : {tt : 1}})
    
    for i in list(temp):
        num = 0
        for j in list(temp[i]):
            num += temp[i][j]
            if temp[i][j] in freqs:
                freqs[i] += num
            else:
                freqs[i] = num

    
    return temp, freqs

This functions generates *n*-grams when n = 4.

In [94]:
def fourgram(list_, temp = {}, freqs = {}):
    '''
    Takes a list of strings (which are sentences) and a dictionary with phrase frequencies
    Returns a trigram and updated phrase frequencies.
    '''
    
    for sent in list_:
        
        sent = sent.split()
        
        for i in range(0, len(sent) - 3):
            ot = sent[i] + " " + sent[i + 1]
            tt = sent[i + 1] + " " + sent[i + 2]
            if ot in temp:
                if tt in temp[ot]:
                    temp[ot][tt] += 1
                else:
                    temp[ot].update({tt : 1})
            else:
                temp.update({ot : {tt : 1}})
    
    for i in list(temp):
        num = 0
        for j in list(temp[i]):
            num += temp[i][j]
            if temp[i][j] in freqs:
                freqs[i] += num
            else:
                freqs[i] = num

    
    return temp, freqs

- check for two and three capitalized letters ending with a period
- send everything to lowercase

In [369]:
one = ["i like dog", "i like cats", "i like to go running"]
two = ["我喜欢狗", "我喜欢猫", "我喜欢跑步"]
#lol = []
#lol.append(nltk.AlignedSent(nltk.word_tokenize("i like dog"), [x for x in "我喜欢狗"]))
#lol.append(nltk.AlignedSent(nltk.word_tokenize("i like dog"), [x for x in "我喜欢狗"]))
#lol.append(nltk.AlignedSent(nltk.word_tokenize("i like cats"), [x for x in "我喜欢猫"]))
#lol.append(nltk.AlignedSent(nltk.word_tokenize("i like to go running"), [x for x in "我喜欢跑步"]))
#temp = nltk.translate.IBMModel2(lol, 10)
f, b, left, right = ibm2_align(3, 5, one, two)

([AlignedSent(['i', 'like', 'dog'], ['我', '喜', '欢', '狗'], Alignment([(0, 2), (1, 2), (2, 3)])),
  AlignedSent(['i', 'like', 'cats'], ['我', '喜', '欢', '猫'], Alignment([(0, 2), (1, 2), (2, 3)])),
  AlignedSent(['i', 'like', 'to', 'go', 'running'], ['我', '喜', '欢', '跑', '步'], Alignment([(0, 2), (1, 2), (2, 4), (3, 4), (4, 4)]))],
 [AlignedSent(['我', '喜', '欢', '狗'], ['i', 'like', 'dog'], Alignment([(0, 1), (1, 1), (2, 1), (3, 2)])),
  AlignedSent(['我', '喜', '欢', '猫'], ['i', 'like', 'cats'], Alignment([(0, 1), (1, 1), (2, 1), (3, 2)])),
  AlignedSent(['我', '喜', '欢', '跑', '步'], ['i', 'like', 'to', 'go', 'running'], Alignment([(0, 1), (1, 1), (2, 1), (3, 4), (4, 4)]))],
 <nltk.translate.ibm2.IBMModel2 at 0x10f951fd0>,
 <nltk.translate.ibm2.IBMModel2 at 0x107b9fa20>)

In [17]:
nltk.pos_tag("The store will be visited by him today".split(" "))

[('The', 'DT'),
 ('store', 'NN'),
 ('will', 'MD'),
 ('be', 'VB'),
 ('visited', 'VBN'),
 ('by', 'IN'),
 ('him', 'PRP'),
 ('today', 'NN')]

### CASIA 2015

In [4]:
import nltk, time

In [5]:
os.chdir(path + "/casia2015")
files = glob.glob("*.txt")

engtemp = open(files[1], "r", encoding = "utf-8").read().split("\n")
chtemp = open(files[0], "r", encoding = "utf-8").read().split("\n")

In [393]:
for i in chtemp:
    print(nltk.word_tokenize(i), len(nltk.word_tokenize(i)))
    time.sleep(0.2)

['表演的明星是X女孩团队——由一对具有天才技艺的艳舞女孩们组成，其中有些人受过专业的训练。'] 1
['表演的压轴戏是闹剧版《天鹅湖》，男女小人们身着粉红色的芭蕾舞裙扮演小天鹅。'] 1
['表演和后期制作之间的屏障被清除了，这对演员来说一样大有裨益。'] 1
['（表演或背诵时）通过暗示下面忘记或记地不准的东西来帮助某人。'] 1
['表演基本上很精彩', '--', '我只对她的技巧稍有意见。'] 3
['表演结束后，我们看到一对对车灯沿主路一路排回镇上，然后散开来各回各家。'] 1
['表演结束后，移走了背景墙，随后全体演员即兴邀请观众上台齐跳并排舞。'] 1
['表演结束后用宣纸轻铺水面，可将水面上的画进行拓印保存。'] 1
['表演结束后，众人期待已久的园游会终于正式开锣，美味可口的素食佳肴让大家一饱口福。'] 1
['表演节目丰富精采，交换礼物的欢乐时刻一到，则形成另一波高潮。'] 1
['表演仅仅是造就一个近乎觉察不出来的「直线的地质的移位」。'] 1
['表演开始十五分钟后，一帮足球运动员开始集体向着舞台上的女演员发出嘘声。'] 1
['表演开始时，舞台上会有一张床，一面镜子，一张椅子以及一位穿着内衣的美女。在你觉察到前，一位性感的女孩会突然变成三位。'] 1
['表演开始时，艺人坐在地毯上轻击盅子，徐缓起舞；'] 1
['表演', '“', '猫女', '”', '——在一个笼子里穿着带有一根长尾巴和猫耳的豹纹女内衣。'] 5
['表演前，她紧张得浑身颤抖不已。'] 1
['表演是他们的嫡传技艺，150多年来他们家族一直都是演员。'] 1
['表演算得上是一门残忍的职业，你偏离正统美越远，就越艰难。'] 1
['表演我软木塞哪一呼吸和我将表演你一瓶醋。'] 1
['表演：悉尼歌剧院首席男高音丁毅先生悉尼歌剧院首席女高音'] 1
['表演秀以及开园时间有可能因故不经预告而取消或中止，敬请留意。'] 1
['表演一个不限分析国内受损的期望往往带来了个人工作的公司。'] 1
['“', '表演', '”', '一结束，他马上给那名黑人', '“', '歹徒', '”', '发工钱，然后两个人还握手拥抱。'] 8
['表演艺术家MarniKotak（持玩偶者）在布鲁克林的', '“', '显微镜', '”', '画廊生下一婴作为她艺术作品的一部分。'] 

['别对他的所作所为太苛刻，他毕竟还是个孩子。'] 1
['别对他的所作所为太苛求，他毕竟还是个孩子。'] 1
['别对他太严厉。他毕竟只是个孩子。'] 1
['别对现实生活过于苛求，常存感激的心情！'] 1
['别饿着自己研究表明使自己处于饥饿状态的人会以错误的方式减少体重。'] 1
['别尔哥罗德州州长称，这一节日与俄罗斯文化传统不符，同样被禁止的还有万圣节。'] 1
['别尔嘉耶夫则从末世论的角度讨论基督教世界必定是一个恩典的历史，人类历史需要一个救世主。'] 1
['《别尔金小说集》一直被分体研究，视为多部小说的沙状集合。'] 1
['“', '别发疯', '”', '，我们愤怒地说：', '“', '我们有足够的果盘。', '”'] 7
['别发火，亲爱的。我去检查过那车了，情况并不像你想象的那么严重。而且，他们的保险公司会负担修理费的。'] 1
['别发火。我会打电话给他们，把事情搞定。'] 1
['别发牢骚了，尽快开始做你的工作。'] 1
['别犯傻了。你还是要看医生的。我来帮你预约。只有傻子才怕见牙医呢！'] 1
['“', '别放弃，', '”', '他说，', '“', '你是个很棒的作家。', '”', '他读过我的几篇小说，以便告诉我哪些地方的科学部分被我弄错了。'] 8
['别放弃。我相信如果你坚持埋头苦干，你就会得到你想要的结果。'] 1
['“', '别放在那儿，', '”', '男孩说，', '“', '那马厩今晚会失火。', '”'] 7
['别费心了，我们不可能赶上的，干脆别尝试了，还浪费我们的钱，加重纳税人负担。'] 1
['别敷衍我了你，是，我误解了你对爱的，理解。'] 1
['别搞错了，这不是狼的侥幸。这是狼应得的，他们以他们的表现挣来了26年来第一次战胜安菲尔德的辉煌战绩。'] 1
['别搞得像你事先知道这事会发生一样。你这完全是事后诸葛亮。'] 1
['别告诉别人，但我在股票市场赚了一大笔。我现在很有钱呀，老弟！'] 1
['别告诉她我们正在为她准备的生日晚会，我们可以给她一个惊喜。'] 1
['别告诉同事，其实我正在找新工作。'] 1
['别告诉我——就这样走出去而我却再也看不到它了。'] 1
['别告诉我硫磺石，我知道我闻到了什么，不可能是硫磺石。'] 1
['别告诉我你不可救药的迷恋，发展成了无意义的嫉妒。'] 1

['别让别人告诉你不能做什么，你有梦想，你就得保护它。'] 1
['别让电视广告引诱你购买自己实际不需要的东西', '.'] 2
['别让干扰信号把你弄糊涂了。重复，别让敌人的干扰信号把你弄晕了。'] 1
['别让孩子尝试控制你的决定，例如，你选择谁保母。'] 1
['别让孩子们养成晚睡的习惯。'] 1
['别让花花世界迷惑了双眼，更别让夺走你内心宁静的那种美妙感受。'] 1
['别让那声音把你吓着了，那不过是风声。别让那声音把你吓着了，那不过是风声。'] 1
['别让那些地面上的垃圾们再让太空中的垃圾问题恶化了。'] 1
['别让那些伤害你的人驾驭你的精神生活，做回你自己。'] 1
['别让你的孩子把头伸出火车窗外-太危险了。'] 1
['别让情绪干扰你的成功EQ差的人，很容易被情绪打败，只有完全与情绪分开来，这样才会战胜自己，才不会踏上失败的路程。'] 1
['别让日复一日的压力和生活琐碎纠缠着你，试试用明星们的方式来放松吧。'] 1
['别让时间冲淡友情的酒，别让距离拉开梦中相握的手。'] 1
['别让他们离得太近了，不然我们就不能一鼓作气夺路冲杀了！'] 1
['别让他迷惑你，因为他是个有魅力的人'] 1
['别让他人的批评妨碍你或减损你的自信，捍卫自己，批评者最后可能是你的好友。'] 1
['别让他人熄灭你的激情！激情可以引导你的生活变得圆满充实。享受你做的事吧。'] 1
['别让我觉得我很渺小，这只会令我更加愚蠢地去逞强。'] 1
['别让我们所争得的自由再落到外国侵略者和国内封建主手里。'] 1
['别让我们整个团体都处于险境，你应该试着跟他沟通。'] 1
['别让我难堪了，行不行，特德。把挡风玻璃抹一下。'] 1
['别让我失望，年轻的女士。我不会总是这么纵容你的。'] 1
['别让我推迟，别让我忽视，因为我不会从这里再次经过了。', '’', '这就是我的生活。'] 3
['别让我想起那个尴尬的日子，我可是出尽了丑。'] 1
['别让我原谅你.你会是一个骗子，会是一个骗子。'] 1
['别让我在人生的战场上寻找盟友，让我拥有自己强大的力量。'] 1
['别让我这么难受，别拖延了。来啊！开枪吧。我说的不是你。'] 1
['《别让我走》可能在技巧上有些科幻，因为故事场景设定在未来，但实际上整个故事并没有科幻色彩。'] 1
['别让现在的经济衰退欺骗了你

KeyboardInterrupt: 

In [374]:
nltk.word_tokenize("DON'T 'DON \"T' ICM's boss thinks ")

['DO', "N'T", "'DON", '``', 'T', "'", 'ICM', "'s", 'boss', 'thinks']

In [375]:
nltk.word_tokenize("Well now you can sleep with your head against one in this 1940s Bristol B-170 Freighter plane.")

['Well',
 'now',
 'you',
 'can',
 'sleep',
 'with',
 'your',
 'head',
 'against',
 'one',
 'in',
 'this',
 '1940s',
 'Bristol',
 'B-170',
 'Freighter',
 'plane',
 '.']

In [378]:
def ibm_align(numsent, iterations, engtemp, chtemp):
    forwards, backwards = [], []
    for i in range(numsent):
        forwards.append(nltk.AlignedSent(nltk.word_tokenize(engtemp[i].lower()), [x for x in chtemp[i]]))
        backwards.append(nltk.AlignedSent([x for x in chtemp[i]], nltk.word_tokenize(engtemp[i].lower())))
    ibm2f = nltk.translate.IBMModel2(forwards, iterations)
    ibm2b = nltk.translate.IBMModel2(backwards, iterations)
    return forwards, backwards, ibm2f, ibm2b

In [379]:
forwards, backwards, ibm2f, ibm2b = ibm_align(100, 5, engtemp, chtemp)

In [249]:
sent = 2

In [277]:
def t(a):
    return [(x[1], x[0]) for x in list(a.alignment)]

In [278]:
set(list(forwards[sent].alignment)).intersection(t(backwards[sent]))

{(7, 2), (9, 18), (10, 14), (12, 28), (16, 29)}

In [160]:
translation_table = ibm2.translation_table
alignment_table = ibm2.alignment_table

os.chdir(path)
os.system("mkdir data_structures")

import json

with open(path + "/data_structures/translation_table.json", "w") as f:
    json.dump(translation_table, f)
    
with open(path + "/data_structures/alignment_table.json", "w") as f:
    json.dump(alignment_table, f)

In [161]:
def get_max_probs(d):
    l = {}
    for i, j in d.items():
        if j > 1e-12:
            l.update({i:j})
    return l

In [169]:
def max_prob(d):
    return max(d, key = d.get)

In [171]:
max_prob(get_max_probs(ibm2.translation_table["after"]))

'毕'

In [173]:
get_max_probs(ibm2.translation_table["show"])

{'布': 4.654869454497104e-08,
 '径': 4.392909528415727e-12,
 '演': 0.14201526890009336,
 '瓶': 0.0002452198467060133,
 '示': 0.280925476124776,
 '舞': 0.572160506868442,
 '表': 3.2865584560703185e-09,
 '软': 2.8700119581480847e-06,
 '饥': 0.032572484561202576,
 '饿': 0.19999991164810252}

In [176]:
bitext[0].alignment

[(6, 39),
 (8, 23),
 (14, 13),
 (20, 43),
 (11, 23),
 (13, 30),
 (9, 37),
 (18, 23),
 (5, 25),
 (0, 40),
 (1, 24),
 (7, 14),
 (16, 23),
 (3, 40),
 (10, 23),
 (17, 4),
 (4, 6),
 (19, 23),
 (12, 23),
 (15, 37),
 (2, 4)]

In [373]:
#json.load(open(path + "/data_structures/translation_table.json", "r"))

In [71]:
nltk.translate.IBMModel2(bitext, 10, ibm2)

TypeError: 'IBMModel2' object is not subscriptable

TypeError: Object of type 'IBMModel2' is not JSON serializable

In [22]:
os.chdir(path + "/casia2015")
files = glob.glob("*.txt")

c2015entemp = process_eng(open(files[1], "r", encoding = "utf-8").read().split("\n"), conts, nopunct, exclude, e)
#c2015en = eng_sentence_split(c2015entemp)

In [14]:
c2015en = eng_sentence_split(c2015entemp, 3)

NameError: name 'eng_sentence_split' is not defined

*n*-gram for `casia2015_en`:

In [9]:
c2015engram, freqs = trigram(c2015en, {}, {})

In [10]:
#c2015engram

In [14]:
c2015engram["It's gon'"]

{"gon' say": 1}

### NTLK Brown

*n*-gram for NLTK Brown corpus:

In [22]:
def process_NLTK_Brown(conts, nopunct, exclude, e):
    from nltk.corpus import brown
    temp = " ".join([x for x in brown.words(categories = brown.categories())])
    listform = eng_sentence_split(temp)
    return process_eng(listform, conts, nopunct, exclude, e)

In [23]:
browntemp = process_NLTK_Brown(conts, nopunct, exclude, e)

In [26]:
browngram, freqs = trigram(browntemp, c2015engram, freqs)

In [35]:
c2015cntemp = open(files[0], "r", encoding = "utf-8").read().split("\n")

In [36]:
for i in range(100):
    print(lol[i], i)

表演的明星是X女孩团队——由一对具有天才技艺的艳舞女孩们组成，其中有些人受过专业的训练。 0
表演的压轴戏是闹剧版《天鹅湖》，男女小人们身着粉红色的芭蕾舞裙扮演小天鹅。 1
表演和后期制作之间的屏障被清除了，这对演员来说一样大有裨益。 2
（表演或背诵时）通过暗示下面忘记或记地不准的东西来帮助某人。 3
表演基本上很精彩--我只对她的技巧稍有意见。 4
表演结束后，我们看到一对对车灯沿主路一路排回镇上，然后散开来各回各家。 5
表演结束后，移走了背景墙，随后全体演员即兴邀请观众上台齐跳并排舞。 6
表演结束后用宣纸轻铺水面，可将水面上的画进行拓印保存。 7
表演结束后，众人期待已久的园游会终于正式开锣，美味可口的素食佳肴让大家一饱口福。 8
表演节目丰富精采，交换礼物的欢乐时刻一到，则形成另一波高潮。 9
表演仅仅是造就一个近乎觉察不出来的「直线的地质的移位」。 10
表演开始十五分钟后，一帮足球运动员开始集体向着舞台上的女演员发出嘘声。 11
表演开始时，舞台上会有一张床，一面镜子，一张椅子以及一位穿着内衣的美女。在你觉察到前，一位性感的女孩会突然变成三位。 12
表演开始时，艺人坐在地毯上轻击盅子，徐缓起舞； 13
表演“猫女”——在一个笼子里穿着带有一根长尾巴和猫耳的豹纹女内衣。 14
表演前，她紧张得浑身颤抖不已。 15
表演是他们的嫡传技艺，150多年来他们家族一直都是演员。 16
表演算得上是一门残忍的职业，你偏离正统美越远，就越艰难。 17
表演我软木塞哪一呼吸和我将表演你一瓶醋。 18
表演：悉尼歌剧院首席男高音丁毅先生悉尼歌剧院首席女高音 19
表演秀以及开园时间有可能因故不经预告而取消或中止，敬请留意。 20
表演一个不限分析国内受损的期望往往带来了个人工作的公司。 21
“表演”一结束，他马上给那名黑人“歹徒”发工钱，然后两个人还握手拥抱。 22
表演艺术家MarniKotak（持玩偶者）在布鲁克林的“显微镜”画廊生下一婴作为她艺术作品的一部分。 23
表演艺术来源于生活而高于生活，生活是艺术创作的来源，取之不尽用之不竭的源泉。 24
表演在安慰疗法中也很重要。安慰性注射虽然比安慰药丸有效果，但是却没有虚假手术来得见效。 25
“表演在本质上给了他们一个正当的理由去学习如何表达自己。”阿伦森说。 26
表演者把信封靠近他的脑袋，然后先给出答案，