# zhsegment: default program

In [8]:
from default import *

## Run the default solution on dev

In [9]:
Pw = Pdist(data=datafile("data/count_1w.txt"))
segmenter = Segment(Pw) # note that the default solution for this homework ignores the unigram counts
output_full = []
with open("data/input/dev.txt") as f:
    for line in f:
        output = " ".join(segmenter.segment(line.strip()))
        output_full.append(output)
print("\n".join(output_full[:3])) # print out the first three lines of output as a sanity check

中 美 在 沪 签 订 高 科 技 合 作 协 议
新 华 社 上 海 八 月 三 十 一 日 电 （ 记 者 白 国 良 、 夏 儒 阁 ）
“ 中 美 合 作 高 科 技 项 目 签 字 仪 式 ” 今 天 在 上 海 举 行 。


## Evaluate the default output

In [10]:
from zhsegment_check import fscore
with open('data/reference/dev.out', 'r') as refh:
    ref_data = [str(x).strip() for x in refh.read().splitlines()]
    tally = fscore(ref_data, output_full)
    print("score: {:.10f}".format(tally), file=sys.stderr)


score: 0.2675962075


## Documentation

Default solution segments every single characters in the text into single words.

## Analysis

Every single characters were segmented in the text. Of course, the F-score was very low. We decided to try the given baseline algorithm.

# zhsegment: baseline program

In [11]:
from zhsegment_baseline import *

## Run the baseline solution on dev

In [12]:
Pw = Pdist(data1=datafile("data/count_1w.txt"), data2=datafile("data/count_2w.txt"))
segmenter = Segment(Pw) # note that the default solution for this homework ignores the unigram counts
output_full = []
with open("data/input/dev.txt") as f:
    for line in f:
        output = " ".join(segmenter.segment(line.strip()))
        output_full.append(output)
print("\n".join(output_full[:3])) # print out the first three lines of output as a sanity check

中 美 在 沪 签订 高 科技 合作 协议
新华社 上海 八月 三十一日 电 （ 记者 白 国 良 、 夏儒阁 ）
“ 中 美 合作 高 科技 项目 签字 仪式 ” 今天 在 上海 举行 。


## Evaluate the baseline output

In [13]:
from zhsegment_check import fscore
with open('data/reference/dev.out', 'r') as refh:
    ref_data = [str(x).strip() for x in refh.read().splitlines()]
    tally = fscore(ref_data, output_full)
    print("score: {:.10f}".format(tally), file=sys.stderr)

score: 0.8053163355


## Documentation

In the middle of implementation of baseline, we found a problem which is the baseline algorithm does not handle the unknown words in the text. We made the unknown words to be independent words, and pushed the unknown word entry in the heap again.

## Analysis

F-score was incresed significantly, but we observed that when there are 2 unknown words in a sequence, they are segmented in our code. We decided to implement bigram method with smoothing while making the unknown words in a sequence into a single word.

# zhsegment: bigram program

In [15]:
from zhsegment_bigram import *

## Run the bigram solution on dev

In [20]:
Pw = Pdist(data1=datafile("data/count_1w.txt"), data2=datafile("data/count_2w.txt"))
segmenter = Segment(Pw) # note that the default solution for this homework ignores the unigram counts
output_full = []
with open("data/input/dev.txt") as f:
    for line in f:
        output = " ".join(segmenter.segment(line.strip()))
        output_full.append(output)
print("\n".join(output_full[318:320])) # print out the first three lines of output as a sanity check

原 总 兵 力 为 １１ · ６ 万余 人 ， 其中 在 拉脱维亚 驻军 最 多 ， 为 ５ 万余 人 ， 立陶宛 为 ３ · ５ 万余 人 ， 爱沙尼亚 为 ３ 万余 人 。
原 苏联 解体 后 ， 该 军队 集 群转归 俄罗斯 所有 。


## Evaluate the bigram output

In [17]:
from zhsegment_check import fscore
with open('data/reference/dev.out', 'r') as refh:
    ref_data = [str(x).strip() for x in refh.read().splitlines()]
    tally = fscore(ref_data, output_full)
    print("score: {:.10f}".format(tally), file=sys.stderr)

score: 0.8948552151


## Documentation

F-score is about 90 now. We used biram probability to compute and compare each entry. For the smoothing, we implemented JM(linear interpolation) smoothing. Unknown words sequence is now considered as one word.

## Analysis

We found that the numbers with unknown words are still segemented, so we forced them to be one word.

# zhsegment: final program

In [21]:
from zhsegment_rec import *

## Run the final solution on dev

In [22]:
Pw = Pdist(data1=datafile("data/count_1w.txt"), data2=datafile("data/count_2w.txt"))
segmenter = Segment(Pw) # note that the default solution for this homework ignores the unigram counts
output_full = []
with open("data/input/dev.txt") as f:
    for line in f:
        output = " ".join(segmenter.segment(line.strip()))
        output_full.append(output)
print("\n".join(output_full[318:320])) # print out the first three lines of output as a sanity check

原 总 兵 力 为 １１·６万 余 人 ， 其中 在 拉脱维亚 驻军 最 多 ， 为 ５万 余 人 ， 立陶宛 为 ３·５万 余 人 ， 爱沙尼亚 为 ３万 余 人 。
原 苏联 解体 后 ， 该 军队 集 群转归 俄罗斯 所有 。


## Evaluate the bigram output

In [23]:
from zhsegment_check import fscore
with open('data/reference/dev.out', 'r') as refh:
    ref_data = [str(x).strip() for x in refh.read().splitlines()]
    tally = fscore(ref_data, output_full)
    print("score: {:.10f}".format(tally), file=sys.stderr)

score: 0.8961249033


## Documentation

The score was increased slightly, but now the numbers and unknown words combinded each other.

## Analysis

When the word is unique word, program segments the word into single characters.