# ensegment: default program

In [2]:
from ensegment import *

## Documentation


Using the default solution, we get the following result on the dev dataset:

In [3]:
Pw = Pdist(data=datafile("../data/count_1w.txt"))
segmenter = Segment(Pw)
with open("../data/input/dev.txt") as f:
    for line in f:
        print(" ".join(segmenter.segment(line.strip())))

choose spain
this is a test
who represents
experts exchange
speed of art
unclimatechangebody
we are the people
mentionyourfaves
now playing
the walking dead
follow me
we are the people
mentionyourfaves
check domain
big rock
name cheap
apple domains
honesty hour
being human
follow back
social media
30secondstoearth
current ratesoughttogodown
this is insane
what is my name
is it time
let us go
me too
nowthatcherisdead
advice for young journalists


The abbove results show that the model failed to segments some words like "unclimatechangebody", "mentionyourfaves", etc. Interestingly enough, the words "unclimatechangebody" and "mentionyourfaves" do not exist in the count dictionary file "count_1w.txt".

The problem arises from the fact that the probability of unknown words (unavailable in the "count_1w.txt") is considered 1/N in the code, where N is the total number of the words in the corpus. This can be seen in the "Pdist" class where the default "missingfn" is defined as follows:

In [7]:
# self.missingfn = missingfn or (lambda k, N: 1./N)

This phenomenon could cause the probability of some segmented nouns like "un climate change body" which is p1.p2.p3.p4 becomes less than 1/N (p1, p2, p3, p4 are the probability of "un", "climate", "change", "body").

The more reasonable approach is to consider the probability of the unknown words 0. In other words, we need to define the "missingfn" equal to 0, as follows:

In [4]:
Pw = Pdist(data=datafile("../data/count_1w.txt"), missingfn=(lambda k, N: 0))
segmenter = Segment(Pw)
with open("../data/input/dev.txt") as f:
    for line in f:
        print(" ".join(segmenter.segment(line.strip())))

choose spain
this is a test
who represents
experts exchange
speed of art
un climate change body
we are the people
mention your faves
now playing
the walking dead
follow me
we are the people
mention your faves
check domain
big rock
name cheap
apple domains
honesty hour
being human
follow back
social media
3 0 seconds to earth
current rate sought to go down
this is insane
what is my name
is it time
let us go
me too
now thatcher is dead
advice for young journalists


However, another problem arises, which is more apparent in the test data:

In [7]:
Pw = Pdist(data=datafile("../data/count_1w.txt"), missingfn=(lambda k, N: 0))
segmenter = Segment(Pw)
with open("../data/input/test.txt") as f:
    for line in f:
        print(" ".join(segmenter.segment(line.strip())))

h o w t o b r e a k u p i n 5 words
what makes god smile
1 0 people who mean alot to me
w o r s t d a y i n 4 words
l o v e s t o r y i n 5 words
t o p 3 favourite comics
1 0 breakup lines
things that make you smile
best female athlete
w o r s t b o s s i n 5 words
now is the time for all good
it is a truth universally acknowledged
when in the course of human events it becomes necessary
it was a bright cold day in april and the clocks were striking thirteen
it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness
as gregor samsa awoke one morning from uneasy dreams he found himself transformed in his bed into a gigantic insect
in a hole in the ground there lived a hobbit not a nasty dirty wet hole filled with the ends of worms and an oozy smell nor yet a dry bare sandy hole with nothing in it to sitdown on or to eat it was a hobbit hole and that means comfort
far out in the uncharted backwaters of the unfashionable end of the western spi

The issue is that once the recursive segmentation algorithm reaches an unknown word (or letter), it segments all of its left-side words into single letters. Because single-letter words are the last segmentation to be checked and are always been picked up in the max function although their probability is 0. For instance, look at "h o w t o b r e a k u p i n 5 words", in which "5" is an unknown word.

Therefore, we decided to assign non-zero probability to the the unknown words again, but this time, we want this probability to be exponentially proportional to the length of the unknown words. 
Put differently, the probability of an unknown word with length 5 is much less than that of an unknown word with length 1. This approach can also prevent facing the issue we had in the first place with missingfn=(lambda k, N: 1./N)

In [10]:
Pw = Pdist(data=datafile("../data/count_1w.txt"), missingfn=(lambda k, N: pow(1./N, len(k))))
segmenter = Segment(Pw)
with open("../data/input/test.txt") as f:
    for line in f:
        print(" ".join(segmenter.segment(line.strip())))

how to breakup in 5 words
what makes god smile
1 0 people who mean alot to me
worst day in 4 words
love story in 5 words
top 3 favourite comics
1 0 breakup lines
things that make you smile
best female athlete
worst boss in 5 words
now is the time for all good
it is a truth universally acknowledged
when in the course of human events it becomes necessary
it was a bright cold day in april and the clocks were striking thirteen
it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness
as gregor samsa awoke one morning from uneasy dreams he found himself transformed in his bed into a gigantic insect
in a hole in the ground there lived a hobbit not a nasty dirty wet hole filled with the ends of worms and an oozy smell nor yet a dry bare sandy hole with nothing in it to sitdown on or to eat it was a hobbit hole and that means comfort
far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the galaxy lies a small 

Now with the missingfn=(lambda k, N: pow(1./N, len(k))), the segmentations are much better and the model is able to capture the correct segments in almost all the test data.
We used the same approach and reached .98 F-score on the dev dataset.
So, this is our final submission for HW0.