# Ensegment: Punish Longer Words

In [1]:
from ensegment import *

## Documentation

First let's run the regular version

In [3]:
Pw = Pdist(data=datafile("../data/count_1w.txt"))
segmenter = Segment(Pw)
with open("../data/input/dev.txt") as f:
    for line in f:
        print(" ".join(segmenter.segment(line.strip())))

choose spain
this is a test
who represents
experts exchange
speed of art
unclimatechangebody
we are the people
mentionyourfaves
now playing
the walking dead
follow me
we are the people
mentionyourfaves
check domain
big rock
name cheap
apple domains
honesty hour
being human
follow back
social media
30secondstoearth
current ratesoughttogodown
this is insane
what is my name
is it time
let us go
me too
nowthatcherisdead
advice for young journalists


## Analysis

Some longer words should be split. Perhaps we should focus on what to do for out of vocab words. Lets add the function below which punishes longer words by increasing the demoninator size with respect to word length. Here we first try using 10 since it is the one used in the referenced book chapter.

In [4]:
def punishLong(word, N):

    "Estimate the prob of unknown word while accounting for word length"
    return 10. / (N * 10**len(word))


In [5]:
Pw = Pdist(data=datafile("../data/count_1w.txt"), missingfn=punishLong)
segmenter = Segment(Pw)
with open("../data/input/dev.txt") as f:
    for line in f:
        print(" ".join(segmenter.segment(line.strip())))

choose spain
this is a test
who represents
experts exchange
speed of art
un climate change body
we are the people
mention your faves
now playing
the walking dead
follow me
we are the people
mention your faves
check domain
big rock
name cheap
apple domains
honesty hour
being human
follow back
social media
30 seconds to earth
current rate sought to go down
this is insane
what is my name
is it time
let us go
me too
now thatcher is dead
advice for young journalists


Looks much better now. Lets check the test data.

In [6]:
with open("../data/input/test.txt") as f:
    for line in f:
        print(" ".join(segmenter.segment(line.strip())))

how to breakup in 5 words
what makes god smile
10 people who mean alot to me
worst day in 4 words
love story in 5 words
to p3 favourite comics
10 breakup lines
things that make you smile
best female athlete
worst bossin5 words
now is the time for all good
it is a truth universally acknowledged
when in the course of human events it becomes necessary
it was a bright cold day in april and the clocks were striking thirteen
it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness
as gregor samsa awoke one morning from uneasy dreams he found himself transformed in his bed into a gigantic insect
in a hole in the ground there lived a hobbit not a nasty dirty wet hole filled with the ends of worms and an oozy smell nor yet a dry bare sandy hole with nothing in it to sitdown on or to eat it was a hobbit hole and that means comfort
far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the galaxy lies a small un r

Theres still some errors. At least 2 involving numbers, for example we have to p3 instead of top 3 and bossin5... Maybe a larger penalty for word length will help? Let's see what will happen when we replace 10 with 50.

In [8]:
def punishLong(word, N):

    "Estimate the prob of unknown word while accounting for word length"
    return 50. / (N * 50**len(word))

Pw = Pdist(data=datafile("../data/count_1w.txt"), missingfn=punishLong)
segmenter = Segment(Pw)
with open("../data/input/test.txt") as f:
    for line in f:
        print(" ".join(segmenter.segment(line.strip())))

how to breakup in 5 words
what makes god smile
10 people who mean alot to me
worst day in 4 words
love story in 5 words
top 3 favourite comics
10 breakup lines
things that make you smile
best female athlete
worst boss in 5 words
now is the time for all good
it is a truth universally acknowledged
when in the course of human events it becomes necessary
it was a bright cold day in april and the clocks were striking thirteen
it was the best of times it was the worst of times it was the age of wisdom it was the age of foolishness
as gregor samsa awoke one morning from uneasy dreams he found himself transformed in his bed into a gigantic insect
in a hole in the ground there lived a hobbit not a nasty dirty wet hole filled with the ends of worms and an oozy smell nor yet a dry bare sandy hole with nothing in it to sitdown on or to eat it was a hobbit hole and that means comfort
far out in the uncharted backwaters of the unfashionable end of the western spiral arm of the galaxy lies a small un

Looks better than before, to p3 is now top 3. However, un regarded should be a single word. Otherwise it looks great. 