# Lab 2: Language Models

Student: John Wu

In [1]:
import charlm as clm
import math

## (a) Simple Character LM

train language model

In [2]:
mdl = clm.train_char_lm('./data/subtitles.txt', order=4)

Continuations for `itte`

In [3]:
clm.print_probs(mdl, 'itte')

[('n', 0.30644454186760195),
 ('r', 0.28233231039017975),
 ('d', 0.2770714598860149),
 ('e', 0.11047786058746165),
 (' ', 0.007452871547566856),
 ('s', 0.0070144673388864535),
 ('.', 0.003068829460762823),
 ('l', 0.0021920210434020165),
 (',', 0.0017536168347216134),
 ('\n', 0.00043840420868040335),
 ('!', 0.00043840420868040335),
 ("'", 0.00043840420868040335),
 ('k', 0.00043840420868040335),
 ('t', 0.00043840420868040335)]


Continuations for `supe`

In [4]:
clm.print_probs(mdl, 'supe')

[('r', 0.9992144540455616), ('s', 0.0007855459544383347)]


Continuations for `ther`

In [5]:
clm.print_probs(mdl, 'ther')

[('e', 0.41045090602612727),
 (' ', 0.3175606525796159),
 ('.', 0.10865089398591295),
 (',', 0.038119318523869725),
 ('s', 0.03071458671964361),
 ("'", 0.022394798627415568),
 ('?', 0.02027572090783216),
 ('i', 0.008295707663596412),
 ('!', 0.007489013304436819),
 ('w', 0.00744085244717356),
 ('f', 0.006983324303172596),
 ('a', 0.005911745229065077),
 ('-', 0.003383300222743965),
 ('\n', 0.0029498525073746312),
 ('n', 0.002925772078743002),
 ('m', 0.0018662332189512973),
 ('l', 0.0010354584311600746),
 ('h', 0.0009752573595810005),
 ('"', 0.0006260911444223707),
 ('t', 0.00046956835831677806),
 (':', 0.00025284450063211124),
 ('b', 0.00018060321473722233),
 ('y', 0.0001685630004214075),
 (']', 0.00015652278610559268),
 (')', 0.0001083619288423334),
 ('o', 9.632171452651857e-05),
 ('/', 6.0201071579074104e-05),
 (';', 6.0201071579074104e-05),
 ('g', 6.0201071579074104e-05),
 ('´', 6.0201071579074104e-05),
 ('j', 4.816085726325929e-05),
 ('c', 3.6120642947444464e-05),
 ('\xa0', 3.6120642

Random Sentences

In [6]:
print(clm.generate_text(mdl, 4, 80) + '\n')
print(clm.generate_text(mdl, 4, 80) + '\n')
print(clm.generate_text(mdl, 4, 80) + '\n')

- And be a reposition, Tony.
- In get think that I expective you'd have night to

You real you meeting back in they and shout then I this he do.
- Mom.
- What I k

Okay, the Cather up the on, James, isn't you.
- Senora?
He really.
- Diddy hope 



## (b) Calculate Perplexity

In [7]:
sent = 'The car eats forks and knives'
print(clm.perplexity(sent, mdl, 4))

7.160948621456589


In [8]:
sent = 'The yob eats forks and knives'
print(clm.perplexity(sent, mdl, 4))

inf


In [9]:
sent = 'The student loves homework'
print(clm.perplexity(sent, mdl, 4))

4.606972940490917


In [10]:
sent = 'It is raining in London'
print(clm.perplexity(sent, mdl, 4))

3.7112360009044503


In [11]:
sent = 'asdfjkl; qwerty'
print(clm.perplexity(sent, mdl, 4))

inf


## (c) Naive smoothing

In [12]:
sent = 'The car eats forks and knives'
print(clm.smoothed_perplexity(sent, mdl, 4))

7.160948621456589


In [13]:
sent = 'The yob eats forks and knives'
print(clm.smoothed_perplexity(sent, mdl, 4))

49.00672281265043


In [14]:
sent = 'The student loves homework'
print(clm.smoothed_perplexity(sent, mdl, 4))

4.606972940490917


In [15]:
sent = 'It is raining in London'
print(clm.smoothed_perplexity(sent, mdl, 4))

3.7112360009044503


In [16]:
sent = 'asdfjkl; qwerty'
print(clm.smoothed_perplexity(sent, mdl, 4))

2344635.520980603


## (d) Language identification

List of languages and reading input files.

In [17]:
LANGUAGES = ['fr', 'de', 'da', 'en', 'it', 'nl']
langFilePat = './data/%s.train.txt'

with open('./data/test.txt', 'r') as f:
    testData = [l.split('\t') for l in f.read().splitlines()]

Functions for training character language models, identification of languages, and assessment of language ID results.

In [18]:
def trainCharLM(namePattern, choices, order=0):
    '''Train character language models for all 6 languages.
    Input requires number of character orders'''
    lms = dict() # dict for storing LMs, key=language code
    for c in choices: # loop over languages
        lms[c] = clm.train_char_lm(namePattern%c, order) 
    return lms

def choiceID(txt, lms, options, order):
    '''Identify languages of an input text based on input language models
    For each language model, calcualte the perplexity score on the input text.
    The language with the lowest perplexity is the prediction.
    '''
    scores = dict()
    for c in options:
        scores[c] = clm.smoothed_perplexity(txt, lms[c], order)
    return min(scores, key=scores.get)

def assessLMs(options, taggedData, lms, order):
    ''' For the test data set, calculate the accuracy of all languages
    '''
    accuracy = {c:[0,0] for c in options}
    for tag, txt in taggedData:
        accuracy[tag][1] += 1
        pred = choiceID(txt, lms, options, order)
        if pred == tag:
            accuracy[tag][0] += 1
    
    for tag,(num,denom) in accuracy.items():
        print('%s: %u correct out of %u -- %.1f%%' % 
              (tag,num,denom,num/denom*100))

__Unigram models__

In [19]:
u_lms = trainCharLM(langFilePat, LANGUAGES, 0)
assessLMs(LANGUAGES, testData, u_lms, 0)

fr: 194 correct out of 200 -- 97.0%
de: 178 correct out of 200 -- 89.0%
da: 187 correct out of 200 -- 93.5%
en: 187 correct out of 200 -- 93.5%
it: 188 correct out of 200 -- 94.0%
nl: 177 correct out of 200 -- 88.5%


For the first test sentence, the perplexity scores of each of the languages for the unigram model is shown below:

In [20]:
print(testData[0][1] + '\n')
for l in LANGUAGES:
    tmp = clm.smoothed_perplexity(testData[0][1], u_lms[l], 0)
    print('Language: %s, perplexity: %f'%(l,tmp))

Quels que soient les témoins qui viendront déposer à la barre, M. Milosevic aura le droit de les contre-interroger.

Language: fr, perplexity: 21.230361
Language: de, perplexity: 29.177491
Language: da, perplexity: 28.990266
Language: en, perplexity: 31.052441
Language: it, perplexity: 23.153187
Language: nl, perplexity: 26.319383


__Bi-gram models__

In [21]:
bi_lms = trainCharLM(langFilePat, LANGUAGES, 1)
assessLMs(LANGUAGES, testData, bi_lms, 1)

fr: 196 correct out of 200 -- 98.0%
de: 198 correct out of 200 -- 99.0%
da: 198 correct out of 200 -- 99.0%
en: 200 correct out of 200 -- 100.0%
it: 200 correct out of 200 -- 100.0%
nl: 198 correct out of 200 -- 99.0%


__4-gram models__

In [22]:
four_lms = trainCharLM(langFilePat, LANGUAGES, 3)
assessLMs(LANGUAGES, testData, four_lms, 3)

fr: 199 correct out of 200 -- 99.5%
de: 199 correct out of 200 -- 99.5%
da: 198 correct out of 200 -- 99.0%
en: 200 correct out of 200 -- 100.0%
it: 200 correct out of 200 -- 100.0%
nl: 199 correct out of 200 -- 99.5%


__Summary of Results__

Language models generally get more accurate as orders increase. With 4-gram models, the accuracy are probably as good as one could hope for. With unigram, Germanic languages like German and Dutch are not as accurate as Romance languages like English, Italian, and French.

## (e) Gender Bias

create input files

In [23]:
genderFilePat = './data/tennis_%s.txt'
GENDERS = ['M', 'F']

# split training files into two separate input files
fMale = open('./data/tennis_M.txt', 'w')
fFemale = open('./data/tennis_F.txt', 'w')
with open('./data/tennis.train.txt', 'r') as f:
    for line in f.read().splitlines():
        tks = line.split('\t')
        if tks[0] == 'M':
            fMale.write(tks[1] + '\n')
        else:
            fFemale.write(tks[1] + '\n')
    fMale.close()
    fFemale.close()

# read test file as tagged text data
with open('./data/tennis.test.txt', 'r') as f:
    tennisTest = [l.split('\t') for l in f.read().splitlines()]

For this section, we reuse functions created in section (d)

__Unigram Models__

In [24]:
mf_LM_u = trainCharLM(genderFilePat, GENDERS, 0)
assessLMs(GENDERS, tennisTest, mf_LM_u, 0)

M: 2736 correct out of 4518 -- 60.6%
F: 1845 correct out of 3696 -- 49.9%


__Bi-gram Models__

In [25]:
mf_LM_bi = trainCharLM(genderFilePat, GENDERS, 1)
assessLMs(GENDERS, tennisTest, mf_LM_bi, 1)

M: 2559 correct out of 4518 -- 56.6%
F: 2416 correct out of 3696 -- 65.4%


__Four-gram Models__

In [26]:
mf_LM_four = trainCharLM(genderFilePat, GENDERS, 3)
assessLMs(GENDERS, tennisTest, mf_LM_four, 3)

M: 2935 correct out of 4518 -- 65.0%
F: 2470 correct out of 3696 -- 66.8%


__Five-gram Models__

In [27]:
mf_LM_five = trainCharLM(genderFilePat, GENDERS, 4)
assessLMs(GENDERS, tennisTest, mf_LM_five, 4)

M: 2946 correct out of 4518 -- 65.2%
F: 2296 correct out of 3696 -- 62.1%


__Summary of results__

A unigram model produced better than random results for male questions, but not female ones. Results improved with bi-gram and four-gram models, reaching about 65% correct for both genders. However, increasing the order to five-gram models resulted in a degradation of the result, likely due to ungeneralizability.  