Homework 5: Neural Language Models  (& 🎃 SpOoKy 👻 authors 🧟 data) - Task 4
----

Names & Sections
----
Names: Alec Condry (4120) and Shrihari Subramaniam (4120)

Task 4: Compare your generated sentences (15 points)
----

In this task, you'll analyze one of the files that you produced in Task 3. You'll need to compare against the corresponding file that we have provided for you that was generated from the vanilla n-gram language model.

Choose *__one__* of the following two options.

Option 1: Evaluate the generated words of *character*-based models
---

Your job for this option is to programmatically measure two things:
1. the percentage of words produced by each model that are valid english words.
2. the percentage of words produced by each model that are valid english words *and* were not seen at train time.

For this task, a word is defined as "characters between _ " or "characters between spaces" (if you replaced your underscores with spaces when you printed out your new sentences).


Make sure to turn in any necessary supporting files along with your submission.


In [1]:
# your imports here
import numpy as np
import neurallm_utils as nutils 

import csv

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\hsubr\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [2]:
# code here!

# CONSTANTS #
VANILLA_NGRAM_WORD = 'spooky_vanilla_3_word.txt'
VANILLA_NGRAM_CHAR = 'spooky_vanilla_3_char.txt'
NEURAL_WORD = 'word_model_sentences.txt'
NEURAL_CHAR = 'char_model_sentences.txt'
ENGLISH_WORDS = 'words.txt'

TRAIN_FILE = 'spooky_author_train.csv'
training_data = nutils.read_file_spooky(TRAIN_FILE, 1)
vocab = set([word for sentence in training_data for word in sentence])
with open(ENGLISH_WORDS, 'r') as f:
    valid_words = set([x.strip().lower() for x in f.readlines()])
    

def percentage_valid_english(filepath):
    total = 0
    valid = 0
    valid_not_seen = 0
    with open(filepath, 'r') as f:
        for line in f.readlines():
            words = line.split()
            valid += sum([x in valid_words for x in words])
            valid_not_seen += sum([x in valid_words and x not in vocab for x in words])
            total += len(words)
    return round((valid / total) * 100, 2), round((valid_not_seen / total) * 100, 2)

print("Vanilla N-gram words model: (% valid english, % valid english not seen at train time) = ", percentage_valid_english(VANILLA_NGRAM_WORD))
print("Vanilla N-gram char model:  (% valid english, % valid english not seen at train time) = ", percentage_valid_english(VANILLA_NGRAM_CHAR))
print("Neural words model:         (% valid english, % valid english not seen at train time) = ", percentage_valid_english(NEURAL_WORD))
print("Neural chars model:         (% valid english, % valid english not seen at train time) = ", percentage_valid_english(NEURAL_CHAR))

Vanilla N-gram words model: (% valid english, % valid english not seen at train time) =  (86.26, 0.0)
Vanilla N-gram char model:  (% valid english, % valid english not seen at train time) =  (52.76, 12.39)
Neural words model:         (% valid english, % valid english not seen at train time) =  (86.78, 0.0)
Neural chars model:         (% valid english, % valid english not seen at train time) =  (53.2, 12.22)


3. How did you determine what a valid english word is? We determined if a word is valid in english by checking if it appears in the set valid_words. The set valid_words is all the words in the english language. This data was acquired from the following github: https://github.com/dwyl/english-words

4. Gather the sequences of characters that are determined not to be words. Sampling at minimum 100 of these sequences, how many of them *should have* been counted as words in your opinion? We believe that 5 of these should have been counted as words because they are actual words with either a period, comma, or quotation mark after them.

In [3]:
# more code here, as needed!
invalid_neural = []
valid_neural = []

    
with open(NEURAL_CHAR, 'r') as f:
    for line in f.readlines():
        words = line.split()
        valid_neural = valid_neural + [x for x in words if x in valid_words]
        invalid_neural = invalid_neural + [x for x in words if x not in valid_words]
    
print(f'Length of invalid words from neural model: {len(invalid_neural)}\n')
print(f'Length of valid words from neural model: {len(valid_neural)}\n')


sample = invalid_neural[430:530]

print(f'100 invalid words in groups of 10:')
for idx in range(0, len(sample), 10):
    print("   ".join(sample[idx:idx+10]))

Length of invalid words from neural model: 1069

Length of valid words from neural model: 1215

100 invalid words in groups of 10:
extry   supopin   theris   cou   wol   retentaliries   elits   frome,   asids?   earethenters
wassawas   and,   anto   prems   on,   ,   crege.   asecto   wase,   thes
wastat   frould   acrughtly   beas   joinciat   ricied   ces   shque   sars   unterat,
mamory   hoich   tomer   dingingstioluctlecto   dayes   oughe   trioasy   auppose   thesomen   ehe
perromence   uppen   mouslay,   capechation   itandso   caideartimegs   gionsed   willse   explusevenes   joill
theyeds   revers,   oppecid   imet   hationly   whaphind   flid   inedutiony   wanity   aboe.
somthaver.   tandearay;   therse,   ipos,   in's   whe   haisink.   haintray   itedix.   belf
painot   sompas   nown.   fecent,   waricere   ghtriall   fation   trol   rist,   andonts
unies   extedre   suchcepteres,   its,   fictight   noatil   coved.   thedee;   asteric   stic
conall.   serney   the"   horn

In [4]:
invalid_vanilla = []
valid_vanilla = []

with open(VANILLA_NGRAM_CHAR, 'r') as f_v:
    for line in f_v.readlines():
        words = line.split()
        valid_vanilla = valid_vanilla + [x for x in words if x in valid_words]
        invalid_vanilla = invalid_vanilla + [x for x in words if x not in valid_words]
        
print(f'Length of invalid words from vanilla model: {len(invalid_vanilla)}\n')
print(f'Length of valid words from vanilla model: {len(valid_vanilla)}')

Length of invalid words from vanilla model: 1525

Length of valid words from vanilla model: 1703


In [5]:
VALID_WORDS_CSV = "valid_words_lms.csv"
INVALID_WORDS_CSV = "invalid_words_lms.csv"

vanilla_valid_data = [['vanilla', x] for x in valid_vanilla]
vanilla_invalid_data = [['vanilla', x] for x in invalid_vanilla]
neural_valid_data = [['neural', x] for x in valid_neural]
neural_invalid_data = [['neural', x] for x in invalid_neural]

fields = ['model', 'sequence']

with open(INVALID_WORDS_CSV, 'w+', encoding="utf-8") as f_invalid:
    
    csvwriter1 = csv.writer(f_invalid)
    
    csvwriter1.writerow(fields)
    
    csvwriter1.writerows(neural_invalid_data)
    csvwriter1.writerows(vanilla_invalid_data)


with open(VALID_WORDS_CSV, 'w+', encoding="utf-8") as f_valid:
    
    csvwriter2 = csv.writer(f_valid)
    
    csvwriter2.writerow(fields)
    
    csvwriter2.writerows(neural_valid_data)
    csvwriter2.writerows(vanilla_valid_data)
    


Submit two csv files alongside this notebook: `valid_words_lms.csv` and `invalid_words_lms.csv`. Both files should have __two__ columns: `model`, `sequence`. `model` will have the value `neural` or `vanilla`. `sequence` will be the corresponding sequence of characters. `valid_words_lms.csv` should contain all sequences from both models you determined to be valid words. `invalid_words_lms.csv` will have all sequences from both models you programatically determined to be invalid words.

Option 2: Evaluate the generated sentences of *word*-based models
----

Your job for this option is to measure the quality of your generated sentences for word-based models. For this option you *must* survey at least 3 people who are __not__ in this course. They need to speak and read the language that you are evaluating, but they need not be native speakers.

You will evaluate the quality of the generated sentences in the following way:
1. Generate 20 sentences from your neural model.
2. Using the same level of n-gram, pair these sentences with provided sentences from the vanilla n-gram model.

Next, build a survey. For each pair of (neural LM sentence, vanilla n-ngram LM sentence), you'll ask the survey taker two binary selection questions:
1. which sentence is more grammatical?
2. which sentence makes more sense, semantically (in meaning)?
3. Which sentence do you prefer?


Finally, you'll evaluate your survey results. Calculate the following:
1. What percentage of neural vs. vanilla n-gram LM sentences were preferred, separated along each of the three dimensions?
2. What is [Krippendorff's alpha](https://en.wikipedia.org/wiki/Krippendorff%27s_alpha) for your survey data? 

You are welcome to use a pre-built python implmenetation of the Krippendorff's alpha calculation, such as [this one](https://pypi.org/project/krippendorff/). Krippendorff's alpha is one way to measure interannotator agreement — the extent to which your survey respondants agree with one another.

You will submit your survey data alongside this notebook.

In [6]:
# !pip install krippendorff

In [8]:
# your imports here

#import krippendorff


In [9]:
# your code here