Homework 5: Neural Language Models  (& 🎃 SpOoKy 👻 authors 🧟 data) - Task 4
----

Names & Sections
----
Names: Sarthak Kagliwal Anisha Kumari Kushwaha (Write these in every notebook you submit. For each partner, write down whether you are a 4120 or a 6120 student.)

Task 4: Compare your generated sentences (15 points)
----

In this task, you'll analyze one of the files that you produced in Task 3. You'll need to compare against the corresponding file that we have provided for you that was generated from the vanilla n-gram language model.

Choose *__one__* of the following two options.

Option 1: Evaluate the generated words of *character*-based models
---

Your job for this option is to programmatically measure two things:
1. the percentage of words produced by each model that are valid english words.
2. the percentage of words produced by each model that are valid english words *and* were not seen at train time.

For this task, a word is defined as "characters between _ " or "characters between spaces" (if you replaced your underscores with spaces when you printed out your new sentences).


Make sure to turn in any necessary supporting files along with your submission.


In [1]:
# your imports here
import nltk
nltk.download('words')
from nltk.corpus import words
english_words = set(words.words())
tokenizer = nltk.word_tokenize
import keras
import random
import pandas as pd
from gensim.models import KeyedVectors

[nltk_data] Downloading package words to
[nltk_data]     /Users/sarthak55k/nltk_data...
[nltk_data]   Package words is already up-to-date!


In [2]:
# code here!

def generate_word_list(filename):
    word_list = []
    with open(filename, 'r') as file:
        # Initialize an empty list to store the words
        # Iterate through each line in the file
        for line in file:
            # Split the line into words based on spaces or other delimiters
            # words = line.split()
            words = tokenizer(line.lower())
            
            # Add the words to the word_list
            word_list.extend(words)
    return word_list


# the percentage of words produced by each model that are valid english words.

def valid_nonvalid(word_list):
    valid_words = []
    invalid_words = []
    total_words = 0
    for word in word_list:
        if word.isalpha():
            if word in english_words:

                valid_words.append(word)
            else:
                invalid_words.append(word)
    return valid_words,invalid_words




#the percentage of words produced by each model that are valid english words and were not seen at train time.
def not_seen_words(valid_words,vocab):
    valid_word_outside_train = []
    for word in valid_words:
        if word.isalpha():
            if word not in vocab:
                valid_word_outside_train.append(word)
        
    return valid_word_outside_train




In [3]:
def getPercentage(filename):
    embedding_model = './spooky_embedding_word.txt'
    w2v_word_model = KeyedVectors.load_word2vec_format(embedding_model, binary=False)

    word_vocab = list(w2v_word_model.key_to_index.keys())
    
    char_words_list = generate_word_list(filename)
    char_word_list_len = len(char_words_list)
    char_valid_words, char_invalid_words = valid_nonvalid(char_words_list)

    percentage_valid_words_char = ((len(char_valid_words)+1)/(char_word_list_len+1))*100

    char_valid_word_outside_train = not_seen_words(char_valid_words,word_vocab)
    percentage_valid_words_outside_vocab_char = ((len(char_valid_word_outside_train)+1)/(char_word_list_len+1))*100

    print(f'percentage of words that are valid english words: {percentage_valid_words_char}')
    print(f'percentage of words that are valid english words and were not seen at train time: {percentage_valid_words_outside_vocab_char}')


    print(f'\nRandom 100 samples of invalid words:\n {random.sample(char_invalid_words,100)}')

nn_filename = 'char_sents.txt'
vanilla_filename = './spooky_vanilla_3_char.txt'

print('For Vanilla model...')
getPercentage(vanilla_filename)

print('\nFor NN model')
getPercentage(nn_filename)




For Vanilla model...
percentage of words that are valid english words: 43.140291111799314
percentage of words that are valid english words and were not seen at train time: 5.264787860018581

Random 100 samples of invalid words:
 ['louddis', 'sto', 'ishe', 'pette', 'youghtly', 'whic', 'whe', 'handown', 'earmea', 'andieflethimor', 'univeret', 'ance', 'dusubjectishat', 'livere', 'combeethe', 'gand', 'ity', 'shadnigning', 'gle', 'ste', 'swovey', 'magover', 'inglibbiz', 'jought', 'coresch', 'mands', 'ang', 'wormighat', 'wasmaked', 'therproughtfutuoureems', 'ity', 'mins', 'sau', 'thathe', 'ordis', 'samind', 'beger', 'fantestain', 'thele', 'maind', 'smages', 'syme', 'hoaterd', 'ful', 'oritly', 'yourappeld', 'whis', 'adren', 'anot', 'thads', 'ares', 'dideveriestow', 'exhien', 'poin', 'thaded', 'bled', 'ande', 'thaveherecally', 'memblown', 'gened', 'por', 'ract', 'fieve', 'inis', 'atioutedo', 'warld', 'adis', 'hemoreauthe', 'becelf', 'astions', 'taines', 'youghts', 'thatingtheyes', 'itin', 'scr

3. How did you determine what a valid english word is? NLTK 

4. Gather the sequences of characters that are determined not to be words. Sampling at minimum 100 of these sequences, how many of them *should have* been counted as words in your opinion?

In [4]:
# more code here, as needed!
char_words_list_nn = generate_word_list(nn_filename)
char_valid_words_nn, char_invalid_words_nn = valid_nonvalid(char_words_list_nn)
print(len(char_valid_words_nn))
char_words_list_vanilla = generate_word_list(vanilla_filename)
char_valid_words_vanilla, char_invalid_words_vanilla = valid_nonvalid(char_words_list_vanilla)

valid_model = ['neural']*len(char_valid_words_nn)+ ['sequence']*len(char_valid_words_vanilla)

valid_seq = char_valid_words_nn+char_valid_words_vanilla

data = {'model':valid_model,'sequence':valid_seq}
df_valid = pd.DataFrame(data)
df_valid.head()

df_valid.to_csv('valid_words_lms.csv')


invalid_model = ['neural']*len(char_invalid_words_nn)+ ['sequence']*len(char_invalid_words_vanilla)
invalid_seq = char_invalid_words_nn+char_invalid_words_vanilla

data = {'model':invalid_model,'sequence':invalid_seq}
df_invalid = pd.DataFrame(data)
df_invalid.head()

df_invalid.to_csv('invalid_words_lms.csv')

782


Submit two csv files alongside this notebook: `valid_words_lms.csv` and `invalid_words_lms.csv`. Both files should have __two__ columns: `model`, `sequence`. `model` will have the value `neural` or `vanilla`. `sequence` will be the corresponding sequence of characters. `valid_words_lms.csv` should contain all sequences from both models you determined to be valid words. `invalid_words_lms.csv` will have all sequences from both models you programatically determined to be invalid words.

Option 2: Evaluate the generated sentences of *word*-based models
----

Your job for this option is to measure the quality of your generated sentences for word-based models. For this option you *must* survey at least 3 people who are __not__ in this course. They need to speak and read the language that you are evaluating, but they need not be native speakers.

You will evaluate the quality of the generated sentences in the following way:
1. Generate 20 sentences from your neural model.
2. Using the same level of n-gram, pair these sentences with provided sentences from the vanilla n-gram model.

Next, build a survey. For each pair of (neural LM sentence, vanilla n-ngram LM sentence), you'll ask the survey taker two binary selection questions:
1. which sentence is more grammatical?
2. which sentence makes more sense, semantically (in meaning)?
3. Which sentence do you prefer?


Finally, you'll evaluate your survey results. Calculate the following:
1. What percentage of neural vs. vanilla n-gram LM sentences were preferred, separated along each of the three dimensions?
2. What is [Krippendorff's alpha](https://en.wikipedia.org/wiki/Krippendorff%27s_alpha) for your survey data? 

You are welcome to use a pre-built python implmenetation of the Krippendorff's alpha calculation, such as [this one](https://pypi.org/project/krippendorff/). Krippendorff's alpha is one way to measure interannotator agreement — the extent to which your survey respondants agree with one another.

You will submit your survey data alongside this notebook.

In [5]:
# !pip install krippendorff

In [6]:
# your imports here

# import krippendorff


In [7]:
# your code here