Homework 5: Neural Language Models  (& 🎃 SpOoKy 👻 authors 🧟 data) - Task 4
----

Names & Sections
----
Names: Julia Geller (4120) & Shae Marks (4120)

Task 4: Compare your generated sentences (15 points)
----

In this task, you'll analyze one of the files that you produced in Task 3. You'll need to compare against the corresponding file that we have provided for you that was generated from the vanilla n-gram language model.

Choose *__one__* of the following two options.

Option 1: Evaluate the generated words of *character*-based models
---

Your job for this option is to programmatically measure two things:
1. the percentage of words produced by each model that are valid english words.
2. the percentage of words produced by each model that are valid english words *and* were not seen at train time.

For this task, a word is defined as "characters between _ " or "characters between spaces" (if you replaced your underscores with spaces when you printed out your new sentences).


Make sure to turn in any necessary supporting files along with your submission.


In [1]:
# your imports here

In [2]:
# code here!

3. How did you determine what a valid english word is? __YOUR ANSWER HERE__

4. Gather the sequences of characters that are determined not to be words. Sampling at minimum 100 of these sequences, how many of them *should have* been counted as words in your opinion?

In [3]:
# more code here, as needed!

Submit two csv files alongside this notebook: `valid_words_lms.csv` and `invalid_words_lms.csv`. Both files should have __two__ columns: `model`, `sequence`. `model` will have the value `neural` or `vanilla`. `sequence` will be the corresponding sequence of characters. `valid_words_lms.csv` should contain all sequences from both models you determined to be valid words. `invalid_words_lms.csv` will have all sequences from both models you programatically determined to be invalid words.

Option 2: Evaluate the generated sentences of *word*-based models
----

Your job for this option is to measure the quality of your generated sentences for word-based models. For this option you *must* survey at least 3 people who are __not__ in this course. They need to speak and read the language that you are evaluating, but they need not be native speakers.

You will evaluate the quality of the generated sentences in the following way:
1. Generate 20 sentences from your neural model.
2. Using the same level of n-gram, pair these sentences with provided sentences from the vanilla n-gram model.

Next, build a survey. For each pair of (neural LM sentence, vanilla n-ngram LM sentence), you'll ask the survey taker two binary selection questions:
1. which sentence is more grammatical?
2. which sentence makes more sense, semantically (in meaning)?
3. Which sentence do you prefer?


Finally, you'll evaluate your survey results. Calculate the following:
1. What percentage of neural vs. vanilla n-gram LM sentences were preferred, separated along each of the three dimensions?

For all questions, the vanilla model was preferred.
- For the more grammatical question, the vanilla model was preferred at 61.7%.
- For the more sense question, the vanilla model was preferred at 63.3%.
- For the preference question, the vanilla model was preferred at 58.3%.
2. What is [Krippendorff's alpha](https://en.wikipedia.org/wiki/Krippendorff%27s_alpha) for your survey data? 
- For the more grammatical question, the alpha was 0.445.
- For the more sense question, the alpha was 0.153.
- For the preferred question, the alpha was 0.326.


You are welcome to use a pre-built python implmenetation of the Krippendorff's alpha calculation, such as [this one](https://pypi.org/project/krippendorff/). Krippendorff's alpha is one way to measure interannotator agreement — the extent to which your survey respondants agree with one another.

You will submit your survey data alongside this notebook.

In [4]:
# !pip install krippendorff

In [5]:
# your imports here
import krippendorff
import random
import pandas as pd

In [6]:
# your code here
# read in txt files - each line is a sequence 

def txt_to_array(filepath: str, num_seq: int=20, seed: int=18):
    """
    Reads in the content at the given filepath and produces a list of size num_seq
    where each element is a randomly chosen line from the filepath.

    Args:
    filepath (string): path to a txt file 
    num_seq (int): number of sequences to choose from the txt file 
    seed (int): seed to give a consistent result with random library 

    Returns:
    lines (list): list of sequences
    """
    lines = []

    with open(filepath, 'r') as file:
        lines = file.readlines()

    # Remove trailing newline characters from each line
    lines = [line.strip() for line in lines]

    random.seed(seed)
    lines = random.sample(lines, k=num_seq) # sample without replacement 

    return lines 


vanilla_ngram_seqs_word = txt_to_array("spooky_vanilla_3_word.txt")
neurallm_seqs_word = txt_to_array("neurallm_seq_word.txt")

In [7]:
for i in range(20):
    print("Pair", i+1)
    print("Vanilla N-Gram:", vanilla_ngram_seqs_word[i])
    print("Neural Net:", neurallm_seqs_word[i], "\n")

Pair 1
Vanilla N-Gram: but it continued and habitual ride , i fled from me '' said he , `` this is the massacre of the genuine `` hey , yew , why should he come , the tiger i alternately rattled the rusty impediments with a face and gnashed his teeth in the crumbling bricks and unhealthy weeds as was really the poet .
Neural Net: all the five four thousand undiscovered health on a couple lunatic , and it , and i will send me with generous to betray neck would secure : these notes have better existed , or any wave the bulk we shall mad a reply student to die on my extra entreaties of her constancy . 

Pair 2
Vanilla N-Gram: the artificial style of gardening , a new scene of the question , when he approached the arch of the volumes i have said that an advantageous offer of private possession were thrown down , so far evinced not the faces of men , '' ran the text represented an artificial alphabet , notwithstanding all the information it could not find a local door which the next thing t

In [8]:
# read in survey responses (A is the vanilla ngram model, B is the neural net language model. We removed any <UNK> characters.)
survey_df = pd.read_csv('sentence_eval_survey_responses.csv')

# save answers to each question in different lists
cols = [c for c in survey_df.columns if c not in ['Timestamp', 'Email Address']] # get response columns 
start = 0
q1 = []
q2 = []
q3 = []

# iterate through the 20 sequence pairs 
for i in range(0, 20):
    q1 += survey_df[cols[start:start+3][0]].values.tolist() # responses to question 1 for pair i
    q2 += survey_df[cols[start:start+3][1]].values.tolist() # responses to question 2 for pair i
    q3 += survey_df[cols[start:start+3][2]].values.tolist() # responses to question 3 for pair i
    start += 3


print("response examples:", q2[:5])

response examples: ['B', 'A', 'A', 'B', 'B']


In [9]:
# see what percent of preferences were for vanilla vs neural network models
vanilla = len([v for v in q1 if v=='A'])/len(q1)
print("Which sentence is more grammatical?")
print('vanilla preference:', round(vanilla,3), ' neural network preference:', 1-round(vanilla,3), "\n")

vanilla = len([v for v in q2 if v=='A'])/len(q2)
print("Which sentence makes more sense, semantically (in meaning)?")
print('vanilla preference:', round(vanilla,3), ' neural network preference:', 1-round(vanilla,3), "\n")

vanilla = len([v for v in q3 if v=='A'])/len(q3)
print("Which sentence do you prefer?")
print('vanilla preference:', round(vanilla,3), ' neural network preference:', 1-round(vanilla,3), "\n")

Which sentence is more grammatical?
vanilla preference: 0.617  neural network preference: 0.383 

Which sentence makes more sense, semantically (in meaning)?
vanilla preference: 0.633  neural network preference: 0.367 

Which sentence do you prefer?
vanilla preference: 0.583  neural network preference: 0.41700000000000004 



In [10]:
# intialize data_reliability for each question
d1 = [[], [], []] # [[person1 responses], [person2 responses], [person3 responses]] to question 1 
d2 =[[], [], []]
d3 = [[], [], []]
# add each survey taker's answer to the correct data reliability sublist for each question
for i in range(0,60):

    if i%3 == 0:
        # Survey taker 1
        j = 0
    elif i%3 == 1:
        # Survey taker 2
        j=1
    else:
        # Survey taker 3
        j=2
        
    d1[j].append(q1[i])
    d2[j].append(q2[i])
    d3[j].append(q3[i])

In [11]:
# print alpha score
print('Question 1 alpha:', krippendorff.alpha(reliability_data=d1, level_of_measurement='nominal'))
print('Question 2 alpha:', krippendorff.alpha(reliability_data=d2, level_of_measurement='nominal'))
print('Question 3 alpha:', krippendorff.alpha(reliability_data=d3, level_of_measurement='nominal'))

Question 1 alpha: 0.445358401880141
Question 2 alpha: 0.15311004784688997
Question 3 alpha: 0.32571428571428573
