Homework 5: Neural Language Models  (& 🎃 SpOoKy 👻 authors 🧟 data) - Task 4
----

Names & Sections
----
Names: __Adrian Criollo__ (Write these in every notebook you submit.)

Task 4: Compare your generated sentences (15 points)
----

In this task, you'll analyze one of the files that you produced in Task 3. You'll need to compare against the corresponding file that we have provided for you that was generated from the vanilla n-gram language model.

Choose *__one__* of the following two options.

Option 1: Evaluate the generated words of *character*-based models
---

Your job for this option is to programmatically measure two things:
1. the percentage of words produced by each model that are valid english words.
2. the percentage of words produced by each model that are valid english words *and* were not seen at train time.

For this task, a word is defined as "characters between _ " or "characters between spaces" (if you replaced your underscores with spaces when you printed out your new sentences).


Make sure to turn in any necessary supporting files along with your submission.


In [6]:
# your imports here
import neurallm_utils as nutils
import nltk
nltk.download('words')
from nltk.corpus import words
import pandas as pd

generated_file = 'generated_sentences_char.txt'
vanilla_generated_file = 'spooky_vanilla_3_char.txt'
FILE_PATH = 'spooky_author_train.csv'

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\adria\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\adria\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


In [7]:
# code here!

with open(generated_file, 'r', encoding='utf-8') as f:
    sentences = f.readlines()

with open(vanilla_generated_file, 'r', encoding='utf-8') as f:
    sentences_vanilla = f.readlines()

def tokenize_sentences(sentences, by_char=False):
    words = []
    for sentence in sentences:
        sentence = sentence.strip()
        tokens = nutils.tokenize_line(
            line=sentence,
            ngram=1,
            by_char=by_char,
            space_char=' ',
            sentence_begin='<s>',
            sentence_end='</s>'
        )
        words.extend(tokens)
    return words

words_my_model = tokenize_sentences(sentences, by_char=False)
words_vanilla_model = tokenize_sentences(sentences_vanilla, by_char=False)

english_words = words.words()
english_words_set = set(word.lower() for word in english_words)

train_data_df = pd.read_csv(FILE_PATH)
train_sentences = train_data_df['text'].tolist()
train_words = tokenize_sentences(train_sentences, by_char=False)
train_words_set = set(word.lower() for word in train_words)

def compute_percentages(words_list, english_words_set, train_words_set):
    total_words = len(words_list)
    words_lower = [word.lower() for word in words_list]
    valid_english_words = [word for word in words_lower if word in english_words_set]
    valid_new_words = [word for word in valid_english_words if word not in train_words_set]
    percentage_valid_english = (len(valid_english_words) / total_words) * 100 if total_words > 0 else 0
    percentage_valid_new_words = (len(valid_new_words) / total_words) * 100 if total_words > 0 else 0

    return percentage_valid_english, percentage_valid_new_words

percentage_valid_english_my_model, percentage_valid_new_words_my_model = compute_percentages(
    words_my_model, english_words_set, train_words_set
)

percentage_valid_english_vanilla_model, percentage_valid_new_words_vanilla_model = compute_percentages(
    words_vanilla_model, english_words_set, train_words_set
)


print("My Character-Based Model:")
print(f"Percentage of valid English words: {percentage_valid_english_my_model}%")
print(f"Percentage of valid English words not seen during training: {percentage_valid_new_words_my_model}%\n")

print("Vanilla N-Gram Model:")
print(f"Percentage of valid English words: {percentage_valid_english_vanilla_model}%")
print(f"Percentage of valid English words not seen during training: {percentage_valid_new_words_vanilla_model}%")

My Character-Based Model:
Percentage of valid English words: 95.10829885616938%
Percentage of valid English words not seen during training: 0.0%

Vanilla N-Gram Model:
Percentage of valid English words: 41.656942823803966%
Percentage of valid English words not seen during training: 5.805134189031505%


3. How did you determine what a valid english word is? __I determined if it was a valid english word by comparing it to NLTK's word corpus.__

4. Gather the sequences of characters that are determined not to be words. Sampling at minimum 100 of these sequences, how many of them *should have* been counted as words in your opinion? __Looking at a sample of 100 invalid words I saw maybe 2 words that should have been counted.__

In [10]:
# more code here, as needed!

my_model = [('neural', word) for word in words_my_model]
vanilla_model = [('vanilla', word) for word in words_vanilla_model]
combined_data = my_model + vanilla_model

valid_words = []
invalid_words = []

for model_label, word in combined_data:
    word_lower = word.lower()
    if word_lower in english_words_set:
        valid_words.append({'model': model_label, 'sequence': word})
    else:
        invalid_words.append({'model': model_label, 'sequence': word})

df_valid_words = pd.DataFrame(valid_words)
df_invalid_words = pd.DataFrame(invalid_words)
df_valid_words = df_valid_words.drop_duplicates()
df_invalid_words = df_invalid_words.drop_duplicates()
df_valid_words.to_csv('valid_words_lms.csv', index=False)
df_invalid_words.to_csv('invalid_words_lms.csv', index=False)

total_invalid_sequences = len(df_invalid_words)
sample_size = min(100, total_invalid_sequences)

sampled_sequences = df_invalid_words.sample(n=sample_size, random_state=42)
print(sampled_sequences)


        model        sequence
478   vanilla           fifte
821   vanilla          whican
1979  vanilla     thadinedged
2185  vanilla           alles
2080  vanilla          notime
...       ...             ...
1963  vanilla           exper
361   vanilla  andieflethimor
1060  vanilla      ouprovered
1203  vanilla        subtally
254   vanilla           linto

[100 rows x 2 columns]


Submit two csv files alongside this notebook: `valid_words_lms.csv` and `invalid_words_lms.csv`. Both files should have __two__ columns: `model`, `sequence`. `model` will have the value `neural` or `vanilla`. `sequence` will be the corresponding sequence of characters. `valid_words_lms.csv` should contain all sequences from both models you determined to be valid words. `invalid_words_lms.csv` will have all sequences from both models you programatically determined to be invalid words.

Option 2: Evaluate the generated sentences of *word*-based models
----

Your job for this option is to measure the quality of your generated sentences for word-based models. For this option you *must* survey at least 3 people who are __not__ in this course. They need to speak and read the language that you are evaluating, but they need not be native speakers.

You will evaluate the quality of the generated sentences in the following way:
1. Generate 20 sentences from your neural model.
2. Using the same level of n-gram, pair these sentences with provided sentences from the vanilla n-gram model.

Next, build a survey. For each pair of (neural LM sentence, vanilla n-ngram LM sentence), you'll ask the survey taker two binary selection questions:
1. which sentence is more grammatical?
2. which sentence makes more sense, semantically (in meaning)?
3. Which sentence do you prefer?


Finally, you'll evaluate your survey results. Calculate the following:
1. What percentage of neural vs. vanilla n-gram LM sentences were preferred, separated along each of the three dimensions?
2. What is [Krippendorff's alpha](https://en.wikipedia.org/wiki/Krippendorff%27s_alpha) for your survey data? 

You are welcome to use a pre-built python implmenetation of the Krippendorff's alpha calculation, such as [this one](https://pypi.org/project/krippendorff/). Krippendorff's alpha is one way to measure interannotator agreement — the extent to which your survey respondants agree with one another.

You will submit your survey data alongside this notebook.

In [13]:
# !pip install krippendorff

In [14]:
# your imports here

import krippendorff


ModuleNotFoundError: No module named 'krippendorff'

In [None]:
# your code here