Homework 4: Neural Language Models  (& 🎃 SpOoKy 👻 authors 🧟 data) - Task 4, Option A
----

### Names
----
Names: __William Aoun__ (Write these in every notebook you submit.)

Task 4: Compare your generated sentences (25 points)
----

In this task, you'll analyze one of the models that you produced in Task 3. You'll need to compare against the corresponding file that we have provided for you that was generated from the vanilla n-gram language model.

Choose *__one__* of option A (this notebook) or B.



Option A: Evaluate the generated words of *character*-based models
---

Your job for this option is to programmatically measure two things for character-based models where n = 3, 4 and 5 (you will need to do some extra training of your character-based model to produce these; put cells with the code you use to generate these files at the end of your Task 3 notebook):

- the percentage of words produced by each model (neural and vanilla) that are valid english words.
- the percentage of words produced by each model (neural and vanilla) that are valid english words *and* were not seen at train time.

For this task, you'll first need to define what you consider a "word" so you know how to evaluate your output. Take into account how you will deal with punctuation!

Make sure to turn in any necessary supporting files along with your submission.


12a. What is your definition of a word for this task? __Any string of letters in between non letter characters. All words are converted to lowercase to compare with the dictionary and evaluate whether the model produces valid words.__

In [19]:
# your imports here

import re
import pandas as pd
import random
import nltk
from nltk.corpus import words

nltk.download("words", quiet=True)
english_words = {word.lower() for word in words.words()}
print(f"English dictionary loaded: {len(english_words)} words")


def extract_words(text: str) -> list[str]:
    return re.findall(r"[a-zA-Z]+", text.lower())

df = pd.read_csv("spooky_author_train.csv")
training_words = set()
for text in df["text"]:
    training_words.update(extract_words(text))

English dictionary loaded: 234377 words


In [20]:
# code here! (add as many cells as you'd like)


def evaluate_model(file: str, model: str, n: int) -> dict:
    with open(file, "r") as f:
        text = f.read()

    words = extract_words(text)
    valid = [w for w in words if w in english_words]
    invalid = [w for w in words if w not in english_words]
    valid_unseen = [w for w in valid if w not in training_words]

    total = len(words)
    return {
        "model": model,
        "n": n,
        "valid": valid,
        "invalid": invalid,
        "pct_valid": len(valid) / total * 100,
        "pct_valid_unseen": len(valid_unseen) / total * 100,
    }


results = [
    evaluate_model("generated_charbased.txt", "neural", 3),
    evaluate_model("generated_charbased_4.txt", "neural", 4),
    evaluate_model("generated_charbased_5.txt", "neural", 5),
    evaluate_model("spooky_vanilla_3_char.txt", "vanilla", 3),
    evaluate_model("spooky_vanilla_4_char.txt", "vanilla", 4),
    evaluate_model("spooky_vanilla_5_char.txt", "vanilla", 5),
]

In [21]:
valid_data = []
invalid_data = []

for r in results:
    for word in r["valid"]:
        valid_data.append({"model": r["model"], "n": r["n"], "sequence": word})
    for word in r["invalid"]:
        invalid_data.append({"model": r["model"], "n": r["n"], "sequence": word})

pd.DataFrame(valid_data).to_csv("valid_words_lms.csv", index=False)
pd.DataFrame(invalid_data).to_csv("invalid_words_lms.csv", index=False)

In [23]:
for r in results:
    if r["invalid"]:
        sample = random.sample(r["invalid"], min(50, len(r["invalid"])))
        print(f"{r['model']}_n{r['n']}: {sample[:10]}")

neural_n4: ['strance', 'strance', 'strance', 'strance', 'strance', 'strance', 'strance', 'strance', 'strance', 'strance']
neural_n5: ['sall', 'sall', 'sall', 'sall', 'sall', 'sall', 'sall', 'sall', 'sall', 'sall']
vanilla_n3: ['aboultiethe', 'whild', 'uption', 'lon', 'angs', 'sichand', 'thathe', 'hou', 'juppred', 'inincesto']
vanilla_n4: ['beams', 'imps', 'franginaluse', 'writy', 'thods', 'ter', 'wharation', 'desperface', 'stration', 'circleave']
vanilla_n5: ['providention', 'breated', 'probes', 'betterns', 'll', 'throneousness', 'littleman', 'frant', 'proportrayer', 'withought']


In [24]:
manual_counts = {
    "neural_n3": 0,
    "neural_n4": 0,
    "neural_n5": 0,
    "vanilla_n3": 5,
    "vanilla_n4": 8,
    "vanilla_n5": 12,
}

print("| model   | n | % valid | % valid unseen | % should be valid |")
print("|---------|---|---------|----------------|-------------------|")
for r in results:
    key = f"{r['model']}_n{r['n']}"
    manual = manual_counts[key]
    print(
        f"| {r['model']:7} | {r['n']} | {r['pct_valid']:7.1f} | {r['pct_valid_unseen']:14.1f} | {manual:17.1f} |"
    )

| model   | n | % valid | % valid unseen | % should be valid |
|---------|---|---------|----------------|-------------------|
| neural  | 3 |   100.0 |            0.0 |               0.0 |
| neural  | 4 |    72.7 |            0.0 |               0.0 |
| neural  | 5 |    58.3 |            0.0 |               0.0 |
| vanilla | 3 |    51.0 |            7.9 |               5.0 |
| vanilla | 4 |    65.9 |            3.8 |               8.0 |
| vanilla | 5 |    79.3 |            2.1 |              12.0 |


13a.
For each value of n = 3, 4, and 5:
- the percentage of words produced by each model (neural and vanilla) that are valid english words
- the percentage of words produced by each model (neural and vanilla) that are valid english words *and* were not seen at train time
- Gather the sequences of characters that are determined not to be words for each model. Randomly sample 50 of these sequences, and count how many of them *should have* been counted as words in your opinion.

Fill in this table with your answers:
| model | n | \% valid | \% valid and not-seen at train | \% not valid that should have been |
| - | - | - | - | - |
| neural | 3 | 100 | 0 | 0 |
| vanilla | 3 | 51 | 7.9 | 5 |
| neural | 4 | 72.7 | 0 | 0 |
| vanilla | 4 | 65.9 | 3.8 | 8 |
| neural | 5 | 58.3 | 0 | 0 |
| vanilla | 5 | 79.3 | 2.1 | 12 |


Submit two csv files alongside this notebook: `valid_words_lms.csv` and `invalid_words_lms.csv`. Both files should have __three__ columns: `model`, `n`, `sequence`. `model` will have the value `neural` or `vanilla`. `n` will be the n-gram level. `sequence` will be the corresponding sequence of characters. `valid_words_lms.csv` should contain all sequences from both models you determined to be valid words. `invalid_words_lms.csv` will have all sequences from both models you programatically determined to be invalid words.