# N-Gram Language Modeling Lab

## Introduction

In this lab, you will build an n-gram language model using a corpus from Project Gutenberg. You'll start by loading and tokenizing the corpus using a pre-existing tokenizer, constructing an n-gram frequency dictionary, and finally implementing beam search to generate text based on your model. Some parts of the code will be left for you to complete.

## Load a Corpus

First, we'll load a text corpus from Project Gutenberg. For this lab, we'll use the text of "Alice's Adventures in Wonderland" by Lewis Carroll.

In [133]:
import urllib.request
import os

def download_corpus(url, filename):
    if not os.path.exists(filename):
        urllib.request.urlretrieve(url, filename)

In [134]:
corpus_url = "https://gutenberg.org/ebooks/11.txt.utf-8"
corpus_filename = "alice_in_wonderland.txt"

download_corpus(corpus_url, corpus_filename)

In [135]:
corpus = open(corpus_filename, 'r').read()

In [136]:
corpus



In [137]:
# Strip out unicode (for this demo)
corpus = corpus.encode('ascii', 'ignore').decode('utf-8')
corpus[:100]

"The Project Gutenberg eBook of Alice's Adventures in Wonderland\n    \nThis ebook is for the use of an"

In [138]:
from textwrap import fill
print(fill(corpus[:100], drop_whitespace=False))

The Project Gutenberg eBook of Alice's Adventures in Wonderland      
This ebook is for the use of an


## Tokenize the Corpus

In this section, you'll build a simple Byte Pair Encoding (BPE) tokenizer to split the text into tokens. BPE is a subword segmentation algorithm that iteratively merges the most frequent pair of bytes or characters.


In [139]:
tokens = list(corpus)
print(tokens[:10])

['T', 'h', 'e', ' ', 'P', 'r', 'o', 'j', 'e', 'c']


In [140]:
from collections import defaultdict

def get_stats(tokens):
    """Compute frequencies of adjacent pairs."""
    pairs = defaultdict(int)
    for i in range(len(tokens)-1):
        pairs[(tokens[i], tokens[i+1])] +=1
    return pairs

stats = get_stats(tokens)
stats

defaultdict(int,
            {('T', 'h'): 262,
             ('h', 'e'): 3969,
             ('e', ' '): 4845,
             (' ', 'P'): 121,
             ('P', 'r'): 92,
             ('r', 'o'): 532,
             ('o', 'j'): 84,
             ('j', 'e'): 105,
             ('e', 'c'): 299,
             ('c', 't'): 232,
             ('t', ' '): 2822,
             (' ', 'G'): 135,
             ('G', 'u'): 84,
             ('u', 't'): 742,
             ('t', 'e'): 1015,
             ('e', 'n'): 1122,
             ('n', 'b'): 94,
             ('b', 'e'): 631,
             ('e', 'r'): 2079,
             ('r', 'g'): 173,
             ('g', ' '): 907,
             (' ', 'e'): 394,
             ('e', 'B'): 16,
             ('B', 'o'): 17,
             ('o', 'o'): 473,
             ('o', 'k'): 226,
             ('k', ' '): 330,
             (' ', 'o'): 1511,
             ('o', 'f'): 718,
             ('f', ' '): 825,
             (' ', 'A'): 446,
             ('A', 'l'): 410,
             ('l', 'i'

In [141]:
n = 10
top_n_tokens = sorted(stats.items(), key=lambda x: x[1], reverse=True)[:n]
top_n_tokens

[(('e', ' '), 4845),
 ((' ', 't'), 4158),
 (('h', 'e'), 3969),
 (('t', 'h'), 3565),
 ((' ', 'a'), 2947),
 (('d', ' '), 2848),
 (('t', ' '), 2822),
 ((',', ' '), 2374),
 ((' ', 's'), 2283),
 (('i', 'n'), 2236)]

In [142]:
most_frequent_token = max(stats, key=stats.get)
most_frequent_token

('e', ' ')

In [143]:
def merge_vocab(pair, tokens):
    merged = []
    i = 0
    while i < len(tokens)-1:
        if (tokens[i], tokens[i+1]) == pair:
            merged.append(pair)
            i += 2
        else:
            merged.append(tokens[i])
            i += 1
    return merged

tokens = merge_vocab(most_frequent_token, tokens)
tokens[:10]

['T', 'h', ('e', ' '), 'P', 'r', 'o', 'j', 'e', 'c', 't']

In [144]:
def flatten_token(token):
    if isinstance(token, tuple):
        return ''.join(flatten_token(t) for t in token)
    else:
        return token

flatten_token((((' ', 't'), 'h'), 'e'))

' the'

In [145]:
import ipywidgets
import tqdm.notebook as tq

def build_bpe_vocab(corpus, num_merges, verbose=False):
    """Build BPE vocabulary."""
    # Initialize tokens as characters with end-of-word token
    tokens = list(corpus)
    vocab_size = len(set(tokens))
    
    for i in tq.trange(num_merges, leave=False, desc='Building BPE vocab'):
        pairs = get_stats(tokens)
        if not pairs:
            break
        best = max(pairs, key=pairs.get)
        tokens = merge_vocab(best, tokens)
        if verbose:
            tq.tqdm.write(f'Merge {i+1:<5}: {repr(flatten_token(best)):<10}')
    
    vocab = sorted([flatten_token(t) for t  in set(tokens)], key=len, reverse=True)
    return vocab

vocab = build_bpe_vocab(corpus, num_merges=10, verbose=True)
print("\nBPE Vocabulary (first 10 tokens):", vocab[:10])

Building BPE vocab:   0%|          | 0/10 [00:00<?, ?it/s]

Merge 1    : 'e '      
Merge 2    : ' t'      
Merge 3    : ' a'      
Merge 4    : ' th'     
Merge 5    : 'in'      
Merge 6    : 't '      
Merge 7    : 'er'      
Merge 8    : 'd '      
Merge 9    : 'ou'      
Merge 10   : ' s'      

BPE Vocabulary (first 10 tokens): [' th', 'd ', 'in', 'ou', ' s', ' a', 't ', 'er', ' t', 'e ']


In [146]:
#vocab = build_bpe_vocab(corpus, num_merges=len(corpus) - 1000)
vocab = build_bpe_vocab(corpus, num_merges=1000) # (I am not patient, this may take a while)

Building BPE vocab:   0%|          | 0/1000 [00:00<?, ?it/s]

In [147]:
vocab[:10]

['*      *      *      *      ',
 'Project Gutenberg ',
 ', said the Hatter',
 'Project Gutenberg',
 ', said the King',
 'electronic work',
 '*      *      ',
 'Mock Turtle ',
 ', said Alice',
 'Caterpillar']

In [148]:
print("Lengh of final vocab:", len(vocab))

Lengh of final vocab: 1062


In [149]:
def tokenize(input_str, vocab):
    tokens = []
    i = 0

    while i < len(input_str):
        # Try to match the longest token from the current position
        for token in vocab:
            if input_str[i:].startswith(token):
                tokens.append(token)
                i += len(token)
                break
        else:
            tokens.append(token)
            i += 1

    return tokens

bpe_tokens = tokenize(corpus, vocab) # Takes about 2m 30 seconds
print("First 10 tokens:", bpe_tokens[:10])

First 10 tokens: ['The ', 'Project Gutenberg ', 'e', 'B', 'ook ', 'of ', 'Alice', "'", 's ', 'A']


In [150]:
print(corpus[:30])

The Project Gutenberg eBook of


In [151]:
# %pip install marisa-trie

In [152]:
import marisa_trie

def tokenize(input_str, vocab):
    # Build the Trie from the vocabulary
    trie = marisa_trie.Trie(vocab)
    
    tokens = []
    i = 0

    while i < len(input_str):
        # Find all prefixes that match the current position
        prefixes = trie.prefixes(input_str[i:])
        
        if prefixes:
            # Choose the longest matching prefix
            longest_prefix = max(prefixes, key=len)
            tokens.append(longest_prefix)
            i += len(longest_prefix)  # Move forward by the length of the matched token
        else:
            tokens.append(input_str[i])
            i += 1  # No match, move to the next character

    return tokens


bpe_tokens = tokenize(corpus, vocab) # Takes about 2.5 seconds
print("First 10 tokens:", bpe_tokens[:10])  

First 10 tokens: ['The ', 'Project Gutenberg ', 'e', 'B', 'ook ', 'of ', 'Alice', "'", 's ', 'A']


## Create N-Gram Frequency Dictionary

We'll create a hash table where each prefix (n-1 grams) maps to a dictionary of possible next tokens and their frequencies.

In [153]:
from collections import defaultdict, Counter

def build_ngram_freq(tokens, n=2):
    ngram_freq = defaultdict(Counter)
    for k in range(1, n+1):
        for i in range(len(tokens) - k):
            ngram = tokens[i:][:k]
            prefix = ngram[:-1]  
            next_token = ngram[-1]
            ngram_freq[tuple(prefix)][next_token] += 1
    return ngram_freq

In [158]:
bigram_freq = build_ngram_freq(bpe_tokens, n=2)
print(f"Number of unique prefixes: {len(bigram_freq)}")
print("Sample prefix and possible next tokens with frequencies:")
sample_prefix = list(bigram_freq.keys())[0]
print(f"Prefix: {sample_prefix}")
print(bigram_freq[sample_prefix].most_common(5))

Number of unique prefixes: 1045
Sample prefix and possible next tokens with frequencies:
Prefix: ()
[(' ', 1513), ('a', 939), ('th', 824), ('\n', 695), ('t', 692)]


**Expected Output:**


```txt
    Number of unique prefixes: 1044
    Sample prefix and possible next tokens with frequencies:
    Prefix: ('The ',)
    [('Queen', 4), ('Mouse ', 4), ('Queen ', 4), ('P', 3), ('Rabbit ', 3)]
```


## Temperature Sampling

In this section, we will use temperature sampling to generate text from the n-gram model. By adjusting the temperature, we can control the randomness of our token choices. A lower temperature makes the model more deterministic, while a higher temperature allows for more variability in token selection.

In [175]:
import numpy as np

def temperature_sample(ngram_freq, n, prefix, max_length=0, temperature=1.0, fallback_prob=0.05, vocab=None):
    """
    Generator that yields tokens one at a time using temperature sampling with a fallback option.
    
    Args:
        ngram_freq (dict): The n-gram frequency dictionary.
        n (int): Order of the n-gram model.
        prefix (list): Initial list of tokens to start generation.
        max_length (int): Maximum number of tokens to generate. Set to 0 for indefinite generation.
        temperature (float): Temperature value to adjust randomness in sampling.
        fallback_prob (float): Probability of choosing a random token as a fallback.
        vocab (list): Complete vocabulary list for random token fallback.
        
    Yields:
        str: The next token in the generated sequence.
    """
    if not isinstance(prefix, list):
        prefix = [prefix]
    
    current_seq = prefix[:]
    generated_length = 0
    
    for tok in prefix:
        yield tok

    while max_length == 0 or generated_length < max_length:
        # Get the (n-1) prefix from the current sequence
        context = current_seq[-(n-1):]
        
        # Fetch possible next tokens
        next_tokens = ngram_freq.get(tuple(context), {})
        if not next_tokens:
            break  # End generation if no next tokens
        
        # Adjust probabilities based on temperature
        token_probs = {token: freq ** (1.0 / temperature) for token, freq in next_tokens.items()}
        total_prob = sum(token_probs.values())
        token_probs = {token: prob / total_prob  for token, prob in token_probs.items()}
        
        # Add fallback probability to choose a random token from vocab
        tokens, probs = zip(*token_probs.items())
        if vocab is not None and np.random.rand() < fallback_prob:
            next_token = np.random.choice(vocab)  # Pick random fallback token
        else:
            next_token = np.random.choice(tokens, p=probs)
        
        # Update sequence and yield token
        current_seq.append(next_token)
        generated_length += 1
        yield next_token

In [184]:
prefix = tokenize("The cheshire cat", vocab)
temp_sample_generator = temperature_sample(bigram_freq, 2, prefix, max_length=100, temperature=0.7)

print("\nGenerated Text:")
for token in temp_sample_generator:
    print(token, end='')


Generated Text:
The cheshire catchinute went to be herself fathere (the PRok of the sion a clear would chance whole
was gon just as well her hriumping of its norn bite, but very good deal too was going down at I dont know what a little feet high, and she said; but a or time; and, tool offickes they began ruppose us

In [183]:
if 'trigram_freq' not in locals():
    trigram_freq = build_ngram_freq(bpe_tokens, 3)

prefix = tokenize("The cheshire cat", vocab)
temp_sample_generator = temperature_sample(trigram_freq, 3, prefix, max_length=100, temperature=0.7)

print("\nGenerated Text:")
for token in temp_sample_generator:
    print(token, end='')


Generated Text:
The cheshire cats eat bathing itself, and began an accountry where she was
going to give the
time to hear it a very humble tone, From the Queens Croquet-Ground
 CHAPTER IV.
Adventures hid think I may as she could, for the poolshe
could hear her chin into Alices side and a backs 

## Beam Search Sampling

We'll implement beam search to generate text based on the n-gram model. Beam search keeps track of the top `k` sequences at each step to explore multiple hypotheses.

In [192]:
import math
import random

def beam_search_sampling(ngram_freq, n=2, beam_width=3, max_length=20, seed_prefix=None):
    """
    Generate text using beam search.
    
    Args:
        ngram_freq (dict): The n-gram frequency dictionary.
        beam_width (int): Number of sequences to keep at each step.
        max_length (int): Maximum number of tokens to generate.
        seed_prefix (list): Initial list of tokens to start the generation.
        
    Returns:
        list: The generated sequence of tokens.
    """
    if seed_prefix is None:
        # Start with a random prefix
        seed_prefix = []

    if not isinstance(seed_prefix, list):
        seed_prefix = [seed_prefix]

    sequences = [(seed_prefix, 0.0)]  # (sequence, score)
    
    for _ in range(max_length):
        all_candidates = []
        for seq, score in sequences:
            prefix = seq[-n+1:]  # For an n-gram model, the prefix is the last n-1 tokens
            next_tokens = ngram_freq.get(tuple(prefix), {})
            if not next_tokens:
                continue
            for token, freq in next_tokens.items():
                prob = freq / sum(next_tokens.values())
                candidate_seq = seq + [token]  
                candidate_score = score - math.log(prob)
                all_candidates.append((candidate_seq, candidate_score))
        if len(all_candidates) == 0:
            break  # No more extensions, stop generating

        # Sort all candidates by score and select top beam_width
        ordered = sorted(all_candidates, key=lambda tup: tup[1])
        sequences = ordered[:beam_width]
        
        # Stop if all sequences have no extensions
        if not sequences:
            break
    
    # Select the sequence with the lowest score
    if len(sequences) == 0:
        return []
    else:
        best_sequence,_ = min(sequences, key=lambda tup: tup[1])
        return best_sequence, sequences

In [210]:
prefix = tokenize("I don't know", vocab)
generated_tokens, others = beam_search_sampling(bigram_freq, n=2, beam_width=5, max_length=100, seed_prefix=prefix)

print("\nGenerated Text:")
print(fill(''.join(generated_tokens), replace_whitespace=False))

print("Others:")
print('-'*70)
for seq, score in others:
    print(f'Score [{score:<5}], lower=better')
    print(fill(''.join(seq), replace_whitespace=False))
    print()


Generated Text:
I don't know whichedgeholdirece: that she had been anxiously as she
had been any ARENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDE
NDENDENDENDENDEND
Others:
----------------------------------------------------------------------
Score [199.22341703124607], lower=better
I don't know whichedgeholdirece: that she had been anxiously as she
had been any ARENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDE
NDENDENDENDENDEND

Score [199.62888213935423], lower=better
I don't know whichedgeholdirece: that she had been anxiously as she
had been any ARENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDE
NDENDENDENDENDENT

Score [199.9054661491479], lower=better
I don't know whichedgeholdirece: that she had been anxiously as she
had been any ARENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDENDE
NDENDENDENDENDEng

Score [199.98555708329297], lower=better
I don't know whichedgeholdirece: that she had been anxiously as she
had been any ARENDENDENDENDENDENDENDENDEND

In [211]:
# Avoid recomputing ngrams for different n
if 'ngrams' not in locals():
    ngrams ={2:bigram_freq, 3:trigram_freq}

In [212]:
n = 4
if n not in ngrams:
    print(f"Bulding ngrams with n={n}")
    ngrams[n] = build_ngram_freq(bpe_tokens, n=n)

generated_tokens, others = beam_search_sampling(ngrams[n], n=n, beam_width=5, max_length=100)

print("\nGenerated Text:")
print(''.join(generated_tokens))
print()

print("Others:\n")
for seq, score in others:
    print(f'Score[{score:<5}]')
    print(''.join(seq))
    print()


Generated Text:
a good deal: this caused some noise and
confusion of voicesHold up his headBrandy
nowDont choke himHow was it, old fellow! said the others.

We must burn the house down! said the Rabbits voice; and Alice
called out The race is over! and they dont reach half high enough yetOh! theyll all think me
at hom

Others:

Score[20.332557234084046]
a good deal: this caused some noise and
confusion of voicesHold up his headBrandy
nowDont choke himHow was it, old fellow! said the others.

We must burn the house down! said the Rabbits voice; and Alice
called out The race is over! and they dont reach half high enough yetOh! theyll all think me
at hom

Score[21.313386487095766]
a good deal: this caused some noise and
confusion of voicesHold up his headBrandy
nowDont choke himHow was it, old fellow! said the others.

We must burn the house down! said the Rabbits voice; and Alice
called out to her in an angry voicethe RabbitsPat! Where are you? And
they pinched it on bo

Score[21.313386

In [216]:
# %pip install nltk
# %pip install rouge_score

In [217]:
from nltk.translate.bleu_score import sentence_bleu
from rouge_score import rouge_scorer

def calculate_bleu(reference_text, generated_text):
    # Tokenize and calculate BLEU score
    reference_tokens = [reference_text.split()]
    generated_tokens = generated_text.split()
    return sentence_bleu(reference_tokens, generated_tokens)

def calculate_rouge(reference_text, generated_text):
    # Use ROUGE-1, ROUGE-2, and ROUGE-L
    scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL'], use_stemmer=True)
    return scorer.score(reference_text, generated_text)

def calculate_diversity(generated_text, n=2):
    # Calculate distinct-n (proportion of unique n-grams in the text)
    tokens = generated_text.split()
    ngrams = [tuple(tokens[i:i + n]) for i in range(len(tokens) - n + 1)]
    return len(set(ngrams)) / len(ngrams) if ngrams else 0

In [227]:
# Generate text using temperature sampling
prefix = ["The", "Cheshire", "Cat"]

n = 10
if n not in ngrams: 
    print("Building ngrams for n={n}")
    ngrams[n] = build_ngram_freq(bpe_tokens, n)

generated_tokens = list(temperature_sample(ngrams[n], n, prefix=[],  max_length=2000, temperature=0.7))

# Convert tokens to strings
generated_text = ''.join(generated_tokens)

# Select a reference segment from the corpus to match generated length
start_index = random.randint(0, len(bpe_tokens)-len(generated_tokens))
reference_tokens = bpe_tokens[start_index:][:len(generated_tokens)]
reference_text = ''.join(reference_tokens)

# Calculate BLEU, ROUGE, and distinct-n
bleu_score = calculate_bleu(reference_text, generated_text)
rouge_scores = calculate_rouge(reference_text, generated_text)
diversity_score = calculate_diversity(generated_text, n=2)

# Display results
print(f"BLEU Score: {bleu_score:.4f}")
print("ROUGE Scores:")
for key in rouge_scores:
    print(f"  - {key}: {rouge_scores[key]}")
print(f"Diversity (distinct-2): {diversity_score:.4f}")

BLEU Score: 0.0146
ROUGE Scores:
  - rouge1: Score(precision=0.4252199413489736, recall=0.39473684210526316, fmeasure=0.4094117647058824)
  - rouge2: Score(precision=0.06457925636007827, recall=0.05994550408719346, fmeasure=0.06217616580310881)
  - rougeL: Score(precision=0.15542521994134897, recall=0.1442831215970962, fmeasure=0.1496470588235294)
Diversity (distinct-2): 0.8822


# Homework Assignment
## Analyzing N-Gram Language Models with Temperature Sampling

In this assignment, you will explore how different configurations of an n-gram language model affect the quality of generated text. You will vary the **temperature** parameter in sampling and experiment with different **n-gram orders** to see how these factors influence fluency, coherence, and diversity.

### Objectives
- Generate text using various n-gram orders and temperatures.
- Perform a **qualitative** analysis by examining examples of generated text.
- Perform a **quantitative** analysis using evaluation metrics.

### Instructions

#### 1. Experiment with N-Gram Orders
- **Task**: Generate text using bigram, trigram, and 4-gram models.
- **Procedure**:
  - Go to project Guttenberg and select another reference text to use with this notebook. 
  - Use the provided `build_ngram_freq` function to create bigram, trigram, and 4-gram frequency dictionaries.
  - Generate a text sample for each n-gram model using `temperature_sample`.
- **Deliverable**: 
  - Show generated examples for each n-gram order and identify patterns or flaws (e.g., repetitive phrases, unnatural phrasing).
  - Does a higher n-gram order (e.g., 4-gram) improve text coherence?
  - Are lower-order models (e.g., bigram) more likely to produce repetitive sequences?

#### 2. Experiment with Temperature Values
- **Task**: Generate text with different temperatures (e.g., 0.5, 0.7, 1.0, 1.2) using the trigram model.
- **Procedure**:
  - Use `temperature_sample` to generate text with each temperature setting.
- **Deliverable**: 
  - Show generated examples for each temperature and look for patterns or issues (e.g., more creative vs. repetitive or incoherent outputs).
  - How does temperature affects the style and structure of the generated text.

#### 3. Quantitative Analysis Using Metrics
- **Task**: Calculate BLEU, ROUGE, and diversity scores (distinct-2) for the generated text.
- **Procedure**:
  - For each configuration in parts 1 and 2 (different n-gram orders and temperatures), calculate BLEU, ROUGE, and distinct-2 scores relative to a reference segment from the corpus.
  - Use `calculate_bleu`, `calculate_rouge`, and `calculate_diversity` functions provided.
- **Deliverable**:
  - Present a table summarizing BLEU, ROUGE-1, ROUGE-2, ROUGE-L, and distinct-2 scores for each combination of n-gram order and temperature.
  - Highlight any trends you observe (e.g., higher n-grams yield higher BLEU scores but lower diversity).

#### 4. Summary and Reflections
- **Deliverable**:
  - Write a final summary discussing your findings. Include your observations about the trade-offs between n-gram order and temperature and how these impact the quality of generated text.
  - How does increasing the n-gram order change the balance between diversity and coherence?
  - How does temperature tuning allow control over the creativity vs. coherence of the generated text?


### Submission
- Do **NOT** submit a notebook. Submit only examples of the generated text along with your write-up and table. 
- Ensure that all  explanations, and analysis are complete and well-organized.

### Additional Notes
- Clear, concise explanations and observations will enhance your assignment.
- Feel free to include additional metrics or analyses if relevant.

### Example Submission:

_This is an example of what a good submission would look like, but based on fake text from ChatGPT_

---

#### 1. Experiment with N-Gram Orders

- **Task**: Generate text using bigram, trigram, and 4-gram models.
- **Procedure**: I used “The Adventures of Sherlock Holmes” by Arthur Conan Doyle as my reference text. Below are the samples generated with different n-gram orders.

##### Bigram Model (n=2, temperature=0.7)
```
"The man in the room was his back was lying with his face in the next to him."
```

##### Trigram Model (n=3, temperature=0.7)
```
"The man was lying on the floor with his back to the window and his face to the door."
```

##### 4-Gram Model (n=4, temperature=0.7)
```
"The man was lying on the floor with his back to the window and his face towards the door."
```

---

#### 2. Experiment with Temperature Values

- **Task**: Generate text with different temperatures using the trigram model.
- **Procedure**: Using the trigram model, I generated text with temperatures of 0.5, 0.7, 1.0, and 1.2.

##### Temperature 0.5
```
"The man was lying on the floor with his back to the wall."
```
##### Temperature 0.7
```
"The man was lying on the floor with his back to the window and his face to the door."
```

##### Temperature 1.0
```
"The man appeared through his back towards the window with his face."
```

##### Temperature 1.2
```
"With his back, man was smiling face floor direction at his window."
```
---

#### 3. Quantitative Analysis Using Metrics

| N-Gram Order | Temperature | BLEU Score | ROUGE-1 | ROUGE-2 | ROUGE-L | Distinct-2 |
|--------------|-------------|------------|---------|---------|---------|------------|
| Bigram       | 0.7         | 0.065      | 0.42    | 0.11    | 0.15    | 0.84       |
| Trigram      | 0.7         | 0.178      | 0.58    | 0.24    | 0.33    | 0.75       |
| 4-Gram       | 0.7         | 0.214      | 0.62    | 0.29    | 0.41    | 0.70       |
| Trigram      | 0.5         | 0.193      | 0.53    | 0.22    | 0.30    | 0.71       |
| Trigram      | 1.0         | 0.127      | 0.45    | 0.18    | 0.25    | 0.81       |
| Trigram      | 1.2         | 0.102      | 0.39    | 0.15    | 0.21    | 0.87       |

---

#### 4. Summary and Reflections

In this experiment, I observed the following trends:
- **Higher n-gram orders** (e.g., 4-grams) create more coherent text with fewer repetitive patterns, but they tend to be more deterministic, limiting diversity.
- **Temperature adjustments** allow control over the model’s creativity. Lower temperatures create more predictable and coherent text, while higher temperatures increase diversity but reduce coherence.
- **Recommendations**: For factual generation tasks, I would recommend a high n-gram order (3 or 4) and a lower temperature (e.g., 0.5-0.7) to maintain accuracy and coherence. For creative generation, a moderate n-gram order with a higher temperature (e.g., 1.0) may produce more diverse and interesting results.


# Grading

### Rubric for N-Gram Language Modeling Lab

| **Criteria**                          | **Points** | **Description**                                                                                                                                                                                       |
|---------------------------------------|------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| **Corpus Selection**                  | **20**     | Selects a new text corpus from Project Gutenberg or similar. Briefly describes the chosen text and reasons for selection.                                                                            |
| **N-Gram Order Experimentation**      | **30**     | Tests at least three n-gram orders (e.g., bigram, trigram, 4-gram). Provides generated examples and discusses which order best balances coherence and diversity.                                     |
| **Temperature Tuning**                | **20**     | Experiments with several temperatures (e.g., 0.5, 0.7, 1.0, 1.2) and explains observed effects on coherence and creativity.                                                                          |
| **Concise Results-Only Submission**   | **30**     | Submits only generated text samples, analysis, and metrics, without the full notebook or code cells. Organizes results for readability, with necessary comments and tables only.                      |

