## Our Development Philosophy: From Baseline to Optimized

This project follows a deliberate and iterative approach to model development. Our core principle is to ensure that every change is measured, understood, and contributes positively to the final result. This methodology transforms the development process from a series of guesses into a scientific experiment.

### Phase 1: Build the Brute-Force Baseline

First, we create the simplest possible, end-to-end working model. This **bare-minimum** version is not intended to be sophisticated; its purpose is to be a functional starting point that serves two critical purposes:

1.  **Proof of Concept:** It confirms that our data pipeline, from loading to processing and modeling, is functional. It validates our foundational assumptions.
2.  **Establish a Benchmark:** It provides a clear **baseline performance metric**. All future work will be measured against this initial score. Without a baseline, it's impossible to quantify improvement.

### Phase 2: Optimize One Step at a Time

Once the baseline is established, we begin a cycle of incremental improvement. We strictly adhere to the principle of making **one isolated change at a time**. This is the most critical aspect of our philosophy.

Instead of overhauling the model at once and being unable to pinpoint the source of a change, we will:

* **Target** a single component for improvement (e.g., memory usage, context awareness, grammar).
* **Implement** that one specific change.
* **Measure** its exact impact on performance and resource consumption.

This methodical process removes guesswork and allows us to attribute performance gains or losses directly to a specific action, ensuring that every step forward is a confident one.

### Step 1: Loading the Dataset

Our journey begins by loading the `wikitext` dataset, a large corpus of text derived from high-quality Wikipedia articles. We use the `datasets` library from Hugging Face for this task, which provides a standardized and efficient way to handle large datasets.

We also specify a custom cache directory (`../2.data`) to ensure that the downloaded data is stored in a predictable location within our project structure, making it easier to manage and reuse without re-downloading.

In [1]:
from datasets import load_dataset

# Define a custom directory to cache the downloaded dataset
my_custom_cache_dir = "../2.data"

# Load the 'wikitext-2-raw-v1' version of the wikitext dataset
wikitext_dataset = load_dataset(
    "wikitext",
    "wikitext-2-raw-v1",
    cache_dir=my_custom_cache_dir
)

# Print the dataset structure to see the different splits (train, validation, test)
print(wikitext_dataset)

  from .autonotebook import tqdm as notebook_tqdm


DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})


### Step 2: Initial Data Exploration

Before we can clean the data, we need to understand its raw form. We'll inspect the first few lines of the training set to identify patterns, noise, and artifacts that need to be addressed in our preprocessing stage. This step is crucial for formulating an effective cleaning strategy.

In [2]:
# Create separate variables for easier access to each dataset split
test_data = wikitext_dataset['test']
train_data = wikitext_dataset['train']
validation_data = wikitext_dataset['validation']

# Print the first 5 lines of the training data to inspect its content
for i in range(5):
  print(f"{i}:{train_data[i]['text']}")

0:
1: = Valkyria Chronicles III = 

2:
3: Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " . 

4: The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more forgiving f

### Step 3: Defining the Preprocessing Strategy

Based on our initial exploration, we have identified several types of noise in the raw text. To prepare the data for our model, we will implement a multi-step cleaning process:

* **Remove Article Headings:** The dataset uses lines starting with ` = ` to denote section titles (e.g., `= = Gameplay = =`). These are metadata, not narrative text, and should be removed.
* **Handle Special Characters:** The text contains artifacts like `@-@` and non-ASCII characters (e.g., `戦場のヴァルキュリア3`). We will remove these to create a cleaner, more consistent vocabulary.
* **Standardize Casing:** All text will be converted to lowercase to ensure that the model treats the same word (e.g., "The" and "the") as a single entity.
* **Remove Empty Lines:** The dataset contains many blank lines, which add no value and should be filtered out.

### Step 4: Measuring the Impact of Preprocessing

Before applying our cleaning functions, we will establish a baseline measurement of the dataset's size. We'll calculate the total number of rows and characters in each split (train, validation, and test). After preprocessing, we will perform the same calculation again. This allows us to quantify the exact impact of our cleaning process, showing how much noise we have successfully removed from the dataset.

In [3]:
import re

# --- Before Preprocessing ---
print("--- Before Cleaning ---")
# Calculate and print the initial size of each dataset split
train_chars_before = sum(len(line) for line in train_data['text'] if line)
validation_chars_before = sum(len(line) for line in validation_data['text'] if line)
test_chars_before = sum(len(line) for line in test_data['text'] if line)
print(f"Training Data:   {train_data.num_rows:,} rows, {train_chars_before:,} characters")
print(f"Validation Data: {validation_data.num_rows:,} rows, {validation_chars_before:,} characters")
print(f"Test Data:       {test_data.num_rows:,} rows, {test_chars_before:,} characters")

# --- Preprocessing Steps ---
# The 'datasets' library allows us to chain operations for a clean pipeline.

# 1. Filter out article headings
processed_dataset = wikitext_dataset.filter(
    lambda example: not example['text'].strip().startswith(' = ')
)

# 2. Apply text cleaning and normalization to each entry
processed_dataset = processed_dataset.map(
    lambda example: {
        'text': re.sub(
            r'[^a-zA-Z0-9\s.,\'?!-]', '', # Keep only alphanumeric, common punctuation, and whitespace
            example['text'].lower().replace('@-@', '') # Convert to lowercase and remove artifacts
        ).strip()
    }
)

# 3. Filter out any lines that became empty after cleaning
processed_dataset = processed_dataset.filter(
    lambda example: len(example['text']) > 0
)

# --- After Preprocessing ---
print("\n--- After Cleaning ---")
# Re-assign the cleaned data to our variables for future use
train_data = processed_dataset['train']
validation_data = processed_dataset['validation']
test_data = processed_dataset['test']

# Calculate and print the final size of each dataset split
train_chars_after = sum(len(line) for line in train_data['text'] if line)
validation_chars_after = sum(len(line) for line in validation_data['text'] if line)
test_chars_after = sum(len(line) for line in test_data['text'] if line)
print(f"Training Data:   {train_data.num_rows:,} rows, {train_chars_after:,} characters")
print(f"Validation Data: {validation_data.num_rows:,} rows, {validation_chars_after:,} characters")
print(f"Test Data:       {test_data.num_rows:,} rows, {test_chars_after:,} characters")

--- Before Cleaning ---
Training Data:   36,718 rows, 10,892,990 characters
Validation Data: 3,760 rows, 1,142,150 characters
Test Data:       4,358 rows, 1,285,622 characters

--- After Cleaning ---
Training Data:   23,764 rows, 10,617,709 characters
Validation Data: 2,461 rows, 1,114,644 characters
Test Data:       2,891 rows, 1,253,140 characters


### Step 5: Verifying the Cleaned Data

After preprocessing, it's essential to visually inspect the data again. This final check confirms that our cleaning functions worked as expected and that the text is now in a suitable format for our model. We expect to see clean, lowercase text with no strange artifacts or article headings.

In [4]:
# Visually inspect the first 20 lines of the cleaned training data
for i in range(5):
  print(f"{i}:{train_data[i]['text']}")

0:valkyria chronicles iii
1:senj no valkyria 3  unrecorded chronicles  japanese  3 , lit . valkyria of the battlefield 3  , commonly referred to as valkyria chronicles iii outside japan , is a tactical role  playing video game developed by sega and media.vision for the playstation portable . released in january 2011 in japan , it is the third game in the valkyria series . employing the same fusion of tactical and real  time gameplay as its predecessors , the story runs parallel to the first game and follows the  nameless  , a penal military unit serving the nation of gallia during the second europan war who perform secret black operations and are pitted against the imperial unit  calamaty raven  .
2:the game began development in 2010 , carrying over a large portion of the work done on valkyria chronicles ii . while it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more forgiving for series newcomers . character designer r

## Baseline Model: The Simple Markov Chain

Our first model is a foundational, first-order Markov chain. It's a simple probabilistic model that predicts the next word in a sequence based *solely* on the current word, ignoring all previous context.

### What is "The Model"?

In this context, the "model" isn't a complex algorithm but a simple and intuitive data structure: a **Python dictionary**. This dictionary is the final output of the "training" process. It stores all the learned word-to-word transition probabilities from the input text.

### The Structure of the Model (The Dictionary)

The model is a nested dictionary with a specific structure: `word -> {next_word: count}`.

* **Level 1: The Keys (Current Words)**
    * The keys of the main dictionary are all the unique words found in the text. Each key represents a possible "current state."
* **Level 2: The Values (Next Words and Their Frequencies)**
    * The value associated with each key is *another dictionary*.
    * In this inner dictionary, the keys are all the words that have ever appeared immediately after the "current word."
    * The values are the counts (integers) of how many times that specific transition occurred.

### A Concrete Example

If our training text is: `"the cat sat on the mat"`

The resulting model dictionary would look like this:

```json
{
    "the": {
        "cat": 1,
        "mat": 1
    },
    "cat": {
        "sat": 1
    },
    "sat": {
        "on": 1
    },
    "on": {
        "the": 1
    }
}

In [5]:
import random
import string
import json


def build_markov_model(text: str) -> dict:
    """
    Builds a simple Markov chain model from a given text.
    """
    tokens = text.split()

    model = {}

    for i in range(len(tokens) - 1):
        current_word = tokens[i]
        next_word = tokens[i + 1]

        if current_word not in model:
            model[current_word] = {}

        if next_word not in model[current_word]:
            model[current_word][next_word] = 0

        model[current_word][next_word] += 1

    return model

def generate_text(model: dict, length: int = 50,start: str=None) -> str:
    if start is None:
      start_word=random.choice(list(model.keys()))
    else:
        start_word=start.lower()
    """
    Generates new text using a pre-built Markov model.
    """
    generated_text = [start_word]
    current_word = start_word

    for _ in range(length - 1):
        if current_word not in model:
            break

        next_words_dict = model[current_word]
        possible_next_words = list(next_words_dict.keys())
        word_frequencies = list(next_words_dict.values())

        chosen_next_word = random.choices(possible_next_words, weights=word_frequencies, k=1)[0]

        generated_text.append(chosen_next_word)
        current_word = chosen_next_word

    return ' '.join(generated_text)


# Helper Functions Ignore them
import os
model_filepath = "../3.model/Markov_chain_v0.1" # <-- Change this path

def save_model(model, filepath):
    """Saves the model dictionary to a JSON file."""
    os.makedirs(os.path.dirname(filepath), exist_ok=True)
    with open(filepath, 'w', encoding='utf-8') as f:
        json.dump(model, f)

def load_model(filepath):
    """Loads the model dictionary from a JSON file."""
    with open(filepath, 'r', encoding='utf-8') as f:
        return json.load(f)

# --- Main execution ---

# 1. Convert the 'train' dataset object into a single corpus string
# We join all the lines of text together with spaces in between.
print("Preparing training corpus...")
corpus = " ".join(train_data['text'])
print("Corpus prepared.")

# 2. Build the model using the training corpus
print("Building Markov model...")
if os.path.exists(model_filepath):
    print("Loading saved model...")
    markov_model = load_model(model_filepath)
else:
    print("No saved model found. Building a new one...")
    markov_model = build_markov_model(corpus)
    save_model(markov_model, model_filepath)


# 3. Generate new text from the trained model
new_text = generate_text(markov_model, length=75,start="morning")

# 4. Print the result
print("\n--- Text Generated from WikiText Model ---")
print(new_text)

Preparing training corpus...
Corpus prepared.
Building Markov model...
Loading saved model...

--- Text Generated from WikiText Model ---
morning of a period can identify who saw a week later applied to read and peter finch explains that nicole is somewhat difficult to show the highway between djedkare isesi . picasso , while on 5 , they instruct individuals . carre 's 1130 , native birds . u during the game characters compete in an interview for a council was in 1 . consequently requests the impression along many different from mile 1 in


## Optimization 1: Memory Efficiency

The initial brute-force approach required loading the entire training dataset into a **single, massive string**. This is highly problematic for large datasets for two main reasons:

1.  **High Memory Usage:** It can consume several gigabytes of RAM, potentially crashing the program with a `MemoryError` on systems with limited memory.
2.  **Loss of Context:** It destroys natural sentence boundaries, leading to a lower-quality model that can't learn to start or end sentences properly.

To solve this, we **process the data incrementally**. Instead of creating one large object in memory, the optimized approach is to build the model **line-by-line**.

We iterate through the dataset, processing only one line at a time to update our model dictionary. This is a highly **memory-efficient** solution, as it allows us to train on a dataset of virtually any size while using a nearly constant amount of memory.

In [6]:
def build_markov_model_incrementally(dataset_split):
    """
    Builds a model by iterating through the dataset line by line (memory-efficient).
    """
    model = {}
    for line in dataset_split['text']:
        tokens = line.split()
        if not tokens:
            continue

        for i in range(len(tokens) - 1):
            current_word = tokens[i]
            next_word = tokens[i + 1]

            if current_word not in model:
                model[current_word] = {}

            if next_word not in model[current_word]:
                model[current_word][next_word] = 0

            model[current_word][next_word] += 1

    return model
# --- Main Execution (Incremental Approach) ---
# Assumes 'train' dataset object and all helper functions are pre-defined.

# 1. Configuration
MODEL_FILEPATH = "../3.model/Markov_chain_v0.2"
START_WORD = "morning"
TEXT_LENGTH = 75

# 2. Load or Build the Model Incrementally
if os.path.exists(MODEL_FILEPATH):
    print("Loading existing model...")
    markov_model = load_model(MODEL_FILEPATH)
else:
    print("No existing model found. Building a new one incrementally...")

    # This is the corrected, memory-efficient call.
    # It passes the dataset object directly to the incremental builder.
    markov_model = build_markov_model_incrementally(train_data)

    save_model(markov_model, MODEL_FILEPATH)

print("Model is ready.")

# 3. Generate and Print Text
print("\n--- Generating new text ---")
generated_text = generate_text(
    markov_model,
    length=TEXT_LENGTH,
    start=START_WORD
)

print("\n--- Generated Text ---")
print(generated_text)

Loading existing model...
Model is ready.

--- Generating new text ---

--- Generated Text ---
morning assemblies called the foot prints and the age named ingleside and shades of saintly royalty until the sweet chariot , for six deaths . the brigade behind . from the bluebirds . the beak from 1855 . falling backwards onto smith notes that was killed in duluth , beautiful sight . meanwhile , sarah rodman of that winter storm damaged on quintana roo and a middle distance was very short period after each occasion


## Optimization 2: Increasing Context with a Larger Window

The most significant weakness of our current model is its extremely limited memory. It is a **first-order Markov chain**, meaning it only knows the previous **1** word. This leads to a critical failure in logical consistency.

### The Problem of Lost Context

Consider the phrase: `"The sky is..."`

Our model's decision for the next word is based *only* on the word `"is"`. It has forgotten the crucial context word `"sky"`. The model's pool of possible next words is created from *every single instance* where `"is"` appeared in the training text, regardless of what came before it. It learns from unrelated phrases like:

* `"...the sky is blue..."`
* `"...the company's logo is an apple..."`

When the model's only context is `"is"`, it blends these unrelated scenarios. It might correctly identify `"blue"` as probable, but `"apple"` remains a possibility because the model has no memory of the preceding word, `"sky"`.

### The Solution: Using Tuples for State

To solve this, we increase the model's memory by using a **tuple of multiple words** as the key. This tuple acts as our new "state," effectively filtering out irrelevant choices.

* **Old Model State (Plain Word):** `is`
    * Possible Next Words: `{ "apple": 3, "blue": 4, "green": 2, ... }`
* **New Model State (Tuple):** `('sky', 'is')`
    * Possible Next Words: `{ "blue": 150, "clear": 80, "overcast": 45, ... }`

By using a tuple, the model's state is now `('sky', 'is')`. The pool of possible next words it considers is now completely different and far more relevant. The word `"apple"` is unlikely to even be in this new list, leading to much more coherent and context-aware text generation.

In [7]:
import random
import string
import json
import os
import ast # Used to safely convert string representations of tuples back to tuples

# (Assuming 'processed_dataset' is already created and split into train, validation, test)
# train = processed_dataset['train']

def build_markov_model_with_window(dataset_split, window_size=2):
    """
    Builds a higher-order Markov model using a tuple of words (the "window") as the state.
    """
    model = {}
    for line in dataset_split['text']:
        tokens = line.split()
        if len(tokens) < window_size + 1:
            continue
        for i in range(len(tokens) - window_size):
            current_state = tuple(tokens[i : i + window_size])
            next_word = tokens[i + window_size]
            if current_state not in model:
                model[current_state] = {}
            if next_word not in model[current_state]:
                model[current_state][next_word] = 0
            model[current_state][next_word] += 1
    return model

def save_model_window(model, filepath):
    """
    Saves the model dictionary to a JSON file, converting tuple keys to strings.
    """
    print(f"Saving model to {filepath}...")
    os.makedirs(os.path.dirname(filepath), exist_ok=True)
    # Convert tuple keys to strings because JSON does not support tuple keys
    string_keyed_model = {str(key): value for key, value in model.items()}
    with open(filepath, 'w', encoding='utf-8') as f:
        json.dump(string_keyed_model, f, indent=4)
    print("Model saved successfully.")

def load_model_window(filepath):
    """
    Loads the model from a JSON file, converting string keys back to tuples.
    """
    print(f"Loading model from {filepath}...")
    with open(filepath, 'r', encoding='utf-8') as f:
        string_keyed_model = json.load(f)
        # Convert string keys back to tuples for the model to work correctly
        model = {ast.literal_eval(key): value for key, value in string_keyed_model.items()}
    print("Model loaded successfully.")
    return model

def generate_text_with_window(model, length=75, window_size=2, start_seed=None):
    """
    Generates text from a windowed model, optionally starting with a given seed phrase.
    """
    start_state = None

    # Try to use the provided start_seed
    if start_seed:
        seed_words = start_seed.lower().split()
        # Validate that the seed has the correct number of words
        if len(seed_words) != window_size:
            print(f"Warning: Your start seed has {len(seed_words)} words, but the model's window size is {window_size}. Starting randomly.")
        else:
            potential_start_state = tuple(seed_words)
            # Validate that the seed exists as a state in our model
            if potential_start_state in model:
                start_state = potential_start_state
            else:
                print(f"Warning: The phrase {potential_start_state} was not found in the model's training data. Starting randomly.")

    # If no seed was provided or the provided seed was invalid, start randomly
    if start_state is None:
        start_state = random.choice(list(model.keys()))

    # --- The rest of the function remains the same ---
    generated_text = list(start_state)
    current_state = start_state
    for _ in range(length - window_size):
        if current_state not in model:
            break
        next_words_dict = model[current_state]
        possible_next_words = list(next_words_dict.keys())
        word_frequencies = list(next_words_dict.values())
        chosen_next_word = random.choices(possible_next_words, weights=word_frequencies, k=1)[0]
        generated_text.append(chosen_next_word)
        current_state = tuple(generated_text[-window_size:])

    return ' '.join(generated_text)

# --- Main Execution ---

# 1. Configuration
WINDOW_SIZE = 2
model_dir = "../3.model/Markov_chain_v0.3"
# We'll make the filename dynamic based on the window size
model_filename = f"markov_model_ws{WINDOW_SIZE}.json"
model_filepath = os.path.join(model_dir, model_filename)

# 2. Load or Build the Model
if os.path.exists(model_filepath):
    windowed_markov_model = load_model_window(model_filepath)
else:
    print("Saved model not found. Training a new one...")
    windowed_markov_model = build_markov_model_with_window(train_data, window_size=WINDOW_SIZE)
    save_model_window(windowed_markov_model, model_filepath)

# 3. Generate new text from the loaded/trained model
USER_START_SEED = "the sky"

# This line has been updated to pass the start_seed to the function
new_text = generate_text_with_window(
    windowed_markov_model,
    length=75,
    window_size=WINDOW_SIZE,
    start_seed=USER_START_SEED
)

# 4. Print the result
print("\n--- Text Generated from Windowed Model ---")
print(new_text)

Loading model from ../3.model/Markov_chain_v0.3\markov_model_ws2.json...
Model loaded successfully.

--- Text Generated from Windowed Model ---
the sky or invisibly present within the icao definition and released in 2006 , darden pursued careers as a prospective tenant appears . mrs. o 'brien destroyer no. 55 dd 55 was laid on top of japanese emigration to brazil . maeda accepted gracie and others were creating . roger friedman of forbes called the scene of this tropical storm formed in april 1919 , but could not otherwise afford the player . he also


In [8]:
# 1. Configuration
WINDOW_SIZE = 4
model_dir = "../3.model/Markov_chain_v0.3"
# We'll make the filename dynamic based on the window size
model_filename = f"markov_model_ws{WINDOW_SIZE}.json"
model_filepath = os.path.join(model_dir, model_filename)

# 2. Load or Build the Model
if os.path.exists(model_filepath):
    windowed_markov_model = load_model_window(model_filepath)
else:
    print("Saved model not found. Training a new one...")
    windowed_markov_model = build_markov_model_with_window(train_data, window_size=WINDOW_SIZE)
    save_model_window(windowed_markov_model, model_filepath)

# 3. Generate new text from the loaded/trained model
USER_START_SEED = ""

# This line has been updated to pass the start_seed to the function
new_text = generate_text_with_window(
    windowed_markov_model,
    length=150,
    window_size=WINDOW_SIZE
    #start_seed=USER_START_SEED
)

# 4. Print the result
print("\n--- Text Generated from Windowed Model ---")
print(new_text)

Loading model from ../3.model/Markov_chain_v0.3\markov_model_ws4.json...
Model loaded successfully.

--- Text Generated from Windowed Model ---
. rosebery seems to have disliked his first son , who he claimed looked jewish . on seeing his son for the first time in her career , including cordelia chase from supernatural dramas buffy the vampire slayer and angel , and heather from the syfy horror film voodoo moon 2006 . in an interview , dylan said he had kicked heroin in new york city . they were aboard the booth line steamship ss polycarp . all three men listed their occupations as professors of juitso . after leaving new york , the three men went to the caribbean , where they stayed from september to december 1921 . at some point in the game , and briefly operated his own dealership until forced to close during the automotive industry crisis of 2008 2010 . he has since played a handful of scoreless games , both at the yamaha


## Final Optimizations: Grammar and Memory

In the final version, we will perform two significant optimizations simultaneously to improve both the quality of the generated text and the model's efficiency.

### 1. Teaching the Model Sentence Structure

Currently, our model has no sense of the start and end of a sentence. A word at the end of a sentence and the same word in the middle are treated as completely different because of the attached punctuation.

> For example, `"found"` and `"found."` are two entirely separate words for our model.

This is a major flaw, as the model never learns that a period signifies the end of a thought. We solve this by **treating punctuation as its own word (or "token")**. During preprocessing, we will pad spaces around major punctuation marks.

* **Before:** `"the cat sat found."`
* **After:** The text is tokenized into `["the", "cat", "sat", "found", "."]`

This teaches the model the crucial relationship between words and punctuation, allowing it to learn how to properly end the sentences it generates.

### 2. Reducing Memory Footprint with Integer Encoding

Storing the entire model using full words (strings) is inefficient. Typically, storing one character requires one byte of memory, so a 7-character word like `"awesome"` uses at least 7 bytes. For a large corpus, the memory required for the model dictionary can become enormous.

To solve this, we create a simple form of **tokenization** by mapping each unique word in our vocabulary to a unique integer.

> For example: `{"a": 1, "the": 14, "awesome": 56, ...}`

This fundamentally changes our model. Instead of a single large dictionary, our saved model now consists of **two essential components**:

1.  **The Vocabulary Maps:** Two dictionaries, one that maps words to their unique integer IDs (`word_to_int`) and another that maps IDs back to words (`int_to_word`) for generation.
2.  **The Core Model:** The main dictionary, which now stores these lightweight integers instead of heavy strings, dramatically reducing its size in memory and on disk.

### 3. Decoupling Vocabulary from the Model for Experimentation

Since we want to check the performance of our models for different window sizes, it is inefficient to store the vocabulary inside each model file. The vocabulary of the training text is constant; it does not change whether we use a window size of 2, 3, or 4.

Therefore, we will adopt a more organized storage strategy:

* **Vocabulary (`vocabulary.json`):** The word-to-integer mappings will be built once from the training data and saved to their own separate file.
* **Model Dictionaries (`int_model_ws2.json`, etc.):** Each model, trained with a specific window size, will be saved to its own file, named dynamically based on its configuration. These model files will only contain the integer-based transition logic.

In [12]:
import json
import os
import re
import random
import ast

# --- Helper Functions (Placeholders for saving/loading) ---
# NOTE: These functions are now very simple because we consistently use string keys.

def save_model_window(model, filepath):
    """Saves the model (with string keys) to a JSON file."""
    os.makedirs(os.path.dirname(filepath), exist_ok=True)
    with open(filepath, 'w', encoding='utf-8') as f:
        json.dump(model, f, indent=4)

def load_model_window(filepath):
    """Loads the model (with string keys) from a JSON file."""
    with open(filepath, 'r', encoding='utf-8') as f:
        model = json.load(f)
    return model

# --- Step 1: Grammar-Aware Preprocessing and Vocabulary Building (Unchanged) ---

def tokenize_and_build_vocab(dataset_split):
    word_counts = {}
    processed_lines = []
    for line in dataset_split['text']:
        line = re.sub(r'([.,?!])', r' \1 ', line)
        tokens = line.split()
        processed_lines.append(tokens)
        for token in tokens:
            word_counts[token] = word_counts.get(token, 0) + 1
    sorted_words = sorted(word_counts.keys(), key=lambda x: word_counts[x], reverse=True)
    word_to_int = {word: i for i, word in enumerate(sorted_words)}
    int_to_word = {i: word for i, word in enumerate(sorted_words)}
    return processed_lines, word_to_int, int_to_word

def save_vocab(word_to_int, int_to_word, filepath):
    os.makedirs(os.path.dirname(filepath), exist_ok=True)
    with open(filepath, 'w', encoding='utf-8') as f:
        json.dump({"word_to_int": word_to_int, "int_to_word": int_to_word}, f, indent=4)

def load_vocab(filepath):
    with open(filepath, 'r', encoding='utf-8') as f:
        data = json.load(f)
    int_to_word_corrected = {int(k): v for k, v in data['int_to_word'].items()}
    return data['word_to_int'], int_to_word_corrected

# --- Step 2: Build the Integer-Based Markov Model (Corrected Function) ---

def build_integer_model(processed_lines, word_to_int, window_size=2):
    """
    Builds a windowed Markov model using STRING keys for consistency.
    """
    model = {}
    for tokens in processed_lines:
        if len(tokens) < window_size + 1:
            continue
        int_tokens = [word_to_int[token] for token in tokens if token in word_to_int]
        
        for i in range(len(int_tokens) - window_size):
            # Create the tuple state...
            current_state_tuple = tuple(int_tokens[i : i + window_size])
            # ...and immediately convert it to a string for use as a key.
            current_state_str = str(current_state_tuple)
            next_word_id = int_tokens[i + window_size]
            
            # Now, use the string key for all dictionary operations.
            if current_state_str not in model:
                model[current_state_str] = {}
            
            next_word_id_str = str(next_word_id)
            if next_word_id_str not in model[current_state_str]:
                model[current_state_str][next_word_id_str] = 0
            model[current_state_str][next_word_id_str] += 1
            
    return model

# --- Step 3: Text Generation with Integer Model (Unchanged) ---

def generate_text_integer_model(model, int_to_word, word_to_int, length=75, window_size=2, start_seed=None):
    start_state = None
    if start_seed:
        seed_words = start_seed.lower().split()
        if len(seed_words) == window_size and all(word in word_to_int for word in seed_words):
            potential_start_state = tuple(word_to_int[word] for word in seed_words)
            # Check for the string representation of the tuple in the model keys
            if str(potential_start_state) in model:
                start_state = potential_start_state

    if start_state is None:
        # If no valid seed is found, pick a random starting state (which is a string)
        random_start_key = random.choice(list(model.keys()))
        # Convert the string key back to a tuple for internal processing
        start_state = ast.literal_eval(random_start_key)

    generated_ids = list(start_state)
    current_state_tuple = start_state

    for _ in range(length - window_size):
        current_state_str = str(current_state_tuple)
        if current_state_str not in model:
            break
        
        next_ids_dict = model[current_state_str]
        possible_next_ids_str = list(next_ids_dict.keys())
        id_frequencies = list(next_ids_dict.values())
        
        chosen_next_id = random.choices([int(i) for i in possible_next_ids_str], weights=id_frequencies, k=1)[0]
        generated_ids.append(chosen_next_id)
        current_state_tuple = tuple(generated_ids[-window_size:])

    generated_words = [int_to_word.get(id, '?') for id in generated_ids]
    return ' '.join(generated_words).replace(' .', '.').replace(' ,', ',').replace(' ?', '?').replace(' !', '!')

# --- Main Execution (Simplified Logic) ---

# ASSUMPTION: You have a Hugging Face dataset object named `train_data` loaded.
# For example:
# from datasets import load_dataset
# wikitext = load_dataset("wikitext", "wikitext-2-raw-v1")
# train_data = wikitext['train']

MODEL_DIR = "../3.model/Markov_chain_v0.4"
VOCAB_FILEPATH = os.path.join(MODEL_DIR, "vocabulary4Markov_chain_v0.4.json")
WINDOW_SIZE = 3 
MODEL_FILENAME = f"int_model_ws{WINDOW_SIZE}.json"
MODEL_FILEPATH = os.path.join(MODEL_DIR, MODEL_FILENAME)

# 1. Build or Load Vocabulary
if os.path.exists(VOCAB_FILEPATH):
    print("Loading existing vocabulary...")
    word_to_int, int_to_word = load_vocab(VOCAB_FILEPATH)
    processed_lines, _, _ = tokenize_and_build_vocab(train_data)
else:
    print("Building and saving new vocabulary...")
    processed_lines, word_to_int, int_to_word = tokenize_and_build_vocab(train_data)
    save_vocab(word_to_int, int_to_word, VOCAB_FILEPATH)

# 2. Build or Load the Integer Model
if os.path.exists(MODEL_FILEPATH):
    print(f"Loading final model with window size {WINDOW_SIZE}...")
    final_model = load_model_window(MODEL_FILEPATH)
else:
    print(f"Training final model with window size {WINDOW_SIZE}...")
    # This now directly returns the model with the correct string keys ✅
    final_model = build_integer_model(processed_lines, word_to_int, window_size=WINDOW_SIZE)
    # Save the model (which already has string keys, perfect for JSON)
    save_model_window(final_model, MODEL_FILEPATH)
    # The confusing conversion line is no longer needed.

# 3. Generate Text
print("\nGenerating text...")
generated_text = generate_text_integer_model(
    final_model,
    int_to_word,
    word_to_int,
    length=75,
    window_size=WINDOW_SIZE,
    start_seed="the history of"
)

print("\n--- Text Generated from Final Model ---")
print(generated_text)

Building and saving new vocabulary...
Training final model with window size 3...

Generating text...

--- Text Generated from Final Model ---
the history of the war lent support to the division s infantry brigades. it was well executed. he also pointed out that such knowledge demanded mastery of an artificial reason... his artistic triumph and legendary status were achieved in paris... angry at human stupidity and destructiveness, and within two years of refurbishment and redesign. calling it mcallister tower, 248 units were modernized for residential


## The Conceptual Limit and the Path Forward

After implementing a series of powerful optimizations—memory-efficient processing, increased context windows, grammar-aware tokenization, and integer encoding—we have pushed the Markov chain architecture to its logical peak. This is the point where we must acknowledge the fundamental limitations of the model itself.

### The Inescapable Roadblock: Statistics vs. Semantics

Our model has become an incredibly sophisticated pattern-matching engine. It knows which words are statistically likely to follow other sequences of words. However, it has no ability to grasp the actual **meaning**, or **semantics**, of the words it is processing.

To the model, the words "king," "queen," and "cabbage" are just unique integers. It does not understand that two of those words relate to royalty and one relates to vegetables. It only knows the statistical probabilities of their arrangement. This is the core conceptual roadblock of the Markov assumption: **it can mimic structure, but it cannot comprehend meaning**.

### Final Refinements vs. New Architectures

This is where most of the *architectural* optimizations for a Markov chain end. While we have reached the limit of what this type of model can conceptually do, there are still two key refinements we could make. These wouldn't change the fundamental nature of the model, but they would improve its performance and robustness.

#### 1. Performance Optimization: Sparse Matrices

* **What:** Instead of a nested Python dictionary, we could represent our model as a **sparse matrix** using libraries like `scipy.sparse`. The rows would represent the state (the tuple of previous words), the columns would represent the next possible word, and the cell value would be the transition count or probability.
* **Why:**
    * **Speed:** Mathematical operations on these matrices (like calculating probabilities) are performed using highly optimized C code (a process called vectorization), which is orders of magnitude faster than looping through a Python dictionary.
    * **Memory:** While our integer-encoded dictionary is very good, a sparse matrix has even less memory overhead per entry, making it the superior choice for extremely large vocabularies.

#### 2. Quality Optimization: Smoothing and Backoff

* **What:** What happens if the model generates a sequence of words it has never seen before in the training data? It will have no entry for this state and will crash or stop generating text. **Smoothing** (like Laplace smoothing) is a technique that gives a tiny, non-zero probability to every *possible* transition, even unseen ones. A more advanced technique, **Backoff**, would make the model "back off" to a smaller context window (e.g., if it can't find a match for a 3-word sequence, it tries matching on the last 2 words) to find a next step.
* **Why:** This makes the model more **robust**. It prevents it from getting stuck on unfamiliar phrases and allows it to generate longer, more fluent sequences without failing, which directly improves the quality and reliability of the output.

### Conclusion

While these refinements would make our model faster and more resilient, they do not solve the core problem of semantic understanding. To achieve a more human-like grasp of language—one that understands context, topics, and long-range dependencies—we would need to move beyond Markov chains to a different class of models entirely: **neural networks**, such as Recurrent Neural Networks (LSTMs) or the state-of-the-art Transformer architecture.