## Our Development Philosophy: From Baseline to Optimized

This project follows a deliberate and iterative approach to model development. Our core principle is to ensure that every change is measured, understood, and contributes positively to the final result.

### Phase 1: Build the Brute-Force Baseline

First, we create the simplest possible, end-to-end working model. This **bare-minimum** version serves two critical purposes:

1.  **Proof of Concept:** It confirms that our data pipeline and basic architecture are functional.
2.  **Establish a Benchmark:** It provides a clear **baseline performance metric**. All future work will be measured against this initial score.

### Phase 2: Optimize One Step at a Time

Once the baseline is established, we begin a cycle of incremental improvement. We strictly adhere to the principle of making **one isolated change at a time**.

Instead of overhauling the model at once, we will:

* **Target** a single component for improvement (e.g., hyperparameter tuning, feature engineering, architecture modification).
* **Implement** that one specific change.
* **Measure** its exact impact on performance.

This methodical process removes guesswork and allows us to attribute performance gains or losses directly to a specific action.

In [2]:
from datasets import load_dataset
my_custom_cache_dir = "/content/data/default_data"
wikitext_dataset = load_dataset(
    "wikitext",
    "wikitext-2-raw-v1",
    cache_dir=my_custom_cache_dir
)
print(wikitext_dataset)

DatasetDict({
    test: Dataset({
        features: ['text'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['text'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['text'],
        num_rows: 3760
    })
})


In [11]:
# Data Exploration
test=wikitext_dataset['test']
train=wikitext_dataset['train']
validation=wikitext_dataset['validation']

for i in range(0,25):
  print(f"{i}:{train[i]['text']}")

0:
1: = Valkyria Chronicles III = 

2:
3: Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " . 

4: The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more forgiving f

# Requirement Observed:


*   Remove non english characters
*   Remove lines starting with = = =
*   Remove @-@
*   Casual Preprocessing



In [18]:
# Before Preprocessing:
# Calculate the total number of characters for each split
train_chars = sum(len(line) for line in train['text'] if line)
validation_chars = sum(len(line) for line in validation['text'] if line)
test_chars = sum(len(line) for line in test['text'] if line)

print(f"Training Data:   {train.num_rows:,} rows, {train_chars:,} characters")
print(f"Validation Data: {validation.num_rows:,} rows, {validation_chars:,} characters")
print(f"Test Data:       {test.num_rows:,} rows, {test_chars:,} characters")

Training Data:   36,718 rows, 10,892,990 characters
Validation Data: 3,760 rows, 1,142,150 characters
Test Data:       4,358 rows, 1,285,622 characters


In [32]:
import re

# --- Before Cleaning ---
print("--- Row Counts Before Cleaning ---")
print(f"Train:      {wikitext_dataset['train'].num_rows:,}")
print(f"Validation: {wikitext_dataset['validation'].num_rows:,}")
print(f"Test:       {wikitext_dataset['test'].num_rows:,}")


# --- Preprocessing Steps ---
# These operations are applied to all splits (train, validation, test) simultaneously.
processed_dataset = wikitext_dataset.filter(
    lambda example: not example['text'].strip().startswith(' = ')
)

processed_dataset = processed_dataset.map(
    lambda example: {
        'text': re.sub(
            r'[^a-zA-Z0-9\s.,\'?!-]', '',
            example['text'].lower().replace('@-@', '')
        ).strip()
    }
)

processed_dataset = processed_dataset.filter(
    lambda example: len(example['text']) > 0
)


# --- After Cleaning ---
print("\n--- Row Counts After Cleaning ---")
print(f"Train:      {processed_dataset['train'].num_rows:,}")
print(f"Validation: {processed_dataset['validation'].num_rows:,}")
print(f"Test:       {processed_dataset['test'].num_rows:,}")

--- Row Counts Before Cleaning ---
Train:      36,718
Validation: 3,760
Test:       4,358

--- Row Counts After Cleaning ---
Train:      23,764
Validation: 2,461
Test:       2,891


In [34]:
# Data Exploration
test=processed_dataset['test']
train=processed_dataset['train']
validation=processed_dataset['validation']

for i in range(26):
  print(f"{i}:{train[i]['text']}")

  #no immediate issues found

0:valkyria chronicles iii
1:senj no valkyria 3  unrecorded chronicles  japanese  3 , lit . valkyria of the battlefield 3  , commonly referred to as valkyria chronicles iii outside japan , is a tactical role  playing video game developed by sega and media.vision for the playstation portable . released in january 2011 in japan , it is the third game in the valkyria series . employing the same fusion of tactical and real  time gameplay as its predecessors , the story runs parallel to the first game and follows the  nameless  , a penal military unit serving the nation of gallia during the second europan war who perform secret black operations and are pitted against the imperial unit  calamaty raven  .
2:the game began development in 2010 , carrying over a large portion of the work done on valkyria chronicles ii . while it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more forgiving for series newcomers . character designer r

## The Markov Chain Model for Text Generation

A Markov chain is a simple probabilistic model used to predict the next event in a sequence based solely on the current event. When applied to text, it predicts the next word based only on the current word, ignoring all previous context.

### What is "The Model"?

In the context of this project, the "model" is not a complex algorithm but a simple and intuitive data structure: a **Python dictionary**. This dictionary is the final output of the "training" process. It stores all the learned word-to-word transition probabilities from the input text.

### The Structure of the Model (The Dictionary)

The model is a nested dictionary with a specific structure: `word -> {next_word: count}`.

* **Level 1: The Keys (Current Words)**
    * The keys of the main dictionary are all the unique words found in the text. Each key represents a possible "current state."

* **Level 2: The Values (Next Words and Their Frequencies)**
    * The value associated with each key is *another dictionary*.
    * In this inner dictionary, the keys are all the words that have ever appeared immediately after the "current word."
    * The values are the counts (integers) of how many times that specific transition occurred.

### A Concrete Example

If our training text is: "**the cat sat on the mat**"

The resulting model dictionary would look like this:

```json
{
    "the": {
        "cat": 1,
        "mat": 1
    },
    "cat": {
        "sat": 1
    },
    "sat": {
        "on": 1
    },
    "on": {
        "the": 1
    }
}

In [48]:
import random
import string
import json


def build_markov_model(text: str) -> dict:
    """
    Builds a simple Markov chain model from a given text.
    """
    tokens = text.split()

    model = {}

    for i in range(len(tokens) - 1):
        current_word = tokens[i]
        next_word = tokens[i + 1]

        if current_word not in model:
            model[current_word] = {}

        if next_word not in model[current_word]:
            model[current_word][next_word] = 0

        model[current_word][next_word] += 1

    return model

def generate_text(model: dict, length: int = 50,start: str=None) -> str:
    if start is None:
      start_word=random.choice(list(model.keys()))
    else:
        start_word=start.lower()
    """
    Generates new text using a pre-built Markov model.
    """
    generated_text = [start_word]
    current_word = start_word

    for _ in range(length - 1):
        if current_word not in model:
            break

        next_words_dict = model[current_word]
        possible_next_words = list(next_words_dict.keys())
        word_frequencies = list(next_words_dict.values())

        chosen_next_word = random.choices(possible_next_words, weights=word_frequencies, k=1)[0]

        generated_text.append(chosen_next_word)
        current_word = chosen_next_word

    return ' '.join(generated_text)


# Helper Functions Ignore them
import os
model_filepath = "/content/model/Markov_chain_v0.1" # <-- Change this path

def save_model(model, filepath):
    """Saves the model dictionary to a JSON file."""
    os.makedirs(os.path.dirname(filepath), exist_ok=True)
    with open(filepath, 'w', encoding='utf-8') as f:
        json.dump(model, f)

def load_model(filepath):
    """Loads the model dictionary from a JSON file."""
    with open(filepath, 'r', encoding='utf-8') as f:
        return json.load(f)

# --- Main execution ---

# 1. Convert the 'train' dataset object into a single corpus string
# We join all the lines of text together with spaces in between.
print("Preparing training corpus...")
corpus = " ".join(train['text'])
print("Corpus prepared.")

# 2. Build the model using the training corpus
print("Building Markov model...")
if os.path.exists(model_filepath):
    print("Loading saved model...")
    markov_model = load_model(model_filepath)
else:
    print("No saved model found. Building a new one...")
    markov_model = build_markov_model(corpus)
    save_model(markov_model, model_filepath)


# 3. Generate new text from the trained model
new_text = generate_text(markov_model, length=75,start="morning")

# 4. Print the result
print("\n--- Text Generated from WikiText Model ---")
print(new_text)

Preparing training corpus...
Corpus prepared.
Building Markov model...
Loading saved model...

--- Text Generated from WikiText Model ---
morning sun , the needs food to be echoed the overcrowding and for the death . according to 10 , whom were false bony fishes , and brown black people simply left colonel henry prevented the restaurant 17 side west . f4 ng6 11.d4 d5 ! 2012 13 . route 61 sonatine , which was also published book containing psychoactive . the song was installed later that i hadn t corporation , which attracts an


## Optimizations

### Memory Efficiency: Incremental Processing

The initial brute-force approach required loading the entire training dataset into a **single, massive string**. This is highly problematic for large datasets for two main reasons:

1.  **High Memory Usage:** It can consume several gigabytes of RAM, potentially crashing the program with a `MemoryError` on systems with limited memory.
2.  **Loss of Context:** It destroys natural sentence boundaries, leading to a lower-quality model that can't learn to start or end sentences properly.

To solve this, we **fragment the training process**. Instead of creating one large object in memory, the optimized approach is to build the model **incrementally**.

We iterate through the dataset **line-by-line**, processing only one line at a time to update our model dictionary. This is a highly **memory-efficient** solution, as it allows us to train on a dataset of virtually any size using **little to no additional space**.

In [46]:
def build_markov_model_incrementally(dataset_split):
    """
    Builds a model by iterating through the dataset line by line (memory-efficient).
    """
    model = {}
    for line in dataset_split['text']:
        tokens = line.split()
        if not tokens:
            continue

        for i in range(len(tokens) - 1):
            current_word = tokens[i]
            next_word = tokens[i + 1]

            if current_word not in model:
                model[current_word] = {}

            if next_word not in model[current_word]:
                model[current_word][next_word] = 0

            model[current_word][next_word] += 1

    return model
# --- Main Execution (Incremental Approach) ---
# Assumes 'train' dataset object and all helper functions are pre-defined.

# 1. Configuration
MODEL_FILEPATH = "/content/model/Markov_chain_v0.2"
START_WORD = "morning"
TEXT_LENGTH = 75

# 2. Load or Build the Model Incrementally
if os.path.exists(MODEL_FILEPATH):
    print("Loading existing model...")
    markov_model = load_model(MODEL_FILEPATH)
else:
    print("No existing model found. Building a new one incrementally...")

    # This is the corrected, memory-efficient call.
    # It passes the dataset object directly to the incremental builder.
    markov_model = build_markov_model_incrementally(train)

    save_model(markov_model, MODEL_FILEPATH)

print("Model is ready.")

# 3. Generate and Print Text
print("\n--- Generating new text ---")
generated_text = generate_text(
    markov_model,
    length=TEXT_LENGTH,
    start=START_WORD
)

print("\n--- Generated Text ---")
print(generated_text)

No existing model found. Building a new one incrementally...
Model is ready.

--- Generating new text ---

--- Generated Text ---
morning , along with several organizations began in august 27 more than a moment i never having open runners , eno about 70 lines would have been to the fox banksia , who is known and underground organizations of x. british flag of all unnatural proportions of keamy shoots it back onto the second division undergraduate he visits to distance events are deeply flawed attempt to catch them as a new stage career . in


## Further Improvements

### Increasing Context with a Larger Window Size

The most significant weakness of the brute-force model is its extremely limited memory. It is a **first-order Markov chain**, meaning it only knows the previous **1** word.

This leads to a critical failure in logical consistency. For example, consider the phrase:

> "The sky is..."

You are correct that the model is still probabilistic. If its training resulted in a model entry like:
```json
{ "is": {"apple": 3, "blue": 4} }
```
## The Problem of Lost Context

The model correctly knows that "**blue**" is a more likely successor to "**is**" than "**apple**."

The core failure is the **loss of critical context**. The model's pool of possible next words is created from *every single instance* where the word "is" appeared, regardless of what came before it. It learns from unrelated phrases like:

> * "the sky **is** blue"
> * "the company's logo **is** an apple"

When the model's only context is the word "is," it blends these unrelated scenarios. It correctly identifies "blue" as more probable, but "apple" remains a possibility because the model has completely forgotten the crucial preceding word, "**sky**."

## The Solution: Using Tuples for State

To solve this, we use a **tuple of multiple words** as the key. This acts as our new "state," effectively increasing the model's memory and filtering out irrelevant choices.

* **Old Model State (Plain Word):** `is`
    > Possible Next Words:
    > ```json
    > { "apple": 3, "blue": 4, "green": 2, ... }
    > ```

* **New Model State (Tuple):** `('sky', 'is')`
    > Possible Next Words:
    > ```json
    > { "blue": 150, "clear": 80, "overcast": 45, ... }
    > ```

By using a tuple, the model's state is now `('sky', 'is')`. The pool of possible next words it looks at is now completely different and far more relevant. The word "apple" is unlikely to even be in this new list, leading to much more coherent and context-aware text generation.

In [54]:
import random
import string
import json
import os
import ast # Used to safely convert string representations of tuples back to tuples

# (Assuming 'processed_dataset' is already created and split into train, validation, test)
# train = processed_dataset['train']

def build_markov_model_with_window(dataset_split, window_size=2):
    """
    Builds a higher-order Markov model using a tuple of words (the "window") as the state.
    """
    model = {}
    for line in dataset_split['text']:
        tokens = line.split()
        if len(tokens) < window_size + 1:
            continue
        for i in range(len(tokens) - window_size):
            current_state = tuple(tokens[i : i + window_size])
            next_word = tokens[i + window_size]
            if current_state not in model:
                model[current_state] = {}
            if next_word not in model[current_state]:
                model[current_state][next_word] = 0
            model[current_state][next_word] += 1
    return model

def save_model_window(model, filepath):
    """
    Saves the model dictionary to a JSON file, converting tuple keys to strings.
    """
    print(f"Saving model to {filepath}...")
    os.makedirs(os.path.dirname(filepath), exist_ok=True)
    # Convert tuple keys to strings because JSON does not support tuple keys
    string_keyed_model = {str(key): value for key, value in model.items()}
    with open(filepath, 'w', encoding='utf-8') as f:
        json.dump(string_keyed_model, f, indent=4)
    print("Model saved successfully.")

def load_model_window(filepath):
    """
    Loads the model from a JSON file, converting string keys back to tuples.
    """
    print(f"Loading model from {filepath}...")
    with open(filepath, 'r', encoding='utf-8') as f:
        string_keyed_model = json.load(f)
        # Convert string keys back to tuples for the model to work correctly
        model = {ast.literal_eval(key): value for key, value in string_keyed_model.items()}
    print("Model loaded successfully.")
    return model

def generate_text_with_window(model, length=75, window_size=2, start_seed=None):
    """
    Generates text from a windowed model, optionally starting with a given seed phrase.
    """
    start_state = None

    # Try to use the provided start_seed
    if start_seed:
        seed_words = start_seed.lower().split()
        # Validate that the seed has the correct number of words
        if len(seed_words) != window_size:
            print(f"Warning: Your start seed has {len(seed_words)} words, but the model's window size is {window_size}. Starting randomly.")
        else:
            potential_start_state = tuple(seed_words)
            # Validate that the seed exists as a state in our model
            if potential_start_state in model:
                start_state = potential_start_state
            else:
                print(f"Warning: The phrase {potential_start_state} was not found in the model's training data. Starting randomly.")

    # If no seed was provided or the provided seed was invalid, start randomly
    if start_state is None:
        start_state = random.choice(list(model.keys()))

    # --- The rest of the function remains the same ---
    generated_text = list(start_state)
    current_state = start_state
    for _ in range(length - window_size):
        if current_state not in model:
            break
        next_words_dict = model[current_state]
        possible_next_words = list(next_words_dict.keys())
        word_frequencies = list(next_words_dict.values())
        chosen_next_word = random.choices(possible_next_words, weights=word_frequencies, k=1)[0]
        generated_text.append(chosen_next_word)
        current_state = tuple(generated_text[-window_size:])

    return ' '.join(generated_text)

# --- Main Execution ---

# 1. Configuration
WINDOW_SIZE = 2
model_dir = "/content/model/Markov_chain_v0.3"
# We'll make the filename dynamic based on the window size
model_filename = f"markov_model_ws{WINDOW_SIZE}.json"
model_filepath = os.path.join(model_dir, model_filename)

# 2. Load or Build the Model
if os.path.exists(model_filepath):
    windowed_markov_model = load_model_window(model_filepath)
else:
    print("Saved model not found. Training a new one...")
    windowed_markov_model = build_markov_model_with_window(train, window_size=WINDOW_SIZE)
    save_model_window(windowed_markov_model, model_filepath)

# 3. Generate new text from the loaded/trained model
USER_START_SEED = "the sky"

# This line has been updated to pass the start_seed to the function
new_text = generate_text_with_window(
    windowed_markov_model,
    length=75,
    window_size=WINDOW_SIZE,
    start_seed=USER_START_SEED
)

# 4. Print the result
print("\n--- Text Generated from Windowed Model ---")
print(new_text)

Saved model not found. Training a new one...
Saving model to /content/model/Markov_chain_v0.3/markov_model_ws2.json...
Model saved successfully.

--- Text Generated from Windowed Model ---
the sky , with a depth of 70 mph 110 km west of bir el mazar 42 miles 68 km east of the detroit metro times said coleman 's 1966 tune most likely abc 's radio presenting career , selected and edited excerpts of cruise ships operate from haifa . the street after an examination , and vertigo titles , in the latino community , creating a historical saga about the supernatural , much like


In [61]:
# 1. Configuration
WINDOW_SIZE = 4
model_dir = "/content/model/Markov_chain_v0.3"
# We'll make the filename dynamic based on the window size
model_filename = f"markov_model_ws{WINDOW_SIZE}.json"
model_filepath = os.path.join(model_dir, model_filename)

# 2. Load or Build the Model
if os.path.exists(model_filepath):
    windowed_markov_model = load_model_window(model_filepath)
else:
    print("Saved model not found. Training a new one...")
    windowed_markov_model = build_markov_model_with_window(train, window_size=WINDOW_SIZE)
    save_model_window(windowed_markov_model, model_filepath)

# 3. Generate new text from the loaded/trained model
USER_START_SEED = ""

# This line has been updated to pass the start_seed to the function
new_text = generate_text_with_window(
    windowed_markov_model,
    length=150,
    window_size=WINDOW_SIZE
    #start_seed=USER_START_SEED
)

# 4. Print the result
print("\n--- Text Generated from Windowed Model ---")
print(new_text)

Loading model from /content/model/Markov_chain_v0.3/markov_model_ws4.json...
Model loaded successfully.

--- Text Generated from Windowed Model ---
only in 2004 that an original 35 mm film print was discovered due to the intervention of a fan .


## Next Optimizations: Grammar and Memory

In the next version of the model, we will perform two significant optimizations simultaneously to improve both the quality of the generated text and the model's efficiency.

### 1. Teaching the Model Sentence Structure

Currently, our model has no sense of the start and end of a sentence. A word at the end of a sentence and the same word in the middle are treated as completely different because of the attached punctuation.

> For example, "**found**" and "**found.**" are two entirely separate words for our model.

This is a major flaw, as the model never learns that a period signifies the end of a thought.

We solve this by **treating punctuation as its own word (or "token")**. During preprocessing, we will pad spaces around major punctuation marks.

* **Before:** `"the cat sat found."`
* **After:** The text is tokenized into `["the", "cat", "sat", "found", "."]`

This teaches the model the crucial relationship between words and punctuation, allowing it to learn how to properly end the sentences it generates.

### 2. Reducing Memory Footprint with Integer Encoding

Storing the entire model using full words (strings) is inefficient. Typically, storing one character requires one byte of memory, so a 7-character word like "**awesome**" uses at least 7 bytes. For a large corpus, the memory required for the model dictionary can become enormous.

To solve this, we create a simple form of **word embedding** by mapping each unique word in our vocabulary to a unique integer.

> For example: `{"a": 1, "the": 14, "awesome": 56, ...}`

This fundamentally changes our model. Instead of a single large dictionary, our saved model now consists of **two essential components**:

1.  **The Vocabulary Map:** A dictionary that maps words to their unique integer IDs (and another that maps IDs back to words for generation).
2.  **The Core Model:** The main dictionary, which now stores these lightweight integers instead of heavy strings, dramatically reducing its size in memory.

### 3. Decoupling Vocabulary from the Model for Experimentation

Since we want to check the performance of our models for different window sizes, it is inefficient to store the vocabulary inside each model file. The vocabulary of the training text is constant; it does not change whether we use a window size of 2, 3, or 4.

Therefore, we will adopt a more organized storage strategy:

* **Vocabulary (`vocabulary.json`):** The word-to-integer mapping will be built once from the training data and saved to its own separate file.
* **Model Dictionaries (`model_ws2.json`, `model_ws3.json`, etc.):** Each model, trained with a specific window size, will be saved to its own file, named dynamically based on its configuration. These model files will only contain the integer-based transition logic.

This approach prevents data duplication, saves disk space, and keeps our experiments clean and organized. When we want to generate text, we will first load the single `vocabulary.json` and then load the specific `model_wsX.json` we wish to test.

```json
{
  "vocabulary": {
    "the": 1,
    "sky": 2,
    "is": 3,
    "blue": 4,
    "apple": 5
  },
  "model": {
    "(2, 3)": {
      "4": 150
    },
    "(1, 5)": {
      "3": 20
    }
  }
}
```
