# Task
Train a scaled-down Mistral model from scratch using the "wikitext" dataset by installing necessary libraries, defining a custom small `MistralConfig`, initializing a `MistralForCausalLM` with random weights, executing the training loop, and testing text generation.

**The Core Concept:** "Next Word Prediction"
Imagine you are teaching someone a language by giving them a sentence cut in half (e.g., "The cat sat on the...") and asking them to guess the next word.

If they guess correctly (e.g., "mat"), you give them a thumbs up.
If they guess wrong (e.g., "pizza"), you correct them.
Over time, by seeing thousands of sentences, they start to learn grammar, vocabulary, and facts about the world just to get better at that guessing game. This is exactly what we did with the AI model.

**What we specifically did in this notebook:**
"From Scratch": We started with a model that knew nothing. Its "brain" (weights) was full of random noise. It didn't even know what a word was.
The Training: We fed it text from Wikipedia. The model looked at the text, tried to predict the next word, got it wrong (because it was random), and mathematically adjusted itself to be slightly less wrong next time.
The Result: Because we used a very tiny model (a "mini" brain) and only trained it for a short time (about 1 minute), it didn't have enough time or capacity to learn actual English. It mostly learned simple patterns, like putting spaces between words or repeating common punctuation, which is why the final output looked like gibberish.

## Install Dependencies

### Subtask:
Install the required Python libraries for training the model.


**Reasoning**:
Install the necessary libraries for the subsequent model training tasks using pip.



In [None]:
!pip install -q transformers datasets accelerate torch

## Prepare Dataset

### Subtask:
Load the Wikitext dataset and preprocess it using a tokenizer.


**Reasoning**:
Load the Wikitext dataset and preprocess it using the GPT-2 tokenizer as per instructions to prepare inputs for the model.



In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer

# 1. Load the dataset
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")

# 2. Initialize tokenizer
# Using gpt2 tokenizer to avoid auth issues, as it's compatible for this demo purpose
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token

# 3. Define preprocessing function
def tokenize_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=128)

# 4. Apply preprocessing
tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])

# 5. Set format to torch
tokenized_datasets.set_format("torch", columns=["input_ids", "attention_mask"])

print("Dataset structure after tokenization:")
print(tokenized_datasets)

README.md: 0.00B [00:00, ?B/s]

wikitext-2-raw-v1/test-00000-of-00001.pa(…):   0%|          | 0.00/733k [00:00<?, ?B/s]

wikitext-2-raw-v1/train-00000-of-00001.p(…):   0%|          | 0.00/6.36M [00:00<?, ?B/s]

wikitext-2-raw-v1/validation-00000-of-00(…):   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Map:   0%|          | 0/4358 [00:00<?, ? examples/s]

Map:   0%|          | 0/36718 [00:00<?, ? examples/s]

Map:   0%|          | 0/3760 [00:00<?, ? examples/s]

Dataset structure after tokenization:
DatasetDict({
    test: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 4358
    })
    train: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 36718
    })
    validation: Dataset({
        features: ['input_ids', 'attention_mask'],
        num_rows: 3760
    })
})


## Initialize Model Configuration and Instantiation

### Subtask:
Define a scaled-down Mistral configuration and instantiate the model with random weights.


**Reasoning**:
Define the scaled-down Mistral configuration and instantiate the model with random weights as requested. I will explicitly set `num_key_value_heads` to match `num_attention_heads` to ensure configuration validity for such a small model.



In [None]:
from transformers import MistralConfig, MistralForCausalLM

# 1. Define the configuration
# Using a small configuration for demonstration purposes to run on limited resources
config = MistralConfig(
    vocab_size=tokenizer.vocab_size,
    hidden_size=128,
    intermediate_size=512,
    num_hidden_layers=2,
    num_attention_heads=4,
    num_key_value_heads=4,  # Matching attention heads for simplicity in this scaled-down version
    max_position_embeddings=256
)

# 2. Instantiate the model
model = MistralForCausalLM(config)

# 3. Print the number of parameters
print(f"Model configuration: {config}")
print(f"Total number of trainable parameters: {model.num_parameters()}")

Model configuration: MistralConfig {
  "attention_dropout": 0.0,
  "bos_token_id": 1,
  "eos_token_id": 2,
  "head_dim": null,
  "hidden_act": "silu",
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 512,
  "max_position_embeddings": 256,
  "model_type": "mistral",
  "num_attention_heads": 4,
  "num_hidden_layers": 2,
  "num_key_value_heads": 4,
  "rms_norm_eps": 1e-06,
  "rope_theta": 10000.0,
  "sliding_window": 4096,
  "tie_word_embeddings": false,
  "transformers_version": "4.57.3",
  "use_cache": true,
  "vocab_size": 50257
}

Total number of trainable parameters: 13390720


## Configure and Train Model

### Subtask:
Configure training arguments, initialize the Trainer with the model and dataset, and execute the training process.


**Reasoning**:
Configure training arguments, initialize the Trainer, and start the training loop using the prepared model and dataset.



In [None]:
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling

# 1. Define data collator
# mlm=False because this is Causal Language Modeling (CL)
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# 2. Define training arguments
training_args = TrainingArguments(
    output_dir="./mistral_wikitext",
    per_device_train_batch_size=8,
    num_train_epochs=1,
    learning_rate=5e-4,
    logging_steps=50,
    report_to="none", # Disable wandb/mlflow reporting for this demo
    save_strategy="no" # Avoid saving checkpoints to save space/time for demo
)

# 3. Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"]
)

# 4. Train the model
trainer.train()

Step,Training Loss
50,9.2675
100,7.6474
150,7.2407
200,7.1239
250,6.8392
300,6.8358
350,6.8168
400,6.7428
450,6.4996
500,6.5761


TrainOutput(global_step=4590, training_loss=5.958062820600788, metrics={'train_runtime': 77.4877, 'train_samples_per_second': 473.856, 'train_steps_per_second': 59.235, 'total_flos': 196206629093376.0, 'train_loss': 5.958062820600788, 'epoch': 1.0})

## Test Model

### Subtask:
Generate text using the trained model to verify functionality.


**Reasoning**:
Generate text using the trained model to verify functionality by tokenizing a prompt, running generation, and decoding the output.



In [None]:
# 1. Define input text
input_text = "The history of the world is"

# 2. Tokenize inputs
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

# 3. Generate text
# Setting max_length to control output length and explicitly setting pad_token_id to avoid stderr warnings
generated_ids = model.generate(**inputs, max_length=50, pad_token_id=tokenizer.eos_token_id)

# 4. Decode and print result
output_text = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(f"Input: {input_text}")
print(f"Generated Output: {output_text}")

Input: The history of the world is
Generated Output: The history of the world is a " " " , and " a " , and " the " . The song is a " a " a " , and " a " , " a " , " , " , and " The " . The


## Final Task

### Subtask:
Summarize the training process and the model's performance.


## Summary:

### Data Analysis Key Findings

*   **Dataset Preparation**: The `wikitext-2-raw-v1` dataset was successfully loaded and preprocessed using a GPT-2 tokenizer, with sequences truncated and padded to a maximum length of 128 tokens.
*   **Model Architecture**: A custom, scaled-down Mistral configuration was implemented with `128` hidden size, `2` hidden layers, and `4` attention heads.
*   **Parameter Count**: The initialized `MistralForCausalLM` model contained a total of `13,390,720` trainable parameters.
*   **Training Performance**: The model was trained for `1` epoch (comprising 4,590 steps) with a batch size of 8 and a learning rate of `5e-4`. The process took approximately 77.5 seconds and resulted in a final training loss of `5.96`.
*   **Inference Verification**: The text generation pipeline was verified using the prompt "The history of the world is". While the output was grammatically incoherent (e.g., repeating punctuation), it confirmed that the model architecture and generation method were technically functional.

### Insights or Next Steps

*   **Proof of Concept vs. Performance**: The high loss (`5.96`) and incoherent output are expected outcomes for a "from scratch" training run using such a tiny architecture and limited training data/time. This successfully demonstrates the pipeline mechanics rather than linguistic capability.
*   **Scaling Up**: To achieve coherent text generation, the next steps should involve significantly increasing the model size (hidden layers and dimensions), training for more epochs, and potentially utilizing a larger dataset or fine-tuning a pre-trained checkpoint instead of random initialization.
