<a href="https://colab.research.google.com/github/YashNigam65/gitfolder/blob/master/assignment_3_wiki_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Perform training of an LLM built from scratch using transformer library which is of reasonably sized using a suitable dataset such as Wikipedia text and test the LLM

## Install Libraries


Install `transformers`, `datasets`, and `accelerate`.


The subtask is to install the `transformers`, `datasets`, and `accelerate` libraries. I will use the `pip install` command for `transformers` first.



In [None]:
!pip install transformers



In [None]:
!pip install datasets



In [None]:
!pip install accelerate



## Load and Preprocess Dataset

Load the 'wikitext' dataset and tokenize it for training.


In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer

print("Imported load_dataset from datasets and AutoTokenizer from transformers.")

Imported load_dataset from datasets and AutoTokenizer from transformers.


The next step is to load the 'wikitext-2-raw-v1' dataset into a variable named `dataset` as per the instructions, using the `load_dataset` function that was previously imported.



In [None]:
dataset = load_dataset('wikitext', 'wikitext-2-raw-v1')
print("Loaded the 'wikitext-2-raw-v1' dataset.")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

wikitext-2-raw-v1/test-00000-of-00001.pa(…):   0%|          | 0.00/733k [00:00<?, ?B/s]

wikitext-2-raw-v1/train-00000-of-00001.p(…):   0%|          | 0.00/6.36M [00:00<?, ?B/s]

wikitext-2-raw-v1/validation-00000-of-00(…):   0%|          | 0.00/657k [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Loaded the 'wikitext-2-raw-v1' dataset.


The previous step successfully loaded the dataset. The next step is to instantiate a tokenizer using `AutoTokenizer.from_pretrained()` as per the instructions. Since we are aiming for a scaled-down Mistral, using a pre-trained Mistral tokenizer is appropriate to handle the vocabulary. I will use 'mistralai/Mistral-7B-v0.1' as a base for the tokenizer.



In [None]:
tokenizer = AutoTokenizer.from_pretrained('mistralai/Mistral-7B-v0.1')
print("Tokenizer instantiated using 'mistralai/Mistral-7B-v0.1'.")

tokenizer_config.json:   0%|          | 0.00/996 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Tokenizer instantiated using 'mistralai/Mistral-7B-v0.1'.


I will now define a tokenization function. This function will take a batch of examples, tokenize the 'text' field using the initialized tokenizer, and apply truncation up to a specified `block_size` of 128.



In [None]:
block_size = 128

def tokenize_function(examples):
    return tokenizer(examples['text'], truncation=True, max_length=block_size)

print(f"Tokenization function defined with block_size = {block_size}.")

Tokenization function defined with block_size = 128.


The next step is to apply the defined `tokenize_function` to the entire `dataset` using the `map` method with `batched=True` as per the instructions. This will tokenize all text entries in the dataset efficiently.



In [None]:
tokenized_datasets = dataset.map(
    tokenize_function,
    batched=True,
    num_proc=4, # Use multiple processes for faster tokenization if your system supports it
    remove_columns=dataset["train"].column_names # Remove original 'text' column after tokenization
)
print("Dataset tokenized successfully and original text column removed.")

Map (num_proc=4):   0%|          | 0/4358 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/36718 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/3760 [00:00<?, ? examples/s]

Dataset tokenized successfully and original text column removed.


The previous steps successfully tokenized the dataset and removed the original text column. The next instruction is to implement a function that concatenates all tokenized texts and then splits them into chunks of `block_size`. This function is essential for creating fixed-length sequences suitable for language model training.



In [None]:
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
    # customize this part to your needs.
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

print(f"Function 'group_texts' defined to concatenate and chunk texts into blocks of size {block_size}.")

Function 'group_texts' defined to concatenate and chunk texts into blocks of size 128.


In [None]:
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    num_proc=4, # Use multiple processes for faster processing
)
print("Tokenized dataset has been grouped into fixed-size blocks.")

Map (num_proc=4):   0%|          | 0/4358 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/36718 [00:00<?, ? examples/s]

Map (num_proc=4):   0%|          | 0/3760 [00:00<?, ? examples/s]

Tokenized dataset has been grouped into fixed-size blocks.


## Define Custom MistralConfig

Create a scaled-down `MistralConfig` with custom parameters.


The first step to defining a custom MistralConfig is to import the necessary class from the `transformers` library, as specified in the instructions.



In [None]:
from transformers import MistralConfig

print("Imported MistralConfig from transformers.")

Imported MistralConfig from transformers.


## Initialize MistralForCausalLM


Initialize `MistralForCausalLM` with random weights using the custom configuration.


The first instruction is to import the `MistralForCausalLM` class from the `transformers` library to enable its use in the subsequent steps.



In [None]:
from transformers import MistralForCausalLM

print("Imported MistralForCausalLM from transformers.")

Imported MistralForCausalLM from transformers.


## Define Training Arguments

Set up `TrainingArguments` for the training loop.


In [None]:
from transformers import TrainingArguments

print("Imported TrainingArguments from transformers.")

Imported TrainingArguments from transformers.


## Create Hugging Face Trainer


Instantiate the `Trainer` with the model, training arguments, and dataset.


In [None]:
from transformers import Trainer

print("Imported Trainer from transformers.")

Imported Trainer from transformers.


In [None]:
config = MistralConfig(
    vocab_size=tokenizer.vocab_size,
    hidden_size=256,
    num_hidden_layers=4,
    num_attention_heads=4,
    num_key_value_heads=4, # Added this line to ensure compatibility
    intermediate_size=1024,
    max_position_embeddings=block_size
)

print("Scaled-down MistralConfig instantiated with custom parameters.")

Scaled-down MistralConfig instantiated with custom parameters.


In [None]:
model = MistralForCausalLM(config)

print("MistralForCausalLM model re-initialized with random weights using the updated custom configuration.")

MistralForCausalLM model re-initialized with random weights using the updated custom configuration.



Create smaller subsets of the `lm_datasets` for both training and evaluation to reduce training time.

## Create Smaller Dataset Subset


Create smaller subsets of the `lm_datasets` for both training and evaluation to significantly reduce training time. I will sample a small percentage of the original datasets.


In [None]:
train_size = 1000
eval_size = 200

small_train_dataset = lm_datasets["train"].select(range(train_size))
small_eval_dataset = lm_datasets["test"].select(range(eval_size))

print(f"Length of small_train_dataset: {len(small_train_dataset)}")
print(f"Length of small_eval_dataset: {len(small_eval_dataset)}")

Length of small_train_dataset: 1000
Length of small_eval_dataset: 200


Modify the `TrainingArguments` to reduce the number of training epochs (e.g., to 1 or 2) to ensure the training completes within the specified time frame.


I will re-instantiate `TrainingArguments` with `num_train_epochs` set to 1, as instructed, to reduce the training time. Other parameters will remain consistent with the previous definition.



In [None]:
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=1, # Reduced epochs for faster training
    per_device_train_batch_size=4
)

print("TrainingArguments re-instantiated with num_train_epochs set to 1.")

TrainingArguments re-instantiated with num_train_epochs set to 1.


In [None]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset
)

print("Trainer re-instantiated with updated training arguments and smaller dataset subsets.")

Trainer re-instantiated with updated training arguments and smaller dataset subsets.


## Execute Training Loop


Run the training process using the re-instantiated `trainer.train()` with the optimized parameters for a quick demo.


Call the `train()` method on the `trainer` object to initiate the training process and then print a confirmation message, as instructed.



In [None]:
trainer.train()
print("Training finished.")

  | |_| | '_ \/ _` / _` |  _/ -_)
[34m[1mwandb[0m: (1) Create a W&B account
[34m[1mwandb[0m: (2) Use an existing W&B account
[34m[1mwandb[0m: (3) Don't visualize my results
[34m[1mwandb[0m: Enter your choice:

 3


[34m[1mwandb[0m: You chose "Don't visualize my results"
[34m[1mwandb[0m: Using W&B in offline mode.
[34m[1mwandb[0m: W&B API key is configured. Use [1m`wandb login --relogin`[0m to force relogin


Step,Training Loss


Training finished.


## Generate Text with the Trained Model

Now that the model has been trained, we can use it to generate text. We'll start by defining a prompt and then use the model's `generate` method to produce a sequence of tokens.

In [None]:
import torch

prompt = "Hello, I am a language model trained on"
input_ids = tokenizer.encode(prompt, return_tensors='pt').to(model.device)

# Create attention mask for the input_ids
attention_mask = torch.ones(input_ids.shape, device=model.device)

# Generate text
output = model.generate(
    input_ids,
    attention_mask=attention_mask,
    max_length=100,
    num_return_sequences=1,
    no_repeat_ngram_size=2,
    pad_token_id=tokenizer.eos_token_id # Set pad_token_id to eos_token_id for open-end generation
)

generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
print(generated_text)

Hello, I am a language model trained on the the , the to the  the = the . the of the @ the the The the in the and the1 the for the- thes the6 the@ the9 the was the were the be the7 the from the0 the it the2 the an the is the ) the that the album the8 the by the ' the5 the other the he the new the F the In the ; the also theal the " the
