<a href="https://colab.research.google.com/github/YashNigam65/gitfolder/blob/master/genAI_concept_notebook/fine_tunning_and_transfer_learning/training_mistral_llm_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [35]:
!pip install datasets



In [36]:
from transformers import AutoTokenizer, LlamaConfig, LlamaForCausalLM, Trainer, TrainingArguments
from datasets import load_dataset, Dataset
from transformers import DataCollatorForLanguageModeling
import torch

In [39]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")
# The tokenizer is loaded from 'mistralai/Mistral-7B-v0.1' to utilize its robust vocabulary and tokenization scheme.
# A tokenizer's job is to convert raw text into numerical IDs (tokens) that a language model can process, and vice-versa.
# The Mistral tokenizer is well-suited for general English text.
# This is separate from the model's architecture. While 'LlamaForCausalLM' is used to define a custom, smaller model architecture,
# the tokenizer ensures that the input text is consistently converted into tokens that the model can understand, regardless of its size.

In [40]:

# 2. Define a Mini LLaMA Config (simulate small pretraining)
config = LlamaConfig(
    vocab_size=tokenizer.vocab_size,
    hidden_size=512,
    intermediate_size=2048,
    num_attention_heads=8,
    num_hidden_layers=4,
    max_position_embeddings=512,
    bos_token_id=tokenizer.bos_token_id,
    eos_token_id=tokenizer.eos_token_id,
    pad_token_id=tokenizer.pad_token_id
)

model = LlamaForCausalLM(config)

In [41]:

# 3. Load or Create Tiny Dataset (for demonstration)
texts = [
    "The robot learned to walk using reinforcement learning.",
    "Yoga improves flexibility and mental health.",
    "AI is transforming the future of technology and medicine.",
    "Machine learning algorithms are used in various applications.",
    "Deep neural networks have revolutionized image recognition.",
    "Regular exercise contributes to a healthy lifestyle.",
    "The internet has changed how we communicate and access information.",
    "Quantum computing promises to solve complex problems faster.",
    "Healthy eating habits are crucial for overall well-being.",
    "Data science combines statistics, computer science, and domain knowledge."
]
dataset = Dataset.from_dict({"text": texts})

In [42]:
# 4. Tokenize Dataset
def tokenize(example):
    # Set the padding token to the EOS token if it's not already set. This is a common practice for causal language models (like LLaMA) to ensure consistent padding behavior and to signal the model to ignore padded tokens, often leveraging the EOS token's semantic meaning as a 'stop' signal.
    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token
    return tokenizer(example["text"], truncation=True, padding="max_length", max_length=128)

tokenized_dataset = dataset.map(tokenize)

Map:   0%|          | 0/10 [00:00<?, ? examples/s]

In [43]:
# 5. Data Collator
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# 6. Training Arguments
training_args = TrainingArguments(
    output_dir="./llama-pretrain-demo",
    per_device_train_batch_size=2,
    num_train_epochs=10,  # Increased from 3 to 10 for better output
    logging_steps=5,
    save_steps=10,
    save_total_limit=1,
    report_to="none"
)

In [44]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
    processing_class=tokenizer
)

In [45]:
trainer.train()

# Save model
model.save_pretrained("./llama-pretrain-demo")
tokenizer.save_pretrained("./llama-pretrain-demo")

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


Step,Training Loss
5,9.9291
10,8.4409
15,7.7159
20,7.1608
25,6.6963
30,6.2821
35,5.9284
40,5.6495
45,5.4334
50,5.3125




('./llama-pretrain-demo/tokenizer_config.json',
 './llama-pretrain-demo/special_tokens_map.json',
 './llama-pretrain-demo/tokenizer.model',
 './llama-pretrain-demo/added_tokens.json',
 './llama-pretrain-demo/tokenizer.json')

In [46]:
from transformers import pipeline

# 10. Test the fine-tuned model

# Load the fine-tuned model and tokenizer
finetuned_model_path = "./llama-pretrain-demo"
loaded_tokenizer = AutoTokenizer.from_pretrained(finetuned_model_path)
loaded_model = LlamaForCausalLM.from_pretrained(finetuned_model_path)

# Create a text generation pipeline
generator = pipeline(
    "text-generation",
    model=loaded_model,
    tokenizer=loaded_tokenizer,
    device=0 if torch.cuda.is_available() else -1  # Use GPU if available
)

# Generate text
prompt = "The robot learned to"
print(f"\nPrompt: {prompt}")

# Adjust max_new_tokens to a reasonable value for demonstration, e.g., 50
# Set do_sample=False for more deterministic output (less creative)
# Set pad_token_id to tokenizer.eos_token_id to avoid warnings if no pad token is explicitly set

generated_text = generator(prompt, max_new_tokens=50, do_sample=False, temperature=0.1, pad_token_id=loaded_tokenizer.eos_token_id)
print(f"Generated Text: {generated_text[0]['generated_text']}")

prompt_2 = "AI is transforming the"
print(f"\nPrompt: {prompt_2}")
generated_text_2 = generator(prompt_2, max_new_tokens=50, do_sample=False, temperature=0.1, pad_token_id=loaded_tokenizer.eos_token_id)
print(f"Generated Text: {generated_text_2[0]['generated_text']}")

Device set to use cpu



Prompt: The robot learned to
Generated Text: The robot learned to walk using reinforcement learning.........................CommCommCommCommComm emÌè¨ hav XViscisciscisciscisc hav hav hav hav hav

Prompt: AI is transforming the
Generated Text: AI is transforming the future of technology and...........................üî¥üî¥üî¥üî¥üî¥üî¥üî¥–ß kv–ß kv–ß kv–ß emÌè¨ emÌè¨ em
