# 🚀 DeepSeek Model Fine-tuning

This notebook provides a complete pipeline for fine-tuning DeepSeek language models on custom datasets. It includes:

- ✨ Automated environment setup
- 🔧 Model configuration with LoRA
- 📊 Data preparation utilities
- 🎯 Training pipeline
- 💾 Model saving and export

---

## 🛠️ Environment Setup

Run these cells in the terminal to set up your environment. Requirements:
- Python 3.8+
- CUDA toolkit (11.8 recommended) https://developer.nvidia.com/cuda-11-8-0-download-archive

The following cells will install all necessary dependencies. Make sure to set up enviroment variabes for CUDA.

In [None]:
# Install PyTorch with CUDA support(replace xx.x with your cuda version)
!conda install pytorch torchvision torchaudio pytorch-cuda=xx.x -c pytorch -c nvidia -y

In [None]:
# Install required packages
!pip install transformers datasets peft bitsandbytes pandas --quiet

## 📚 Import Libraries

Import all required libraries and set up basic configurations.

In [None]:
import os
import torch
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from datasets import load_dataset
import bitsandbytes as bnb
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# Enable eager execution for better debugging
torch.backends.cuda.matmul.allow_tf32 = True

## 🖥️ Hardware Check

Verify GPU availability and display system information.

In [None]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"🔍 Using device: {device}")

if torch.cuda.is_available():
    print(f"📊 GPU Model: {torch.cuda.get_device_name(0)}")
    print(f"💾 Available GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")

## 🤖 Model and Tokenizer Setup

Initialize the DeepSeek model and tokenizer with optimized settings.

In [None]:
# Model Configuration
MODEL_NAME = "unsloth/DeepSeek-R1-Distill-Qwen-1.5B" #or any other model depending upon your requirements.
MAX_SEQ_LENGTH = 1024

def setup_tokenizer(model_name):
    """Initialize and configure the tokenizer"""
    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        trust_remote_code=True
    )
    tokenizer.pad_token = tokenizer.pad_token or tokenizer.eos_token
    return tokenizer

def setup_model(model_name):
    """Load and configure the model with optimized settings"""
    return AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map="auto",
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.float16
    )
    #Or 8 bit depending on your hardware
    """The `load_in_4bit` and `load_in_8bit` arguments are deprecated, 
    incase they don't work you can use BitsAndBytesConfig to load
    the model in quanized mode."""

print("📥 Loading model and tokenizer...")
tokenizer = setup_tokenizer(MODEL_NAME)
model = setup_model(MODEL_NAME)
#model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=True) #enable if gpu has low vram although this will increase the training time

print("✅ Model and tokenizer loaded successfully!")

## 🎯 LoRA Configuration

Set up Low-Rank Adaptation (LoRA) for efficient fine-tuning.

In [None]:
def setup_lora_config():
    """Configure LoRA parameters for optimal training"""
    return LoraConfig(
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
        lora_dropout=0.1,
        bias="none",
        task_type="CAUSAL_LM"
    )

print("🔧 Applying LoRA configuration...")
model = get_peft_model(model, setup_lora_config())
print("\n📊 Trainable parameters summary:")
model.print_trainable_parameters()

## 📝 Data Preparation

Prepare and process your dataset for training.

In [None]:
def format_example(example):
    """Format individual examples for training"""
    return {
        "text": f"{example['text_query'].strip()}\n{example['text_answer'].strip()}{tokenizer.eos_token}"
        #change text_query and text_answer to the actual coloumn names of your data
    }

def prepare_dataset(data_path):
    """Load and prepare the dataset"""
    dataset = load_dataset("csv", data_files={"train": data_path})["train"]
    return dataset.map(format_example)

def tokenize_function(examples):
    """Tokenize examples with proper padding"""
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=MAX_SEQ_LENGTH,
        padding="max_length"
    )

# Set your csv dataset path here 
DATA_FILE = 'path/to/your/dataset.csv'

print("📚 Preparing dataset...")
dataset = prepare_dataset(DATA_FILE)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset.set_format(type="torch", columns=["input_ids", "attention_mask"])
print(f"✅ Dataset prepared with {len(tokenized_dataset)} examples")

## ⚙️ Training Configuration

Configure training parameters and initialize the trainer.

In [None]:
def setup_training_args(output_dir="./model_output"):
    """Configure training arguments"""
    return TrainingArguments(
        output_dir=output_dir,
        overwrite_output_dir=True,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        num_train_epochs=3,
        learning_rate=2e-5,
        logging_steps=10,
        save_steps=100,
        fp16=True,
        report_to="none"
        #uncomment the below piece of code if you pass an eval dataset
        #,load_best_model_at_end=True,
        #evaluation_strategy="steps",
        #eval_steps=100
    )

print("⚙️ Setting up training configuration...")
training_args = setup_training_args()
data_collator = DataCollatorForLanguageModeling(tokenizer, mlm=False)
'''pick the optimiser according to your requirements although it is optional'''
#optim = bnb.optim.AdamW8bit(model.parameters(), lr=training_args.learning_rate)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
    #optimizers=(optim, None)
)
print("✅ Training configuration complete!")

## 🚀 Training and Model Saving

Execute training and save the fine-tuned model.

In [None]:
def save_model(model, tokenizer, output_dir='fine_tuned_model'):
    """Save the fine-tuned model and tokenizer"""
    os.makedirs(output_dir, exist_ok=True)
    model.save_pretrained(output_dir)
    tokenizer.save_pretrained(output_dir)
    print(f"💾 Model and tokenizer saved to {output_dir}")

print("🚀 Starting training...")
trainer.train()
print("✨ Training complete!")

save_model(model, tokenizer)

## 📝 Usage Notes and Troubleshooting

### Parameter Adjustments
- Adjust `MAX_SEQ_LENGTH` based on your GPU memory
- Modify `format_example()` to match your dataset structure
- Fine-tune training parameters in `setup_training_args()`

### Common Issues

#### CUDA Out of Memory 🚫
- Reduce batch size
- Decrease sequence length
- Increase gradient accumulation steps

#### Slow Training ⏳
- Increase batch size (if memory allows)
- Adjust learning rate
- Consider using a smaller model variant

### Best Practices 🌟
- Monitor GPU memory usage
- Start with small datasets for testing
- Use gradient accumulation for larger effective batch sizes
- Save checkpoints regularly

---

💡 For additional help or issues, please refer to the DeepSeek documentation or create an issue on the GitHub repository.