# Linux Command Chatbot - Training Pipeline

**Project**: Fine-tune TinyLlama for Linux command explanations

**Dataset**: 101 Linux Commands by Bobby Iliev

**Model**: TinyLlama-1.1B-Chat-v1.0 with LoRA adapters

**Training Time**: ~30-45 minutes on Colab T4 GPU, 1-2 hours on MPS(Apple Silicon) or CPU

---

## Table of Contents
1. [Setup and Dependencies](#section-1)
2. [Data Acquisition](#section-2)
3. [HTML Parsing and Extraction](#section-3)
4. [Data Cleaning](#section-4)
5. [Dataset Creation and Augmentation](#section-5)
6. [Load Base Model](#section-6)
7. [Prepare Dataset for Training](#section-7)
8. [Configure LoRA](#section-8)
9. [Training Configuration](#section-9)
10. [Fine-Tuning Execution (OPTIONAL)](#section-10)
11. [Inference Testing](#section-11)

<a id='section-1'></a>
## 1. Setup and Dependencies

Install required packages and import libraries.

In [28]:
# Install required packages
%pip install -q transformers datasets peft trl bitsandbytes accelerate beautifulsoup4 requests

print("‚úì Packages installed successfully!")

Python(26722) MallocStackLogging: can't turn off malloc stack logging because it was not enabled.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.
‚úì Packages installed successfully!


In [29]:
# Import libraries
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig, TrainingArguments
from datasets import Dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftModel
from trl import SFTTrainer
from bs4 import BeautifulSoup
import json
import re
import requests
from typing import List, Dict
import os
import gc

print("‚úì Libraries imported successfully!")

# ============================================================
# Device Detection - –∞–≤—Ç–æ–º–∞—Ç–∏—á–µ—Å–∫–æ–µ –æ–ø—Ä–µ–¥–µ–ª–µ–Ω–∏–µ —É—Å—Ç—Ä–æ–π—Å—Ç–≤–∞
# ============================================================
def get_device_config():
    """
    –û–ø—Ä–µ–¥–µ–ª—è–µ—Ç –¥–æ—Å—Ç—É–ø–Ω–æ–µ —É—Å—Ç—Ä–æ–π—Å—Ç–≤–æ –∏ –≤–æ–∑–º–æ–∂–Ω–æ—Å—Ç—å –∏—Å–ø–æ–ª—å–∑–æ–≤–∞–Ω–∏—è –∫–≤–∞–Ω—Ç–∏–∑–∞—Ü–∏–∏.
    
    Returns:
        tuple: (device_name, use_quantization)
        - CUDA: bitsandbytes handles 4-bit quantization in GPU memory
        - MPS (Apple Silicon): no bitsandbytes support, use float16
        - CPU: no quantization, use float16
    """
    if torch.cuda.is_available():
        return "cuda", True
    elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
        return "mps", False
    else:
        return "cpu", False

DEVICE, USE_QUANTIZATION = get_device_config()

print(f"\n{'='*60}")
print("Device Configuration:")
print(f"{'='*60}")
print(f"  Device: {DEVICE}")
print(f"  Quantization (4-bit): {'‚úì Enabled' if USE_QUANTIZATION else '‚úó Disabled'}")

if DEVICE == "cuda":
    print(f"  GPU Name: {torch.cuda.get_device_name(0)}")
    print(f"  GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.2f} GB")
elif DEVICE == "mps":
    print("  Apple Silicon GPU (MPS) detected")
    print("  ‚ö†Ô∏è bitsandbytes not supported - using float16")
else:
    print("  ‚ö†Ô∏è Running on CPU - training will be VERY slow!")
print(f"{'='*60}")


‚úì Libraries imported successfully!

Device Configuration:
  Device: mps
  Quantization (4-bit): ‚úó Disabled
  Apple Silicon GPU (MPS) detected
  ‚ö†Ô∏è bitsandbytes not supported - using float16


<a id='section-2'></a>
## 2. Data Acquisition

Download the HTML file programmatically from GitHub releases.

In [30]:
# Download HTML file from GitHub releases
url = "https://github.com/bobbyiliev/101-linux-commands/releases/latest/download/101-linux-commands.html"

print("Downloading 101 Linux Commands HTML...")
response = requests.get(url)
response.raise_for_status()

html_content = response.text

# Save locally for reproducibility
with open("101-linux-commands.html", "w", encoding="utf-8") as f:
    f.write(html_content)

print(f"‚úì Downloaded {len(html_content)} characters")
print("‚úì File saved: 101-linux-commands.html")

Downloading 101 Linux Commands HTML...
‚úì Downloaded 1020874 characters
‚úì File saved: 101-linux-commands.html


<a id='section-3'></a>
## 3. HTML Parsing and Data Extraction

Parse HTML and extract command-description pairs.

In [31]:
# Load HTML
with open("101-linux-commands.html", "r", encoding="utf-8") as f:
    html_content = f.read()

soup = BeautifulSoup(html_content, 'html.parser')

# Inspect HTML structure first
print("=== HTML Structure Inspection ===")
print("Total headings (h1-h3):", len(soup.find_all(['h1', 'h2', 'h3'])))
print("Total paragraphs:", len(soup.find_all('p')))
print("\nFirst 5 headings:")
for i, heading in enumerate(soup.find_all(['h2', 'h3'])[:5]):
    print(f"{i+1}. {heading.get_text().strip()[:250]}")

=== HTML Structure Inspection ===
Total headings (h1-h3): 1656
Total paragraphs: 1047

First 5 headings:
1. 101 Linux Commands
2. Hacktoberfest
3. About me
4. DigitalOcean
5. DevDojo


In [32]:
import re
from bs4 import BeautifulSoup, NavigableString

# Load HTML
with open("101-linux-commands.html", "r", encoding="utf-8") as f:
    html_content = f.read()

soup = BeautifulSoup(html_content, "html.parser")

commands_data = []

# Find all h1 tags which denote command sections
all_h1 = soup.find_all("h1")

def get_section_content(h1_tag):
    """–ò–∑–≤–ª–µ–∫–∞–µ—Ç –≤–µ—Å—å –∫–æ–Ω—Ç–µ–Ω—Ç —Å–µ–∫—Ü–∏–∏ –∫–æ–º–∞–Ω–¥—ã –¥–æ —Å–ª–µ–¥—É—é—â–µ–≥–æ h1 –∏–ª–∏ page-break"""
    content_parts = []
    examples = []
    syntax = ""

    # –ò–¥—ë–º –ø–æ —ç–ª–µ–º–µ–Ω—Ç–∞–º –ø–æ—Å–ª–µ h1
    current = h1_tag.next_sibling
    while current:
        # –û—Å—Ç–∞–Ω–∞–≤–ª–∏–≤–∞–µ–º—Å—è –Ω–∞ —Å–ª–µ–¥—É—é—â–µ–º h1 –∏–ª–∏ page-break div
        if hasattr(current, 'name'):
            if current.name == 'h1':
                break
            if current.name == 'div' and 'page-break' in current.get('style', ''):
                break

            # –°–æ–±–∏—Ä–∞–µ–º –ø–∞—Ä–∞–≥—Ä–∞—Ñ—ã –æ–ø–∏—Å–∞–Ω–∏—è (–¥–æ Examples –∏–ª–∏ Syntax)
            if current.name == 'p':
                text = current.get_text(" ", strip=True)
                if text and len(text) > 10:
                    content_parts.append(text)

            # –°–æ–±–∏—Ä–∞–µ–º –ø—Ä–∏–º–µ—Ä—ã –∏–∑ pre > code
            if current.name == 'pre':
                code = current.find('code')
                if code:
                    example_code = code.get_text().strip()
                    if example_code and len(example_code) < 100:  # –ö–æ—Ä–æ—Ç–∫–∏–µ –ø—Ä–∏–º–µ—Ä—ã –∫–æ–º–∞–Ω–¥
                        examples.append(example_code)

            # –ò—â–µ–º —Å–∏–Ω—Ç–∞–∫—Å–∏—Å –ø–æ—Å–ª–µ h3 "Syntax"
            if current.name == 'h3':
                h3_text = current.get_text().strip().lower()
                if 'syntax' in h3_text:
                    next_pre = current.find_next('pre')
                    if next_pre:
                        code = next_pre.find('code')
                        if code:
                            syntax = code.get_text().strip()

        current = current.next_sibling

    return content_parts, examples[:5], syntax  # –ú–∞–∫—Å–∏–º—É–º 5 –ø—Ä–∏–º–µ—Ä–æ–≤

for h1 in all_h1:
    h1_text = h1.get_text().strip()
    if not re.match(r"^The\s+.+\s+[Cc]ommand$", h1_text):
        continue

    code_tag = h1.find("code")
    if not code_tag:
        continue

    command_name = code_tag.get_text().strip()

    # –ò–∑–≤–ª–µ–∫–∞–µ–º —Ä–∞—Å—à–∏—Ä–µ–Ω–Ω—ã–π –∫–æ–Ω—Ç–µ–Ω—Ç
    content_parts, examples, syntax = get_section_content(h1)

    # –°–æ–±–∏—Ä–∞–µ–º –ø–æ–ª–Ω–æ–µ –æ–ø–∏—Å–∞–Ω–∏–µ
    description_parts = []

    # –û—Å–Ω–æ–≤–Ω–æ–µ –æ–ø–∏—Å–∞–Ω–∏–µ (–ø–µ—Ä–≤—ã–µ 2-3 –ø–∞—Ä–∞–≥—Ä–∞—Ñ–∞)
    if content_parts:
        description_parts.extend(content_parts[:3])

    # –î–æ–±–∞–≤–ª—è–µ–º —Å–∏–Ω—Ç–∞–∫—Å–∏—Å
    if syntax:
        description_parts.append(f"Syntax: {syntax}")

    # –î–æ–±–∞–≤–ª—è–µ–º –ø—Ä–∏–º–µ—Ä—ã
    if examples:
        examples_text = "Examples: " + ", ".join(examples[:3])
        description_parts.append(examples_text)

    full_description = " ".join(description_parts)

    if full_description:
        commands_data.append({
            "command": command_name,
            "description": full_description,
            "examples": examples,
            "syntax": syntax,
        })

print(f"Total commands extracted: {len(commands_data)}")
for i, cmd in enumerate(commands_data[:5]):
    print(f"\n{i+1}. Command: {cmd['command']}")
    print(f"   Description: {cmd['description'][:200]}...")
    if cmd['examples']:
        print(f"   Examples: {cmd['examples'][:3]}")


Total commands extracted: 160

1. Command: ls
   Description: The ls command lets you see the files and directories inside a specific directory (current working directory by default) .
It normally lists the files and directories in ascending alphabetical order. ...
   Examples: ['ls', 'ls {Directory_Path}', 'ls -lah']

2. Command: cd
   Description: The cd command is used to change the current working directory (i.e., the directory in which the current user is working) . The "cd" stands for " c hange d irectory" and it is one of the most frequent...
   Examples: ['cd [OPTIONS] [directory]', 'cd /path/to/directory', 'cd ~']

3. Command: cat
   Description: The cat command allows us to create single or multiple files, to view the content of a file or to concatenate files and redirect the output to the terminal or files. The "cat" stands for 'concatenate....
   Examples: ['cat <specified_file_name>', 'cat file1 file2 ...', 'cat > file_name']

4. Command: tac
   Description: tac is a Linux

<a id='section-4'></a>
## 4. Data Cleaning and Transformation

Clean extracted text and prepare for training format.

In [33]:
def clean_text(text: str) -> str:
    """Clean extracted text for training"""
    # Remove multiple spaces and newlines
    text = re.sub(r'\s+', ' ', text)
    
    # Remove URLs
    text = re.sub(r'http\S+', '', text)
    
    # Remove email addresses
    text = re.sub(r'\S+@\S+', '', text)
    
    # Remove special characters but keep basic punctuation
    text = re.sub(r'[^\w\s.,!?\-:;()\[\]{}]', '', text)
    
    # Limit to reasonable length (~1000 chars max for richer descriptions)
    if len(text) > 1000:
        sentences = text.split('.')
        # Keep first 6 sentences
        text = '. '.join(sentences[:6])
        if text and not text.endswith('.'):
            text += '.'
    
    return text.strip()

def clean_command_name(command: str) -> str:
    """Extract clean command name"""
    # Remove markdown symbols, numbers, etc.
    command = re.sub(r'^\d+\.\s*', '', command)  # Remove "1. "
    command = re.sub(r'[#*`]', '', command)      # Remove markdown
    command = command.strip()
    
    # Extract first word if it's a compound phrase
    words = command.split()
    if words:
        return words[0].lower()
    return command.lower()

# Clean the data
cleaned_data = []
for item in commands_data:
    cmd = clean_command_name(item['command'])
    desc = clean_text(item['description'])
    
    # Skip if too short or too long
    if len(desc) < 50 or len(cmd) < 2:
        continue
    
    cleaned_data.append({
        'command': cmd,
        'description': desc
    })

print(f"‚úì Cleaned commands: {len(cleaned_data)}")
print("\nExample after cleaning:")
print(f"Command: {cleaned_data[0]['command']}")
print(f"Description: {cleaned_data[0]['description'][:200]}...")

‚úì Cleaned commands: 159

Example after cleaning:
Command: ls
Description: The ls command lets you see the files and directories inside a specific directory (current working directory by default) . It normally lists the files and directories in ascending alphabetical order. ...


<a id='section-5'></a>
## 5. Dataset Creation and Augmentation

Convert to instruction-following format with data augmentation.

In [34]:
# Question templates for augmentation
question_templates = [
    "Explain the Linux command {cmd}",
    "What does the {cmd} command do?",
    "How do I use {cmd} in Linux?",
    "What is the {cmd} command used for?",
    "Describe the {cmd} command",
]

# Create augmented dataset
augmented_dataset = []

for item in cleaned_data:
    cmd = item['command']
    desc = item['description']
    
    # Generate multiple training examples per command
    for template in question_templates:
        question = template.format(cmd=cmd)
        
        augmented_dataset.append({
            'instruction': question,
            'input': '',
            'output': desc
        })

print(f"‚úì Total training examples: {len(augmented_dataset)}")
print(f"‚úì Augmentation ratio: {len(augmented_dataset) / len(cleaned_data):.1f}x")

# Show examples
print("\nExample variations for one command:")
cmd_examples = [ex for ex in augmented_dataset if 'ls' in ex['instruction'].lower()][:3]
for i, ex in enumerate(cmd_examples, 1):
    print(f"\n{i}. Instruction: {ex['instruction']}")
    print(f"   Output: {ex['output'][:80]}...")

‚úì Total training examples: 795
‚úì Augmentation ratio: 5.0x

Example variations for one command:

1. Instruction: Explain the Linux command ls
   Output: The ls command lets you see the files and directories inside a specific director...

2. Instruction: What does the ls command do?
   Output: The ls command lets you see the files and directories inside a specific director...

3. Instruction: How do I use ls in Linux?
   Output: The ls command lets you see the files and directories inside a specific director...


In [35]:
# Save to JSONL
os.makedirs('data', exist_ok=True)

with open('data/linux_commands.jsonl', 'w', encoding='utf-8') as f:
    for item in augmented_dataset:
        f.write(json.dumps(item, ensure_ascii=False) + '\n')

print("‚úì Dataset saved to: data/linux_commands.jsonl")

# Verify file size
file_size = os.path.getsize('data/linux_commands.jsonl') / 1024
print(f"‚úì File size: {file_size:.2f} KB")

‚úì Dataset saved to: data/linux_commands.jsonl
‚úì File size: 409.95 KB


<a id='section-6'></a>
## 6. Load Base Model and Tokenizer

Load TinyLlama with conditional configuration:
- **CUDA (NVIDIA GPU)**: 4-bit quantization with bitsandbytes for memory efficiency
- **MPS (Apple Silicon)**: float16 without quantization
- **CPU**: float16 (will be slow)

---
### üìã Kaggle/Colab GPU Instructions

**Kaggle:**
1. Settings ‚Üí Accelerator ‚Üí **GPU T4 x2** or **GPU P100**
2. Restart notebook

**Google Colab:**
1. Runtime ‚Üí Change runtime type ‚Üí **T4 GPU**
2. Restart runtime

---


In [36]:
model_name = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"

print(f"Loading model: {model_name}")
print(f"Device: {DEVICE}, Quantization: {USE_QUANTIZATION}")
print("This may take a few minutes...\n")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"  # Recommended for training

print(f"‚úì Tokenizer loaded")
print(f"  Vocabulary size: {len(tokenizer)}")

# Load model based on device capabilities
if USE_QUANTIZATION:
    # CUDA: Use 4-bit quantization with bitsandbytes
    print("\n  Loading with 4-bit quantization (bitsandbytes)...")
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16,
        bnb_4bit_use_double_quant=True,
    )
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        quantization_config=bnb_config,
        device_map="auto",
        trust_remote_code=True
    )
else:
    # MPS/CPU: Load without quantization in float16
    print(f"\n  Loading without quantization (float16) for {DEVICE}...")
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,
        trust_remote_code=True,
        low_cpu_mem_usage=True
    )
    # Move model to device
    if DEVICE == "mps":
        model = model.to("mps")
    # For CPU, keep on CPU (default)

print("\n‚úì Model loaded successfully!")
print(f"  Model device: {next(model.parameters()).device}")
print(f"  Model dtype: {model.dtype}")

Loading model: TinyLlama/TinyLlama-1.1B-Chat-v1.0
Device: mps, Quantization: False
This may take a few minutes...

‚úì Tokenizer loaded
  Vocabulary size: 32000

  Loading without quantization (float16) for mps...

‚úì Model loaded successfully!
  Model device: mps:0
  Model dtype: torch.float16


<a id='section-7'></a>
## 7. Prepare Dataset for Training

Format dataset with proper chat template.

In [37]:
# Load dataset
dataset = Dataset.from_json('data/linux_commands.jsonl')

print(f"Dataset size: {len(dataset)}")
print(f"Features: {dataset.features}")

# Format with TinyLlama chat template
def format_instruction(example):
    """Format example with language enforcement"""
    prompt = f"""<|user|>
Answer in English. Be concise and technical.
User question: {example['instruction']}
<|assistant|>
{example['output']}"""
    return {'text': prompt}

# Apply formatting
dataset = dataset.map(format_instruction, remove_columns=['instruction', 'input', 'output'])

# Show example
print("\n" + "="*60)
print("Formatted example:")
print("="*60)
print(dataset[0]['text'][:400])
print("...")

# Split into train/eval
dataset = dataset.train_test_split(test_size=0.1, seed=42)

print(f"\n‚úì Train size: {len(dataset['train'])}")
print(f"‚úì Eval size: {len(dataset['test'])}")

Generating train split: 795 examples [00:00, 40594.49 examples/s]


Dataset size: 795
Features: {'instruction': Value('string'), 'input': Value('string'), 'output': Value('string')}


Map: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 795/795 [00:00<00:00, 44715.99 examples/s]


Formatted example:
<|user|>
Answer in English. Be concise and technical.
User question: Explain the Linux command ls
<|assistant|>
The ls command lets you see the files and directories inside a specific directory (current working directory by default) . It normally lists the files and directories in ascending alphabetical order. In this interactive tutorial, you will learn the different ways to use the ls command: T
...

‚úì Train size: 715
‚úì Eval size: 80





<a id='section-8'></a>
## 8. Configure LoRA

Set up PEFT with LoRA adapters.

In [38]:
# Prepare model for training
if USE_QUANTIZATION:
    # Only needed for quantized models
    print("Preparing model for k-bit training...")
    model = prepare_model_for_kbit_training(model)
else:
    # For MPS/CPU: enable gradient checkpointing to save memory
    print(f"Preparing model for {DEVICE} training...")
    model.gradient_checkpointing_enable()
    model.enable_input_require_grads()

# LoRA configuration
lora_config = LoraConfig(
    r=16,                    # LoRA rank
    lora_alpha=32,           # Scaling factor
    target_modules=[         # Target attention modules
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

# Apply LoRA to model
model = get_peft_model(model, lora_config)

# Print trainable parameters
print("\n" + "="*60)
print("LoRA Configuration:")
print("="*60)
model.print_trainable_parameters()

Preparing model for mps training...

LoRA Configuration:
trainable params: 4,505,600 || all params: 1,104,553,984 || trainable%: 0.4079


<a id='section-9'></a>
## 9. Training Configuration

Configure trainer and training arguments.

In [39]:
# Training arguments - use SFTConfig instead of TrainingArguments
from trl import SFTConfig

# Device-specific optimizer and settings
if USE_QUANTIZATION:  # CUDA
    optimizer_name = "paged_adamw_8bit"  # Memory-efficient, requires bitsandbytes
    use_fp16 = True
    use_bf16 = False
elif DEVICE == "mps":
    optimizer_name = "adamw_torch"  # Standard PyTorch optimizer
    use_fp16 = False  # MPS doesn't support fp16 training well
    use_bf16 = False  # MPS M1/M2 doesn't support bf16
else:  # CPU
    optimizer_name = "adamw_torch"
    use_fp16 = False
    use_bf16 = False

print(f"Optimizer: {optimizer_name}")
print(f"FP16: {use_fp16}, BF16: {use_bf16}")

training_args = SFTConfig(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=2 if DEVICE != "cuda" else 4,  # Smaller batch for MPS/CPU
    gradient_accumulation_steps=8 if DEVICE != "cuda" else 4,  # Compensate smaller batch
    learning_rate=2e-4,
    fp16=use_fp16,
    bf16=use_bf16,
    logging_steps=10,
    save_strategy="epoch",
    eval_strategy="epoch",
    warmup_steps=50,
    lr_scheduler_type="cosine",
    optim=optimizer_name,
    report_to="none",               # Disable wandb
    dataloader_pin_memory=False if DEVICE == "mps" else True,  # MPS doesn't support pin_memory
    # SFT-specific parameters
    max_length=512,
    packing=False,
    dataset_text_field="text",
)

# Create trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset['train'],
    eval_dataset=dataset['test'],
    processing_class=tokenizer,
    args=training_args,
)

print("\n‚úì Trainer configured successfully!")
print(f"\nTraining configuration:")
print(f"  Device: {DEVICE}")
print(f"  Epochs: {training_args.num_train_epochs}")
print(f"  Batch size: {training_args.per_device_train_batch_size}")
print(f"  Gradient accumulation: {training_args.gradient_accumulation_steps}")
print(f"  Effective batch size: {training_args.per_device_train_batch_size * training_args.gradient_accumulation_steps}")
print(f"  Learning rate: {training_args.learning_rate}")
print(f"  Optimizer: {optimizer_name}")


Optimizer: adamw_torch
FP16: False, BF16: False


Adding EOS to train dataset: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 715/715 [00:00<00:00, 29912.70 examples/s]
Tokenizing train dataset: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 715/715 [00:00<00:00, 6027.66 examples/s]
Truncating train dataset: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 715/715 [00:00<00:00, 26070.15 examples/s]
Adding EOS to eval dataset: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 80/80 [00:00<00:00, 22765.75 examples/s]
Tokenizing eval dataset: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 80/80 [00:00<00:00, 5080.62 examples/s]
Truncating eval dataset: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 80/80 [00:00<00:00, 56054.85 examples/s]



‚úì Trainer configured successfully!

Training configuration:
  Device: mps
  Epochs: 3
  Batch size: 2
  Gradient accumulation: 8
  Effective batch size: 16
  Learning rate: 0.0002
  Optimizer: adamw_torch


<a id='section-10'></a>
## 10. Fine-Tuning Execution (OPTIONAL)

‚ö†Ô∏è **OPTIONAL CELL**: Skip this if using pre-trained adapters

Training takes approximately **30-45 minutes** on Colab T4 GPU.

In [13]:
# ‚ö†Ô∏è OPTIONAL CELL: Skip this if using pre-trained adapters
# Training takes approximately 30-45 minutes on Colab T4 GPU

print("="*60)
print("Starting training...")
print("This will take approximately 30-45 minutes.")
print("="*60)
print()

# Train
trainer.train()

print("\n" + "="*60)
print("‚úì Training completed!")
print("="*60)

The tokenizer has new PAD/BOS/EOS tokens that differ from the model config and generation config. The model config and generation config were aligned accordingly, being updated with the tokenizer's values. Updated tokens: {'pad_token_id': 2}.


Starting training...
This will take approximately 30-45 minutes.



Epoch,Training Loss,Validation Loss,Entropy,Num Tokens,Mean Token Accuracy
1,1.7634,1.667363,1.677441,104868.0,0.631944
2,1.3742,1.367578,1.400391,209736.0,0.680658
3,1.1231,1.275015,1.312695,314604.0,0.700126



‚úì Training completed!


In [40]:
# Save LoRA adapters
os.makedirs('model/lora_adapters', exist_ok=True)

model.save_pretrained("model/lora_adapters")
tokenizer.save_pretrained("model/lora_adapters")

print("‚úì LoRA adapters saved to: model/lora_adapters/")

# Check adapter size
adapter_path = "model/lora_adapters/adapter_model.safetensors"
if os.path.exists(adapter_path):
    size_mb = os.path.getsize(adapter_path) / (1024 * 1024)
    print(f"‚úì Adapter size: {size_mb:.2f} MB")
else:
    print("‚ö†Ô∏è Adapter file not found. Check for adapter_model.bin instead.")
    adapter_path_alt = "model/lora_adapters/adapter_model.bin"
    if os.path.exists(adapter_path_alt):
        size_mb = os.path.getsize(adapter_path_alt) / (1024 * 1024)
        print(f"‚úì Adapter size: {size_mb:.2f} MB")

‚úì LoRA adapters saved to: model/lora_adapters/
‚úì Adapter size: 17.21 MB


<a id='section-11'></a>
## 11. Inference Testing

Test the fine-tuned model with various questions.

In [41]:
# Clear memory - —É–Ω–∏–≤–µ—Ä—Å–∞–ª—å–Ω–∞—è –æ—á–∏—Å—Ç–∫–∞ –¥–ª—è –≤—Å–µ—Ö —É—Å—Ç—Ä–æ–π—Å—Ç–≤
if 'trainer' in globals():
    del trainer
gc.collect()

# Device-specific cache clearing
if torch.cuda.is_available():
    torch.cuda.empty_cache()
    print("‚úì CUDA memory cleared")
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    torch.mps.empty_cache()
    print("‚úì MPS memory cleared")
else:
    print("‚úì CPU memory cleared (gc.collect)")


‚úì MPS memory cleared


In [42]:
# Load base model for inference
print(f"Loading base model for inference on {DEVICE}...")

if DEVICE == "cuda":
    base_model = AutoModelForCausalLM.from_pretrained(
        "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        device_map="auto",
        torch_dtype=torch.float16
    )
else:
    base_model = AutoModelForCausalLM.from_pretrained(
        "TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        torch_dtype=torch.float16,
        low_cpu_mem_usage=True
    )
    if DEVICE == "mps":
        base_model = base_model.to("mps")

# Load LoRA adapters
print("Loading LoRA adapters...")
model = PeftModel.from_pretrained(base_model, "model/lora_adapters")
tokenizer = AutoTokenizer.from_pretrained("model/lora_adapters")

print(f"\n‚úì Model ready for inference on {DEVICE}!")

Loading base model for inference on mps...
Loading LoRA adapters...

‚úì Model ready for inference on mps!


In [43]:
# Inference function
def ask_bot(question: str, max_tokens: int = 150) -> str:
    """Ask the bot a question"""
    prompt = f"""<|user|>
Answer in English. Be concise and technical.
User question: {question}
<|assistant|>"""
    
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_tokens,
            temperature=0.7,
            do_sample=True,
            top_p=0.9,
            repetition_penalty=1.1
        )
    
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    
    # Extract only assistant's response
    if "<|assistant|>" in response:
        response = response.split("<|assistant|>")[-1].strip()
    
    return response

print("‚úì Inference function defined")

‚úì Inference function defined


In [46]:
# Test examples
test_questions = [
    "Explain the ls command",
    "What does grep do?",
    "How to use chmod?",
    "–ß—Ç–æ –¥–µ–ª–∞–µ—Ç –∫–æ–º–∞–Ω–¥–∞ cd?",  # Russian: "What does cd command do?"
    "–û–ø–∏—à–∏ –∫–æ–º–∞–Ω–¥—É mkdir",      # Russian: "Describe mkdir command"
]

print("\n" + "="*60)
print("INFERENCE TESTS")
print("="*60)

for i, question in enumerate(test_questions, 1):
    print(f"\n[Test {i}]")
    print(f"Question: {question}")
    answer = ask_bot(question)
    print(f"Answer: {answer}")
    print("-" * 60)


INFERENCE TESTS

[Test 1]
Question: Explain the ls command
Answer: The ls command is a common command in Unix/Linux that displays the list of files and directories in a directory hierarchy. It works by listing the contents of a specific directory or file system path using its syntax:

```
ls [OPTION]... [PATH]
```

Here are some examples of how to use this command:

1. List all the files in a specific directory:
   ```
   ls -a
   ```

   This will show both regular files (i.e., those with a dot extension (.txt, .png, etc.) and symbolic links (i.e., which point to other locations).

2. List only the files with a certain extension:
------------------------------------------------------------

[Test 2]
Question: What does grep do?
Answer: Grep is a command-line utility used for searching text files for specific patterns or strings. It operates on the line by line basis and searches for any occurrence of specified pattern(s) in each line. Here's how it works:

1. The first argument to th