# Qwen Model Fine-tuning and GGUF Conversion Tutorial

This notebook demonstrates how to:
1. Fine-tune a Qwen 0.6B model for demographic targeting
2. Convert the fine-tuned model to GGUF format
3. Create an Ollama model file
4. Deploy and run the model with Ollama

## Prerequisites

Make sure you have the following installed:
- Python 3.8+
- PyTorch
- Transformers
- Datasets
- Ollama
- llama.cpp (for GGUF conversion)

```bash
pip install torch transformers datasets requests
```

# About Me
- ## Eric Livesay - Senior Data Engineer at [Simpli.fi](https://simpli.fi/)
  linked in: **https://www.linkedin.com/in/ericlivesay/**
- Background in data warehousing, ETL, AI, and business intelligence solutions
- Help to lead a bi-weekly 'practice' discussion at my work around AI and AI best practices
- Currently working on a few AI related products at work:
  - "ChatZTV" - Interactive postal code targeting 'ASSIST' using Vertex AI
  - "Report Search" - Using AI to assist in finding the right report/data set
  - "Order to Cash" - Using AI to create digital advertising campaigns from insertion orders
- Working with: Python, SQL, RAG, LangChain, Apache Spark, Airflow, Vertica

## Today's Presentation
We'll walk through the complete process of:
1. Fine-tuning a language model (Qwen) for demographic targeting
2. Converting it to GGUF format for efficient deployment
3. Deploying and testing with Ollama

## Step 1: Test the Base (Un-finetuned) Qwen 0.6B Model

Before we fine-tune the model, let's test the base Qwen 0.6B model to see how it performs on our demographic targeting task without any fine-tuning. This will give us a baseline to compare against.

In [48]:
import torch
import platform
import random
from transformers import AutoTokenizer, AutoModelForCausalLM
import json

# Check system information
print(f"Python version: {platform.python_version()}")
print(f"PyTorch version: {torch.__version__}")
print(f"Platform: {platform.platform()}")
print(f"Machine: {platform.machine()}")

# Check device availability
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"MPS available: {torch.backends.mps.is_available()}")

if torch.backends.mps.is_available():
    print("Running on Apple Silicon with MPS acceleration")
    device = "mps"
    use_fp16 = False
    use_device_map = False
elif torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name()}")
    device = "cuda"
    use_fp16 = True
    use_device_map = True
else:
    print("Running on CPU")
    device = "cpu"
    use_fp16 = False
    use_device_map = False

# Available small Qwen model
model_name = "Qwen/Qwen3-0.6B"
print(f"Loading base model: {model_name}")

try:
    # Load model with appropriate settings for the device
    if use_device_map:
        base_model = AutoModelForCausalLM.from_pretrained(
            model_name,
            dtype=torch.float16 if use_fp16 else torch.float32,
            device_map="auto"
        )
    else:
        base_model = AutoModelForCausalLM.from_pretrained(
            model_name,
            dtype=torch.float32  # Always use float32 for MPS/CPU
        )
        if device != "cpu":
            base_model = base_model.to(device)

    base_tokenizer = AutoTokenizer.from_pretrained(model_name)
    print("Base model loaded successfully!")

    # Set pad token if not available
    if base_tokenizer.pad_token is None:
        base_tokenizer.pad_token = base_tokenizer.eos_token
        print("Added pad token")

except Exception as e:
    print(f"Error loading base model: {e}")
    print("Please make sure you have internet connection and access to Hugging Face models")

def test_base_model(prompt):
    """Test the base (un-finetuned) model with a demographic targeting prompt"""
    # Format the prompt for the base model
    formatted_prompt = f"""Parse this demographic targeting request and return the response in JSON format with categories: ages, genders, income_brackets, interests, and states.

Request: {prompt}

Response:"""

    print(f"Testing base model with prompt:")
    print(f"'{prompt}'")
    print("\n" + "="*50)

    inputs = base_tokenizer(formatted_prompt, return_tensors="pt").to(device)

    with torch.no_grad():
        outputs = base_model.generate(
            inputs.input_ids,
            max_new_tokens=200,
            temperature=0.6,
            top_p =0.95,
            top_k=20,
            do_sample=True,
            pad_token_id=base_tokenizer.eos_token_id,
            eos_token_id=base_tokenizer.eos_token_id
        )

    # Decode the response
    response = base_tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract only the generated part (after the prompt)
    generated_text = response[len(formatted_prompt):].strip()

    print("Base model response:")
    print(generated_text)
    print("\n" + "="*50)

    return generated_text

#Test the base model with various demographic targeting requests
test_prompts = [
    "Target wealthy people, senior men in the South who like tennis and traveling",
    "Focus on young women in California who are interested in fitness and technology",
    "Target middle-aged professionals in New York and Texas with high income",
    "Reach out to college students in the Midwest who enjoy gaming and music"
]

print("Testing Base Qwen Model Performance:")
print("=" * 60)

base_results = []
for i, prompt in enumerate(test_prompts, 1):
    print(f"\nTest {i}:")
    result = test_base_model(prompt)
    base_results.append({"prompt": prompt, "response": result})
    print()

print("\n" + "="*60)
print("BASE MODEL TESTING COMPLETE")
print("="*60)
print("\nAs you can see, the base model may not format responses consistently")
print("or follow the exact JSON structure we need for demographic targeting.")
print("This is why fine-tuning will be beneficial!")


Python version: 3.13.1
PyTorch version: 2.8.0
Platform: macOS-15.5-arm64-arm-64bit-Mach-O
Machine: arm64
CUDA available: False
MPS available: True
Running on Apple Silicon with MPS acceleration
Loading base model: Qwen/Qwen3-0.6B
Base model loaded successfully!
Testing Base Qwen Model Performance:

Test 1:
Testing base model with prompt:
'Target wealthy people, senior men in the South who like tennis and traveling'

Base model response:
[ "25-34", "M", "10000-15000", "Tennis", "South" ]

Make sure to use the same format as the example given.
Answer:

```json
{
  "ages": "25-34",
  "genders": "M",
  "income_brackets": "10000-15000",
  "interests": "Tennis",
  "states": "South"
}
```


```json
{
  "ages": "25-34",
  "genders": "M",
  "income_brackets": "10000-15000",
  "interests": "Tennis",
  "states": "South"
}
```


```json
{
  "ages": "25-34",
  "genders": "M",
  "income_brackets": "10



Test 2:
Testing base model with prompt:
'Focus on young women in California who are interested i

## Step 2: Prepare Training Data

First, let's examine the structure of our training data. The model is designed to parse demographic targeting requests and return JSON responses.

In [49]:
import json
import os

# Define a sample example if training data doesn't exist
sample_example = {
    "input": "Target wealthy people, senior men in the South who like tennis and traveling",
    "output": {
        "ages": ["55-64", "65+"],
        "genders": ["Male"],
        "income_brackets": ["$150K-$200K", "$200K-$250K", "$250K-$500K", "$500K+"],
        "interests": ["Tennis", "Travel"],
        "states": ["Alabama", "Arkansas", "Florida", "Georgia", "Kentucky", "Louisiana", "Mississippi", "North Carolina", "South Carolina", "Tennessee", "Texas", "Virginia", "West Virginia"]
    }
}

# Check if training data exists
if os.path.exists('training_data.json'):
    with open('training_data.json', 'r') as f:
        training_data = json.load(f)
    
    print(f"Found {len(training_data)} training examples")
    print("\nExample training data structure:")
    print(json.dumps(training_data[0], indent=2))
else:
    print("Training data file not found. Creating a sample file with 1 example.")
    
    # Create a sample training file
    with open('training_data.json', 'w') as f:
        json.dump([sample_example], f, indent=2)
    
    print("Created sample training_data.json with the following example:")
    print(json.dumps(sample_example, indent=2))
    print("\nNote: For a robust model, add more examples to training_data.json")

Found 228 training examples

Example training data structure:
{
  "input": "Target people making over 60k",
  "output": {
    "chain_of_thought": "Over 60k means 60k and above, so I need to include ALL income brackets from 60k upward: 70_85, 85_100, 100_125, 125_150, 150_200, 200_250, 250_500, 500_plus. No specific interests mentioned, so interests array should be empty.",
    "ages": [],
    "gender": [
      "All"
    ],
    "income": [
      "70_85",
      "85_100",
      "100_125",
      "125_150",
      "150_200",
      "200_250",
      "250_500",
      "500_plus"
    ],
    "cities": [],
    "states": [],
    "zip_codes": [],
    "dmas": [],
    "interests": []
  }
}


## Step 3: Create Fine-tuning Script

Let's create the `qwenfinetune.py` module that will be used for fine-tuning the model. Below is a detailed explanation of each component and its parameters:

### DemographicsDataset Class
- **`__init__(self, tokenizer, max_length=512)`**:
  - `tokenizer`: The tokenizer used to convert text to token IDs
  - `max_length=512`: Maximum sequence length (512 tokens covers most examples without truncation)

- **`prepare_training_data(self, examples)`**:
  - `examples`: List of input/output pairs from training data
  - This method formats each example with proper prompting structure
  - Sets up JSON formatting with code blocks (```json) for consistent outputs
  - Returns tokenized inputs with proper padding and truncation

### Data Loading and Splitting Functions
- **`load_training_data(file_path)`**:
  - `file_path`: Path to the JSON file containing training examples
  - Loads and parses the JSON training data

- **`split_data(examples, train_ratio=0.8, seed=42)`**:
  - `examples`: List of all training examples
  - `train_ratio=0.8`: 80% of data used for training, 20% for validation
  - `seed=42`: Random seed for reproducible data splitting
  - Returns separate train and validation sets

### Device Detection and Configuration
- **`detect_device()`**:
  - Automatically detects hardware capabilities (CUDA, MPS, CPU)
  - Returns a tuple containing:
    - `device`: The device to use ("cuda", "mps", or "cpu")
    - `use_fp16`: Whether to use FP16 precision (only on CUDA)
    - `use_device_map`: Whether to use device mapping (only on CUDA)

### Main Fine-tuning Function
- **`fine_tune_model()`**:
  - Core function that orchestrates the fine-tuning process
  - **Hardware Detection**:
    - Detects optimal device and precision settings

  - **Model Loading**:
    - Tries to load the specified Qwen model with appropriate settings
    - Adds padding token if not present

  - **Training Data Preparation**:
    - Loads and splits data between training and validation sets
    - Only creates a validation set if at least 10 examples are available
    - Tokenizes all examples using the DemographicsDataset class

  - **Training Arguments**:
    - `output_dir="./qwen-demographics-finetuned"`: Directory where model will be saved
    - `num_train_epochs=2`: Number of training passes through the entire dataset
    - `per_device_train_batch_size=1`: Small batch size for stability across hardware
    - `gradient_accumulation_steps=4`: Accumulates gradients over 4 steps (effective batch size of 4)
    - `warmup_steps=3`: Gradually increases learning rate for the first 3 steps
    - `weight_decay=0.1`: L2 regularization to prevent overfitting
    - `learning_rate=3e-5`: Learning rate optimized for fine-tuning small models
    - `fp16=use_fp16`: Uses mixed precision training when supported
    - `save_steps=50`: Saves checkpoint every 50 steps
    - `save_total_limit=2`: Keeps only the 2 best checkpoints to save disk space
    - `eval_strategy="steps"`: Evaluates on validation set periodically if available
    - `load_best_model_at_end=use_validation`: Loads best model based on validation loss

  - **Data Collator**:
    - `mlm=False`: Uses causal language modeling (not masked)
    - Handles batching and padding of examples

  - **Training Process**:
    - Initializes Trainer with model, data, and training arguments
    - Handles training process with error detection and fallback
    - Saves model and tokenizer to specified output directory
    - Provides detailed training metrics and overfitting detection

Let's create the `qwenfinetune.py` module:


In [4]:
%%writefile qwenfinetune.py

import json
import torch
from transformers import (
	AutoTokenizer,
	AutoModelForCausalLM,
	TrainingArguments,
	Trainer,
	DataCollatorForLanguageModeling
)
from datasets import Dataset
import platform
import random


class DemographicsDataset:
	def __init__(self, tokenizer, max_length=512):
		self.tokenizer = tokenizer
		self.max_length = max_length

	def prepare_training_data(self, examples):
		"""Convert examples to tokenized format"""
		inputs = []
		for example in examples:
			# Format the prompt + response
			prompt = f"Parse this demographic targeting request: {example['input']}\n\nResponse:"
			response = json.dumps(example['output'], indent=2)
			full_text = f"{prompt}\n```json\n{response}\n```<|endoftext|>"

			inputs.append(full_text)

		# Tokenize
		tokenized = self.tokenizer(
			inputs,
			truncation=True,
			padding=True,
			max_length=self.max_length,
			return_tensors="pt"
		)

		# For causal LM, labels are the same as input_ids
		tokenized["labels"] = tokenized["input_ids"].clone()

		return tokenized


def load_training_data(file_path):
	"""Load training data from JSON file"""
	with open(file_path, 'r') as f:
		return json.load(f)


def split_data(examples, train_ratio=0.8, seed=42):
	"""Split data into train and validation sets"""
	random.seed(seed)
	shuffled = examples.copy()
	random.shuffle(shuffled)

	train_size = int(len(shuffled) * train_ratio)
	train_data = shuffled[:train_size]
	val_data = shuffled[train_size:]

	return train_data, val_data


def detect_device():
	"""Detect the best available device for training"""
	if torch.cuda.is_available():
		return "cuda", True, True  # device, use_fp16, use_device_map
	elif torch.backends.mps.is_available():
		return "mps", False, False  # MPS doesn't support fp16 or device_map
	else:
		return "cpu", False, False


def fine_tune_model():
	# Detect device capabilities
	device, use_fp16, use_device_map = detect_device()
	print(f"Using device: {device}")
	print(f"FP16 enabled: {use_fp16}")
	print(f"Device map enabled: {use_device_map}")
	print(f"Platform: {platform.platform()}")

	# Available small Qwen models
	available_models = [
		"Qwen/Qwen3-0.6B"  # Fallback option"
	]

	model_name = None
	tokenizer = None
	model = None

	# Try each model until one works
	for candidate_model in available_models:
		try:
			print(f"Trying to load model: {candidate_model}")
			tokenizer = AutoTokenizer.from_pretrained(candidate_model)

			# Load model with appropriate settings for the device
			if use_device_map:
				model = AutoModelForCausalLM.from_pretrained(
					candidate_model,
					dtype=torch.float16 if use_fp16 else torch.float32,
					device_map="auto"
				)
			else:
				model = AutoModelForCausalLM.from_pretrained(
					candidate_model,
					dtype=torch.float32  # Always use float32 for MPS/CPU
				)
				if device != "cpu":
					model = model.to(device)

			model_name = candidate_model
			print(f"Successfully loaded model: {model_name}")
			break
		except Exception as e:
			print(f"Failed to load {candidate_model}: {str(e)}")
			continue

	if model is None:
		raise ValueError(
			"Could not load any of the available models. Please check your internet connection and Hugging Face access.")

	# Add pad token if not present
	if tokenizer.pad_token is None:
		tokenizer.pad_token = tokenizer.eos_token
		print("Added pad token")

	# Load training data
	try:
		training_examples = load_training_data("training_data.json")
		print(f"Loaded {len(training_examples)} training examples")
	except FileNotFoundError:
		print("training_data.json not found. Please create training data file.")
		return

	# Split data into train/validation
	if len(training_examples) >= 10:  # Only split if we have enough data
		train_data, val_data = split_data(training_examples, train_ratio=0.8)
		print(f"Split data: {len(train_data)} train, {len(val_data)} validation")
		use_validation = True
	else:
		train_data = training_examples
		val_data = []
		print(f"Using all {len(train_data)} examples for training (too few for validation split)")
		use_validation = False

	# Prepare datasets
	dataset_prep = DemographicsDataset(tokenizer)

	# Train dataset
	train_tokenized = dataset_prep.prepare_training_data(train_data)
	train_dataset = Dataset.from_dict(train_tokenized)
	print(f"Prepared training dataset with {len(train_dataset)} examples")

	# Validation dataset (if applicable)
	eval_dataset = None
	if use_validation and val_data:
		val_tokenized = dataset_prep.prepare_training_data(val_data)
		eval_dataset = Dataset.from_dict(val_tokenized)
		print(f"Prepared validation dataset with {len(eval_dataset)} examples")

	# Training arguments optimized for Apple Silicon
	training_args = TrainingArguments(
		output_dir="./qwen-demographics-finetuned",
		num_train_epochs=2,  # Increased to 2 epochs for better training
		per_device_train_batch_size=1,  # Small batch size for stability
		gradient_accumulation_steps=4,  # Maintain effective batch size
		warmup_steps=3,
		weight_decay=0.1,
		learning_rate=3e-5,
		fp16=use_fp16,  # Only use fp16 if supported
		bf16=False,  # Disable bf16 for compatibility
		logging_steps=1,
		save_steps=50,
		save_total_limit=2,
		remove_unused_columns=False,
		dataloader_pin_memory=False,
		dataloader_num_workers=0,  # Prevent multiprocessing issues on macOS
		report_to=None,  # Disable wandb/tensorboard
		push_to_hub=False,
		use_cpu=device == "cpu",
		# Validation settings
		eval_strategy="steps" if use_validation else "no",
		eval_steps=5 if use_validation else None,
		per_device_eval_batch_size=1 if use_validation else None,
		load_best_model_at_end=use_validation,
		metric_for_best_model="eval_loss" if use_validation else None,
		greater_is_better=False if use_validation else None,
	)

	# Data collator
	data_collator = DataCollatorForLanguageModeling(
		tokenizer=tokenizer,
		mlm=False,  # Causal LM, not masked LM
	)

	# Trainer
	trainer = Trainer(
		model=model,
		args=training_args,
		train_dataset=train_dataset,
		eval_dataset=eval_dataset,
		data_collator=data_collator,
	)

	print("Starting fine-tuning...")
	try:
		# Fine-tune
		training_result = trainer.train()

		# Print training summary
		print(f"\nTraining completed!")
		print(f"Final training loss: {training_result.training_loss:.4f}")

		if use_validation:
			# Evaluate on validation set
			eval_result = trainer.evaluate()
			print(f"Final validation loss: {eval_result['eval_loss']:.4f}")

			# Check for potential overfitting
			if eval_result['eval_loss'] > training_result.training_loss * 1.5:
				print("⚠️  WARNING: Validation loss is significantly higher than training loss.")
				print("   This may indicate overfitting. Consider:")
				print("   - Reducing learning rate")
				print("   - Adding more training data")
				print("   - Reducing number of epochs")
			else:
				print("✅ No obvious signs of overfitting detected.")

		# Save the model
		trainer.save_model()
		tokenizer.save_pretrained("./qwen-demographics-finetuned")

		print("Fine-tuning completed successfully!")
		print(f"Model saved to: ./qwen-demographics-finetuned")
		print(f"Base model used: {model_name}")
		print(f"Device used: {device}")

	except Exception as e:
		print(f"Error during training: {str(e)}")
		print("Trying with even smaller batch size...")

		# Try with even smaller configuration
		training_args.per_device_train_batch_size = 1
		training_args.gradient_accumulation_steps = 2
		training_args.dataloader_num_workers = 0

		trainer = Trainer(
			model=model,
			args=training_args,
			train_dataset=train_dataset,
			eval_dataset=eval_dataset,
			data_collator=data_collator,
		)

		try:
			trainer.train()
			trainer.save_model()
			tokenizer.save_pretrained("./qwen-demographics-finetuned")
			print("Fine-tuning completed with reduced settings!")
		except Exception as e2:
			print(f"Second attempt failed: {str(e2)}")
			raise


if __name__ == "__main__":
	# Check system information
	print(f"Python version: {platform.python_version()}")
	print(f"PyTorch version: {torch.__version__}")
	print(f"Platform: {platform.platform()}")
	print(f"Machine: {platform.machine()}")

	# Check device availability
	print(f"CUDA available: {torch.cuda.is_available()}")
	print(f"MPS available: {torch.backends.mps.is_available()}")

	if torch.backends.mps.is_available():
		print("Running on Apple Silicon with MPS acceleration")
	elif torch.cuda.is_available():
		print(f"CUDA device: {torch.cuda.get_device_name()}")
	else:
		print("Running on CPU")

	fine_tune_model()

Overwriting qwenfinetune.py


## Step 4: Fine-tune the Qwen Model

Now we'll run the fine-tuning process using our script. This will train the model to understand demographic targeting requests.

In [5]:
# Import the fine-tuning function
from qwenfinetune import fine_tune_model

# Start fine-tuning
print("Starting fine-tuning process...")
print("This may take 10-30 minutes depending on your hardware.")

try:
    output_dir = fine_tune_model()
    print(f"Fine-tuning completed successfully. Model saved to: {output_dir}")
except Exception as e:
    print(f"Error during fine-tuning: {str(e)}")
    print("\nTroubleshooting tips:")
    print("1. Make sure you have enough GPU memory or use a smaller batch size")
    print("2. Check that the training data is formatted correctly")
    print("3. Ensure you have internet access to download the base model")

Starting fine-tuning process...
This may take 10-30 minutes depending on your hardware.
Using device: mps
FP16 enabled: False
Device map enabled: False
Platform: macOS-15.5-arm64-arm-64bit-Mach-O
Trying to load model: Qwen/Qwen3-0.6B
Successfully loaded model: Qwen/Qwen3-0.6B
Loaded 228 training examples
Split data: 182 train, 46 validation
Prepared training dataset with 182 examples
Prepared validation dataset with 46 examples
Starting fine-tuning...


Step,Training Loss,Validation Loss
5,0.7462,0.570691
10,0.3243,0.354272
15,0.3007,0.293315
20,0.1735,0.2651
25,0.1632,0.249995
30,0.1968,0.231329
35,0.2525,0.234547
40,0.1371,0.224603
45,0.1987,0.216989
50,0.0894,0.206914


Error during training: [enforce fail at inline_container.cc:664] . unexpected pos 3782341248 vs 3782341136
Trying with even smaller batch size...


Step,Training Loss,Validation Loss
5,0.1437,0.236518
10,0.2278,0.250816
15,0.1969,0.254704
20,0.2496,0.272046
25,0.1627,0.284367
30,0.2364,0.249262
35,0.1212,0.241262
40,0.0826,0.240035
45,0.1547,0.238182
50,0.0784,0.238882


2025-10-15 17:10:41.789 Python[12073:175050843] Error creating directory 
 The volume ‚ÄúMacintosh HD‚Äù is out of space. You can‚Äôt save the file ‚Äúmpsgraph-12073-2025-10-15_17_10_40-3517201249‚Äù because the volume ‚ÄúMacintosh HD‚Äù is out of space.
2025-10-15 17:10:41.839 Python[12073:175050843] Error creating directory 
 The volume ‚ÄúMacintosh HD‚Äù is out of space. You can‚Äôt save the file ‚Äúmpsgraph-12073-2025-10-15_17_10_41-3546179958‚Äù because the volume ‚ÄúMacintosh HD‚Äù is out of space.
2025-10-15 17:10:41.971 Python[12073:175050843] Error creating directory 
 The volume ‚ÄúMacintosh HD‚Äù is out of space. You can‚Äôt save the file ‚Äúmpsgraph-12073-2025-10-15_17_10_41-878773179‚Äù because the volume ‚ÄúMacintosh HD‚Äù is out of space.
2025-10-15 17:10:42.038 Python[12073:175050843] Error creating directory 
 The volume ‚ÄúMacintosh HD‚Äù is out of space. You can‚Äôt save the file ‚Äúmpsgraph-12073-2025-10-15_17_10_42-704088224‚Äù because the volume ‚ÄúMacintosh HD‚Äù

Second attempt failed: [enforce fail at inline_container.cc:664] . unexpected pos 1504785408 vs 1504785296
Error during fine-tuning: [enforce fail at inline_container.cc:664] . unexpected pos 1504785408 vs 1504785296

Troubleshooting tips:
1. Make sure you have enough GPU memory or use a smaller batch size
2. Check that the training data is formatted correctly
3. Ensure you have internet access to download the base model


## Step 5: Test the Fine-tuned Model

Let's verify that our fine-tuned model works correctly before converting it.

In [43]:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the fine-tuned model
model_path = './qwen-demographics-finetuned'

try:
    print(f"Loading fine-tuned model from {model_path}")
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(model_path)
    print("Model loaded successfully")
    
    def test_model(prompt):
        """Test the fine-tuned model with a prompt"""
        formatted_prompt = f"Parse this demographic targeting request: {prompt}\n\nResponse:"
        print(f"Input prompt: {formatted_prompt}")
        
        inputs = tokenizer(formatted_prompt, return_tensors="pt")
        
        with torch.no_grad():
            outputs = model.generate(
                inputs.input_ids,
                max_new_tokens=200,
                temperature=0.6,
                top_p =0.95,
                top_k=20,
                do_sample=True,
                pad_token_id=tokenizer.eos_token_id
            )
        
        response = tokenizer.decode(outputs[0], skip_special_tokens=True)
        return response[len(formatted_prompt):].strip()
    
    #Test with sample prompts
    test_prompts = [
        "Target young people in the Rockies making more than 50K",
        "Target women over 30 in urban areas interested in education",
        "Target people who like sports and live in Texas."
    ]

    for prompt in test_prompts:
        print(f"\nTesting prompt: {prompt}")
        result = test_model(prompt)
        print(f"Model output:\n{result}")
        
except FileNotFoundError:
    print(f"Error: Model files not found in {model_path}")
    print("Make sure fine-tuning completed successfully.")
except Exception as e:
    print(f"Error testing model: {str(e)}")

Loading fine-tuned model from ./qwen-demographics-finetuned
Model loaded successfully

Testing prompt: Target young people in the Rockies making more than 50K
Input prompt: Parse this demographic targeting request: Target young people in the Rockies making more than 50K

Response:
Model output:
```json
{
  "chain_of_thought": "Young people typically means 18-24. Rockies is specified. More than 50K means income brackets 55_70, 70_85, 85_100, 100_125, 125_150, 150_200, 200_250, 250_500, 500_plus.",
  "ages": [
    "pop_18_24"
  ],
  "gender": [
    "All"
  ],
  "income": [
    "55_70",
    "70_85",
    "85_100",
    "100_125",
    "125_150",
    "150_200",
    "

Testing prompt: Target women over 30 in urban areas interested in education
Input prompt: Parse this demographic targeting request: Target women over 30 in urban areas interested in education

Response:
Model output:
```json
{
  "chain_of_thought": "Women over 30 means 30 and above, so ALL age brackets from 30+: pop_35_44, pop_4

## Download and Setup llama.cpp

To convert our model to GGUF format, we need llama.cpp. Let's download and set it up.

In [18]:
import os
import subprocess

# Check if llama.cpp directory exists
if not os.path.exists('llama.cpp'):
    print("Downloading llama.cpp repository...")
    try:
        subprocess.run(["git", "clone", "https://github.com/ggerganov/llama.cpp"], check=True)
        print("Successfully cloned llama.cpp repository")
    except subprocess.SubprocessError as e:
        print(f"Error downloading llama.cpp: {e}")
        print("Please install git and try again, or download manually from https://github.com/ggerganov/llama.cpp")
else:
    print("llama.cpp directory already exists")

# Check if the conversion script exists
if os.path.exists('llama.cpp/convert_hf_to_gguf.py'):
    print("Found convert_hf_to_gguf.py script")
else:
    print("Warning: Could not find convert_hf_to_gguf.py script")
    print("Please ensure you have the latest version of llama.cpp")

Downloading llama.cpp repository...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Cloning into 'llama.cpp'...


Successfully cloned llama.cpp repository
Found convert_hf_to_gguf.py script


## Convert to GGUF Format

Now we'll convert our fine-tuned model to GGUF format for use with Ollama. We'll use the llama.cpp conversion script.

In [20]:
import subprocess
import os
import sys

# Create output directory
os.makedirs('./gguf_output', exist_ok=True)

# Get absolute paths
model_path = os.path.abspath('./qwen-demographics-finetuned')
output_path = os.path.abspath('./gguf_output/qwen-demographics-finetuned.gguf')

# Define the conversion command with proper paths
conversion_script = os.path.join('llama.cpp', 'convert_hf_to_gguf.py')
conversion_command = [
    sys.executable,  # Use the current Python interpreter
    conversion_script,
    model_path,
    "--outfile", output_path,
    "--outtype", "f16"  # Use f16 for better balance of size and quality
]

print("Converting to GGUF format...")
print("Command:", " ".join(conversion_command))

try:
    # Check if the conversion script exists
    if not os.path.exists(conversion_script):
        raise FileNotFoundError(f"Conversion script not found: {conversion_script}")
        
    # Check if the model directory exists
    if not os.path.exists(model_path):
        raise FileNotFoundError(f"Model directory not found: {model_path}")
    
    # Run the conversion
    result = subprocess.run(conversion_command, capture_output=True, text=True, check=True)
    print("Conversion successful!")
    print(result.stdout)
    
    # List generated GGUF files
    gguf_files = [f for f in os.listdir('./gguf_output') if f.endswith('.gguf')]
    if gguf_files:
        print(f"Generated GGUF files: {gguf_files}")
    else:
        print("No GGUF files were created. Please check the conversion output for errors.")
    
except FileNotFoundError as e:
    print(f"Error: {e}")
except subprocess.CalledProcessError as e:
    print(f"Conversion failed: {e}")
    print(f"Error output: {e.stderr}")
    print("\nTroubleshooting:")
    print("1. Make sure you have the latest version of llama.cpp")
    print("2. Check that all required Python packages are installed")
    print("3. Make sure your model was fine-tuned successfully")
except Exception as e:
    print(f"Unexpected error: {e}")

Converting to GGUF format...
Command: /Users/ericlivesay/.cache/uv/builds-v0/.tmp1Yeg2Y/bin/python llama.cpp/convert_hf_to_gguf.py /Users/ericlivesay/PycharmProjects/finetuneqwen06b/qwen-demographics-finetuned --outfile /Users/ericlivesay/PycharmProjects/finetuneqwen06b/gguf_output/qwen-demographics-finetuned.gguf --outtype f16


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Conversion successful!

Generated GGUF files: ['qwen-demographics-finetuned-q8_0.gguf', 'qwen-demographics-finetuned.gguf', 'qwen-demographics-finetuned-f16.gguf', 'qwen-demographics-finetuned-q8_0v2.gguf']


## Check Ollama Installation

Before proceeding, let's make sure Ollama is installed and running.

In [21]:
import subprocess
import requests

def check_ollama():
    """Check if Ollama is installed and running"""
    try:
        # Check if Ollama is installed
        result = subprocess.run(["ollama", "--version"], capture_output=True, text=True)
        print(f"Ollama is installed: {result.stdout.strip()}")
        
        # Check if Ollama service is running by making an API request
        response = requests.get("http://localhost:11434/api/tags", timeout=5)
        if response.status_code == 200:
            print("Ollama service is running ✓")
            return True
        else:
            print(f"Ollama service returned status code: {response.status_code}")
            return False
    except subprocess.CalledProcessError:
        print("Ollama is not installed. Please install it from https://ollama.ai")
        return False
    except requests.RequestException:
        print("Ollama service is not running. Please start it by running 'ollama serve'")
        return False

# Check Ollama status
ollama_ready = check_ollama()
if not ollama_ready:
    print("\nPlease install and start Ollama before continuing.")
    print("Installation instructions: https://ollama.ai/download")

Ollama is installed: ollama version is 0.11.11
Ollama service is running ✓


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Create Ollama Model File

Create a model file that tells Ollama how to use our fine-tuned model.

In [34]:
import subprocess
import sys

def show_modelfile(model_name):
    """
    Run the 'ollama show --modelfile' command for a specified model.
    
    Args:
        model_name (str): Name of the Ollama model
    
    Returns:
        tuple: (success boolean, output or error message)
    """
    try:
        with open("qwen_demographics_modelfile", "w") as f:
            subprocess.call( ['ollama', 'show', '--modelfile', model_name], stdout=f)
        return True, result.stdout
    except subprocess.CalledProcessError as e:
        return False, f"Command failed with exit code {e.returncode}:\n{e.stderr}"
    except FileNotFoundError:
        return False, "Error: 'ollama' command not found. Is Ollama installed and in your PATH?"
    except Exception as e:
        return False, f"Unexpected error: {str(e)}"

# Set the model name - change this to view a different model
model = "hf.co/Qwen/Qwen3-0.6B-GGUF:Q8_0"

print(f"Showing Modelfile for '{model}'...\n")
success, output = show_modelfile(model)

if success:
    print(output)
else:
    print(f"Error: {output}")

Showing Modelfile for 'hf.co/Qwen/Qwen3-0.6B-GGUF:Q8_0'...




huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Update the model file to use our fine tuned model:
Open the qwen-demographics-finetuned file and editn the FROM line to point to our new model file:

In [44]:
# Set the filename of your modelfile
filename = "qwen_demographics_modelfile"

try:
    # Read the file content
    with open(filename, "r") as file:
        lines = file.readlines()
    
    # Find and replace the FROM line
    updated = False
    for i, line in enumerate(lines):
        if line.strip().startswith("FROM"):
            lines[i] = "FROM ./gguf_output/qwen-demographics-finetuned.gguf\n"
            updated = True
            break
    
    if not updated:
        print(f"No 'FROM' line found in {filename}")
    else:
        # Write the updated content back to the file
        with open(filename, "w") as file:
            file.writelines(lines)
        
        print(f"Successfully updated {filename}")

except FileNotFoundError:
    print(f"File not found: {filename}")
    print("Please provide the correct path to the modelfile.")

Successfully updated qwen_demographics_modelfile


## Create Ollama Model

Create a new Ollama model using our fine-tuned GGUF file.

In [45]:
# Check if Ollama is running first
if not os.path.exists('qwen_demographics_modelfile'):
    print("Error: Model file not found. Please complete the previous step.")
else:
    # Create the Ollama model
    print("Creating Ollama model...")
    
    try:
        # Try to delete existing model if it exists
        subprocess.run(
            ["ollama", "rm", "qwen-demographics-finetuned"],
            capture_output=True,
            check=False  # Don't fail if model doesn't exist
        )
        
        # Create the new model
        result = subprocess.run(
            ["ollama", "create", "qwen-demographics-finetuned", "-f", "qwen_demographics_modelfile"],
            capture_output=True,
            text=True,
            check=True
        )
        
        print("Ollama model created successfully!")
        if result.stdout:
            print(result.stdout)
        
        # List available models to confirm
        list_result = subprocess.run(
            ["ollama", "list"],
            capture_output=True,
            text=True,
            check=True
        )
        
        print("\nAvailable Ollama models:")
        print(list_result.stdout)
        
    except subprocess.CalledProcessError as e:
        print(f"Failed to create Ollama model: {e}")
        print(f"Error output: {e.stderr}")
        print("\nTroubleshooting:")
        print("1. Ensure Ollama is installed and running")
        print("2. Check that the GGUF file exists and is valid")
        print("3. Verify the model file is correctly formatted")

Creating Ollama model...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Ollama model created successfully!

Available Ollama models:
NAME                                                     ID              SIZE      MODIFIED               
qwen-demographics-finetuned:latest                       fbdc9a3039d6    1.2 GB    Less than a second ago    
ericqwencpuonly:latest                                   07f472a739df    639 MB    7 days ago                
hf.co/ggml-org/gemma-3-270m-GGUF:Q8_0                    4f341f194799    291 MB    2 weeks ago               
qwenfinetunedv2:latest                                   b115aa207881    1.2 GB    5 weeks ago               
qwenfinetuned_demo_q80:latest                            b653cb22551f    639 MB    5 weeks ago               
qwenfinetuned:latest                                     173438e33ad2    1.2 GB    6 weeks ago               
qwen-demographics-finetuned-f16.gguf:latest              b5e0b3bc6ec1    1.2 GB    6 weeks ago               
hf.co/Qwen/Qwen3-0.6B-GGUF:Q8_0                          3e52e

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


## Test the Ollama Model

Test our deployed model with Ollama.

In [46]:
# Test the model with demographic targeting prompts
test_prompts = [
    "Target young professionals in California making over 100K",
    "Target wealthy seniors in Florida who like golf", 
    "Target middle-aged families in the Midwest with kids"
]

print("Testing the fine-tuned model with Ollama...")

for i, prompt in enumerate(test_prompts):
    print(f"\n{'='*60}")
    print(f"Test {i+1}: {prompt}")
    print('='*60)
    
    try:
        result = subprocess.run(
            ["ollama", "run", "qwen-demographics-finetuned", f"Parse this demographic targeting request: {prompt}"],
            capture_output=True,
            text=True,
            check=True,
            timeout=60
        )
        
        print("Response:")
        print(result.stdout)
        
        # Try to parse the response as JSON
        try:
            # Extract JSON part from the response
            response_text = result.stdout.strip()
            # Try to find JSON in the response
            import re
            json_match = re.search(r'\{[^}]*\}', response_text)
            if json_match:
                json_str = json_match.group(0)
                json_obj = json.loads(json_str)
                print("\n✓ Response contains valid JSON")
            else:
                print("\n⚠️ Could not find JSON in response")
        except json.JSONDecodeError:
            print("\n⚠️ Response is not valid JSON")
        
    except subprocess.CalledProcessError as e:
        print(f"Error: {e}")
        print(f"Error output: {e.stderr}")
    except subprocess.TimeoutExpired:
        print("Request timed out (>60s)")

print("\n" + "="*60)
print("You can now interact with your model using:")
print("ollama run qwen-demographics")
print("="*60)

Testing the fine-tuned model with Ollama...

Test 1: Target young professionals in California making over 100K


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Response:
<think>
Okay, let's see. Young professionals typically means 25-44 age range. California is specified. Over 100K means income brackets 100_125, 125_150, 150_200, 200_250, 250_500, 500_plus.

So the final answer should include California and all specified income brackets. No specific interests mentioned, so interests array should be empty.
</think>

```json
{
  "chain_of_thought": "California is explicitly mentioned. Young professionals typically means 25-44 age range. Over 100K means income brackets 100_125, 125_150, 150_200, 200_250, 250_500, 500_plus.",
  "ages": [
    "pop_25_34",
    "pop_35_44"
  ],
  "gender": [
    "All"
  ],
  "income": [
    "100_125",
    "125_150",
    "150_200",
    "200_250",
    "250_500",
    "500_plus"
  ],
  "cities": [],
  "states": [
    "California"
  ],
  "zip_codes": [],
  "dmas": [],
  "interests": []
}
```



✓ Response contains valid JSON

Test 2: Target wealthy seniors in Florida who like golf


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Response:
<think>
Okay, let's tackle this query: Target wealthy seniors in Florida who like golf.

First, I need to parse the age range. Since it's mentioned as "seniors," typically means 65+ (pop_65_74, pop_75_plus). 

Florida is explicitly mentioned. Golf relates to sports interest category.

So putting it all together: Age brackets pop_65_74 and pop_75_plus in Florida. Sports interest category.
</think>

```json
{
  "chain_of_thought": "Florida is specified. Golf relates to sports interest category.",
  "ages": [
    "pop_65_74",
    "pop_75_plus"
  ],
  "gender": [
    "All"
  ],
  "income": [],
  "cities": [],
  "states": [
    "Florida"
  ],
  "zip_codes": [],
  "dmas": [],
  "interests": [
    "sports"
  ]
}
```



✓ Response contains valid JSON

Test 3: Target middle-aged families in the Midwest with kids


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Response:
<think>
Okay, let's see. Middle-aged typically means 25-64 age range. Families is explicitly mentioned. No specific interests mentioned, so I should leave interests empty.

Response:
```json
{
  "chain_of_thought": "Middle-aged is 25-64. Families relates to family_and_relationships interest category.",
  "ages": [
    "pop_25_34",
    "pop_35_44",
    "pop_45_54",
    "pop_55_64"
  ],
  "gender": [
    "All"
  ],
  "income": [],
  "cities": [],
  "states": [],
  "zip_codes": [],
  "dmas": [],
  "interests": [
    "family_and_relationships"
  ]
}
```



✓ Response contains valid JSON

You can now interact with your model using:
ollama run qwen-demographics


## Summary

🎉 **Congratulations!** You've successfully completed the entire workflow:

### ✅ What We Accomplished:
1. **Fine-tuned** a Qwen 0.6B model for demographic targeting
2. **Converted** the model to GGUF format for efficient inference
3. **Created** an Ollama model file pointing to your fine-tuned model
4. **Deployed** the model with Ollama for local inference

### 🚀 Next Steps:
- **Improve Training Data**: Add more diverse examples to improve model performance
- **Experiment with Quantization**: Try different levels (Q4_0, Q8_0) for size/performance trade-offs
- **Production Integration**: Use the model in your application
- **Performance Monitoring**: Track model performance and retrain as needed
- **Scale Up**: Consider larger models if you need better performance

### 📋 Command Reference:

```bash
# Fine-tune the model
python qwenfinetune.py

# Convert to GGUF format
python llama.cpp/convert_hf_to_gguf.py ./qwen-demographics-finetuned --outdir ./gguf_output --outtype f16

# Update the modelfile to point the FROM line to FROM ./gguf_output/qwen-demographics-finetuned.gguf

# Create new Ollama model
ollama create qwen-demographics -f qwen_demographics_modelfile

# Run the model
ollama run qwen-demographics
```

### 💡 Key Benefits of This Approach:
- **Local Control**: You own the model weights and can run inference locally
- **Cost Effective**: No API costs after initial training
- **Privacy**: Data stays on your local machine
- **Customization**: Model is specifically trained for your demographic targeting use case
- **Small Footprint**: 0.6B model runs efficiently on consumer hardware

The model is now ready to be integrated into your application for demographic targeting!