<a href="https://colab.research.google.com/github/peremartra/optipfair/blob/main/examples/basic_pruning_mlp.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#OptiPFair Notebook Series – Example: Basic Pruning (MLP)

![optiPfair Logo](https://github.com/peremartra/optipfair/blob/main/images/optiPfair.png?raw=true)


This notebook demonstrates how to use [OptiPFair](https://github.com/peremartra/optipfair) for structured pruning of transformer models with GLU-based MLP layers.  
The example covers both percentage-based and expansion-rate-based pruning strategies.

##Recommended Environment

- **Platform**: [Google Colab](https://colab.research.google.com)  
- **Hardware**: GPU runtime (recommended: T4 or better for 1B–3B models)  
- **Dependencies**: Installed automatically in the first cell (optipfair, transformers, torch)

##by Pere Martra.

- [LinkedIn](https://www.linkedin.com/in/pere-martra/?originalSubdomain=es)  
- [GitHub](https://github.com/peremartra)  
- [X / Twitter](https://x.com/peremartra)

---

> If you find this useful, please ⭐ the [repository](https://github.com/peremartra/optipfair) and share it!
---
If you want your favorite LLM to create code with optiPfair, you just need to provide it with the file: [**optipfair_llm_reference_manual.txt**](https://github.com/peremartra/optipfair/blob/main/optipfair_llm_reference_manual.txt), which contains all the necessary information for the LLM to become an expert in using the library.


# Basic OptiPFair Pruning Example

This notebook demonstrates how to use OptiPFair for structured pruning of language models.
OptiPFair focuses on pruning MLP layers with GLU (Gated Linear Unit) architecture,
which is commonly found in modern models like LLaMA, Gemma, Mistral, and others.

Author: Pere Martra

Designed for Google Colab - GPU runtime recommended

---
## Installation and Setup


In [None]:
!pip install -q transformers optipfair torch

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m2.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m121.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m86.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m47.5 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m39.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m127.9/127.9 MB[0m [31m17.4 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━


## Import Libraries and Check GPU


In [None]:
import torch
import os
import gc
from transformers import AutoModelForCausalLM, AutoTokenizer
from optipfair import prune_model

# Check device availability
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB")

Using device: cuda
GPU: NVIDIA L4
GPU Memory: 22.2 GB


## Configuration


In [None]:
# List of models to test - you can add more GLU-compatible models here
# Note: For Colab, stick to smaller models due to memory constraints
MODELS_TO_TEST = [
    "meta-llama/Llama-3.2-1B",
    # "google/gemma-2-2b",  # Uncomment if you have enough GPU memory
    # Add more models here as needed
]

# Pruning configuration - modify these values as needed
PRUNING_PERCENTAGE = 20  # Percentage of neurons to remove (0-100)
TARGET_EXPANSION_RATE = 200  # Alternative: target expansion rate (e.g., 200% instead of ~400%)

# Test prompts for evaluation
TEST_PROMPTS = [
    "Paris is the capital of",
    "The theory of relativity states that",
    "Machine learning is a field of",
]

print("Configuration set successfully!")
print(f"Models to test: {MODELS_TO_TEST}")
print(f"Pruning percentage: {PRUNING_PERCENTAGE}%")
print(f"Target expansion rate: {TARGET_EXPANSION_RATE}%")


Configuration set successfully!
Models to test: ['meta-llama/Llama-3.2-1B']
Pruning percentage: 20%
Target expansion rate: 200%


## Introduction to Structured Pruning

---
This example demonstrates structured pruning of MLP layers in transformer models.

Structured pruning removes entire neurons while maintaining model architecture, resulting in actual speedup and memory reduction during inference.


## Utility Functions


In [None]:
def count_parameters(model):
    """Count total parameters in model"""
    return sum(p.numel() for p in model.parameters())

def test_model_generation(model, tokenizer, prompt, max_length=50):
    """Test text generation with the model"""
    inputs = tokenizer(prompt, return_tensors='pt').to(device)

    with torch.no_grad():
        outputs = model.generate(
            inputs['input_ids'],
            attention_mask=inputs['attention_mask'],
            max_length=max_length,
            num_return_sequences=1,
            pad_token_id=tokenizer.pad_token_id,
            do_sample=False,
            num_beams=3,
            early_stopping=True,
            no_repeat_ngram_size=2
        )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

def print_model_info(model, model_name, stage=""):
    """Print basic model information"""
    param_count = count_parameters(model)
    print(f"{stage} Model: {model_name}")
    print(f"Parameters: {param_count:,}")
    return param_count

def cleanup_memory():
    """Clean up GPU memory - important for Colab"""
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()

print("Utility functions defined successfully!")

Utility functions defined successfully!



## OptiPFair Parameters Explanation
• model: The model to be pruned

• pruning_type: Type of pruning (currently only 'MLP_GLU' supported)

• neuron_selection_method: Method to calculate neuron importance:
  - 'MAW': Maximum Absolute Weight (recommended for most models)
  - 'VOW': Variance of Weights (alternative method)
  - 'PON': Product of Norms (alternative method)
  
• pruning_percentage: Percentage of neurons to remove (0-100)

• expansion_rate: Alternative to pruning_percentage - target expansion rate

• show_progress: Display progress bar during pruning

• return_stats: Return detailed statistics about pruning

## Example 1 - Pruning by Percentage Function


In [None]:
def example_pruning_by_percentage(model, tokenizer, model_name):
    """Example of pruning by neuron percentage"""
    print(f"=== Example 1: Pruning {model_name} by {PRUNING_PERCENTAGE}% ===")

    # Get original model info
    original_params = print_model_info(model, model_name, "Original")

    # Test original model
    print("\n--- Original Model Generation ---")
    for prompt in TEST_PROMPTS[:2]:  # Test first 2 prompts
        generated = test_model_generation(model, tokenizer, prompt)
        print(f"Prompt: '{prompt}'")
        print(f"Generated: {generated}")
        print()

    # Apply pruning by percentage
    pruned_model, stats = prune_model(
        model=model,
        pruning_type="MLP_GLU",
        neuron_selection_method="MAW",  # Change to "VOW" or "PON" to try other methods
        pruning_percentage=PRUNING_PERCENTAGE,
        show_progress=True,
        return_stats=True
    )

    # Print pruning statistics
    print("\n--- Pruning Results ---")
    print(f"Original parameters: {stats['original_parameters']:,}")
    print(f"Pruned parameters: {stats['pruned_parameters']:,}")
    print(f"Reduction: {stats['reduction']:,} parameters ({stats['percentage_reduction']:.2f}%)")
    print(f"Final expansion rate: {stats['expansion_rate']:.2f}%")

    # Test pruned model
    print("\n--- Pruned Model Generation ---")
    for prompt in TEST_PROMPTS[:2]:
        generated = test_model_generation(pruned_model, tokenizer, prompt)
        print(f"Prompt: '{prompt}'")
        print(f"Generated: {generated}")
        print()

    return pruned_model, stats

print("Example 1 function defined!")

Example 1 function defined!



## Example 2 - Pruning by Expansion Rate Function


In [None]:
def example_pruning_by_expansion_rate(model, tokenizer, model_name):
    """Example of pruning by target expansion rate"""
    print(f"=== Example 2: Pruning {model_name} to {TARGET_EXPANSION_RATE}% expansion rate ===")

    # Get original model info
    original_params = print_model_info(model, model_name, "Original")

    # Apply pruning by expansion rate
    pruned_model, stats = prune_model(
        model=model,
        pruning_type="MLP_GLU",
        neuron_selection_method="MAW",
        pruning_percentage=None,  # Must be None when using expansion_rate
        expansion_rate=TARGET_EXPANSION_RATE,  # Target expansion rate instead of percentage
        show_progress=True,
        return_stats=True
    )

    # Print pruning statistics
    print("\n--- Pruning Results ---")
    print(f"Original parameters: {stats['original_parameters']:,}")
    print(f"Pruned parameters: {stats['pruned_parameters']:,}")
    print(f"Reduction: {stats['reduction']:,} parameters ({stats['percentage_reduction']:.2f}%)")
    print(f"Target expansion rate: {TARGET_EXPANSION_RATE}%")
    print(f"Actual expansion rate: {stats['expansion_rate']:.2f}%")

    # Test pruned model with one prompt
    print("\n--- Pruned Model Generation ---")
    prompt = TEST_PROMPTS[0]
    generated = test_model_generation(pruned_model, tokenizer, prompt)
    print(f"Prompt: '{prompt}'")
    print(f"Generated: {generated}")
    print()

    return pruned_model, stats

print("Example 2 function defined!")

Example 2 function defined!


## Run Example 1 - Pruning by Percentage


In [None]:
print("Starting Example 1: Pruning by Percentage")
print("=" * 50)

# Process the first model in the list
model_name = MODELS_TO_TEST[0]
print(f"Loading model: {model_name}")

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16 if device.type == 'cuda' else torch.float32,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Set pad token if not present
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

# Run Example 1
pruned_model_1, stats_1 = example_pruning_by_percentage(model, tokenizer, model_name)

# Store stats for summary
results = [{
    'model': model_name,
    'method': 'Percentage',
    'reduction': stats_1['percentage_reduction'],
    'expansion_rate': stats_1['expansion_rate']
}]

print(f"\nExample 1 completed! Reduction: {stats_1['percentage_reduction']:.2f}%")

Starting Example 1: Pruning by Percentage
Loading model: meta-llama/Llama-3.2-1B


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


=== Example 1: Pruning meta-llama/Llama-3.2-1B by 20% ===
Original Model: meta-llama/Llama-3.2-1B
Parameters: 1,235,814,400

--- Original Model Generation ---


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Prompt: 'Paris is the capital of'
Generated: Paris is the capital of France and the largest city in the country. It is located on the River Seine and is one of the most popular tourist destinations in Europe. The city has a population of over 2.2 million people, making

Prompt: 'The theory of relativity states that'
Generated: The theory of relativity states that the speed of light is the same in all inertial frames of reference. In other words, light always travels at a constant speed, regardless of the motion of its source. This means that if you are moving



Pruning layers: 100%|██████████| 16/16 [00:05<00:00,  2.81it/s]
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



--- Pruning Results ---
Original parameters: 1,235,814,400
Pruned parameters: 1,074,792,448
Reduction: 161,021,952 parameters (13.03%)
Final expansion rate: 320.02%

--- Pruned Model Generation ---


The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


Prompt: 'Paris is the capital of'
Generated: Paris is the capital of France. It is also known as the City of Lights, because it is considered to be one of the most beautiful cities in the world. Paris is famous for its architecture, its museums, and its restaurants. There are

Prompt: 'The theory of relativity states that'
Generated: The theory of relativity states that there is no such thing as absolute simultaneity, that is to say that two objects moving at the same speed do not necessarily have to be in the exact same time frame. In other words, they do


Example 1 completed! Reduction: 13.03%


## Run Example 2 - Pruning by Expansion Rate


In [None]:
print("Starting Example 2: Pruning by Expansion Rate")
print("=" * 50)

# Clean up memory from previous example
#del pruned_model_1
cleanup_memory()

# Reload model for second example (since first one was modified)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16 if device.type == 'cuda' else torch.float32,
    device_map="auto"
)

# Run Example 2
pruned_model_2, stats_2 = example_pruning_by_expansion_rate(model, tokenizer, model_name)

# Add to results
results.append({
    'model': model_name,
    'method': 'Expansion Rate',
    'reduction': stats_2['percentage_reduction'],
    'expansion_rate': stats_2['expansion_rate']
})

print(f"\nExample 2 completed! Reduction: {stats_2['percentage_reduction']:.2f}%")


Starting Example 2: Pruning by Expansion Rate
=== Example 2: Pruning meta-llama/Llama-3.2-1B to 200% expansion rate ===
Original Model: meta-llama/Llama-3.2-1B
Parameters: 1,235,814,400


Pruning layers: 100%|██████████| 16/16 [00:03<00:00,  4.49it/s]
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.



--- Pruning Results ---
Original parameters: 1,235,814,400
Pruned parameters: 833,161,216
Reduction: 402,653,184 parameters (32.58%)
Target expansion rate: 200%
Actual expansion rate: 200.00%

--- Pruned Model Generation ---
Prompt: 'Paris is the capital of'
Generated: Paris is the capital of France. It’s the biggest city in the country of the region of Laigneueu. There’s been a long history of time in this city, that it’s one’s most famous. That it is,


Example 2 completed! Reduction: 32.58%


## Results Summary

In [None]:
print("\n" + "="*60)
print("PRUNING RESULTS SUMMARY")
print("="*60)
print(f"{'Model':<30} {'Method':<15} {'Reduction':<12} {'Expansion Rate':<15}")
print("-" * 75)

for result in results:
    print(f"{result['model']:<30} {result['method']:<15} {result['reduction']:<12.2f}% {result['expansion_rate']:<15.2f}%")

print(f"\nTotal models tested: {len(results)}")
print("Pruning examples completed successfully!")




PRUNING RESULTS SUMMARY
Model                          Method          Reduction    Expansion Rate 
---------------------------------------------------------------------------
meta-llama/Llama-3.2-1B        Percentage      13.03       % 320.02         %
meta-llama/Llama-3.2-1B        Expansion Rate  32.58       % 200.00         %

Total models tested: 2
Pruning examples completed successfully!
