# Tutorial on Optimizing Language Models

This tutorial explores advanced optimization techniques for language models like GPT-2. The goal is to 
reduce computational costs and environmental impact while retaining high performance.

### Objectives:
1. Details on optimization methods like pruning, quantization, and distillation, including their mathematical foundations.
2. Measure the impact of optimizations on performance and carbon emissions.
3. Visualize and compare results to identify the most effective techniques.

### Why Optimize?
LLMs require significant computational resources. By optimizing models:
- **Efficiency**: Reduce inference time and memory usage.
- **Sustainability**: Lower carbon emissions.
- **Deployability**: Enable models to run on resource-constrained devices.

### References:
- **Pruning**: Han et al., ["Learning both Weights and Connections for Efficient Neural Networks"](https://arxiv.org/abs/1506.02626)
- **Quantization**: Jacob et al., ["Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference"](https://arxiv.org/abs/1712.05877)
- **Knowledge Distillation**: Hinton et al., ["Distilling the Knowledge in a Neural Network"](https://arxiv.org/abs/1503.02531)

---


In [1]:
#checking hardware related information

import os
import platform
import psutil
import torch

def check_hardware():
    print("=== CPU Information ===")
    print(f"Processor: {platform.processor()}")
    print(f"CPU Count: {os.cpu_count()}")
    if psutil:
        print(f"Logical CPUs: {psutil.cpu_count(logical=True)}")
        print(f"Physical CPUs: {psutil.cpu_count(logical=False)}")
        print(f"CPU Frequency: {psutil.cpu_freq().current} MHz")
    
    print("\n=== RAM Information ===")
    ram = psutil.virtual_memory()
    print(f"Total RAM: {ram.total / 1024**3:.2f} GB")
    print(f"Available RAM: {ram.available / 1024**3:.2f} GB")
    
    print("\n=== GPU Information ===")
    if torch.cuda.is_available():
        print(f"GPU Count: {torch.cuda.device_count()}")
        for i in range(torch.cuda.device_count()):
            print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
            print(f"Memory Allocated: {torch.cuda.memory_allocated(i) / 1024**2:.2f} MB")
            print(f"Memory Cached: {torch.cuda.memory_reserved(i) / 1024**2:.2f} MB")
    else:
        print("No GPU available")
    
    print("\n=== System Information ===")
    print(f"System: {platform.system()}")
    print(f"Machine: {platform.machine()}")
    print(f"Node: {platform.node()}")
    print(f"Version: {platform.version()}")

check_hardware()


=== CPU Information ===
Processor: x86_64
CPU Count: 32
Logical CPUs: 32
Physical CPUs: 16
CPU Frequency: 2387.6037187499996 MHz

=== RAM Information ===
Total RAM: 188.59 GB
Available RAM: 170.31 GB

=== GPU Information ===
GPU Count: 1
GPU 0: NVIDIA A40
Memory Allocated: 0.00 MB
Memory Cached: 0.00 MB

=== System Information ===
System: Linux
Machine: x86_64
Node: gpu028
Version: #147~18.04.1-Ubuntu SMP Sat Oct 15 13:10:18 UTC 2022


In [2]:
#suppress warnings
import os
import warnings

# Suppress specific warnings or errors
warnings.filterwarnings("ignore")

# Disable SLURM-related queries in the environment
os.environ["CODECARBON_SCONTROL_WARNING"] = "false"


## Import Libraries

Do the necassary imports
1. **`time`**: For measuring execution time of code.
2. **`torch`**: PyTorch library for tensor computations and deep learning.
3. **`transformers`**: Hugging Face tools for pre-trained transformer models.
4. **`copy`**: For creating deep or shallow copies of objects.
5. **`os`**: To interact with the operating system (e.g., environment variables).
6. **`logging`**: For debugging and monitoring with log messages.
7. **`matplotlib.pyplot`**: For creating visualizations and charts.
8. **`codecarbon`**: To estimate carbon emissions from computations.

In [3]:
import time
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, GPT2Config
import copy
import os
import logging
import matplotlib.pyplot as plt
from codecarbon import EmissionsTracker

### Disabling Tokenizer Parallelism

Setting `TOKENIZERS_PARALLELISM` to `False` avoids multi-threading conflicts during tokenization. 
This ensures smoother execution of the Hugging Face pipeline.

In [4]:
os.environ["TOKENIZERS_PARALLELISM"]="False"



## Initialize Model and Tokenizer

#### Explanation
1. **Tokenization**: Converts text into a sequence of tokens (integers) that the model can process.
2. **Model Loading**: GPT-2, a pre-trained generative model, is loaded.
3. **Device Setup**: Ensures computations run on a GPU (if available) for efficiency.




In [5]:
# Initialize model and tokenizer
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
print(f"Using device: {device}")

Using device: cuda



## Measure Baseline Performance

We measure the **inference time** and **carbon emissions** of the unoptimized model using `codecarbon`. 
This baseline serves as a reference to evaluate optimizations.

#### Key Metrics:
- **Duration**: Time taken for the model to generate outputs.
- **Emissions**: Estimated carbon footprint in kg CO2eq.



In [6]:
def run_inference(model, prompt, model_device, num_iterations=5):
    model.eval()
    results = []  # Initialize as a list to collect decoded outputs
    with torch.no_grad():
        for _ in range(num_iterations):
            inputs = tokenizer(prompt, return_tensors="pt").to(model_device)
            outputs = model.generate(**inputs, max_length=50, pad_token_id=model.config.eos_token_id)
            decoded_output = tokenizer.decode(outputs[0], skip_special_tokens=True)
            results.append(decoded_output)  # Append decoded text to the list
    print(f"Total outputs generated: {len(results)}")  # Debug statement
    return results  # Return all collected outputs

# Modify measure_performance to handle returned results
def measure_performance(model, prompt, name, cc_verbose=True):
    model_device = next(model.parameters()).device
    log_level = "info" if cc_verbose else "warning"
    tracker = EmissionsTracker(log_level=log_level)
    try:
        tracker.start()
        start_time = time.time()
        
        results = run_inference(model, prompt, model_device)  # Get all outputs
        
        duration = time.time() - start_time
        emissions = tracker.stop()
        
        print(f"\n{name}:")
        print(f"Duration: {duration:.2f} seconds")
        print(f"Estimated Emissions: {emissions:.6f} kg CO2eq")
        print(f"Generated Texts: {results}")  # Print all collected outputs
        return duration, emissions
    except Exception as e:
        print(f"Error measuring {name}: {str(e)}")
        return None, None
    finally:
        tracker.stop()


baseline_duration, baseline_emissions = measure_performance(model, "Once upon a time", "Baseline")


[codecarbon INFO @ 13:27:11] [setup] RAM Tracking...
[codecarbon INFO @ 13:27:11] [setup] GPU Tracking...
[codecarbon INFO @ 13:27:12] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 13:27:12] [setup] CPU Tracking...


ref: /fs01/projects/green-ai/envs/green-ai/lib/python3.10/site-packages/codecarbon/data/hardware/cpu_power.csv


[codecarbon INFO @ 13:27:16] CPU Model on constant consumption mode: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
[codecarbon INFO @ 13:27:16] >>> Tracker's metadata:
[codecarbon INFO @ 13:27:16]   Platform system: Linux-5.4.0-131-generic-x86_64-with-glibc2.27
[codecarbon INFO @ 13:27:16]   Python version: 3.10.12
[codecarbon INFO @ 13:27:16]   CodeCarbon version: 2.6.0
[codecarbon INFO @ 13:27:16]   Available RAM : 1.000 GB
[codecarbon INFO @ 13:27:16]   CPU count: 2
[codecarbon INFO @ 13:27:16]   CPU model: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
[codecarbon INFO @ 13:27:16]   GPU count: 1
[codecarbon INFO @ 13:27:16]   GPU model: 1 x NVIDIA A40
[codecarbon INFO @ 13:27:20] Saving emissions data to file /fs01/projects/green-ai/Shaina/emissions.csv
[codecarbon INFO @ 13:27:36] Energy consumed for RAM : 0.000002 kWh. RAM Power : 0.375 W
[codecarbon INFO @ 13:27:36] Energy consumed for all GPUs : 0.000366 kWh. Total GPU Power : 87.05686032317743 W
[codecarbon INFO @ 13:27:36] Energy co

Total outputs generated: 5
ref: /fs01/projects/green-ai/envs/green-ai/lib/python3.10/site-packages/codecarbon/data/private_infra/2016/canada_energy_mix.json


[codecarbon INFO @ 13:27:41] Energy consumed for RAM : 0.000002 kWh. RAM Power : 0.375 W
[codecarbon INFO @ 13:27:41] Energy consumed for all GPUs : 0.000501 kWh. Total GPU Power : 86.89445059900272 W
[codecarbon INFO @ 13:27:41] Energy consumed for all CPUs : 0.000240 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 13:27:41] 0.000743 kWh of electricity used since the beginning.



Baseline:
Duration: 19.25 seconds
Estimated Emissions: 0.000028 kg CO2eq
Generated Texts: ['Once upon a time, the world was a place of great beauty and great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a', 'Once upon a time, the world was a place of great beauty and great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a', 'Once upon a time, the world was a place of great beauty and great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a', 'Once upon a time, the world was a place of great beauty and great danger. The world was a place of great danger, and the world was a place of great danger. The world was a place of great danger, and the world was a', 'Once upon a time, the world was a p


## **Method 1 : Model Pruning**

Model pruning is a widely-used optimization technique that eliminates unimportant or redundant weights from a neural network. By reducing the number of active parameters, it minimizes computational requirements and leads to a smaller, more efficient model. The resulting weight matrix is typically sparse:

```
W_pruned = W * M
```

Where:
- `W` is the original weight matrix.
- `M` is a binary mask where:
  - `1` indicates weights retained.
  - `0` indicates weights pruned.

---

### **Types of Pruning**
1. **Global Pruning**: Removes weights across the entire model based on importance.
2. **Layer-Wise Pruning**: Targets specific layers for pruning, maintaining balance in model architecture.
3. **Structured Pruning**: Eliminates entire neurons, channels, or filters, simplifying computations further.
----

### **Advantages**
1. **Accelerated Inference**: Smaller models require fewer computations, enabling faster predictions—especially critical for real-time applications like speech recognition or online recommendations.
2. **Memory Efficiency**: Reduces the memory footprint, making models deployable on resource-constrained devices such as mobile phones, IoT devices, or edge servers.
3. **Energy Savings**: Fewer computations lead to lower energy consumption, contributing to sustainable AI practices.

---

### **Trade-offs and Considerations**
1. **Potential Accuracy Loss**: Removing too many weights or critical ones can degrade model accuracy. Careful pruning is essential.
2. **Increased Complexity**: Requires iterative fine-tuning or retraining to recover lost accuracy, especially in aggressive pruning scenarios.
3. **Hardware Support**: Sparse matrices may not achieve peak performance on some hardware architectures unless optimized libraries are used.

---

### **Key References**
- Han et al., ["Learning both Weights and Connections for Efficient Neural Networks"](https://arxiv.org/abs/1506.02626)  
- Hugging Face Documentation on [Pruning](https://huggingface.co/docs/optimum/v1.2.1/en/intel/pruning)  


In [7]:
# 1. Model Pruning
print("\n1. Model Pruning")
print("Model pruning involves removing less important weights from the model.")
def prune_model(model, pruning_amount=0.3):
    for param in model.parameters():
        mask = torch.rand(param.shape, device=param.device) > pruning_amount
        param.data *= mask
    return model

pruned_model = prune_model(copy.deepcopy(model))
pruned_duration, pruned_emissions = measure_performance(pruned_model, "Once upon a time", "Pruned Model")


1. Model Pruning
Model pruning involves removing less important weights from the model.


[codecarbon INFO @ 13:27:41] [setup] RAM Tracking...
[codecarbon INFO @ 13:27:41] [setup] GPU Tracking...
[codecarbon INFO @ 13:27:41] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 13:27:41] [setup] CPU Tracking...
[codecarbon INFO @ 13:27:43] CPU Model on constant consumption mode: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
[codecarbon INFO @ 13:27:43] >>> Tracker's metadata:
[codecarbon INFO @ 13:27:43]   Platform system: Linux-5.4.0-131-generic-x86_64-with-glibc2.27
[codecarbon INFO @ 13:27:43]   Python version: 3.10.12
[codecarbon INFO @ 13:27:43]   CodeCarbon version: 2.6.0
[codecarbon INFO @ 13:27:43]   Available RAM : 1.000 GB
[codecarbon INFO @ 13:27:43]   CPU count: 2
[codecarbon INFO @ 13:27:43]   CPU model: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
[codecarbon INFO @ 13:27:43]   GPU count: 1
[codecarbon INFO @ 13:27:43]   GPU model: 1 x NVIDIA A40


ref: /fs01/projects/green-ai/envs/green-ai/lib/python3.10/site-packages/codecarbon/data/hardware/cpu_power.csv


[codecarbon INFO @ 13:27:46] Saving emissions data to file /fs01/projects/green-ai/Shaina/emissions.csv
[codecarbon INFO @ 13:27:50] Energy consumed for RAM : 0.000000 kWh. RAM Power : 0.375 W
[codecarbon INFO @ 13:27:50] Energy consumed for all GPUs : 0.000104 kWh. Total GPU Power : 97.11724794784514 W
[codecarbon INFO @ 13:27:50] Energy consumed for all CPUs : 0.000045 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 13:27:50] 0.000149 kWh of electricity used since the beginning.
[codecarbon INFO @ 13:27:50] Energy consumed for RAM : 0.000000 kWh. RAM Power : 0.375 W
[codecarbon INFO @ 13:27:50] Energy consumed for all GPUs : 0.000104 kWh. Total GPU Power : 0.0 W
[codecarbon INFO @ 13:27:50] Energy consumed for all CPUs : 0.000046 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 13:27:50] 0.000150 kWh of electricity used since the beginning.


Total outputs generated: 5
ref: /fs01/projects/green-ai/envs/green-ai/lib/python3.10/site-packages/codecarbon/data/private_infra/2016/canada_energy_mix.json

Pruned Model:
Duration: 3.84 seconds
Estimated Emissions: 0.000006 kg CO2eq
Generated Texts: ['Once upon a time comm one II II X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X', 'Once upon a time comm one II II X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X', 'Once upon a time comm one II II X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X', 'Once upon a time comm one II II X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X', 'Once upon a time comm one II II X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X']
ref: /fs01/projects/green-ai/envs/green-ai/lib/python3.10/site-packages/codecarbon/data/private_infra/2016/canada_energy_mix.json



## **Method 2 : Quantization**

Quantization is an optimization technique that reduces the precision of model parameters, typically converting 32-bit floating-point weights to 8-bit integers. This decreases the memory footprint and computational overhead, making the model more efficient. It is defined as:

```
w_quantized = round(w / s)
```

Where:
- `w` is the original weight value.
- `s` is a scaling factor used to map the floating-point values into the integer range.

The quantized model uses these integer values during inference, significantly improving performance, especially on hardware optimized for lower precision arithmetic.

---

### **Types of Quantization**:

Quantization is a process to optimize machine learning models by reducing the precision of the weights and activations. There are three main types: **Dynamic Quantization**, which applies quantization during runtime for weights while leaving activations in floating-point precision; **Static Quantization**, which quantizes both weights and activations by calibrating on a dataset beforehand; and **Quantization-Aware Training**, which incorporates quantization directly into the training process, often yielding better performance than post-training quantization.

---

### Advantages:


Quantization offers several benefits. It reduces the memory footprint as storing weights as 8-bit integers requires four times less memory compared to 32-bit floating-point weights. It also accelerates inference since integer arithmetic is computationally cheaper, especially on CPUs or hardware with dedicated quantization support. Additionally, quantization improves energy efficiency as reduced computations translate to lower energy consumption.

---
### **Trade-offs and Considerations**

Despite its advantages, quantization has trade-offs. It may lead to a slight accuracy loss depending on the model and task. Moreover, the performance gains are hardware-dependent, relying on the underlying system's ability to efficiently handle integer operations.

---
### Key References:
   - Jacob et al., ["Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference"](https://arxiv.org/abs/1712.05877)
   - [PyTorch Quantization](https://pytorch.org/docs/stable/quantization.html)
   - Hugging Face Blog: ["Quantization"](https://huggingface.co/docs/transformers/en/main_classes/quantization)
    
 

In [8]:
# 2. Quantization
print("\n2. Quantization")
print("Quantization reduces the precision of the model's weights, typically from 32-bit to 8-bit.")
print("Note: Dynamic quantization is currently only supported on CPU.")
model_cpu = model.cpu()
quantized_model = torch.quantization.quantize_dynamic(
    model_cpu, {torch.nn.Linear}, dtype=torch.qint8
)
quantized_duration, quantized_emissions = measure_performance(quantized_model, "Once upon a time", "Quantized Model (CPU)")


2. Quantization
Quantization reduces the precision of the model's weights, typically from 32-bit to 8-bit.
Note: Dynamic quantization is currently only supported on CPU.


[codecarbon INFO @ 13:28:17] [setup] RAM Tracking...
[codecarbon INFO @ 13:28:18] [setup] GPU Tracking...
[codecarbon INFO @ 13:28:18] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 13:28:18] [setup] CPU Tracking...


ref: /fs01/projects/green-ai/envs/green-ai/lib/python3.10/site-packages/codecarbon/data/hardware/cpu_power.csv


[codecarbon INFO @ 13:28:21] CPU Model on constant consumption mode: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
[codecarbon INFO @ 13:28:21] >>> Tracker's metadata:
[codecarbon INFO @ 13:28:21]   Platform system: Linux-5.4.0-131-generic-x86_64-with-glibc2.27
[codecarbon INFO @ 13:28:21]   Python version: 3.10.12
[codecarbon INFO @ 13:28:21]   CodeCarbon version: 2.6.0
[codecarbon INFO @ 13:28:21]   Available RAM : 1.000 GB
[codecarbon INFO @ 13:28:21]   CPU count: 2
[codecarbon INFO @ 13:28:21]   CPU model: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
[codecarbon INFO @ 13:28:21]   GPU count: 1
[codecarbon INFO @ 13:28:21]   GPU model: 1 x NVIDIA A40
[codecarbon INFO @ 13:28:26] Saving emissions data to file /fs01/projects/green-ai/Shaina/emissions.csv
[codecarbon INFO @ 13:28:41] Energy consumed for RAM : 0.000002 kWh. RAM Power : 0.375 W
[codecarbon INFO @ 13:28:41] Energy consumed for all GPUs : 0.000373 kWh. Total GPU Power : 88.37316561476844 W
[codecarbon INFO @ 13:28:41] Energy co

Total outputs generated: 5
ref: /fs01/projects/green-ai/envs/green-ai/lib/python3.10/site-packages/codecarbon/data/private_infra/2016/canada_energy_mix.json


[codecarbon INFO @ 13:28:54] Energy consumed for RAM : 0.000003 kWh. RAM Power : 0.375 W
[codecarbon INFO @ 13:28:54] Energy consumed for all GPUs : 0.000687 kWh. Total GPU Power : 84.9694006112296 W
[codecarbon INFO @ 13:28:54] Energy consumed for all CPUs : 0.000332 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 13:28:54] 0.001023 kWh of electricity used since the beginning.



Quantized Model (CPU):
Duration: 27.52 seconds
Estimated Emissions: 0.000040 kg CO2eq
Generated Texts: ['Once upon a time the "a" and "b" in the "-" and "-" in the "-" and "-" in the "-" and, except in the, the, and, the, the, the', 'Once upon a time the "a" and "b" in the "-" and "-" in the "-" and "-" in the "-" and, except in the, the, and, the, the, the', 'Once upon a time the "a" and "b" in the "-" and "-" in the "-" and "-" in the "-" and, except in the, the, and, the, the, the', 'Once upon a time the "a" and "b" in the "-" and "-" in the "-" and "-" in the "-" and, except in the, the, and, the, the, the', 'Once upon a time the "a" and "b" in the "-" and "-" in the "-" and "-" in the "-" and, except in the, the, and, the, the, the']
ref: /fs01/projects/green-ai/envs/green-ai/lib/python3.10/site-packages/codecarbon/data/private_infra/2016/canada_energy_mix.json




## **Method 3 : Knowledge Distillation**

Knowledge Distillation is a model compression technique where a smaller, simpler model (the **student model**) learns to mimic the behavior of a larger, more complex model (the **teacher model**). This process transfers knowledge from the teacher to the student, allowing the smaller model to approximate the larger model's performance with significantly fewer parameters.

The student model is trained using the teacher model's predictions instead of the actual labels. This involves minimizing a distillation loss:

```
L = (1 - α) * L_hard + α * L_soft
```

Where:
- \( L_{hard} \): Loss based on the actual labels (e.g., cross-entropy).
- \( L_{soft} \): Loss based on the teacher's predictions (softened using temperature \( T \)).
- \( \alpha \): Balances the contribution of hard and soft losses.
- \( T \): A temperature parameter used to soften the output probabilities of the teacher.

The softened probabilities allow the student to capture finer-grained information about the teacher's decision-making.

---

### **Advantages of Knowledge Distillation**

Knowledge distillation offers several benefits, including **model compression**, where the smaller student model is faster, consumes less memory, and is ideal for deployment. It facilitates **knowledge transfer**, enabling the student model to inherit nuanced understanding from the teacher, such as handling ambiguous inputs. Additionally, it improves **efficiency**, making distilled models suitable for edge devices with limited computational power.

---

### **Trade-offs and Considerations**

Despite its advantages, knowledge distillation comes with trade-offs. There may be an **accuracy gap**, as the student model might not fully replicate the teacher's performance. Furthermore, the process involves **training overhead**, requiring additional computational resources during the initial training phase of the student model.

------

In this tutorial, we use **DistilGPT2**, a pre-trained distilled version of GPT-2, as our student model. DistilGPT2 retains about 97% of GPT-2's performance while being 60% faster and smaller.


---

### **Key References**:
- Hinton et al., ["Distilling the Knowledge in a Neural Network"](https://arxiv.org/abs/1503.02531)
- Hugging Face documentation on [DistilGPT2](https://huggingface.co/distilgpt2)


In [None]:
# 3. Knowledge Distillation
print("\n3. Knowledge Distillation")
print("Using a pre-trained smaller model (DistilGPT2) as an example of knowledge distillation.")
distil_model = AutoModelForCausalLM.from_pretrained("distilgpt2").to(device)
distil_duration, distil_emissions = measure_performance(distil_model, "Once upon a time", "Distilled Model")


3. Knowledge Distillation
Using a pre-trained smaller model (DistilGPT2) as an example of knowledge distillation.


[codecarbon INFO @ 13:29:01] [setup] RAM Tracking...
[codecarbon INFO @ 13:29:02] [setup] GPU Tracking...
[codecarbon INFO @ 13:29:02] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 13:29:02] [setup] CPU Tracking...
[codecarbon INFO @ 13:29:03] CPU Model on constant consumption mode: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
[codecarbon INFO @ 13:29:03] >>> Tracker's metadata:
[codecarbon INFO @ 13:29:03]   Platform system: Linux-5.4.0-131-generic-x86_64-with-glibc2.27
[codecarbon INFO @ 13:29:03]   Python version: 3.10.12
[codecarbon INFO @ 13:29:03]   CodeCarbon version: 2.6.0
[codecarbon INFO @ 13:29:03]   Available RAM : 1.000 GB
[codecarbon INFO @ 13:29:03]   CPU count: 2
[codecarbon INFO @ 13:29:03]   CPU model: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
[codecarbon INFO @ 13:29:03]   GPU count: 1
[codecarbon INFO @ 13:29:03]   GPU model: 1 x NVIDIA A40


ref: /fs01/projects/green-ai/envs/green-ai/lib/python3.10/site-packages/codecarbon/data/hardware/cpu_power.csv



## **Method 4: Efficient Attention**

Efficient attention refers to optimizations applied to the attention mechanism in transformer models to reduce memory and computational overhead. The standard self-attention mechanism in transformers has a computational complexity of \(O(n^2)\), where \(n\) is the sequence length. For large inputs, this quadratic complexity can become a bottleneck.

Efficient attention mechanisms aim to:
1. Reduce the memory footprint required for attention computations.
2. Speed up the attention process, especially for long sequences.

---

### **How Does it Work?**
In this example, we use a simplified approach where the attention mechanism is wrapped in a custom module that disables gradient computation during the forward pass:

1. **Disabling Gradients**:
   - By applying `torch.no_grad()`, the memory and computation associated with storing gradients for backpropagation are avoided during inference.
   - This is useful for scenarios where the model is used solely for inference, not training.

2. **Class Modification**:
   - The `EfficientAttention` class replaces the default `torch.nn.MultiheadAttention` module with a more memory-efficient implementation that skips unnecessary gradient computations.

---

### **Advantages**
1. **Reduced Memory Usage**:
   - Memory-intensive gradients are skipped, making the model more lightweight during inference.
2. **Improved Speed**:
   - Eliminating gradient computations accelerates the forward pass, especially for large inputs or batch sizes.
3. **Simplicity**:
   - The modification is minimally invasive, as it only wraps the existing attention module.

---

### **Trade-offs and considerations**
1. **No Training Support**:
   - The efficient attention mechanism is designed for inference and cannot be used for training without enabling gradients.
2. **Applicability**:
   - This implementation does not change the quadratic complexity of standard attention; instead, it optimizes the memory and computation for existing operations.

---

### **Key References**
- **Transformers**: Vaswani et al., ["Attention is All You Need"](https://arxiv.org/abs/1706.03762)
- **Efficient Attention**: Tay et al., ["Efficient Transformers: A Survey"](https://arxiv.org/abs/2009.06732)



In [None]:
print("\n4. Efficient Attention")
print("This is a simplified example of implementing more efficient attention mechanisms.")
class EfficientAttention(torch.nn.Module):
    def __init__(self, attention):
        super().__init__()
        self.attention = attention

    def forward(self, *args, **kwargs):
        with torch.no_grad():
            output = self.attention(*args, **kwargs)
        return output

def apply_efficient_attention(model):
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.MultiheadAttention):
            setattr(model, name, EfficientAttention(module))
    return model

efficient_model = apply_efficient_attention(copy.deepcopy(model))
efficient_duration, efficient_emissions = measure_performance(efficient_model, "Once upon a time", "Efficient Attention Model")



## **Method 5 : Smaller Model**

Creating a smaller model involves customizing the architecture of a neural network to reduce its size while maintaining acceptable performance. For transformers like GPT-2, this means adjusting key parameters such as:
- **Number of Layers** (\(n_{layer}\)): Determines the depth of the model.
- **Number of Attention Heads** (\(n_{head}\)): Affects how the model attends to different parts of the input.
- **Embedding Dimensions** (\(n_{embd}\)): Controls the size of token representations.

In this example, we create a scaled-down GPT-2 model with:
- 6 layers (\(n_{layer} = 6\)).
- 8 attention heads (\(n_{head} = 8\)).
- Embedding size of 512 (\(n_{embd} = 512\)).

---

### **Advantages**
1. **Faster Inference**:
   - Smaller models require fewer computations, leading to faster predictions.
2. **Memory Efficiency**:
   - Reduced size decreases the memory footprint, enabling deployment on devices with limited resources.
3. **Energy Savings**:
   - Fewer computations also lower energy consumption, making the model more eco-friendly.

---

### **Trade-offs**
1. **Accuracy Loss**:
   - Smaller models often have reduced capacity to capture complex patterns, which can result in slight accuracy degradation.
2. **Task-Specific Performance**:
   - Customizing the architecture may require fine-tuning to ensure it meets the demands of specific tasks.
---

### **Advantages of Smaller Models**
- **Deployment Flexibility**:
   - Ideal for edge devices, mobile phones, or low-power environments.
- **Cost Efficiency**:
   - Reduces the infrastructure cost of serving models, especially at scale.

---

### **Key References**
- Hugging Face Documentation: [Model Configuration](https://huggingface.co/docs/transformers/main_classes/configuration)
- Vaswani et al., ["Attention is All You Need"](https://arxiv.org/abs/1706.03762)

 

In [None]:

print("\n5. Smaller Model")
print("This creates a new, smaller GPT-2 model with fewer layers, heads, and embedding dimensions.")

# Create a smaller GPT-2 model
config = GPT2Config(n_layer=6, n_head=8, n_embd=512)
smaller_model = AutoModelForCausalLM.from_config(config).to("cpu")  # Adjust device as needed

# Measure performance
smaller_duration, smaller_emissions = measure_performance(smaller_model, "Once upon a time", "Smaller Model")

if smaller_duration and smaller_emissions:
    print("\nSmaller Model Results:")
    print(f"Duration: {smaller_duration:.2f} seconds")
    print(f"Estimated Emissions: {smaller_emissions:.6f} kg CO2eq")
else:
    print("Failed to measure Smaller Model performance.")



## **Calculate Improvements**

This step quantifies the effectiveness of each optimization technique by comparing their performance to the baseline model. The key metrics analyzed are:

- **Inference Time**: Measures how much faster the optimized model generates predictions.
- **Carbon Emissions**: Estimates how optimization reduces the environmental impact of model computations.

The percentage improvement for each metric is calculated using the formula:
   ```
   Improvement (%) = ((Baseline Metric - Optimized Metric) / Baseline Metric) * 100
   ```
 This formula is applied to both **inference time** and **carbon emissions**.
 Results are reported as a percentage, indicating the relative improvement of each optimized model.

 The optimized models include Pruned, Quantized, Distilled, Efficient Attention, Compressed, and Smaller models.Each model’s performance is compared to the baseline.


In [None]:
from tabulate import tabulate  #pip install tabulate

# Function to calculate improvements
def calculate_improvement(baseline, optimized):
    if baseline and optimized:
        return (baseline - optimized) / baseline * 100
    return None

# List of models with durations and emissions
models = [
    ("Pruned", pruned_duration, pruned_emissions),
    ("Quantized", quantized_duration, quantized_emissions),
    ("Distilled", distil_duration, distil_emissions),
    ("Efficient Attention", efficient_duration, efficient_emissions),
    ("Smaller", smaller_duration, smaller_emissions)
]

# Prepare table data with baseline and optimized values
table_data = []
for name, duration, emissions in models:
    time_improvement = calculate_improvement(baseline_duration, duration)
    emissions_improvement = calculate_improvement(baseline_emissions, emissions)
    if time_improvement is not None and emissions_improvement is not None:
        table_data.append([
            name, 
            baseline_duration, duration, f"{time_improvement:.2f}%",
            baseline_emissions, emissions, f"{emissions_improvement:.2f}%"
        ])
    else:
        table_data.append([name, "N/A", "N/A", "Measurement failed", "N/A", "N/A", "Measurement failed"])

# Print extended table
print("\nDetailed Improvements:")
print(tabulate(table_data, headers=["Model", "Baseline Time", "Optimized Time", "Time Improvement", 
                                    "Baseline Emissions", "Optimized Emissions", "Emissions Improvement"], 
               tablefmt="grid"))


In [None]:
# Visualization
print("\nCreating visualization...")
names = [model[0] for model in models if model[1] is not None and model[2] is not None]
durations = [model[1] for model in models if model[1] is not None and model[2] is not None]
emissions = [model[2] for model in models if model[1] is not None and model[2] is not None]

if names and durations and emissions:
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 12))
    ax1.bar(names, durations, color='skyblue')
    ax1.set_ylabel('Duration (seconds)')
    ax1.set_title('Inference Duration Comparison')
    ax1.axhline(y=baseline_duration, color='r', linestyle='--', label='Baseline')
    ax1.legend()

    ax2.bar(names, emissions, color='lightgreen')
    ax2.set_ylabel('Estimated Emissions (kg CO2eq)')
    ax2.set_title('Estimated Carbon Emissions Comparison')
    ax2.axhline(y=baseline_emissions, color='r', linestyle='--', label='Baseline')
    ax2.legend()

    plt.tight_layout()
    plt.show()
    plt.savefig('llm_optimization_comparison.png')

else:
    print("Not enough data to create visualization.")


In [None]:
print("\nConclusion:")
print("This script demonstrated various optimization techniques for LLMs and their impact on performance and estimated carbon emissions.")
print("The results show that different techniques can lead to significant improvements in both speed and environmental impact.")
print("However, it's important to note that these optimizations may affect model accuracy, which should be evaluated separately.")
print("The emissions estimates provided by CodeCarbon are more accurate than the previous rough estimates, but should still be considered approximations.")

## On Optimization Methods and Efficiency Gains

While the optimization methods implemented in this notebook aim to reduce carbon emissions and computational time, the following factors might limit their effectiveness:

1. **Hardware Dependency**: 
   - Some methods, like mixed-precision training or quantization, require specific hardware (e.g., GPUs with Tensor Cores).
   - Without compatible hardware, these optimizations might not yield significant gains.

2. **Model Characteristics**: 
   - The architecture of certain models may inherently limit the effectiveness of optimizations.
   - Larger models may still consume significant energy, even with optimizations, due to their complexity.

3. **Overhead Costs**: 
   - Implementing methods like gradient checkpointing can introduce additional computational overhead during training, potentially negating runtime gains.

4. **Use Case Variability**: 
   - Optimizations might perform differently across tasks (e.g., fine-tuning, inference), and their benefits may vary depending on the dataset and prompt complexity.

Therefore, while these methods improve efficiency in many cases, their impact depends on the specific context and hardware configuration.


## Prepared By

- **Name**: **Shaina Raza, PhD** [shaina.raza@vectorinstitute.ai](mailto:shaina.raza@vectorinstitute.ai)
- **Affiliation**: Vector Institute for Artificial Intelligence

This notebook was prepared as part of a practical guide for efficient evaluation and optimization of large language models (LLMs), with an emphasis on reducing carbon emissions and computational costs.
