
# Optimizing LLMs Notebook

This notebook dynamically initializes and evaluates LLaMA 3.21B and other large language models (LLMs).
The performance and emissions of the models are tracked using the CodeCarbon library.

---


### Instructions to Access a Model from Hugging Face, Set Up a Login, and Use It in Your Notebook

#### 1. Access a Model on Hugging Face
- Visit the [Hugging Face Model Hub](https://huggingface.co/models). 
- Search for your desired model, e.g., `meta-llama/Llama-3.2-1B-Instruct`,or SmolLM2-360M-Instruct for faster loading .
  [https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)
  [https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct] (https://huggingface.co/HuggingFaceTB/SmolLM2-360M-Instruct)
- Check if the model requires special permissions:
  - If restricted, look for a **"Request Access"** or **"Fill Form"** button.
  - Follow the instructions to request access.

---

#### 2. Filling Out Forms for Restricted Models
- If the model is restricted:
  1. Click the **"Request Access"** button.
  2. Fill out the form with:
     - Your **email address**.
     - A brief **reason for accessing the model** (e.g., research or benchmarking).
     - Your **organization** or **affiliation** (if applicable).
  3. Submit the form and wait for approval (typically within a few hours to days).

---

#### 3. Login to Hugging Face from Your Notebook
1. Install the required libraries:
   ```bash
   pip install  huggingface_hub

#### 4 Obtain your access token:

Go to your [Hugging Face Settings](https://huggingface.co/settings/tokens). 
Create "New Token" and generate the token.
Copy the token and use it in the login function.


In [1]:
from huggingface_hub import login

# Authenticate using your Hugging Face token
#uncomment it below line and enter your token
# login(token="your-access-token")  


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
#check hardware

#checking hardware related information

import os
import platform
import psutil
import torch

def check_hardware():
    print("=== CPU Information ===")
    print(f"Processor: {platform.processor()}")
    print(f"CPU Count: {os.cpu_count()}")
    if psutil:
        print(f"Logical CPUs: {psutil.cpu_count(logical=True)}")
        print(f"Physical CPUs: {psutil.cpu_count(logical=False)}")
        print(f"CPU Frequency: {psutil.cpu_freq().current} MHz")
    
    print("\n=== RAM Information ===")
    ram = psutil.virtual_memory()
    print(f"Total RAM: {ram.total / 1024**3:.2f} GB")
    print(f"Available RAM: {ram.available / 1024**3:.2f} GB")
    
    print("\n=== GPU Information ===")
    if torch.cuda.is_available():
        print(f"GPU Count: {torch.cuda.device_count()}")
        for i in range(torch.cuda.device_count()):
            print(f"GPU {i}: {torch.cuda.get_device_name(i)}")
            print(f"Memory Allocated: {torch.cuda.memory_allocated(i) / 1024**2:.2f} MB")
            print(f"Memory Cached: {torch.cuda.memory_reserved(i) / 1024**2:.2f} MB")
    else:
        print("No GPU available")
    
    print("\n=== System Information ===")
    print(f"System: {platform.system()}")
    print(f"Machine: {platform.machine()}")
    print(f"Node: {platform.node()}")
    print(f"Version: {platform.version()}")

check_hardware()


=== CPU Information ===
Processor: x86_64
CPU Count: 32
Logical CPUs: 32
Physical CPUs: 16
CPU Frequency: 2401.5331875000006 MHz

=== RAM Information ===
Total RAM: 188.59 GB
Available RAM: 179.44 GB

=== GPU Information ===
GPU Count: 1
GPU 0: NVIDIA A40
Memory Allocated: 0.00 MB
Memory Cached: 0.00 MB

=== System Information ===
System: Linux
Machine: x86_64
Node: gpu001
Version: #147~18.04.1-Ubuntu SMP Sat Oct 15 13:10:18 UTC 2022


## Measuring Performance and Emissions of Language Models
This script evaluates the performance and carbon emissions of language models using the Hugging Face transformers library and the codecarbon package. It provides a measure_performance function to compute runtime and emissions for a given pipeline and input prompt.

**Hugging Face Transformers: Load and evaluate causal language models.**

**CodeCarbon: Monitor and log CO2 emissions.**

**BitsAndBytes: Optimize model execution with low-precision quantization.**

### Standard text generation pipeline

**Initialize a basic text generation pipeline without any optimization, using a pre-trained causal language model and tokenizer from Hugging Face.**

### **Optimizations**

This pipeline initialization uses multiple techniques for efficient memory usage, faster inference, and seamless operation on limited hardware:

1. **8-bit Quantization (`load_in_8bit=True`)**  
   - Compresses weights to 8 bits to reduce memory usage.  
   - Works with `bitsandbytes`.

2. **Mixed Precision (`torch_dtype=torch.bfloat16`)**  
   - Uses `bfloat16` for faster computation and reduced memory.  
   - Requires GPUs with mixed precision support (e.g., NVIDIA Ampere).

3. **Gradient Checkpointing (`model.gradient_checkpointing_enable()`)**  
   - Saves memory by recomputing activations during the backward pass.  
   - Useful for large model training or memory-efficient inference.

4. **Device Mapping (`device_map="auto"`)**  
   - Automatically distributes model layers across available devices.  

5. **Tokenization and Preprocessing**  
   - Efficient tokenizer loading to handle text inputs seamlessly.  

In [3]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from codecarbon import EmissionsTracker




### `load_model` Function

The `load_model` function is responsible for loading a language model with optional optimizations for enhanced performance or reduced memory usage. This function dynamically adjusts its behavior based on the specified optimization parameters.

#### Function Parameters
- `model_name` (str): The Hugging Face model ID for the language model to be loaded.
- `optimized` (bool, default: `False`): If `True`, applies optimizations such as quantization (8-bit or INT4) or mixed precision.
- `bf16` (bool, default: `False`): If `True`, uses bfloat16 (BF16) mixed precision for computation.
- `int4` (bool, default: `False`): If `True`, uses INT4 quantization for efficient memory usage and computation.

#### Optimization Modes
1. **Standard Loading**:
   - If no optimizations are specified, the model is loaded with standard precision (`float32`) and mapped to the available devices (`device_map="auto"`).
   - Example:
     ```python
     model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
     ```

2. **8-Bit Quantization (QLoRA)**:
   - If `optimized=True` and `bf16=False`, the model is loaded with 8-bit quantization using the `bitsandbytes` library.
   - Example:
     ```python
     model = AutoModelForCausalLM.from_pretrained(model_name, load_in_8bit=True, device_map="auto")
     ```

3. **BF16 Precision**:
   - If `optimized=True` and `bf16=True`, the model is loaded with bfloat16 precision for reduced memory usage and faster computation on supported GPUs.
   - Example:
     ```python
     model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="auto")
     ```

4. **INT4 Quantization**:
   - If `int4=True`, the model is loaded with 4-bit quantization using `BitsAndBytesConfig` for memory-efficient operations.
   - Parameters in `BitsAndBytesConfig`:
     - `load_in_4bit=True`: Enables INT4 quantization.
     - `bnb_4bit_use_double_quant=True`: Uses double quantization for higher precision.
     - `bnb_4bit_quant_type="nf4"`: NormalFloat4 quantization type for improved performance.
     - `bnb_4bit_compute_dtype=torch.bfloat16`: Sets compute precision to bfloat16.
   - Example:
     ```python
     bnb_config = BitsAndBytesConfig(
         load_in_4bit=True,
         bnb_4bit_use_double_quant=True,
         bnb_4bit_quant_type="nf4",
         bnb_4bit_compute_dtype=torch.bfloat16
     )
     model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto", quantization_config=bnb_config)
     ```

#### Tokenizer
The tokenizer is loaded alongside the model to process input text into tensors compatible with the model. The `use_fast=False` parameter ensures compatibility with Llama-based models.

#### Return Value
The function returns a tuple:
- `model`: The loaded and configured model.
- `tokenizer`: The corresponding tokenizer for the model.

#### Usage Example
```python
model, tokenizer = load_model("meta-llama/Llama-3.2-1B-Instruct", optimized=True, bf16=True)


In [4]:
def load_model(model_name, optimized=False, bf16=False, int4=False):
    """
    Load the specified language model with optional optimizations.
    """
    if int4:
        print("Loading model with INT4 optimization...")
        # BitsAndBytesConfig for int-4 optimization
        bnb_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch.bfloat16  # You can adjust to torch.float16 if needed
        )
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            device_map="auto",
            quantization_config=bnb_config
        )
    elif optimized and bf16:
        print("Loading model with BF16 optimization...")
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            torch_dtype=torch.bfloat16,
            device_map="auto"
        )
    elif optimized:
        print("Loading model with 8-bit quantization (QLoRA)...")
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            load_in_8bit=True,
            device_map="auto"
        )
    else:
        print("Loading standard model...")
        model = AutoModelForCausalLM.from_pretrained(
            model_name,
            device_map="auto"
        )
    
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
    return model, tokenizer


# Function: `run_inference`

## Description
Generates text based on a given prompt using a pre-trained language model and tokenizer.

## Parameters
- **`model`**: Pre-trained language model (e.g., GPT, LLaMA).
- **`tokenizer`**: Tokenizer associated with the model.
- **`prompt`** (str): The input text for inference.
- **`max_length`** (int, optional): Maximum length of the generated text. Default is `50`.

## Returns
- **str**: The generated text from the model.




In [5]:


def run_inference(model, tokenizer, prompt, max_length=50):
    """
    Run inference on the given prompt and return the generated text.
    """
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_length=max_length)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


# Function: `calculate_emissions`

## Description
Calculates the carbon footprint of running inference with a specific model and configuration.

## Parameters
- **`model_name`** (str): Name of the pre-trained model.
- **`prompt`** (str): Input text for inference.
- **`max_length`** (int, optional): Maximum length of the generated text. Default is `50`.
- **`optimized`** (bool, optional): Indicates if the model is optimized (e.g., pruning, distillation). Default is `False`.
- **`bf16`** (bool, optional): Enables bfloat16 precision for inference. Default is `False`.
- **`int4`** (bool, optional): Enables 4-bit quantization. Default is `False`.

## Returns
- **float**: Carbon emissions in kilograms of CO2 equivalent (`kgCO2eq`).



In [6]:


def calculate_emissions(model_name, prompt, max_length=50, optimized=False, bf16=False, int4=False):
    """
    Calculate carbon emissions for running inference with a specific model.
    """
    tracker = EmissionsTracker(project_name=f"{model_name} {'INT4' if int4 else 'Optimized' if optimized else 'Standard'}")
    tracker.start()
    
    # Load model and tokenizer
    model, tokenizer = load_model(model_name, optimized=optimized, bf16=bf16, int4=int4)
    
    # Run inference
    print("Running inference...")
    output = run_inference(model, tokenizer, prompt, max_length)
    print(f"Generated Text: {output}")
    
    # Stop tracking and get emissions
    emissions = tracker.stop()
    print(f"Carbon emissions (kgCO2eq): {emissions}")
    return emissions

# Compare Standard and Optimized Models for Carbon Emissions

## Description
This script calculates and compares the carbon emissions (`kgCO2eq`) for running inference on a pre-trained model under different configurations: standard, QLoRA optimization, BF16 precision, and INT4 quantization.

## Workflow
1. **Model**: `meta-llama/Llama-2-7b-hf`
2. **Prompt**: `"Once upon a time, in a world filled with AI magic,"`
3. **Max Length**: `50` tokens

## Configurations
- **Standard Model**: Default setup with no optimizations.
- **QLoRA**: Optimized model with quantized low-rank adapters.
- **BF16**: Optimized model with bfloat16 precision.
- **INT4**: Optimized model with 4-bit quantization.

## Results
The script prints the carbon emissions for each configuration.

## Example Output
```plaintext
Comparison of Emissions:
Standard Model: 0.123456 kgCO2eq
Optimized Model with QLoRA: 0.098765 kgCO2eq
Optimized Model with BF16: 0.087654 kgCO2eq
Optimized Model with INT4: 0.056789 kgCO2eq


In [7]:



# Main function to compare standard and optimized models
if __name__ == "__main__":
    # MODEL_NAME = "meta-llama/Llama-3.2-1B-Instruct"  # Updated to a valid model ID
    MODEL_NAME = "HuggingFaceTB/SmolLM2-360M-Instruct"
    PROMPT = "Once upon a time, in a world filled with AI magic,"
    MAX_LENGTH = 50

    # Calculate emissions for standard model
    standard_emissions = calculate_emissions(MODEL_NAME, PROMPT, MAX_LENGTH, optimized=False)

    # Calculate emissions for optimized model with QLoRA
    qlora_emissions = calculate_emissions(MODEL_NAME, PROMPT, MAX_LENGTH, optimized=True, bf16=False)

    # Calculate emissions for optimized model with BF16
    bf16_emissions = calculate_emissions(MODEL_NAME, PROMPT, MAX_LENGTH, optimized=True, bf16=True)

    # Calculate emissions for optimized model with INT4
    int4_emissions = calculate_emissions(MODEL_NAME, PROMPT, MAX_LENGTH, int4=True)

    print("\nComparison of Emissions:")
    print(f"Standard Model: {standard_emissions:.6f} kgCO2eq")
    print(f"Optimized Model with QLoRA: {qlora_emissions:.6f} kgCO2eq")
    print(f"Optimized Model with BF16: {bf16_emissions:.6f} kgCO2eq")
    print(f"Optimized Model with INT4: {int4_emissions:.6f} kgCO2eq")


[codecarbon INFO @ 10:33:47] [setup] RAM Tracking...
[codecarbon INFO @ 10:33:47] [setup] GPU Tracking...
[codecarbon INFO @ 10:33:47] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 10:33:47] [setup] CPU Tracking...
[codecarbon INFO @ 10:33:48] CPU Model on constant consumption mode: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
[codecarbon INFO @ 10:33:48] >>> Tracker's metadata:
[codecarbon INFO @ 10:33:48]   Platform system: Linux-5.4.0-131-generic-x86_64-with-glibc2.27
[codecarbon INFO @ 10:33:48]   Python version: 3.10.12
[codecarbon INFO @ 10:33:48]   CodeCarbon version: 2.6.0
[codecarbon INFO @ 10:33:48]   Available RAM : 1.000 GB
[codecarbon INFO @ 10:33:48]   CPU count: 2
[codecarbon INFO @ 10:33:48]   CPU model: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
[codecarbon INFO @ 10:33:48]   GPU count: 1
[codecarbon INFO @ 10:33:48]   GPU model: 1 x NVIDIA A40


ref: /fs01/projects/green-ai/greenai/green-ai/lib/python3.10/site-packages/codecarbon/data/hardware/cpu_power.csv


[codecarbon INFO @ 10:33:51] Saving emissions data to file /fs01/projects/green-ai/Shaina/emissions.csv


Loading standard model...


[codecarbon INFO @ 10:34:08] Energy consumed for RAM : 0.000002 kWh. RAM Power : 0.375 W
[codecarbon INFO @ 10:34:15] Energy consumed for all GPUs : 0.000534 kWh. Total GPU Power : 83.81667327086336 W
[codecarbon INFO @ 10:34:15] Energy consumed for all CPUs : 0.000282 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 10:34:15] 0.000818 kWh of electricity used since the beginning.
[codecarbon INFO @ 10:34:22] Energy consumed for RAM : 0.000002 kWh. RAM Power : 0.375 W
[codecarbon INFO @ 10:34:22] Energy consumed for all GPUs : 0.000724 kWh. Total GPU Power : 96.04053662668517 W
[codecarbon INFO @ 10:34:22] Energy consumed for all CPUs : 0.000367 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 10:34:22] 0.001093 kWh of electricity used since the beginning.
[codecarbon INFO @ 10:34:38] Energy consumed for RAM : 0.000004 kWh. RAM Power : 0.375 W
[codecarbon INFO @ 10:34:38] Energy consumed for all GPUs : 0.001131 kWh. Total GPU Power : 96.00924542827806 W
[codecarbon INFO @ 10:34:38] Ener

Running inference...


[codecarbon INFO @ 10:35:37] Energy consumed for RAM : 0.000010 kWh. RAM Power : 0.375 W
[codecarbon INFO @ 10:35:37] Energy consumed for all GPUs : 0.002742 kWh. Total GPU Power : 98.65310528855213 W
[codecarbon INFO @ 10:35:37] Energy consumed for all CPUs : 0.001250 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 10:35:37] 0.004002 kWh of electricity used since the beginning.
[codecarbon INFO @ 10:35:49] Energy consumed for RAM : 0.000011 kWh. RAM Power : 0.375 W
[codecarbon INFO @ 10:35:49] Energy consumed for all GPUs : 0.003061 kWh. Total GPU Power : 97.91986939807556 W
[codecarbon INFO @ 10:35:49] Energy consumed for all CPUs : 0.001389 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 10:35:49] 0.004461 kWh of electricity used since the beginning.


Generated Text: Once upon a time, in a world filled with AI magic, there was a small, mysterious shop called "The Enchanted Emporium." The shop was said to be run by a wise and enigmatic wizard named Zorvath,
ref: /fs01/projects/green-ai/greenai/green-ai/lib/python3.10/site-packages/codecarbon/data/private_infra/2016/canada_energy_mix.json
Carbon emissions (kgCO2eq): 0.00017620818071783338


[codecarbon INFO @ 10:35:50] [setup] RAM Tracking...
[codecarbon INFO @ 10:35:50] [setup] GPU Tracking...
[codecarbon INFO @ 10:35:50] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 10:35:50] [setup] CPU Tracking...
[codecarbon INFO @ 10:35:51] CPU Model on constant consumption mode: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
[codecarbon INFO @ 10:35:51] >>> Tracker's metadata:
[codecarbon INFO @ 10:35:51]   Platform system: Linux-5.4.0-131-generic-x86_64-with-glibc2.27
[codecarbon INFO @ 10:35:51]   Python version: 3.10.12
[codecarbon INFO @ 10:35:51]   CodeCarbon version: 2.6.0
[codecarbon INFO @ 10:35:51]   Available RAM : 1.000 GB
[codecarbon INFO @ 10:35:51]   CPU count: 2
[codecarbon INFO @ 10:35:51]   CPU model: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
[codecarbon INFO @ 10:35:51]   GPU count: 1
[codecarbon INFO @ 10:35:51]   GPU model: 1 x NVIDIA A40


ref: /fs01/projects/green-ai/greenai/green-ai/lib/python3.10/site-packages/codecarbon/data/hardware/cpu_power.csv


[codecarbon INFO @ 10:35:55] Saving emissions data to file /fs01/projects/green-ai/Shaina/emissions.csv
The `load_in_4bit` and `load_in_8bit` arguments are deprecated and will be removed in the future versions. Please, pass a `BitsAndBytesConfig` object in `quantization_config` argument instead.


Loading model with 8-bit quantization (QLoRA)...


[codecarbon INFO @ 10:36:24] Energy consumed for RAM : 0.000003 kWh. RAM Power : 0.375 W
[codecarbon INFO @ 10:36:24] Energy consumed for all GPUs : 0.000798 kWh. Total GPU Power : 96.92475782183483 W
[codecarbon INFO @ 10:36:24] Energy consumed for all CPUs : 0.000350 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 10:36:24] 0.001152 kWh of electricity used since the beginning.


Running inference...


[codecarbon INFO @ 10:36:36] Energy consumed for RAM : 0.000004 kWh. RAM Power : 0.375 W
[codecarbon INFO @ 10:36:36] Energy consumed for all GPUs : 0.001109 kWh. Total GPU Power : 97.7918234161375 W
[codecarbon INFO @ 10:36:36] Energy consumed for all CPUs : 0.000485 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 10:36:36] 0.001598 kWh of electricity used since the beginning.
[codecarbon INFO @ 10:36:36] [setup] RAM Tracking...
[codecarbon INFO @ 10:36:36] [setup] GPU Tracking...
[codecarbon INFO @ 10:36:36] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 10:36:36] [setup] CPU Tracking...


Generated Text: Once upon a time, in a world filled with AI magic, there was a powerful enchantress named Lyra. She had the ability to weave spells of wonder and magic, but she was also a bit of a perfectionist. She spent hours
ref: /fs01/projects/green-ai/greenai/green-ai/lib/python3.10/site-packages/codecarbon/data/private_infra/2016/canada_energy_mix.json
Carbon emissions (kgCO2eq): 6.3123055137539e-05


[codecarbon INFO @ 10:36:37] CPU Model on constant consumption mode: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
[codecarbon INFO @ 10:36:37] >>> Tracker's metadata:
[codecarbon INFO @ 10:36:37]   Platform system: Linux-5.4.0-131-generic-x86_64-with-glibc2.27
[codecarbon INFO @ 10:36:37]   Python version: 3.10.12
[codecarbon INFO @ 10:36:37]   CodeCarbon version: 2.6.0
[codecarbon INFO @ 10:36:37]   Available RAM : 1.000 GB
[codecarbon INFO @ 10:36:37]   CPU count: 2
[codecarbon INFO @ 10:36:37]   CPU model: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
[codecarbon INFO @ 10:36:37]   GPU count: 1
[codecarbon INFO @ 10:36:37]   GPU model: 1 x NVIDIA A40


ref: /fs01/projects/green-ai/greenai/green-ai/lib/python3.10/site-packages/codecarbon/data/hardware/cpu_power.csv


[codecarbon INFO @ 10:36:40] Saving emissions data to file /fs01/projects/green-ai/Shaina/emissions.csv


Loading model with BF16 optimization...
Running inference...


[codecarbon INFO @ 10:36:52] Energy consumed for RAM : 0.000001 kWh. RAM Power : 0.375 W
[codecarbon INFO @ 10:36:52] Energy consumed for all GPUs : 0.000336 kWh. Total GPU Power : 98.58707014041666 W
[codecarbon INFO @ 10:36:52] Energy consumed for all CPUs : 0.000145 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 10:36:52] 0.000482 kWh of electricity used since the beginning.
[codecarbon INFO @ 10:36:53] [setup] RAM Tracking...
[codecarbon INFO @ 10:36:53] [setup] GPU Tracking...
[codecarbon INFO @ 10:36:53] Tracking Nvidia GPU via pynvml
[codecarbon INFO @ 10:36:53] [setup] CPU Tracking...


Generated Text: Once upon a time, in a world filled with AI magic, there was a small, mysterious shop called "The Enchanted Emporium." The shop was said to be run by a wise and powerful wizard named Zorvath,
ref: /fs01/projects/green-ai/greenai/green-ai/lib/python3.10/site-packages/codecarbon/data/private_infra/2016/canada_energy_mix.json
Carbon emissions (kgCO2eq): 1.9039901167641488e-05


[codecarbon INFO @ 10:36:54] CPU Model on constant consumption mode: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
[codecarbon INFO @ 10:36:54] >>> Tracker's metadata:
[codecarbon INFO @ 10:36:54]   Platform system: Linux-5.4.0-131-generic-x86_64-with-glibc2.27
[codecarbon INFO @ 10:36:54]   Python version: 3.10.12
[codecarbon INFO @ 10:36:54]   CodeCarbon version: 2.6.0
[codecarbon INFO @ 10:36:54]   Available RAM : 1.000 GB
[codecarbon INFO @ 10:36:54]   CPU count: 2
[codecarbon INFO @ 10:36:54]   CPU model: Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
[codecarbon INFO @ 10:36:54]   GPU count: 1
[codecarbon INFO @ 10:36:54]   GPU model: 1 x NVIDIA A40


ref: /fs01/projects/green-ai/greenai/green-ai/lib/python3.10/site-packages/codecarbon/data/hardware/cpu_power.csv


[codecarbon INFO @ 10:36:57] Saving emissions data to file /fs01/projects/green-ai/Shaina/emissions.csv


Loading model with INT4 optimization...
Running inference...


[codecarbon INFO @ 10:37:11] Energy consumed for RAM : 0.000001 kWh. RAM Power : 0.375 W
[codecarbon INFO @ 10:37:11] Energy consumed for all GPUs : 0.000381 kWh. Total GPU Power : 98.19856961906925 W
[codecarbon INFO @ 10:37:11] Energy consumed for all CPUs : 0.000165 kWh. Total CPU Power : 42.5 W
[codecarbon INFO @ 10:37:11] 0.000547 kWh of electricity used since the beginning.


Generated Text: Once upon a time, in a world filled with AI magic, there was a small, independent company called "EchoPlex." EchoPlex was known for its innovative AI-powered tools and services, which were designed to help people achieve their
ref: /fs01/projects/green-ai/greenai/green-ai/lib/python3.10/site-packages/codecarbon/data/private_infra/2016/canada_energy_mix.json
Carbon emissions (kgCO2eq): 2.1624181808329322e-05

Comparison of Emissions:
Standard Model: 0.000176 kgCO2eq
Optimized Model with QLoRA: 0.000063 kgCO2eq
Optimized Model with BF16: 0.000019 kgCO2eq
Optimized Model with INT4: 0.000022 kgCO2eq


## Prepared By

- **Name**: **Shaina Raza, PhD** [shaina.raza@vectorinstitute.ai](mailto:shaina.raza@vectorinstitute.ai)
- **Affiliation**: Vector Institute for Artificial Intelligence

This notebook was prepared as part of a practical guide for efficient evaluation and optimization of large language models (LLMs), with an emphasis on reducing carbon emissions and computational costs.
