In [1]:
# ============================================================
# Notebook setup: run this before everything
# ============================================================

%load_ext autoreload
%autoreload 2

# Control figure size
interactive_figures = False
if interactive_figures:
    # Normal behavior
    %matplotlib widget
    figsize=(9, 3)
else:
    # PDF export behavior
    figsize=(14, 5)

import seaborn as sns
import matplotlib.pyplot as plt
from util import util
#from scipy.integrate import odeint
import numpy as np
import pandas as pd
import os
#from skopt.space import Space
#from eml.net.reader import keras_reader
from codecarbon import EmissionsTracker
from sklearn.tree import DecisionTreeRegressor

ModuleNotFoundError: No module named 'core'

# Case study: Sustainable Hardware Dimensioning

With widely recognised power-hungry and expensive training algorithms, deep learning has begun to address its carbon footprint. Machine learning (ML) models have grown exponentially in size over the past few years, with some algorithms training for thousands of core-hours, and the associated energy consumption and cost have become a growing concern [Green AI paper]. 

Previous studies have made advances in estimating GHG emissions of computation, and have attempted in providing general and easy-to-use methodologies for estimating carbon footprint that can be applied to any computational task [Green Algorithms paper].

In this work, we explore the dimension of finding the the best Hardware architecture and its dimensioning for AI algorithms, while respecting constraints in terms of carbon emissions. Previous work [HADA paper] has focused on HW Dimensioning for AI algorithms with constraints on budget, time and solution quality. This problem is called Hardware Dimensioning. In this work, we aim at extending this approach by also considering constraints on carbon emissions of the computations, and we name this problem Sustainable Hardware Dimensioning.

The HADA approach is based on the Empirical Model Learning paradigm [EML paper], which integrates Machine Learning (ML) models into an optimisation problem. The key idea is to integrate domain knowledge held by experts with data-driven models that learn the relationships between HW requirements and AI algorithm performances, which would be very complex to express formally in a suitable model. The approach starts with benchmarking multiple AI algorithms on different HW resources, generating data used to train ML models; then, optimisation is used to find the best [HW configuration](https://www.sciencedirect.com/topics/computer-science/hardware-configuration) that respects user-defined constraints.

# Methodology

At the basis of our approach is the Empirical Model Learning (EML) paradigm. Broadly speaking, EML deals with solving declarative optimisation models with a complex component $h$, which represents the relation between variables which can be acted upon $x$ (the decision variables) and the observables related to the system considered; the function $h(x) = y$ describes this relationships. As the $h(x)$ is complex, we cannot optimise directly over it. Hence, we exploit empirical knowledge to build a surrogate model $h_\theta(x)$ learned from data, where $\theta$ is the parameter vector.

HADA (HArdware Dimensioning of AI Algorithms), is then constituted of three main phases

1. data set collection (benchmarking phase) - an initial phase to collect the data set by running multiple times the target algorithms, under different configurations;
2. surrogate model creation - once a training set is available, a set of ML models is then trained on such data and then these models are encoded as a set of variables and constraints following EML paradigm;
3. optimisation – post the user-defined constraints and objective function on top of the combinatorial structure formed by the encoded ML models and the domain-knowledge constraints, and finally solve the optimisation model (either until an optimal solution or a time limit is reached.

## Dataset Collection

The training set was built based on grounding the two stochastic algorithms, i.e., anticipate and contingency [32, 4] from the energy management system domain. The two algorithms calculate the amount of energy that must be produced by the energy system to meet the required load, minimising the total energy cost over the daily time horizon and by taking into account the uncertainty. Both the algorithms divide the daily time horizon into 96 15-minutes time intervals.

### Input data

The input data is a set of 30 different instance realisations, each one representing one daily time horizon. For each day, we have **Load**, which is a 96-valued vector of the load observations sampled at each interval (every 15 minutes over the course of a day), and **PV**, a 96-valued vector representing the observations of available Photovoltaic energy production. Here we can see an example of the first two instances:

Since the objective of the two algorithms is to minimize the total energy cost, as input file we also have the grid electricity price at each our of the day

In [None]:
data = util.display_prices_data()
display(data)

The base version of HADA involves measuring the solution cost, the runtime and the average memory usage for the target algorithm. In the next section we will see the additional metrics that were added.

### Measuring Carbon Emissions

In order to extend HADA for taking into account sustainability, we need to measure the carbon emissions for running the algorithms during the benchmark phase. A simple tool to do so is [codecarbon](https://mlco2.github.io/codecarbon/index.html), which is a python package offering useful tools for tracking the emissions resulting from executing code execution.

The CO2e emission tracking tool offered by codecarbon can be used in [different modalities](https://mlco2.github.io/codecarbon/usage.html): as an Explicit Object (instantiating a EmissionsTracker object and pass it as a parameter to function calls to start and stop the emissions tracking of the compute section), as a Context Manager (recommended for monitoring a specific code block) or as a Decorator (recommended for monitoring training functions). For example, let's track the emissions of running the ANTICIPATE algorithm. Let's say we would like to solve instance 5 with 4 scenarios:

In [None]:
scenarios = 4
instance = 5
project_name = f"anticipate-ins-{instance}-ns-{scenarios}"
output_dir = '../data/'

# Codecarbon emission tracker
tracker = EmissionsTracker(project_name=project_name, 
                           log_level='ERROR', 
                           output_dir=output_dir)

with tracker as t:
    sol_cost, run_final, mem_final = util.online_ant(scenarios=scenarios, instance=instance, file='InstancesTest.csv')

print(f"The solution cost (in keuro) is: {sol_cost:.2f}")
print(f"The runtime (in sec) is: {run_final:.2f}")
print(f"Avg memory used (in MB) is: {mem_final:.2f}")

This tracks the emissions of the `online_ant` function by using the codecarbon `EmissionTracker` as a context manager. By default, codecarbon saves the tracking data to a .csv file, named `emissions.csv`. Let's take a look:

In [None]:
emissions = util.display_emissions_data()
display(emissions)

(**NOTE:** maybe a table is not the most suitable representation) As we can see, Codecarbon keeps track of a series of metrics. For this project, we decided to include the following metrics in the training set to generate:

* `emissions`: the total emissions of CO2eq (kg) (**NOTE:** add a brief explanation about how codecarbon computes the emissions);
* `emission_rate`: the amount of CO2eq emissions per second (kg/s);
* `cpu_energy`: the energy consumed by the cpu;
* `ram_energy`: the energy consumed by the ram;
* `tot_energy`: the total energy consumed;
* `country`, `region`: the country and region where the computation took place;
* `cpu_count`: the number of cores.

### Benchmarking

For the benchmarking phase, the ANTICIPATE and CONTINGECY algorithms were run on each instance 100 times, each time considering a different number of the configurable parameter (from 1 to 100 traces/scenarios). This value is taken directly from the HADA paper, according to which running the algorithms on each instance 100 times sufficiently explores the parameter space [Hada paper, 4]. Then, the training set of each algorithm will be of 3,000 records (100 runs x 30 instances). 

Since for the HADA approach is recommended to collect data relative to different Hardware configurations, i executed the benchmarking phase on my personal laptop [Insert specifications], and on [Leonardo](https://leonardo-supercomputer.cineca.eu/), an HPC System hosted by CINECA. After the benchmark phase, the collected dataset will look something like this:

In [None]:
filename = 'contingency_mbp19.csv'
benchmark_data = util.display_benchmark_data(filename)
display(benchmark_data)

## Data exploration

Let's have a look at the data produced during benchmark phase, to gather some insights

### Load and combine data

We first load the data and add identifying columns to make it easy to filter and compare the data.

In [None]:
# load data for each combination of algorithm and platform
files = {
    "anticipate_mbp19": "anticipate_mbp19.csv",
    "anticipate_leonardo": "anticipate_leonardo.csv",
    "contingency_mbp19": "contingency_mbp19.csv",
    "contingency_leonardo": "contingency_leonardo.csv"
}

# load each files and add identifiers
dataframes = []
for key, file in files.items():
    algorithm, platform = key.split('_')
    df = util.read_benchmark_file(file)
    df['algorithm'] = algorithm
    df['platform'] = platform
    dataframes.append(df)

# concatenate all dataframes into one for analysis
data = pd.concat(dataframes, ignore_index=True)

### Basic data overview

To get a high-level summary of each metric, we run `.describe()` for statistical insights and `.info()` to check data types

In [None]:
# Lets drop some of the columns
columns = ['nScenarios','nTraces','cpuCount']
data_overview = data.drop(columns=columns)

print(data_overview.describe())
print("="*50)
print(data_overview.info())

Notice that the min for memPeak is 0.12. That's because the values for the memory peak consumption recorded on leonardo are strangely low, thus also bringing the average down. It is strange also because in the instance with memPeak 0.12, the average memory is higher, which shouldn’t be possible. The benchmark runs on Leonardo should have been repeated. Unfortunately, the recent weather emergency in Emilia-Romagna caused the interruptions of the HPC services offered by Cineca.

### Categorical analysis

Analyze categorical columns like `algorithm`, `platform`, `country` and `region` to understand the distribution.

In [None]:
# Count distribution for categorical columns
print(data['algorithm'].value_counts())
print(data['platform'].value_counts())
print(data['country'].value_counts())
print(data['region'].value_counts())

Notice that we "canada" and "quebec" in many entries for the country and region. That's not because i travelled, even though i'd like to. That is something taken from codecarbon, so there should be some problem there when tracking emissions. The problem with these values is that they influence the way in which the carbon footprint is computed. That's because codecarbon uses data relative to the carbon intensity of a given country, so that could change the final result. For example, we have 4.31e-06 as emission rate for italy and 3.00e-08 for canada. Notice that we could also remove this column, since its values are fixed.

### Exploring performance metrics

Since `sol(keuro)`, `time(sec)`, `memAvg(MB)` and `memPeak(MB)` represent performance metrics, visualize and compare them across algorithms and platforms. First, we compare solution cost and time

In [None]:
# Solution Cost Comparison by Algorithm and Platform
sns.boxplot(data=data, x='algorithm', y='sol(keuro)', hue='platform')
plt.title('Solution Cost Comparison by Algorithm and Platform')
plt.show()

# Time Required Comparison
sns.boxplot(data=data, x='algorithm', y='time(sec)', hue='platform')
plt.title('Execution Time Comparison by Algorithm and Platform')
plt.show()

We can notice that solution cost is tipycally higher for contingency. While for the execution time, results are a bit strange

In [None]:
# Average Memory Usage
sns.boxplot(data=data, x='algorithm', y='memAvg(MB)', hue='platform')
plt.title('Average Memory Usage by Algorithm and Platform')
plt.show()

# Peak Memory Usage
sns.boxplot(data=data, x='algorithm', y='memPeak(MB)', hue='platform')
plt.title('Peak Memory Usage by Algorithm and Platform')
plt.show()

### Energy and Emission Analysis

Explore the CO2 emissions `CO2e(kg)` and energy consumption metrics to compare environmental impact across algorithms and platforms

In [None]:
# CO2 Emissions by Algorithm and Platform
sns.boxplot(data=data, x='algorithm', y='CO2e(kg)', hue='platform')
plt.title('CO2 Emissions by Algorithm and Platform')
plt.show()

# Energy Consumption: CPU, RAM, and Total
fig, axes = plt.subplots(1, 3, figsize=(18, 5))

sns.boxplot(data=data, x='algorithm', y='cpuEnergy(kW)', hue='platform', ax=axes[0])
axes[0].set_title('CPU Energy Consumption')

sns.boxplot(data=data, x='algorithm', y='ramEnergy(kW)', hue='platform', ax=axes[1])
axes[1].set_title('RAM Energy Consumption')

sns.boxplot(data=data, x='algorithm', y='totEnergy(kW)', hue='platform', ax=axes[2])
axes[2].set_title('Total Energy Consumption')

plt.tight_layout()
plt.show()

### Performance vs. Emissions Correlation

Analyze if there's a correlation between performance and environmental metrics

In [None]:
# Correlation matrix for numeric columns
correlation_matrix = data[['sol(keuro)', 'time(sec)', 'CO2e(kg)', 'totEnergy(kW)']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
plt.title('Correlation Between Performance and Emissions Metrics')
plt.show()

### Comparison of Core Counts

Check if CPU core count impacts solution time or energy consumption.

In [None]:
# Execution Time vs. CPU Count
sns.scatterplot(data=data, x='cpuCount', y='time(sec)', hue='algorithm', style='platform')
plt.title('Execution Time vs. CPU Count')
plt.show()

# Total Energy vs. CPU Count
sns.scatterplot(data=data, x='cpuCount', y='totEnergy(kW)', hue='algorithm', style='platform')
plt.title('Total Energy vs. CPU Count')
plt.show()