# CARTE-Enbridge Bootcamp
#### Lab 4-2

# Understanding the Carbon Cost of Machine Learning

With the rise of Large Language Models, there has been a growing discussion about the climate impact of using deep learning. In this lab, we are going to explore the carbon cost of training a model. We will use the [codecarbon](https://github.com/mlco2/codecarbon) library to measure the carbon footprint of a few different machine learning methods.

In [None]:
!pip install -U -q codecarbon pint transformers datasets torch "accelerate>=0.20.1"

In [None]:
# Check if we are running with a GPU
import torch
if torch.cuda.is_available():
    print('GPU available')
else:
    raise Exception('GPU not available - select Runtime -> Change runtime type -> GPU')

In [None]:
!codecarbon init

CodeCarbon is a Python library that allows you to measure the carbon footprint of your code. It works by measuring the power consumption of your machine and estimating the carbon emissions associated with that power consumption. It generates a detailed report that includes the carbon footprint of your code, helping us to understand and compare the impact of different models.

Let's start by using CodeCarbon to investigate the impact of training a simple linear regression model. We will use the [California Housing dataset](https://scikit-learn.org/stable/datasets/toy_dataset.html#california-housing-dataset) from scikit-learn. This dataset contains information about housing prices in California in the 1990s. We will use the median income of the residents to predict the median house value.

In [None]:
from sklearn.datasets import fetch_california_housing
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from pint import UnitRegistry
import pandas as pd
from codecarbon import EmissionsTracker
ureg = UnitRegistry()

In [None]:
# Load the data
housing = fetch_california_housing()
X = pd.DataFrame(housing.data, columns=housing.feature_names)
y = housing.target
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Before we go any further, let's take a look at the data we're working with. It's always important to understand what we're predicting.

This housing dataset tasks us with predicting the median house value in a given area. The dataset contains 8 features:
- MedInc: median income in block
- HouseAge: median house age in block
- AveRooms: average number of rooms
- AveBedrms: average number of bedrooms
- Population: block population
- AveOccup: average house occupancy
- Latitude: house block latitude
- Longitude: house block longitude

The house values are measured in hundreds of thousands of dollars.

We will use `Mean Absolute Error` as our evaluation metric. This metric is easy to interpret, as it is in the same units as the target variable. It is also robust to outliers, which is important in this dataset.

In [None]:
x_train.head()

In [None]:
y_train[:5]

Using CodeCarbon's EmissionsTracker is easy. When we want to record the cost of a specific training run, we simply wrap the training code in a with statement. Let's train a linear regression model and see how much carbon it emits.

In [None]:
def report_emissions(emissions_tracker: EmissionsTracker):
    energy = emissions_tracker.final_emissions_data.energy_consumed
    energy = energy * ureg.kilowatt_hour
    carbon = emissions_tracker.final_emissions_data.emissions
    carbon = carbon * ureg.kilogram
    print(f'Carbon emitted:      {carbon.to_compact():~.2f}')
    print(f'Energy consumed:     {energy.to_compact():~.2f}')


In [None]:
%%capture

model = LinearRegression()

# Wrap the training code in a with statement
with EmissionsTracker(project_name="Linear Regression") as tracker:
    model.fit(x_train, y_train)

with EmissionsTracker(project_name="Linear Regression Prediction") as predict:
    y_hat = model.predict(x_test)

We are going to also record the carbon cost of making a prediction with each of these models. This is for later, when we look at LLMs.

In [None]:
report_emissions(tracker)
print(f'Mean absolute error: {mean_absolute_error(y_test, y_hat):.2f}')

Unsurprisingly, training a simple linear regression model has a very small carbon footprint. Let's see what happens when we train a more complex model. Let's train a large Random Forest model.

In [None]:
%%capture

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators=100, max_depth=None, n_jobs=-1) # Use all cores, any depth
with EmissionsTracker(project_name="Random Forest") as tracker:
    model.fit(x_train, y_train)

with EmissionsTracker(project_name="Random Forest Prediction") as predict:
    y_hat = model.predict(x_test)

In [None]:
report_emissions(tracker)
print(f'Mean absolute error: {mean_absolute_error(y_test, y_hat):.2f}')

Our model has improved significantly, but at an increased carbon cost.

Something important to note is that we are currently running these experiments in Google Colab. The location of the computing resources powering our code has a significant impact on the carbon footprint of our code. We can check what region the code is running in, using CodeCarbon:

In [None]:
print(f'Region:         {tracker.final_emissions_data.region}')
print(f'Country:        {tracker.final_emissions_data.country_name}')
emissions_rate = tracker.final_emissions_data.emissions_rate * ureg.kilogram / ureg.kilowatt_hour
print(f'Emissions rate: {emissions_rate.to_compact():~.2f}')

The region you see here is variable, but it's likely to be in the US. In one test, we received the following results:

```
Region:         oregon
Country:        United States
Emissions rate: 1.82 µg / Wh
```

Unsurprisingly, if you run this code in Ontario, where we have a much higher proportion of renewable energy, the emissions rate is much lower:

```
Region:         ontario
Country:        Canada
Emissions rate: 312.58 ng / Wh
```

Another important factor to consider is the efficiency of the hardware that we're using. CodeCarbon reports the power consumption of the computing resources we are working on:

In [None]:
print(f'CPU Power: {tracker.final_emissions_data.cpu_power * ureg.watt:~.2f}')
print(f'RAM Power:  {tracker.final_emissions_data.ram_power * ureg.watt:~.2f}')

**Your turn**

Before we move on, let's try one more experiment. Choose a machine learning model in [Scikit-Learn](https://scikit-learn.org/stable/supervised_learning.html) and train it on the California Housing dataset. Use CodeCarbon to measure the carbon footprint of your model. How does it compare to the models we've already trained? __Hint: If you aren't sure what model to use, try the [Extra Random Trees](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesRegressor.html) model.__

In [None]:
# Your code here

## Understanding the Carbon Cost of Large Language Models

Now that we have a better understanding of how CodeCarbon works, let's use it to investigate the carbon cost of using a large language model.

We are going to use HuggingFace to train a version of GPT-2 for a single epoch (i.e. one pass through the training data). We will then use the model to generate some text, and measure the carbon cost of the training and prediction steps.

While the training part of this code is quite simple, getting the data ready requires a little bit of effort. We are going to rush through it here, as it isn't the focus of this lab, but the code is commented in case you're interested.

In [None]:
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM, Trainer, TrainingArguments

dataset = load_dataset("wikitext", "wikitext-2-raw-v1") # Load the raw dataset
tokenizer = AutoTokenizer.from_pretrained("gpt2") # Load the tokenizer - converts text into numbers

# To speed up, let's use half the data
dataset['train'] = dataset['train'].select(range(0, len(dataset['train']), 2))

tokenized_datasets = dataset.map(
    lambda x: tokenizer(x["text"]), # Tells this function how to use the tokenizer
    batched=True, # Apply to groups of examples
    num_proc=4, # Use 4 cores
    remove_columns=["text"] # Remove the text column, as we don't need it anymore
)

block_size = 256 # The maximum number of tokens in a single input

# The main data processing function that will concatenate all texts from our dataset and generate chunks of block_size.
def group_texts(examples):
    # Concatenate all texts.
    concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
    total_length = len(concatenated_examples[list(examples.keys())[0]])
    total_length = (total_length // block_size) * block_size
    # Split by chunks of max_len.
    result = {
        k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
        for k, t in concatenated_examples.items()
    }
    result["labels"] = result["input_ids"].copy()
    return result

# Finally, apply the function above to our data
lm_datasets = tokenized_datasets.map(
    group_texts,
    batched=True,
    num_proc=4,
)

Whew! With that out of the way, we can set up the actual training and measure its carbon cost.

In [None]:
model = AutoModelForCausalLM.from_pretrained("gpt2")

training_args = TrainingArguments(
    output_dir="./gpt2-wikitext2",
    overwrite_output_dir=True,
    num_train_epochs=1, # Train for one epoch
    per_device_train_batch_size=4,
    save_steps=10_000,
    save_total_limit=1,
    prediction_loss_only=True,
    logging_steps=1,
    logging_first_step=True,
    learning_rate=2e-5,
    weight_decay=0.01,
    report_to=None,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=lm_datasets["train"],
    eval_dataset=lm_datasets["validation"],
    data_collator=None,
)

In [None]:
with EmissionsTracker(project_name="GPT-2 Training") as tracker:
  trainer.train()

In [None]:
report_emissions(tracker)

**Your turn**

Now that we've trained our model, let's use it to generate some text. Use the `generate` method on the `trainer` object to generate some text. You can use the [documentation](https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments) to help you. Use CodeCarbon to measure the carbon footprint of generating the text. How does it compare to the carbon footprint of training the model? We can also look at the results of all our experiments thus far by running the following command:

`pd.read_csv('codecarbon.csv')`

For reference, one estimate of the carbon cost to train GPT-4 is around 12,500 metric tons of CO2. This is based on the assumption that the model is trained in California, using about 25,000 NVIDIA A100 GPUs. This is the equivalent of the anual emissions of 2,700 cars. This is a lot of carbon, but it's important to remember that this is a one-time cost. Once the model is trained, it can be used by many people, with a much lower carbon cost per user.

## Conclusion

In this lab, we have explored the carbon cost of training a machine learning model. We have seen that the cost of training a large language model is substantially higher than a traditional machine learning algorithm. There are a number of ways that we can try to reduce carbon emissions in machine learning:

- **Use more efficient hardware**: The hardware we use to train our models has a significant impact on the carbon footprint of our code. Using more efficient hardware, such as GPUs, can reduce the carbon footprint of our code.
- **Use more efficient algorithms**: Some algorithms are more efficient than others. For example, linear regression is much more efficient than a large language model.
- **Run code in regions with renewable energy**: The location of the computing resources powering our code has a significant impact on the carbon footprint of our code. Running our code in regions with a high proportion of renewable energy can reduce the carbon footprint of our code.
- **Train less often**: Training a model has a much higher carbon cost than using it. If we are careful about how often we train our models, we can spread the carbon cost over a longer period of time.
- **Use smaller models**: Large language models are very powerful, but they also have a high carbon cost. If we can use a smaller model, we can reduce the carbon footprint of our code.