# **Exercise: Predicting Power Consumption in a Kubernetes Cluster**

This exercise will guide you through analyzing a dataset obtained from a Kubernetes cluster and building a predictive model for power consumption. The steps include:

1. **Load and Preprocess Data:** Load a dataset containing various metrics from a Kubernetes cluster and clean it by removing static or missing values.
2. **Explore Data:** Identify relevant features related to power consumption using string matching and correlation analysis.
3. **Visualize Data:** Plot power consumption trends and relationships between different features.
4. **Build a Predictive Model:** Train a linear regression model to predict power consumption based on selected features.
5. **Evaluate Model Performance:** Assess the accuracy of your model using mean squared error and visualization.


# Phase 1: Prepare the dataset

Steps:
1. Download the dataset
2. Load the dataset as a Pandas dataframe
3. Remove static values

In [None]:
"""
Task 1.1: Download the dataset from Google Drive

- The downloaded file should appear on the file-browser on the left-side menu
- This dataset includes data from one of the worker nodes in the cluster (worker 1)

TODO: Nothing, everything is included.
"""
!gdown https://drive.google.com/uc?id=1Lyl26JCUmXs0IRdURfVd9j4If5xnq5O3  # worker1.feather

In [None]:
"""
Task 1.2: Load dataframe and print info

TODO: Nothing, everything is included.
"""
import pandas as pd

path = "/content/worker1.feather"
original_df = pd.read_feather(path)

# Count number of rows and cols in the original df
print(f"Loaded {len(original_df)} rows and {len(original_df.columns)} columns")

In [None]:
"""
Task 1.3: Remove nans and static columns

- There are almost 16 000 columns in the dataset, however a significant portion of the data is useless or missing
- Removing columns with mostly static data will make it much easier to handle the dataset

TODO: Fill the missing code (very obvious hints in comments)
"""
import random

# Make a copy of the dataframe, so we do not need to keep loading from file after every change
df = original_df.copy()

# To remove unnecessary columns, we need to figure out which columns are useless
static_columns = [] # TODO: fill this list with the names of the columns (see hint below)

# HINT: For example, first count the number of unique values in each column
# unique_counts = df.nunique()
# Then find all columns with less than two unique values
# static_columns = unique_counts[unique_counts <= 2].index

# Remove the static columns from the dataframe
df = df.drop(static_columns, axis=1)
print(f"Removing {len(static_columns)} static columns ({len(df.columns)} remaining)")

# Print some examples of the removed columns
indices = [x for x in range(len(static_columns))]
for i in range(10):
  col = static_columns[i]
  vals = original_df[col].unique()  # Get a set containing all values in this column
  print(f"Removed: {col}, valuess: {vals}")



In [None]:
"""
Task 1.4 Explore the dataset

TODO: Have a look at the dataset.

- What do the columns look like?
- What do the values look like?
- How much data is there?
"""

# for example, you can print a truncated list of all columns
print(f"{len(df)} rows and {len(df.columns)} columns")
print(df.columns)

df

# TODO: print values from one column, plot a column, print all 600+ columns, etc, ...

# for col in df.columns:
#   print(col)

# Phase 2: Plot power consumption over time

Our goal in this step is to find out how to plot the power consumption of this worker node. In order to do this, we need to first figure out the columns that represent the power usage.

Tasks:

1. Find the columns representing power usage
2. Process the columns into a more usable format
3. Plot the columns

In [None]:
"""
Task 2.1 In order to plot the power consumption of this worker node, we need to first
figure out the relevant columns.

A) As a starting point, look at the Kepler's documentation for potentially relevant metrics:
https://sustainable-computing.io/design/metrics/

hint 1: CPU is likely using the most power in our cluster
hint 2: What do terms like 'core', 'uncore' and 'package' mean in the context of a CPU?
hint 3: Do we want information about 'containers' or 'node'?
hint 4: You will likely find multiple metrics with quite similar values,
        and it might be useful to plot multiple different metrics

B) As the Kepler's documentation does not fully match the column names in the dataset,
   we can use string matching to find columns that are similar to some keywords.
"""

# We can use difflib to search for close matches for a string. (Useful, since the column names are very verbose)
import difflib
keywords = '' # TODO: Replace with relevant keywords to search for (you can use multiple words separated by space or underscore)

n = 6  # Maximum number of matches to return --- (use smaller values to decrease garbage results)
cutoff = 0.05  # Minimum similarity for a match (0.0 to 1.0)  --- (use higher values to decrease garbage results)
closest_matches = difflib.get_close_matches(keywords, df.columns, n = n,  cutoff=cutoff)

# Print the matches
if len(closest_matches) == 0:
  print("No matches found! (did you include any keywords to search for?)")
for match in closest_matches:
  print(match)

In [None]:
"""
Task 2.2 Let's plot these columns. However, it is very likely that resulting plots are not very informative as such.

- Just running the code can be enough, unless you want to fine-tune the plots.

Documentation for plotting pandas dataframes:
https://pandas.pydata.org/docs/user_guide/visualization.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html
"""
from matplotlib import pyplot as plt
df[closest_matches].plot()
plt.xlabel('x_axis')
plt.ylabel('y_axis')
plt.title("Closest matches")

In [None]:
"""
Task 2.3 Let's make the metrics more readable.

We will do this by converting them from counters to gauges.

For example, instead of getting the 'total joules consumed' we get 'joules per second'.
"""

""" 2.3.1 get the update interval of the dataset. Prometheus polls data from Kubernetes pods every n seconds. """
timestamp = df["timestamp"]  # Timestamp column includes timestamps in seconds. It should be sorted in an ascending format.
interval =  #  TODO: Compute the interval using the first two timestamps. This is usually between 5 to 15 seconds in our datasets (static value)

""" 2.3.2: Use the update interval to convert counters to gauges """
rate_metric_list = []
for metric in closest_matches:
  timeseries = df[metric]
  # TODO: Compute the change over time for the timeseries
  # Hint 1: Compute the difference between consecutive rows of the timeseries (e.g., [row1 - row0, row2 - row1, ...])
  # Hint 2: Use diff() from Pandas to do this
  difference =  # TODO: Compute the change over time for the timeseries
  # TODO: Compute rate per second (hint: remember the interval we computed above)
  per_second = # TODO (you can do element-wise divisions on dataframes, e.g., 'df = df/2')
  rate_metric_list.append(per_second)

# Create a new dataframe from the gauges
rates_df = pd.concat(rate_metric_list, axis=1)
# rates_df  # (Uncomment this to visualize the new dataframe)

In [None]:
"""
Task 2.4 Let's plot these new metrics. They should now be more informative than before.

- Just running the code can be enough, unless you want to fine-tune the plots.
- If you did everything correctly in tasks 2.1 - 2.3, you should see six clear peaks in the power data
  (This is the day-night cycle from simulating a city for 6 days)
- Max power usage for our cluster is around 24 Watts per worker.
- The workload used for this dataset might not reach 24 Watts, but may be closer to 10 Watts.

Documentation for plotting pandas dataframes:
https://pandas.pydata.org/docs/user_guide/visualization.html
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.plot.html
"""
from matplotlib import pyplot as plt
rates_df.plot()
plt.xlabel('x_axis')
plt.ylabel('y_axis')
plt.title("Closest matches (converted to gauges)")

# Phase 3: Compute correlations

Our cluster is currently a bare-metal cluster. This means that we can directly access the power consumption metric through APIs, such as RAPL.

However, cloud servers are often virtualized with no access to these hardware level metrics. In this case, we may need to predict the power consumption from other metrics.

In this phase we will train a power model that could be used to predict the power usage, even if we turn our cluster into a virtualized cluster. As the resulting power model is trained on data from our cluster hardware, the end result will only be capapble for doing predictions for similar hardware.

Before we can train a model, we need to figure out the target column (e.g., power) and the input columns (i.e., which columns could be useful for predicting the power usage?).

Tasks:
1. Get the target column (e.g., the power consumption)
2. Figure out potential candidates for input columns by computing correlations against the target column

In [None]:
"""
Task 3.1: Get the target column (e.g., the power consumption)
"""

# TODO: Get the target value for training power models (our target is the power)
# Hint 1: Which column was the best for predicting the power in Phase 2
# Hint 2: Remember to convert to a gauge
# Hint 3: You can reuse a variable from previous tasks, or use the difflib again
keywords = ""
closest_matches = difflib.get_close_matches(keywords, df.columns, n = 1,  cutoff=0.05)
print(closest_matches)
power = df[closest_matches[0]].diff() / interval # TODO: Replace me with the power column instead of the timestamp
power.plot()


In [None]:
"""
Task 3.2: Compute correlations againts the power
"""
# Power (of course) correlates with power - however the idea is to predict power when we do not have power metrics
# -> Remove all power metrics from the dataset
joule_columns = [x for x in df.columns if "joules" in x]
non_joule_df = df.drop(joule_columns, axis=1)

# Compute correlations against power
# Hint: You can use https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corrwith.html
correlations =  # TODO: replace with correlations (dictionary of column-correlation pairs)

# Sort and print some of the correlations
correlations = correlations.sort_values(ascending=False)
best_correlations = []
for i, corr in enumerate(correlations.items()):
  print(corr)
  best_correlations.append(corr[0])
  if i > 10:
    break

  #print(sorted(correlations, reverse=True))

# Phase 4 (optional): Compute more correlations

Similar to power, many metrics are included as counters and are therefore not useful for predicting the power consumption.

Transforming these metrics to gauges could reveal more useful metrics that correlate with power consumption

Hints:
- How to figure out which metrics are counters that need to be?

In [None]:
# TODO: similar to Phase 3, but convert all counters to gauges before computing correlations
# Hint: There is no strictly correct approach for finding out which metrics are counters in the first place.
#       For example, you can try manually looking at the columns to try to find a pattern.
# Hint 2: This task is completely optional.
optional_correlations = []
optional_best_correlations = []

# Phase 5: Create a predictive model for power

Now that we have the target column (power) and the input columns, we can try to train a model.

Here we are using scikit-learn (sklearn) to train a very simple LinearRegression model. However, you could try to train more complex models in a very similar way by using the other regression algorithms from sklearn.

In [None]:
"""
Task 5.1: Fill the missing lines to train a LinearRegression model

- How much is the mean square error? Can you improve it?
"""

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Prepare the data
# X is a dataframe containing the input features
# y is a series/dataframe containing the target value that we want to predict (i.e., power)
X = # TODO: Use the best correlating columns as the input
y = # TODO: Use the power column as the target

print(f"Input columns: {X.columns}")

# Check for NaN values in the dataframes (sklearn does not like NaNs)
print("NaN values in X:", X.isna().sum().sum())
print("NaN values in y:", y.isna().sum())

# Replace NaN values with 0 as a quick and dirty workaround
X = X.fillna(0)
y = y.fillna(0)

# TODO: Split the data into training and testing sets (hint: train_test_split can do this for you)
# https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
X_train, X_test, y_train, y_test = # TODO: Use train_test_split-method, perhaps with 20% test data and some fixed random seed

# Train a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)


# Phase 6: Assess the quality of the power model

In addition to computing the MSE, we can evaluate the model by plotting the predictions againts real values.

In [None]:
"""
Task 6.1: Plot predictions against the test set by running the code

- What can you say from the plots?
- How well are the predictions matching the real values?
"""


import matplotlib.pyplot as plt

# Plot predicted vs real power
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.5)
plt.xlabel("Real Power")
plt.ylabel("Predicted Power")
plt.title("Real Power vs Predicted Power")
plt.grid(True)

# Add a diagonal line for reference
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red')

plt.show()


In [None]:
"""
Task 6.2: Plot predictions against the full dataset by running the code

- What can you say from the plots?
- How well are the predictions matching the real values?
- What is the difference to the above plot?
"""
# Make predictions on the FULL training set (not just test set)
y_pred_full = model.predict(X)

# Create a DataFrame for plotting
plot_df = pd.DataFrame({
    'Real Power': y,
    'Predicted Power': y_pred_full
}, index=X.index)

# Sort the DataFrame by index to avoid jumps in the plot
plot_df = plot_df.sort_index()

# Plot predicted vs real power
plt.figure(figsize=(10, 6))
plt.plot(plot_df.index, plot_df['Real Power'], label='Real Power')
plt.plot(plot_df.index, plot_df['Predicted Power'], label='Predicted Power')
plt.xlabel("Index")
plt.ylabel("Power")
plt.title("Real Power vs Predicted Power (Full Training Set)")
plt.grid(True)
plt.legend()
plt.show()

# What next?

For next step, you can try to evaluate your model against previously unseen data from other datasets:

- *List of datasets here* (hopefully updated for the Thursday session)

You can also try to improve the model by using more advanced approaches:

- Use something other that LinearRegression
- Use AutoML to automatically try different regression methods and feature selection methods
- Do additional filtering and preprocessing of features