# Phase 7: Extension to previous exercise



In the previous exercise, you trained and tested a simple power model againts one dataset.

This notebook includes links to more data and a small script to test your model againts these datasets.

You can copy-paste the code from this notebook to the previous exercise and try to evaluate your model.

Steps:
1. Copy-and-paste this code to the end of the previous exercise (after training your model)
2. Download the new datasets using the provided code (only do this once)
3. Run the second part of this code to evaluate your model againts these datasets
4. Think about the following questions:

- How well does your model generalize to the other datasets?
- Are there differences between the workers?
- Can you improve your model?
  - e.g., use data from more datasets to train the model
  - Try selecting better input features (e.g., something better than the simple correlations)
  - Use more advanced ML algorithms (e.g., LinearRegression vs PolynomialRegression vs ... vs deep learning)
  - Use autoML to automatically try different approaches
  - ...?

In [None]:
"""
7.1: Download another dataset, including data from all workers.

This dataset includes the six peaks of the simulated day-night cycle (less vehicles at night, more during day).

During peak hours, the maximum amount of images recieved from vehicles and processed in the cluster was limited to 4mbps.
"""

!gdown https://drive.google.com/uc?id=1XCDQt5m7k4NuLn30iFZxENWP6hzyc2t0 -O mbps4_worker1.feather
!gdown https://drive.google.com/uc?id=105QzBeUda6FfovMqa7916hV4zLGS-hGI -O mbps4_worker2.feather
!gdown https://drive.google.com/uc?id=1uMBCK4PywMJaU0vhD2dHD_UFyxmQiofU -O mbps4_worker3.feather
!gdown https://drive.google.com/uc?id=1Fe_atHnsOmAHckTLNVjlrPnwk3J7-eZF -O mbps4_worker4.feather
!gdown https://drive.google.com/uc?id=1aitUG9j48i9JHJmfc-1vXjVVRdQZxc9J -O mbps4_worker5.feather

"""
Additionally, you can uncomment the following code to download yet another dataset.

This dataset does not have the six peaks of the simulated day-night cycle (less vehicles at night, more during day). Instead, the workload increases linearly over time.

It is very likely that your model will not generalize to this dataset.
"""

#!gdown https://drive.google.com/uc?id=1Yj6bFut4HE3BSqiM7aJKbXkq2I8LBCQ6 -O linear_worker1.feather
#!gdown https://drive.google.com/uc?id=1YfbmnDAJY2XGVf5BTltBy8OEaPRX7_8e -O linear_worker2.feather
#!gdown https://drive.google.com/uc?id=1zbGVwmXy8Y3JOaZMO3jXKcbPQ3zvpjMB -O linear_worker3.feather
#!gdown https://drive.google.com/uc?id=1CIagdefkqgx3QJclR5ExhKg_J_zymjhn -O linear_worker4.feather
#!gdown https://drive.google.com/uc?id=1c8tI7-6jSI3hOHFDQspZP-b-POPHRfqf -O linear_worker5.feather


In [None]:
"""
7.2 Load the new datasets to a list, so you do not need to keep reloading them after every change to later code.
"""

evaluation_dfs = []
for i in range(1, 6):
  # Get the dataset
  worker_df = pd.read_feather(f"/content/mbps4_worker{i}.feather")
  # Remove the static columns from the dataframe (hopefully makes dealing with the dataset faster)
  unique_counts = worker_df.nunique()
  static_columns = unique_counts[unique_counts <= 2].index
  worker_df = worker_df.drop(static_columns, axis=1)
  evaluation_dfs.append(worker_df)

In [None]:
"""
7.3 Evaluate your model against all workers from the new datasets

NOTE: If your model relies on any data from specific containers, the evaluation will fail.
      You cannot expect that the other workers are running the exact same containers with the exact same ID.
      You will have to remove all container specific features from the feature candidates.

- Can you improve the model to get better results?
-- e.g., use better input features,
      use better regression methods (check scikit-learn documentation or ask gemini on colab),
      use better preprocessing for the metrics,
      use more data (from the new datasets - but beware of overfitting),
      ...
"""
import pandas as pd
from sklearn.metrics import mean_squared_error
import difflib

#model = None # TODO: Replace with your model (reuse the code from the previous exercise)
input_features = best_correlations # TODO: Replace with the input features (list of strings) you used for the model (e.g., the best_correlations)
mse_list = []

def get_mse(df, input_features, worker_id, model):
  # Get the power column for this dataset
  target_word = 'kepler package joules total dynamic'
  closest_matches = difflib.get_close_matches(target_word, df.columns, n = 1,  cutoff=0.05)

  # Convert counter to gauge
  ts = df["timestamp"]
  interval = ts[1] - ts[0]
  power = df[closest_matches[0]].diff() / interval
  y = power

  # Change the worker name in the columns, otherwise X will be empty (i.e., worker2 has no data for worker1)
  for i in range(2,6):
    df.columns = [x.replace(f"worker{i}", "worker1") for x in df.columns]
  X = df[input_features]
  # Get predicted power using the model
  y_pred = model.predict(X)
  # Get rid of NaNs to avoid errors
  X = X.fillna(0)
  y = y.fillna(0)
  # Compare predicted power to power from RAPL
  mse = mean_squared_error(y, y_pred)
  print(f"Mean Squared Error (worker{i}):", mse)
  return mse


for i in range(1, 6):
  # Get the dataset
  #worker_df = pd.read_feather(f"/content/mbps4_worker{i}.feather")
  worker_df = evaluation_dfs[i-1]

  # Remove the static columns from the dataframe (hopefully makes dealing with the dataset faster)
  unique_counts = worker_df.nunique()
  static_columns = unique_counts[unique_counts <= 2].index
  worker_df = worker_df.drop(static_columns, axis=1)

  # Compute mse
  mse = get_mse(worker_df, input_features, i, model)
  mse_list.append(mse)

print(mse_list)
print(f"Mean MSE: {sum(mse_list) / len(mse_list)} Watts")



# What next?

For next step, you can try to evaluate your model against previously unseen data from other datasets:

- List of datasets here

You can also try to improve the model by using more advanced approaches:

- Use something other that LinearRegression
- Use AutoML
- Do additional filtering and preprocessing of features