# Generate Synthetic Data and Trigger Model Performance Workflow


Kernel `Python 3 (ipykernal)` works well with this notebook.

---

This tutorial shows you how to generate 2 synthetic dataset:
1. Synthetic Autopilot generated result: use in model performance evaluation
2. Synthetic historical solar power data: include daily ground truth and use in model evaluation and Autopilot training

---

## Contents
1. [Setup](#Setup)
2. [Create sample dataset](#sample_data)
 




## Setup <a class="anchor" id="Setup"></a>


In [None]:
# Install necessary packages
!pip install --upgrade boto3 --quiet
!pip install --upgrade sagemaker --quiet
!pip install --upgrade pandas --quiet
!pip install --upgrade numpy --quiet
!pip install --upgrade matplotlib --quiet

First, let's obtain S3 bucket name used to store data.

In [None]:
import boto3

# Retrieve the account ID and Region
region = boto3.Session().region_name
account_id = boto3.client("sts").get_caller_identity()["Account"]

# Create an S3 client
s3 = boto3.client("s3")

#Obtain bucket name
bucket_name = "solar-power-forecast-{}-{}".format(account_id, region)
print("Bucket name is: {}.".format(bucket_name))

Next, we'll import the Python libraries we'll need for the remainder of the exercise.

In [None]:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import random

## Create sample dataset <a class="anchor" id="sample_data"></a>

This part shows you how to generate 2 synthetic dataset: 
1. Synthetic Autopilot generated result
2. Synthetic historical solar power data

In reality, the dataset can be huge, and you may need to leverage tools like Amazon Athena or Amazon SageMaker Processor Job to process the data.

For the purposes of this example, we simplified the process: synthesize the data in this notebook, write it locally, then upload it to S3.

In [None]:
# Define the number of solar panels and maximum capacity
num_panels = 1
max_capacity = 350  # in watts

# Initialize an empty list to store the data
data = []

# Define the start and end dates for the 3-month period
n_day = 15
now = datetime.now()
today = datetime(now.year, now.month, now.day)
start_date = today - timedelta(days=n_day)
predict_start_time = today - timedelta(days=1)

# Convert the string to a datetime object
formatted_time = datetime.strptime(str(predict_start_time), '%Y-%m-%d %H:%M:%S').strftime('%Y-%m-%d')

In [None]:
def generate_solar_power_data(start_date, end_date, num_panels, max_capacity):
    """
    Generate power generation data for each solar panel and every 15-minute interval.

    :param start_date: The starting datetime for data generation.
    :param end_date: The ending datetime for data generation.
    :param num_panels: Number of solar panels.
    :param max_capacity: Maximum power capacity of solar panel.
    """
    data = []
    
    for panel_id in range(1, num_panels + 1):
        current_date = start_date
        while current_date < end_date:
            for minute in range(0, 60 * 24, 15):
                timestamp = current_date + timedelta(minutes=minute)
                hour = timestamp.hour

                # Simulate the sun level based on the time of day
                if 6 <= hour < 18:  # Daytime hours (6 AM to 6 PM)
                    sun_level = np.sin((hour - 6) * np.pi / 12) * ((100 + random.randint(-10, 10)) / 100)
                else:
                    sun_level = 0.0  # No sun during nighttime

                # Introduce noise to the max_capacity for each data point
                noisy_max_capacity = max_capacity * np.random.normal(loc=1.0, scale=0.10)  # 10% standard deviation

                # Calculate the actual power generation based on sun level and noisy maximum capacity
                power = sun_level * noisy_max_capacity

                # Add the data to the list
                data.append([panel_id, timestamp, power])

            current_date += timedelta(days=1)  # Move to the next day

    return data

### Synthesize Autopilot generated result

The structure of the data generated is as follows:
* __id__: (required: ItemIdentifierAttributeName)
* __timestamp__: (required: TimestampAttributeName)
* __p50__: Predicted result. The true value is expected to be lower than the predicted value 50% of the time. This is also known as the median forecast. [Read more here](https://docs.aws.amazon.com/forecast/latest/dg/metrics.html#metrics-wQL)

In [None]:
# Generate historical power data for the last 2 weeks
# Lower the max_capacity to create high drifting of synthetic model prediction result to trigger retraining
data = generate_solar_power_data(predict_start_time, today, num_panels, max_capacity-200) 

# Define column names
column_names = ['id', 'timestamp', 'p50']

# Create a DataFrame from the data list
df_result = pd.DataFrame(data, columns=column_names)

In [None]:
# Review the first 10 row
df_result.head(10)

In [None]:
# (Optional) Plot the dataset
import matplotlib.pyplot as plt

plt.figure(figsize=(20, 5))
plt.plot(df_result['timestamp'], df_result['p50'])
plt.title('Autopilot Predicted Power')
plt.xlabel('Time')
plt.ylabel('Power')
plt.show()

In [None]:
file_name = 'solar_power_data.csv'
local_file_path = 'pred_solar_power_data.csv'
s3_file_path = 'data/pred/{}/{}'.format(formatted_time, file_name)

try:
    # Save data to CSV
    df_result.to_csv(local_file_path, index=False)

    # Upload the file
    s3.upload_file(local_file_path, bucket_name, s3_file_path)
    print(f"File successfully uploaded to {bucket_name}/{s3_file_path}")
except Exception as e:
    print(f"An error occurred: {e}")

### Synthesize historical power data

The structure of the data generated is as follows:
* __id__: (required: ItemIdentifierAttributeName)
* __timestamp__: (required: TimestampAttributeName)
* __actual_power__: (required: TargetAttributeName)

In [None]:
# Generate historical power data for the last 2 weeks
data = generate_solar_power_data(start_date, today, num_panels, max_capacity)

# Define column names
column_names = ['id', 'timestamp', 'actual_power']

# Create a DataFrame from the data list
df_hist = pd.DataFrame(data, columns=column_names)

In [None]:
# Review the first 10 row
df_hist.head(10)

In [None]:
# (Optional) Plot the dataset
import matplotlib.pyplot as plt

plt.figure(figsize=(20, 5))
plt.plot(df_hist['timestamp'], df_hist['actual_power'])
plt.title('Historical Power')
plt.xlabel('Time')
plt.ylabel('Power')
plt.show()

Once the data is uploaded to `s3://\<your-bucket>/data/hist/...` path, the `object_create` event will trigger the AWS Step Function to start the evaluation.

In [None]:
file_name = 'solar_power_data.csv'
local_file_path = 'hist_solar_power_data.csv'
s3_file_path = 'data/hist/{}/{}'.format(formatted_time, file_name)

try:
    # Save data to CSV
    df_hist.to_csv(local_file_path, index=False)

    # Upload the file
    s3.upload_file(local_file_path, bucket_name, s3_file_path)
    print(f"File successfully uploaded to {bucket_name}/{s3_file_path}")
except Exception as e:
    print(f"An error occurred: {e}")