Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the MIT License.

# Data Preparation
---

This notebook creates simulated solar and energy consumption from a real-home dataset in Australia to walk you through the process of training many models and forecasting on Azure Machine Learning.

This notebook walks you through all the necessary steps to configure the data for this solution accelerator, including:

1. Generate the sample data
2. Split in training/forecasting sets
3. Connect to your workspace and upload the data to its Datastore

### Prerequisites
If you have already run the [00_Setup_AML_Workspace](00_Setup_AML_Workspace.ipynb) notebook you are all set.


## 1.0 Generate sample data

The creation of the synthetic datasets are as follows

1. We will create datasets for 5 suburbs; each suburb will have 10 homes each - total of 50 models to build/train upon
2. For each home; we will create shift the Solar/Temp/General Usage values with a random normal variable with the means/std of the original dataset
3. This would produce 50 separate files in a folder; 

In [None]:
import os
import pandas as pd
import numpy as np

np.random.seed(0)

In [None]:
# Read the raw dataset
df = pd.read_csv("data/sampleEnergy.csv")

# Split reference data into solar and general usage and join
general_usage_df = df[df["RateTypeDescription"] == "Generalusage"][["EndDate", "ProfileReadValue"]].reset_index(drop=True)
solar_df = df[df["RateTypeDescription"] == "Solar"][["EndDate", "ProfileReadValue", "DeviceNumber", "QualityFlag", "BOMTEMP"]].reset_index(drop=True)

pd.concat([general_usage_df, solar_df], axis=1, join="inner")

# process reference dataset
processed_df = solar_df.copy()
processed_df["Generalusage"] = general_usage_df["ProfileReadValue"]
processed_df["Solar"] = solar_df["ProfileReadValue"]
processed_df["Temp"] = solar_df["BOMTEMP"]
processed_df["NetEnergy"] = processed_df["Solar"] - processed_df["Generalusage"]
processed_df["EndDate"] = pd.to_datetime(processed_df["EndDate"], format="%d/%m/%Y %H:%M")
processed_df.drop(["ProfileReadValue", "BOMTEMP"], axis=1, inplace=True)

# Create time-based features
processed_df["Quarter"] = processed_df["EndDate"].dt.quarter
processed_df["Month"] = processed_df["EndDate"].dt.month
processed_df["Weekday"] = processed_df["EndDate"].dt.weekday
processed_df["Hour"] = processed_df["EndDate"].dt.hour
processed_df["WeekOfTheMonth"] = processed_df["EndDate"].dt.week
processed_df["Weekend"] = (processed_df["Weekday"] >= 5).astype(np.int)
processed_df["DateOfMonth"] = ((processed_df["EndDate"].dt.day // 7) + 1)
processed_df["AMPM"] = (processed_df["EndDate"].dt.hour>11).astype(np.int)

In [None]:
# Create folder to write data
folder_path = "data/synthetic/"
os.makedirs(folder_path, exist_ok=True)

# Create simulated suburbs and homes
suburbs = ["Manly", "Bondi", "StKildas", "AlbertPark", "NorthBridge"]
homes = list(range(1, 11, 1))

# Calculate mean / std of changes for Generalusage, Solar, Temp to augment baseline data
generalusage_mean, generalusage_std = processed_df["Generalusage"].diff().mean(), processed_df["Generalusage"].diff().std()
solar_mean, solar_std = processed_df["Solar"].diff().mean(), processed_df["Solar"].diff().std()
temp_mean, temp_std = processed_df["Temp"].diff().mean(), processed_df["Temp"].diff().std()

# Generate synthetic data
for suburb_idx, suburb in enumerate(suburbs):
    temp_delta = np.random.normal(temp_mean, temp_std, processed_df.shape[0])
    
    for home_idx, home in enumerate(homes):
        suburb_name = suburb
        home_name = f"home{home}"
    
        suburb_home_df = processed_df.copy()
        suburb_home_df["DeviceNumber"] += suburb_idx + home_idx
        suburb_home_df["Temp"] += temp_delta
        suburb_home_df["Generalusage"] += temp_delta
        suburb_home_df["Solar"] += temp_delta
        suburb_home_df.loc[suburb_home_df["Generalusage"] < 0, "Generalusage"] = 0
        suburb_home_df.loc[suburb_home_df["Solar"] < 0, "Solar"] = 0
#         suburb_home_df["NetEnergy"] = suburb_home_df["Solar"] - suburb_home_df["Generalusage"]
        suburb_home_df["Suburb"] = suburb_name
        suburb_home_df["Home"] = home_name
        
        print(f"Writing synthetic data for {suburb_name} {home_name}")
        suburb_home_df.to_csv(folder_path + f"{suburb_name}_{home_name}.csv", index=False)

## 2.0 Split data in two sets

We will now split each dataset in two parts: one will be used for training, and the other will be used for simulating batch forecasting. The training files will contain the data records before '2020-06-01' and the last part of each series will be stored in the inferencing files.

Finally, we will upload both sets of data files to the Workspace's default [Datastore](https://docs.microsoft.compython/api/azureml-core/azureml.core.datastore(class)).

In [None]:
from scripts.helper import split_data

# Split each file and store in corresponding directory
train_path, inference_path = split_data(folder_path, 'EndDate', '2020-06-01')

## 3.0 Upload data to Datastore in AML Workspace

In the [setup notebook](00_Setup_AML_Workspace.ipynb) you created a [Workspace](https://docs.microsoft.com/python/api/azureml-core/azureml.core.workspace.workspace). We are going to register the data in that enviroment.

In [None]:
from azureml.core.workspace import Workspace

ws = Workspace.from_config()

# Take a look at Workspace
ws.get_details()

We will create a new Datastore and upload data into that data store. Feel free to read the [Datastore](https://docs.microsoft.com/azure/machine-learning/how-to-access-data) documentation. Please create a container before running the code below. AzureML Datastore functions DOES NOT create a container for you; it merely registers the datastore to be used later. 

A Datastore is a place where data can be stored that is then made accessible for training or forecasting. Please refer to [Datastore documentation](https://docs.microsoft.com/python/api/azureml-core/azureml.core.datastore(class)) on how to access data from Datastore. 

In [None]:
## Create a new datastore
from azureml.core import Datastore

blob_datastore_name='energy' # Name of the datastore to workspace
container_name=os.getenv("BLOB_CONTAINER", "energytest") # Name of Azure blob container
account_name=os.getenv("BLOB_ACCOUNTNAME", "<StorageAccountName>") # Storage account name
account_key=os.getenv("BLOB_ACCOUNT_KEY", "<storageaccountkey>") # Storage account access key

blob_datastore = Datastore.register_azure_blob_container(workspace=ws, 
                                                         datastore_name=blob_datastore_name, 
                                                         container_name=container_name, 
                                                         account_name=account_name,
                                                         account_key=account_key)

If you'd like to use AzureData Lake as a Datastore; here is a code example below. Please note that you'd need to create a ServicePrinicpal to access the data on the Datastore

In [None]:
adlsgen2_datastore_name = 'adlsgen2datastore'

subscription_id=os.getenv("ADL_SUBSCRIPTION", "<my_subscription_id>") # subscription id of ADLS account
resource_group=os.getenv("ADL_RESOURCE_GROUP", "<my_resource_group>") # resource group of ADLS account

account_name=os.getenv("ADLSGEN2_ACCOUNTNAME", "<my_account_name>") # ADLS Gen2 account name
tenant_id=os.getenv("ADLSGEN2_TENANT", "<my_tenant_id>") # tenant id of service principal
client_id=os.getenv("ADLSGEN2_CLIENTID", "<my_client_id>") # client id of service principal
client_secret=os.getenv("ADLSGEN2_CLIENT_SECRET", "<my_client_secret>") # the secret of service principal

adlsgen2_datastore = Datastore.register_azure_data_lake_gen2(workspace=ws,
                                                             datastore_name=adlsgen2_datastore_name,
                                                             account_name=account_name, # ADLS Gen2 account name
                                                             filesystem='test', # ADLS Gen2 filesystem
                                                             tenant_id=tenant_id, # tenant id of service principal
                                                             client_id=client_id, # client id of service principal
                                                             client_secret=client_secret) # the secret of service principal

In [None]:
# Connect to default datastore
# datastore = ws.get_default_datastore()
#or connect to the external blob_store

datastore = blob_datastore

target_path = 'energy'

# Upload train data
ds_train_path = target_path + '_train'
datastore.upload(src_dir=train_path, target_path=ds_train_path, overwrite=True)

# Upload inference data
ds_inference_path = target_path + '_inference'
datastore.upload(src_dir=inference_path, target_path=ds_inference_path, overwrite=True)

### *[Optional]* If data is already in Azure: create Datastore from it



<div style="color:red">
If your data is already in Azure you don't need to upload it from your local machine to the default datastore. Instead, you can create a new Datastore that references that set of data. 
The following is an example of how to set up a Datastore from a container in Blob storage where the sample data is located. 

In this case, the orange juice data is available in a public blob container, defined by the information below. In your case, you'll need to specify the account credentials as well. For more information check [the documentation](https://docs.microsoft.com/python/api/azureml-core/azureml.core.datastore.datastore#register-azure-blob-container-workspace--datastore-name--container-name--account-name--sas-token-none--account-key-none--protocol-none--endpoint-none--overwrite-false--create-if-not-exists-false--skip-validation-false--blob-cache-timeout-none--grant-workspace-access-false--subscription-id-none--resource-group-none-).
</div>

In [None]:
# blob_datastore_name = "automl_many_models"
# container_name = "automl-sample-notebook-data"
# account_name = "automlsamplenotebookdata"

In [None]:
# from azureml.core import Datastore

# datastore = Datastore.register_azure_blob_container(
#     workspace=ws, 
#     datastore_name=blob_datastore_name, 
#     container_name=container_name,
#     account_name=account_name,
#     create_if_not_exists=True
# )

# if 0 < dataset_maxfiles < 11973:
#     ds_train_path = 'oj_data_small/'
#     ds_inference_path = 'oj_inference_small/'
# else:
#     ds_train_path = 'oj_data/'
#     ds_inference_path = 'oj_inference/'

## 4.0 Register dataset in AML Workspace

The last step is creating and registering [datasets](https://docs.microsoft.com/azure/machine-learning/concept-data#datasets) in Azure Machine Learning for the train and inference sets.

Using a [FileDataset](https://docs.microsoft.com/python/api/azureml-core/azureml.data.file_dataset.filedataset) is currently the best way to take advantage of the many models pattern, so we create FileDatasets in the next cell. We then [register](https://docs.microsoft.com/azure/machine-learning/how-to-create-register-datasets#register-datasets) the FileDatasets in your Workspace; this associates the train/inference sets with simple names that can be easily referred to later on when we train models and produce forecasts.

In [None]:
from azureml.core.dataset import Dataset

# Create file datasets
ds_train = Dataset.File.from_files(path=datastore.path(ds_train_path), validate=False)
ds_inference = Dataset.File.from_files(path=datastore.path(ds_inference_path), validate=False)

# Register the file datasets
dataset_name = 'energy50'
train_dataset_name = dataset_name + '_train'
inference_dataset_name = dataset_name + '_inference'
ds_train.register(ws, train_dataset_name, create_new_version=True)
ds_inference.register(ws, inference_dataset_name, create_new_version=True)

## 5.0 *[Optional]* Interact with the registered dataset

After registering the data, it can be easily called using the command below. This is how the datasets will be accessed in future notebooks.

In [None]:
energy_ds = Dataset.get_by_name(ws, name=train_dataset_name)
energy_ds

It is also possible to download the data from the registered dataset:

In [None]:
download_paths = energy_ds.take(5).download()
download_paths

Let's load one of the data files to see the format:

In [None]:
import pandas as pd

sample_data = pd.read_csv(download_paths[0])
sample_data.head(10)

## Next Steps

Now that you have created your datasets, you are ready to move to one of the training notebooks to train and score the models:

- Automated ML: please open [02_AutoML_Training_Pipeline.ipynb](Automated_ML/02_AutoML_Training_Pipeline/02_AutoML_Training_Pipeline.ipynb).
- Custom Script: please open [02_CustomScript_Training_Pipeline.ipynb](Custom_Script/02_CustomScript_Training_Pipeline.ipynb).