# EEEE3129: Applications of AI in Electrical and Electronic Engineering
## Coursework 2: UK Domestic Load Forecasting

## Key Details:
*This is worth 25% of the module.  Since this is a 10-credit module, you should therefore expect to spend around **25 hours** of time completing this coursework.*

Approximate weightings for each question are given in the provisional rubric, which is available on Moodle.  These may change slightly but you will be notified if that is the case.  Note that there are also marks for code quality and uploading the appropriate documents alongside this workbook as indicated in the provisional rubric.

# CWK Submission:
0. **CWK due - 15.00 21st November (Thursday), 2024**
1. Your submission should be a **.zip** file, containing 
   a) This notebook in the format of **.ipynb** 
   b) The exported **.html** file as the report. To export the .html file, you should use the following menu command in Jupyter Lab: File -> Save and Export Notebook As ... -> HTML, same as CWK1.
   c) You can include the .json dataset file in the .zip file, but it will not be used
   d) Any other Jupyter notebooks or .py files will not be marked so *ALL YOUR CODE SHOULD RUN INSIDE THIS WORKBOOK*.
3. If you only submit the .html file without the .ipynb file, your code cannot be executed and hence not possible to reproduce and evaluate. Thus the marking will be based on your .html file for relavent contents, but a mandatory additional 50% of overall mark penalty will apply.
4. If you only submit the .ipynb file without the .html file, you are missing the report and only containing source code file. Thus the marking will be based on your .ipynb file, but a mandatory additonal 20% overall mark penalty will apply.
5. All submission should be made via Moodle, no email submissions are accepted.
6. Late penalties will apply for any late submission, in accordance to the University policy. If you need to apply for an EC, please refer to the EC instructions on the top of Moodle page.
7. **DO NOT copy code from other people. DO NOT share your code with other people. Do not share your code with other people (do not share your screen either).** Whether you let someone else copy your code or you copy theirs you could still fail. It is fine and encouraged to help people, but you should explain concepts and not share exact code.

# Overview 
1. The general aim of this coursework is to make the day-ahead load forecasting based on the UK historical dataset. For day-ahead forecasting, you will need to rely on load profile from one day to predict the load profile for the following day. An accurate load profile will help improve the grid operations, especially the mitigating the challenge of intermittent renewable energy generations (e.g., wind and solar energy), to suppport economical dispatching/operation of energy storages (e.g., batteries and Electric Vehicles), and to apply smarter energy management strategies for cost saving and carbon reductions.

2. To do this, you will need:

   a) to prepare the dataset
   
   b) to train a proper model and tuning the parameters
   
   c) to verify and report the result
   
4. The dataset you will be using is the UK historical load dataset in 2023, stored in 'UKLoad2023.json'. This contains the original data collected during 2023. For each data record, it contains 8 fields:
   
   a) "dataset": name of this dataset
   
   b) "documentID": the unique id of this data record
   
   c) "documentRevisionNumber": if the data is revised
   
   d) "publishTime": the time this data record is published, in Coordinated Universal Time (UTC)
   
   e) "startTime": the start time of this data record, in Coordinated Universal Time (UTC)
   
   f) "settlementDate": the date of this data record
   
   g) "settlementPeriod": the index of this settlement period, note that in the UK, the 24 hours are divided into 48 settlement periods in total, each settlement is with 30 mins duration.
   
   h) "quantity": the measured load power, with the unit of MegaWatt (MW)



## Part 1: Data Processing
For this part you will be completing the code to processes the raw data into an appropriate format. You will be told what parts you need to write and blank spaces will be provided in the code below denoted by commnents (e.g. "Write your code for Task 1.1 here").  You may need to import any library functions you will use. For questions that have a text based answer you should write those in the space below the question (e.g. replace '[Write your answer here]')

**Task 1.1** Write the set-up Code for the whole notebook. Write necessary codes including module import, self-defined supporting functions and general configurations for the notebook.

**Task 1.2** Write code to validate the integrity of the dataset by checking for duplicate data and applying appropriate solutions.

**Task 1.3** Write code to validate the integrity of the dataset by checking missing values (NaNs) and applying appropriate solutions.

**Task 1.4** For the missing values (NaNs), please write below the available solutions, and the justification for your applied solution. Write your answer below.

*[write your answer here]*


**Task 1.5** Write code to prepare the following two dataset: the feature dataset with the name of "feature_dataset", which should be a 2-dimensional numpy ndarray format, each row is a different day, and each row contains 48 values from the field "quantity" in that day; the label dataset with the name of "label_dataset", which is with a similar format of the "feature_dataset", but the date for each row should be one day later. For example, if row 3 in "feature_dataset" is the data for 2023-8-20, then row 3 in "label_dataset" should be the data for 2023-8-21. This will prepare the "feature_dataset" and "label_dataset" ready for the day-ahead model training and validation purpose.

**Task 1.6** Write code to visualize the day-ahead forecasting problem, by plotting the day-ahead data from  "feature_dataset" and the corresponding output data from "label_dataset". 

**Task 1.7** Observe the data from  "feature_dataset" and "label_dataset", and analyse the statistics such as the mean, shape of the dataset, number of samples etc, and discuss how these parameters will influence the day-ahead forecasting problem below. 

*[write your answer here]*




In [1]:
# DO NOT CHANGE THIS PART

import pandas as pd # pandas module required for dataset manipulation
import numpy as np # numpy module required for numerical calculation, signal processing and training.
from sklearn.linear_model import LinearRegression # the model LinearRegression from sklearn.linear_model is required for Part 2 
from sklearn.model_selection import train_test_split # the function train_test_split is required from sklearn.model_selection for training and testing dataset preparation.

###################################################
## Write your code for Task 1.1 here. 
###################################################
## Write the set-up Code for the whole notebook. Write necessary codes including module import, self-defined supporting functions and general configurations for the notebook.




In [2]:
# these two functions are mandatory, and you should NOT change it. Modifying these two functions will lead to 0 marks for any assessment calling these two functions.
def performance_indicator_relative(mse_train, mse_test, mse_validation):
    return abs(mse_validation - mse_train) / abs(mse_train) + abs(mse_validation - mse_test) / abs(mse_test)

def performance_indicator_rmse(mse_validation, Y_labels):
    return mse_validation/(np.mean(Y_labels)**2)

In [3]:
## Code for Part 1

# Load the training dataset
df = pd.read_json('UKLoad2023.json')

###################################################
## Write your code for Task 1.2 here. 
###################################################
## Write code to validate the integrity of the dataset by checking for duplicate data and applying appropriate solutions.

#

###################################################
## Write your code for Task 1.3 here. 
###################################################
## Write code to validate the integrity of the dataset by checking missing values (NaNs) and applying appropriate solutions.

#

###################################################
## Write your code for Task 1.5 here. 
###################################################
# Write code to prepare the following two dataset: the feature dataset with the name of "feature_dataset", which should be 
# a 2-dimensional numpy ndarray format, each row is a different day, and each row contains 48 values from the field "quantity" 
# in that day; the label dataset with the name of "label_dataset", which is with a similar format of the "feature_dataset", 
# but the date for each row should be one day later. For example, if row 3 in "feature_dataset" is the data for 2023-8-20, 
# then row 3 in "label_dataset" should be the data for 2023-8-21. This will prepare the "feature_dataset" and "label_dataset"
# ready for the day-ahead model training and validation purpose.

# dataset initialization - you might need to rewrite/remove this part if necessary
feature_dataset = np.ones((10,10)) # - you might need to rewrite/remove this part if necessary
label_dataset   = np.ones((10,10)) # - you might need to rewrite/remove this part if necessary

#

###################################################
## Write your code for Task 1.6 here. 
###################################################
# Write code to visualize the day-ahead forecasting problem, by plotting the day-ahead data from  "feature_dataset" and the corresponding output data from "label_dataset".

#




## Part 2: Model Training
For this part you will be completing the code to fit the model. You will be told what parts you need to write and blank spaces will be provided in the code below denoted by commnents (e.g. "Write your code for Task 1.1 here").  You may need to import any library functions you will use. For questions that have a text based answer you should write those in the space below the question (e.g. replace '[Write your answer here]')

**Task 2.1** Write code to calculate the Mean Squared Error (MSE) performance for both the training part and the testing part, with the name of mse_train and mse_test, correspondingly. You will need to follow this naming convention for Part 3.

**Task 2.2** Write code to visualize the model fitting performance. You should plot the first row of Y_train and Y_train_output, first row of Y_test_output and Y_test.

**Task 2.3** Compare the values of mse_train and mse_test, which is larger? Is this expected, and why? Write your answer below.

*[write your answer here]*

Example Answer: By default, it should be mse_train < mse_test, indicating the overfitting. Explains can focus on dataset size, or models. 


**Task 2.4** Rerun the code, report your best performance below. Based on your observation, make at least one suggestion on how to improve the code. Write your answer below.

*[write your answer here]*

Example Answer: Observe the randomness of the code - random seed can be fixed for reproductivity. Better splitting the dataset. 


In [4]:
# Example code for dataset splitting, training, and testing. You may need to rewrite/reuse this routine for this or later parts.
# you should have this part working for the first run, to check your variables are properly named in Part 1.

# Training and Testing dataset preparation. Splitting ratio between training and testing is set as 50% vs 50% as an example.
X_train, X_test, Y_train, Y_test = train_test_split(feature_dataset, label_dataset, test_size=0.5)

# Configuring the models to fit/train. 
cwk_model_part2 = LinearRegression()

# Fitting the models with the training data.
cwk_model_part2.fit(X_train, Y_train)

# Making predictions with the fitted model
Y_train_output = cwk_model_part2.predict(X_train)

# Making predictions with the fitted model
Y_test_output= cwk_model_part2.predict(X_test)

###################################################
## Write your code for Task 2.1 here. 
###################################################
# Write code to calculate the Mean Squared Error (MSE) performance for both the training part and the testing part, with the name of mse_train and mse_test, correspondingly. You will need to follow this naming convention for Part 3.

# value initialization - you might need to rewrite/remove this part if necessary
mse_train = 1.0 # - you might need to rewrite/remove this part if necessary
mse_test = 1.0 # - you might need to rewrite/remove this part if necessary

#


###################################################
# the following code should run with no problem after your implementation of Task 2.1
print("---------------------------------------------------------------------------")
print("MSE Performance for Part 2:")
print(f"Linear Regression MSE Train: {mse_train}")
print(f"Linear Regression MSE Test: {mse_test}")
print("---------------------------------------------------------------------------")

###################################################
## Write your code for Task 2.2 here. 
###################################################
# Write code to visualize the model fitting performance. You should plot the first row of Y_train and Y_train_output, first row of Y_test_output and Y_test.

#




---------------------------------------------------------------------------
MSE Performance for Part 2:
Linear Regression MSE Train: 1.0
Linear Regression MSE Test: 1.0
---------------------------------------------------------------------------


## Part 3: Model Explore

In this part, the expectation is:
a) to choose your own model
b) to prepare the data for the model training
c) to configure the model and to train the model 
d) to report the performance

Note that:
a) you can try as many models as you like, but you can only choose one as the final model to report. The final performance will be evaluated based on that model alone. 
b) if multiple models are left in this part without a clear indication of the final model selection, the LAST model will be used for marking purpose.
c) your model will be evaluated following the convention of machine learning projects. An unknown dataset to you will be used to evaluate your model performance. Please don't ask for the dataset as it is kept away from you intentionally.
d) your model should be able to load the example test dataset. It is mandatory to ensure your code is with the correct structure for the unknown dataset test, but DO NOT rely on its performance because the example test dataset is subset of your training dataset and NOT the unknown dataset for marking. 

For this part you will be completing the code to evaluate the fitted model. You will be told what parts you need to write and blank spaces will be provided in the code below denoted by commnents (e.g. "Write your code for Task 1.1 here").  You may need to import any library functions you will use. For questions that have a text based answer you should write those in the space below the question (e.g. replace '[Write your answer here]')

**Task 3.1** Write your code to implement a different model (any model except LinearRegression model) and evaluate its performance. You should rewrite part of the following code.

**Task 3.2** Justify your choice for the implemented model. Write your answer below.

*[write your answer here]*

Example Answers: Reasonable regression model selections are ok, but expecting discussions. 

**Task 3.3** Compare and discuss the performance of your implemented models. Write your answer below.

*[write your answer here]*

Example Answers: Quantified evidence is expected to make the comparison; model/hyper parameters are expected to be used for the discussion. 

**Task 3.4** What is overfitting? How to mitigate the overfitting in your implemented code? Write your answer below.

*[write your answer here]*

Example Answers: 1st class answers should also include code implementations, e.g., k-fold.

In [5]:
###################################################
## Write your code for Task 3.1 here. 
###################################################
# Write your code to implement a different model (any model except LinearRegression model) and evaluate its performance. You should rewrite part of the following code.
# They are provided to give you hints about the model names to be expected in Part 4.
# The following code should be rewrite completely. 

X_train, X_test, Y_train, Y_test = train_test_split(feature_dataset, label_dataset, test_size=0.2)

# Configuring the models to fit/train. 
cwk_model_part3 = LinearRegression()

# Fitting the models with the training data.
cwk_model_part3.fit(X_train, Y_train)

# Making predictions with the fitted model
Y_train_output = cwk_model_part3.predict(X_train)

# Making predictions with the fitted model
Y_test_output= cwk_model_part3.predict(X_test)

# 


## Part 4: Model Evaluation
For this part you will be completing the code to evaluate the fitted model. You will be told what parts you need to write and blank spaces will be provided in the code below denoted by commnents (e.g. "Write your code for Task 1.1 here").  You may need to import any library functions you will use. For questions that have a text based answer you should write those in the space below the question (e.g. replace '[Write your answer here]')

**Task 4.1** Write code to use the data from "UKLoad2023_test.json" to evaluate your fitted model from Part 2. Note that the data provided in "UKLoad2023_test.json" is a dummy dataset, and during the marking process, another dataset with the same name and similar format will be use. The dataset for marking is NOT available to you, so please don't make requests to this dataset, but you should use the provided  "UKLoad2023_test.json" to check if your code can run.

**Task 4.2** Given the overall aim is to make day-ahead load forecasting, please suggest an approach (other than the trial with different models) to better achieve this aim. Write your answer below.

*[write your answer here]*

Example Answers: The discussion on additional features or better cooking of the dataset/features, e.g., day of the week; seasonal changes; quality of the dataset.


**Task 4.3** Performance evaluation via overall_performance. You must make sure the following lines can run, or a 0 mark will be made for this task. 
overall_performance = performance_indicator_relative(mse_train, mse_test, mse_validation)
print('overall_performance', overall_performance)

**Task 4.4** Performance evaluation via rmse_performance. You must make sure the following lines can run, or a 0 mark will be made for this task. 
rmse_performance = performance_indicator_rmse(mse_validation, label_dataset_validation)
print('rmse_performance', rmse_performance)



In [6]:
# To load the evaluation dataset
df_evaluation = pd.read_json('UKLoad2023_test.json')


###################################################
## Write your code for Task 4.1 here. 
###################################################

# initialize the mse_validation - you might need to rewrite/remove this part if necessary
mse_validation = 1.0 # - you might need to rewrite/remove this part if necessary

# dataset initialization - you might need to rewrite/remove this part if necessary
feature_dataset_validation = np.ones((10,10)) # - you might need to rewrite/remove this part if necessary
label_dataset_validation   = np.ones((10,10)) # - you might need to rewrite/remove this part if necessary



In [7]:
#################################################################################################################
## Performance Evaluation Part
# Please leave this part un-touched. 
# They are provided for you to test if your code can be evaluated - the performance reported below is for
# your reference only and not the ones in marking. The dataset for marking is NOT available to you so don't ask for it.
overall_performance = performance_indicator_relative(mse_train, mse_test, mse_validation)
print('overall_performance', overall_performance)

rmse_performance = performance_indicator_rmse(mse_validation, label_dataset_validation)
print('rmse_performance', rmse_performance)
#################################################################################################################

overall_performance 0.0
rmse_performance 1.0
