# Your project title

## Your name

The final project is an individual project, where you apply one or more machine learning techniques to some power system planning/operation problem. A good project either has the potential to be polished into a research paper, or is suitable to be transformed into teaching materials.

This template is for your reference. You do not have to strictly follow it. You can remove the instructions. You need to submit a `.zip` file, which includes the `.ipynb` file, the `.html` file, a `data` folder, and a `figs` folder (if applicable).

The project will be graded by both the instructor and the TA. Please feel free to contact (at least one of) us (by email or attending office hours) as you work on the project.

The outline of the project is similar to an IEEE research paper.

## Introduction

Do not include a reference list in the appendix. Instead, include a link to cite a reference, e.g., a benchmark model is developed in [[Hong11]](https://doi.org/10.1109/PES.2011.6038881), another model is developed in [[Xie18]](https://doi.org/10.1007/s40565-017-0374-0).

## Problem Statement

Formulate a problem in power systems from the data analytics/machine learning perspective. Clearly specify the available data sets, the objective, the performance metric, and the state of the art (best performance in the current literature for the same or a similar problem). You can load all the needed packages here and have a glance at the data:

You can include code for data preprocessing in either this section or later sections.

## Methodology

Describe your proposed methods (e.g., a classical machine learning method and a deep neural network). Intuitively argue why your methods may be better than the existing methods (in some aspects, of course). The instructor and the TA are aware that some topics may be more sophisticated, and even reproducing the state of the art is already challenging enough (which is acceptable in this project; we will take into account all the factors and grade your work based on the overall quality).

You can include a figure like this:

In [33]:
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt
%matplotlib inline 

import warnings
warnings.filterwarnings("ignore")
import os
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

In [34]:
# This is the original data for Charlotte from the Nation Centers for Environmeental Information
clt_original = pd.read_csv('Charlotte/653351892e546d5d55892a5e2a054320/Modified_1045140_35.21_-80.86_2006.csv')
clt_original

Unnamed: 0,Year,Month,Day,Hour,Minute,DHI,DNI,Dew Point,Surface Albedo,Wind Speed,...,Clearsky DHI,Clearsky GHI,GHI,Solar Zenith Angle,Cloud Type,Fill Flag,Wind Direction,Precipitable Water,Global Horizontal UV Irradiance (280-400nm),Global Horizontal UV Irradiance (295-385nm)
0,2006,1,1,0,0,0,0,2,0.116,0.4,...,0,0,0,166.47,7,0,290.6,0.968,0.0,0.0
1,2006,1,1,0,30,0,0,2,0.116,0.4,...,0,0,0,167.78,7,0,290.6,0.955,0.0,0.0
2,2006,1,1,1,0,0,0,2,0.116,0.4,...,0,0,0,165.82,7,0,301.7,0.942,0.0,0.0
3,2006,1,1,1,30,0,0,1,0.116,0.4,...,0,0,0,161.63,0,0,301.7,0.928,0.0,0.0
4,2006,1,1,2,0,0,0,1,0.116,0.4,...,0,0,0,156.37,0,0,310.9,0.915,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17515,2006,12,31,21,30,0,0,4,0.116,0.4,...,0,0,0,139.86,0,0,264.7,1.045,0.0,0.0
17516,2006,12,31,22,0,0,0,4,0.116,0.5,...,0,0,0,145.93,4,0,269.0,1.012,0.0,0.0
17517,2006,12,31,22,30,0,0,2,0.116,0.5,...,0,0,0,151.86,7,0,269.0,1.002,0.0,0.0
17518,2006,12,31,23,0,0,0,2,0.116,0.5,...,0,0,0,157.53,7,0,278.4,0.992,0.0,0.0


## How this data is produced can be found in the Raleigh Pwer forcasting.ipynb

In [35]:
# This time shift data is for clt_original where certain parameters are shifted ahead by 4 hours as a way to compare with 
# the solar performance of the 4hr forcast that is provided.
clt_4h_weather_timeshift = pd.read_csv('Charlotte_weather_4h_timeshift.csv')
clt_4h_weather_timeshift['LocalTime'] = pd.to_datetime(clt_4h_weather_timeshift['LocalTime'])

# you can see that is has 8 less rows, this is because the first 8  places are removed for the 4 hours forcast at 30 min interval
clt_4h_weather_timeshift

Unnamed: 0,Year,Month,Day,Hour,Minute,LocalTime,TIME,DATE,Dew Point,Surface Albedo,...,Precipitable Water,DHI,DNI,Clearsky DNI,Clearsky DHI,Clearsky GHI,GHI,Solar Zenith Angle,Global Horizontal UV Irradiance (280-400nm),Global Horizontal UV Irradiance (295-385nm)
0,2006,1,1,4,0,2006-01-01 04:00:00,04:00:00,2006-01-01,2.0,0.116,...,0.968,0,0,0,0,0,0,132.46,0.0,0.0
1,2006,1,1,4,30,2006-01-01 04:30:00,04:30:00,2006-01-01,2.0,0.116,...,0.955,0,0,0,0,0,0,126.34,0.0,0.0
2,2006,1,1,5,0,2006-01-01 05:00:00,05:00:00,2006-01-01,2.0,0.116,...,0.942,0,0,0,0,0,0,120.25,0.0,0.0
3,2006,1,1,5,30,2006-01-01 05:30:00,05:30:00,2006-01-01,1.0,0.116,...,0.928,0,0,0,0,0,0,114.21,0.0,0.0
4,2006,1,1,6,0,2006-01-01 06:00:00,06:00:00,2006-01-01,1.0,0.116,...,0.915,0,0,0,0,0,0,108.27,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17507,2006,12,31,21,30,2006-12-31 21:30:00,21:30:00,2006-12-31,14.0,0.113,...,3.878,0,0,0,0,0,0,139.86,0.0,0.0
17508,2006,12,31,22,0,2006-12-31 22:00:00,22:00:00,2006-12-31,14.0,0.113,...,3.930,0,0,0,0,0,0,145.93,0.0,0.0
17509,2006,12,31,22,30,2006-12-31 22:30:00,22:30:00,2006-12-31,14.0,0.113,...,3.930,0,0,0,0,0,0,151.86,0.0,0.0
17510,2006,12,31,23,0,2006-12-31 23:00:00,23:00:00,2006-12-31,5.0,0.116,...,1.123,0,0,0,0,0,0,157.53,0.0,0.0


In [36]:
# This is NREL's provided realtime Solar Power dataset
# we only want to look at 30 min intervals
actual_solar_raleigh = pd.read_csv('Raleigh_Solar_Power.csv')
actual_solar_raleigh['LocalTime'] = pd.to_datetime(actual_solar_raleigh['LocalTime'])
actual_solar_raleigh = actual_solar_raleigh[(actual_solar_raleigh['Minute'] == 0) | (actual_solar_raleigh['Minute'] == 30)]
actual_solar_raleigh['Power(MW)_actual'] = actual_solar_raleigh['Power(MW)']
actual_solar_raleigh

Unnamed: 0,LocalTime,Power(MW),Year,Day,Month,TIME,DATE,Hour,Minute,TOTAL MINUTES PASS,Power(MW)_actual
0,2006-01-01 00:00:00,0.0,2006,1,1,00:00:00,2006-01-01,0,0,0,0.0
6,2006-01-01 00:30:00,0.0,2006,1,1,00:30:00,2006-01-01,0,30,30,0.0
12,2006-01-01 01:00:00,0.0,2006,1,1,01:00:00,2006-01-01,1,0,60,0.0
18,2006-01-01 01:30:00,0.0,2006,1,1,01:30:00,2006-01-01,1,30,90,0.0
24,2006-01-01 02:00:00,0.0,2006,1,1,02:00:00,2006-01-01,2,0,120,0.0
...,...,...,...,...,...,...,...,...,...,...,...
105090,2006-12-31 21:30:00,0.0,2006,31,12,21:30:00,2006-12-31,21,30,1290,0.0
105096,2006-12-31 22:00:00,0.0,2006,31,12,22:00:00,2006-12-31,22,0,1320,0.0
105102,2006-12-31 22:30:00,0.0,2006,31,12,22:30:00,2006-12-31,22,30,1350,0.0
105108,2006-12-31 23:00:00,0.0,2006,31,12,23:00:00,2006-12-31,23,0,1380,0.0


In [37]:
# This is NREL's provided Solar Power 4hr forcast dataset 
ha4_solar_raleigh = pd.read_csv('Raleigh/6ef6c64fbb3bfd36bcb328cc956df78d/HA4_35.85_-78.65_2006_DPV_37MW_60_Min.csv')

ha4_solar_raleigh['LocalTime'] = pd.to_datetime(ha4_solar_raleigh['LocalTime'])
ha4_solar_raleigh['Power(MW)_ha4'] = ha4_solar_raleigh['Power(MW)']
ha4_solar_raleigh

Unnamed: 0,LocalTime,Power(MW),Power(MW)_ha4
0,2006-01-01 00:00:00,0.0,0.0
1,2006-01-01 01:00:00,0.0,0.0
2,2006-01-01 02:00:00,0.0,0.0
3,2006-01-01 03:00:00,0.0,0.0
4,2006-01-01 04:00:00,0.0,0.0
...,...,...,...
8755,2006-12-31 19:00:00,0.0,0.0
8756,2006-12-31 20:00:00,0.0,0.0
8757,2006-12-31 21:00:00,0.0,0.0
8758,2006-12-31 22:00:00,0.0,0.0


In [38]:
# The first step is to merge the 4hr forcast and the realtime solar data to sync up the times, then remove any 0 power rows for night times

df2 = pd.merge(ha4_solar_raleigh, actual_solar_raleigh, how='inner', on='LocalTime')
df2 = df2[(df2['Power(MW)_actual'] > 0) | (df2['Power(MW)_ha4'] > 0)]

ha4 = df2['Power(MW)_ha4']

actual = df2['Power(MW)_actual']

In [39]:
# We can perform R2 metric scoring for regressions
R2_Score_dtr = round(r2_score(ha4,actual) * 100, 2)
print("R2 Score : ",R2_Score_dtr,"%")

R2 Score :  75.45 %


### The 4 hour forcast data that is provided with the actual data has a R2 score of 75.45%

# Now check if we can use Charlotte data to predict Raleigh power output.

In [40]:
# evaluate an xgboost regression model on the housing dataset
from numpy import absolute
from pandas import read_csv
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from xgboost import XGBRegressor

In [41]:
# The first step is to merge the Charlotte 4 hour shifted data and the realtime solar data to sync up the times, then remove any all 0 rows for night times
df = pd.merge(clt_4h_weather_timeshift, actual_solar_raleigh, how='inner', on='LocalTime')
df = df[(df['Power(MW)'] > 0) | (df['DHI'] > 0) | (df['DNI'] > 0) | (df['GHI'] > 0)]

# The result is then divided by the X and y variables that the model will you.
X = df[['DHI', 'DNI', 'Dew Point',
   'Surface Albedo', 'Wind Speed', 'Relative Humidity', 'Temperature',
   'Pressure', 'Clearsky DHI', 'Clearsky DNI', 'Clearsky GHI', 'GHI',
   'Solar Zenith Angle', 'Cloud Type', 'Fill Flag', 'Wind Direction',
   'Precipitable Water', 'Global Horizontal UV Irradiance (280-400nm)',
   'Global Horizontal UV Irradiance (295-385nm)']]
y = df['Power(MW)']

# The dataset is then split into test and train sets
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.2,random_state=69)



In [42]:
# XGBRegressor is the model used
# Other models like LinearRegression, RandomForestRegressor, DecisionTreeRegressor were testing in the EDA, 
# but XGBRegressor performed the best.

model = XGBRegressor(n_estimators=1000, max_depth=7, eta=0.1, subsample=0.7, colsample_bytree=0.8)
# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
scores = absolute(scores)

# get MAE
print('Mean MAE: %.3f (%.3f)' % (scores.mean(), scores.std()) )

# now fit the dataset
model.fit(X_train,y_train)

# get model score
score_lr = 100*model.score(X_test,y_test)
print(f'XGBR Model score = {score_lr:4.4f}%')

# predict the test set
y_pred_lr = model.predict(X_test)
R2_Score_lr = round(r2_score(y_pred_lr,y_test) * 100, 2)

# get R2 score
print("R2 Score : ",R2_Score_lr,"%")

Mean MAE: 1.814 (0.075)
XGBR Model score = 90.2323%
R2 Score :  88.83 %


# Using current Charlotte weather data, you can forcast 4 hour Raleigh data better than NREL's provided Solar Power 4hr forcast

# Things to talk about
XGBRegressor and how it works
results, R2
Explain the other files and other stuff


You can create a math equation like this:
$$
E = mc^2.
$$

You can include some key code in this section to highlight the main idea and contributions of your work.

## Results

Most of your code is included here, with the generated results. You may want to compare your results with the state of the art, and discuss the results.

## Conclusion

Main takeaways of your work, and future directions of your work.