# ECE 529/629 - Project 1

# Name: ______________

This project uses the Microgrid Dataset as described below. Please follow the steps in this notebook and add your code where appropriate. Please answer any questions posed in the notebook (i.e., where "Answer:" is printed).

The project follows roughly the steps of the end-to-end machine learning project in Chapter 2 of the textbook. However, this project is designed to be simpler and to require much fewer steps and code.

## Microgrid Dataset
This dataset is from https://www.kaggle.com/jonathandumas/liege-microgrid-open-data

The data have been modified to merge consumption and production data with weather data.

The descriptions of weather data columns are:
- CD = low clouds (0 to 1)
- CM = medium clouds (0 to 1)
- CU = high clouds (0 to 1)
- PREC = precipitation (mm / 15 min)
- RH2m = relative humidity (%)
- SNOW = snow height (mm)
- ST = Surface Temperature (°C)
- SWD = Global Horizontal Irradiance (W/m2)
- SWDtop = Total Solar Irradiance at the top of the atmosphere (W/m2)
- TT2M = temperature 2 meters above the ground (°C)
- WS100m = Wind speed at 100m from the ground (m/s)
- WS10m =Wind speed at 10m from the ground (m/s)

## Part 1: Loading Data

The following contains the usual import functions.

In [None]:
# Python ≥3.5 is required
import sys
assert sys.version_info >= (3, 5)

# Scikit-Learn ≥0.20 is required
import sklearn
assert sklearn.__version__ >= "0.20"

# Common imports
import numpy as np
import os

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as pltB
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join("images", fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

The data files is read and the data are stored in the table grid_data.

In [None]:
import pandas as pd

grid_data = pd.read_csv(os.path.join("datasets","microgrid","weather_consumption_production_clean.csv"))
grid_data.head()

Print the information about each column (e.g., data type) using the .info() command.

Describe any observations you notice (e.g., which columns are numerical and which columns are categorical).

Answer: 

## Part 2: Preparing Data

The following code uses the information in the time column (text) to create a new column named "minutes," which contains the number of minutes since midnight (numeric value).

In [None]:
from datetime import datetime

def hour_string_to_minutes(time_string):
    pt = datetime.strptime(time_string,'%H:%M')
    total_minutes = pt.minute + pt.hour*60
    return total_minutes

grid_data.insert(2,"minutes",grid_data.time.apply(hour_string_to_minutes))
grid_data.head()

Using a similar approach as above, create a column called "clouds" that represents the maximum of the CD, CM, and CU values. The new column should be placed to the right of the CU column. Print the first five rows of the table to show that you completed this task successfully.

Write the code to plot a histogram of all numeric columns in grid_data.

Discuss your observations about the the consumption and generation plots.

Answer:

Do you think that overall more power is consumed or more power is generated? Why?

Answer:

Write the code to drop columns date, time, CD, CM, CU, and SNOW.

Write the code to plot a figure that shows pairwise scatter plots for each of the following features: minutes, clouds, SWD, ST, consumption, generation.

Note three interesting patters or correlations:

Answer 1: 

Answer 2: 

Answer 3:


Split that dataset into training (80%) and test (20%) datasets. Use 42 as the initialization for the random number generator.

For both the training dataset and test dataset, use the consumption column as y and all other columns, except consumption and generation, as X. You should end up with variables X_train, y_train, X_test, and y_test.

Print the shape of X_train.

Write the code to create a pipeline, called grid_pipeline, that applies a scaling function (StandardScaler) to the input. Use the fit_transform function to train the Scaler and apply the scaling to X_train. The scaled output should be assigned to X_train_tr.

Print X_train_tr.

## Part 3: Training ML Algorithms
In this part, you will train different ML algorithms. Please give each trained model a different variable name since we will compare all models in Part 4 of this project.
### Linear Regression

Write the code to train a linear regression model on X_train_tr and y_train.

Write the code to evaluate the accuracy of the linear regression model using the RMSE metric.

What is the RMSE (in percent) for linear regression?

Answer: ___%

### Random Forest Regressor

Write the code to train a random forest regressor on X_train_tr and y_train.

Write the code to evaluate the accuracy of the random forest regressor using the RMSE metric.

What is the RMSE (in percent) for the random forest regressor?

Answer: ___%

### Support Vector Machine Regressor

Write the code to train a support vector machine regressor on X_train_tr and y_train.

Write the code to evaluate the accuracy of the support vector machine regressor using the RMSE metric.

What is the RMSE (in percent) for the support vector machine regressor?

Answer: ___%

### Grid Search

Write the code to perform a grid search (using GridSearchCV) on X_train_tr and y_train. Try 12 (3×4) combinations of hyperparameters ('n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]), then try 6 (2×3) combinations with bootstrap set as False ('bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]).

Write code to identify the best set of parameters.

The best parameters are:

Answer: max_features = _, n_estimators = _

Write the code to evaluate the accuracy of the grid search regressor (with the best parameters) using the RMSE metric.

What is the RMSE (in percent) for the grid search with the best parameters?

Answer: ___%

### Choosing ML Algorithm
Based on your results, which algorithm would you choose and why?

Answer: Chosen algorithm: 

Answer: Reason: 

## Part 4: Evaluating ML Algorithms

Write code to evaluate each trained ML algorithm on the test dataset. Determine the RMSE for each algorithm.

Prepare your test dataset input by passing it through the pipeline you created.

### Linear Regression

What is the RMSE (in percent) for the test dataset?

Answer: ___%

### Random Forest Regressor

What is the RMSE (in percent) for the test dataset?

Answer: ___%

### Support Vector Machine Regressor

What is the RMSE (in percent) for the test dataset?

Answer: ___%

### Grid Search

What is the RMSE (in percent) for the test dataset?

Answer: ___%

## Part 5: Summarize Results

Please summarize your RMSE results (in percent) in the following table.

| ML algorithm                     | training dataset | test dataset |
| -------------------------------- | ---------------- |------------- |
| Linear Regression                |                  |              |
| Random Forest Regressor          |                  |              |
| Support Vector Machine Regressor |                  |              |
| Grid Search                      |                  |              |

What do you observe? Is there anything that surprises you?

Answer:

## Final Check

Before submitting the project, please save your notebook, restart the kernel and clear the output, and rerun the entire notebook. There should be no error messages and your results should be the same as before.

## Optional: Random Grid Search and Feature Importance

This part of the project will not be graded. You may skip the rest of the project or work on it for your entertainment.

Write code to perform a random grid search (using Random Search CV and RandomForestRegressor) on the training dataset. Vary the number of estimators and the maximum number of features.

Determine for which parameters you achieve the lowest RMSE on the training set.

Determine the RMSE on the training set.

Determine the RSME on the test set.

Answer: 

estimators = ___ 

maximum features = ___ 

RMSE on training dataset = ___%

RMSE on test dataset = ___%

Write code to determine which feature is the most important in terms of predicting power consumption. Provide an explanation why this makes sense.

Answer: 

Most important feature:

Explanation:

If you want to do more, feel free to repeat the entire project with *power generation* as the prediction. Please not *not* submit that as part of the project assignment. (Only submit work where *power consumption* is predicted.) 