**Capstone 2: Pre-processing and Training Data Development**

About the data
This dataset contains information on energy consumption and various weather parameters such as solar radiation, temperature, pressure, humidity, wind speed, and precipitation. The "Energy delta[Wh]" column represents the change in energy consumption over a certain time period, while the "GHI" column measures the Global Horizontal Irradiance, which is the amount of solar radiation received by a horizontal surface. The dataset also includes information on the presence of sunlight ("isSun"), the length of daylight ("dayLength"), and the amount of time during which sunlight is available ("sunlightTime"). The "weather_type" column provides information on the overall weather conditions such as clear, cloudy, or rainy. The dataset is organized by hour and month, making it ideal for studying the relationship between renewable energy generation and weather patterns over time.
This text above is taken from the source of the data: https://www.kaggle.com/code/totoro29/renewable-energy-analysis

I'd like to investigate the relationship between the consumption of energy and how it is affected by the changing weather conditions. This is relevant information to be able to forecast and anticipate the potential renewable energy demand depending on the different weather conditions.

In the previous Capstone 2 project I did the EDA and saved the file with the output data from that work. Saved as: Capstone2_EDA.csv

In [1]:
#import pandas, numpy, matlab, seaborn

import pandas as pd
import numpy as np
import warnings

warnings.filterwarnings('ignore')


In [58]:
# Import data from previous EDA step
df = pd.read_csv('/Users/claudiazaffaroni/Desktop/Springboard_Data_Science_Course/Unit 16 Feature Engineering/Capstone 2 PreProcessing & Training Data Development/Capstone2_EDA.csv')

In [59]:
# Take a look at the first few rows
df.head()

Unnamed: 0,Energy_delta[Wh],GHI,temp,pressure,humidity,wind_speed,rain_1h,snow_1h,clouds_all,isSun,sunlightTime,dayLength,SunlightTime/daylength,weather_type,hour,month
0,0,0.0,1.6,1021,100,4.9,0.0,0.0,100,0,0,450,0.0,4,0,1
1,0,0.0,1.6,1021,100,4.9,0.0,0.0,100,0,0,450,0.0,4,0,1
2,0,0.0,1.6,1021,100,4.9,0.0,0.0,100,0,0,450,0.0,4,0,1
3,0,0.0,1.6,1021,100,4.9,0.0,0.0,100,0,0,450,0.0,4,0,1
4,0,0.0,1.7,1020,100,5.2,0.0,0.0,100,0,0,450,0.0,4,1,1


In [78]:
df.columns

Index(['Energy_delta[Wh]', 'GHI', 'temp', 'pressure', 'humidity', 'wind_speed',
       'rain_1h', 'snow_1h', 'clouds_all', 'isSun', 'sunlightTime',
       'dayLength', 'SunlightTime/daylength', 'weather_type', 'hour', 'month'],
      dtype='object')

In [60]:
#Call the info method on df to see a summary of the columns and data types

df.describe()

Unnamed: 0,Energy_delta[Wh],GHI,temp,pressure,humidity,wind_speed,rain_1h,snow_1h,clouds_all,isSun,sunlightTime,dayLength,SunlightTime/daylength,weather_type,hour,month
count,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0
mean,573.008228,32.596538,9.790521,1015.29278,79.810566,3.937746,0.066035,0.007148,65.974387,0.519962,211.721094,748.644347,0.265187,3.198398,11.498902,6.298329
std,1044.824047,52.172018,7.995428,9.585773,15.604459,1.821694,0.278913,0.06971,36.628593,0.499603,273.902186,194.870208,0.329023,1.289939,6.921887,3.376066
min,0.0,0.0,-16.6,977.0,22.0,0.0,0.0,0.0,0.0,0.0,0.0,450.0,0.0,1.0,0.0,1.0
25%,0.0,0.0,3.6,1010.0,70.0,2.6,0.0,0.0,34.0,0.0,0.0,570.0,0.0,2.0,5.0,3.0
50%,0.0,1.6,9.3,1016.0,84.0,3.7,0.0,0.0,82.0,1.0,30.0,765.0,0.05,4.0,11.0,6.0
75%,577.0,46.8,15.7,1021.0,92.0,5.0,0.0,0.0,100.0,1.0,390.0,930.0,0.53,4.0,17.0,9.0
max,5020.0,229.2,35.8,1047.0,100.0,14.3,8.09,2.82,100.0,1.0,1020.0,1020.0,1.0,5.0,23.0,12.0


# Step 1: Create dummy or indicator features for categorical variables.

In [86]:
# In this case the columns 'isSun' and 'weather_type' are examples of columns that could have been categorical 
# but they were already converted to numerical types in the data set I chose. So I'll use these 2 columns 
# to explore the get_dummies() feature. 

df_drop = ['isSun', 'weather_type']
df_o = pd.get_dummies(df[df_drop], columns = df_drop)
df = df.drop(columns = df_drop)
df_1 = pd.concat([df, df_o], axis=1)

In [87]:
df_1.describe()

Unnamed: 0,Energy_delta[Wh],GHI,temp,pressure,humidity,wind_speed,rain_1h,snow_1h,clouds_all,sunlightTime,...,SunlightTime/daylength,hour,month,isSun_0,isSun_1,weather_type_1,weather_type_2,weather_type_3,weather_type_4,weather_type_5
count,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,...,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0
mean,573.008228,32.596538,9.790521,1015.29278,79.810566,3.937746,0.066035,0.007148,65.974387,211.721094,...,0.265187,11.498902,6.298329,0.480038,0.519962,0.142172,0.180042,0.160894,0.371001,0.145892
std,1044.824047,52.172018,7.995428,9.585773,15.604459,1.821694,0.278913,0.06971,36.628593,273.902186,...,0.329023,6.921887,3.376066,0.499603,0.499603,0.349227,0.384224,0.367434,0.483074,0.352999
min,0.0,0.0,-16.6,977.0,22.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,3.6,1010.0,70.0,2.6,0.0,0.0,34.0,0.0,...,0.0,5.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,1.6,9.3,1016.0,84.0,3.7,0.0,0.0,82.0,30.0,...,0.05,11.0,6.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
75%,577.0,46.8,15.7,1021.0,92.0,5.0,0.0,0.0,100.0,390.0,...,0.53,17.0,9.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
max,5020.0,229.2,35.8,1047.0,100.0,14.3,8.09,2.82,100.0,1020.0,...,1.0,23.0,12.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [88]:
# List columns
df_1.columns

Index(['Energy_delta[Wh]', 'GHI', 'temp', 'pressure', 'humidity', 'wind_speed',
       'rain_1h', 'snow_1h', 'clouds_all', 'sunlightTime', 'dayLength',
       'SunlightTime/daylength', 'hour', 'month', 'isSun_0', 'isSun_1',
       'weather_type_1', 'weather_type_2', 'weather_type_3', 'weather_type_4',
       'weather_type_5'],
      dtype='object')

# Step 2: Standardize the magnitude of numeric features using a scaler.

In [89]:
from sklearn import preprocessing

# Select the columns to scale
columns_to_scale = ['Energy_delta[Wh]', 'GHI', 'temp', 'pressure', 'humidity', 'clouds_all', 'sunlightTime', 'dayLength']

# Making a Scaler object
scaler = preprocessing.StandardScaler()

# Fitting data to the scaler object
df_1[columns_to_scale] = scaler.fit_transform(df_1[columns_to_scale])
#scaled_df = pd.DataFrame(scaled_df, columns=columns_to_scale)


In [90]:
df_1.describe()

Unnamed: 0,Energy_delta[Wh],GHI,temp,pressure,humidity,wind_speed,rain_1h,snow_1h,clouds_all,sunlightTime,...,SunlightTime/daylength,hour,month,isSun_0,isSun_1,weather_type_1,weather_type_2,weather_type_3,weather_type_4,weather_type_5
count,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,...,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0,196776.0
mean,2.960956e-17,1.509365e-17,1.321597e-17,2.614885e-15,-8.196792000000001e-17,3.937746,0.066035,0.007148,-1.850958e-16,-1.069916e-16,...,0.265187,11.498902,6.298329,0.480038,0.519962,0.142172,0.180042,0.160894,0.371001,0.145892
std,1.000003,1.000003,1.000003,1.000003,1.000003,1.821694,0.278913,0.06971,1.000003,1.000003,...,0.329023,6.921887,3.376066,0.499603,0.499603,0.349227,0.384224,0.367434,0.483074,0.352999
min,-0.548427,-0.6247913,-3.30071,-3.994762,-3.704756,0.0,0.0,0.0,-1.801176,-0.7729826,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.548427,-0.6247913,-0.7742596,-0.5521509,-0.6287044,2.6,0.0,0.0,-0.8729374,-0.7729826,...,0.0,5.0,3.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,-0.548427,-0.5941234,-0.06135037,0.07377832,0.2684774,3.7,0.0,0.0,0.4375176,-0.6634542,...,0.05,11.0,6.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
75%,0.003820531,0.2722436,0.7391091,0.595386,0.7811527,5.0,0.0,0.0,0.9289382,0.650887,...,0.53,17.0,9.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0
max,4.256222,3.768379,3.253052,3.307746,1.293828,14.3,8.09,2.82,0.9289382,2.950984,...,1.0,23.0,12.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


# Step 3: Split your data into testing and training datasets

In [92]:
# import necessary packages

from sklearn.model_selection import train_test_split


In [96]:
# drop the column we want to predict

X = df_1.drop(columns=['Energy_delta[Wh]'])
X.columns

Index(['GHI', 'temp', 'pressure', 'humidity', 'wind_speed', 'rain_1h',
       'snow_1h', 'clouds_all', 'sunlightTime', 'dayLength',
       'SunlightTime/daylength', 'hour', 'month', 'isSun_0', 'isSun_1',
       'weather_type_1', 'weather_type_2', 'weather_type_3', 'weather_type_4',
       'weather_type_5'],
      dtype='object')

In [97]:
# create 'y' variable

y = df_1[['Energy_delta[Wh]']]
y.columns

Index(['Energy_delta[Wh]'], dtype='object')

In [95]:
# Divide the complete dataset into training and testing data

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [98]:
# Save the DataFrame to a CSV file

X_train.to_csv('X_train.csv', index=False)
X_test.to_csv('X_test.csv', index=False)
y_train.to_csv('y_train.csv', index=False)
y_test.to_csv('y_test.csv', index=False)
