## Capstone Two: Preprocessing and Training Data Development

### Renewable Energy in the USA from 1965 to 2022 
### Hydropower, Wind, Solar, Biofuel & Geothermal Renewable Energy Dataset
##### Which type of renewable energy source is expected to grow the most in the next decade in the United States?

In [1]:
#Import Packages
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

In [2]:
# Read the csv file that was previously wrangled
df = pd.read_csv("merged_usa_dataset.csv")

# Display the first few rows of the dataset to get an overview
df.head()

Unnamed: 0,Entity,Year,Geo Biomass Other - TWh,Solar Generation - TWh,Wind Generation - TWh,Hydro Generation - TWh,Electricity from solar (TWh),Solar Capacity,Geothermal Capacity,Solar (% electricity),...,Solar (% equivalent primary energy),Renewables (% equivalent primary energy),Wind (% equivalent primary energy),Electricity from hydro (TWh),Biofuels Production - TWh - Total,Electricity from wind (TWh),Other renewables including bioenergy (TWh),Wind Capacity,Hydro (% electricity),Renewables (% electricity)
0,United States,1965,13.332232,0.0,0.0,198.97409,0.0,0.0,0.0,0.0,...,0.0,4.36887,0.0,397.94818,0.0,0.0,13.332232,0.0,0.0,0.0
1,United States,1966,14.062007,0.0,0.0,199.9369,0.0,0.0,0.0,0.0,...,0.0,4.171402,0.0,399.8738,0.0,0.0,14.062007,0.0,0.0,0.0
2,United States,1967,14.073571,0.0,0.0,227.22081,0.0,0.0,0.0,0.0,...,0.0,4.542216,0.0,454.44162,0.0,0.0,14.073571,0.0,0.0,0.0
3,United States,1968,15.546045,0.0,0.0,228.15471,0.0,0.0,0.0,0.0,...,0.0,4.330974,0.0,456.30942,0.0,0.0,15.546045,0.0,0.0,0.0
4,United States,1969,16.22706,0.0,0.0,256.02853,0.0,0.0,0.0,0.0,...,0.0,4.598878,0.0,512.05706,0.0,0.0,16.22706,0.0,0.0,0.0


---

### Create Dummy or Indicator Features for Categorical Variables

The dataset selected for my Capstone 2 does  have any categorical data that needs to be converted into dummy or indicator features. However, for the sake of the practice purpose of this exercise, I'm creating dummy variables for 'Solar Generation - TWh' for a given year, as long as this variable is is greater than 0, and 0 otherwise. This new column will serve as a dummy variable representing whether or not a year has solar power.

In [5]:
# 'Has_Solar_Power' as a dummy variable for whether a year has solar power or not
df['Has_Solar_Power'] = (df['Solar Generation - TWh'] > 0).astype(int)

# Displaying the first few rows of the dataset to verify the new column
df[['Year', 'Solar Generation - TWh', 'Has_Solar_Power']]

Unnamed: 0,Year,Solar Generation - TWh,Has_Solar_Power
0,1965,0.0,0
1,1966,0.0,0
2,1967,0.0,0
3,1968,0.0,0
4,1969,0.0,0
5,1970,0.0,0
6,1971,0.0,0
7,1972,0.0,0
8,1973,0.0,0
9,1974,0.0,0


---

### Standardization

In [6]:
# Dropping non-numeric features for standardization and model building
df_numeric = df.drop(columns=['Entity', 'Year'])

# Standardizing the numeric features
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df_numeric)
scaled_df = pd.DataFrame(scaled_data, columns=df_numeric.columns)

# Displaying the first few rows of the scaled dataset
scaled_df.head()

Unnamed: 0,Geo Biomass Other - TWh,Solar Generation - TWh,Wind Generation - TWh,Hydro Generation - TWh,Electricity from solar (TWh),Solar Capacity,Geothermal Capacity,Solar (% electricity),Wind (% electricity),Hydro (% equivalent primary energy),...,Renewables (% equivalent primary energy),Wind (% equivalent primary energy),Electricity from hydro (TWh),Biofuels Production - TWh - Total,Electricity from wind (TWh),Other renewables including bioenergy (TWh),Wind Capacity,Hydro (% electricity),Renewables (% electricity),Has_Solar_Power
0,-1.556448,-0.382113,-0.534639,-2.290546,-0.371326,-0.387845,-0.815674,-0.372562,-0.538402,0.748683,...,-0.552623,-0.541409,-2.233782,-0.687728,-0.534946,-1.565097,-0.552359,-1.300452,-1.229591,-1.47196
1,-1.526798,-0.382113,-0.534639,-2.261531,-0.371326,-0.387845,-0.815674,-0.372562,-0.538402,0.463076,...,-0.669433,-0.541409,-2.205038,-0.687728,-0.534946,-1.534987,-0.552359,-1.300452,-1.229591,-1.47196
2,-1.526328,-0.382113,-0.534639,-1.439309,-0.371326,-0.387845,-0.815674,-0.372562,-0.538402,1.014672,...,-0.450082,-0.541409,-1.390503,-0.687728,-0.534946,-1.534509,-0.552359,-1.300452,-1.229591,-1.47196
3,-1.466501,-0.382113,-0.534639,-1.411165,-0.371326,-0.387845,-0.815674,-0.372562,-0.538402,0.691484,...,-0.57504,-0.541409,-1.362622,-0.687728,-0.534946,-1.473756,-0.552359,-1.300452,-1.229591,-1.47196
4,-1.438832,-0.382113,-0.534639,-0.571166,-0.371326,-0.387845,-0.815674,-0.372562,-0.538402,1.083278,...,-0.416564,-0.541409,-0.530476,-0.687728,-0.534946,-1.445658,-0.552359,-1.300452,-1.229591,-1.47196


---

### Split into testing and training datasets

In [7]:
# Using 'Year' as feature and the rest as targets for prediction
X = df[['Year']]
y = scaled_df

# Splitting the dataset into training and testing sets (80% training and 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Displaying the shape of the training and testing datasets
X_train.shape, X_test.shape, y_train.shape, y_test.shape

((45, 1), (12, 1), (45, 21), (12, 21))

In [None]:
df.to_csv('preprocessing.csv', index=False)