## Introduction

Climate change is a globally relevant, urgent, and multi-faceted issue heavily impacted by energy policy and infrastructure. Addressing climate change involves mitigation (i.e. mitigating greenhouse gas emissions) and adaptation (i.e. preparing for unavoidable consequences). Mitigation of GHG emissions requires changes to electricity systems, transportation, buildings, industry, and land use.

According to a report issued by the International Energy Agency (IEA), the lifecycle of buildings from construction to demolition were responsible for 37% of global energy-related and process-related CO2 emissions in 2020. Yet it is possible to drastically reduce the energy consumption of buildings by a combination of easy-to-implement fixes and state-of-the-art strategies. For example, retrofitted buildings can reduce heating and cooling energy requirements by 50-90 percent. Many of these energy efficiency measures also result in overall cost savings and yield other benefits, such as cleaner air for occupants. This potential can be achieved while maintaining the services that buildings provide.

**Dataset**: 

Created in collaboration with Climate Change AI (CCAI) and Lawrence Berkeley National Laboratory (Berkeley Lab).It consists of variables that describe building characteristics and climate and weather variables for the regions in which the buildings are located. Accurate predictions of energy consumption can help policymakers target retrofitting efforts to maximize emissions reductions.

**Task**: 

Analyze differences in building energy efficiency, creating models to predict building energy consumption. 

**Evaluation Metric**: 

The evaluation metric for this competition is Root Mean Squared Error (RMSE). The RMSE is commonly used measure of the differences between predicted values provided by a model and the actual observed values.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Reading Data

In [2]:
train = pd.read_csv("train.csv")
test = pd.read_csv("test.csv")
train['Year_Factor'] = train['Year_Factor'].astype(str)
test['Year_Factor'] = test['Year_Factor'].astype(str)
train['State_Factor'] = train['State_Factor'].astype(str)
test['State_Factor'] = test['State_Factor'].astype(str)
train['year_built'] = train['year_built'].astype(str)
test['year_built'] = test['year_built'].astype(str)
train['id'] = train['id'].astype(str)
test['id'] = test['id'].astype(str)
train['direction_max_wind_speed'] = train['direction_max_wind_speed'].astype(str)
test['direction_max_wind_speed'] = test['direction_max_wind_speed'].astype(str)
train['direction_peak_wind_speed'] = train['direction_peak_wind_speed'].astype(str)
test['direction_peak_wind_speed'] = test['direction_peak_wind_speed'].astype(str)

train.head()

Unnamed: 0,Year_Factor,State_Factor,building_class,facility_type,floor_area,year_built,energy_star_rating,ELEVATION,january_min_temp,january_avg_temp,...,days_above_80F,days_above_90F,days_above_100F,days_above_110F,direction_max_wind_speed,direction_peak_wind_speed,max_wind_speed,days_with_fog,site_eui,id
0,1,State_1,Commercial,Grocery_store_or_food_market,61242.0,1942.0,11.0,2.4,36,50.5,...,14,0,0,0,1.0,1.0,1.0,,248.682615,0
1,1,State_1,Commercial,Warehouse_Distribution_or_Shipping_center,274000.0,1955.0,45.0,1.8,36,50.5,...,14,0,0,0,1.0,,1.0,12.0,26.50015,1
2,1,State_1,Commercial,Retail_Enclosed_mall,280025.0,1951.0,97.0,1.8,36,50.5,...,14,0,0,0,1.0,,1.0,12.0,24.693619,2
3,1,State_1,Commercial,Education_Other_classroom,55325.0,1980.0,46.0,1.8,36,50.5,...,14,0,0,0,1.0,,1.0,12.0,48.406926,3
4,1,State_1,Commercial,Warehouse_Nonrefrigerated,66000.0,1985.0,100.0,2.4,36,50.5,...,14,0,0,0,1.0,1.0,1.0,,3.899395,4


In [3]:
test.head()

Unnamed: 0,Year_Factor,State_Factor,building_class,facility_type,floor_area,year_built,energy_star_rating,ELEVATION,january_min_temp,january_avg_temp,...,days_below_0F,days_above_80F,days_above_90F,days_above_100F,days_above_110F,direction_max_wind_speed,direction_peak_wind_speed,max_wind_speed,days_with_fog,id
0,7,State_1,Commercial,Grocery_store_or_food_market,28484.0,1994.0,37.0,2.4,38,50.596774,...,0,29,5,2,0,,,,,75757
1,7,State_1,Commercial,Grocery_store_or_food_market,21906.0,1961.0,55.0,45.7,38,50.596774,...,0,29,5,2,0,,,,,75758
2,7,State_1,Commercial,Grocery_store_or_food_market,16138.0,1950.0,1.0,59.1,38,50.596774,...,0,29,5,2,0,,,,,75759
3,7,State_1,Commercial,Grocery_store_or_food_market,97422.0,1971.0,34.0,35.4,38,50.596774,...,0,29,5,2,0,,,,,75760
4,7,State_1,Commercial,Grocery_store_or_food_market,61242.0,1942.0,35.0,1.8,38,50.596774,...,0,29,5,2,0,340.0,330.0,22.8,126.0,75761


In [4]:
train.describe()

Unnamed: 0,floor_area,energy_star_rating,ELEVATION,january_min_temp,january_avg_temp,january_max_temp,february_min_temp,february_avg_temp,february_max_temp,march_min_temp,...,days_below_20F,days_below_10F,days_below_0F,days_above_80F,days_above_90F,days_above_100F,days_above_110F,max_wind_speed,days_with_fog,site_eui
count,75757.0,49048.0,75757.0,75757.0,75757.0,75757.0,75757.0,75757.0,75757.0,75757.0,...,75757.0,75757.0,75757.0,75757.0,75757.0,75757.0,75757.0,34675.0,29961.0,75757.0
mean,165983.9,61.048605,39.506323,11.432343,34.310468,59.054952,11.720567,35.526837,58.486278,21.606281,...,17.447932,4.886532,0.876764,82.709809,14.058701,0.279539,0.002442,4.190601,109.142051,82.584693
std,246875.8,28.663683,60.656596,9.381027,6.996108,5.355458,12.577272,8.866697,8.414611,10.004303,...,14.469435,7.071221,2.894244,25.282913,10.943996,2.252323,0.14214,6.458789,50.699751,58.255403
min,943.0,0.0,-6.4,-19.0,10.806452,42.0,-13.0,13.25,38.0,-9.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,12.0,1.001169
25%,62379.0,40.0,11.9,6.0,29.827586,56.0,2.0,31.625,55.0,13.0,...,5.0,0.0,0.0,72.0,6.0,0.0,0.0,1.0,88.0,54.528601
50%,91367.0,67.0,25.0,11.0,34.451613,59.0,9.0,34.107143,61.0,25.0,...,11.0,2.0,0.0,84.0,12.0,0.0,0.0,1.0,104.0,75.293716
75%,166000.0,85.0,42.7,13.0,37.322581,62.0,20.0,40.87931,62.0,27.0,...,26.0,7.0,0.0,97.0,17.0,0.0,0.0,1.0,131.0,97.277534
max,6385382.0,100.0,1924.5,49.0,64.758065,91.0,48.0,65.107143,89.0,52.0,...,93.0,59.0,31.0,260.0,185.0,119.0,16.0,23.3,311.0,997.86612


In [5]:
test.describe()

Unnamed: 0,floor_area,energy_star_rating,ELEVATION,january_min_temp,january_avg_temp,january_max_temp,february_min_temp,february_avg_temp,february_max_temp,march_min_temp,...,days_below_30F,days_below_20F,days_below_10F,days_below_0F,days_above_80F,days_above_90F,days_above_100F,days_above_110F,max_wind_speed,days_with_fog
count,9705.0,7451.0,9705.0,9705.0,9705.0,9705.0,9705.0,9705.0,9705.0,9705.0,...,9705.0,9705.0,9705.0,9705.0,9705.0,9705.0,9705.0,9705.0,1130.0,588.0
mean,163214.3,64.712924,205.23119,13.520762,36.678081,60.008449,21.7051,41.634886,66.940958,23.146419,...,54.256054,20.443895,5.371561,1.323029,66.820093,11.941267,0.211643,0.0,18.131327,150.755102
std,262475.9,27.935984,264.822814,12.458365,6.96852,5.874699,9.774624,5.528689,6.397885,11.553421,...,42.259933,23.182254,6.676871,2.205729,30.936872,13.077936,0.61525,0.0,1.993348,58.760576
min,5982.0,1.0,1.8,-1.0,27.548387,42.0,9.0,33.428571,52.0,8.0,...,0.0,0.0,0.0,0.0,15.0,1.0,0.0,0.0,14.8,34.0
25%,48020.0,45.0,26.5,-1.0,27.548387,54.0,9.0,36.053571,59.0,12.0,...,17.0,0.0,0.0,0.0,39.0,3.0,0.0,0.0,16.5,129.0
50%,82486.0,72.0,118.9,15.0,38.66129,59.0,22.0,41.625,69.0,21.0,...,45.0,7.0,0.0,0.0,77.0,5.0,0.0,0.0,18.3,129.0
75%,177520.0,88.0,231.3,21.0,41.177419,64.0,28.0,45.685185,73.0,33.0,...,108.0,51.0,14.0,5.0,79.0,12.0,0.0,0.0,19.2,138.0
max,6353396.0,100.0,812.0,38.0,50.596774,71.0,40.0,54.482143,76.0,42.0,...,108.0,51.0,14.0,5.0,122.0,41.0,2.0,0.0,23.3,250.0


## 1. Baseline Model - XGBoost

In [6]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

features = train.iloc[:,4:-1]
Y_train = train.iloc[:,-1]
X_test = test.iloc[:,4:]
scaled_x_test = scaler.fit_transform(X_test)
Y_train = Y_train.replace((np.inf, -np.inf, np.nan), 0).reset_index(drop=True)
X_test = X_test.replace((np.inf, -np.inf, np.nan), 0).reset_index(drop=True)
features = features.replace((np.inf, -np.inf, np.nan), 0).reset_index(drop=True)

In [7]:
import xgboost as xgb
from sklearn.metrics import mean_squared_error

regressor = xgb.XGBRegressor(
    learning_rate=0.01,
    colsample_bytree=0.8,
    n_estimators=430,
    reg_lambda=1,
    gamma=1,
    max_depth=3,
    subsample=0.55
)
model = regressor.fit(scaler.fit_transform(features), Y_train)
preds = regressor.predict(scaled_x_test)

In [8]:
submission = pd.read_csv('sample_solution.csv')

In [9]:
submission['site_eui'] = preds
submission.to_csv('results/submission_base_xgboost.csv', index = False)

## 2. Baseline Model - Lasso & Ridge Regression

In [None]:
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

regression = Lasso(alpha=0.5)
model = regression.fit(features, Y_train)
preds_lasso = model.predict(scaled_x_test)

regression = Ridge(alpha=0.5)
model = regression.fit(features, Y_train)
preds_ridge = model.predict(scaled_x_test)

In [None]:
submission['site_eui'] = preds_lasso
submission.to_csv('results/submission_base_lasso.csv', index = False)

In [None]:
submission['site_eui'] = preds_ridge
submission.to_csv('results/submission_base_ridge.csv', index = False)

## 3. DNN Model