# Residential Energy Consumption Data
The [Residential Energy Consumption Survey](https://www.eia.gov/consumption/residential/) (RECS) is administered by the US Department of Energy roughly every five years.  The most recent full data set is from 2015.  

The core dataset covers about 6,000 homes sampled from across the country.  In addition to energy consumption statistics, the record for each home includes information about the building characteristics, types of appliances, types of fuel used, energy expenditures, household income, and more.

In prior years, ER131 students have built models that predict energy consumption for homes with this data set, and then used data for other homes or communities that are not in the dataset to predict consumption out of the RECS sample.  This creates opportunities to explore energy poverty, what households are most likely to suffer from heat stress amidst climate change, and more.  Note that these data do have some spatial information, so they could be potentially used in conjunction with other geospatial data sets.  

You'll need to read the RECS documentation to develop a deep understanding of what the data contain, how the samples are constructed, and so on.  

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.impute import SimpleImputer

In [3]:
RECS_df = pd.read_csv('recs2015_public_v4.csv')
RECS_df.describe()

Unnamed: 0,DOEID,REGIONC,DIVISION,TYPEHUQ,ZTYPEHUQ,CELLAR,ZCELLAR,BASEFIN,ZBASEFIN,ATTIC,...,ZELAMOUNT,NGXBTU,PERIODNG,ZNGAMOUNT,FOXBTU,PERIODFO,ZFOAMOUNT,LPXBTU,PERIODLP,ZLPAMOUNT
count,5686.0,5686.0,5686.0,5686.0,5686.0,5686.0,5686.0,5686.0,5686.0,5686.0,...,5686.0,3304.0,5686.0,5686.0,5686.0,5686.0,5686.0,5686.0,5686.0,5686.0
mean,12843.5,2.760816,5.670243,2.596025,0.0,-0.205593,-0.505276,-1.199261,-1.381815,-0.111854,...,0.10904,100.088868,0.181674,-0.731094,137.441423,-1.738305,-1.872318,91.33,-1.479071,-1.748505
std,1641.551147,1.004187,2.842655,1.164641,0.0,1.134775,0.880288,1.235166,0.933693,1.187953,...,0.311716,4.437933,2.197037,1.117215,0.142739,1.197667,0.558504,5.230054e-12,1.66304,0.76791
min,10001.0,1.0,1.0,1.0,0.0,-2.0,-2.0,-2.0,-2.0,-2.0,...,0.0,83.34,-2.0,-2.0,135.0,-2.0,-2.0,91.33,-2.0,-2.0
25%,11422.25,2.0,3.0,2.0,0.0,-2.0,-2.0,-2.0,-2.0,-2.0,...,0.0,99.05,-2.0,-2.0,137.45,-2.0,-2.0,91.33,-2.0,-2.0
50%,12843.5,3.0,5.0,2.0,0.0,0.0,0.0,-2.0,-2.0,0.0,...,0.0,101.145,1.0,0.0,137.45,-2.0,-2.0,91.33,-2.0,-2.0
75%,14264.75,4.0,8.0,3.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,102.78,1.0,0.0,137.45,-2.0,-2.0,91.33,-2.0,-2.0
max,15686.0,4.0,10.0,5.0,0.0,1.0,1.0,1.0,1.0,1.0,...,1.0,107.72,5.0,1.0,137.45,5.0,1.0,91.33,5.0,1.0


There are 755 columns in this data set -- that is an enormous number of fields to work with.  

### Simple prediction model
In the cells below We'll give you a simple workflow for building a prediction model with all the numeric values from the dataset.  

In [4]:
numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

RECS_df_numeric = RECS_df.select_dtypes(include=numerics)
RECS_df_numeric.head()

Unnamed: 0,DOEID,REGIONC,DIVISION,TYPEHUQ,ZTYPEHUQ,CELLAR,ZCELLAR,BASEFIN,ZBASEFIN,ATTIC,...,ZELAMOUNT,NGXBTU,PERIODNG,ZNGAMOUNT,FOXBTU,PERIODFO,ZFOAMOUNT,LPXBTU,PERIODLP,ZLPAMOUNT
0,10001,4,10,2,0,0,0,-2,-2,0,...,0,103.32,1,0,137.45,-2,-2,91.33,-2,-2
1,10002,3,7,2,0,0,0,-2,-2,0,...,1,,-2,-2,137.45,-2,-2,91.33,-2,-2
2,10003,3,6,2,0,1,0,1,0,0,...,0,100.14,1,0,137.45,-2,-2,91.33,-2,-2
3,10004,2,4,2,0,1,0,1,0,0,...,0,,-2,-2,137.45,-2,-2,91.33,2,0
4,10005,1,2,2,0,1,0,0,0,1,...,0,102.83,1,0,137.45,-2,-2,91.33,-2,-2


Next step is to build a simple model to predict energy use.  

In [5]:
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn import tree

This next cell will create a list of which columns have `KWH` in the label.  Those are electric energy values, and I want to remove all of them from the data frame I use to train the model.

In [6]:
haskwh = RECS_df_numeric.columns.str.contains('^kwh', case=False);

In [7]:
RECS_df_numeric_noKWH = RECS_df_numeric.loc[:,~haskwh] 

Now let's train a simple model.  This will look unfamiliar to you -- but you're going to be very skilled at these sorts of operations soon.

In [8]:
X = RECS_df_numeric_noKWH
test_ind = int(X.shape[0]*0.8)
X_train = X.loc[:test_ind,:]
X_test = X.loc[test_ind:,:]

y = RECS_df_numeric.loc[:,'KWH']
y_train = y.loc[:test_ind]
y_test = y.loc[test_ind:]

Before training, I need to deal w missing values.  sklearn has a cool imputer for this.  There's more here: https://scikit-learn.org/stable/modules/impute.html

In [9]:
# Create our imputer to replace missing values with the mean e.g.
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp = imp.fit(X)

X_train_imp = imp.transform(X_train)

X_test_imp = imp.transform(X_test)

Note that `SimpleImputer` converts a pandas dataframe (X_train) to a numpy array.  Also note it's just finding the mean value in each col of the data frame you fit with, then replacing all NaNs in by their column's mean (from the fit data).

In [10]:
energy_tree = DecisionTreeRegressor()
energy_tree.fit(X_train_imp, y_train)

print("Number of features: {}".format(energy_tree.tree_.n_features))
print("Number of nodes (internal and terminal): {}".format(energy_tree.tree_.node_count), "\n")

train_score = energy_tree.score(X_train_imp, y_train)
print('Train Score: ', train_score)

val_score = energy_tree.score(X_test_imp, y_test)
print('Validation Score: ', val_score)

Number of features: 728
Number of nodes (internal and terminal): 9071 

Train Score:  1.0
Validation Score:  0.9997498304252398


That's a damn good model, so much so that one should be suspicious that the training data might have other measures of energy consumption that you would exclude from training data in a real-world application.  If you work with this you'll need to justify each training feature and explain why one might be able to measure it for homes that you *can't* measure energy consumption for.