## Energy Consumption

Benny Cohen
10/6/2019

In this notebook we will be looking at a data set describing how much energy different types of buildings consume in NY

First let's bring in needed libraries and import the data

In [24]:
import pandas as pd

In [2]:
nyc_gas =  pd.read_csv("https://data.cityofnewyork.us/api/views/uedp-fegm/rows.csv?accessType=DOWNLOAD")

Now let's see what the data looks like

In [3]:
rows, cols = nyc_gas.shape
print("there are " + str(rows) + " rows and " + str(cols) + " columns")

there are 1015 rows and 5 columns


In [4]:
nyc_gas.head(5)

Unnamed: 0,Zip Code,Building type (service class,Consumption (therms),Consumption (GJ),Utility/Data Source
0,10300,Commercial,470.0,50.0,National Grid
1,10335,Commercial,647.0,68.0,National Grid
2,10360,Large residential,33762.0,3562.0,National Grid
3,11200,Commercial,32125.0,3389.0,National Grid
4,11200,Institutional,3605.0,380.0,National Grid


Some observations about the data: 
The index seems to be auto incrementing
The columns are 
1. Zip code of where the building is
2. building type saying if the building is residential, commericat, institutional etc.
3. Consumption of energy in therms (some type of meassurement)
4. Consumption of energy in GJ (another type of meassurement. Maybe we can find how to convert GJ to therms from this data)
5. The source of the data

In [5]:
nyc_gas.columns

Index(['Zip Code', 'Building type (service class', ' Consumption (therms) ',
       ' Consumption (GJ) ', 'Utility/Data Source'],
      dtype='object')

There are some problems with the column names. Zip code for example has a space in it which makes it impossible to use dot notation. The parentesis next to building type isn't closed and looks strange. Also there are spaces before and after Consumption(therms) and GJ which will make it more annoying to use. Let's rename them with camelCase

In [6]:
nyc_gas = nyc_gas.rename(columns = {'Zip Code':"zip", 'Building type (service class':'buildingType', ' Consumption (therms) ': 'consumptionTherms', ' Consumption (GJ) ': 'consumptionGJ', 'Utility/Data Source': 'source'})

In [7]:
nyc_gas.head(1)

Unnamed: 0,zip,buildingType,consumptionTherms,consumptionGJ,source
0,10300,Commercial,470.0,50.0,National Grid


Let's see how many types of buildings there are

In [8]:
nyc_gas.buildingType.unique()

array(['Commercial', 'Large residential', 'Institutional',
       'Small residential', 'Industrial', 'Large Residential',
       'Residential'], dtype=object)

In [9]:
nyc_gas.buildingType.value_counts()

Commercial           354
Residential          198
Small residential    110
Institutional         99
Large residential     94
Industrial            82
Large Residential     78
Name: buildingType, dtype: int64

We see that there are 7 types of buildings.large Residential and Large residential seem to be the same just labeled inconsistantly. It is unclear whether all the residential should just be combined since there is a general residential category Let's just combine large and Large residential for now

In [10]:
#x == 'large Residential' ? 'Large Residential': x
renameResidential = lambda x:  x if x != 'Large residential' else 'Large Residential'
nyc_gas.buildingType = nyc_gas.buildingType.apply(renameResidential)
nyc_gas.buildingType.value_counts()

Commercial           354
Residential          198
Large Residential    172
Small residential    110
Institutional         99
Industrial            82
Name: buildingType, dtype: int64

Now let's see the median gas consumption of all of these. 

In [11]:
overallMedian = nyc_gas.consumptionGJ.median()

In [12]:
medians = nyc_gas.groupby('buildingType').consumptionGJ.median()

In [13]:
medians[medians > overallMedian]

buildingType
Commercial           189413.0
Large Residential    160960.0
Small residential    599600.0
Name: consumptionGJ, dtype: float64

In [14]:
medians[medians < overallMedian]

buildingType
Industrial       16867.5
Institutional    66027.0
Residential      53288.5
Name: consumptionGJ, dtype: float64

We see that Comercial, small and large residential have a median higher than the overall median and the others are less

In [15]:
medians.sort_values()

buildingType
Industrial            16867.5
Residential           53288.5
Institutional         66027.0
Large Residential    160960.0
Commercial           189413.0
Small residential    599600.0
Name: consumptionGJ, dtype: float64

Industrial has the lowest energy consumption and small residential has the highest. I find this surprising since I would expect that industrial would have the highest. I also find it strange that small residential has the highest even though residential has second to lowest. Something might be wrong with how I am interperting these values. 

Let's look at the data reporters

In [16]:
print("There are "+ str(len(nyc_gas.source.unique())) + ' unique data reporters')

There are 2 unique data reporters


In [17]:
nyc_gas.source.value_counts()

ConEd            518
National Grid    497
Name: source, dtype: int64

About half are conEd and half are national grid. Let's see the consumption by each

In [18]:
nyc_gas.groupby('source').consumptionGJ.mean()

source
ConEd            224575.750491
National Grid    357475.560484
Name: consumptionGJ, dtype: float64

In [19]:
nyc_gas.groupby('source').consumptionGJ.std()

source
ConEd            298958.048808
National Grid    562355.273624
Name: consumptionGJ, dtype: float64

Overall National Grid seems to use more energy but the deviation is high and at sight the difference doesn't appear to be a significant.

#### Suggestions about the Data

First note how zips are stored

In [20]:
nyc_gas.sort_values('zip')

Unnamed: 0,zip,buildingType,consumptionTherms,consumptionGJ,source
414,"10001(40.750259021437, -73.99688630376)",Residential,58338.0,6155.0,ConEd
296,"10001(40.750259021437, -73.99688630376)",Commercial,4628579.0,488341.0,ConEd
520,"10001(40.750259021437, -73.99688630376)",Large Residential,3616905.0,381604.0,ConEd
391,"10001(40.750259021437, -73.99688630376)",Residential,1476126.0,155740.0,ConEd
241,"10001(40.750259021437, -73.99688630376)",Commercial,7191956.0,758792.0,ConEd
545,"10002(40.716121467931, -73.985831470246)",Residential,550055.0,58034.0,ConEd
99,"10002(40.716121467931, -73.985831470246)",Residential,4509087.0,475734.0,ConEd
86,"10002(40.716121467931, -73.985831470246)",Commercial,4230121.0,446301.0,ConEd
317,"10002(40.716121467931, -73.985831470246)",Commercial,12107072.0,1277364.0,ConEd
701,"10002(40.716121467931, -73.985831470246)",Large Residential,115301.0,12165.0,ConEd


1. Note how some zip codes are broken up further by longitude and latitude. This is fine since it gives more detailed data points but they probably should be stored as seperate columns. 

In [21]:
len(nyc_gas.zip.unique())

231

In [22]:
nyc_gas.count()

zip                  1015
buildingType         1015
consumptionTherms    1005
consumptionGJ        1005
source               1015
dtype: int64

2. We see that there are 1015 rows but 10 of the rows contain nans for energy consumption. Those rows need to be fixed

In [23]:
len(nyc_gas[['zip', 'buildingType', 'source']].drop_duplicates())

793

3. One would think that those 3 columns are unique but they aren't otherwise the drop duplicates function wouldn't drop any. This means that there may be different values given for the same zip code, building type and source, with some other variable not included. 