## Importing Data

In [42]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [43]:
dfg = pd.read_excel('./data/annual_generation_state.xls')

In [44]:
# Resetting column headings
dfg.columns = dfg.iloc[0]
dfg.drop([0], inplace = True)

In [45]:
dfg.head()

Unnamed: 0,YEAR,STATE,TYPE OF PRODUCER,ENERGY SOURCE,GENERATION (Megawatthours)
1,1990,AK,Total Electric Power Industry,Total,5599506
2,1990,AK,Total Electric Power Industry,Coal,510573
3,1990,AK,Total Electric Power Industry,Hydroelectric Conventional,974521
4,1990,AK,Total Electric Power Industry,Natural Gas,3466261
5,1990,AK,Total Electric Power Industry,Petroleum,497116


### Initial EDA and Data Cleaning

EIA data is quite complete:

In [46]:
dfg.isnull().sum()

0
YEAR                          0
STATE                         0
TYPE OF PRODUCER              0
ENERGY SOURCE                 0
GENERATION (Megawatthours)    0
dtype: int64

We have a tremendous number of rows, and we will certainly want to pare that down to what we truly need. <br>We will do this in section 1.2) Confirming Assumptions:

In [47]:
dfg.shape

(51633, 5)

All-caps column names will be difficult to work with, so here we rename:

In [48]:
column_rename = {"YEAR": "Year", 
                 "STATE": "State", 
                 "TYPE OF PRODUCER": "Producer Type",
                 "ENERGY SOURCE": "Source", 
                 "GENERATION (Megawatthours)": "Gen MWh"}

dfg.rename(columns=column_rename, inplace=True)

However not all our data types are as expected. "Year" and "GENERATION (Megawatthours)" are expected to be numeric but they are not. We will convert both these columns to int.

In [49]:
dfg.dtypes

0
Year             object
State            object
Producer Type    object
Source           object
Gen MWh          object
dtype: object

In [50]:
dfg['Gen MWh'] = dfg['Gen MWh'].astype(int)
dfg['Year'] = dfg['Year'].astype(int)

We have 29 years of data, 1990 to 2018, just like our rates data:

In [51]:
dfg['Year'].unique()

array([1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
       2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
       2012, 2013, 2014, 2015, 2016, 2017, 2018])

There are 54 State categories, which is unexpected and worth investigation:

In [52]:
dfg['State'].nunique()

54

Beyond the 50 states plus DC, we have some blanks (' '), and two total US categories, "US-TOTAL" and "US-Total":

In [53]:
dfg['State'].unique()

array(['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA',
       'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME',
       'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
       'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
       'US-TOTAL', 'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY', '  ',
       'US-Total'], dtype=object)

All three blank State items represent 0 MWh of generation, which is not meaningful data. We lose nothing by dropping these rows.

In [54]:
dfg.loc[dfg['State'] == "  "]

Unnamed: 0,Year,State,Producer Type,Source,Gen MWh
20577,2003,,Total Electric Power Industry,Coal,0
20578,2003,,Total Electric Power Industry,Natural Gas,0
20579,2003,,Total Electric Power Industry,Petroleum,0


In [55]:
dfg = dfg[dfg['State'] != "  "]

We should also drop the US-TOTAL / US-Total data. We are not modeling the entire country, so the state data are all we need.

In [56]:
dfg = dfg[dfg['State'] != "US-TOTAL"]
dfg = dfg[dfg['State'] != "US-Total"]

### Exploring Totals

We saw above that we have almost 50,000 rows. What we are interested in is 51 states, 29 years, and the (approximately) 10 different types of electricity generation plus totals. So we expect to **need** around 15,000 rows. In this section we will confirm that the extra rows are not needed, and remove them.

The data contains a range of Producer Types for each combination of Year, State, and Source of generation:

In [57]:
dfg['Producer Type'].unique()

array(['Total Electric Power Industry',
       'Electric Generators, Electric Utilities',
       'Combined Heat and Power, Industrial Power',
       'Combined Heat and Power, Commercial Power',
       'Electric Generators, Independent Power Producers',
       'Combined Heat and Power, Electric Power'], dtype=object)

This can be most easily understood by looking at the 20 rows of data data for the state of Alaska in the year 1990:

In [58]:
dfg.head(20)

Unnamed: 0,Year,State,Producer Type,Source,Gen MWh
1,1990,AK,Total Electric Power Industry,Total,5599506
2,1990,AK,Total Electric Power Industry,Coal,510573
3,1990,AK,Total Electric Power Industry,Hydroelectric Conventional,974521
4,1990,AK,Total Electric Power Industry,Natural Gas,3466261
5,1990,AK,Total Electric Power Industry,Petroleum,497116
6,1990,AK,Total Electric Power Industry,Wind,0
7,1990,AK,Total Electric Power Industry,Wood and Wood Derived Fuels,151035
8,1990,AK,"Electric Generators, Electric Utilities",Total,4493024
9,1990,AK,"Electric Generators, Electric Utilities",Coal,311960
10,1990,AK,"Electric Generators, Electric Utilities",Hydroelectric Conventional,974521


We are not interested in the various producer types - only the "Total Electric Power Industry". It certainly _appears_ that if we sum up the Gen MWh for each Source, for all Producer Types except "Total Electric Power Industry", they will equal "Total Electric Power Industry". 

But that is a rather risky assumption to make for almost 50,000 rows of data. So here we confirm this is true:

In [59]:
# This function will return a pivot table object and the multiindex
# as a list, which we will use to create a calculated Total column
def year_state_pivot(data, index, columns):
    return pd.pivot_table(data=data, index=index, columns=columns, fill_value=0), list(pd.pivot_table(data=data, index=index, columns=columns, fill_value=0).columns)

#dfg_pivot = pd.pivot_table(data=dfg, index=["Year", "State", "Source"], columns=["Producer Type"], fill_value=0)

In [60]:
dfg_pivot, dfg_pivot_cols = year_state_pivot(dfg, ['Year', 'State', 'Source'], 'Producer Type')

In [61]:
dfg_pivot.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Gen MWh,Gen MWh,Gen MWh,Gen MWh,Gen MWh,Gen MWh
Unnamed: 0_level_1,Unnamed: 1_level_1,Producer Type,"Combined Heat and Power, Commercial Power","Combined Heat and Power, Electric Power","Combined Heat and Power, Industrial Power","Electric Generators, Electric Utilities","Electric Generators, Independent Power Producers",Total Electric Power Industry
Year,State,Source,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
1990,AK,Coal,198613,0,0,311960,0,510573
1990,AK,Hydroelectric Conventional,0,0,0,974521,0,974521
1990,AK,Natural Gas,0,0,596623,2869638,0,3466261
1990,AK,Petroleum,66920,0,93291,336905,0,497116
1990,AK,Total,265533,0,840949,4493024,0,5599506


Here are our multiindex columns which we'll be working with:

In [62]:
dfg_pivot_cols

[('Gen MWh', 'Combined Heat and Power, Commercial Power'),
 ('Gen MWh', 'Combined Heat and Power, Electric Power'),
 ('Gen MWh', 'Combined Heat and Power, Industrial Power'),
 ('Gen MWh', 'Electric Generators, Electric Utilities'),
 ('Gen MWh', 'Electric Generators, Independent Power Producers'),
 ('Gen MWh', 'Total Electric Power Industry')]

Now we will create a calculated (Gen MWh, Total) column, which we can compare with the (Gen MWh, Total Electric Power Industry) numbers provided by EIA. As we are working with a multiindex, we will assign each column name to a variable for readability, before summing them up to create the new column:

In [63]:
multi_0 = dfg_pivot[dfg_pivot_cols[0]]
multi_1 = dfg_pivot[dfg_pivot_cols[1]] 
multi_2 = dfg_pivot[dfg_pivot_cols[2]]
multi_3 = dfg_pivot[dfg_pivot_cols[3]]
multi_4 = dfg_pivot[dfg_pivot_cols[4]]

dfg_pivot[('Gen MWh', 'Calculated Total')] = multi_0 + multi_1 + multi_2 + multi_3 + multi_4

In [64]:
dfg_pivot.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Gen MWh,Gen MWh,Gen MWh,Gen MWh,Gen MWh,Gen MWh,Gen MWh
Unnamed: 0_level_1,Unnamed: 1_level_1,Producer Type,"Combined Heat and Power, Commercial Power","Combined Heat and Power, Electric Power","Combined Heat and Power, Industrial Power","Electric Generators, Electric Utilities","Electric Generators, Independent Power Producers",Total Electric Power Industry,Calculated Total
Year,State,Source,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
1990,AK,Coal,198613,0,0,311960,0,510573,510573
1990,AK,Hydroelectric Conventional,0,0,0,974521,0,974521,974521
1990,AK,Natural Gas,0,0,596623,2869638,0,3466261,3466261
1990,AK,Petroleum,66920,0,93291,336905,0,497116,497116
1990,AK,Total,265533,0,840949,4493024,0,5599506,5599506


In the cell below, we compare our calculated (Gen MWh, Total) column with the (Gen MWh, Total Electric Power Industry) column. By subtracting, taking the absolute value, and counting True/False values, we can see if the generation totals are as we expect. 

A threshold of 10 (out of fields that range from the thousands into the millions) accounts for rounding errors.

In [68]:
(abs(dfg_pivot[('Gen MWh', 'Calculated Total')] - dfg_pivot[('Gen MWh', 'Total Electric Power Industry')]) <= 10).value_counts()

True    13809
dtype: int64

We have confirmed that the rows where Producer Type is "Total Electric Power Industry" contain all the data in the other "Producer Type" rows. Here, we will drop those rows:

In [69]:
dfg = dfg[dfg['Producer Type'] == 'Total Electric Power Industry']

In [70]:
dfg.shape

(13809, 5)

We are also interested in the various generation sources. Let's use the function created above to confirm that each "Total" row is equal to the sum of the various generation sources such as Coal, Hydro, Natural Gas, etc:

In [27]:
dfg.head()

Unnamed: 0,Year,State,Producer Type,Source,Gen MWh
1,1990,AK,Total Electric Power Industry,Total,5599506
2,1990,AK,Total Electric Power Industry,Coal,510573
3,1990,AK,Total Electric Power Industry,Hydroelectric Conventional,974521
4,1990,AK,Total Electric Power Industry,Natural Gas,3466261
5,1990,AK,Total Electric Power Industry,Petroleum,497116


In [28]:
dfg_pivot = year_state_pivot(dfg, ['Year', 'State'], 'Source')

In [29]:
dfg_pivot.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Gen MWh,Gen MWh,Gen MWh,Gen MWh,Gen MWh,Gen MWh,Gen MWh,Gen MWh,Gen MWh,Gen MWh,Gen MWh,Gen MWh,Gen MWh,Gen MWh
Unnamed: 0_level_1,Source,Coal,Geothermal,Hydroelectric Conventional,Natural Gas,Nuclear,Other,Other Biomass,Other Gases,Petroleum,Pumped Storage,Solar Thermal and Photovoltaic,Total,Wind,Wood and Wood Derived Fuels
Year,State,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2
1990,AK,510573,0,974521,3466261,0,0,0,0,497116,0,0,5599506,0,151035
1990,AL,53658115,0,10366507,1020714,12051882,0,47503,269476,138089,0,0,79652133,0,2099847
1990,AR,19207935,0,3654653,3578573,11282053,0,15389,0,79979,42972,0,39099598,0,1238044
1990,AZ,31915610,0,7417576,2333900,20597689,0,0,0,151867,249767,0,62774297,0,107888
1990,CA,2637677,14521254,23792567,74168308,32692807,0,2117915,2146742,5473852,986252,366668,165784909,2758881,4121986


In [30]:
dfg_pivot.columns

MultiIndex([('Gen MWh',                           'Coal'),
            ('Gen MWh',                     'Geothermal'),
            ('Gen MWh',     'Hydroelectric Conventional'),
            ('Gen MWh',                    'Natural Gas'),
            ('Gen MWh',                        'Nuclear'),
            ('Gen MWh',                          'Other'),
            ('Gen MWh',                  'Other Biomass'),
            ('Gen MWh',                    'Other Gases'),
            ('Gen MWh',                      'Petroleum'),
            ('Gen MWh',                 'Pumped Storage'),
            ('Gen MWh', 'Solar Thermal and Photovoltaic'),
            ('Gen MWh',                          'Total'),
            ('Gen MWh',                           'Wind'),
            ('Gen MWh',    'Wood and Wood Derived Fuels')],
           names=[0, 'Source'])

In [31]:
dfg_pivot_cols = list(dfg_pivot.columns)

In [None]:
multi_0 = dfg_pivot[dfg_pivot_cols[0]]
multi_1 = dfg_pivot[dfg_pivot_cols[1]] 
multi_2 = dfg_pivot[dfg_pivot_cols[2]]
multi_3 = dfg_pivot[dfg_pivot_cols[3]]
multi_4 = dfg_pivot[dfg_pivot_cols[4]]
multi_5 = dfg_pivot[dfg_pivot_cols[5]]
multi_6 = dfg_pivot[dfg_pivot_cols[6]]
multi_7 = dfg_pivot[dfg_pivot_cols[7]]
multi_8 = dfg_pivot[dfg_pivot_cols[8]]
multi_9 = dfg_pivot[dfg_pivot_cols[9]]
multi_10 = dfg_pivot[dfg_pivot_cols[10]]
multi_11 = dfg_pivot[dfg_pivot_cols[11]]
multi_12 = dfg_pivot[dfg_pivot_cols[12]]
multi_13 = dfg_pivot[dfg_pivot_cols[13]]

#dfg_pivot[('Gen MWh', 'Calcuated Total')] = multi_0

In [None]:
dfg_pivot.head()

In [None]:
dfg_pivot['']

In [None]:
dfg_pivot['Totals Calculated'] = [ for i in range(0,11)]

For each Year/State combination, a "Total" row, representing all generation, is provided by EIA. There are also rows for each generation source (the list in the cell below). In the section below, we compare the sum of MWh generated by each individual source with the Year/State Total, to confirm that the sum of all sources is equal to the "Total" column.

In [None]:
# Here we display all the different sources of electricity generation.
dfg['Source'].unique()

In [None]:
def gen_totals(Year, State):
    
    # Here we extract a "sources" dataframe, which is specific to a given Year/State combination, 
    # and the MWH generated by all generation sources. It contains the MWh generated by each of the 
    # sources listed in the cell above.
    sources =    dfg[(dfg['Year'] == Year) & 
                (dfg['State'] == State) & 
                (dfg['Producer Type'] == "Total Electric Power Industry")]['Gen MWh']
    
    # Here we extract the individual row that represents only the TOTAL generation for that Year/State
    gen_total =  int(dfg[(dfg['Year'] == Year) &
                (dfg['State'] == State) & 
                (dfg['Producer Type'] == "Total Electric Power Industry")]['Gen MWh'][0:1].values)

    # Here we extract all generation EXCEPT the Total:
    gen_sum =   dfg[(dfg['Year'] == Year) & 
               (dfg['State'] == State) & 
               (dfg['Producer Type'] == "Total Electric Power Industry")]['Gen MWh'][1:len(sources)].sum() 
        
    # We'll define a threshhold, below which we accept the reported total and the calculated total are
    # functionally equal, which is 10 MWh. 
    # Considering even a very small state generates millions of MHW/yr, 
    # anything this small represents a rounding error:
    if abs(gen_total - gen_sum) < 10:
        return True
    else:
        return False

Here is an example of the output for once year/State:

In [None]:
gen_totals(2016, "RI")

The for loop below will confirm that we have no mistakes in our data. Any Year/State combinations for which the reported total does not equal the calculated total will be printed.<br><br> **NOTE:** this will take around 30 seconds to run on a newer computer:

In [None]:
for year in dfg['Year'].unique():
    for state in dfg['State'].unique():
        if gen_totals(year, state) == True:
            pass
        else:
            print(year, state)

We can confirm that the sum of all generation sources is equal to the "Total" column.

In [None]:
dfg.shape

In [None]:
dfg['Producer Type'].unique()

We no longer need "Producer Type" at all as it only has one value. Here we drop the column:

In [None]:
dfg.drop(["Producer Type"], axis = 1, inplace = True)

We have 51 States and 29 years, or 1479 combinations of State and Year. There are an average of around 9 rows per State and Year combination, which is in the expected range of one Total row and an average of 8 generation types for each.

In [None]:
dfg.shape

In [None]:
dfg.head()

In [None]:
dfg['Source'].unique()

Let's shorten some of the Source names:

In [None]:
dfg.replace({"Source": {"Hydroelectric Conventional": "Hydroelectric", 
                        "Solar Thermal and Photovoltaic": "Solar Thermal/PV"}})

Let's pivot our dataframe, so each Year / State combination is a row, and each generation source is a column with Gen MWh as the value:

In [None]:
dfg = pd.pivot_table(data=dfg,index=["Year", "State"], columns = "Source", values = "Gen MWh", fill_value=0)
dfg.reset_index(level=[0,1], inplace = True)
dfg.head()

In [None]:
dfg["Biomass"] = dfg["Wood and Wood Derived Fuels"] + dfg["Other Biomass"]

We can also drop "Wood and Wood Derived Fuels" and "Other Biomass" now that we have the combined "Biomass" column:

In [None]:
dfg.drop(["Wood and Wood Derived Fuels", "Other Biomass"], axis = 1, inplace = True)

Changing our float display format will make the data easier to interpret visually:

In [None]:
pd.options.display.float_format = '{:,.0f}'.format

In [None]:
dfg.head()

In [None]:
dfg.to_csv('./data/electricity-generation.csv')