In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [29]:
dfg = pd.read_excel('./data/annual_generation_state.xls')

In [30]:
# Resetting column headings
dfg.columns = dfg.iloc[0]
dfg.drop([0], inplace = True)

In [31]:
dfg.head()

Unnamed: 0,YEAR,STATE,TYPE OF PRODUCER,ENERGY SOURCE,GENERATION (Megawatthours)
1,1990,AK,Total Electric Power Industry,Total,5599506
2,1990,AK,Total Electric Power Industry,Coal,510573
3,1990,AK,Total Electric Power Industry,Hydroelectric Conventional,974521
4,1990,AK,Total Electric Power Industry,Natural Gas,3466261
5,1990,AK,Total Electric Power Industry,Petroleum,497116


EIA data is quite complete:

In [32]:
dfg.isnull().sum()

0
YEAR                          0
STATE                         0
TYPE OF PRODUCER              0
ENERGY SOURCE                 0
GENERATION (Megawatthours)    0
dtype: int64

All-caps column names will be difficult to work with, so here we rename:

In [33]:
column_rename = {"YEAR": "Year", 
                 "STATE": "State", 
                 "TYPE OF PRODUCER": "Producer Type",
                 "ENERGY SOURCE": "Source", 
                 "GENERATION (Megawatthours)": "Gen MWh"}

dfg.rename(columns=column_rename, inplace=True)

However not all our data types are as expected. "Year" and "GENERATION (Megawatthours)" are expected to be numeric but they are not. We do not (yet) need Year as datetime, so we will convert both these columns to int.

In [34]:
dfg.dtypes

0
Year             object
State            object
Producer Type    object
Source           object
Gen MWh          object
dtype: object

In [35]:
dfg['Gen MWh'] = dfg['Gen MWh'].astype(int)
dfg['Year'] = dfg['Year'].astype(int)

We have 29 years of data, 1990 to 2018, just like our rates data:

In [36]:
dfg['Year'].unique()

array([1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998, 1999, 2000,
       2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011,
       2012, 2013, 2014, 2015, 2016, 2017, 2018])

There are 54 State categories, which is unexpected and worth investigation:

In [37]:
dfg['State'].nunique()

54

Beyond the 50 states plus DC, we have some blanks (' '), and two total US categories, "US-TOTAL" and "US-Total":

In [38]:
dfg['State'].unique()

array(['AK', 'AL', 'AR', 'AZ', 'CA', 'CO', 'CT', 'DC', 'DE', 'FL', 'GA',
       'HI', 'IA', 'ID', 'IL', 'IN', 'KS', 'KY', 'LA', 'MA', 'MD', 'ME',
       'MI', 'MN', 'MO', 'MS', 'MT', 'NC', 'ND', 'NE', 'NH', 'NJ', 'NM',
       'NV', 'NY', 'OH', 'OK', 'OR', 'PA', 'RI', 'SC', 'SD', 'TN', 'TX',
       'US-TOTAL', 'UT', 'VA', 'VT', 'WA', 'WI', 'WV', 'WY', '  ',
       'US-Total'], dtype=object)

All three blank State items represent 0 MWh of generation, which is not meaningfu data. We lose nothing by dropping these rows.

In [39]:
dfg.loc[dfg['State'] == "  "]

Unnamed: 0,Year,State,Producer Type,Source,Gen MWh
20577,2003,,Total Electric Power Industry,Coal,0
20578,2003,,Total Electric Power Industry,Natural Gas,0
20579,2003,,Total Electric Power Industry,Petroleum,0


In [40]:
dfg = dfg[dfg['State'] != "  "]

We should also drop the US-TOTAL / US-Total data. We are not modeling the entire country, so the state data are all we need.

In [41]:
dfg = dfg[dfg['State'] != "US-TOTAL"]
dfg = dfg[dfg['State'] != "US-Total"]

It makes sense to boil our data down to the key rows: **Producer Type** "Total Electric Power Industry", each possible **Source**, along with the **Source**:Total from all reported generation sources. Before doing this, which will mean dropping many rows, let's confirm that the sum of all sources equals the total in the data. 

In [42]:
def gen_totals(Year, State):
    
    sources =    dfg[(dfg['Year'] == Year) & 
                (dfg['State'] == State) & 
                (dfg['Producer Type'] == "Total Electric Power Industry")]['Gen MWh']
    
    gen_total =  int(dfg[(dfg['Year'] == Year) &
                (dfg['State'] == State) & 
                (dfg['Producer Type'] == "Total Electric Power Industry")]['Gen MWh'][0:1].values)

    gen_sum =   dfg[(dfg['Year'] == Year) & 
               (dfg['State'] == State) & 
               (dfg['Producer Type'] == "Total Electric Power Industry")]['Gen MWh'][1:len(sources)].sum() 
        
    # We'll define a threshhold bewteen the reported total and the sum of 10
    # Considering even a very small state generates millions of MHW/yr, 
    # anything this small represents a rounding error.
    if abs(gen_total - gen_sum) < 10:
        return True
    else:
        return False

Here is an example of the output for once year/State:

In [43]:
gen_totals(2016, "RI")

True

The for loop below will confirm that we have no mistakes in our data. Any year / State combinations reported require investigation (spoiler alert - there are none).<br><br> **NOTE:** this will take around 30 seconds to run on a newer computer:

In [20]:
for year in dfg['Year'].unique():
    for state in dfg['State'].unique():
        if gen_totals(year, state) == True:
            pass
        else:
            print(year, state)

Now we are ready to drop all "Producer Type" rows that do not represent "Total Electric Power Industry":

In [47]:
dfg = dfg[dfg["Producer Type"] == "Total Electric Power Industry"]

In [48]:
dfg.shape

(13809, 5)