Capstone 1 Wrangling

All companies that participate in wholesale energy markets are required to submit a quarterly report detailing each transaction to the Federal Energy Regulatory Commission (FERC). This information is then made publicly available for download on FERC’s website at https://eqrreportviewer.ferc.gov/.

I've dowloaded all of the hourly data for Florida as a csv file and saved it to my desktop. My first step is to load the file as a dataframe.

In [27]:
import pandas as pd
data = pd.read_csv(r'C:\Users\anhem44\Desktop\Capstone1\all_data_cap1.csv')
data.head()

Unnamed: 0,DATE,SELLER_COMPANY,SELLER_COMPANY_OLD,C_BUYER_NAME,C_BUYER_NAME_OLD,Region,Contract_Service_Agreement_id,TR_CONTRACT_ID,loc,TR_TIMEZONE,...,index_loc_seller,index_loc_oldseller,transcharge,price_minus_index,price_above_trans,price_above_trans_mwh,transaction_len,year,qtr,benchhub
0,1/1/2014,EXELON,"EXELON GENERATION COMPANY, LLC",THE ENERGY AUTHORITY,THE ENERGY AUTHORITY,Florida,12870,717562,FPL,EASTERNPREVAILING,...,30.0,30.0,8,-1.471054,0,0.0,Hourly,2014,1,FPL
1,1/1/2014,SOUTHERN COMPANY,"SOUTHERN COMPANY SERVICES, INC. (AS AGENT)",SEMINOLE ELECTRIC COOP,"SEMINOLE ELECTRIC COOPERATIVE, INC.",Florida,481,1888312,FPL,CENTRALSTANDARD,...,38.0,38.0,8,2.114532,0,0.0,Hourly,2014,1,FPL
2,1/1/2014,THE ENERGY AUTHORITY,"THE ENERGY AUTHORITY, INC.",J.P. MORGAN CHASE & COMPANY,JP MORGAN VENTURES ENERGY CORPORATION,Florida,30760,892022,FPL,EASTERNPREVAILING,...,31.75,31.75,8,0.893512,0,0.0,Hourly,2014,1,FPL
3,1/1/2014,THE ENERGY AUTHORITY,"THE ENERGY AUTHORITY, INC.",EXELON,"EXELON GENERATION COMPANY, LLC",Florida,30789,892011,FPL,EASTERNPREVAILING,...,31.5,31.5,8,1.272839,0,0.0,Hourly,2014,1,FPL
4,1/1/2014,THE ENERGY AUTHORITY,"THE ENERGY AUTHORITY, INC.",EXELON,"EXELON GENERATION COMPANY, LLC",Florida,30789,892011,FPL,EASTERNPREVAILING,...,31.451613,31.451613,8,1.407159,0,0.0,Hourly,2014,1,FPL


With the data loaded I double check the datatypoes to isnure that there were no errors in the upload.

In [28]:
data.dtypes

DATE                              object
SELLER_COMPANY                    object
SELLER_COMPANY_OLD                object
C_BUYER_NAME                      object
C_BUYER_NAME_OLD                  object
Region                            object
Contract_Service_Agreement_id      int64
TR_CONTRACT_ID                     int64
loc                               object
TR_TIMEZONE                       object
TR_CLASS_NAME                     object
PRICEINDOLPERMWH                 float64
TR_DELV_SPEC_LOC                  object
TRADE_DATE                        object
HOUROFDAY                          int64
QUANTITYINMWH                    float64
HOURLYTRANSCHARGE                float64
HOUR_FREQ                          int64
weighted_pricemw                 float64
index_loc                        float64
index_bench                      float64
index_loc_seller                 float64
index_loc_oldseller              float64
transcharge                        int64
price_minus_inde

Everythingg looks good, with the exception of the 'TRADE_DATE' column. This column contains the date and hour that the trade occured and should be a date time, but during upload it was classified as an object. So my next step is to convert it to a data time and check the results.

In [41]:
data['TRADE_DATE'] = pd.to_datetime(data['TRADE_DATE'])      
data['TRADE_DATE'].head()

0   2014-01-01 22:00:00
1   2014-01-01 18:00:00
2   2014-01-01 02:00:00
3   2014-01-01 03:00:00
4   2014-01-01 04:00:00
Name: TRADE_DATE, dtype: datetime64[ns]

Finally, since the goal of this project was to predict prices for a FERC specific time period and season I used the following to create these variables in the data frame.

In [31]:
data['FERC_time'] = 'off_peak'
data['FERC_time'][(data['TRADE_DATE'].dt.weekday <= 5) & (data['TRADE_DATE'].dt.hour >= 6) & (data['TRADE_DATE'].dt.hour <= 21)]='peak' 

data['FERC_season'] = 'shoulder'
data['FERC_season'][(data['TRADE_DATE'].dt.month <= 2) | (data['TRADE_DATE'].dt.month >= 12)]='winter' 
data['FERC_season'][(data['TRADE_DATE'].dt.month <= 8) & (data['TRADE_DATE'].dt.month >= 6)]= 'summer'
data.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy


Unnamed: 0,DATE,SELLER_COMPANY,SELLER_COMPANY_OLD,C_BUYER_NAME,C_BUYER_NAME_OLD,Region,Contract_Service_Agreement_id,TR_CONTRACT_ID,loc,TR_TIMEZONE,...,transcharge,price_minus_index,price_above_trans,price_above_trans_mwh,transaction_len,year,qtr,benchhub,FERC_time,FERC_season
0,1/1/2014,EXELON,"EXELON GENERATION COMPANY, LLC",THE ENERGY AUTHORITY,THE ENERGY AUTHORITY,Florida,12870,717562,FPL,EASTERNPREVAILING,...,8,-1.471054,0,0.0,Hourly,2014,1,FPL,off_peak,winter
1,1/1/2014,SOUTHERN COMPANY,"SOUTHERN COMPANY SERVICES, INC. (AS AGENT)",SEMINOLE ELECTRIC COOP,"SEMINOLE ELECTRIC COOPERATIVE, INC.",Florida,481,1888312,FPL,CENTRALSTANDARD,...,8,2.114532,0,0.0,Hourly,2014,1,FPL,peak,winter
2,1/1/2014,THE ENERGY AUTHORITY,"THE ENERGY AUTHORITY, INC.",J.P. MORGAN CHASE & COMPANY,JP MORGAN VENTURES ENERGY CORPORATION,Florida,30760,892022,FPL,EASTERNPREVAILING,...,8,0.893512,0,0.0,Hourly,2014,1,FPL,off_peak,winter
3,1/1/2014,THE ENERGY AUTHORITY,"THE ENERGY AUTHORITY, INC.",EXELON,"EXELON GENERATION COMPANY, LLC",Florida,30789,892011,FPL,EASTERNPREVAILING,...,8,1.272839,0,0.0,Hourly,2014,1,FPL,off_peak,winter
4,1/1/2014,THE ENERGY AUTHORITY,"THE ENERGY AUTHORITY, INC.",EXELON,"EXELON GENERATION COMPANY, LLC",Florida,30789,892011,FPL,EASTERNPREVAILING,...,8,1.407159,0,0.0,Hourly,2014,1,FPL,off_peak,winter


Since I will require two data sets for this project, one for training my modlea and one for testing my model I split the data frame into two.

For simplicities sake, I've exported this data frame to a csv so that I don't have to clean the data every time.

In [39]:
cap1_training = data['TRADE_DATE'].dt.year <= 2015
cap1_training_data = data[cap1_training]

cap1_testing = data['TRADE_DATE'].dt.year == 2016
cap1_testing_data = data[cap1_testing]

cap1_testing_data.to_csv(r"C:\Users\anhem44\Data\cap1_testing_data.csv")
cap1_training_data.to_csv(r"C:\Users\anhem44\Data\cap1_training_data.csv")

In [42]:
len(cap1_testing_data)

16622

In [43]:
len(cap1_training_data)

62495

I am left with two csv files containing 62k rows for training and 16k for testing.