# Pivot
Duncan Callaway

This notebook gives an introduction to using the Pandas' `pivot` method.  It can be accompanied by a simple set of power point slides on how pivoting works.

Pivot is used to examine aggregates with respect to two characteristics.  You might construct a pivot of sales data if you wanted to look at average sales broken down by year and market.  
  
The pivot operation is essentially a `groupby` operation that transforms the rows *and the columns.*  

In [1]:
import pandas as pd
import numpy as np

### Warm up example

In [2]:
df = pd.DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                   'key2' : ['one', 'two', 'one', 'two', 'one'],
                   'data1' : np.random.randn(5),
                   'data2' : np.random.randn(5)})
df

Unnamed: 0,key1,key2,data1,data2
0,a,one,-0.524345,1.242175
1,a,two,-1.46146,-0.133339
2,b,one,0.258816,0.829948
3,b,two,-0.86341,-0.390779
4,a,one,0.967477,-1.183011


We can do a groupby operation on the data1 column by key1:

In [3]:
grouped = df['data1'].groupby(df['key1'])

And then aggregate.  For example averaging all data1 values with the same key1 would give:

In [4]:
grouped.mean()

key1
a   -0.339442
b   -0.302297
Name: data1, dtype: float64

Now let's talk about pivot.  It is a generalization of groupby:

In [5]:
df.pivot_table(
    values  = 'data1', # the entry to aggregate over
    index   = 'key1',  # the row grouping attributes
    columns = 'key2',    # the column grouping attributes
    aggfunc = 'mean'   # the aggregation function
)

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,0.221566,-1.46146
b,0.258816,-0.86341


### Q: How do you think you might return the same info as in our groupby above, but using pivot?

In [6]:
df.pivot_table(
    values  = 'data1', # the entry to aggregate over
    index   = 'key1',  # the row grouping attributes
    #columns = 'key2',    # the column grouping attributes
    aggfunc = 'mean'   # the aggregation function
)

Unnamed: 0_level_0,data1
key1,Unnamed: 1_level_1
a,-0.339442
b,-0.302297


### A: drop the column key, as above.

### Using pivot on a more detailed data set
First let's get our data in order.

In [7]:
cds = pd.read_csv('CAISO_2017to2018_stack.csv', index_col=0)
cds

Unnamed: 0,Source,MWh
2017-08-29 00:00:00,GEOTHERMAL,1181
2017-08-29 00:00:00,BIOMASS,340
2017-08-29 00:00:00,BIOGAS,156
2017-08-29 00:00:00,SMALL HYDRO,324
2017-08-29 00:00:00,WIND TOTAL,1551
...,...,...
2018-08-28 23:00:00,BIOGAS,235
2018-08-28 23:00:00,SMALL HYDRO,262
2018-08-28 23:00:00,WIND TOTAL,2921
2018-08-28 23:00:00,SOLAR PV,0


### Q: I'd like to organize the data by source and hour of day.  How should I do that?

We'll start by extracting the hour of day from the index.

An amazing trick to take text that represents date/time info and turn it into more meaningful data is the `pd.to_datetime` method:

In [8]:
cds_time = pd.to_datetime(cds.index)
type(cds_time)

pandas.core.indexes.datetimes.DatetimeIndex

Now we can extract year, month, day... information from the datetimeindex:

In [9]:
cds_time[0].month

8

You can even do this for the entire object in one fell swoop:

In [10]:
cds_time.hour

Index([ 0,  0,  0,  0,  0,  0,  0,  1,  1,  1,
       ...
       22, 22, 22, 23, 23, 23, 23, 23, 23, 23],
      dtype='int32', length=61320)

### Q: Now that we have hours, what next?

### A: We can add hours from the `cds_time` object into the dataframe as follows:

In [11]:
cds['hour'] = cds_time.hour
cds.head()

Unnamed: 0,Source,MWh,hour
2017-08-29 00:00:00,GEOTHERMAL,1181,0
2017-08-29 00:00:00,BIOMASS,340,0
2017-08-29 00:00:00,BIOGAS,156,0
2017-08-29 00:00:00,SMALL HYDRO,324,0
2017-08-29 00:00:00,WIND TOTAL,1551,0


### Q: Try it for yourself: Create a pivot table with average hourly generation 

In [12]:
cds.pivot_table(
    values  = 'MWh',    # the entry to aggregate over
    index   = 'hour',   # the row grouping attributes
    columns = 'Source', # the column grouping attributes
    aggfunc = 'mean'    # the aggregation function
)

Source,BIOGAS,BIOMASS,GEOTHERMAL,SMALL HYDRO,SOLAR PV,SOLAR THERMAL,WIND TOTAL
hour,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,225.591781,318.30137,958.720548,330.824658,0.679452,0.0,2173.268493
1,225.964384,318.369863,959.235616,322.421918,0.643836,0.0,2120.778082
2,225.953425,319.846575,959.367123,318.249315,0.635616,0.0,2051.832877
3,225.887671,320.567123,958.367123,316.909589,0.419178,0.0,1973.969863
4,225.753425,321.742466,956.347945,322.254795,0.413699,0.0,1881.463014
5,225.243836,323.863014,956.230137,375.180822,0.482192,0.021918,1772.484932
6,224.479452,330.808219,955.682192,426.931507,352.956164,4.372603,1646.630137
7,222.454795,333.178082,953.263014,422.564384,2489.268493,58.317808,1490.194521
8,221.536986,333.936986,949.024658,376.813699,5552.531507,208.106849,1363.40274
9,221.539726,332.273973,946.210959,343.756164,7174.468493,316.841096,1290.512329


### Q: In class challenge: 
create a pivot table where source is the columns, the *month* is the row, and you aggregate into maximum values. 

Hint: write `max` to represent standard deviation.

In [13]:
cds['month'] = cds_time.month

In [14]:
cds_piv = cds.pivot_table(
    values  = 'MWh',
    index   = 'month',
    columns = 'Source',
    aggfunc = 'max'
)
cds_piv

Source,BIOGAS,BIOMASS,GEOTHERMAL,SMALL HYDRO,SOLAR PV,SOLAR THERMAL,WIND TOTAL
month,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,249,376,999,585,8024,397,4015
2,248,374,1012,572,9369,441,4420
3,248,344,1012,564,9795,583,4108
4,248,336,967,681,10027,589,4531
5,253,359,1005,663,10050,604,4925
6,247,436,1009,659,10102,652,5006
7,240,482,1009,662,9997,636,4466
8,243,399,1212,652,9930,679,4675
9,178,421,1230,577,9044,667,3943
10,238,423,1009,583,8909,541,4426
