# Data Aggregation and Group Operations

Thanks to pandas we can *pivot tables* or *group by* to compute group
statistics for reporting or visualization purposes, enabling us to slice, dice, 
and summarize datasets in a natural way.



## Index

- [How to Think About Group Operations](#how-to-think-about-group-operations)
    - [Iterating over groups](#iterating-over-groups)
    - [Selecting a Column or Subset of Columns](#selecting-a-column-or-subset-of-columns)



In [3]:
import numpy as np 
import pandas as pd
#import seaborn as sns
#import matplotlib.pyplot as plt
import warnings
#from datetime import datetime 
from sinfo import sinfo

warnings.filterwarnings("ignore")

# matplotlib:
#%matplotlib inline
#plt.rc("figure", figsize=(16,8))

## How to Think About Group Operations

The core is *split-apply-combine*:
1. data in a DataFrame/Series is split into groups based on passed *keys*.
    - Grouped on rows: `(axis="index")`
    - Grouped on columns: `(axis="columns")`
2. A functions is *applied* to each group which results a new value.
3. This results are combined into a new object.

When we use `groupby()` the new variable is a special "GroupBy" object which 
we can compute some operations. 

In [4]:
data = pd.DataFrame({
    "key1" : ["a", "a", None, "b", "b", "a", None],
    "key2" : pd.Series([1, 2, 1, 2, 1, None, 1], dtype="Int64"),
    "data1" : np.random.standard_normal(7),
    "data2" : np.random.standard_normal(7),
})
data

Unnamed: 0,key1,key2,data1,data2
0,a,1.0,1.916865,0.584766
1,a,2.0,-0.296464,0.020639
2,,1.0,1.830957,-0.074832
3,b,2.0,-0.890768,1.733296
4,b,1.0,-1.95747,1.268994
5,a,,-0.018124,0.692001
6,,1.0,-2.178028,1.116266


***

`data1` column mean using `key1` labels

***

In [6]:
grpd1k1 = data["data1"].groupby(data["key1"])
grpd1k1

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001B7A8C3F750>


In [7]:
# Mean calculation on grouped variable

grpd1k1.mean()

key1
a    0.534092
b   -1.424119
Name: data1, dtype: float64

In [9]:
grpd1means = data["data1"].groupby([data["key1"], data["key2"]]).mean()
grpd1means

key1  key2
a     1       1.916865
      2      -0.296464
b     1      -1.957470
      2      -0.890768
Name: data1, dtype: float64

***
From series with hierarchical index to dataframe unstakced
***

In [10]:
grpd1means.unstack()

key2,1,2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,1.916865,-0.296464
b,-1.95747,-0.890768


***
New keys with the same length for our data
***

In [11]:
states = np.array(["OH", "CA", "CA", "OH", "OH", "CA", "OH"])
years = [2005, 2005, 2006, 2005, 2006, 2005, 2006]

# If the new series have the same length, we can used as keys for groupby

data["data1"].groupby([states, years]).mean()

CA  2005   -0.157294
    2006    1.830957
OH  2005    0.513049
    2006   -2.067749
Name: data1, dtype: float64

***
When the grouping information is in the same DataFrame we can pass the column
names and it will group the rest
***

In [12]:
temp = data.groupby("key1").mean()
temp

Unnamed: 0_level_0,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1.5,0.534092,0.432469
b,1.5,-1.424119,1.501145


In [14]:
temp = data.groupby(["key2", "key1"]).mean()
temp

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key2,key1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,a,1.916865,0.584766
1,b,-1.95747,1.268994
2,a,-0.296464,0.020639
2,b,-0.890768,1.733296


***
GroupBy and `size()` method is useful to return group sizes.

`count()` computes the number of nonnull values in each group
***

In [15]:
temp = data.groupby("key1", dropna=False).size()
temp

key1
a      3
b      2
NaN    2
dtype: int64

In [16]:
temp = data.groupby(["key1", "key2"], dropna=False).size()
temp


key1  key2
a     1       1
      2       1
      <NA>    1
b     1       1
      2       1
NaN   1       2
dtype: int64

In [17]:
temp = data.groupby("key1").count()
temp

Unnamed: 0_level_0,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,2,3,3
b,2,2,2


### Iterating over Groups

In [18]:
data

Unnamed: 0,key1,key2,data1,data2
0,a,1.0,1.916865,0.584766
1,a,2.0,-0.296464,0.020639
2,,1.0,1.830957,-0.074832
3,b,2.0,-0.890768,1.733296
4,b,1.0,-1.95747,1.268994
5,a,,-0.018124,0.692001
6,,1.0,-2.178028,1.116266


In [21]:
for name, group in data.groupby("key1"):
    print(name)
    print(group)


a
  key1  key2     data1     data2
0    a     1  1.916865  0.584766
1    a     2 -0.296464  0.020639
5    a  <NA> -0.018124  0.692001
b
  key1  key2     data1     data2
3    b     2 -0.890768  1.733296
4    b     1 -1.957470  1.268994


In [22]:
for (k1, k2), group in data.groupby(["key1", "key2"]):
    print((k1, k2))
    print(group)

('a', 1)
  key1  key2     data1     data2
0    a     1  1.916865  0.584766
('a', 2)
  key1  key2     data1     data2
1    a     2 -0.296464  0.020639
('b', 1)
  key1  key2    data1     data2
4    b     1 -1.95747  1.268994
('b', 2)
  key1  key2     data1     data2
3    b     2 -0.890768  1.733296


***
Can be useful create a dictionary with the data
***


In [24]:
data_pieces = {name: group for name, group in data.groupby("key1")}
print(data_pieces["b"])
print("\n", data_pieces["a"], sep='')

  key1  key2     data1     data2
3    b     2 -0.890768  1.733296
4    b     1 -1.957470  1.268994

  key1  key2     data1     data2
0    a     1  1.916865  0.584766
1    a     2 -0.296464  0.020639
5    a  <NA> -0.018124  0.692001


### Selecting a column or subset of Columns