# Data Aggregation and Group Operations

Thanks to pandas we can *pivot tables* or *group by* to compute group
statistics for reporting or visualization purposes, enabling us to slice, dice, 
and summarize datasets in a natural way.



## Index

- [How to Think About Group Operations](#how-to-think-about-group-operations)
    - [Iterating over groups](#iterating-over-groups)
    - [Selecting a Column or Subset of Columns](#selecting-a-column-or-subset-of-columns)
    - [Grouping with Dictionaries and Series](#grouping-with-dictionaries-and-series)
    - [Grouping by Index Levels](#grouping-by-index-levels)
- [Data Aggregation](#data-aggregation)
    - [Column-Wise and Multiple Function Application](#column-wise-and-multiple-function-application)
    - [Returning Aggregated Data Without Row Indexes](#returning-aggregated-data-without-row-indexes)
- [Apply: General split-apply-combine](#apply-general-split-apply-combine)
    - [Quantile and Bucket Analysis](#quantile-and-bucket-analysis)
    - [Example: Filling Missing Values with Group-Specific Values](#example-filling-missing-values-with-group-specific-values)
    

In [44]:
import numpy as np 
import pandas as pd
#import seaborn as sns
#import matplotlib.pyplot as plt
import warnings
#from datetime import datetime 
from sinfo import sinfo

warnings.filterwarnings("ignore")

# matplotlib:
#%matplotlib inline
#plt.rc("figure", figsize=(16,8))

## How to Think About Group Operations

The core is *split-apply-combine*:
1. data in a DataFrame/Series is split into groups based on passed *keys*.
    - Grouped on rows: `(axis="index")`
    - Grouped on columns: `(axis="columns")`
2. A functions is *applied* to each group which results a new value.
3. This results are combined into a new object.

When we use `groupby()` the new variable is a special "GroupBy" object which 
we can compute some operations. 

In [45]:
data = pd.DataFrame({
    "key1" : ["a", "a", None, "b", "b", "a", None],
    "key2" : pd.Series([1, 2, 1, 2, 1, None, 1], dtype="Int64"),
    "data1" : np.random.standard_normal(7),
    "data2" : np.random.standard_normal(7),
})
data

Unnamed: 0,key1,key2,data1,data2
0,a,1.0,-0.408233,-0.100022
1,a,2.0,-0.530758,0.868868
2,,1.0,0.315066,-0.651356
3,b,2.0,-1.056444,-0.783801
4,b,1.0,-0.364715,-0.68781
5,a,,-1.086281,-0.212336
6,,1.0,-0.747556,1.778774


***

`data1` column mean using `key1` labels

***

In [46]:
grpd1k1 = data["data1"].groupby(data["key1"])
grpd1k1

<pandas.core.groupby.generic.SeriesGroupBy object at 0x000002937DC62A90>

In [47]:
# Mean calculation on grouped variable

grpd1k1.mean()

key1
a   -0.675091
b   -0.710579
Name: data1, dtype: float64

In [48]:
grpd1means = data["data1"].groupby([data["key1"], data["key2"]]).mean()
grpd1means

key1  key2
a     1      -0.408233
      2      -0.530758
b     1      -0.364715
      2      -1.056444
Name: data1, dtype: float64

***
From series with hierarchical index to dataframe unstakced
***

In [49]:
grpd1means.unstack()

key2,1,2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
a,-0.408233,-0.530758
b,-0.364715,-1.056444


***
New keys with the same length for our data
***

In [50]:
states = np.array(["OH", "CA", "CA", "OH", "OH", "CA", "OH"])
years = [2005, 2005, 2006, 2005, 2006, 2005, 2006]

# If the new series have the same length, we can used as keys for groupby

data["data1"].groupby([states, years]).mean()

CA  2005   -0.808520
    2006    0.315066
OH  2005   -0.732338
    2006   -0.556135
Name: data1, dtype: float64

***
When the grouping information is in the same DataFrame we can pass the column
names and it will group the rest
***

In [51]:
temp = data.groupby("key1").mean()
temp

Unnamed: 0_level_0,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1.5,-0.675091,0.185504
b,1.5,-0.710579,-0.735805


In [52]:
temp = data.groupby(["key2", "key1"]).mean()
temp

Unnamed: 0_level_0,Unnamed: 1_level_0,data1,data2
key2,key1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,a,-0.408233,-0.100022
1,b,-0.364715,-0.68781
2,a,-0.530758,0.868868
2,b,-1.056444,-0.783801


***
GroupBy and `size()` method is useful to return group sizes.

`count()` computes the number of nonnull values in each group
***

In [53]:
temp = data.groupby("key1", dropna=False).size()
temp

key1
a      3
b      2
NaN    2
dtype: int64

In [54]:
temp = data.groupby(["key1", "key2"], dropna=False).size()
temp


key1  key2
a     1       1
      2       1
      <NA>    1
b     1       1
      2       1
NaN   1       2
dtype: int64

In [55]:
temp = data.groupby("key1").count()
temp

Unnamed: 0_level_0,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,2,3,3
b,2,2,2


### Iterating over Groups

In [56]:
data

Unnamed: 0,key1,key2,data1,data2
0,a,1.0,-0.408233,-0.100022
1,a,2.0,-0.530758,0.868868
2,,1.0,0.315066,-0.651356
3,b,2.0,-1.056444,-0.783801
4,b,1.0,-0.364715,-0.68781
5,a,,-1.086281,-0.212336
6,,1.0,-0.747556,1.778774


In [57]:
for name, group in data.groupby("key1"):
    print(name)
    print(group)


a
  key1  key2     data1     data2
0    a     1 -0.408233 -0.100022
1    a     2 -0.530758  0.868868
5    a  <NA> -1.086281 -0.212336
b
  key1  key2     data1     data2
3    b     2 -1.056444 -0.783801
4    b     1 -0.364715 -0.687810


In [58]:
for (k1, k2), group in data.groupby(["key1", "key2"]):
    print((k1, k2))
    print(group)

('a', 1)
  key1  key2     data1     data2
0    a     1 -0.408233 -0.100022
('a', 2)
  key1  key2     data1     data2
1    a     2 -0.530758  0.868868
('b', 1)
  key1  key2     data1    data2
4    b     1 -0.364715 -0.68781
('b', 2)
  key1  key2     data1     data2
3    b     2 -1.056444 -0.783801


***
Can be useful create a dictionary with the data
***


In [59]:
data_pieces = {name: group for name, group in data.groupby("key1")}
print(data_pieces["b"])
print("\n", data_pieces["a"], sep='')

  key1  key2     data1     data2
3    b     2 -1.056444 -0.783801
4    b     1 -0.364715 -0.687810

  key1  key2     data1     data2
0    a     1 -0.408233 -0.100022
1    a     2 -0.530758  0.868868
5    a  <NA> -1.086281 -0.212336


### Selecting a column or subset of Columns

Sometimes in large datasets we'll prefer to aggregate a few columns to compute
some calculations.

In [60]:
data.groupby(["key1", "key2"])[["data2"]].mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,1,-0.100022
a,2,0.868868
b,1,-0.68781
b,2,-0.783801


### Grouping with Dictionaries and Series

We can group passing a dictionary and then operate with that, e.g. mean, sum, 
count...

In [61]:
people = pd.DataFrame(np.random.standard_normal((5, 5)),
                      columns=["a", "b", "c", "d", "e"],
                      index=["Joe", "Steve", "Wanda", "Jill", "Trey"])

people.iloc[2:3, [1, 2]] = np.nan

people

Unnamed: 0,a,b,c,d,e
Joe,0.105561,0.38663,0.971671,0.424418,1.757019
Steve,-0.821874,-1.800735,0.864665,0.628691,0.921311
Wanda,0.420627,,,-0.923812,1.57322
Jill,-0.148279,-0.902405,-0.525646,-0.527887,-0.051234
Trey,-1.526856,-0.155809,-1.851783,-0.505209,0.392157


In [62]:
mapping = {"a":"red", "b": "red", "c": "blue", 
           "d": "blue", "e": "red", "f" : "orange"}

sum_col = people.groupby(mapping, axis="columns")
sum_col.sum()

Unnamed: 0,blue,red
Joe,1.396089,2.24921
Steve,1.493356,-1.701299
Wanda,-0.923812,1.993847
Jill,-1.053534,-1.101918
Trey,-2.356993,-1.290508


In [63]:
sum_col.mean()

Unnamed: 0,blue,red
Joe,0.698045,0.749737
Steve,0.746678,-0.5671
Wanda,-0.923812,0.996923
Jill,-0.526767,-0.367306
Trey,-1.178496,-0.430169


In [64]:
map_ser = pd.Series(mapping)
map_ser

a       red
b       red
c      blue
d      blue
e       red
f    orange
dtype: object

In [65]:
people.groupby(map_ser, axis="columns").count()

Unnamed: 0,blue,red
Joe,2,3
Steve,2,3
Wanda,1,2
Jill,2,3
Trey,2,3


### Grouping by Index Levels

To group by level we can pass the level number or level name as argument:

`df.groupby(level='1', axis='1')`.

In [66]:
cols = pd.MultiIndex.from_arrays([["US", "US", "US", "JP", "JP"],
                                 [1, 3, 7, 1, 3]], names=["cty", "tenor"])
hier_data = pd.DataFrame(np.random.standard_normal((4, 5)), columns=cols)
hier_data

cty,US,US,US,JP,JP
tenor,1,3,7,1,3
0,0.209412,1.719926,0.898809,-1.036759,1.364985
1,-0.778979,0.238919,-0.369632,-0.301138,-1.154888
2,0.147042,0.16164,-0.362939,1.6943,-0.30855
3,0.904927,0.679221,-0.912813,0.442267,-0.23516


In [67]:
hier_data.groupby(level="cty", axis=1).count()

cty,JP,US
0,2,3
1,2,3
2,2,3
3,2,3


## Data Aggregation

Aggregations refer to any data transformation that produces scalar values from arrays.


*Optimized GroupBy methods*
| Function name| Description|
|--:|---|
|any, all |Return True if any (one or more values) or all non-NA values are “truthy”|
|count |Number of non-NA values|
|cummin, cummax |Cumulative minimum and maximum of non-NA values|
|cumsum |Cumulative sum of non-NA values|
|cumprod |Cumulative product of non-NA values|
|first, last |First and last non-NA values|
|mean |Mean of non-NA values|
|median |Arithmetic median of non-NA values|
|min, max |Minimum and maximum of non-NA values|
|nth |Retrieve value that would appear at position n with the data in sorted order|
|ohlc |Compute four “open-high-low-close” statistics for time series-like data|
|prod |Product of non-NA values|
|quantile |Compute sample quantile|
|rank |Ordinal ranks of non-NA values, like calling Series.rank|
|size |Compute group sizes, returning result as a Series|
|sum |Sum of non-NA values|
|std, var |Sample standard deviation and variance|

Custom aggregation functions usually are slower than the previous table 
functions.

In [68]:
data

Unnamed: 0,key1,key2,data1,data2
0,a,1.0,-0.408233,-0.100022
1,a,2.0,-0.530758,0.868868
2,,1.0,0.315066,-0.651356
3,b,2.0,-1.056444,-0.783801
4,b,1.0,-0.364715,-0.68781
5,a,,-1.086281,-0.212336
6,,1.0,-0.747556,1.778774


In [69]:
grouped = data.groupby("key1")

# Extracting the two smollest numbers per key
grouped["data1"].nsmallest(2)

key1   
a     5   -1.086281
      1   -0.530758
b     3   -1.056444
      4   -0.364715
Name: data1, dtype: float64

***
We can use our own aggregation function passing a function which aggregates an 
array to the agg method
***

In [71]:
def range(arr):
    return arr.max() - arr.min()

def trimean(series):
    Q1 = series.quantile(0.25)
    median = series.median()
    Q3 = series.quantile(0.75)
    return (Q1 + 2 * median + Q3) / 4

grouped.agg(trimean)

Unnamed: 0_level_0,key2,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1.5,-0.584883,0.00705
b,1.5,-0.710579,-0.735805


In [72]:
grouped.describe()

Unnamed: 0_level_0,key2,key2,key2,key2,key2,key2,key2,key2,data1,data1,data1,data1,data1,data2,data2,data2,data2,data2,data2,data2,data2
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
a,2.0,1.5,0.707107,1.0,1.25,1.5,1.75,2.0,3.0,-0.675091,...,-0.469496,-0.408233,3.0,0.185504,0.59447,-0.212336,-0.156179,-0.100022,0.384423,0.868868
b,2.0,1.5,0.707107,1.0,1.25,1.5,1.75,2.0,2.0,-0.710579,...,-0.537647,-0.364715,2.0,-0.735805,0.067876,-0.783801,-0.759803,-0.735805,-0.711807,-0.68781


### Column-Wise and Multiple Function Application

A DataFrame will have hierarchical columns only f multiple functions are 
applied to at least one column.

In [73]:
with open("datasets/tips.csv") as file:
    data = pd.read_csv(file)

data.head()

Unnamed: 0,total_bill,tip,smoker,day,time,size
0,16.99,1.01,No,Sun,Dinner,2
1,10.34,1.66,No,Sun,Dinner,3
2,21.01,3.5,No,Sun,Dinner,3
3,23.68,3.31,No,Sun,Dinner,2
4,24.59,3.61,No,Sun,Dinner,4


In [74]:
# Adding percentage tip column

data["tip_pct"] = data["tip"] / data["total_bill"]
data.head()

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
0,16.99,1.01,No,Sun,Dinner,2,0.059447
1,10.34,1.66,No,Sun,Dinner,3,0.160542
2,21.01,3.5,No,Sun,Dinner,3,0.166587
3,23.68,3.31,No,Sun,Dinner,2,0.13978
4,24.59,3.61,No,Sun,Dinner,4,0.146808


***
Applying multiple functions to grouped data
***

In [75]:
"""
    mean per day and smoker
"""
grouped = data.groupby(["day", "smoker"])

grouped_pct = grouped["tip_pct"]
grouped_pct.agg("mean")

day   smoker
Fri   No        0.151650
      Yes       0.174783
Sat   No        0.158048
      Yes       0.147906
Sun   No        0.160113
      Yes       0.187250
Thur  No        0.160298
      Yes       0.163863
Name: tip_pct, dtype: float64

In [77]:
grouped_pct.agg(["mean", trimean, "median", "std", range])

Unnamed: 0_level_0,Unnamed: 1_level_0,mean,trimean,median,std,range
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Fri,No,0.15165,0.149843,0.149241,0.028123,0.067349
Fri,Yes,0.174783,0.172701,0.173913,0.051293,0.159925
Sat,No,0.158048,0.155115,0.150152,0.039767,0.235193
Sat,Yes,0.147906,0.147387,0.153624,0.061375,0.290095
Sun,No,0.160113,0.162074,0.161665,0.042347,0.193226
Sun,Yes,0.18725,0.147323,0.138122,0.154134,0.644685
Thur,No,0.160298,0.157392,0.153492,0.038774,0.19335
Thur,Yes,0.163863,0.162642,0.153846,0.039389,0.15124


In [80]:
"""
    Changing names of .agg()
"""

grouped_pct.agg([("average", "mean"), ("trimean", trimean),
                 ("stdev", "std"), ("range", range)])

Unnamed: 0_level_0,Unnamed: 1_level_0,average,trimean,stdev,range
day,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Fri,No,0.15165,0.149843,0.028123,0.067349
Fri,Yes,0.174783,0.172701,0.051293,0.159925
Sat,No,0.158048,0.155115,0.039767,0.235193
Sat,Yes,0.147906,0.147387,0.061375,0.290095
Sun,No,0.160113,0.162074,0.042347,0.193226
Sun,Yes,0.18725,0.147323,0.154134,0.644685
Thur,No,0.160298,0.157392,0.038774,0.19335
Thur,Yes,0.163863,0.162642,0.039389,0.15124


***
By previously establishing a list of functions.

We can pass a list of tuples as well ("name", "function")
***

In [81]:
func = ["count", "mean", "max"]

result = grouped[["tip_pct", "total_bill"]].agg(func)
result

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,total_bill,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,max,count,mean,max
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Fri,No,4,0.15165,0.187735,4,18.42,22.75
Fri,Yes,15,0.174783,0.26348,15,16.813333,40.17
Sat,No,45,0.158048,0.29199,45,19.661778,48.33
Sat,Yes,42,0.147906,0.325733,42,21.276667,50.81
Sun,No,57,0.160113,0.252672,57,20.506667,48.17
Sun,Yes,19,0.18725,0.710345,19,24.12,45.35
Thur,No,45,0.160298,0.266312,45,17.113111,41.19
Thur,Yes,17,0.163863,0.241255,17,19.190588,43.11


***
Also we can apply specific functions to specific collumns passing a dict:
***

In [84]:
grouped.agg({"tip_pct" : ["min", "max", trimean, "std"], 
             "size" : "sum"})

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,tip_pct,size
Unnamed: 0_level_1,Unnamed: 1_level_1,min,max,trimean,std,sum
day,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
Fri,No,0.120385,0.187735,0.149843,0.028123,9
Fri,Yes,0.103555,0.26348,0.172701,0.051293,31
Sat,No,0.056797,0.29199,0.155115,0.039767,115
Sat,Yes,0.035638,0.325733,0.147387,0.061375,104
Sun,No,0.059447,0.252672,0.162074,0.042347,167
Sun,Yes,0.06566,0.710345,0.147323,0.154134,49
Thur,No,0.072961,0.266312,0.157392,0.038774,112
Thur,Yes,0.090014,0.241255,0.162642,0.039389,40


### Returning Aggregated Data Without Row Indexes

Not always we want group and that keys be used as index. We can change this
behaviour passing `as_index=False` argument. Also, it is possible to 
obtain this result b calling `reset_index` on the result, but if in the first
place we use `as_index` argument, we avoid some unnecessary computations.

In [89]:
temp = data.groupby(["day", "smoker"], as_index=False)
temp["total_bill"].mean()

Unnamed: 0,day,smoker,total_bill
0,Fri,No,18.42
1,Fri,Yes,16.813333
2,Sat,No,19.661778
3,Sat,Yes,21.276667
4,Sun,No,20.506667
5,Sun,Yes,24.12
6,Thur,No,17.113111
7,Thur,Yes,19.190588


In [90]:
temp = data.groupby(["day", "smoker"])
temp["total_bill"].mean()

day   smoker
Fri   No        18.420000
      Yes       16.813333
Sat   No        19.661778
      Yes       21.276667
Sun   No        20.506667
      Yes       24.120000
Thur  No        17.113111
      Yes       19.190588
Name: total_bill, dtype: float64

In [91]:
temp["total_bill"].mean().reset_index()

Unnamed: 0,day,smoker,total_bill
0,Fri,No,18.42
1,Fri,Yes,16.813333
2,Sat,No,19.661778
3,Sat,Yes,21.276667
4,Sun,No,20.506667
5,Sun,Yes,24.12
6,Thur,No,17.113111
7,Thur,Yes,19.190588


## Apply: General split-apply-combine

The method `apply` is one of the most widely used general purpose methods 
in GroupBy. How works apply: splits the object being manipulated into pieces, 
invokes the passed function on each piece, and then attempts to concatenate 
the pieces.

In [92]:
data.head()

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
0,16.99,1.01,No,Sun,Dinner,2,0.059447
1,10.34,1.66,No,Sun,Dinner,3,0.160542
2,21.01,3.5,No,Sun,Dinner,3,0.166587
3,23.68,3.31,No,Sun,Dinner,2,0.13978
4,24.59,3.61,No,Sun,Dinner,4,0.146808


In [93]:
def top(df, n=5, column="tip_pct"):
    return df.sort_values(column, ascending=False)[:n]

top(data, n=6)

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
172,7.25,5.15,Yes,Sun,Dinner,2,0.710345
178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
232,11.61,3.39,No,Sat,Dinner,2,0.29199
183,23.17,6.5,Yes,Sun,Dinner,4,0.280535
109,14.31,4.0,Yes,Sat,Dinner,2,0.279525


In [97]:
data.groupby(["smoker", "time"]).apply(top, n=3)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,total_bill,tip,smoker,day,time,size,tip_pct
smoker,time,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
No,Dinner,232,11.61,3.39,No,Sat,Dinner,2,0.29199
No,Dinner,51,10.29,2.6,No,Sun,Dinner,2,0.252672
No,Dinner,185,20.69,5.0,No,Sun,Dinner,5,0.241663
No,Lunch,149,7.51,2.0,No,Thur,Lunch,2,0.266312
No,Lunch,88,24.71,5.85,No,Thur,Lunch,2,0.236746
No,Lunch,87,18.28,4.0,No,Thur,Lunch,2,0.218818
Yes,Dinner,172,7.25,5.15,Yes,Sun,Dinner,2,0.710345
Yes,Dinner,178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
Yes,Dinner,67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
Yes,Lunch,221,13.42,3.48,Yes,Fri,Lunch,2,0.259314


In [98]:
"""
    Suppressing the Group Keys with 'group_keys=False' argument
"""

data.groupby(["smoker", "time"], group_keys=False).apply(top, n=3)

Unnamed: 0,total_bill,tip,smoker,day,time,size,tip_pct
232,11.61,3.39,No,Sat,Dinner,2,0.29199
51,10.29,2.6,No,Sun,Dinner,2,0.252672
185,20.69,5.0,No,Sun,Dinner,5,0.241663
149,7.51,2.0,No,Thur,Lunch,2,0.266312
88,24.71,5.85,No,Thur,Lunch,2,0.236746
87,18.28,4.0,No,Thur,Lunch,2,0.218818
172,7.25,5.15,Yes,Sun,Dinner,2,0.710345
178,9.6,4.0,Yes,Sun,Dinner,2,0.416667
67,3.07,1.0,Yes,Sat,Dinner,1,0.325733
221,13.42,3.48,Yes,Fri,Lunch,2,0.259314


### Quantile and Bucket Analysis

We can combine `pandas.cut` or `pandas.qcut` with `groupby` to perform bucket
or quantile analysis on a dataset.

In [99]:
data = pd.DataFrame({"data1": np.random.standard_normal(1000),
                     "data2": np.random.standard_normal(1000)})
data.head()

Unnamed: 0,data1,data2
0,-0.456457,0.311566
1,-0.988269,0.424796
2,0.242928,-1.054064
3,1.246976,0.027102
4,-1.083054,0.128056


In [101]:
quartiles = pd.cut(data["data1"], 4)

quartiles.head(6)

0    (-1.57, -0.0401]
1    (-1.57, -0.0401]
2     (-0.0401, 1.49]
3     (-0.0401, 1.49]
4    (-1.57, -0.0401]
5     (-0.0401, 1.49]
Name: data1, dtype: category
Categories (4, interval[float64, right]): [(-3.106, -1.57] < (-1.57, -0.0401] < (-0.0401, 1.49] < (1.49, 3.02]]

In [102]:
def get_stats(group):
    return pd.DataFrame(
        {"min": group.min(), "max":group.max(),
         "count": group.count(), "mean":group.mean()}
    )


In [103]:
grouped = data.groupby(quartiles)

grouped.apply(get_stats)

Unnamed: 0_level_0,Unnamed: 1_level_0,min,max,count,mean
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"(-3.106, -1.57]",data1,-3.099882,-1.574652,76,-2.018111
"(-3.106, -1.57]",data2,-2.195629,3.04222,76,-0.101893
"(-1.57, -0.0401]",data1,-1.568625,-0.041442,423,-0.659082
"(-1.57, -0.0401]",data2,-2.754118,2.842741,423,0.066365
"(-0.0401, 1.49]",data1,-0.039729,1.484489,442,0.584653
"(-0.0401, 1.49]",data2,-3.269956,3.17243,442,-0.017231
"(1.49, 3.02]",data1,1.52713,3.019611,59,1.997996
"(1.49, 3.02]",data2,-2.906824,2.401976,59,0.03938


In [104]:
grouped.agg(["min", "max", "count", "mean"])

Unnamed: 0_level_0,data1,data1,data1,data1,data2,data2,data2,data2
Unnamed: 0_level_1,min,max,count,mean,min,max,count,mean
data1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
"(-3.106, -1.57]",-3.099882,-1.574652,76,-2.018111,-2.195629,3.04222,76,-0.101893
"(-1.57, -0.0401]",-1.568625,-0.041442,423,-0.659082,-2.754118,2.842741,423,0.066365
"(-0.0401, 1.49]",-0.039729,1.484489,442,0.584653,-3.269956,3.17243,442,-0.017231
"(1.49, 3.02]",1.52713,3.019611,59,1.997996,-2.906824,2.401976,59,0.03938


***
Generating equal bins with 'qcut' and with it's corresponding labels.

Also without labels (`labels=False`)
***

In [106]:
qu_samp = pd.qcut(data["data1"], 4,
                  labels=[f"Q{i+1}" for i in np.arange(0,4)])

qu_samp.head()


0    Q2
1    Q1
2    Q3
3    Q4
4    Q1
Name: data1, dtype: category
Categories (4, object): ['Q1' < 'Q2' < 'Q3' < 'Q4']

In [108]:
grouped = data.groupby(qu_samp)
grouped.apply(get_stats)

Unnamed: 0_level_0,Unnamed: 1_level_0,min,max,count,mean
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Q1,data1,-3.099882,-0.723132,250,-1.361963
Q1,data2,-2.73069,3.04222,250,0.015431
Q2,data1,-0.715537,-0.039729,250,-0.366868
Q2,data2,-2.754118,2.702989,250,0.058493
Q3,data1,-0.038684,0.627971,250,0.285919
Q3,data2,-2.964407,3.17243,250,-0.053369
Q4,data1,0.631146,3.019611,250,1.219433
Q4,data2,-3.269956,2.564433,250,0.039589


In [112]:
# Now, qcut without labels

qu_samp = pd.qcut(data["data1"], 4,
                  labels=False)

grouped = data.groupby(qu_samp)
group_stats = grouped.apply(get_stats)

# Setting index name to identified them
group_stats.index.names = ["Quartile", "data_col"]
group_stats


Unnamed: 0_level_0,Unnamed: 1_level_0,min,max,count,mean
Quartile,data_col,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,data1,-3.099882,-0.723132,250,-1.361963
0,data2,-2.73069,3.04222,250,0.015431
1,data1,-0.715537,-0.039729,250,-0.366868
1,data2,-2.754118,2.702989,250,0.058493
2,data1,-0.038684,0.627971,250,0.285919
2,data2,-2.964407,3.17243,250,-0.053369
3,data1,0.631146,3.019611,250,1.219433
3,data2,-3.269956,2.564433,250,0.039589


### Example: Filling Missing Values with Group-Specific Values