In [5]:
import pandas as pd
import numpy as np

https://pandas.pydata.org/docs/user_guide/groupby.html

### Splitting: the data into groups based on some criteria.
### Applying: a function to each group independently.
### Combining: the results into a data structure.


### Aggregation: compute a summary statistic (or statistics) for each group. Some examples:
    - Compute group sums or means.
    - Compute group sizes / counts.

### Transformation: perform some group-specific computations and return a like-indexed object. Some examples:
    - Standardize data (zscore) within a group.
    - Filling NAs within groups with a value derived from each group.

### Filtration: discard some groups, according to a group-wise computation that evaluates True or False. Some examples:
    - Discard data that belongs to groups with only a few members.
    - Filter out data based on the group sum or mean.

### Some combination of the above: GroupBy will examine the results of the apply step and try to return a sensibly combined result if it doesn’t fit into either of the above two categories.




In [6]:
 df = pd.DataFrame(
    [
         ("bird", "Falconiformes", 389.0),
         ("bird", "Psittaciformes", 24.0),
         ("mammal", "Carnivora", 80.2),
         ("mammal", "Primates", np.nan),
         ("mammal", "Carnivora", 58),
    ],
    index=["falcon", "parrot", "lion", "monkey", "leopard"], 
    columns=("class", "order", "max_speed"), 
)
df

Unnamed: 0,class,order,max_speed
falcon,bird,Falconiformes,389.0
parrot,bird,Psittaciformes,24.0
lion,mammal,Carnivora,80.2
monkey,mammal,Primates,
leopard,mammal,Carnivora,58.0


In [4]:
grouped = df.groupby("class")
print (grouped)

grouped = df.groupby("order", axis="columns")
print (grouped)

grouped = df.groupby(["class", "order"])
print (grouped)

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11e649460>
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11e6499a0>
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x11e649460>


In [7]:
df = pd.DataFrame(
        {
            "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
            "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
            "C": np.random.randn(8),
            "D": np.random.randn(8),
        }
    )

In [8]:
grouped = df.groupby(["A"])
for name, group in grouped:
   print(name)
   print(group)
   print ('')

df.groupby(["A"]).sum()

bar
     A      B         C         D
1  bar    one  1.441084 -1.429318
3  bar  three  0.504004 -0.074721
5  bar    two  1.295684  0.877602

foo
     A      B         C         D
0  foo    one -0.418179 -1.070514
2  foo    two -1.329937 -1.080185
4  foo    two  0.193194 -1.790206
6  foo    one -1.173756  0.276146
7  foo  three  0.485690  0.946268



Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,3.240772,-0.626437
foo,-2.242987,-2.718491


In [9]:
df.groupby(["A", "B"]).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
bar,one,1.441084,-1.429318
bar,three,0.504004,-0.074721
bar,two,1.295684,0.877602
foo,one,-1.591935,-0.794369
foo,three,0.48569,0.946268
foo,two,-1.136743,-2.870391


In [59]:
grouped.size()

A
bar    3
foo    5
dtype: int64

In [60]:
grouped.describe()

Unnamed: 0_level_0,C,C,C,C,C,C,C,C,D,D,D,D,D,D,D,D
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,std,min,25%,50%,75%,max
A,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2
bar,3.0,0.223429,0.155849,0.050088,0.159156,0.268224,0.3101,0.351976,3.0,-0.553889,0.425876,-1.004151,-0.752067,-0.499983,-0.328758,-0.157533
foo,5.0,-0.091039,0.805698,-1.000637,-0.552773,-0.09618,0.045533,1.148861,5.0,-0.271918,0.852381,-1.181916,-0.906459,-0.298524,0.044332,0.982975


In [61]:
df.groupby(["A", "B"], as_index=False).sum()

Unnamed: 0,A,B,C,D
0,bar,one,0.268224,-0.157533
1,bar,three,0.351976,-0.499983
2,bar,two,0.050088,-1.004151
3,foo,one,0.148224,-2.088375
4,foo,three,0.045533,0.044332
5,foo,two,-0.648952,0.684451


Function

Description

mean() Compute mean of groups
sum() Compute sum of group values
size() Compute group sizes


count() Compute count of group
std() Standard deviation of groups
var() Compute variance of groups


sem() Standard error of the mean of groups
describe() Generates descriptive statistics
first() Compute first of group values
last() Compute last of group values
nth() Take nth value, or a subset if n is a list
min() Compute min of group values
max() Compute max of group values


In [2]:
grouped = df.groupby("A")
grouped["C"].agg([np.sum, np.mean, np.std])

NameError: name 'df' is not defined

In [10]:
grouped.agg([np.sum, np.mean, np.std])

Unnamed: 0_level_0,C,C,C,D,D,D
Unnamed: 0_level_1,sum,mean,std,sum,mean,std
A,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
bar,3.240772,1.080257,0.504318,-0.626437,-0.208812,1.159291
foo,-2.242987,-0.448597,0.804414,-2.718491,-0.543698,1.119294


https://rfriend.tistory.com/403

The transform method returns an object that is indexed the same (same size) as the one being grouped. The transform function must:

Return a result that is either the same size as the group chunk or broadcastable to the size of the group chunk (e.g., a scalar, grouped.transform(lambda x: x.iloc[-1])).

Operate column-by-column on the group chunk. The transform is applied to the first group chunk using chunk.apply.

Not perform in-place operations on the group chunk. Group chunks should be treated as immutable, and changes to a group chunk may produce unexpected results. For example, when using fillna, inplace must be False (grouped.transform(lambda x: x.fillna(inplace=False))).

(Optionally) operates on the entire group chunk. If this is supported, a fast path is used starting from the second chunk.

Similar to Aggregations with User-Defined Functions, the resulting dtype will reflect that of the transformation function. If the results from different groups have different dtypes, then a common dtype will be determined in the same way as DataFrame construction.

In [12]:
grouped.agg({"C": np.sum, "D": lambda x: np.std(x, ddof=1)})

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
bar,3.240772,1.159291
foo,-2.242987,1.119294


The filter method returns a subset of the original object. Suppose we want to take only elements that belong to groups with a group sum greater than 2.


In [15]:
dff = pd.DataFrame({"A": np.arange(8), "B": list("aabbbbcc")})
dff

Unnamed: 0,A,B
0,0,a
1,1,a
2,2,b
3,3,b
4,4,b
5,5,b
6,6,c
7,7,c


In [16]:
grouped = dff.groupby("B")
dff.groupby("B").filter(lambda x: len(x) > 2)

Unnamed: 0,A,B
2,2,b
3,3,b
4,4,b
5,5,b


In [21]:
grouped = df.groupby("A")
grouped["C"].apply(lambda x: x.describe())

A         
bar  count    3.000000
     mean     1.080257
     std      0.504318
     min      0.504004
     25%      0.899844
     50%      1.295684
     75%      1.368384
     max      1.441084
foo  count    5.000000
     mean    -0.448597
     std      0.804414
     min     -1.329937
     25%     -1.173756
     50%     -0.418179
     75%      0.193194
     max      0.485690
Name: C, dtype: float64