# Groupby


By “group by” we are referring to a process involving one or more of the following steps:

**Splitting** the data into groups based on some criteria.

**Applying** a function to each group independently.

**Combining** the results into a data structure.

Out of these, the split step is the most straightforward. In fact, in many situations we may wish to split the data set into groups and do something with those groups. In the apply step, we might wish to one of the following:

**Aggregation**: compute a summary statistic (or statistics) for each group. Some examples:

Compute group sums or means.
Compute group sizes / counts.

**Transformation**: perform some group-specific computations and return a like-indexed object. Some examples:

Standardize data (zscore) within a group.
Filling NAs within groups with a value derived from each group.

**Filtration**: discard some groups, according to a group-wise computation that evaluates True or False. Some examples:

Discard data that belongs to groups with only a few members.
Filter out data based on the group sum or mean.

Some combination of the above: GroupBy will examine the results of the apply step and try to return a sensibly combined result if it doesn’t fit into either of the above two categories.

[Source](https://pandas.pydata.org/pandas-docs/stable/groupby.html)

In [1]:
import numpy as np
import pandas as pd

# Create dataframe
# Source: https://www.cnblogs.com/kungfupanda/p/6578112.html
df = pd.DataFrame({'Company' : ['Amazon', 'Google', 'Microsoft', 'Amazon', 'Google', 'Microsoft', 'Amazon', 'Google', 'Microsoft'],
                   'Level' : ['Principal Software Engineer', 'T6', 'Principal SDE', 'SDE3', 'T5', 'Senior SDE', 'SDE2', 'T4', 'SDE2'],
                   'Salary' : [445000, 472500, 261000, 180000, 306500, 199000, 147500, 201000, 143000]})

In [2]:
df

Unnamed: 0,Company,Level,Salary
0,Amazon,Principal Software Engineer,445000
1,Google,T6,472500
2,Microsoft,Principal SDE,261000
3,Amazon,SDE3,180000
4,Google,T5,306500
5,Microsoft,Senior SDE,199000
6,Amazon,SDE2,147500
7,Google,T4,201000
8,Microsoft,SDE2,143000


** Now we can use the .groupby() method to group rows together based off of a column name. For instance let's group based off of **A**. This will create a DataFrameGroupBy object:**

In [3]:
df.groupby('Company')

<pandas.core.groupby.DataFrameGroupBy object at 0x7fd870a6af28>

We can save this object as a new variable:

In [4]:
by_Company = df.groupby("Company")

And then call some aggregate methods off the object

In [5]:
by_Company.mean()
# Column with Strings are ignored

Unnamed: 0_level_0,Salary
Company,Unnamed: 1_level_1
Amazon,257500.0
Google,326666.666667
Microsoft,201000.0


In [6]:
df.groupby('Company').mean()

Unnamed: 0_level_0,Salary
Company,Unnamed: 1_level_1
Amazon,257500.0
Google,326666.666667
Microsoft,201000.0


More examples of aggregate methods

In [7]:
by_Company.std()
# Strings are ignored here too

Unnamed: 0_level_0,Salary
Company,Unnamed: 1_level_1
Amazon,163190.839204
Google,136868.854504
Microsoft,59025.418253


In [8]:
by_Company.min()
# Here we also have string as because in pandas it tries to sort the string in accending order.

Unnamed: 0_level_0,Level,Salary
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
Amazon,Principal Software Engineer,147500
Google,T4,201000
Microsoft,Principal SDE,143000


In [9]:
by_Company.max()

Unnamed: 0_level_0,Level,Salary
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
Amazon,SDE3,445000
Google,T6,472500
Microsoft,Senior SDE,261000


In [10]:
by_Company.count()

Unnamed: 0_level_0,Level,Salary
Company,Unnamed: 1_level_1,Unnamed: 2_level_1
Amazon,3,3
Google,3,3
Microsoft,3,3


In [11]:
by_Company.describe()

Unnamed: 0_level_0,Salary,Salary,Salary,Salary,Salary,Salary,Salary,Salary
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
Company,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Amazon,3.0,257500.0,163190.839204,147500.0,163750.0,180000.0,312500.0,445000.0
Google,3.0,326666.666667,136868.854504,201000.0,253750.0,306500.0,389500.0,472500.0
Microsoft,3.0,201000.0,59025.418253,143000.0,171000.0,199000.0,230000.0,261000.0


In [12]:
by_Company.describe().transpose()

Unnamed: 0,Company,Amazon,Google,Microsoft
Salary,count,3.0,3.0,3.0
Salary,mean,257500.0,326666.666667,201000.0
Salary,std,163190.839204,136868.854504,59025.418253
Salary,min,147500.0,201000.0,143000.0
Salary,25%,163750.0,253750.0,171000.0
Salary,50%,180000.0,306500.0,199000.0
Salary,75%,312500.0,389500.0,230000.0
Salary,max,445000.0,472500.0,261000.0


In [13]:
by_Company.describe().transpose()['Google']

Salary  count         3.000000
        mean     326666.666667
        std      136868.854504
        min      201000.000000
        25%      253750.000000
        50%      306500.000000
        75%      389500.000000
        max      472500.000000
Name: Google, dtype: float64