# Apply

The most general `GroupBy` method is `apply`. It splits the object to be processed, calls the passed function on each part and then tries to chain the parts together.

Suppose we want to select the five largest `hit` values by group. To do this, we first write a function that selects the rows with the largest values in a particular column:

In [1]:
import numpy as np
import pandas as pd

In [2]:
df = pd.DataFrame(
    {
        "2021-12": [30134, 6073, 4873, None, 427, 95],
        "2022-01": [33295, 7716, 3930, None, 276, 226],
        "2022-02": [19651, 6547, 2573, None, 525, 157],
    },
    index=[
        [
            "Jupyter Tutorial",
            "Jupyter Tutorial",
            "PyViz Tutorial",
            "PyViz Tutorial",
            "Python Basics",
            "Python Basics",
        ],
        ["de", "en", "de", "en", "de", "en"],
    ],
)
df.index.names = ["Title", "Language"]

df

Unnamed: 0_level_0,Unnamed: 1_level_0,2021-12,2022-01,2022-02
Title,Language,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jupyter Tutorial,de,30134.0,33295.0,19651.0
Jupyter Tutorial,en,6073.0,7716.0,6547.0
PyViz Tutorial,de,4873.0,3930.0,2573.0
PyViz Tutorial,en,,,
Python Basics,de,427.0,276.0,525.0
Python Basics,en,95.0,226.0,157.0


In [3]:
def top(df, n=5, column="2021-12"):
    return df.sort_values(by=column, ascending=False)[:n]


top(df, n=3)

Unnamed: 0_level_0,Unnamed: 1_level_0,2021-12,2022-01,2022-02
Title,Language,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jupyter Tutorial,de,30134.0,33295.0,19651.0
Jupyter Tutorial,en,6073.0,7716.0,6547.0
PyViz Tutorial,de,4873.0,3930.0,2573.0


If we now group by titles, for example, and call `apply` with this function, we get the following:

In [4]:
grouped_titles = df.groupby("Title", as_index=False)

grouped_titles.apply(top)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,2021-12,2022-01,2022-02
Unnamed: 0_level_1,Title,Language,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,Jupyter Tutorial,de,30134.0,33295.0,19651.0
0,Jupyter Tutorial,en,6073.0,7716.0,6547.0
1,PyViz Tutorial,de,4873.0,3930.0,2573.0
1,PyViz Tutorial,en,,,
2,Python Basics,de,427.0,276.0,525.0
2,Python Basics,en,95.0,226.0,157.0


What happened here? The upper function is called for each row group of the `DataFrame`, and then the results are concatenated with [pandas.concat](https://pandas.pydata.org/docs/reference/api/pandas.concat.html), labelling the parts with the group names. The result therefore has a hierarchical index whose inner level contains index values from the original `DataFrame`.

If you pass a function to `apply` that takes other arguments or keywords, you can pass them after the function:

In [5]:
grouped_titles = df.groupby("Title", as_index=False)

grouped_titles.apply(top, n=1)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,2021-12,2022-01,2022-02
Unnamed: 0_level_1,Title,Language,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,Jupyter Tutorial,de,30134.0,33295.0,19651.0
1,PyViz Tutorial,de,4873.0,3930.0,2573.0
2,Python Basics,de,427.0,276.0,525.0


We have now seen the basic usage of `apply`. What happens inside the passed function is very versatile and up to you; it only has to return a pandas object or a single value. In the following, we will therefore mainly show examples that can give you ideas on how to solve various problems with `groupby`.

First, let’s look again at `describe`, called over the `GroupBy` object:

In [6]:
grouped_titles = df.groupby("Title")

result = grouped_titles.describe()

result

Unnamed: 0_level_0,2021-12,2021-12,2021-12,2021-12,2021-12,2021-12,2021-12,2021-12,2022-01,2022-01,2022-01,2022-01,2022-01,2022-02,2022-02,2022-02,2022-02,2022-02,2022-02,2022-02,2022-02
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
Title,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2
Jupyter Tutorial,2.0,18103.5,17013.696262,6073.0,12088.25,18103.5,24118.75,30134.0,2.0,20505.5,...,26900.25,33295.0,2.0,13099.0,9265.927261,6547.0,9823.0,13099.0,16375.0,19651.0
PyViz Tutorial,1.0,4873.0,,4873.0,4873.0,4873.0,4873.0,4873.0,1.0,3930.0,...,3930.0,3930.0,1.0,2573.0,,2573.0,2573.0,2573.0,2573.0,2573.0
Python Basics,2.0,261.0,234.759451,95.0,178.0,261.0,344.0,427.0,2.0,251.0,...,263.5,276.0,2.0,341.0,260.215295,157.0,249.0,341.0,433.0,525.0


When you call a method like `describe` within `GroupBy`, it is actually just an abbreviation for:

In [7]:
f = lambda x: x.describe()
grouped_titles.apply(f)

Unnamed: 0_level_0,Unnamed: 1_level_0,2021-12,2022-01,2022-02
Title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jupyter Tutorial,count,2.0,2.0,2.0
Jupyter Tutorial,mean,18103.5,20505.5,13099.0
Jupyter Tutorial,std,17013.696262,18087.084356,9265.927261
Jupyter Tutorial,min,6073.0,7716.0,6547.0
Jupyter Tutorial,25%,12088.25,14110.75,9823.0
Jupyter Tutorial,50%,18103.5,20505.5,13099.0
Jupyter Tutorial,75%,24118.75,26900.25,16375.0
Jupyter Tutorial,max,30134.0,33295.0,19651.0
PyViz Tutorial,count,1.0,1.0,1.0
PyViz Tutorial,mean,4873.0,3930.0,2573.0


## Suppression of the group keys

In the previous examples, you saw that the resulting object has a hierarchical index formed by the group keys together with the indices of the individual parts of the original object. You can disable this by passing `group_keys=False` to `groupby`:

In [8]:
grouped_lang = df.groupby("Language", group_keys=False)

grouped_lang.apply(top)

Unnamed: 0_level_0,Unnamed: 1_level_0,2021-12,2022-01,2022-02
Title,Language,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jupyter Tutorial,de,30134.0,33295.0,19651.0
PyViz Tutorial,de,4873.0,3930.0,2573.0
Python Basics,de,427.0,276.0,525.0
Jupyter Tutorial,en,6073.0,7716.0,6547.0
Python Basics,en,95.0,226.0,157.0
PyViz Tutorial,en,,,


## Quantile and bucket analysis

As described in [discretisation and grouping](discretisation.ipynb), pandas has some tools, especially `cut` and `qcut`, to split data into buckets with bins of your choice or by sample quantiles. Combine these functions with `groupby` and you can conveniently perform bucket or quantile analysis on a dataset. Consider a simple random data set and a bucket categorisation of equal length with `cut`:

In [9]:
rng = np.random.default_rng()
df2 = pd.DataFrame(
    {
        "data1": rng.normal(size=1000),
        "data2": rng.normal(size=1000)
    }
)

quartiles = pd.cut(df2.data1, 4)

quartiles[:10]

0    (-1.985, -0.309]
1    (-3.667, -1.985]
2    (-1.985, -0.309]
3    (-1.985, -0.309]
4    (-1.985, -0.309]
5     (-0.309, 1.367]
6    (-1.985, -0.309]
7      (1.367, 3.043]
8    (-1.985, -0.309]
9     (-0.309, 1.367]
Name: data1, dtype: category
Categories (4, interval[float64, right]): [(-3.667, -1.985] < (-1.985, -0.309] < (-0.309, 1.367] < (1.367, 3.043]]

The `category` object returned by `cut` can be passed directly to `groupby`. So we could calculate a set of group statistics for the quartiles as follows:

In [10]:
def stats(group):
    return pd.DataFrame(
        {
            "min": group.min(),
            "max": group.max(),
            "count": group.count(),
            "mean": group.mean(),
        }
    )


grouped_quart = df2.groupby(quartiles, observed=False)

grouped_quart.apply(stats)

Unnamed: 0_level_0,Unnamed: 1_level_0,min,max,count,mean
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
"(-3.667, -1.985]",data1,-3.660514,-1.997396,31,-2.42246
"(-3.667, -1.985]",data2,-1.729069,2.380827,31,0.193483
"(-1.985, -0.309]",data1,-1.974745,-0.310654,356,-0.941064
"(-1.985, -0.309]",data2,-2.772165,3.049964,356,0.00615
"(-0.309, 1.367]",data1,-0.30666,1.366418,525,0.39829
"(-0.309, 1.367]",data2,-3.678267,3.307415,525,-0.013217
"(1.367, 3.043]",data1,1.36927,3.042972,88,1.813557
"(1.367, 3.043]",data2,-2.885412,2.047945,88,0.04575


These were buckets of equal length; to calculate buckets of equal size based on sample quantiles, we can use `qcut`. I pass `labels=False` to get only quantile numbers:

In [11]:
quartiles_samp = pd.qcut(df2.data1, 4, labels=False)

grouped_quart_samp = df2.groupby(quartiles_samp)

grouped_quart_samp.apply(stats)

Unnamed: 0_level_0,Unnamed: 1_level_0,min,max,count,mean
data1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
0,data1,-3.660514,-0.743539,250,-1.365801
0,data2,-2.772165,2.830985,250,-0.040007
1,data1,-0.742822,-0.01969,250,-0.345565
1,data2,-2.209969,3.049964,250,0.146497
2,data1,-0.017901,0.627629,250,0.301905
2,data2,-2.764695,3.307415,250,-0.077263
3,data1,0.628714,3.042972,250,1.243782
3,data2,-3.678267,2.260646,250,-0.008129


## Populating data with group-specific values

When cleaning missing data, in some cases you will replace data observations with `dropna`, but in other cases you may want to fill the null values (`NA`) with a fixed value or a value derived from the data. `fillna` is the right tool for this; here, for example, I fill the null values with the mean:

In [12]:
s = pd.Series(rng.normal(size=8))
s[::3] = np.nan

s

0         NaN
1    0.682415
2   -0.463491
3         NaN
4    0.397419
5    0.607853
6         NaN
7    0.060891
dtype: float64

In [13]:
s.fillna(s.mean())

0    0.257018
1    0.682415
2   -0.463491
3    0.257018
4    0.397419
5    0.607853
6    0.257018
7    0.060891
dtype: float64

Here are some sample data for my tutorials, divided into German and English editions:

Suppose you want the fill value to vary by group. These values can be predefined, and since the groups have an internal `name` attribute, you can use this with `apply`:

In [14]:
fill_values = {"de": 10632, "en": 3469}

fill_func = lambda g: g.fillna(fill_values[g.name])

df.groupby("Language").apply(fill_func)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,2021-12,2022-01,2022-02
Language,Title,Language,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
de,Jupyter Tutorial,de,30134.0,33295.0,19651.0
de,PyViz Tutorial,de,4873.0,3930.0,2573.0
de,Python Basics,de,427.0,276.0,525.0
en,Jupyter Tutorial,en,6073.0,7716.0,6547.0
en,PyViz Tutorial,en,3469.0,3469.0,3469.0
en,Python Basics,en,95.0,226.0,157.0


You can also group the data and use `apply` with a function that calls `fillna` for each data packet:

In [15]:
fill_mean = lambda g: g.fillna(g.mean())

df.groupby("Language").apply(fill_mean)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,2021-12,2022-01,2022-02
Language,Title,Language,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
de,Jupyter Tutorial,de,30134.0,33295.0,19651.0
de,PyViz Tutorial,de,4873.0,3930.0,2573.0
de,Python Basics,de,427.0,276.0,525.0
en,Jupyter Tutorial,en,6073.0,7716.0,6547.0
en,PyViz Tutorial,en,3084.0,3971.0,3352.0
en,Python Basics,en,95.0,226.0,157.0


## Group weighted average

Since operations between columns in a `DataFrame` or two `Series` are possible, we can calculate the group-weighted average, for example:

In [16]:
df3 = pd.DataFrame(
    {
        "category": ["de", "de", "de", "de", "en", "en", "en", "en"],
        "data": np.random.randint(100000, size=8),
        "weights": np.random.rand(8),
    }
)

df3

Unnamed: 0,category,data,weights
0,de,30354,0.283774
1,de,28408,0.416312
2,de,21630,0.128192
3,de,75974,0.006033
4,en,14334,0.391982
5,en,56144,0.028325
6,en,46108,0.326988
7,en,33759,0.714362


The group average weighted by category would then be:

In [17]:
grouped_cat = df3.groupby("category")
get_wavg = lambda g: np.average(g["data"], weights=g["weights"])

grouped_cat.apply(get_wavg, include_groups=False)

category
de    28372.421657
en    31746.068449
dtype: float64

## Correlation

An interesting task could be to calculate a `DataFrame` consisting of the percentage changes.

For this purpose, we first create a function that calculates the pairwise correlation of the `2021-12` column with the subsequent columns:

In [18]:
corr = lambda x: x.corrwith(x["2021-12"])

Next, we calculate the percentage change:

In [19]:
pcts = df.pct_change().dropna()

pcts

Unnamed: 0_level_0,Unnamed: 1_level_0,2021-12,2022-01,2022-02
Title,Language,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Jupyter Tutorial,en,-0.798467,-0.768253,-0.666836
PyViz Tutorial,de,-0.197596,-0.490669,-0.606996
PyViz Tutorial,en,0.0,0.0,0.0
Python Basics,de,-0.912374,-0.929771,-0.795958
Python Basics,en,-0.777518,-0.181159,-0.700952


Finally, we group these percentage changes by year, which can be extracted from each row label with a one-line function that returns the year attribute of each date label:

In [20]:
grouped_lang = pcts.groupby("Language")

grouped_lang.apply(corr)

Unnamed: 0_level_0,2021-12,2022-01,2022-02
Language,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
de,1.0,1.0,1.0
en,1.0,0.699088,0.99781


In [21]:
grouped_lang.apply(lambda g: g["2021-12"].corr(g["2022-01"]))

Language
de    1.000000
en    0.699088
dtype: float64

## Performance problems with `apply`

Since the `apply` method typically acts on each individual value in a `Series`, the function is called once for each value. If you have thousands of values, the function will be called thousands of times. This ignores the fast vectorisations of pandas unless you are using NumPy functions and slow Python is used. For example, we previously grouped the data by title and then called our `top` method with `apply`. Let’s measure the time for this:

In [22]:
%%timeit
grouped_titles.apply(top)

422 μs ± 14.5 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


We can get the same result without applying by passing the `DataFrame` to our `top` method:

In [23]:
%%timeit
top(df)

39.2 μs ± 586 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)


This calculation is 18 times faster.

## Optimising `apply` with Cython

It is not always easy to find an alternative for `apply`. However, numerical operations like our `top` method can be made faster with [Cython](https://cython.org/). To use Cython in Jupyyter, we use the following [IPython magic](../ipython/magics.ipynb):

In [24]:
%load_ext Cython

Then we can define our `top` function with Cython:

In [25]:
%%cython
def top_cy(df, n=5, column="2021-12"):
    return df.sort_values(by=column, ascending=False)[:n]

In [26]:
%%timeit
grouped_titles.apply(top_cy)

472 μs ± 27.6 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


We haven’t really gained much with this yet. Further optimisation possibilities would be to define the type in the Cython code with `cpdef`. For this, however, we would have to modify our method, because then no `DataFrame` can be passed.