# A Pandas surprise: NaNs and GroupBy

In [10]:
import pandas as pd
import numpy as np
from pandas.errors import UnsupportedFunctionCall

I figure out something about pandas today, which I was quite surprised by.
When you use pandas groupby - NaNs are automatically ignored.

Well this is intended, but sometimes you might be in the situation that you want to have NaNs in the summary. For example, for a quick check, whether all the data are correct. 

In [5]:
# Create a sample Array:
DF = pd.DataFrame.from_dict({'g1': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b'], 
                             'g2': ['c', 'c', 'd', 'd', 'c', 'c', 'd', 'd'],
                             'd1': [0, 1, np.nan, 3, 4, 5, 6, 7]})

For example in this little example DF we would expect a `NaN` in group `a`, `b`, but we get a `0.5`.

In [7]:
DF.groupby(['g1', 'g2']).mean()

Unnamed: 0_level_0,Unnamed: 1_level_0,d1
g1,g2,Unnamed: 2_level_1
a,c,0.5
a,d,3.0
b,c,4.5
b,d,6.5


The thing is, you cannot use `skipna = False` in the mean, and neither in the summing function.

In [18]:
# This creates an error
try:
    DF.groupby(['g1', 'g2']).mean(skipna=False)
except UnsupportedFunctionCall:
    print('UnsupportedFunctionCall')

UnsupportedFunctionCall


One example that has often been given is to use `.apply(np.mean)` instead of directly calling `.mean()`

However:

In [19]:
DF.groupby(['g1', 'g2']).apply(np.mean)

Unnamed: 0_level_0,Unnamed: 1_level_0,d1
g1,g2,Unnamed: 2_level_1
a,c,0.5
a,d,3.0
b,c,4.5
b,d,6.5


Calling `np.mean` causes pands to bypass the function and calls `DF.mean()`, pandas function with `skipna=True`! 
Afaik, you need to create a new function (or a partial, don't know much about that), but still:

In [25]:
def mean_w_nan(x):
    # Don't forget the np.array call!
    return np.mean(np.array(x))

DF.groupby(['g1', 'g2']).apply(mean_w_nan)

g1  g2
a   c     0.5
    d     NaN
b   c     4.5
    d     6.5
dtype: float64

In [None]:
# Links
https://stackoverflow.com/questions/26145585/pandas-aggregation-ignoring-nans
https://github.com/pandas-dev/pandas/issues/15674
    https://stackoverflow.com/questions/54106112/pandas-groupby-mean-not-ignoring-nans