In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [5]:
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.random.randn(6,4),index=dates,columns=list('ABCD'))
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ['E'])
df2 = pd.DataFrame({'A': 1.,
                    'B': pd.Timestamp('20130102'),
                    'C': pd.Series(1, index=list(range(4)), dtype='float32'),
                    'D': np.array([3] * 4, dtype='int32'),
                    'E': pd.Categorical(["test", "train", "test", "train"]),
                    'F': 'foo'})
s = pd.Series([1,3,5,np.nan,6,8],index=dates).shift(2)

In [6]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.580407,-1.681435,1.338886,-1.695531
2013-01-02,-1.212343,0.810087,-0.883083,-0.158433
2013-01-03,-0.04379,2.915493,0.965412,0.074694
2013-01-04,-1.150028,-0.127283,0.996294,1.8854
2013-01-05,-0.369467,0.71423,-0.863941,0.324629
2013-01-06,-1.307036,-0.8854,-0.445092,-1.969323


#### Comparing array-like objects

You can conveniently perform element-wise comparisons when comparing a pandas data structure with a scalar value:

In [8]:
pd.Series(['foo','bar','baz']) == 'foo'

0     True
1    False
2    False
dtype: bool

In [9]:
pd.Index(['foo','bar','baz']) == 'foo'

array([ True, False, False])

Pandas also handles element-wise comparisons between different array-like objects of the same length:

In [10]:
pd.Series(['foo','bar','baz']) == pd.Index(['foo','bar','qux'])

0     True
1     True
2    False
dtype: bool

In [11]:
pd.Series(['foo','bar','baz']) == np.array(['foo','bar','qux'])

0     True
1     True
2    False
dtype: bool

Trying to compare Index or Series objects of different lengths will raise a ValueError:

In [12]:
np.array([1,2,3]) == np.array([2])

array([False,  True, False])

or it can return False if broadcasting can not be done:

In [13]:
np.array([1,2,3]) == np.array([1,2])

  """Entry point for launching an IPython kernel.


False

### Combining overlapping data sets

In [14]:
df1 = pd.DataFrame({
    'A':[1.,np.nan,3.,5.,np.nan],
    'B':[np.nan,2.,3.,np.nan,6.]
})

In [15]:
df2 = pd.DataFrame({
    'A':[5.,2.,4.,np.nan,3.,7.],
    'B':[np.nan,np.nan,3.,4.,6.,8.]
})

In [16]:
df1

Unnamed: 0,A,B
0,1.0,
1,,2.0
2,3.0,3.0
3,5.0,
4,,6.0


In [17]:
df2

Unnamed: 0,A,B
0,5.0,
1,2.0,
2,4.0,3.0
3,,4.0
4,3.0,6.0
5,7.0,8.0


In [18]:
df1.combine_first(df2)

Unnamed: 0,A,B
0,1.0,
1,2.0,2.0
2,3.0,3.0
3,5.0,4.0
4,3.0,6.0
5,7.0,8.0


### General DataFrame combine

The combine_first() method above calls the more general DataFrame.combine(). This method takes another DataFrame and a combiner function, aligns the input DataFrame and then passes the combiner function pairs of Series (i.e., columns whose names are the same).

So, for instance, to reproduce combine_first() as above:

In [19]:
def combiner(x,y):
    return np.where(pd.isna(x),y,x)

### 3.3.5 Descriptive statistics

There exists a large number of methods for computing descriptive statistics and other related operations on Series, DataFrame. Most of these are aggregations (hence producing a lower-dimensional result) like sum(), mean(), and quantile(), but some of them, like cumsum() and cumprod(), produce an object of the same size. Generally speaking, these methods take an axis argument, just like ndarray.{sum, std, }, but the axis can be specified by name or integer:


• Series: no axis argument needed
• DataFrame: index (axis=0, default), columns (axis=1) 

For example:

In [20]:
df

Unnamed: 0,A,B,C,D
2013-01-01,-0.580407,-1.681435,1.338886,-1.695531
2013-01-02,-1.212343,0.810087,-0.883083,-0.158433
2013-01-03,-0.04379,2.915493,0.965412,0.074694
2013-01-04,-1.150028,-0.127283,0.996294,1.8854
2013-01-05,-0.369467,0.71423,-0.863941,0.324629
2013-01-06,-1.307036,-0.8854,-0.445092,-1.969323


In [21]:
df.mean(0)

A   -0.777178
B    0.290949
C    0.184746
D   -0.256427
dtype: float64

In [22]:
df.mean(1)

2013-01-01   -0.654622
2013-01-02   -0.360943
2013-01-03    0.977953
2013-01-04    0.401096
2013-01-05   -0.048638
2013-01-06   -1.151713
Freq: D, dtype: float64

All such methods have a skipna option signaling whether to exclude missing data (True by default):

In [23]:
df.sum(0,skipna=False)

A   -4.663070
B    1.745691
C    1.108476
D   -1.538564
dtype: float64

In [24]:
df.sum(axis=1,skipna=True)

2013-01-01   -2.618487
2013-01-02   -1.443771
2013-01-03    3.911810
2013-01-04    1.604382
2013-01-05   -0.194550
2013-01-06   -4.606852
Freq: D, dtype: float64

Combined with the broadcasting / arithmetic behavior, one can describe various statistical procedures, like standard- ization (rendering data zero mean and standard deviation 1), very concisely:

In [25]:
ts_stand = ( df - df.mean() ) / df.std()

In [26]:
ts_stand.std()

A    1.0
B    1.0
C    1.0
D    1.0
dtype: float64

In [28]:
xs_stand = df.sub(df.mean(1), axis=0).div(df.std(1), axis=0)

In [29]:
xs_stand.std(1)

2013-01-01    1.0
2013-01-02    1.0
2013-01-03    1.0
2013-01-04    1.0
2013-01-05    1.0
2013-01-06    1.0
Freq: D, dtype: float64

Note that methods like cumsum() and cumprod() preserve the location of NaN values. This is somewhat different from expanding() and rolling(). For more details please see this note.

In [30]:
df.cumsum()

Unnamed: 0,A,B,C,D
2013-01-01,-0.580407,-1.681435,1.338886,-1.695531
2013-01-02,-1.79275,-0.871348,0.455803,-1.853963
2013-01-03,-1.836539,2.044145,1.421216,-1.779269
2013-01-04,-2.986567,1.916862,2.417509,0.106131
2013-01-05,-3.356034,2.631092,1.553568,0.430759
2013-01-06,-4.66307,1.745691,1.108476,-1.538564


Here is a quick reference summary table of common functions. Each also takes an optional level parameter which applies only if the object has a hierarchical index.

Note that by chance some NumPy methods, like mean, std, and sum, will exclude NAs on Series input by default:

In [32]:
np.mean(df['A'])

-0.7771783977404798

In [34]:
np.mean(df['A'].to_numpy())

-0.7771783977404798

Series.nunique() will return the number of unique non-NA values in a Series:

In [35]:
series = pd.Series(np.random.randn(500))

In [36]:
series[20:500] = np.nan

In [37]:
series[10:20] = 5

In [38]:
series.nunique()

11

### Summarizing data: describe

There is a convenient describe() function which computes a variety of summary statistics about a Series or the columns of a DataFrame (excluding NAs of course):

In [39]:
series = pd.Series(np.random.randn(1000))

In [40]:
series[::2] = np.nan

In [41]:
series.describe()

count    500.000000
mean       0.021956
std        0.957585
min       -3.015134
25%       -0.593110
50%        0.050342
75%        0.649822
max        2.942714
dtype: float64

In [44]:
frame = pd.DataFrame(np.random.randn(1000,5),columns=['a','b','c','d','e'])

In [45]:
frame.iloc[::2] = np.nan

In [46]:
frame.describe()

Unnamed: 0,a,b,c,d,e
count,500.0,500.0,500.0,500.0,500.0
mean,-0.035739,0.061744,0.002215,-0.026693,-0.019077
std,0.952523,1.039449,1.016104,0.999068,1.02985
min,-2.870968,-3.269927,-2.911425,-2.453823,-2.634322
25%,-0.699435,-0.577695,-0.685504,-0.750976,-0.677547
50%,-0.076752,0.053241,-0.054535,-0.027162,-0.081331
75%,0.595129,0.801656,0.744086,0.575257,0.70542
max,2.842469,3.09773,2.943138,2.765271,3.349105


You can select specific percentiles to include in the output:

In [47]:
series.describe(percentiles=[.05,.25,.75,.85])

count    500.000000
mean       0.021956
std        0.957585
min       -3.015134
5%        -1.595899
25%       -0.593110
50%        0.050342
75%        0.649822
85%        0.976575
max        2.942714
dtype: float64

By default, the median is always included.

For a non-numerical Series object, describe() will give a simple summary of the number of unique values and most frequently occurring values:

In [48]:
s = pd.Series(['a','a','b','b','a','a',np.nan,'c','d','a'])

In [49]:
s.describe()

count     9
unique    4
top       a
freq      5
dtype: object

Note that on a mixed-type DataFrame object, describe() will restrict the summary to include only numerical columns or, if none are, only categorical columns:

In [51]:
frame = pd.DataFrame({
    'a':['Yes','Yes','No','No'],
    'b':range(4)
})

In [52]:
frame.describe()

Unnamed: 0,b
count,4.0
mean,1.5
std,1.290994
min,0.0
25%,0.75
50%,1.5
75%,2.25
max,3.0


This behavior can be controlled by providing a list of types as include/exclude arguments. The special value all can also be used:

In [53]:
frame.describe(include=['object'])

Unnamed: 0,a
count,4
unique,2
top,Yes
freq,2


In [54]:
frame.describe(include=['number'])

Unnamed: 0,b
count,4.0
mean,1.5
std,1.290994
min,0.0
25%,0.75
50%,1.5
75%,2.25
max,3.0


In [55]:
frame.describe(include='all')

Unnamed: 0,a,b
count,4,4.0
unique,2,
top,Yes,
freq,2,
mean,,1.5
std,,1.290994
min,,0.0
25%,,0.75
50%,,1.5
75%,,2.25


That feature relies on select_dtypes. Refer to there for details about accepted inputs.

### Index of min/max values

The idxmin() and idxmax() functions on Series and DataFrame compute the index labels with the minimum and maximum corresponding values:

In [57]:
s1 = pd.Series(np.random.randn(5))

In [58]:
s1

0    1.798021
1   -0.617379
2    1.372603
3    0.172300
4    1.435599
dtype: float64

In [60]:
s1.idxmin(),s1.idxmax()

(1, 0)

In [61]:
df1 = pd.DataFrame(np.random.randn(5,3),columns=['A','B','C'])

In [62]:
df1

Unnamed: 0,A,B,C
0,0.699162,-0.285348,-0.347659
1,-0.370816,-0.269597,0.207879
2,1.26102,1.208938,-0.700888
3,-1.565806,1.160954,1.036011
4,0.937124,-0.322472,-0.364248


In [63]:
df1.idxmin(axis=0)

A    3
B    4
C    2
dtype: int64

In [64]:
df1.idxmax(axis=1)

0    A
1    C
2    A
3    B
4    A
dtype: object

When there are multiple rows (or columns) matching the minimum or maximum value, idxmin() and idxmax() return the first matching index:

In [65]:
df3 = pd.DataFrame([2,1,1,3,np.nan],columns=['A'],index=list('edcba'))

In [66]:
df3

Unnamed: 0,A
e,2.0
d,1.0
c,1.0
b,3.0
a,


In [67]:
df3['A'].idxmin()

'd'

In [69]:
df3['A'].idxmax()

'b'


Note: idxminandidxmaxarecalledargminandargmaxinNumPy.

### Value counts (histogramming) / mode

The value_counts() Series method and top-level function computes a histogram of a 1D array of values. It can also be used as a function on regular arrays:


In [70]:
data = np.random.randint(0,7,size=50)

In [71]:
data

array([0, 0, 2, 5, 4, 4, 0, 2, 6, 0, 4, 3, 2, 5, 0, 5, 2, 1, 3, 0, 6, 2,
       5, 0, 3, 3, 1, 4, 3, 4, 1, 3, 4, 0, 1, 2, 0, 3, 0, 2, 2, 3, 5, 3,
       1, 0, 1, 5, 6, 4])

In [72]:
data = np.random.randint(0,7,size=50)

In [73]:
data

array([4, 3, 5, 4, 0, 0, 4, 1, 1, 0, 5, 3, 0, 4, 3, 5, 4, 0, 3, 1, 0, 3,
       6, 1, 0, 2, 5, 1, 6, 1, 2, 6, 5, 6, 0, 5, 2, 3, 6, 1, 1, 3, 6, 1,
       3, 5, 6, 3, 4, 4])

In [74]:
s = pd.Series(data)

In [76]:
s.value_counts()

3    9
1    9
0    8
6    7
5    7
4    7
2    3
dtype: int64

In [77]:
pd.value_counts(data)

3    9
1    9
0    8
6    7
5    7
4    7
2    3
dtype: int64

Similarly, you can get the most frequently occurring value(s) (the mode) of the values in a Series or DataFrame:

In [78]:
s5 = pd.Series([1,1,3,3,3,5,5,7,7,7])

In [79]:
s5.mode()

0    3
1    7
dtype: int64

In [80]:
df5 = pd.DataFrame({
    'A':np.random.randint(0,7,size=50),
    'B':np.random.randint(-10,15,size=50)
})

In [81]:
df5.mode()

Unnamed: 0,A,B
0,2,-4.0
1,5,


### Discretization and quantiling

Continuous values can be discretized using the cut() (bins based on values) and qcut() (bins based on sample quantiles) functions:

In [82]:
arr = np.random.randn(20)

In [83]:
factor = pd.cut(arr,4)

In [84]:
factor

[(0.181, 0.923], (0.923, 1.665], (0.181, 0.923], (-1.305, -0.561], (0.923, 1.665], ..., (0.923, 1.665], (-1.305, -0.561], (0.181, 0.923], (-0.561, 0.181], (-1.305, -0.561]]
Length: 20
Categories (4, interval[float64]): [(-1.305, -0.561] < (-0.561, 0.181] < (0.181, 0.923] < (0.923, 1.665]]

In [85]:
factor = pd.cut(arr,[-5,-1,0,1,5])

In [86]:
factor

[(0, 1], (1, 5], (0, 1], (-1, 0], (1, 5], ..., (1, 5], (-5, -1], (0, 1], (-1, 0], (-5, -1]]
Length: 20
Categories (4, interval[int64]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]

qcut() computes sample quantiles. For example, we could slice up some normally distributed data into equal-size quartiles like so:

In [87]:
arr = np.random.randn(30)

In [88]:
factor = pd.qcut(arr,[0,.25,.5,.75,1])

In [89]:
factor

[(0.612, 2.574], (-0.936, -0.582], (0.612, 2.574], (-0.936, -0.582], (0.612, 2.574], ..., (-0.936, -0.582], (-0.582, -0.0505], (-0.582, -0.0505], (-0.936, -0.582], (-0.936, -0.582]]
Length: 30
Categories (4, interval[float64]): [(-0.936, -0.582] < (-0.582, -0.0505] < (-0.0505, 0.612] < (0.612, 2.574]]

In [90]:
pd.value_counts(factor)

(0.612, 2.574]       8
(-0.936, -0.582]     8
(-0.0505, 0.612]     7
(-0.582, -0.0505]    7
dtype: int64

We can also pass infinite values to define the bins:

In [91]:
arr = np.random.randn(20)

factor = pd.cut(arr,[-np.inf,0,np.inf])

In [93]:
factor

[(-inf, 0.0], (-inf, 0.0], (0.0, inf], (0.0, inf], (0.0, inf], ..., (0.0, inf], (0.0, inf], (-inf, 0.0], (-inf, 0.0], (0.0, inf]]
Length: 20
Categories (2, interval[float64]): [(-inf, 0.0] < (0.0, inf]]

## 3.3.6 Function application

To apply your own or another librarys functions to pandas objects, you should be aware of the three methods below. The appropriate method to use depends on whether your function expects to operate on an entire DataFrame or Series, row- or column-wise, or elementwise.


1. Tablewise Function Application: pipe()
2. Row or Column-wise Function Application: apply() 

3. Aggregation API: agg() and transform()
4. Applying Elementwise Functions: applymap()

### Tablewise function application


DataFrames and Series can of course just be passed into functions. However, if the function needs to be called in a chain, consider using the pipe() method. Compare the following

In [95]:
import statsmodels.formula.api as sm

In [96]:
bb = pd.read_csv('baseball.csv',index_col='id')

In [97]:
(bb.query('h>0')
     .assign(ln_h=lambda df:np.log(df.h))
     .pipe((sm.ols,'data'),'hr~ln_h+year+g+C(lg)')
     .fit()
     .summary()
)

0,1,2,3
Dep. Variable:,hr,R-squared:,0.685
Model:,OLS,Adj. R-squared:,0.665
Method:,Least Squares,F-statistic:,34.28
Date:,"Sun, 08 Mar 2020",Prob (F-statistic):,3.48e-15
Time:,19:00:30,Log-Likelihood:,-205.92
No. Observations:,68,AIC:,421.8
Df Residuals:,63,BIC:,432.9
Df Model:,4,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,-8484.7720,4664.146,-1.819,0.074,-1.78e+04,835.780
C(lg)[T.NL],-2.2736,1.325,-1.716,0.091,-4.922,0.375
ln_h,-1.3542,0.875,-1.547,0.127,-3.103,0.395
year,4.2277,2.324,1.819,0.074,-0.417,8.872
g,0.1841,0.029,6.258,0.000,0.125,0.243

0,1,2,3
Omnibus:,10.875,Durbin-Watson:,1.999
Prob(Omnibus):,0.004,Jarque-Bera (JB):,17.298
Skew:,0.537,Prob(JB):,0.000175
Kurtosis:,5.225,Cond. No.,14900000.0


The pipe method is inspired by unix pipes and more recently dplyr and magrittr, which have introduced the popular (%>%) (read pipe) operator for R. The implementation of pipe here is quite clean and feels right at home in python.


We encourage you to view the source code of pipe().

### Row or column-wise function application

Arbitrary functions can be applied along the axes of a DataFrame using the apply() method, which, like the de- scriptive statistics methods, takes an optional axis argument:

In [98]:
df.apply(np.mean)

A   -0.777178
B    0.290949
C    0.184746
D   -0.256427
dtype: float64

In [99]:
df.apply(np.mean,axis=1)

2013-01-01   -0.654622
2013-01-02   -0.360943
2013-01-03    0.977953
2013-01-04    0.401096
2013-01-05   -0.048638
2013-01-06   -1.151713
Freq: D, dtype: float64

In [100]:
df.apply(lambda x:x.max()-x.min())

A    1.263247
B    4.596929
C    2.221969
D    3.854723
dtype: float64

In [101]:
df.apply(np.cumsum)

Unnamed: 0,A,B,C,D
2013-01-01,-0.580407,-1.681435,1.338886,-1.695531
2013-01-02,-1.79275,-0.871348,0.455803,-1.853963
2013-01-03,-1.836539,2.044145,1.421216,-1.779269
2013-01-04,-2.986567,1.916862,2.417509,0.106131
2013-01-05,-3.356034,2.631092,1.553568,0.430759
2013-01-06,-4.66307,1.745691,1.108476,-1.538564


In [102]:
df.apply(np.exp)

Unnamed: 0,A,B,C,D
2013-01-01,0.559671,0.186107,3.814793,0.183502
2013-01-02,0.297499,2.248104,0.413506,0.853481
2013-01-03,0.957155,18.457914,2.62587,1.077555
2013-01-04,0.316628,0.880484,2.708226,6.588988
2013-01-05,0.691103,2.042613,0.421498,1.383517
2013-01-06,0.270621,0.412549,0.640765,0.139551


The apply() method will also dispatch on a string method name.

In [103]:
df.apply('mean')

A   -0.777178
B    0.290949
C    0.184746
D   -0.256427
dtype: float64

In [104]:
df.apply('mean',axis=1)

2013-01-01   -0.654622
2013-01-02   -0.360943
2013-01-03    0.977953
2013-01-04    0.401096
2013-01-05   -0.048638
2013-01-06   -1.151713
Freq: D, dtype: float64

The return type of the function passed to apply() affects the type of the final output from DataFrame.apply for the default behaviour:

• If the applied function returns a Series, the final output is a DataFrame. The columns match the index of the Series returned by the applied function.

• If the applied function returns any other type, the final output is a Series.

This default behaviour can be overridden using the result_type, which accepts three options: reduce,
broadcast, and expand. These will determine how list-likes return values expand (or not) to a DataFrame.

apply() combined with some cleverness can be used to answer many questions about a data set. For example,
suppose we wanted to extract the date where the maximum value for each column occurred:

In [105]:
tsdf = pd.DataFrame(np.random.randn(1000,3),columns=['A','B','C'],index=pd.date_range('1/1/2000',periods=1000))

In [106]:
tsdf.apply(lambda x:x.idxmax())

A   2001-01-31
B   2002-09-06
C   2001-01-06
dtype: datetime64[ns]

You may also pass additional arguments and keyword arguments to the apply() method. For instance, consider the following function you would like to apply:

In [107]:
def subtract_and_divide(x,sub,divide=1):
    return (x-sub)/divide

You may then apply this function as follows:

In [108]:
df.apply(subtract_and_divide,args=(5,),divide=3)

Unnamed: 0,A,B,C,D
2013-01-01,-1.860136,-2.227145,-1.220371,-2.231844
2013-01-02,-2.070781,-1.396638,-1.961028,-1.719478
2013-01-03,-1.681263,-0.694836,-1.344863,-1.641769
2013-01-04,-2.050009,-1.709094,-1.334569,-1.0382
2013-01-05,-1.789822,-1.42859,-1.954647,-1.558457
2013-01-06,-2.102345,-1.9618,-1.815031,-2.323108


Another useful feature is the ability to pass Series methods to carry out some Series operation on each column or row:

In [109]:
tsdf

Unnamed: 0,A,B,C
2000-01-01,-2.184787,-1.306384,1.313990
2000-01-02,0.005431,0.713980,1.070233
2000-01-03,-2.214944,-1.642459,0.005254
2000-01-04,-0.167712,-2.448214,-0.928917
2000-01-05,0.005370,-0.192001,-1.114783
...,...,...,...
2002-09-22,-0.881799,2.667154,1.695512
2002-09-23,0.456797,-0.466448,-1.172211
2002-09-24,0.491561,-1.127058,-0.137953
2002-09-25,2.267376,0.246412,-0.369981


In [110]:
tsdf.apply(pd.Series.interpolate)

Unnamed: 0,A,B,C
2000-01-01,-2.184787,-1.306384,1.313990
2000-01-02,0.005431,0.713980,1.070233
2000-01-03,-2.214944,-1.642459,0.005254
2000-01-04,-0.167712,-2.448214,-0.928917
2000-01-05,0.005370,-0.192001,-1.114783
...,...,...,...
2002-09-22,-0.881799,2.667154,1.695512
2002-09-23,0.456797,-0.466448,-1.172211
2002-09-24,0.491561,-1.127058,-0.137953
2002-09-25,2.267376,0.246412,-0.369981


Finally, apply() takes an argument raw which is False by default, which converts each row or column into a Series before applying the function. When set to True, the passed function will instead receive an ndarray object, which has positive performance implications if you do not need the indexing functionality.

### Aggregation API


New in version 0.20.0.


The aggregation API allows one to express possibly multiple aggregation operations in a single concise way. This API is similar across pandas objects, see groupby API, the window functions API, and the resample API. The entry point for aggregation is DataFrame.aggregate(), or the alias DataFrame.agg().


We will use a similar starting frame from above:

In [111]:
tsdf = pd.DataFrame(np.random.randn(10,3),columns=['A','B','D'],index=pd.date_range('1/1/2000',periods=10))

In [112]:
tsdf.iloc[3:7] = np.nan

In [113]:
tsdf

Unnamed: 0,A,B,D
2000-01-01,0.731055,1.788132,-0.947848
2000-01-02,-2.087354,-1.064578,0.780159
2000-01-03,1.408928,-0.931477,-1.683704
2000-01-04,,,
2000-01-05,,,
2000-01-06,,,
2000-01-07,,,
2000-01-08,1.519882,2.601054,-0.047401
2000-01-09,-0.470323,-0.776397,0.041087
2000-01-10,1.539887,-1.01027,-1.124332


Using a single function is equivalent to apply(). You can also pass named methods as strings. These will return a Series of the aggregated output:

In [114]:
tsdf.agg(np.sum)

A    2.642075
B    0.606464
D   -2.982039
dtype: float64

In [115]:
tsdf.agg('sum')

A    2.642075
B    0.606464
D   -2.982039
dtype: float64

In [118]:
# these are equivalent to a ``.sum()`` because we are aggregating 
# on a single function
tsdf.sum()

A    2.642075
B    0.606464
D   -2.982039
dtype: float64

Single aggregations on a Series this will return a scalar value:

In [119]:
tsdf.A.agg('sum')

2.64207480696808

### Aggregating with multiple functions

You can pass multiple aggregation arguments as a list. The results of each of the passed functions will be a row in the resulting DataFrame. These are naturally named from the aggregation function.

In [120]:
tsdf.agg(['sum'])

Unnamed: 0,A,B,D
sum,2.642075,0.606464,-2.982039


Multiple functions yield multiple rows:

In [121]:
tsdf.agg(['sum','mean'])

Unnamed: 0,A,B,D
sum,2.642075,0.606464,-2.982039
mean,0.440346,0.101077,-0.497007


On a Series, multiple functions return a Series, indexed by the function names:

In [122]:
tsdf.A.agg(['sum','mean'])

sum     2.642075
mean    0.440346
Name: A, dtype: float64

Passing a lambda function will yield a <lambda> named row:

In [123]:
tsdf.A.agg(['sum',lambda x:x.mean()])

sum         2.642075
<lambda>    0.440346
Name: A, dtype: float64

Passing a named function will yield that name for the row:

In [124]:
def mymean(x):
    return x.mean()

In [125]:
tsdf.A.agg(['sum',mymean])

sum       2.642075
mymean    0.440346
Name: A, dtype: float64