# Descriptive statistics

pandas objects are equipped with a number of common mathematical and statistical methods. Most of them fall into the category of reductions or summary statistics, methods that extract a single value (such as the sum or mean) from a series or set of values from the rows or columns of a DataFrame. Compared to similar methods found in NumPy arrays, they also handle missing data.

In [1]:
import pandas as pd
import numpy as np

rng = np.random.default_rng()
df = pd.DataFrame(rng.normal(size=(7, 3)), index=pd.date_range("2022-02-02", periods=7))
new_index = pd.date_range("2022-02-03", periods=7)
df2 = df.reindex(new_index)

df2

Unnamed: 0,0,1,2
2022-02-03,0.264503,-0.345312,1.260343
2022-02-04,-1.28644,-1.107862,1.571945
2022-02-05,-0.027269,-0.138083,0.462049
2022-02-06,-1.090419,0.798509,0.795627
2022-02-07,1.571553,1.519501,1.566992
2022-02-08,0.975667,0.309376,0.512775
2022-02-09,,,


Calling the `pandas.DataFrame.sum` method returns a series containing column totals:

In [2]:
df2.sum()

0    0.407595
1    1.036129
2    6.169731
dtype: float64

Passing `axis='columns'` or `axis=1` instead sums over the columns:

In [3]:
df2.sum(axis='columns')

2022-02-03    1.179534
2022-02-04   -0.822357
2022-02-05    0.296697
2022-02-06    0.503717
2022-02-07    4.658046
2022-02-08    1.797817
2022-02-09    0.000000
Freq: D, dtype: float64

If an entire row or column contains all NA values, the sum is `0`. This can be disabled with the `skipna` option:

In [4]:
df2.sum(axis='columns', skipna=False)

2022-02-03    1.179534
2022-02-04   -0.822357
2022-02-05    0.296697
2022-02-06    0.503717
2022-02-07    4.658046
2022-02-08    1.797817
2022-02-09         NaN
Freq: D, dtype: float64

Some aggregations, such as `mean`, require at least one non-`NaN` value to obtain a valuable result:

In [5]:
df2.mean(axis='columns')

2022-02-03    0.393178
2022-02-04   -0.274119
2022-02-05    0.098899
2022-02-06    0.167906
2022-02-07    1.552682
2022-02-08    0.599272
2022-02-09         NaN
Freq: D, dtype: float64

## Options for reduction methods

Method | Description
:----- | :----------
`axis` | the axis of values to reduce: `0` for the rows of the DataFrame and `1` for the columns
`skipna` | exclude missing values; by default `True`.
`level` | reduce grouped by level if the axis is hierarchically indexed (MultiIndex)

Some methods, such as `idxmin` and `idxmax`, provide indirect statistics such as the index value at which the minimum or maximum value is reached:

In [6]:
df2.idxmax()

0   2022-02-07
1   2022-02-07
2   2022-02-04
dtype: datetime64[ns]

Other methods are accumulations:

In [7]:
df2.cumsum()

Unnamed: 0,0,1,2
2022-02-03,0.264503,-0.345312,1.260343
2022-02-04,-1.021937,-1.453174,2.832288
2022-02-05,-1.049206,-1.591257,3.294337
2022-02-06,-2.139625,-0.792748,4.089963
2022-02-07,-0.568072,0.726753,5.656955
2022-02-08,0.407595,1.036129,6.169731
2022-02-09,,,


Another type of method is neither reductions nor accumulations. `describe` is one such example that produces several summary statistics in one go:

In [8]:
df2.describe()

Unnamed: 0,0,1,2
count,6.0,6.0,6.0
mean,0.067932,0.172688,1.028288
std,1.123268,0.919669,0.505989
min,-1.28644,-1.107862,0.462049
25%,-0.824631,-0.293505,0.583488
50%,0.118617,0.085646,1.027985
75%,0.797876,0.676226,1.49033
max,1.571553,1.519501,1.571945


For non-numeric data, `describe` generates alternative summary statistics:

In [9]:
data = {'Code': ['U+0000', 'U+0001', 'U+0002', 'U+0003', 'U+0004', 'U+0005'],
        'Octal': ['001', '002', '003', '004', '004', '005']}
df3 = pd.DataFrame(data)

df3.describe()

Unnamed: 0,Code,Octal
count,6,6
unique,6,5
top,U+0000,4
freq,1,2


Descriptive and summary statistics:

Method | Description
:----- | :----------
`count` | number of non-NA values
`describe` | calculation of a set of summary statistics for series or each DataFrame column
`min`, `max` | calculation of minimum and maximum values
`argmin`, `argmax` | calculation of the index points (integers) at which the minimum or maximum value was reached
`idxmin`, `idxmax` | calculation of the index labels at which the minimum or maximum values were reached
`quantile` | calculation of the sample quantile in the range from 0 to 1
`sum` | sum of the values
`mean` | arithmetic mean of the values
`median` | arithmetic median (50% quantile) of the values
`mad` | mean absolute deviation from the mean value
`prod` | product of all values
`var` | sample variance of the values
`std` | sample standard deviation of the values
`skew` | sample skewness (third moment) of the values
`kurt` | sample kurtosis (fourth moment) of the values
`cumsum` | cumulative sum of the values
`cummin`, `cummax` | cumulated minimum and maximum of the values respectively
`cumprod` | cumulated product of the values
`diff` | calculation of the first arithmetic difference (useful for time series)
`pct_change` | calculation of the percentage changes

## `pandas_profiling`

[pandas-profiling](https://pandas-profiling.ydata.ai/docs/master/index.html) generates profile reports from a pandas DataFrame. The pandas `df.describe()` function is handy, but a bit basic for exploratory data analysis. pandas-profiling extends pandas DataFrame with `df.profile_report()`, which automatically generates a standardised report for understanding the data.

### Installation

```bash
$ pipenv install pandas_profiling[notebook]
…
✔ Success!
Updated Pipfile.lock (cbc5f7)!
Installing dependencies from Pipfile.lock (cbc5f7)...
  🐍   ▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉▉ 80/80 — 00:02:26
…
$ pipenv run jupyter nbextension enable --py widgetsnbextension
Enabling notebook extension jupyter-js-widgets/extension...
      - Validating: OK
```

### Example

In [10]:
from pandas_profiling import ProfileReport

profile = ProfileReport(df2, title="Pandas Profiling Report")

profile.to_widgets()

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]

Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…