# __[How to calculate summary statistics](https://pandas.pydata.org/docs/getting_started/intro_tutorials/06_calculate_statistics.html#how-to-calculate-summary-statistics)__

Import Libraries

In [None]:
import pandas as pd

Data used for this tutorial:

In [None]:
titanic_passenger_data = pd.read_csv("../data/titanic.csv")
titanic_passenger_data.head()

### Aggregating Statistics
![](../utility/calc_sum_stats_01.png)

#### What is the average age of the Titanic passengers?

In [None]:
titanic_passenger_data['Age'].mean()

Different statistics are available and can be applied to columns with numerical data. Operations in general exclude missing data and operate across rows by default.

![](../utility/calc_sum_stats_02.png)

#### What is the median age and ticket fare price of the Titanic passengers?

In [None]:
titanic_passenger_data[['Age', 'Fare']].median()

The statistic applied to multiple columns of a `DataFrame` (the selection of two columns return a `DataFrame`, see the __[subset data tutorial](https://pandas.pydata.org/docs/getting_started/intro_tutorials/03_subset_data.html#min-tut-03-subset)__) is calculated for each numeric column.
<br>
The aggregating statistic can be calculated for multiple columns at the same time. Remember the describe function from __[first tutorial](https://pandas.pydata.org/docs/getting_started/intro_tutorials/01_table_oriented.html#min-tut-01-tableoriented)__?

In [None]:
titanic_passenger_data[['Age', 'Fare']].describe()

Instead of the predefined statistics, specific combinations of aggregating statistics for given columns can be defined using the __[`DataFrame.agg()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.agg.html#pandas.DataFrame.agg)__ method:

In [None]:
titanic_passenger_data.agg(
    {
        "Age": ["min", "max", "median", "skew"],
        "Fare": ["min", "max", "median", "mean"],
    }
)

Details about descriptive statistics are provided in the user guide section on __[descriptive statistics](https://pandas.pydata.org/docs/user_guide/basics.html#basics-stats)__.

### Aggregating statistics grouped by category
![](../utility/calc_sum_stats_03.png)

#### What is the average age for male versus female Titanic passengers?

In [None]:
titanic_passenger_data[['Sex', 'Age']].groupby('Sex').mean()

As our interest is the average age for each gender, a sub-selection on these two columns is made first: `titanic[["Sex", "Age"]]`. Next, the __[`groupby()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby)__ method is applied on the `Sex` column to make a group per category. The average age for each gender is calculated and returned.

Calculating a given statistic (e.g. `mean` age) for each category in a column (e.g. male/female in the `Sex` column) is a common pattern. The `groupby` method is used to support this type of operations. More general, this fits in the more general `split-apply-combine` pattern:
- **Split** the data into groups
- **Apply** a function to each group independently
- **Combine** the results into a data structure

The apply and combine steps are typically done together in pandas.

In the previous example, we explicitly selected the 2 columns first. If not, the `mean` method is applied to each column containing numerical columns:

In [None]:
titanic_passenger_data.groupby("Sex").mean()

It does not make much sense to get the average value of the `Pclass` if we are only interested in the average age for each gender, the selection of columns (rectangular brackets `[]` as usual) is supported on the grouped data as well:

In [None]:
titanic_passenger_data.groupby('Sex')['Age'].mean()

![](../utility/calc_sum_stats_04.png)
<br>
The `Pclass` column contains numerical data but actually represents 3 categories (or factors) with respectively the labels ‘1’, ‘2’ and ‘3’. Calculating statistics on these does not make much sense. Therefore, pandas provides a `Categorical` data type to handle this type of data. More information is provided in the user guide __[Categorical data](https://pandas.pydata.org/docs/user_guide/categorical.html#categorical)__ section.

#### What is the mean ticket fare price for each of the sex and cabin class combinations?

In [None]:
titanic_passenger_data.groupby(['Sex', 'Pclass'])['Fare'].mean()

Grouping can be done by multiple columns at the same time. Provide the column names as a list to the __[`groupby()`](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.groupby.html#pandas.DataFrame.groupby)__ method.
<br>
A full description on the split-apply-combine approach is provided in the user guide section on __[groupby operations](https://pandas.pydata.org/docs/user_guide/groupby.html#groupby)__.

### Count number of records by category
![](../utility/calc_sum_stats_05.png)

#### What is the number of passengers in each of the cabin classes?

In [None]:
titanic_passenger_data['Pclass'].value_counts()

The __[`value_counts()`](https://pandas.pydata.org/docs/reference/api/pandas.Series.value_counts.html#pandas.Series.value_counts)__ method counts the number of records for each category in a column.
<br>
The function is a shortcut, as it is actually a groupby operation in combination with counting of the number of records within each group:

In [None]:
titanic_passenger_data.groupby("Pclass")["Pclass"].count()

Both `size` and `count` can be used in combination with `groupby`. Whereas `size` includes `NaN` values and just provides the number of rows (size of the table), `count` excludes the missing values. In the `value_counts` method, use the `dropna` argument to include or exclude the `NaN` values.
<br>
The user guide has a dedicated section on `value_counts` , see page on __[discretization](https://pandas.pydata.org/docs/user_guide/basics.html#basics-discretization)__.

REMEMBER
- Aggregation statistics can be calculated on entire columns or rows
- groupby provides the power of the split-apply-combine pattern
- value_counts is a convenient shortcut to count the number of entries in each category of a variable
<br>

A full description on the split-apply-combine approach is provided in the user guide pages about __[groupby operations](https://pandas.pydata.org/docs/user_guide/groupby.html#groupby)__.
