## Summary Statistics

Summary statistics are numerical measures that provide an overview or summary of a dataset. They help to describe the main characteristics of the data, such as central tendency, dispersion, and shape. Common summary statistics include measures like mean, median, mode, standard deviation, variance, minimum, maximum, and quartiles. These statistics provide valuable insights into the distribution and properties of the data, allowing for better understanding and analysis.

Lets first import libraries and then read the data

In [1]:
import numpy as np
import pandas as pd

titanic = pd.read_csv('datasets/titanic.csv')
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## Aggregating Statistics

Often we may be interested to check the mean or median of a particular column for example. This is a case of aggregating statistics

![Agg stats](assets/aggregatestats.png)

For example: What is the average age of the Titanic passengers?

In [2]:
titanic["Age"].mean()

29.69911764705882

Different statistics are available and can be applied to columns with numerical data. Operations in general exclude missing data and operate across rows by default.

![Multi stats](assets/multicolstats.png)

For example: What is the median age and ticket fare price of the Titanic passengers?

The statistic applied to multiple columns of a ``DataFrame`` (the selection of two columns returns a ``DataFrame``) is calculated for each numeric column

In [3]:
titanic[["Age", "Fare"]].median()

Age     28.0000
Fare    14.4542
dtype: float64

The aggregating statistic can be calculated for multiple columns at the same time. Remember the ``describe`` function?

In [4]:
titanic[["Age", "Fare"]].describe()

Unnamed: 0,Age,Fare
count,714.0,891.0
mean,29.699118,32.204208
std,14.526497,49.693429
min,0.42,0.0
25%,20.125,7.9104
50%,28.0,14.4542
75%,38.0,31.0
max,80.0,512.3292


### Aggregating statistics grouped by category

Sometimes we want to calcluate the statistics according to groups

![groupby](assets/groupby.png)

For example: What is the average age for male versus female Titanic passengers?

As our interest is the average age for each gender, a subselection on these two columns is made first: ``titanic[["Sex", "Age"]]``. Next, the ``groupby()`` method is applied on the Sex column to make a group per category. The average age for each gender is calculated and returned.

In [5]:
titanic[["Sex", "Age"]].groupby("Sex").mean()

Unnamed: 0_level_0,Age
Sex,Unnamed: 1_level_1
female,27.915709
male,30.726645


Calculating a given statistic (e.g. ``mean`` age) for _each category in a column_ (e.g. male/female in the ``Sex`` column) is a common pattern. The ``groupby`` method is used to support this type of operations. This fits in the more general ``split-apply-combine`` pattern:

1. Split the data into groups

2. Apply a function to each group independently

3. Combine the results into a data structure

The apply and combine steps are typically done together in pandas.

If we are only interested in the average age for each gender, the selection of columns (rectangular brackets ``[]`` as usual) is supported on the grouped data as well:

![grpbyadv](assets/groupbyadv.png)

In [6]:
titanic.groupby("Sex")["Age"].mean()

Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

Another example: What is the mean ticket fare price for each of the sex and cabin class combinations?

Grouping can be done by multiple columns at the same time. Provide the column names as a _list_ to the ``groupby()`` method.

In [7]:
titanic.groupby(["Sex", "Pclass"])["Fare"].mean()

Sex     Pclass
female  1         106.125798
        2          21.970121
        3          16.118810
male    1          67.226127
        2          19.741782
        3          12.661633
Name: Fare, dtype: float64

### Count number of records by category

Sometimes we want to get the count of records we have in the data by category: Here is a diagram:

![countb](assets/countb.png)

For example: What is the number of passengers in each of the cabin classes?

The ``value_counts()`` method counts the number of records for each category in a column.

In [8]:
titanic["Pclass"].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

Interesting observation: The function ``value_counts()`` is a shortcut, as it is actually a ``groupby`` operation in combination with counting of the number of records within each group:

In [9]:
titanic.groupby("Pclass")["Pclass"].count()

Pclass
1    216
2    184
3    491
Name: Pclass, dtype: int64

Once again, Here is a summary of what you should remember from this notebook:

1. Aggregation statistics can be calculated on entire columns or rows.

2. ``groupby`` provides the power of the _split-apply-combine_ pattern.

3. ``value_counts`` is a convenient shortcut to count the number of entries in each category of a variable.