# Custom Functions
In this guide, we discuss the `summary statistics` functions in Pandas
1. Import Pandas library
2. Load dataset


We practice these commands on the Titanic train dataset. 
The dataset can be downloaded from Kaggle website.

https://www.kaggle.com/c/titanic/data?select=train.csv

---------

### List of methods and properties discussed in this notebook

**Load the data**
- pd.read_csv()

**Calculate summary statistics**
- df["column_name"].mean()
- df["column_name"].median()
- df[["column1","column2"]].median()

**Groupby and aggregate**
- df.groupby(["Sex","Pclass"])["Fare"].mean()

**Value counts**
- df["column_name"].value_counts()

**List of statistics functions**
- count()
- sum()
- mean()
- median()
- mode()
- std()
- min()
- max()
- abs()
- prod()
- cumsum()
- cumprod()

-----------


## 1. Import Pandas library

In [1]:
#First, import the Pandas library
import pandas as pd

## 2. Load dataset

In [2]:
#Next, let's load the dataset into a Pandas dataframe
df = pd.read_csv('train.csv') 

#Since, we have the dataset in a csv file, we have used pd.read_csv().
#There are different functions based on the type of data we are trying to load in a dataframe.
#More details can be found here
#https://pandas.pydata.org/pandas-docs/stable/reference/io.html

#Pandas provides support to read following filetypes
#Table, CSV, Clipboard, Excel, JSON, HTML ,XML, Latex, HDFStore: PyTables (HDF5), Feather, Parquet, ORC, SAS, SPSS, SQL, Google BigQuery and STATA

In [3]:
## 3. Preview the dataset
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## 3. Calculate Summary Statistics

In [8]:
#Calculate mean
df["Age"].mean()

29.69911764705882

In [9]:
#Calculate median
df["Age"].median()

28.0

In [10]:
#Calculate median of two columns
df[["Age", "Fare"]].median()

Age     28.0000
Fare    14.4542
dtype: float64

In [11]:
#Calculate all the summary statistics on selected columns
df[["Age", "Fare"]].describe()

Unnamed: 0,Age,Fare
count,714.0,891.0
mean,29.699118,32.204208
std,14.526497,49.693429
min,0.42,0.0
25%,20.125,7.9104
50%,28.0,14.4542
75%,38.0,31.0
max,80.0,512.3292


## 4. Aggregating statistics grouped by category

In [17]:
df.groupby("Sex").mean()

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
Sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
female,431.028662,0.742038,2.159236,27.915709,0.694268,0.649682,44.479818
male,454.147314,0.188908,2.389948,30.726645,0.429809,0.235702,25.523893


In [19]:
#What is the average age for male versus female Titanic passengers?
df.groupby("Sex")["Age"].mean()

Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

In [23]:
#What is the mean ticket fare price for each of the sex and cabin class combinations?
df.groupby(["Sex","Pclass"])["Fare"].mean()

Sex     Pclass
female  1         106.125798
        2          21.970121
        3          16.118810
male    1          67.226127
        2          19.741782
        3          12.661633
Name: Fare, dtype: float64

## 5. Count number of records by category

In [26]:
#What is the number of passengers in each of the cabin classes?
df["Pclass"].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

The `value_counts()` method counts the number of records for each category in a column.

The function is a shortcut, as it is actually a `groupby operation` in `combination with counting` of the `number of records` within each group: