# Descriptive Statistics
In this guide, we discuss the `summary statistics` functions in Pandas

---------

### List of methods and properties discussed in this notebook

**Load the data**
- pd.read_csv()

**Calculate summary statistics**
- df["column_name"].mean()
- df["column_name"].median()
- df[["column1","column2"]].median()

**Groupby and aggregate**
- df.groupby(["Sex","Pclass"])["Fare"].mean()

**Value counts**
- df["column_name"].value_counts()

**List of statistics functions**
- count()
- sum()
- mean()
- median()
- mode()
- std()
- min()
- max()
- abs()
- prod()
- cumsum()
- cumprod()
- describe()

-----------


## 1. Import Pandas library

In [6]:
#First, import the Pandas library
import pandas as pd
import numpy as np

from numpy import random

----------

## 2. Summarizing Descriptive Statistics

**2.1 Overview of Descriptive Statistics**

In [34]:
#Create a DataFrame
df = pd.DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]], 
                  index = ['a','b','c','d'], 
                  columns=['one', 'two'])

In [35]:
## 3. Preview the dataset
df

Unnamed: 0,one,two
a,1.4,
b,7.1,-4.5
c,,
d,0.75,-1.3


In [36]:
#Calling sum() method
df.sum()

one    9.25
two   -5.80
dtype: float64

In [37]:
#Calling sum() method across columns
df.sum(axis=1)

a    1.40
b    2.60
c    0.00
d   -0.55
dtype: float64

In [38]:
#Calling sum() method across rows
df.sum(axis=0)

one    9.25
two   -5.80
dtype: float64

----------

**2.2 skipna**

NA values are excluded unless the entire slice (row or column in this case) is NA. This can be disabled with the skipna option:

In [39]:
df.sum(axis='columns', skipna=False)

a     NaN
b    2.60
c     NaN
d   -0.55
dtype: float64

----------

**2.3 Reduction methods - idxmax(), idxmin(), cumsum()**

Some methods, like `idxmin` and `idxmax`, return indirect statistics like the index value where the minimum or maximum values are attained:

In [40]:
#Finds index where the value is max in a column
df.idxmax()

one    b
two    d
dtype: object

In [41]:
#Finds index where the value is min in a column
df.idxmin()

one    d
two    b
dtype: object

In [42]:
#Cumulative sum
df.cumsum()

Unnamed: 0,one,two
a,1.4,
b,8.5,-4.5
c,,
d,9.25,-5.8


------------

**2.4 .describe()**

Another type of method is neither a `reduction` nor an `accumulation`. `describe` is one such example, producing multiple summary statistics in one shot:

In [43]:
df.describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


<img src="img/descriptive_statistics.png" alt="Descriptive Statistics in Pandas" class="bg-primary mb-1">
<h4><center>Descriptive Statistics in Pandas</center></h4>

-----------

## 3. Descriptive statistics on a particular column

In [44]:
#Calculate mean
df["one"].mean()

3.0833333333333335

In [45]:
#Calculate median
df["one"].median()

1.4

In [46]:
#Calculate median of two columns
df[["one", "two"]].median()

one    1.4
two   -2.9
dtype: float64

In [47]:
#Calculate all the summary statistics on selected columns
df[["one", "two"]].describe()

Unnamed: 0,one,two
count,3.0,2.0
mean,3.083333,-2.9
std,3.493685,2.262742
min,0.75,-4.5
25%,1.075,-3.7
50%,1.4,-2.9
75%,4.25,-2.1
max,7.1,-1.3


---------

## 4. Unique Values, Value Counts, and Membership

In [48]:
obj = pd.Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c'])

**4.1 Unique**<br>
The first function is `unique`, which gives you an array of the `unique values` in a Series:

In [63]:
#Get unique values
uniques = obj.unique()
uniques

array(['c', 'a', 'd', 'b'], dtype=object)

In [64]:
#The unique values are not necessarily returned in sorted order, 
#but could be sorted after the fact if needed (uniques.sort()).
uniques.sort()
uniques

array(['a', 'b', 'c', 'd'], dtype=object)

--------

**4.2 Value counts**

In [66]:
#Get the count of each value
obj.value_counts()

a    3
c    3
b    2
d    1
dtype: int64

The Series is sorted by value in `descending order` as a convenience. 
`value_counts` is also available as a `top-level pandas method` that can be used with any array or sequence:

In [67]:
pd.value_counts(obj.values, sort=False)

d    1
c    3
a    3
b    2
dtype: int64

**4.3 isin()**<br>
`isin` performs a vectorized set membership check and can be useful in filtering a dataset down to a subset of values in a Series or column in a DataFrame:

In [68]:
mask = obj.isin(['b', 'c'])

In [69]:
mask

0     True
1    False
2    False
3    False
4    False
5     True
6     True
7     True
8     True
dtype: bool

In [70]:
obj[mask]

0    c
5    b
6    b
7    c
8    c
dtype: object

# **End of Sheet**