# Standard Deviation & Variance

# UCIMLrepo
This Heart Disease data set comes from the UC Irvine Machine Learning Repo. It falls under the CCA 4.0 International License. The repo represents 4 databases: Cleveland, Hungry Switzerland and VA Long Beach

In [5]:
from ucimlrepo import fetch_ucirepo, list_available_datasets

import pandas as pd
import numpy as np

In [6]:
# Uncomment to run a list of datasets
#list_available_datasets()

In [7]:
heart_disease = fetch_ucirepo(id=45)
#this is in a dictionary format
#Creates a data frame of 303 rows and 13 columns

In [8]:
hd_df = heart_disease.data.features

y = heart_disease.data.targets

hd_df.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
0,63,1,1,145,233,1,2,150,0,2.3,3,0.0,6.0
1,67,1,4,160,286,0,2,108,1,1.5,2,3.0,3.0
2,67,1,4,120,229,0,2,129,1,2.6,2,2.0,7.0
3,37,1,3,130,250,0,0,187,0,3.5,3,0.0,3.0
4,41,0,2,130,204,0,2,172,0,1.4,1,0.0,3.0


## Standard Deviation

### ddof

As a side note pandas and numpys both use what is called **"Delta Degrees of Freedom" or ddof** by default.
- Pandas ddof defaults to 1 (sample standard deviation/variance)
- Numpy ddof defaults to 0 (population standard deviation/variance)

The ddof is subtracted from the divisor during the calculation.  This changes the calculation and the outcome based on the context (sample population).

## Population vs. Sample

- **Population** For population you are dividing by ***N*** (where ***N*** is the number of data points)
- **Sample** If you are sampling from a population, then you would devide by ***N - 1***.  This corrects the bias in the estimation.  This is known as Bessel's correction

### How it Works

- when `ddof = 0` (numpy default): it calculates the population variance or standard deviation by dividing by ***N***
- when `ddof = 1` (pandas defualt): it calculates the sample variance or standard deviation by dividing by ***N - 1***

### Numpy STD

In [9]:
np_std_ddof0 = np.std(hd_df.chol)
np_std_ddof1 = np.std(hd_df.chol, ddof=1)
np_std_ddof0, np_std_ddof1

(51.69140647264888, 51.776917542637015)

### Pandas STD

In [10]:
pandas_std_ddof0 = hd_df['chol'].std(ddof=0)
pandas_std_ddof1 = hd_df['chol'].std()
pandas_std_ddof0, pandas_std_ddof1

(51.69140647264888, 51.776917542637015)

### Numpy Variance

In [11]:
np_var_ddof0 = np.var(hd_df.chol)
np_var_ddof1 = np.var(hd_df.chol, ddof=1)
np_var_ddof0, np_var_ddof1

(2672.0015031206067, 2680.8491902170326)

### Pandas Variance

In [12]:
pd_var_ddof0 = hd_df.chol.var(ddof=0)
pd_var_ddof1 = hd_df.chol.var()
pd_var_ddof0, pd_var_ddof1

(2672.0015031206067, 2680.8491902170326)

## Standard Deviation Equation

### Population

$$ \sigma = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (x_i - \mu)^2} $$

In [13]:
def population_std(data):
    #calculate the mean
    mean = sum(data) / len(data) #This is denoted by the Mu or u 
    
    #subtract mean and square the results
    squared_difference = [(x - mean) ** 2 for x in data] # List comprehension where (x is the elements in data - mean) squared
    
    # Compute population variance
    population_variance = sum(squared_difference) / len(data) # summation of squared differences denoted by E and N being length of data
    
    return population_variance ** 0.5 ## square root of population variance

data = hd_df.chol
std = population_std(data)
print(std)

51.69140647264888


### Sample

$$  s = \sqrt{\frac{1}{n-1} \sum_{i=1}^{n} (x_i - \mu)^2} $$

In [16]:
def sample_std(data):
     #calculate the mean
    mean = sum(data) / len(data) #This is denoted by the Mu or u 
    
    #subtract mean and square the results
    squared_difference = [(x - mean) ** 2 for x in data] # List comprehension where (x is the elements in data - mean) squared
    
    # Compute population variance
    variance = sum(squared_difference) / (len(data) - 1) # summation of squared differences denoted by E and N being length - 1
    
    return variance ** 0.5 ## square root of population variance

data = hd_df.chol
std = sample_std(data)
print(std)    

51.776917542637015
