# Udacity: Intro to Statistics

Here there are some notes and code about statistics. It's just some stuff in order to refresh my memory and practice a little bit of mathematics.

In [2]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

## Probability

The basic formula of probability:

$ P(A) = 1 - P(\neg A) $

### Conditional probability

Like for a disease we would like to detect, if our test is not 100% accurate:

$ P(Test) = P(Test | Disease) * P(Disease) + P(Test | \neg Disease) * P(\neg Disease) $

## Bayes Rules

$ P(A|B) = \frac {P(B|A)P(A)} {P(B)} $

In [3]:
pr = 0.5
pg = 0.5
p_seeR_atR = 0.8
p_seeG_atR = 0.2
p_seeG_atG = 0.5
p_seeR_atG = 0.5
normalize = pr * p_seeR_atR + pg * p_seeR_atG
print str((pr * p_seeR_atR) / normalize)
print str((pg * p_seeR_atG) / normalize)

0.615384615385
0.384615384615


In [4]:
print str(0.3 / 0.366)
print str(0.033 / 0.366)

0.819672131148
0.0901639344262


In [5]:
gone = 0.6
home = 0.4
rain_home = 0.01
norain_home = 0.99
rain_gone = 0.3
norain_gone = 0.7
norm = 0.4 * 0.01 + 0.6 * 0.3
print (norm)
print str((0.4 * 0.01) / norm)

0.184
0.0217391304348


#### Programming Bayes Rules

In [6]:
def f(p):
    return 3 * p * (1 - p) * (1 - p)
print str(f(0.5))

0.375


## Programming estimators

MLE vs. Laplace ???

#### Mean

In [7]:
data1=[49., 66, 24, 98, 37, 64, 98, 27, 56, 93, 68, 78, 22, 25, 11]

def mean(data):
    return sum(data) / len(data)

print mean(data1)

54.4


#### Median

In [8]:
# If ODD numbers only
data1=[1,2,5,10,-20]

def median(data):
    sdata = sorted(data)
    return sdata[(len(data) - 1) / 2]

print median(data1)

# If EVEN: we can either take the number OR the mean between median numbers

2


#### Mode

In [9]:
data1=[1,2,5,10,-20,5,5]

def mode(data):
    mode = 0
    index = 0
    for i in range(len(data)):
        if data.count(data[i]) > mode:
            index = i
            mode = data.count(data[i])

    return data[index]
    
print mode(data1)

5


#### Variance

$ V(x) = \sigma^{2} ={\frac  1n}\sum _{{i=1}}^{n}(x_{i}-\mu)^{2} $

In [17]:
data3=[13.04, 1.32, 22.65, 17.44, 29.54, 23.22, 17.65, 10.12, 26.73, 16.43]

def variance(data):
    mu = mean(data)
    return mean([(x - mu) ** 2 for x in data])

print variance(data3)

62.572884


#### Standard deviation

$ \sigma = \sqrt {V(x)} $

In [30]:
from math import *
def stddev(data):
    return sqrt(variance(data))
    
print stddev(data3)

7.91030239624


#### Standard score

$ \frac {x - \mu} {\sigma} $

#### Variance correction factor

$ \frac {n} {n - 1} $

$ n $: \# data points in the sample

## Outliers: ignoring datas

Outliers means in french: "valeurs aberrantes".

#### Quartile

To avoid some extreme datas, we split dataset with 4 quartiles separated by 3 values (one in the center - could be the median if odd, one up, one down). So quartile will be $ 4n + 3 $ multiple.

#### Percentile

It means we are going to remove a certain percent of the dataset, i.e. the upper 20%.

## Binomial distribution

For example:

Let's flip 10 coins (n = 10), how much head = 5 (k = 5) do we have?

Answer is: $ \frac {10 * 9 * 8 * 7 * 6} {5 * 4 * 3 * 2 * 1} = 252 $

Formula is:
$ \frac {n!} {k! * (n - k)!} $

The _binomial distribution_:

$ \frac {n!} {k! * (n - k)!} * p^k * (1 - p)^{n-k} $

Where $k < n$, $p$ is the probability of $k$