# Statistics
- the mathematics and techniques with which we understand data.

## Central Tendencies
- Where is the 'center' of our data?

In [2]:
def mean(x):
    return sum(x) / len(x)

In [3]:
a_list_of_numbers=[1,2]
mean(a_list_of_numbers)

1.5

Median can also be referred for central tendencies. It is the middle most value if the list is odd numbered and the average of the middle two values if the list is even numbered. 

In [None]:
def median(v):
    """finds the 'middle-most' value of v"""
        n = len(v)
        sorted_v = sorted(v)
        midpoint = n // 2
    if n % 2 == 1:
        # if odd, return the middle value
        return sorted_v[midpoint]
    else:
        # if even, return the average of the middle values
        lo = midpoint - 1
        hi = midpoint
        return (sorted_v[lo] + sorted_v[hi]) / 2

- Mean is very sensitive to outliers
- Median, not so much
- Thus, mean could sometime give us a misleading picture
- Generalisation of median is the quantile

In [None]:
def quantile(x, p):
    """returns the pth-percentile value in x"""
    p_index = int(p * len(x))
    return sorted(x)[p_index]

- Less commonly, we might want to look at mode, the most-common value

In [None]:
def mode(x):
    """returns a list, might be more than one mode"""
    counts = Counter(x)
    max_count = max(counts.values())
    return [x_i for x_i, count in counts.iteritems()
        if count == max_count]

## Dispersion
- measures of how spread out our data is (not spread out vs very spread out)

In [None]:
def data_range(x):
    return max(x) - min(x)

range is not very good to measure dispersion. It does not capture information of data beyond the min and max value.

A more complex measure of dispersion is variance

In [None]:
def de_mean(x):
    """translate x by subtracting its mean (so the result has mean 0)"""
    x_bar = mean(x)
    return [x_i - x_bar for x_i in x]
def variance(x):
    """assumes x has at least two elements"""
    n = len(x)
    deviations = de_mean(x)
    return sum_of_squares(deviations) / (n - 1)

Variance has units that are the square of the original units, which is not very intuitive.

Therefore, we can use the square-root of this value

In [None]:
def standard_deviation(x):
    return math.sqrt(variance(x))

Both the range and the standard deviation will have the same outlier problem as mean. 

A more robust alternative computes the difference between the 75th percentile and the 25th percentile value

In [None]:
def interquartile_range(x):
    return quantile(x, 0.75) - quantile(x, 0.25)

## Correlation
- Used to investigate the relationship between metrics
- Look first at covariance (paired analogue of variance). 
 - Variance measures how a single variable deviates from its means
 - Covariance measures how two variables vary in tandem from their means

In [None]:
def covariance(x, y):
    n = len(x)
    return dot(de_mean(x), de_mean(y)) / (n - 1)

- dot sums up the products of corresponding pairs of elements
- when corresponding elements of both x and y are either both above or below their means, a positive number enters the sum. 
- If one of them is above and another below the mean, a negative one enters. 
- Therefore, a 'large' positive covariace means that x tends to be large when y is large, 
- A large negative covariance means the opposite.

It is hard to interpret:
- the units tend to not make sense 
- the number value of x or y will greatly affect the covariance value. No standard to compare against.

Therefore, it is more common to look at the correlation, which divides out the standard deviations of both variables:

In [None]:
def correlation(x, y):
    stdev_x = standard_deviation(x)
    stdev_y = standard_deviation(y)
    if stdev_x > 0 and stdev_y > 0:
        return covariance(x, y) / stdev_x / stdev_y
    else:
        return 0
        # if no variation, correlation is zero

- Correlation is unitless and always lies between -1 and 1
 - it can be sensitive to outliers too
 - it is useful to analyse outliers and decide if they can be removed or not

## Simpson's Paradox
- correlation can be misleading when confounding variables are ignored.
- once you divide the data further by considering certain attributes, correlation might change 
- see page 65 (Data Science from Scratch book) for example
- The key issue: correlation is measuring the relationship between two variables considering 'all else being equal'
- When there's deeper pattern to class assignment, this assumption becomes unhelpful/ destructive. 
- Important to know your data and check for possible confounding factors.

## Correlation and Causation
- when x is correlated to y, it could mean
 - x causes y
 - y causes x
 - nothing
 
- to feel more confident about causality, need to conduct randomised trials

## Further Exploration
- SciPy, Pandas, StatsModels

# References
- Data Science from Scratch