# Describing a single set of data

## Initial methods

- show the data
- chart the data
- find the largest and smallest values
- find values in specific positions

## Central Tendencies

**Where is our data centered?**

- typically the mean, but sometimes the median
- a generalization of the median is the *quantile*
    - this represents the value under which a certain percentile of the data lies (the median represents the value under which 50% of the data lies)
- less commonly, you may use the mode (most common values)

## Dispersion

**Dispersion** 

- refers to measures of how spread out the data is

**Range**
  
- difference between the largest and smallest elements

**Variance**

- more complex measure of dispersion
- almost the average squared deviation from the mean
    - we are dividing by n-1 instead of n
    - x_bar is only an estimate of the actual mean, which means that the average (x_i - x_bar) ** 2 is an underestimate of x_i's squared deviation from the mean, which is why we divide by n-1 instead of n
    
    
    from scratch.linear_algebra import sum_of_squares

    def de_mean(xs: List[float]) -> List[float]:
        """Translate xs by subtracting its mean (so the result has mean 0)"""
        x_bar = mean(xs)
        return [x - x_bar for x in xs]

    def variance(xs: List[float]) -> float:
        """Almost the average squared deviation from the mean"""
        assert len(xs) >= 2, "variance requires at least two elements"

        n = len(xs)
        deviations = de_mean(xs)
        return sum_of_squares(deviations) / (n - 1)

    assert 81.54 < variance(num_friends) < 81.55
    
**Standard deviation**

- the square root of the variance


    import math

    def standard_deviation(xs: List[float]) -> float:
        """The standard deviation is the square root of the variance"""
        return math.sqrt(variance(xs))

    assert 9.02 < standard_deviation(num_friends) < 9.04
   

Both the range and the standard deviation have the same outlier problem as the mean. 

A more robust alternative computes the difference between the 75th percentile value and the 25th percentile value.

**Interquartile range**

- the difference between the 75%-ile and the 25%-ile



    def interquartile_range(xs: List[float]) -> float:
        """Returns the difference between the 75%-ile and the 25%-ile"""
        return quantile(xs, 0.75) - quantile(xs, 0.25)

    assert interquartile_range(num_friends) == 6


## Correlation

**Covariance**

- the paired analogue of variance
- whereas variance measures how a single variable deviates from its mean, covariance measures how two variables vary in tandem from their means

    def covariance(xs: List[float], ys: List[float]) -> float:


        assert len(xs) == len(ys), "xs and ys must have same number of elements"
        return dot(de_mean(xs), de_mean(ys)) / (len(xs) - 1)

    assert 22.42 < covariance(num_friends, daily_minutes) < 22.43

    assert 22.42 / 60 < covariance(num_friends, daily_hours) < 22.43 / 60

Recall that dot sums of the product of corresponding pairs of elements. 

- When corresponding elements of x and y are either both above their means or both below their means, a positive number enters the sum. 
- When one is above and the other is below, a negative number enters the sum.
- A large "positive" covariance means that x tends to be large when y is large and small when y is small
- A large "negative" covariance means that x tends to be large when y is small and small when y is large
- A covariance close to "zero" means that no such relationship exists

- This number can be hard to interpret
    - its units are the product of the inputs' units, which can be hard to make sense of
    - it's hard to say what counts as large covariance

For this reason, it's more common to look at **correlation**.

- divides out the standard deviations of both variables
- measures how much xs and ys vary in tandem about their means


    def correlation(xs: List[float], ys: List[float]) -> float:
        """Measures how much xs and ys vary in tandem about their means"""
        stdev_x = standard_deviation(xs)
        stdev_y = standard_deviation(ys)
        if stdev_x > 0 and stdev_y > 0:
            return covariance(xs, ys) / stdev_x / stdev_y
        else:
            return 0    # if no variation, correlation is zero

    assert 0.24 < correlation(num_friends, daily_minutes) < 0.25
    
    assert 0.24 < correlation(num_friends, daily_hours) < 0.25

Correlation is always unitless and always lies between -1 (perfect anticorrelation) and 1 (perfect correlation). 

- a number like 0.25 always represents a relatively weak positive correlation

Correlation can be very sensitive to outliers. Try to remove if possible.

## Simpson's paradox

- correlations can be misleading when confounding variables are ignored

Example: looking at number of friends by region, but neglecting to look at type of degree

## Some other correlational caveats

- A correlation of zero indicates that there is no linear relationship between the two variiables

## Correlation and causation

Correlation can mean causation, but doesn't mean.

- One way to feel more confident about causality is by conducting randomized trials