# D8 Lec 25, Prof Sanchez 
## Center and Spread
#### Sean Villegas

[Reading](https://inferentialthinking.com/chapters/14/2/Variability.html)

#### Quantifying center and variability: 
- center: Mean/avg
    - mean is quantifying the center of a dataset imagine putting a ∆ to balance a line of a histogram    
    - The mean of a set of numbers does not need to be in the set 
    - The mean is between the min and max values
    - The mean is in the same units at the data!
- variability: the standard deviation 
- median is the 50th percentile of data

**Both mean and median quantify center** 
- If the distribution is symmetric about a value:
    - then that value is both the mean and the median 
- If the histogram is skewed (asymmetric):
    - then the mean is pulled away from the median in the direction of the skew (left side or right side of tail)

**Quantifying variability** 
- focus on the variability of a typical observation
- The average discusses a typical observation
- It measures roughly how far the data are from their average
- It also has the same units as the data


#### Calculating Standard Deviation 
_the std is built into numpy and handles the processes described below_

Definition and Steps: **The square root of the mean of the squared deviations from the average.** 

_It measures the spread of the data around the mean—larger SD means more variation, while a smaller SD means the data points are closer to the mean._ 
1. Deviations from Average: Subtract the mean from each data point. 
2. Square: Square each deviation to eliminate negatives. 
3. Mean: Take the average of the squared deviations. 
4. Square Root: Compute the square root of this average to get the final standard deviation. 

```python
# Step 1. The average.

mean = np.mean(any_numbers)
mean

# Step 2. The deviations from average.
deviations = any_numbers - mean
calculation_steps = Table().with_columns(
        'Value', any_numbers,
        'Deviation from Average', deviations
        )
calculation_steps

# Step 3. The squared deviations from average

squared_deviations = deviations ** 2
calculation_steps = calculation_steps.with_column(
   'Squared Deviations from Average', squared_deviations
    )
calculation_steps

# Step 4. Variance = the mean squared deviation from average

variance = np.mean(squared_deviations)

# Step 5.
# Standard Deviation:    root mean squared deviation from average
# Steps of calculation:   5    4      3       2             1

sd = variance ** 0.5

```


#### Chebyshevs Inequality 
_no matter shape of distribution these rules apply_ 

```python
def chebyshev(num_SDs):
    '''returns the least proportion of the data in +/- num_SDs
    2 0.75
    3 0.888888888889
    4 0.9375
    '''
    z = num_SDs
    return 1 - 1/z**2  

for i in np.arange(2,5):
    print(i, chebyshev(i))


```
- the proportion of values in the range “mean ± z SDs” is at least 
1 - (1/z<sup>2</sup>) 

| Range | Proportion | 
| ---  | --- |
| mean ± 2 SDs | at least 1 - 1/4 = 3/4    (75%)| 
| mean ± 3 SDs | at least 1 - 1/9 = 8/9   (88.88…%) | 
| mean ± 4 SDs | at least 1 - 1/16 = 15/16 (93.75%) | 
| mean ± 5 SDs | at least 1 - 1/25 = 24/25  (96%) | 

**iff** the histogram is bell-shaped you can estimate standard deviation, otherwise, you can't based on visualization 

If a histogram is bell-shaped, then:

1. the average is at the center
2. the SD is the distance between the average and the points of inflection on either side
3. the ∆ are under points of inflection, where the bars jump in vertical growth 

#### Comparing distributions with different shapes
- try converting distributions to a common, or standard set of units; use mean and standard deviation 

**Converting to a standard set of units**
- Subtract the values in a dist. by their mean
    - Their new average is 0
- Divide the values by their SD
    - Their new standard deviation is 1 

Formula: `standard units = (original value - mean) / SD` 




In [None]:
%matplotlib inline
import numpy as np 
from datascience import * 

values = make_array(2, 3, 3, 9) # (2 + 3 + 3 + 9)/4 == 4.25

np.sum(values)/len(values)

np.average(values) # 4.25
np.mean(values) # 4.25
"""
Same as: 
2*(1/4) + 3*(2/4) + 9*(1/4) # 4.25 
2*0.25 + 3*0.5 + 9*0.25 # 4.25
"""

values_table = Table().with_columns('value', values)
bins_for_display = np.arange(0.5, 10.6, 1)
values_table.hist('value', bins = bins_for_display)



## Make array of 10 2s, 20 3s, and 10 9s
### Averages are not necessarily dependent on the number of items in the collection
### Outputs same histogram as above
print('Same histogram as above despite added values')
new_vals = make_array(2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
                      3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
                      9, 9, 9, 9, 9, 9, 9, 9, 9, 9)

In [None]:
Table().with_column('value', new_vals).hist(bins = bins_for_display)
plots.ylim(-0.04, 0.5)
plots.plot([0, 10], [0, 0], color='grey', lw=2)
plots.scatter(4.25, -0.015, marker='^', color='red', s=100)
plots.title('Average as a Center of Mass');

A **weighted mean** is an average where each value has a specific weight, giving more importance to some values than others. Some data points are more important or frequent than others. Instead of treating all values equally, it assigns different weights to reflect their significance.


Uses:
- Grading Systems – Some assignments/exams count more toward the final grade.
- Finance – Stock indices (e.g., S&P 500) use weighted averages based on market capitalization.
- Survey Analysis – Adjusting responses if some groups are overrepresented.
- Physics & Engineering – Calculating center of mass or weighted probabilities.
- Machine Learning & Data Science – Weighted loss functions, weighted sampling, or handling imbalanced data.



In [None]:
## weighted means ## optional 
np.average(make_array(2, 3, 9), weights=(1, 2, 1))