<a href="https://colab.research.google.com/github/alysolamon/Applied-Statistics-For-Business/blob/main/Week%202%20Statistics%20Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Measures of Variability + Variance/Std Dev + Coefficient of Variation

What I‚Äôm trying to do here is set up a mini ‚Äúplayground‚Äù where I can practice the same ideas on a few different tables (so it‚Äôs not just one example):
- Table 8 (tuition by discipline) ‚Üí range, IQR, variance, standard deviation
- Table 9 (Ontario home sales, monthly 2008) ‚Üí treat as a **population** and compute variance/std dev
- Table 10 (earnings by sex) ‚Üí coefficient of variation (also as population)

**Quick self-reminder:** I‚Äôm not trying to memorize formulas only ‚Äî I want to be able to look at a dataset and know which lever to pull (mean vs median, range vs IQR, sample vs population, etc.).


## Notes before I start (why `pandas` + `numpy`)

I *think* the combo here is basically:
- **pandas**: good for tables with labels/columns (keeps me organized so I don‚Äôt lose what each number *means*)
- **NumPy**: good for the actual calculations (arrays + stats functions)

**if I were to describe them in one sentence:**

> **Panda** is like the labeled spreadsheet layer


> **numpy** is like the math engine underneath.


### My Take on this Tech-Stack
it's quite un-orthodox compared to an Excel sheet, but it's really very fun to interact with because you can literally manipulate data at scale and convert it into any type of sheet you want. If you want it in Google Sheet or Excel Sheet or any other kind of dataset forms and literally the data will bend to your will. But to be honest, it is quite confusing, the beginning. But when you look at how it bends everything to your will, it does work it.


In [2]:
import numpy as np
import pandas as pd

## Helper functions (so I don‚Äôt repeat myself)

I‚Äôm writing small helper functions so I can reuse the same logic across all tables (and so I‚Äôm not re-typing formulas).

**Quartiles / IQR note (course-style):** I‚Äôm using the Tukey ‚Äúmedian of halves‚Äù method:
- If \(n\) is odd, I exclude the overall median before splitting into lower/upper halves.

**Population vs sample note:** in NumPy, this is controlled by `ddof`:
- `ddof=0` ‚Üí population variance/std
- `ddof=1` ‚Üí sample variance/std (uses \(n-1\))

### Quick exercise (don‚Äôt skip)
Before I run anything: for a skewed dataset, which do I think changes more if there‚Äôs one extreme outlier ‚Äî the range or the IQR?
- **My guess**:
> obviously it will be the range because think about it, when something goes radically an outlier, that generally tends to hit the range immediately because that is an outlier and less likely, yes, it will affect obviously the iqr but not as much as the range.




## Topics + learning objectives (what I think Week 2 wants from me)

When I read the module outline, it *sounds like* the point is:

- **Central tendency** (where the ‚Äúmiddle‚Äù is)
  - mean, median, mode
  - choosing mean vs median depending on skew / outliers
- **Variability** (how spread out things are)
  - range, IQR, variance, standard deviation
  - knowing when range is too sensitive and IQR is safer
- **Relative variability**
  - coefficient of variation (CV) when I need a ‚Äúspread relative to the mean‚Äù comparison
- **Shape / skewness**
  - not just ‚Äúis it skewed?‚Äù, but ‚Äúwhat does that do to mean vs median?‚Äù

### What I‚Äôm practicing in this notebook
- With **raw data** (like these tables): compute mean/median/mode and the spread measures.
- With **grouped data** (if I get a frequency table later): practice the grouped formulas.
- Look at the numbers and ask: ‚ÄúIf I had to describe this dataset in one sentence‚Ä¶ what measure would I pick and why?‚Äù

### Mini-checks I want to remember to ask myself
- If the distribution looks skewed, do I expect \(mean > median\) or \(mean < median\)? (I‚Äôll test it instead of guessing.)
- Is this a **sample** or a **population**? (That decides \(ddof\) / whether I‚Äôm using \(n\) or \(n-1\).)
- Am I comparing two groups with different scales? (That‚Äôs where CV might be the more fair comparison.)

### My own summary (fill in)
- Mean vs median ‚Äî when I‚Äôd use each:

- Why IQR can be better than range:

- One sentence definition of variance vs standard deviation (in my own words):


# Digging Deeper into **Variance** and **Standard Deviation**

Before I dive into the numbers, I want to take a moment to really think about what variance and standard deviation *mean*. These two are often talked about together, and for good reason: one builds directly on the other.

**Quick conceptual reminder for myself:**
-  Both are measures of **spread** or **dispersion**.
-  Think: how "scattered" or "bunched up" is my data?

---

## Okay, Let's Break This Down...

### First, the Mean (Average)
$$\mu = \frac{1}{N}\sum_{i=1}^{N} x_i$$

**In plain English:** This is just the average ‚Äî add everything up, divide by how many things you have. Nothing fancy.

> üîë **KEY:** `Œº` (mu) = population mean

---

### Variance ‚Äî The "Average of Squared Distances"

$$\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2$$

**What is this actually saying?**

1. Take each data point
2. Subtract the mean (`x_i - Œº`) ‚Äî this is how far it is from the average (could be negative!)
3. **Square it** ‚Äî this removes the negative signs (also makes big differences stand out more)
4. Add 'em all up, divide by `N`

**Translation:** Variance is literally the average of how far each point is from the mean, *but squared so we don't get negatives*.

> üîë **KEY:** `œÉ¬≤` (sigma squared) = population variance  
> üîë **KEY:** `N` = total population size

---

### Standard Deviation ‚Äî "Un-squaring" the Variance

$$\sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2}$$

**Here's the thing:** The variance is in *squared units* ‚Äî like "dollars squared" or "meters squared." That's kinda meaningless for real life.

So we take the square root to get back to the original units.

**Translation:** Standard deviation = variance but actually interpretable. It tells you "on average, how far is each point from the mean?"

> üîë **KEY:** `œÉ` (sigma) = population standard deviation

---

## Wait ‚Äî Sample vs. Population?

Okay so here's where people I am confused. If you have ALL the data ‚Üí population. If you have a SLICE of the data ‚Üí sample.

### Sample Mean
$$\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$$

Same formula, different notation! (`n` instead of `N`, `xÃÑ` instead of `Œº`)

> üîë **KEY:** `xÃÑ` (x-bar) = sample mean

---

### Sample Variance ‚Äî The Tricky Part

$$s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2$$

**Why `n-1` instead of `n`?**

Oh man, this confused me for the longest time. Here's the ADHD breakdown:

-  When you estimate variance from a sample, it tends to *underestimate* the true variance
-  Using `n-1` (called **Bessel's correction**) compensates for this bias
-  Basically: we "punish" the estimate a little to account for not having all the data

**Translation:** It's a fudge factor to make the estimate less wrong.

> üîë **KEY:** `s¬≤` = sample variance  
> üîë **KEY:** `n-1` = degrees of freedom (fancy way to say "accounting for uncertainty")

---

### Sample Standard Deviation

$$s = \sqrt{s^2} = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2}$$

Same idea ‚Äî just un-square the variance to get it back into interpretable units.

> üîë **KEY:** `s` = sample standard deviation

---

## Quick Reference Summary

| Symbol | What It Is | Population | Sample |
|--------|------------|------------|--------|
| Mean | Average | `Œº = (1/N)‚àëx_i` | `xÃÑ = (1/n)‚àëx_i` |
| Variance | Avg squared distance | `œÉ¬≤ = (1/N)‚àë(x_i-Œº)¬≤` | `s¬≤ = (1/(n-1))‚àë(x_i-xÃÑ)¬≤` |
| Std Dev | ‚àövariance | `œÉ = ‚àöœÉ¬≤` | `s = ‚àös¬≤` |

---

## Mental Bookmark for Later

> **Variance** = fancy math way to say "how spread out is this data... squared"  
> **Standard Deviation** = variance, but actually useful (same units as data)  
> **Population** = you have everything  
> **Sample** = you have a slice, use `n-1` to be honest about uncertainty

---

*My data, making it easier to interpret as an 'average' deviation.*

In [3]:
def data_range(x):
    x = np.asarray(x, dtype=float)
    return np.max(x) - np.min(x)

def tukey_quartiles(x):
    """
    Tukey quartiles (median-of-halves).
    If n is odd: exclude the median from the halves.
    """
    x = np.sort(np.asarray(x, dtype=float))
    n = len(x)

    # median
    q2 = np.median(x)

    if n % 2 == 1:
        lower = x[:n//2]       # excludes median
        upper = x[n//2+1:]
    else:
        lower = x[:n//2]
        upper = x[n//2:]

    q1 = np.median(lower)
    q3 = np.median(upper)
    return q1, q2, q3

def tukey_iqr(x):
    q1, _, q3 = tukey_quartiles(x)
    return q3 - q1

def variance(x, population=True):
    x = np.asarray(x, dtype=float)
    ddof = 0 if population else 1
    return np.var(x, ddof=ddof)

def std_dev(x, population=True):
    x = np.asarray(x, dtype=float)
    ddof = 0 if population else 1
    return np.std(x, ddof=ddof)

def coef_of_variation(x, population=True):
    x = np.asarray(x, dtype=float)
    return (std_dev(x, population=population) / np.mean(x)) * 100

## Part 1 ‚Äî Table 8: Tuition fees by discipline (Canada)

Goal (for **2009/2010**): I want to compute spread measures and see how ‚Äúuneven‚Äù tuition is across disciplines.

What I‚Äôm computing:
- range
- IQR (Tukey)
- variance + standard deviation (population vs sample)

### Exercise (before running)
- Which measure do I *think* will look less ‚Äúdramatic‚Äù: range or IQR? Why?
- Do I expect the mean to be higher or lower than the median? (Just a guess ‚Äî I‚Äôll verify.)


In [None]:
tuition = pd.DataFrame({
    "Discipline": [
        "Canada",
        "Agriculture, Natural Resources, and Conservation",
        "Architecture and Related Technologies",
        "Humanities",
        "Business Management and Public Administration",
        "Education",
        "Engineering",
        "Law",
        "Medicine",
        "Visual and Performing Arts and Communication Technologies",
        "Physical and Life Sciences and Technologies",
        "Math and Computer and Information Sciences",
        "Social and Behavioural Sciences",
        "Other Health, and Parks, Recreation, and Fitness",
        "Dentistry",
        "Nursing",
        "Pharmacy",
        "Veterinary medicine"
    ],
    "2006/2007r": [
        4400, 3869, 3839, 4336, 4195, 3373, 4943, 7155, 9659, 3991, 4270, 4650, 4041, 4996,
        np.nan, np.nan, np.nan, np.nan
    ],
    "2007/2008r": [
        4558, 4064, 3999, 4342, 4637, 3545, 5099, 7382, 10029, 4239, 4534, 4746, 4165, 4400,
        12516, 4267, 4215, 4296
    ],
    "2008/2009": [
        4747, 4366, 4503, 4364, 4978, 3652, 5319, 8030, 9821, 4377, 4679, 4987, 4251, 4539,
        13290, 4422, 8366, 4422
    ],
    "2009/2010": [
        4917, 4516, 4794, 4501, 5073, 3783, 5583, 8502, 10216, 4539, 4847, 5220, 4399, 4692,
        13988, 4558, 8792, 5110
    ],
})

tuition.head()

Unnamed: 0,Discipline,2006/2007r,2007/2008r,2008/2009,2009/2010
0,Canada,4400.0,4558,4747,4917
1,"Agriculture, Natural Resources, and Conservation",3869.0,4064,4366,4516
2,Architecture and Related Technologies,3839.0,3999,4503,4794
3,Humanities,4336.0,4342,4364,4501
4,Business Management and Public Administration,4195.0,4637,4978,5073


## Extra stats I might add (if it helps me)

While I‚Äôm doing variability, I keep realizing I also want the ‚Äútypical value‚Äù measures next to it (mean/median/mode), because otherwise I‚Äôm staring at spread with no center.

- **Mean**: average tuition across disciplines.
- **Median**: I already get this from my quartiles output.
- **Mode**: I‚Äôm not 100% sure mode is meaningful here (tuition is kind of ‚Äúcontinuous-ish‚Äù), but it might still be a useful check.

### Exercise (for me)
- Add `mean` to the `results` dict using `np.mean(x)`.
- Try to compute a mode (if any). If it‚Äôs messy / not unique, I‚Äôll write a note explaining why.

### My notes (fill in)
- Did mean and median come out close or far apart? What does that make me think about skew/outliers?


In [None]:
# Use disciplines only (exclude "Canada" row for the quiz-style answers)
x = tuition.loc[tuition["Discipline"] != "Canada", "2009/2010"].to_numpy()

results = {
    "n": len(x),
    "min": np.min(x),
    "max": np.max(x),
    "range": data_range(x),
    "Q1 (Tukey)": tukey_quartiles(x)[0],
    "median": tukey_quartiles(x)[1],
    "Q3 (Tukey)": tukey_quartiles(x)[2],
    "IQR (Tukey)": tukey_iqr(x),
    "pop_variance": variance(x, population=True),
    "pop_std": std_dev(x, population=True),
    "sample_variance": variance(x, population=False),
    "sample_std": std_dev(x, population=False),
}

pd.Series(results).round(2)

Unnamed: 0,0
n,17.0
min,3783.0
max,13988.0
range,10205.0
Q1 (Tukey),4527.5
median,4847.0
Q3 (Tukey),7042.5
IQR (Tukey),2515.0
pop_variance,6978532.84
pop_std,2641.69


## Part 2 ‚Äî Table 9: Ontario single-family dwelling sales (2008)

Note to myself: September shows ">8,196" in the screenshot, but the quiz options I saw match using **8196** exactly.

Also, I‚Äôm treating this year‚Äôs 12 months like the **whole population** (not a sample), so I should use `ddof=0`.

### Quick exercise
Before I compute it: do I expect the standard deviation to be ‚Äúbig‚Äù compared to the mean? Why do I think that?
- My guess:


In [None]:
sales_2008 = pd.DataFrame({
    "month": ["January","February","March","April","May","June","July","August","September","October","November","December"],
    "sales": [2670, 4120, 6171, 8107, 9589, 10955, 9967, 8035, 8196, 8476, 5549, 5541],
})
sales_2008

In [None]:
# Population variance + std dev (exercise: try predicting first)
# (I‚Äôm using my helper functions so ddof is consistent.)

x_sales = sales_2008["sales"].to_numpy()

pop_var_sales = variance(x_sales, population=True)
pop_std_sales = std_dev(x_sales, population=True)

# Optional: uncomment to reveal the numbers
# pd.Series({"population_variance": pop_var_sales, "population_std_dev": pop_std_sales}).round(5)


## Part 3 ‚Äî Table 10: Average earnings by sex (constant 2008 $)

What I‚Äôm trying to check here is *relative* variability.

I think this is why we use **coefficient of variation (CV)**:
- standard deviation by itself is in ‚Äúdollars‚Äù, so it‚Äôs hard to compare groups if the means differ
- CV rescales it: \(CV = (\sigma/\mu)\times 100\%\)

Assumption I‚Äôm making (because it matches the way the questions are usually phrased): treat these years as the **population** ‚Üí use `ddof=0`.

### Exercise (before running)
- Do I *expect* women or men to have the higher CV here? (Not higher dollars ‚Äî higher variability relative to their mean.)
- My guess:

### My notes (fill in)
- If the CV is higher for one group, what does that mean in a sentence?


In [None]:
earnings = pd.DataFrame({
    "year":  [1999,2000,2001,2002,2003,2004,2005,2006,2007,2008],
    "women": [27000,27500,27600,27900,27600,27900,28600,29000,29900,30200],
    "men":   [43000,44500,44400,44400,43800,44000,44700,44800,45500,46900],
}).set_index("year")

earnings


In [None]:
women = earnings["women"].to_numpy()
men = earnings["men"].to_numpy()

cv_women = coef_of_variation(women, population=True)
cv_men = coef_of_variation(men, population=True)

# Optional: uncomment to reveal the numbers
# pd.Series({"CV_women_% (population)": cv_women, "CV_men_% (population)": cv_men}).round(2)
