# Measures of Variability + Variance/Std Dev + Coefficient of Variation

What I’m trying to do here is set up a mini “playground” where I can practice the same ideas on a few different tables (so it’s not just one example):
- Table 8 (tuition by discipline) → range, IQR, variance, standard deviation
- Table 9 (Ontario home sales, monthly 2008) → treat as a **population** and compute variance/std dev
- Table 10 (earnings by sex) → coefficient of variation (also as population)

**Quick self-reminder:** I’m not trying to memorize formulas only — I want to be able to look at a dataset and know which lever to pull (mean vs median, range vs IQR, sample vs population, etc.).


## Notes before I start (why `pandas` + `numpy`)

I *think* the combo here is basically:
- **pandas**: good for tables with labels/columns (keeps me organized so I don’t lose what each number *means*)
- **NumPy**: good for the actual calculations (arrays + stats functions)

**My own words:** pandas is like the “labeled spreadsheet layer”, and NumPy is like the “math engine” underneath.

### My notes (fill in)
- What felt confusing about pandas the first time I saw it:
  - 
- One pandas thing I used today (and what it returned):
  - 


In [1]:
import numpy as np
import pandas as pd

## Topics + learning objectives (what I think Week 2 wants from me)

When I read the module outline, it *sounds like* the point is:

- **Central tendency** (where the “middle” is)
  - mean, median, mode
  - choosing mean vs median depending on skew / outliers
- **Variability** (how spread out things are)
  - range, IQR, variance, standard deviation
  - knowing when range is too sensitive and IQR is safer
- **Relative variability**
  - coefficient of variation (CV) when I need a “spread relative to the mean” comparison
- **Shape / skewness**
  - not just “is it skewed?”, but “what does that do to mean vs median?”

### What I’m practicing in this notebook
- With **raw data** (like these tables): compute mean/median/mode and the spread measures.
- With **grouped data** (if I get a frequency table later): practice the grouped formulas.
- Look at the numbers and ask: “If I had to describe this dataset in one sentence… what measure would I pick and why?”

### Mini-checks I want to remember to ask myself
- If the distribution looks skewed, do I expect \(mean > median\) or \(mean < median\)? (I’ll test it instead of guessing.)
- Is this a **sample** or a **population**? (That decides \(ddof\) / whether I’m using \(n\) or \(n-1\).)
- Am I comparing two groups with different scales? (That’s where CV might be the more fair comparison.)

### My own summary (fill in)
- Mean vs median — when I’d use each:

- Why IQR can be better than range:

- One sentence definition of variance vs standard deviation (in my own words):


## Helper functions (so I don’t repeat myself)

I’m writing small helper functions so I can reuse the same logic across all tables (and so I’m not re-typing formulas).

**Quartiles / IQR note (course-style):** I’m using the Tukey “median of halves” method:
- If \(n\) is odd, I exclude the overall median before splitting into lower/upper halves.

**Population vs sample note:** in NumPy, this is controlled by `ddof`:
- `ddof=0` → population variance/std
- `ddof=1` → sample variance/std (uses \(n-1\))

### Quick exercise (don’t skip)
Before I run anything: for a skewed dataset, which do I think changes more if there’s one extreme outlier — the range or the IQR?
- My guess:


In [2]:
def data_range(x):
    x = np.asarray(x, dtype=float)
    return np.max(x) - np.min(x)

def tukey_quartiles(x):
    """
    Tukey quartiles (median-of-halves).
    If n is odd: exclude the median from the halves.
    """
    x = np.sort(np.asarray(x, dtype=float))
    n = len(x)

    # median
    q2 = np.median(x)

    if n % 2 == 1:
        lower = x[:n//2]       # excludes median
        upper = x[n//2+1:]
    else:
        lower = x[:n//2]
        upper = x[n//2:]

    q1 = np.median(lower)
    q3 = np.median(upper)
    return q1, q2, q3

def tukey_iqr(x):
    q1, _, q3 = tukey_quartiles(x)
    return q3 - q1

def variance(x, population=True):
    x = np.asarray(x, dtype=float)
    ddof = 0 if population else 1
    return np.var(x, ddof=ddof)

def std_dev(x, population=True):
    x = np.asarray(x, dtype=float)
    ddof = 0 if population else 1
    return np.std(x, ddof=ddof)

def coef_of_variation(x, population=True):
    x = np.asarray(x, dtype=float)
    return (std_dev(x, population=population) / np.mean(x)) * 100

## Part 1 — Table 8: Tuition fees by discipline (Canada)

Goal (for **2009/2010**): I want to compute spread measures and see how “uneven” tuition is across disciplines.

What I’m computing:
- range
- IQR (Tukey)
- variance + standard deviation (population vs sample)

### Exercise (before running)
- Which measure do I *think* will look less “dramatic”: range or IQR? Why?
- Do I expect the mean to be higher or lower than the median? (Just a guess — I’ll verify.)


In [3]:
tuition = pd.DataFrame({
    "Discipline": [
        "Canada",
        "Agriculture, Natural Resources, and Conservation",
        "Architecture and Related Technologies",
        "Humanities",
        "Business Management and Public Administration",
        "Education",
        "Engineering",
        "Law",
        "Medicine",
        "Visual and Performing Arts and Communication Technologies",
        "Physical and Life Sciences and Technologies",
        "Math and Computer and Information Sciences",
        "Social and Behavioural Sciences",
        "Other Health, and Parks, Recreation, and Fitness",
        "Dentistry",
        "Nursing",
        "Pharmacy",
        "Veterinary medicine"
    ],
    "2006/2007r": [
        4400, 3869, 3839, 4336, 4195, 3373, 4943, 7155, 9659, 3991, 4270, 4650, 4041, 4996,
        np.nan, np.nan, np.nan, np.nan
    ],
    "2007/2008r": [
        4558, 4064, 3999, 4342, 4637, 3545, 5099, 7382, 10029, 4239, 4534, 4746, 4165, 4400,
        12516, 4267, 4215, 4296
    ],
    "2008/2009": [
        4747, 4366, 4503, 4364, 4978, 3652, 5319, 8030, 9821, 4377, 4679, 4987, 4251, 4539,
        13290, 4422, 8366, 4422
    ],
    "2009/2010": [
        4917, 4516, 4794, 4501, 5073, 3783, 5583, 8502, 10216, 4539, 4847, 5220, 4399, 4692,
        13988, 4558, 8792, 5110
    ],
})

tuition.head()

Unnamed: 0,Discipline,2006/2007r,2007/2008r,2008/2009,2009/2010
0,Canada,4400.0,4558,4747,4917
1,"Agriculture, Natural Resources, and Conservation",3869.0,4064,4366,4516
2,Architecture and Related Technologies,3839.0,3999,4503,4794
3,Humanities,4336.0,4342,4364,4501
4,Business Management and Public Administration,4195.0,4637,4978,5073


## Extra stats I might add (if it helps me)

While I’m doing variability, I keep realizing I also want the “typical value” measures next to it (mean/median/mode), because otherwise I’m staring at spread with no center.

- **Mean**: average tuition across disciplines.
- **Median**: I already get this from my quartiles output.
- **Mode**: I’m not 100% sure mode is meaningful here (tuition is kind of “continuous-ish”), but it might still be a useful check.

### Exercise (for me)
- Add `mean` to the `results` dict using `np.mean(x)`.
- Try to compute a mode (if any). If it’s messy / not unique, I’ll write a note explaining why.

### My notes (fill in)
- Did mean and median come out close or far apart? What does that make me think about skew/outliers?


In [7]:
# Use disciplines only (exclude "Canada" row for the quiz-style answers)
x = tuition.loc[tuition["Discipline"] != "Canada", "2009/2010"].to_numpy()

results = {
    "n": len(x),
    "min": np.min(x),
    "max": np.max(x),
    "range": data_range(x),
    "Q1 (Tukey)": tukey_quartiles(x)[0],
    "median": tukey_quartiles(x)[1],
    "Q3 (Tukey)": tukey_quartiles(x)[2],
    "IQR (Tukey)": tukey_iqr(x),
    "pop_variance": variance(x, population=True),
    "pop_std": std_dev(x, population=True),
    "sample_variance": variance(x, population=False),
    "sample_std": std_dev(x, population=False),
}

pd.Series(results).round(2)

Unnamed: 0,0
n,17.0
min,3783.0
max,13988.0
range,10205.0
Q1 (Tukey),4527.5
median,4847.0
Q3 (Tukey),7042.5
IQR (Tukey),2515.0
pop_variance,6978532.84
pop_std,2641.69


## Part 2 — Table 9: Ontario single-family dwelling sales (2008)

Note to myself: September shows ">8,196" in the screenshot, but the quiz options I saw match using **8196** exactly.

Also, I’m treating this year’s 12 months like the **whole population** (not a sample), so I should use `ddof=0`.

### Quick exercise
Before I compute it: do I expect the standard deviation to be “big” compared to the mean? Why do I think that?
- My guess:


In [None]:
sales_2008 = pd.DataFrame({
    "month": ["January","February","March","April","May","June","July","August","September","October","November","December"],
    "sales": [2670, 4120, 6171, 8107, 9589, 10955, 9967, 8035, 8196, 8476, 5549, 5541],
})
sales_2008

In [None]:
# Population variance + std dev (exercise: try predicting first)
# (I’m using my helper functions so ddof is consistent.)

x_sales = sales_2008["sales"].to_numpy()

pop_var_sales = variance(x_sales, population=True)
pop_std_sales = std_dev(x_sales, population=True)

# Optional: uncomment to reveal the numbers
# pd.Series({"population_variance": pop_var_sales, "population_std_dev": pop_std_sales}).round(5)


## Part 3 — Table 10: Average earnings by sex (constant 2008 $)

What I’m trying to check here is *relative* variability.

I think this is why we use **coefficient of variation (CV)**:
- standard deviation by itself is in “dollars”, so it’s hard to compare groups if the means differ
- CV rescales it: \(CV = (\sigma/\mu)\times 100\%\)

Assumption I’m making (because it matches the way the questions are usually phrased): treat these years as the **population** → use `ddof=0`.

### Exercise (before running)
- Do I *expect* women or men to have the higher CV here? (Not higher dollars — higher variability relative to their mean.)
- My guess:

### My notes (fill in)
- If the CV is higher for one group, what does that mean in a sentence?


In [None]:
earnings = pd.DataFrame({
    "year":  [1999,2000,2001,2002,2003,2004,2005,2006,2007,2008],
    "women": [27000,27500,27600,27900,27600,27900,28600,29000,29900,30200],
    "men":   [43000,44500,44400,44400,43800,44000,44700,44800,45500,46900],
}).set_index("year")

earnings


In [None]:
women = earnings["women"].to_numpy()
men = earnings["men"].to_numpy()

cv_women = coef_of_variation(women, population=True)
cv_men = coef_of_variation(men, population=True)

# Optional: uncomment to reveal the numbers
# pd.Series({"CV_women_% (population)": cv_women, "CV_men_% (population)": cv_men}).round(2)
