<a href="https://colab.research.google.com/github/alysolamon/Applied-Statistics-For-Business/blob/main/Week%202%20Statistics%20Practice.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Measures of Variability + Variance/Std Dev + Coefficient of Variation

What I‚Äôm trying to do here is set up a mini ‚Äúplayground‚Äù where I can practice the same ideas on a few different tables (so it‚Äôs not just one example):
- Table 8 (tuition by discipline) ‚Üí range, IQR, variance, standard deviation
- Table 9 (Ontario home sales, monthly 2008) ‚Üí treat as a **population** and compute variance/std dev
- Table 10 (earnings by sex) ‚Üí coefficient of variation (also as population)

**Quick self-reminder:** I‚Äôm not trying to memorize formulas only ‚Äî I want to be able to look at a dataset and know which lever to pull (mean vs median, range vs IQR, sample vs population, etc.).


## Notes before I start (why `pandas` + `numpy`)

I *think* the combo here is basically:
- **pandas**: good for tables with labels/columns (keeps me organized so I don‚Äôt lose what each number *means*)
- **NumPy**: good for the actual calculations (arrays + stats functions)

**if I were to describe them in one sentence:**

> **Panda** is like the labeled spreadsheet layer


> **numpy** is like the math engine underneath.


### My Take on this Tech-Stack
it's quite un-orthodox compared to an Excel sheet, but it's really very fun to interact with because you can literally manipulate data at scale and convert it into any type of sheet you want. If you want it in Google Sheet or Excel Sheet or any other kind of dataset forms and literally the data will bend to your will. But to be honest, it is quite confusing, the beginning. But when you look at how it bends everything to your will, it does work it.


In [5]:
import numpy as np
import pandas as pd

## Helper functions (so I don‚Äôt repeat myself)

I‚Äôm writing small helper functions so I can reuse the same logic across all tables (and so I‚Äôm not re-typing formulas).

**Quartiles / IQR note (course-style):** I‚Äôm using the Tukey ‚Äúmedian of halves‚Äù method:
- If \(n\) is odd, I exclude the overall median before splitting into lower/upper halves.

**Population vs sample note:** in NumPy, this is controlled by `ddof`:
- `ddof=0` ‚Üí population variance/std
- `ddof=1` ‚Üí sample variance/std (uses \(n-1\))

### Quick exercise (don‚Äôt skip)
Before I run anything: for a skewed dataset, which do I think changes more if there‚Äôs one extreme outlier ‚Äî the range or the IQR?
- **My guess**:
> obviously it will be the range because think about it, when something goes radically an outlier, that generally tends to hit the range immediately because that is an outlier and less likely, yes, it will affect obviously the iqr but not as much as the range.




## Topics + learning objectives (what I think Week 2 wants from me)

When I read the module outline, it *sounds like* the point is:

- **Central tendency** (where the ‚Äúmiddle‚Äù is)
  - mean, median, mode
  - choosing mean vs median depending on skew / outliers
- **Variability** (how spread out things are)
  - range, IQR, variance, standard deviation
  - knowing when range is too sensitive and IQR is safer
- **Relative variability**
  - coefficient of variation (CV) when I need a ‚Äúspread relative to the mean‚Äù comparison
- **Shape / skewness**
  - not just ‚Äúis it skewed?‚Äù, but ‚Äúwhat does that do to mean vs median?‚Äù

### What I‚Äôm practicing in this notebook
- With **raw data** (like these tables): compute mean/median/mode and the spread measures.
- With **grouped data** (if I get a frequency table later): practice the grouped formulas.
- Look at the numbers and ask: ‚ÄúIf I had to describe this dataset in one sentence‚Ä¶ what measure would I pick and why?‚Äù

### Mini-checks I want to remember to ask myself
- If the distribution looks skewed, do I expect \(mean > median\) or \(mean < median\)? (I‚Äôll test it instead of guessing.)
- Is this a **sample** or a **population**? (That decides \(ddof\) / whether I‚Äôm using \(n\) or \(n-1\).)
- Am I comparing two groups with different scales? (That‚Äôs where CV might be the more fair comparison.)

### My own summary (fill in)
- Mean vs median ‚Äî when I‚Äôd use each:

- Why IQR can be better than range:

- One sentence definition of variance vs standard deviation (in my own words):


# Digging Deeper into **Variance** and **Standard Deviation**

Before I dive into the numbers, I want to take a moment to really think about what variance and standard deviation *mean*. These two are often talked about together, and for good reason: one builds directly on the other.

**Quick conceptual reminder for myself:**
-  Both are measures of **spread** or **dispersion**.
-  Think: how "scattered" or "bunched up" is my data?

---

## Okay, Let's Break This Down...

### First, the Mean (Average)
$$\mu = \frac{1}{N}\sum_{i=1}^{N} x_i$$

**In plain English:** This is just the average ‚Äî add everything up, divide by how many things you have. Nothing fancy.

> üîë **KEY:** `Œº` (mu) = population mean

---

### Variance ‚Äî The "Average of Squared Distances"

$$\sigma^2 = \frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2$$

**What is this actually saying?**

1. Take each data point
2. Subtract the mean (`x_i - Œº`) ‚Äî this is how far it is from the average (could be negative!)
3. **Square it** ‚Äî this removes the negative signs (also makes big differences stand out more)
4. Add 'em all up, divide by `N`

**Translation:** Variance is literally the average of how far each point is from the mean, *but squared so we don't get negatives*.

> üîë **KEY:** `œÉ¬≤` (sigma squared) = population variance  
> üîë **KEY:** `N` = total population size

---

### Standard Deviation ‚Äî "Un-squaring" the Variance

$$\sigma = \sqrt{\sigma^2} = \sqrt{\frac{1}{N}\sum_{i=1}^{N}(x_i - \mu)^2}$$

**Here's the thing:** The variance is in *squared units* ‚Äî like "dollars squared" or "meters squared." That's kinda meaningless for real life.

So we take the square root to get back to the original units.

**Translation:** Standard deviation = variance but actually interpretable. It tells you "on average, how far is each point from the mean?"

> üîë **KEY:** `œÉ` (sigma) = population standard deviation

---

## Wait ‚Äî Sample vs. Population?

Okay so here's where people I am confused. If you have ALL the data ‚Üí population. If you have a SLICE of the data ‚Üí sample.

### Sample Mean
$$\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_i$$

Same formula, different notation! (`n` instead of `N`, `xÃÑ` instead of `Œº`)

> üîë **KEY:** `xÃÑ` (x-bar) = sample mean

---

### Sample Variance ‚Äî The Tricky Part

$$s^2 = \frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2$$

**Why `n-1` instead of `n`?**

Oh man, this confused me for the longest time. Here's the ADHD breakdown:

-  When you estimate variance from a sample, it tends to *underestimate* the true variance
-  Using `n-1` (called **Bessel's correction**) compensates for this bias
-  Basically: we "punish" the estimate a little to account for not having all the data

**Translation:** It's a fudge factor to make the estimate less wrong.

> üîë **KEY:** `s¬≤` = sample variance  
> üîë **KEY:** `n-1` = degrees of freedom (fancy way to say "accounting for uncertainty")

---

### Sample Standard Deviation

$$s = \sqrt{s^2} = \sqrt{\frac{1}{n-1}\sum_{i=1}^{n}(x_i - \bar{x})^2}$$

Same idea ‚Äî just un-square the variance to get it back into interpretable units.

> üîë **KEY:** `s` = sample standard deviation

---

## Quick Reference Summary

| Symbol | What It Is | Population | Sample |
|--------|------------|------------|--------|
| Mean | Average | `Œº = (1/N)‚àëx_i` | `xÃÑ = (1/n)‚àëx_i` |
| Variance | Avg squared distance | `œÉ¬≤ = (1/N)‚àë(x_i-Œº)¬≤` | `s¬≤ = (1/(n-1))‚àë(x_i-xÃÑ)¬≤` |
| Std Dev | ‚àövariance | `œÉ = ‚àöœÉ¬≤` | `s = ‚àös¬≤` |

---

## Mental Bookmark for Later

> **Variance** = fancy math way to say "how spread out is this data... squared"  
> **Standard Deviation** = variance, but actually useful (same units as data)  
> **Population** = you have everything  
> **Sample** = you have a slice, use `n-1` to be honest about uncertainty

---

*My data, making it easier to interpret as an 'average' deviation.*

In [6]:
def data_range(x):
    x = np.asarray(x, dtype=float)
    return np.max(x) - np.min(x)

def tukey_quartiles(x):
    """
    Tukey quartiles (median-of-halves).
    If n is odd: exclude the median from the halves.
    """
    x = np.sort(np.asarray(x, dtype=float))
    n = len(x)

    # median
    q2 = np.median(x)

    if n % 2 == 1:
        lower = x[:n//2]       # excludes median
        upper = x[n//2+1:]
    else:
        lower = x[:n//2]
        upper = x[n//2:]

    q1 = np.median(lower)
    q3 = np.median(upper)
    return q1, q2, q3

def tukey_iqr(x):
    q1, _, q3 = tukey_quartiles(x)
    return q3 - q1

def variance(x, population=True):
    x = np.asarray(x, dtype=float)
    ddof = 0 if population else 1
    return np.var(x, ddof=ddof)

def std_dev(x, population=True):
    x = np.asarray(x, dtype=float)
    ddof = 0 if population else 1
    return np.std(x, ddof=ddof)

def coef_of_variation(x, population=True):
    x = np.asarray(x, dtype=float)
    return (std_dev(x, population=population) / np.mean(x)) * 100

## Part 1 ‚Äî Table 8: Tuition fees by discipline (Canada)

Goal (for **2009/2010**): I want to compute spread measures and see how ‚Äúuneven‚Äù tuition is across disciplines.

What I‚Äôm computing:
- range
- IQR (Tukey)
- variance + standard deviation (population vs sample)

### Exercise (before running)
- Which measure do I *think* will look less ‚Äúdramatic‚Äù: range or IQR? Why?
- Do I expect the mean to be higher or lower than the median? (Just a guess ‚Äî I‚Äôll verify.)


In [7]:
tuition = pd.DataFrame({
    "Discipline": [
        "Canada",
        "Agriculture, Natural Resources, and Conservation",
        "Architecture and Related Technologies",
        "Humanities",
        "Business Management and Public Administration",
        "Education",
        "Engineering",
        "Law",
        "Medicine",
        "Visual and Performing Arts and Communication Technologies",
        "Physical and Life Sciences and Technologies",
        "Math and Computer and Information Sciences",
        "Social and Behavioural Sciences",
        "Other Health, and Parks, Recreation, and Fitness",
        "Dentistry",
        "Nursing",
        "Pharmacy",
        "Veterinary medicine"
    ],
    "2006/2007r": [
        4400, 3869, 3839, 4336, 4195, 3373, 4943, 7155, 9659, 3991, 4270, 4650, 4041, 4996,
        np.nan, np.nan, np.nan, np.nan
    ],
    "2007/2008r": [
        4558, 4064, 3999, 4342, 4637, 3545, 5099, 7382, 10029, 4239, 4534, 4746, 4165, 4400,
        12516, 4267, 4215, 4296
    ],
    "2008/2009": [
        4747, 4366, 4503, 4364, 4978, 3652, 5319, 8030, 9821, 4377, 4679, 4987, 4251, 4539,
        13290, 4422, 8366, 4422
    ],
    "2009/2010": [
        4917, 4516, 4794, 4501, 5073, 3783, 5583, 8502, 10216, 4539, 4847, 5220, 4399, 4692,
        13988, 4558, 8792, 5110
    ],
})

tuition.head()

Unnamed: 0,Discipline,2006/2007r,2007/2008r,2008/2009,2009/2010
0,Canada,4400.0,4558,4747,4917
1,"Agriculture, Natural Resources, and Conservation",3869.0,4064,4366,4516
2,Architecture and Related Technologies,3839.0,3999,4503,4794
3,Humanities,4336.0,4342,4364,4501
4,Business Management and Public Administration,4195.0,4637,4978,5073


## Extra stats I might add (if it helps me)

While I‚Äôm doing variability, I keep realizing I also want the ‚Äútypical value‚Äù measures next to it (mean/median/mode), because otherwise I‚Äôm staring at spread with no center.

- **Mean**: average tuition across disciplines.
- **Median**: I already get this from my quartiles output.
- **Mode**: I‚Äôm not 100% sure mode is meaningful here (tuition is kind of ‚Äúcontinuous-ish‚Äù), but it might still be a useful check.

### Exercise (for me)
- Add `mean` to the `results` dict using `np.mean(x)`.
- Try to compute a mode (if any). If it‚Äôs messy / not unique, I‚Äôll write a note explaining why.

### My notes (fill in)
- Did mean and median come out close or far apart? What does that make me think about skew/outliers?


In [8]:
# Use disciplines only (exclude "Canada" row for the quiz-style answers)
x = tuition.loc[tuition["Discipline"] != "Canada", "2009/2010"].to_numpy()

results = {
    "n": len(x),
    "min": np.min(x),
    "max": np.max(x),
    "range": data_range(x),
    "Q1 (Tukey)": tukey_quartiles(x)[0],
    "median": tukey_quartiles(x)[1],
    "Q3 (Tukey)": tukey_quartiles(x)[2],
    "IQR (Tukey)": tukey_iqr(x),
    "pop_variance": variance(x, population=True),
    "pop_std": std_dev(x, population=True),
    "sample_variance": variance(x, population=False),
    "sample_std": std_dev(x, population=False),
}

pd.Series(results).round(2)

Unnamed: 0,0
n,17.0
min,3783.0
max,13988.0
range,10205.0
Q1 (Tukey),4527.5
median,4847.0
Q3 (Tukey),7042.5
IQR (Tukey),2515.0
pop_variance,6978532.84
pop_std,2641.69


## Part 2 ‚Äî Table 9: Ontario single-family dwelling sales (2008)

Note to myself: September shows ">8,196" in the screenshot, but the quiz options I saw match using **8196** exactly.

Also, I‚Äôm treating this year‚Äôs 12 months like the **whole population** (not a sample), so I should use `ddof=0`.

### Quick exercise
Before I compute it: do I expect the standard deviation to be ‚Äúbig‚Äù compared to the mean? Why do I think that?
- My guess:


In [9]:
sales_2008 = pd.DataFrame({
    "month": ["January","February","March","April","May","June","July","August","September","October","November","December"],
    "sales": [2670, 4120, 6171, 8107, 9589, 10955, 9967, 8035, 8196, 8476, 5549, 5541],
})
sales_2008

Unnamed: 0,month,sales
0,January,2670
1,February,4120
2,March,6171
3,April,8107
4,May,9589
5,June,10955
6,July,9967
7,August,8035
8,September,8196
9,October,8476


In [10]:
# Population variance + std dev (exercise: try predicting first)
# (I‚Äôm using my helper functions so ddof is consistent.)

x_sales = sales_2008["sales"].to_numpy()

pop_var_sales = variance(x_sales, population=True)
pop_std_sales = std_dev(x_sales, population=True)

# Optional: uncomment to reveal the numbers
# pd.Series({"population_variance": pop_var_sales, "population_std_dev": pop_std_sales}).round(5)


## Part 3 ‚Äî Table 10: Average earnings by sex (constant 2008 $)

What I‚Äôm trying to check here is *relative* variability.

I think this is why we use **coefficient of variation (CV)**:
- standard deviation by itself is in ‚Äúdollars‚Äù, so it‚Äôs hard to compare groups if the means differ
- CV rescales it: \(CV = (\sigma/\mu)\times 100\%\)

Assumption I‚Äôm making (because it matches the way the questions are usually phrased): treat these years as the **population** ‚Üí use `ddof=0`.

### Exercise (before running)
- Do I *expect* women or men to have the higher CV here? (Not higher dollars ‚Äî higher variability relative to their mean.)
- My guess:

### My notes (fill in)
- If the CV is higher for one group, what does that mean in a sentence?


In [11]:
earnings = pd.DataFrame({
    "year":  [1999,2000,2001,2002,2003,2004,2005,2006,2007,2008],
    "women": [27000,27500,27600,27900,27600,27900,28600,29000,29900,30200],
    "men":   [43000,44500,44400,44400,43800,44000,44700,44800,45500,46900],
}).set_index("year")

earnings


Unnamed: 0_level_0,women,men
year,Unnamed: 1_level_1,Unnamed: 2_level_1
1999,27000,43000
2000,27500,44500
2001,27600,44400
2002,27900,44400
2003,27600,43800
2004,27900,44000
2005,28600,44700
2006,29000,44800
2007,29900,45500
2008,30200,46900


# Question List ‚Äî Comparing Mean, Median, and Mode

What I'm trying to do here is work through a new kind of problem ‚Äî one where I have to compare different measures of central tendency (mean, median, mode) across multiple datasets to identify specific patterns.

## The problem statement

I've got four sets of scores:

| Set 1 | Set 2 | Set 3 | Set 4 |
|-------|-------|-------|-------|
| 14    | 43    | 41    | 9     |
| 16    | 47    | 42    | 44    |
| 23    | 49    | 46    | 50    |
| 23    | 55    | 51    | 61    |
| 30    | 61    | 51    | 69    |
| 32    | 63    | 55    | 72    |
| 97    | 67    | 56    | 72    |

And I need to answer:

**Which sets have:**
- (a) The mean is greater than the median
- (b) The median and the mean are the same
- (c) The mode is greater than the median

---

## My approach (thinking through it)

Okay, so before I jump into the code, let me think about what each question is *really* asking...

### What does "mean > median" tell me?

From what I remember, this is about **skewness**:
- When the mean is greater than the median, the distribution is **right-skewed** (positive skew)
- This happens when there are outliers on the *high* end pulling the mean up
- The median is more "resistant" to outliers, so it stays closer to the center

**My guess before looking at the data:**
I'm betting Set 1 might show this pattern because I see that 97 at the end ‚Äî that's pretty high compared to the other numbers. Let me verify this though...

---

### What does "mean = median" mean?

This typically indicates a **symmetric distribution** (or at least roughly symmetric).
- The data is balanced on both sides
- No strong skew in either direction
- In a perfect normal distribution, mean = median = mode

**My guess before looking at the data:**
I wonder if Set 2 or Set 3 might show this? They look more "balanced" when I scan the numbers, but I'm not sure. Let me actually calculate it to find out.

---

### What does "mode > median" mean?

This is the tricky one. The mode is the most frequent value. If the mode is greater than the median...
- The most common value is on the higher side
- But the median (middle value) is lower
- This could happen if there are repeated high values but enough lower values to pull the median down

**My guess before looking at the data:**
Set 4 has 72 appearing twice ‚Äî that might be the mode. And looking at the spread, I can imagine the median being lower. But I'm totally guessing here. Let me work through it step by step.

---

## Helper functions I'll need

I already have `tukey_quartiles` which gives me the median, so I can reuse that. But I also need a function to find the **mode**.

*Note:* In NumPy/pandas, there's no single "mode" function that's as straightforward as `np.mean()` or `np.median()`. I'll need to use `scipy.stats.mode` or write a custom function.

### Quick self-check (before coding)

What do I *think* will happen?
- **Set 1**: Mean > median? (because of that 97 outlier)
- **Set 2**: Mean = median? (looks symmetric-ish)
- **Set 3**: Mean = median? (also looks pretty balanced)
- **Set 4**: Mode > median? (72 appears twice, might be the mode)

But let me actually calculate these instead of just guessing...

---

In [None]:
women = earnings["women"].to_numpy()
men = earnings["men"].to_numpy()

cv_women = coef_of_variation(women, population=True)
cv_men = coef_of_variation(men, population=True)

# Optional: uncomment to reveal the numbers
# pd.Series({"CV_women_% (population)": cv_women, "CV_men_% (population)": cv_men}).round(2)


The operations manager of a plant that manufactures tires wants to compare the actual inner diameters of two grades of‚Äã tires, each of which is expected to be 575 millimeters. Samples of five tires from each grade were‚Äã selected, and the results representing the inner diameters of the‚Äã tires, ranked from smallest to‚Äã largest, are shown below. Complete parts‚Äã (a) through‚Äã (c) below.


In [None]:
# Define the grades data as a dictionary
grades_data = {
    "Grade X": [566, 572, 575, 578, 585],
    "Grade Y": [571, 574, 575, 579, 583],
}

Grade X - Mean: 575.2, Median: 575.0, Standard Deviation: 6.305553108173779
Grade Y - Mean: 576.4, Median: 575.0, Standard Deviation: 4.17612260356422


## A. For each of the two grades of‚Äã tires, compute the‚Äã mean, median, and standard deviation. 

In [19]:
# Compute mean, median, and standard deviation for Grade X
mean_x = np.mean(grades_data["Grade X"])
median_x = np.median(grades_data["Grade X"])

#here I made mistake of calculating the population standard deviation, but the question is asking for the sample standard deviation. so make sure to use ddof=1 next time.

std_dev_x = np.std(grades_data["Grade X"], ddof=1)


#So we can call the 'numpy's functions to compute the mean, median, and standard deviation 'np.mean()', 'np.median()', 'np.std()'

# Compute mean, median, and standard deviation for Grade Y
mean_y = np.mean(grades_data["Grade Y"])
median_y = np.median(grades_data["Grade Y"])
std_dev_y = np.std(grades_data["Grade Y"], ddof=1)

# Display the results in a clearly structured way (rounded to two decimal places)
print("Grade X:")
print(f"  Mean:               {mean_x:.2f}")
print(f"  Median:             {median_x:.2f}")
print(f"  Standard Deviation: {std_dev_x:.2f}\n")

print("Grade Y:")
print(f"  Mean:               {mean_y:.2f}")
print(f"  Median:             {median_y:.2f}")
print(f"  Standard Deviation: {std_dev_y:.2f}")


Grade X:
  Mean:               575.20
  Median:             575.00
  Standard Deviation: 7.05

Grade Y:
  Mean:               576.40
  Median:             575.00
  Standard Deviation: 4.67


## B. Which grade of tire is providing better‚Äã quality? Explain. Choose the correct answer below.

In [None]:
# b. Which grade of tire is providing better‚Äã quality? Explain. Choose the correct answer below.

    #Step 1: What does ‚Äúbetter quality‚Äù mean in this context?
     # They told you:

     # Expected diameter = 575 mm

        # So quality depends on two things:

        # Accuracy ‚Üí mean close to 575

        # Consistency ‚Üí smaller standard deviation

        # üëâ Both matter.

#Step 2: Compute the mean and standard deviation for each grade

# Compute mean and standard deviation for Grade X
mean_x = np.mean(grades_data["Grade X"])
std_dev_x = np.std(grades_data["Grade X"], ddof=1)

# Compute mean and standard deviation for Grade Y
mean_y = np.mean(grades_data["Grade Y"])
std_dev_y = np.std(grades_data["Grade Y"], ddof=1)

#Step 3: Compare the mean and standard deviation for each grade

# Compare means
if mean_x > mean_y:
    print("Grade X has a higher mean diameter.")
else:
    print("Grade Y has a higher mean diameter.")

# Compare standard deviations
if std_dev_x < std_dev_y:
    print("Grade X has a smaller standard deviation.")
else:
    print("Grade Y has a smaller standard deviation.")

#Step 4: Conclusion

# Grade X has a higher mean diameter and a smaller standard deviation, indicating better accuracy and consistency.
# Therefore, Grade X is providing better quality.


Grade Y has a higher mean diameter.
Grade Y has a smaller standard deviation.


#What would be the effect on your answers in‚Äã (a) and‚Äã (b) if the last value for Grade Y were 588 instead of 583‚Äã? Explain. Choose the correct answer below.
When the fifth Grade Y tire measures 588 mm rather than 583 ‚Äãmm, Grade‚Äã Y's mean inner diameter becomes 
‚ÄÄ‚ÄÄ
enter your response here ‚Äãmm, which is 
‚ñº 
smaller
larger
 than Grade‚Äã X's mean inner‚Äã diameter, and Grade‚Äã Y's standard deviation changes from 
‚ÄÄ‚ÄÄ
enter your response here mm to 
‚ÄÄ‚ÄÄ
enter your response here mm. In this‚Äã case, Grade‚Äã X's tires are providing 
‚ñº 
worse
better
the same
 quality in terms of the mean inner diameter.

In [22]:
# Here, we'll calculate the effect of changing the last Grade Y value to 588 instead of 583.
# We'll find the new mean and standard deviation for Grade Y, compare these to Grade X, and see how the answers to (a) and (b) are affected.

# Step 1: Update the last value of Grade Y to 588
grades_data["Grade Y"][-1] = 588

# Step 2: Compute the mean and standard deviation for both grades
mean_x = np.mean(grades_data["Grade X"])
std_dev_x = np.std(grades_data["Grade X"], ddof=1)
mean_y = np.mean(grades_data["Grade Y"])
std_dev_y = np.std(grades_data["Grade Y"], ddof=1)

# Step 3: Display the results
print(f"Grade X mean: {mean_x:.2f}, std dev: {std_dev_x:.2f}")
print(f"Grade Y mean (with last value 588): {mean_y:.2f}, std dev: {std_dev_y:.2f}")

# Step 4: Answer the comparison asked in the question
mean_diff = mean_y - mean_x
comparison = "larger" if mean_y > mean_x else "smaller" if mean_y < mean_x else "the same as"

print(f"When the fifth Grade Y tire is 588 mm, Grade Y's mean diameter becomes {mean_y:.2f} mm, "
      f"which is {abs(mean_diff):.2f} mm {comparison} Grade X's mean diameter.")
print(f"Grade Y's standard deviation changes to {std_dev_y:.2f} mm.")

# Optional: Print conclusion on quality
if mean_y > mean_x and std_dev_y > std_dev_x:
    print("Grade Y has a higher mean but is less consistent (higher std dev) than Grade X.")
elif mean_y < mean_x and std_dev_y > std_dev_x:
    print("Grade X has a higher mean and is more consistent; therefore, Grade X is better.")
elif mean_y > mean_x and std_dev_y < std_dev_x:
    print("Grade Y has a higher mean and is more consistent; therefore, Grade Y is better.")
else:
    print("Grade X is providing better or equal quality in terms of mean and consistency.")


Grade X mean: 575.20, std dev: 7.05
Grade Y mean (with last value 588): 577.40, std dev: 6.58
When the fifth Grade Y tire is 588 mm, Grade Y's mean diameter becomes 577.40 mm, which is 2.20 mm larger Grade X's mean diameter.
Grade Y's standard deviation changes to 6.58 mm.
Grade Y has a higher mean and is more consistent; therefore, Grade Y is better.
