# Introduction to Statistics in Python — Structured Notes (Slides + Transcript)


---

## 1) What is statistics?

**Key ideas**

* **Statistics (the field):** the practice and study of **collecting** and **analyzing** data to learn from it and make decisions.
* **Summary statistic:** a **fact about a dataset** (e.g., an average, a count, a proportion, a median).

**Why this matters**

* Statistics gives us a language and set of tools to turn raw data into information we can interpret and act on.

---

## 2) What can statistics do?

**Examples from the slides/transcript**

* Estimate **purchase likelihood** and how it changes with different payment systems.
* Forecast **hotel occupancy** and optimize it.
* Decide how many **jean sizes** are needed to cover 95% of customers, and whether to produce equal quantities.
* Run **A/B tests** to learn which ad leads to more purchases.

**Takeaway**

* Statistics answers concrete, decision-oriented questions by quantifying uncertainty and comparing alternatives.

---

## 3) What **can’t** statistics do?

**Limits**

* “**Why** is *Game of Thrones* so popular?” — Even with data on views and violence, we **cannot conclude causation** from correlation alone.
* People might **misreport** reasons; other **unobserved factors** can drive popularity.

**Takeaway**

* Statistics can show **associations** (e.g., “more violent shows have more views”) but not necessarily **causal reasons** (“violence causes popularity”) without a proper causal design.

---

## 4) Types of statistics

* **Descriptive statistics:** describe/summarize the data you have.
  Example: From 4 friends: 50% drive, 25% bus, 25% bike.
* **Inferential statistics:** use a **sample** to make statements about a **population** (with quantified uncertainty).
  Example: Estimate the **percentage of all people** who drive to work based on your sample.

**Takeaway**

* Descriptive = summarize **sample**.
* Inferential = generalize beyond sample to **population** (often with confidence intervals or hypothesis tests).

---

## 5) Types of data

* **Numeric (Quantitative)**

  * **Continuous (measured):** speed, time, height, temperature.
  * **Discrete (counted):** number of pets, packages shipped.
* **Categorical (Qualitative)**

  * **Nominal (unordered):** marital status, country.
  * **Ordinal (ordered):** Likert scale (strongly disagree → strongly agree).

**Takeaway**

* Knowing the **type** of each variable guides which **summaries** and **visualizations** are appropriate.

---

## 6) Categorical data can be represented as numbers (but may still be categorical)

* **Nominal examples:** married/unmarried encoded as 1/0; countries encoded as 1, 2, …
* **Ordinal examples:** “strongly disagree”=1 … “strongly agree”=5
* **Important:** Encoding with numbers **does not automatically** make a variable “numeric” in the analytical sense. Treat it according to its **conceptual type**.

---

## 7) Why does data type matter?

* **Numeric data:** summaries like **mean/median**, and visuals like **histograms** or **scatter plots** are appropriate.
* **Categorical data:** **counts/proportions** and visuals like **bar plots** are appropriate.
* Using the wrong tool (e.g., the mean of country codes) gives **nonsense**.

---

## 8) Code Examples from the Slides

Below are the two code snippets illustrated in the slides, shown with output and line-by-line explanations.

---

### 8.1) Numeric summary statistic: Mean speed

#### Code

```python
import numpy as np
np.mean(car_speeds['speed_mph'])
```

#### Expected Output

```
40.09062
```
![image.png](attachment:image.png)

#### Line-by-line explanation

1. `import numpy as np`

   * **What:** Imports the NumPy library and aliases it as `np`.
   * **Why:** NumPy provides efficient numerical functions; here we need `mean`.
   * **Result:** Makes `np.mean` available in our code.

2. `np.mean(car_speeds['speed_mph'])`

   * **What:** Computes the arithmetic **mean** of the `speed_mph` column in the `car_speeds` DataFrame/Series.
   * **Why:** The **mean** is a common **descriptive statistic** for the center of a numeric distribution.
   * **Expected result:** `40.09062`, the average miles per hour across observed cars.
   * **Interpretation:** On average, cars in this dataset travel about **40.09 mph**. This single number summarizes central tendency but **doesn’t show spread** (variation). Consider also reporting **median** and **standard deviation**, especially if the distribution is skewed or has outliers.

---

### 8.2) Categorical summary: Value counts of marriage status

#### Code

```python
demographics['marriage_status'].value_counts()
```

#### Expected Output

```
single      188
married     143
divorced    124
dtype: int64
```
![image-2.png](attachment:image-2.png)
#### Line-by-line explanation

1. `demographics['marriage_status']`

   * **What:** Selects the `marriage_status` column (a **categorical** variable) from the `demographics` DataFrame.
   * **Why:** We want to summarize how often each **category** appears.
   * **Result:** A pandas Series of marriage statuses.

2. `.value_counts()`

   * **What:** Counts the **frequency** of each distinct category (e.g., `single`, `married`, `divorced`).
   * **Why:** For **categorical** data, **counts** (and proportions) are the appropriate descriptive statistics.
   * **Expected result:** `single: 188`, `married: 143`, `divorced: 124`.
   * **Interpretation:** `single` is the most common status in this sample. For relative comparisons, you can use `value_counts(normalize=True)` to get **proportions**.

> **Significance:**
>
> * You wouldn’t compute a meaningful **mean** of `marriage_status`, because categories have no numeric magnitude.
> * The correct visualization here is a **bar plot** of counts or proportions, not a scatter plot.

---

## 9) Plots (what fits which data type)

> The slides mention “Plots” to emphasize matching plot types to data types.

* **Numeric → Numeric** (e.g., `speed` vs. `time`): use **scatter plots**; consider regression lines.
* **Categorical → Counts/Proportions** (e.g., `marriage_status`): use **bar plots**.
* **Numeric (single variable):** use **histograms** or **box plots** for distribution and outliers.
---

## 10) Let’s practice!

**Checklist**

* Identify variable **types** (numeric continuous/discrete vs categorical nominal/ordinal).
* Choose **appropriate summary statistics**:

  * Numeric: mean/median/std, quantiles.
  * Categorical: counts/proportions, mode.
* Choose **appropriate plots**:

  * Numeric–numeric: scatter.
  * Numeric (one variable): histogram/box.
  * Categorical: bar plot.

---

### Final Takeaway

* **Know your variable types** → pick the **right summaries** and **right plots**.
* **Descriptive vs. Inferential:** describe your sample vs. generalize to a population.
* **Correlation ≠ Causation:** be cautious when interpreting associations as reasons.


### Descriptive and inferential statistics
Statistics can be used to answer lots of different types of questions, but being able to identify which type of statistics is needed is essential to drawing accurate conclusions. In this exercise, you'll sharpen your skills by identifying which type is needed to answer each question.

Instructions

Identify which questions can be answered with descriptive statistics and which questions can be answered with inferential statistics.

![image.png](attachment:image.png)


### Data type classification
In the video, you learned about two main types of data: numeric and categorical. Numeric variables can be classified as either discrete or continuous, and categorical variables can be classified as either nominal or ordinal. These characteristics of a variable determine which ways of summarizing your data will work best.

Instructions

Map each variable to its data type by dragging each item and dropping it into the correct data type.


![image-2.png](attachment:image-2.png)

# Measures of Center and Outliers: Summary Statistics in Python

---

## 1. Introduction: Measures of Center

- **Summary statistics** help us understand the "typical" value in a dataset.
- The three main measures of center:
    - **Mean** (average)
    - **Median** (middle value)
    - **Mode** (most frequent value)

---

## 2. Mammal Sleep Data Example

- We'll use the `msleep` dataset, which contains sleep data for various mammals.
- The main variable of interest: `sleep_total` (hours of sleep per day).

---

## 3. Visualizing Data: Histograms

- **Histograms** divide data into bins and count how many values fall into each bin.
- The height of each bar shows the number of data points in that bin.
- Useful for quickly understanding the distribution of data.

### Example: Histogram of Values

```python
# Import matplotlib for plotting
import matplotlib.pyplot as plt

# Assume 'data' is a DataFrame with a 'values' column
data['values'].hist()
plt.show()
```
![image.png](attachment:image.png)

#### Output

- A histogram plot appears, showing the distribution of the 'values' column.

#### Line-by-line Explanation

- `import matplotlib.pyplot as plt`  
    *Imports the plotting library for visualization.*
- `data['values'].hist()`  
    *Creates a histogram of the 'values' column in the DataFrame.*
- `plt.show()`  
    *Displays the histogram plot.*

#### Significance

- **Histograms** help you see the shape of the data—where values are concentrated, and if there are any outliers or skew.

---

## 4. Measures of Center: Mean, Median, Mode

### 4.1 Mean (Average)

- **Definition**: Add up all data points and divide by the number of data points.
- **Python Example:**

```python
import numpy as np

# Mean of sleep_total
mean_sleep = np.mean(msleep['sleep_total'])
print("Mean:", mean_sleep)
```

#### Output

```plaintext
Mean: 10.43
```

#### Explanation

- `np.mean(msleep['sleep_total'])`: Calculates the average of all values in the `sleep_total` column.
- Result: The typical mammal in the dataset sleeps about 10.43 hours.

---

### 4.2 Median

- **Definition**: The middle value when all data points are sorted.
- **Python Example:**

```python
median_sleep = np.median(msleep['sleep_total'])
print("Median:", median_sleep)
```

#### Output

```plaintext
Median: 10.1
```

#### Explanation

- `np.median(msleep['sleep_total'])`: Finds the middle value of `sleep_total`.
- Result: Half of mammals sleep less than 10.1 hours, half sleep more.

---

### 4.3 Mode

- **Definition**: The most frequently occurring value.
- **Usage**: Often used for categorical data, but can be used for numbers as well.

```python
from statistics import mode

# Mode of sleep_total
mode_sleep = mode(msleep['sleep_total'])
print("Mode (sleep_total):", mode_sleep)

# Mode of vore (diet type)
mode_vore = mode(msleep['vore'])
print("Mode (vore):", mode_vore)
```

#### Output

```plaintext
Mode (sleep_total): 12.5
Mode (vore): herbi
```

#### Explanation

- `mode(msleep['sleep_total'])`: Finds the most frequent sleep value (12.5 hours).
- `mode(msleep['vore'])`: Finds the most frequent diet type ("herbi" for herbivore).

---

## 5. Adding an Outlier: Effect on Mean and Median

### Subset: Insectivores Only

#### Before Adding an Outlier

```python
# Select only insectivores
insecti_sleep = msleep[msleep['vore'] == 'insecti']['sleep_total']

print(insecti_sleep)
print(insecti_sleep.agg([np.mean, np.median]))
```

#### Output

```plaintext
22    19.7
43    19.9
62    18.1
84     8.4
Name: sleep_total, dtype: float64

mean      16.53
median    18.90
Name: sleep_total, dtype: float64
```

#### Explanation

- **Subset**: Four insectivores with sleep_totals: 19.7, 19.9, 18.1, 8.4.
- **Mean**: 16.53  
- **Median**: 18.90

---

### After Adding a New Outlier (Mystery Insectivore Who Never Sleeps)

```python
# Add a new mystery insectivore with 0 hours of sleep
insecti_sleep_with_outlier = insecti_sleep.append(pd.Series([0.0]), ignore_index=True)

print(insecti_sleep_with_outlier)
print(insecti_sleep_with_outlier.agg([np.mean, np.median]))
```

#### Output

```plaintext
0    19.7
1    19.9
2    18.1
3     8.4
4     0.0
dtype: float64

mean      13.22
median    18.10
dtype: float64
```

#### Explanation

- Added a value of `0.0` (an outlier) to the list.
- **Mean** drops from 16.5 to 13.2 (a big change).
- **Median** drops from 18.9 to 18.1 (a small change).

#### Significance

- The **mean** is highly sensitive to outliers.
- The **median** is robust and barely changes.

---

## 6. Skewness and Which Measure to Use

### Skewness

- **Left-skewed**: Data piles up on the right, tail on the left.
    - Mean < Median
    
- **Right-skewed**: Data piles up on the left, tail on the right.
    - Mean > Median
    
![image.png](attachment:image.png)


### Histograms and Skewness

- Use histograms to visually assess skewness.
- In skewed distributions, the mean and median will differ.

---

## 7. Which Measure of Center Should You Use?

- **For symmetric distributions**: Mean and median are close; mean is appropriate.
- **For skewed distributions or data with outliers**: Median is better, as it is less affected by extreme values.
- **For categorical data**: Use the mode.

---

**Summary:**
- Use mean for symmetric data without outliers.
- Use median for skewed or outlier-prone data.
- Use mode for categorical data.
- Always visualize your data before summarizing!

### Exercise
Mean and median
In this chapter, you'll be working with the food_consumption dataset from 2018 Food Carbon Footprint Index by nu3. The food_consumption dataset contains the number of kilograms of food consumed per person per year in each country and food category (consumption), and its carbon footprint (co2_emissions) measured in kilograms of carbon dioxide, or CO2.

In this exercise, you'll compute measures of center to compare food consumption in the US and Belgium using your pandas and numpy skills.

pandas is imported as pd for you and food_consumption is pre-loaded.

Instructions
Import numpy with the alias np.
Subset food_consumption to get the rows where the country is 'USA'.
Calculate the mean of food consumption in the usa_consumption DataFrame, which is already created for you.
Calculate the median of food consumption in the usa_consumption DataFrame.

```python
# Import numpy with alias np
import numpy as np

# Subset country for USA: usa_consumption
usa_consumption = food_consumption[food_consumption['country']=='USA']

# Calculate mean consumption in USA
print(usa_consumption.agg([np.mean]))

# Calculate median consumption in USA
print(usa_consumption.agg([np.median]))


<script.py> output:
          Unnamed: 0  consumption  co2_emission
    mean        61.0        44.65        156.26
            Unnamed: 0  consumption  co2_emission
    median        61.0        14.58         15.34

```

### Exercise
Mean vs. median
In the video, you learned that the mean is the sum of all the data points divided by the total number of data points, and the median is the middle value of the dataset where 50% of the data is less than the median, and 50% of the data is greater than the median. In this exercise, you'll compare these two measures of center.

pandas is loaded as pd, numpy is loaded as np, and food_consumption is available.

Instructions 1/4

Import matplotlib.pyplot with the alias plt.
Subset food_consumption to get the rows where food_category is 'rice'.
Create a histogram of co2_emission in rice_consumption DataFrame and show the plot.

```python

# Import matplotlib.pyplot with alias plt
import matplotlib.pyplot as plt

# Subset for food_category equals rice
rice_consumption = food_consumption[food_consumption['food_category']== 'rice']

# Histogram of co2_emission for rice and show plot
rice_consumption['co2_emission'].hist()
plt.show()

```
![image.png](attachment:image.png)

Instructions 2/4

Question
Take a look at the histogram you just created of different countries' CO2 emissions for rice. Which of the following terms best describes the shape of the data?3

Possible answers

No skew

Left-skewed

Right-skewed

3. Use .agg() to calculate the mean and median of co2_emission for rice.


```python
# Subset for food_category equals rice
rice_consumption = food_consumption[food_consumption['food_category'] == 'rice']

# Calculate mean and median of co2_emission with .agg()
print(rice_consumption.agg([np.mean,np.median]))

<script.py> output:
            Unnamed: 0  consumption  co2_emission
    mean         718.5       29.375        37.592
    median       718.5       11.875        15.200

```
4. Question
Given the skew of this data, what measure of central tendency best summarizes the kilograms of CO2 emissions per person per year for rice?2

Possible answers


Mean

Median

Both mean and median

# Introduction to Statistics in Python: Measures of Spread

---

## 1. Measures of Spread

- **Measures of spread** describe how far apart or close together the data points in a dataset are.
- Just as we have measures of center (like mean, median), we have several measures to describe the variability or "spread" of data.

---

![image.png](attachment:image.png)

## 2. Variance

### What is Variance?

- **Variance** measures the average squared distance of each data point from the mean.
- A higher variance means the data points are more spread out.

![image-2.png](attachment:image-2.png)

### Calculating Variance
- one step calc using np.var()
```python
import numpy as np

# Calculate sample variance for 'sleep_total'
np.var(msleep['sleep_total'], ddof=1)
```

**Output:**
```
19.803684210526317
```
![image-3.png](attachment:image-3.png)

#### Explanation (Line by Line):

- `import numpy as np`  
  *Imports the NumPy library for numerical operations.*

- `np.var(msleep['sleep_total'], ddof=1)`  
  *Calculates the **variance** of the 'sleep_total' column in the `msleep` DataFrame.*
  - `.var()` computes the variance.
  - `ddof=1` sets the "Delta Degrees of Freedom" to 1, which means we're calculating the **sample variance** (not the population variance).
  - The result is the average of the squared differences from the mean, in the units squared (e.g., hours²).

**Significance:**
- The output tells us the average squared deviation from the mean sleep total across all mammals in the dataset.
- Units are squared (e.g., hours²), which can make interpretation less intuitive.

---

## 3. Standard Deviation

### What is Standard Deviation?

- **Standard deviation** is the square root of the variance.
- It brings the units back to the original data units (e.g., hours).
- It's often easier to interpret than variance.

### Calculating Standard Deviation

```python
# Calculate sample standard deviation for 'sleep_total'
np.sqrt(np.var(msleep['sleep_total'], ddof=1))
# or, equivalently:
np.std(msleep['sleep_total'], ddof=1)
```

**Output:**
```
4.450357
```

#### Explanation:

- `np.sqrt(np.var(msleep['sleep_total'], ddof=1))`  
  *Takes the square root of the sample variance to get the standard deviation.*

- `np.std(msleep['sleep_total'], ddof=1)`  
  *Directly computes the sample standard deviation.*

**Significance:**
- The standard deviation is in the same units as the data (hours).
- It's a common way to describe the "typical" distance from the mean.

---

## 4. Mean Absolute Deviation (MAD)

### What is Mean Absolute Deviation?

- **Mean Absolute Deviation (MAD)** is the average of the absolute differences between each data point and the mean.
- Unlike standard deviation, MAD treats all deviations equally (does not square them).

### Calculating MAD

```python
dists = msleep['sleep_total'] - np.mean(msleep['sleep_total'])
np.mean(np.abs(dists))
```

**Output:**
```
3.566701
```

#### Explanation:

- `dists = msleep['sleep_total'] - np.mean(msleep['sleep_total'])`  
  *Calculates the difference between each value and the mean (the deviation).*

- `np.abs(dists)`  
  *Takes the absolute value of each deviation.*

- `np.mean(np.abs(dists))`  
  *Calculates the mean of these absolute deviations.*

**Significance:**
- **MAD** gives a robust, easy-to-interpret measure of spread.
- It doesn't penalize large deviations as strongly as standard deviation (since it doesn't square them).
- **Which to use?** Standard deviation is more common; MAD is more robust to outliers.

---

## 5. Standard Deviation vs. Mean Absolute Deviation

- **Standard deviation**: Squares distances, so large deviations have more influence (heavier penalty).
- **MAD**: Penalizes all deviations equally.
- Neither is "better"; each has its uses. Standard deviation is more commonly used in statistics.

---

## 6. Quantiles

### What are Quantiles?

- **Quantiles** (or percentiles) split the data into equal-sized intervals.
- The 0.5 quantile is the **median**.

### Calculating Median (0.5 Quantile)

```python
np.quantile(msleep['sleep_total'], 0.5)
```

**Output:**
```
10.1
```

#### Explanation:

- `np.quantile(msleep['sleep_total'], 0.5)`  
  *Finds the value in 'sleep_total' below which 50% of the data falls (the median).*

**Significance:**
- Half the mammals sleep less than 10.1 hours, and half sleep more.

---

### Calculating Quartiles

```python
np.quantile(msleep['sleep_total'], [0, 0.25, 0.5, 0.75, 1])
```

**Output:**
```
array([ 1.9 ,  7.85, 10.1 , 13.75, 19.9 ])
```

#### Explanation:

- `[0, 0.25, 0.5, 0.75, 1]` specifies the 0th (min), 25th, 50th (median), 75th, and 100th (max) percentiles.
- The output tells us:
  - **Min**: 1.9 hours
  - **1st Quartile (Q1)**: 7.85 hours
  - **Median (Q2)**: 10.1 hours
  - **3rd Quartile (Q3)**: 13.75 hours
  - **Max**: 19.9 hours

**Significance:**
- These values are used to describe data distribution and to construct boxplots.

---

## 7. Boxplots and Quartiles

- **Boxplots** visually represent the quartiles of the data.
  - **Bottom of the box**: 1st quartile (Q1)
  - **Top of the box**: 3rd quartile (Q3)
  - **Middle line**: Median (Q2)

### Creating a Boxplot

```python
import matplotlib.pyplot as plt

plt.boxplot(msleep['sleep_total'])
plt.show()
```
![image-4.png](attachment:image-4.png)

#### Explanation:

- `import matplotlib.pyplot as plt`  
  *Imports Matplotlib's plotting module.*

- `plt.boxplot(msleep['sleep_total'])`  
  *Creates a boxplot for the 'sleep_total' data.*

- `plt.show()`  
  *Displays the plot.*

**Significance:**
- The height of the box represents the **interquartile range (IQR)**, and the whiskers/fliers indicate possible outliers.

---

## 8. Quantiles with `np.linspace()`

- **`np.linspace(start, stop, num)`** creates evenly spaced numbers between `start` and `stop`.
- Can use this for evenly spaced quantiles.

### Example: 20%, 40%, 60%, 80% Quantiles

```python
np.quantile(msleep['sleep_total'], [0, 0.2, 0.4, 0.6, 0.8, 1])
```

**Output:**
```
array([ 1.9 ,  6.24,  9.48, 11.14, 14.4 , 19.9 ])
```

### Example: Quartiles via `np.linspace()`

```python
np.quantile(msleep['sleep_total'], np.linspace(0, 1, 5))
```

**Output:**
```
array([ 1.9 ,  7.85, 10.1 , 13.75, 19.9 ])
```

#### Explanation:

- `np.linspace(0, 1, 5)` creates `[0, 0.25, 0.5, 0.75, 1]` for quartiles.
- Passing this to `np.quantile` returns the quartiles.

---

## 9. Interquartile Range (IQR)

### What is IQR?

- **Interquartile Range (IQR)** = Q3 - Q1  
  (distance between the 75th and 25th percentiles)
- Represents the "middle 50%" of the data; height of the box in a boxplot.

### Calculating IQR

#### Using `np.quantile`:

```python
np.quantile(msleep['sleep_total'], 0.75) - np.quantile(msleep['sleep_total'], 0.25)
```

**Output:**
```
5.9
```

#### Using `scipy.stats.iqr`:

```python
from scipy.stats import iqr

iqr(msleep['sleep_total'])
```

**Output:**
```
5.9
```

#### Explanation:

- First method subtracts the 25th percentile from the 75th.
- Second method uses a built-in function from SciPy.

**Significance:**
- The IQR is a robust measure of spread, less sensitive to outliers than standard deviation.

---

## 10. Outliers

### What is an Outlier?

- An **outlier** is a data point that is substantially different from the rest.
- Rule of thumb:
  - **Lower threshold:** Q1 − 1.5 × IQR
  - **Upper threshold:** Q3 + 1.5 × IQR

### Criteria:

- Any data point < Q1 − 1.5 × IQR or > Q3 + 1.5 × IQR is considered an outlier.

---

## 11. Finding Outliers

### Example: Find Outliers in Body Weight

```python
from scipy.stats import iqr

iqr_value = iqr(msleep['bodywt'])
lower_threshold = np.quantile(msleep['bodywt'], 0.25) - 1.5 * iqr_value
upper_threshold = np.quantile(msleep['bodywt'], 0.75) + 1.5 * iqr_value

msleep[(msleep['bodywt'] < lower_threshold) | (msleep['bodywt'] > upper_threshold)]
```
![image-5.png](attachment:image-5.png)


**Output:**  
A DataFrame showing outlier mammals (e.g., Cow, Asian elephant, Horse, etc.) where body weight is much lower or higher than the typical range.

#### Explanation:

- `iqr_value = iqr(msleep['bodywt'])`  
  *Calculates the IQR for body weight.*

- `lower_threshold = ...`  
  *Calculates the lower bound for outliers.*

- `upper_threshold = ...`  
  *Calculates the upper bound for outliers.*

- `msleep[(msleep['bodywt'] < lower_threshold) | (msleep['bodywt'] > upper_threshold)]`  
  *Selects rows in the DataFrame where body weight is an outlier.*

**Significance:**
- Identifying outliers is important for data cleaning and understanding data distribution.

---

## 12. Summary Statistics All in One

- The `.describe()` method in pandas gives **many summary statistics at once**.

### Example:

```python
msleep['bodywt'].describe()
```

**Output:**
```
count      83.000000
mean      166.136349
std       786.839732
min         0.005000
25%         0.174000
50%         1.670000
75%        41.750000
max      6654.000000
Name: bodywt, dtype: float64
```

#### Explanation:

- `.describe()` gives:
  - **count**: Number of non-null entries
  - **mean**: Mean of the data
  - **std**: Standard deviation
  - **min**: Minimum value
  - **25%**: First quartile (Q1)
  - **50%**: Median (Q2)
  - **75%**: Third quartile (Q3)
  - **max**: Maximum value

**Significance:**
- Quick overview of the data's central tendency, spread, and range.

---

## 13. Practice

- Now, practice measuring spread and finding outliers in your own datasets!

---

# Summary Table

| Measure                 | Function/Formula                                 | Purpose                                  |
|-------------------------|--------------------------------------------------|------------------------------------------|
| Variance                | `np.var(data, ddof=1)`                           | Avg. squared distance from mean          |
| Standard Deviation      | `np.std(data, ddof=1)` or `np.sqrt(variance)`    | Typical distance from mean (same units)  |
| Mean Absolute Deviation | `np.mean(np.abs(data - np.mean(data)))`          | Avg. absolute distance from mean         |
| Quantiles               | `np.quantile(data, q)`                           | Value below which q% of data falls       |
| IQR                     | `iqr(data)` or Q3 - Q1                          | Range of middle 50% of data              |
| Outlier Thresholds      | Q1 - 1.5*IQR, Q3 + 1.5*IQR                      | Rule to identify outliers                |
| Summary Statistics      | `data.describe()`                                | All-in-one overview                      |

---

**Key Takeaways:**
- Use measures of spread alongside measures of center to fully understand your data.
- Each spread measure has its own strengths and best use-cases.
- Outlier detection helps improve data quality and insights.
- Pandas and NumPy make all these calculations easy in Python!

---


### Exercise
Variance and standard deviation
Variance and standard deviation are two of the most common ways to measure the spread of a variable, and you'll practice calculating these in this exercise. Spread is important since it can help inform expectations. For example, if a salesperson sells a mean of 20 products a day, but has a standard deviation of 10 products, there will probably be days where they sell 40 products, but also days where they only sell one or two. Information like this is important, especially when making predictions.

pandas has been imported as pd, numpy as np, and matplotlib.pyplot as plt; the food_consumption DataFrame is also available.

Instructions

Calculate the variance and standard deviation of co2_emission for each food_category with the .groupby() and .agg() methods; compare the values of variance and standard deviation.
Create a histogram of co2_emission for the beef in food_category and show the plot.
Create a histogram of co2_emission for the eggs in food_category and show the plot.

```python
# Print variance and sd of co2_emission for each food_category
print(food_consumption.groupby('food_category')['co2_emission'].agg([np.var, np.std]))

# Create histogram of co2_emission for food_category 'beef'
food_consumption[food_consumption['food_category']=='beef']['co2_emission'].hist()
plt.show()

# Create histogram of co2_emission for food_category 'eggs'
plt.figure()
food_consumption[food_consumption['food_category']=='eggs']['co2_emission'].hist()
plt.show()

```

![image.png](attachment:image.png)

### Exercise
Quartiles, quantiles, and quintiles
Quantiles are a great way of summarizing numerical data since they can be used to measure center and spread, as well as to get a sense of where a data point stands in relation to the rest of the data set. For example, you might want to give a discount to the 10% most active users on a website.

In this exercise, you'll calculate quartiles, quintiles, and deciles, which split up a dataset into 4, 5, and 10 pieces, respectively.

Both pandas as pd and numpy as np are loaded and food_consumption is available.

Instructions 1/3

Calculate the quartiles of the co2_emission column of food_consumption.

```python
# Calculate the quartiles of co2_emission
print(np.quantile(food_consumption['co2_emission'], [0, 0.25, 0.5, 0.75, 1]))


<script.py> output:
    [   0.        5.21     16.53     62.5975 1712.    ]
```

2. Calculate the six quantiles that split up the data into 5 pieces (quintiles) of the co2_emission column of food_consumption.



```python
# Calculate the quintiles of co2_emission
print(np.quantile(food_consumption['co2_emission'], [0, 0.2, 0.4, 0.6, 0.8, 1]))

<script.py> output:
    [   0.       3.54    11.026   25.59    99.978 1712.   ]

```
3. Calculate the eleven quantiles of co2_emission that split up the data into ten pieces (deciles).

```python
# Calculate the deciles of co2_emission
print(np.quantile(food_consumption['co2_emission'], [0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1]))

<script.py> output:
    [0.00000e+00 6.68000e-01 3.54000e+00 7.04000e+00 1.10260e+01 1.65300e+01
     2.55900e+01 4.42710e+01 9.99780e+01 2.03629e+02 1.71200e+03]
```


### Exercise

![image.png](attachment:image.png)



```python
# Calculate total co2_emission per country: emissions_by_country
emissions_by_country = food_consumption.groupby('country')['co2_emission'].sum()

print(emissions_by_country)

<script.py> output:
    country
    Albania      1777.85
    Algeria       707.88
    Angola        412.99
    Argentina    2172.40
    Armenia      1109.93
                  ...   
    Uruguay      1634.91
    Venezuela    1104.10
    Vietnam       641.51
    Zambia        225.30
    Zimbabwe      350.33
    Name: co2_emission, Length: 130, dtype: float64
In [2]:

    
```


2. Compute the first and third quartiles of emissions_by_country and store these as q1 and q3.
   Calculate the interquartile range of emissions_by_country and store it as iqr.

```python
# Calculate total co2_emission per country: emissions_by_country
emissions_by_country = food_consumption.groupby('country')['co2_emission'].sum()

# Compute the first and third quartiles and IQR of emissions_by_country
q1 = np.quantile(emissions_by_country, 0.25)
q3 =np.quantile(emissions_by_country, 0.75)
iqr = q3 - q1
```


3. Calculate the lower and upper cutoffs for outliers of emissions_by_country, and store these as lower and upper.


```python
# Calculate total co2_emission per country: emissions_by_country
emissions_by_country = food_consumption.groupby('country')['co2_emission'].sum()

# Compute the first and third quantiles and IQR of emissions_by_country
q1 = np.quantile(emissions_by_country, 0.25)
q3 = np.quantile(emissions_by_country, 0.75)
iqr = q3 - q1

# Calculate the lower and upper cutoffs for outliers
lower =  np.quantile(emissions_by_country, 0.25) - 1.5 * iqr
upper = np.quantile(emissions_by_country, 0.75) + 1.5 * iqr
```


4. Subset emissions_by_country to get countries with a total emission greater than the upper cutoff or a total emission less than the lower cutoff.

```python
# Calculate total co2_emission per country: emissions_by_country
emissions_by_country = food_consumption.groupby('country')['co2_emission'].sum()

# Compute the first and third quantiles and IQR of emissions_by_country
q1 = np.quantile(emissions_by_country, 0.25)
q3 = np.quantile(emissions_by_country, 0.75)
iqr = q3 - q1

# Calculate the lower and upper cutoffs for outliers
lower = q1 - 1.5 * iqr
upper = q3 + 1.5 * iqr

# Subset emissions_by_country to find outliers
outliers = emissions_by_country[(emissions_by_country < lower) | (emissions_by_country > upper)]
print(outliers)

<script.py> output:
    country
    Argentina    2172.4
    Name: co2_emission, dtype: float64

```



