# Introduction to Statistics in Python: Correlation

---

## 1. Introduction & Overview

- **Topic:** Correlation & Experimental Design
- **Goal:** Understand how to measure and interpret relationships between two numeric variables in Python.

---

## 2. Relationships Between Two Variables

- **Visualizing relationships:**  
  - Often, we use **scatterplots** to visualize the relationship between two numeric variables.
- **Terminology:**
  - **Independent (Explanatory) variable:** Plotted on the x-axis.
  - **Dependent (Response) variable:** Plotted on the y-axis.
- **Example:**  
  - Relationship between `sleep_total` (total sleep time) and `sleep_rem` (REM sleep time) in mammals.
 
![image.png](attachment:1e118901-2328-46f1-a3ff-c140eafd9b71.png)

---

## 3. Correlation Coefficient

- **Definition:**  
  - The **correlation coefficient** quantifies the relationship between two variables.
  - **Range:** `-1` to `1`
    - **Magnitude** (absolute value): Strength of the relationship.
    - **Sign** (positive/negative): Direction of the relationship.

### Strength (Magnitude) of Correlation

- **Near-perfect/Very strong:**  
  - `r ≈ 0.99`  
    - Data points cluster closely around a line.
    - Knowing `x` gives a very good estimate of `y`.
- **Strong:**  
  - `r ≈ 0.75`  
    - Points are slightly more spread.
![image.png](attachment:6039eca6-eb96-4456-a5ab-94d0a36b1391.png)
- **Moderate:**  
  - `r ≈ 0.56`  
    - Moderate clustering.
- **Weak:**  
  - `r ≈ 0.21`  
    - Little clustering.
![image.png](attachment:8cf1620a-2276-4c0b-b75a-c4b91a31f9d7.png)
- **No relationship:**  
  - `r ≈ 0.04` or close to `0`
![image.png](attachment:c03b5762-1ad8-4975-b1e2-468398e43e42.png)
    - Points are scattered randomly; knowing `x` tells us nothing about `y`.

### Direction (Sign) of Correlation

- **Positive correlation (`r > 0`):**  
  - As `x` increases, `y` increases.
- **Negative correlation (`r < 0`):**  
  - As `x` increases, `y` decreases.
![image.png](attachment:8a7ef456-6c01-45a2-9e6c-3f87a7195c5c.png)

---

## 4. Visualizing Relationships with Python

### Using Seaborn to Create a Scatterplot

```python
import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot(x="sleep_total", y="sleep_rem", data=msleep)
plt.show()
```
![image.png](attachment:c0402bdd-ba9f-4443-8dc6-6d556634ed81.png)

#### **Line-by-Line Explanation:**

1. **`import seaborn as sns`**
    - Imports the `seaborn` library, a powerful package for data visualization in Python.
    - Aliased as `sns` for convenience.

2. **`import matplotlib.pyplot as plt`**
    - Imports the `pyplot` module from `matplotlib`, used for displaying plots.

3. **`sns.scatterplot(x="sleep_total", y="sleep_rem", data=msleep)`**
    - Creates a scatterplot.
    - `x="sleep_total"`: The x-axis will show the total sleep time.
    - `y="sleep_rem"`: The y-axis will show REM sleep time.
    - `data=msleep`: Uses the `msleep` DataFrame (assumed to be loaded).

4. **`plt.show()`**
    - Displays the plot window.

#### **Expected Output:**

A scatterplot where each point represents a mammal species, with total sleep on the x-axis and REM sleep on the y-axis. The pattern of points shows their relationship.

#### **Significance:**

- **Purpose:**  
  - Visualizes the relationship between two variables.
- **Interpretation:**  
  - If points are trending upward, there may be a positive correlation; downward suggests negative correlation; a random scatter suggests no correlation.

---

## 5. Adding a Trendline

### Using Seaborn to Add a Linear Trendline

```python
import seaborn as sns
import matplotlib.pyplot as plt

sns.lmplot(x="sleep_total", y="sleep_rem", data=msleep, ci=None)
plt.show()
```
![image.png](attachment:c3506687-2d8e-490c-aac3-ef2af3527824.png)

#### **Line-by-Line Explanation:**

1. **`import seaborn as sns`**  
   - As above; ensures the seaborn package is available.

2. **`import matplotlib.pyplot as plt`**  
   - As above; used for displaying plots.

3. **`sns.lmplot(x="sleep_total", y="sleep_rem", data=msleep, ci=None)`**
    - Creates a scatterplot **with a linear regression (trend) line**.
    - `ci=None`: Removes the confidence interval shading (keeps the plot clean).

4. **`plt.show()`**
    - Displays the plot.

#### **Expected Output:**

A scatterplot with a straight line (trendline) through the data, showing the estimated linear relationship between total sleep and REM sleep.

#### **Significance:**

- **Purpose:**  
  - Makes it easier to see the direction and strength of the relationship.
- **Interpretation:**  
  - A steep trendline suggests a stronger relationship; a flat line indicates a weak or no relationship.

---

## 6. Computing Correlation in Python
To calculate the correlation coefficient between two Series, we can use the dot-corr method.

### Calculating Pearson Correlation with Pandas

```python
msleep['sleep_total'].corr(msleep['sleep_rem'])
```

#### **Expected Output:**

```
0.751755
```

### Calculating in the Opposite Order

```python
msleep['sleep_rem'].corr(msleep['sleep_total'])
```

#### **Expected Output:**

```
0.751755
```

#### **Line-by-Line Explanation:**

1. **`msleep['sleep_total'].corr(msleep['sleep_rem'])`**
    - Computes the **Pearson correlation coefficient** between the two columns.
    - `.corr()` is a pandas Series method.
    - Returns a float between `-1` and `1`.
    - In this case, `0.751755` indicates a **fairly strong positive correlation**.

2. **`msleep['sleep_rem'].corr(msleep['sleep_total'])`**
    - Same calculation, but order reversed.
    - **Pearson's r is symmetric:** correlation between x and y is the same as y and x.

#### **Significance of the Output:**

- **Interpretation:**  
  - `0.751755` means as total sleep increases, REM sleep tends to increase as well (strong positive relationship).
- **Order doesn't matter:**  
  - Correlation is symmetric.

---

## 7. Many Ways to Calculate Correlation

- **Pearson product-moment correlation (r):**  
  - The most common method.
  - Measures **linear** relationships between two continuous variables.
  - Formula (not required to memorize):

    $$
    r = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{(n-1)\sigma_x \sigma_y}
    $$

    - $\bar{x}$, $\bar{y}$ = Means of x and y  
    - $\sigma_x$, $\sigma_y$ = Standard deviations of x and y

- **Other methods:**
    - **Kendall's tau:** For ordinal data or non-linear relationships.
    - **Spearman's rho:** For ranked or monotonic relationships.

> **Note:** Pearson's r is most useful for linear, continuous data.

---

- **Key takeaways:**
    - Use scatterplots and trendlines to visualize relationships.
    - Use `.corr()` to quantify strength and direction.
    - Remember the difference between positive, negative, and no correlation.
    - Know which correlation method is appropriate for your data.

---

# Summary Table

| Correlation Coefficient (r) | Strength      | Direction         | Interpretation                                 |
|-----------------------------|--------------|-------------------|-----------------------------------------------|
| 0.99                        | Very strong  | Positive          | x and y almost perfectly related; as x ↑, y ↑  |
| 0.75                        | Strong       | Positive          | x and y strongly related; as x ↑, y ↑         |
| 0.56                        | Moderate     | Positive          | Moderate association; as x ↑, y ↑             |
| 0.21                        | Weak         | Positive          | Weak association; as x ↑, y ↑                 |
| 0.04                        | None         | No relationship   | x gives no info about y                       |
| -0.75                       | Strong       | Negative          | x and y strongly related; as x ↑, y ↓         |

---

# Key Python Code Examples

### Scatterplot

```python
import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot(x="sleep_total", y="sleep_rem", data=msleep)
plt.show()
```

**Output:**  
Scatterplot showing the relationship between total sleep and REM sleep.

---

### Scatterplot with Trendline

```python
import seaborn as sns
import matplotlib.pyplot as plt

sns.lmplot(x="sleep_total", y="sleep_rem", data=msleep, ci=None)
plt.show()
```

**Output:**  
Scatterplot with a linear trendline showing the direction and strength of the relationship.

---

### Calculating Correlation

```python
msleep['sleep_total'].corr(msleep['sleep_rem'])
```

**Output:**  
```
0.751755
```

**Interpretation:**  
Strong positive linear relationship.

---


### Exercise
Relationships between variables
In this chapter, you'll be working with a dataset world_happiness containing results from the 2019 World Happiness Report. The report scores various countries based on how happy people in that country are. It also ranks each country on various societal aspects such as social support, freedom, corruption, and others. The dataset also includes the GDP per capita and life expectancy for each country.

In this exercise, you'll examine the relationship between a country's life expectancy (life_exp) and happiness score (happiness_score) both visually and quantitatively. seaborn as sns, matplotlib.pyplot as plt, and pandas as pd are loaded and world_happiness is available.

Instructions 1/4

Create a scatterplot of happiness_score vs. life_exp (without a trendline) using seaborn.
Show the plot.

```python
# Create a scatterplot of happiness_score vs. life_exp and show
sns.scatterplot(x='life_exp',        y='happiness_score',
                data=world_happiness)

# Show plot
plt.show()
```
![image.png](attachment:e03293cc-4548-46dd-8df4-a23ed36d7391.png)

Create a scatterplot of happiness_score vs. life_exp with a linear trendline using seaborn, setting ci to None.
Show the plot.

```python
# Create scatterplot of happiness_score vs life_exp with trendline
sns.lmplot(x='life_exp', y='happiness_score',  data=world_happiness, ci=None)

# Show plot
plt.show()

```
![image.png](attachment:c6abfd76-a1d4-4d90-a312-aacb88a9eb45.png)

Question
Based on the scatterplot, which is most likely the correlation between life_exp and happiness_score?

Possible answersc


0.3

-0.3

0.8

-0.8


Calculate the correlation between life_exp and happiness_score. Save this as cor.

```python
# Create scatterplot of happiness_score vs life_exp with trendline
sns.lmplot(x='life_exp', y='happiness_score', data=world_happiness, ci=None)

# Show plot
plt.show()

# Correlation between life_exp and happiness_score
cor = world_happiness['life_exp'].corr(world_happiness['happiness_score'])

print(cor)

<script.py> output:
    0.7802249053272062

```

# Transformations, Correlation, and Causation in Python

---

## 1. Correlation Caveats

- **Correlation** is a powerful tool for quantifying relationships, but it has important limitations and should not be used blindly.

---

## 2. Non-linear Relationships

- The **correlation coefficient** ($r$) is designed to measure **linear relationships** only.
- Example:
    - Suppose $x$ and $y$ have a clear quadratic (curved) relationship.
![image.png](attachment:6537837c-69d9-493a-9010-56194113be7c.png)
    - The calculated correlation could be quite low (e.g., $r = 0.18$), even though the variables are related.
![image.png](attachment:33f3ddab-0c35-4a83-ad99-fe61e59cb5d2.png)
    - **Always visualize your data** (e.g., with scatterplots) to check for non-linearity.
![image.png](attachment:f5c1cdee-9465-44a5-bde0-43897530b435.png)

---

## 3. Example: Mammal Sleep Data

### Body Weight vs. Awake Time

- When plotting **body weight** (`bodywt`) vs **awake time** for mammals:
    - The relationship is **not linear**.
    - The correlation is weak ($r \approx 0.3$).
![image.png](attachment:8f43802d-ca30-4a08-8378-b39ba8d06f9c.png)

---

## 4. Data Distributions and Transformations

### Highly Skewed Variables

- The distribution of **body weight** is highly **right-skewed** (many small values, a few very large ones).
![image.png](attachment:f51c523f-b82b-4423-b8a5-19e398688a4b.png)

### Log Transformation

- **Purpose:** To make skewed data more symmetric and relationships more linear.
- **How:** Take the logarithm of each value.

![image.png](attachment:91fcfc95-0a77-4260-a0da-49d011be0f4c.png)

#### Example: Applying a Log Transformation

```python
import numpy as np

# Create a new column 'log_bodywt' with the log of body weight
msleep['log_bodywt'] = np.log(msleep['bodywt'])
```

**Output:**  
A new column `log_bodywt` is added to the `msleep` DataFrame.  
For example, if `bodywt = 10`, then `log_bodywt = np.log(10) ≈ 2.30`.

#### **Line-by-Line Explanation:**

1. `import numpy as np`
    - Imports the NumPy library, which provides the `log` function for element-wise logarithms.
2. `msleep['log_bodywt'] = np.log(msleep['bodywt'])`
    - Takes the natural log of each value in the `bodywt` column.
    - Stores the result in a new column called `log_bodywt`.
    - **Purpose:** Reduces skew, spreads out small values, compresses large values.

#### **Significance of the Output:**

- **Result:**  
  - After transformation, the relationship between log(body weight) and awake time is **more linear**.
  - The new correlation is **higher** ($r \approx 0.57$), reflecting a stronger linear relationship.

---

### Visualizing the Effect of Transformation

```python
import seaborn as sns
import matplotlib.pyplot as plt

# Plot log_bodywt vs. awake
sns.scatterplot(x='log_bodywt', y='awake', data=msleep)
plt.show()
```

**Output:**  
A scatterplot where the relationship between log-transformed body weight and awake time appears more linear.

#### **Line-by-Line Explanation:**

1. `import seaborn as sns`
    - Imports the Seaborn library for easier plotting.
2. `import matplotlib.pyplot as plt`
    - Imports Matplotlib for plot display.
3. `sns.scatterplot(x='log_bodywt', y='awake', data=msleep)`
    - Makes a scatterplot of log body weight vs. awake time.
4. `plt.show()`
    - Displays the plot.

**Significance:**  
- Shows how transformation can linearize relationships, making statistical methods more appropriate.

---

### Calculating Correlation Before and After Transformation

```python
# Before transformation
msleep['bodywt'].corr(msleep['awake'])

# After transformation
msleep['log_bodywt'].corr(msleep['awake'])
```

**Output:**
```
# Before transformation
0.3

# After transformation
0.57
```

#### **Line-by-Line Explanation:**

1. `msleep['bodywt'].corr(msleep['awake'])`
    - Calculates Pearson's r for original body weight and awake time.
    - Expected result: 0.3 (weak correlation).

2. `msleep['log_bodywt'].corr(msleep['awake'])`
    - Calculates Pearson's r for log-transformed body weight and awake time.
    - Expected result: 0.57 (moderate correlation).

**Significance:**  
- Transformation increased the linearity and strength of the relationship as measured by correlation.

---

## 5. Other Data Transformations

- **Common transformations:**
    - **Logarithm:** `np.log(x)`
    - **Square root:** `np.sqrt(x)`
    - **Reciprocal:** `1 / x`
- **Combinations:**  
    - Transform `x` and/or `y` individually or together.
    - Example: `log(x)` vs. `log(y)`, or `sqrt(x)` vs. `1/y`

---

## 6. Why Use Transformations?

- **Certain statistical methods require linear relationships:**
    - Pearson’s correlation coefficient
    - Linear regression
- **Transformations** help meet these assumptions, making results more valid.

---

## 7. Correlation ≠ Causation

- **Key Principle:**  
    - Just because $x$ and $y$ are correlated does **not** mean $x$ causes $y$.
- **Spurious correlation:**  
    - Example:  
      - **Margarine consumption** (US) and **Maine divorce rates** are highly correlated ($r=0.99$), but there is no causal relationship.
![image.png](attachment:10d41316-5e88-4259-9f76-32235d615f21.png)
---

## 8. Confounding Variables

- **Confounding:** A third variable creates the illusion of a relationship between $x$ and $y$.
- **Example:**
    - **Coffee drinking** and **lung cancer** are correlated.
    - **Smoking** is a confounder:
        - Smoking is linked to more coffee drinking.
        - Smoking causes lung cancer.
    - The apparent link between coffee and lung cancer is **spurious**—due to the hidden variable, not causation.

- **Another Example:**  
    - **Retail sales** and **holidays**:
        - Sales go up around holidays.
        - But special deals/promotions (the confounder) also boost sales.
![image.png](attachment:1b191c4a-9cec-41b0-bce5-920139e86d7e.png)
---
## 9. Summary

- **Always visualize relationships** to check for linearity.
- **Use transformations** (log, sqrt, reciprocal) to linearize relationships and reduce skew.
- **Correlation measures only linear relationships.**
- **Correlation does not imply causation.**
- **Watch for confounders** that can create misleading associations.

---

# Key Python Code Examples

### Log Transformation

```python
import numpy as np
msleep['log_bodywt'] = np.log(msleep['bodywt'])
```

### Scatterplot After Transformation

```python
import seaborn as sns
import matplotlib.pyplot as plt

sns.scatterplot(x='log_bodywt', y='awake', data=msleep)
plt.show()
```

### Correlation Before and After

```python
msleep['bodywt'].corr(msleep['awake'])      # Output: 0.3
msleep['log_bodywt'].corr(msleep['awake'])  # Output: 0.57
```


### Exercise
What can't correlation measure?
While the correlation coefficient is a convenient way to quantify the strength of a relationship between two variables, it's far from perfect. In this exercise, you'll explore one of the caveats of the correlation coefficient by examining the relationship between a country's GDP per capita (gdp_per_cap) and happiness score.

pandas as pd, matplotlib.pyplot as plt, and seaborn as sns are imported, and world_happiness is loaded.

Instructions 1/3

Create a seaborn scatterplot (without a trendline) showing the relationship between gdp_per_cap (on the x-axis) and life_exp (on the y-axis).
Show the plot

```python
# Scatterplot of gdp_per_cap and life_exp
sns.scatterplot(x='gdp_per_cap', y='life_exp', data=world_happiness)

# Show plot
plt.show()
```
![image.png](attachment:ba7c5fde-53f4-4788-98af-65d89a71a485.png)

Calculate the correlation between gdp_per_cap and life_exp and store as cor.
```python
# Scatterplot of gdp_per_cap and life_exp
sns.scatterplot(x='gdp_per_cap', y='life_exp', data=world_happiness)

# Show plot
plt.show()
  
# Correlation between gdp_per_cap and life_exp
cor = world_happiness['gdp_per_cap'].corr(world_happiness['life_exp'])

print(cor)

<script.py> output:
    0.7019547642148012
In [1]:

```
Question
The correlation between GDP per capita and life expectancy is 0.7. Why is correlation not the best way to measure the relationship between these two variables?

Possible answersb


Correlation measures how one variable affects another.

Correlation only measures linear relationships.

Correlation cannot properly measure relationships between numeric variables.


### Exercise
Transforming variables
When variables have skewed distributions, they often require a transformation in order to form a linear relationship with another variable so that correlation can be computed. In this exercise, you'll perform a transformation yourself.

pandas as pd, numpy as np, matplotlib.pyplot as plt, and seaborn as sns are imported, and world_happiness is loaded.

Instructions 1/2

Create a scatterplot of happiness_score versus gdp_per_cap and calculate the correlation between them.

```python
# Scatterplot of happiness_score vs. gdp_per_cap
sns.scatterplot(x='gdp_per_cap', y='happiness_score', data=world_happiness)
plt.show()

# Calculate correlation
cor = world_happiness['gdp_per_cap'].corr(world_happiness['happiness_score'])
print(cor)

<script.py> output:
    0.727973301222298

```
![image.png](attachment:55f35dfa-ee67-4399-b874-a015df45a70c.png)

2. Add a new column to world_happiness called log_gdp_per_cap that contains the log of gdp_per_cap.
Create a seaborn scatterplot of happiness_score versus log_gdp_per_cap.
Calculate the correlation between log_gdp_per_cap and happiness_score.


```python
# Create log_gdp_per_cap column
world_happiness['log_gdp_per_cap'] = np.log(world_happiness['gdp_per_cap'])

# Scatterplot of happiness_score vs. log_gdp_per_cap
sns.scatterplot(x='log_gdp_per_cap', y='happiness_score', data=world_happiness)
plt.show()

# Calculate correlation
cor = world_happiness['log_gdp_per_cap'].corr(world_happiness['happiness_score'])
print(cor)

```
![image.png](attachment:37d4d395-62a2-411d-830f-251eeffaabb9.png)

### Exercise
Does sugar improve happiness?
A new column has been added to world_happiness called grams_sugar_per_day, which contains the average amount of sugar eaten per person per day in each country. In this exercise, you'll examine the effect of a country's average sugar consumption on its happiness score.

pandas as pd, matplotlib.pyplot as plt, and seaborn as sns are imported, and world_happiness is loaded.

Instructions 1/2

Create a seaborn scatterplot showing the relationship between grams_sugar_per_day (on the x-axis) and happiness_score (on the y-axis).
Calculate the correlation between grams_sugar_per_day and happiness_score.

```python
# Scatterplot of grams_sugar_per_day and happiness_score
sns.scatterplot(x='grams_sugar_per_day', y='happiness_score', data=world_happiness)
plt.show()

# Correlation between grams_sugar_per_day and happiness_score
cor = world_happiness['grams_sugar_per_day'].corr(world_happiness['happiness_score'])
print(cor)

```
![image.png](attachment:ca713ca7-9c05-4d30-960e-d4f90b4349a0.png)

Question
Based on this data, which statement about sugar consumption and happiness scores is true?

Possible answersc


Increased sugar consumption leads to a higher happiness score.

Lower sugar consumption results in a lower happiness score

Increased sugar consumption is associated with a higher happiness score.

Sugar consumption is not related to happiness.

# Design of Experiments

---

## 1. Introduction

- **Purpose of Experiments:**  
  Data is often generated by studies designed to answer specific questions.  
  How the data is generated (the study design) determines how it should be analyzed and interpreted.

---

## 2. Key Vocabulary

- **Experiment Question:**  
  Typically framed as:  
  *“What is the effect of the treatment on the response?”*
- **Treatment:**  
  - The explanatory or independent variable.
- **Response:**  
  - The response or dependent variable.
- **Example:**  
  - *What is the effect of an advertisement on the number of products purchased?*  
    - **Treatment:** Advertisement  
    - **Response:** Number of products purchased

---

## 3. Controlled Experiments

- **Definition:**  
  Participants are assigned by researchers to either a **treatment group** or a **control group**.
    - *Treatment group:* Receives the treatment (e.g., sees the advertisement)
    - *Control group:* Does not receive the treatment

- **Goal:**  
  Groups should be as similar as possible except for the treatment, so that any difference in response can be attributed to the treatment.

- **A/B Test Example:**  
  - Treatment group sees the ad.
  - Control group does not.

- **Confounding and Bias:**  
  - If groups differ in other ways (e.g., average age), this can **confound** results.
    - *Example:*  
      - Treatment group avg. age: 25  
      - Control group avg. age: 50  
      - If younger people buy more, age is a **potential confounder** and introduces bias.

---

## 4. The Gold Standard of Experiments

### Randomized Controlled Trial (RCT)

- **Random Assignment:**  
  Participants are assigned to treatment or control **randomly**, not based on any characteristics.
- **Purpose:**  
  Randomization helps ensure groups are comparable, reducing bias.

### Placebo

- **Definition:**  
  A placebo resembles the treatment but has no effect.
- **Purpose:**  
  Participants do not know if they are in the treatment or control group, so any effect is due to the actual treatment, not the participant's expectations.
- **Example in Clinical Trials:**  
  The control group receives a sugar pill.

### Double-Blind Trial

- **Definition:**  
  Neither the participant **nor** the person administering the treatment knows who gets the real treatment or placebo.
- **Purpose:**  
  Prevents bias in both participant responses and researcher analysis.
- **Result:**  
  Fewer opportunities for bias → more reliable conclusions about causation.

---

## 5. Observational Studies

- **Definition:**  
  Participants **are not assigned randomly**; they self-select or are grouped by pre-existing characteristics.
- **When Used:**  
  - When random assignment is unethical or impossible.
    - *Example:* Cannot force someone to smoke or to have a disease.
    - Cannot assign people past behaviors.
- **Interpretation:**  
  Observational studies can establish **association**, **not causation**.
- **Confounding:**  
  - Because group assignment is not random, **confounders** may bias results.
    - *Example:* People who choose to smoke may differ in other ways from non-smokers.
- **Controlling for Confounders:**  
  Statistical methods can **control for confounders** to strengthen reliability of associations, but causality can rarely be proven.

---

## 6. Longitudinal vs. Cross-sectional Studies

### Longitudinal Study

- **What:**  
  Follows the same participants over time, measuring the effect of treatment on response at multiple points.
- **Advantage:**  
  Reduces confounding by generational or lifestyle changes.
- **Disadvantage:**  
  More expensive, takes longer.
- **Example:**  
  Measure the same people's heights as they age.

### Cross-sectional Study

- **What:**  
  Collects data from participants at a **single point in time**.
- **Advantage:**  
  Cheaper, faster, more convenient.
- **Disadvantage:**  
  More susceptible to confounding (e.g., by generation).
- **Example:**  
  Measure heights of people of all ages at one time; results confounded if newer generations are taller due to better nutrition.

---

## 7. Summary Table

| Study Design           | Random Assignment | Causality?        | Confounding Control | Examples                | Cost/Duration      |
|------------------------|-------------------|-------------------|---------------------|-------------------------|--------------------|
| Controlled Experiment  | Yes               | Yes, if well-designed | High                | Drug trials, A/B tests  | Medium to High     |
| Randomized Controlled Trial (RCT) | Yes      | Yes              | Very High           | Clinical trials         | High               |
| Double-Blind Trial     | Yes               | Yes              | Very High           | Placebo drug trials     | High               |
| Observational Study    | No                | No (association only) | Low to Medium      | Smoking studies, surveys| Low to Medium      |
| Longitudinal Study     | Varies            | Varies            | High                | Cohort studies          | High, Long-term    |
| Cross-sectional Study  | No                | No (association only) | Low                | Census, snapshots       | Low, Short-term    |

---

# Key Takeaways

- **Controlled experiments** with randomization and blinding are best for assessing causality.
- **Confounding** can bias both experiments and observational studies.
- **Observational studies** are necessary when controlled experiments are not feasible but can only show association.
- **Longitudinal studies** follow the same individuals over time; **cross-sectional studies** compare different individuals at one time point.
- **Fewer opportunities for bias → stronger conclusions.**

---

### Exercise
Study types
While controlled experiments are ideal, many situations and research questions are not conducive to a controlled experiment. In a controlled experiment, causation can likely be inferred if the control and test groups have similar characteristics and don't have any systematic difference between them. On the other hand, causation cannot usually be inferred from observational studies, whose results are often misinterpreted as a result.

In this exercise, you'll practice distinguishing controlled experiments from observational studies.

Instructions

Determine if each study is a controlled experiment or observational study.

![image.png](attachment:bc8f7e85-ce8a-47a9-ae2e-4d1e0eb410ca.png)

## Longitudinal vs. cross-sectional studies
A company manufactures thermometers, and they want to study the relationship between a thermometer's age and its accuracy. To do this, they take a sample of 100 different thermometers of different ages and test how accurate they are. Is this data longitudinal or cross-sectional?

Answer the question
Possible Answersb
Select one answer

Longitudinal

Cross-sectional

Both

Neither

### Briefly summarize the major concepts covered in each chapter:

### **Chapter 1: Descriptive Statistics**

- **What is statistics?**
    - The science of learning from data.
- **Measures of center:**
    - *Mean, median, mode* — ways to describe the "typical" value in a dataset.
- **Measures of spread:**
    - *Range, variance, standard deviation, interquartile range (IQR)* — ways to describe how spread out the data are.

---

### **Chapter 2: Probability and Distributions**

- **Measuring chance:**
    - Understanding probability and how it helps us reason about uncertainty.
- **Probability distributions:**
    - Mathematical functions describing the likelihood of different outcomes.
- **Binomial distribution:**
    - Models the probability of a certain number of successes in a fixed number of independent trials.

---

### **Chapter 3: More Distributions**

- **Normal distribution:**
    - The classic "bell curve"; many real-world phenomena are normally distributed.
- **Central limit theorem:**
    - With enough samples, the sampling distribution of the mean is approximately normal, no matter the shape of the original data.
- **Poisson distribution:**
    - Models the probability of a given number of events occurring in a fixed interval of time or space.

---

### **Chapter 4: Relationships and Study Design**

- **Correlation:**
    - Quantifies the strength and direction of a linear relationship between two variables.
- **Controlled experiments:**
    - Participants are randomly assigned to groups; can establish causation if well designed.
- **Observational studies:**
    - No random assignment; can show association but not causation.

---

## 3. Looking Ahead: Build on Your Skills

- **Next Steps:**  
  - There's much more to explore in statistics.
  - This course sets you up for **regression analysis** and linear modeling, which allow you to predict and explain outcomes using statistical models.
  - Recommended next step: *Introduction to Linear Modeling in Python*.

---

# Summary Table: Key Concepts by Chapter

| Chapter | Major Topics Covered                                  |
|---------|------------------------------------------------------|
|    1    | What is statistics, Measures of center & spread      |
|    2    | Probability, Probability distributions, Binomial     |
|    3    | Normal distribution, Central limit theorem, Poisson  |
|    4    | Correlation, Controlled experiments, Observational studies |

---