<a href="https://colab.research.google.com/github/chebil/stat/blob/main/part1/ch02_assignment.ipynb" target="_blank" rel="noopener noreferrer"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 2 Assignment: Analyzing Relationships in Data

## Overview

In this assignment, you will apply the concepts learned in Chapter 2 to analyze relationships between variables. You will:

1. **Load and explore** a multi-variable dataset
2. **Create 2D visualizations** (scatter plots, heatmaps)
3. **Calculate and interpret correlations**
4. **Make predictions** using linear relationships
5. **Identify correlation pitfalls** and draw conclusions

## Dataset: Auto MPG Dataset

We will use the **Auto MPG** dataset, which contains information about cars:
- Miles per gallon (mpg)
- Number of cylinders
- Engine displacement
- Horsepower
- Weight
- Acceleration
- Model year
- Origin

This dataset is publicly available from the UCI Machine Learning Repository.

**Source**: https://archive.ics.uci.edu/ml/datasets/auto+mpg

---

## Instructions

- Complete all the tasks marked with **TODO**
- Write your code in the provided cells
- Answer the questions in markdown cells
- Make sure your visualizations have proper labels and titles

---

## Part 1: Loading and Exploring the Data (10 points)

First, let's import the necessary libraries and load the dataset.

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set display options
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')

In [None]:
# Load the Auto MPG dataset from a public URL
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv"
df = pd.read_csv(url)

# Display the first 10 rows
print("First 10 rows of the dataset:")
df.head(10)

### Task 1.1: Explore the Dataset (5 points)

**TODO**: Answer the following questions about the dataset:
1. How many rows and columns does the dataset have?
2. What are the numerical columns?
3. Are there any missing values?

In [None]:
# TODO: Find the shape of the dataset
# Hint: Use df.shape


In [None]:
# TODO: Display the data types and identify numerical columns
# Hint: Use df.dtypes or df.select_dtypes(include=[np.number]).columns


In [None]:
# TODO: Check for missing values and handle them
# Hint: Use df.isnull().sum() and df.dropna()


### Task 1.2: Clean the Data (5 points)

**TODO**: Create a clean dataset by:
1. Dropping rows with missing values
2. Selecting only numerical columns for correlation analysis

In [None]:
# TODO: Create a clean dataset
# Step 1: Drop rows with missing values
# Step 2: Select numerical columns: mpg, cylinders, displacement, horsepower, weight, acceleration, model_year

df_clean = None  # Replace with your code

# Verify the result
# print(f"Clean dataset shape: {df_clean.shape}")
# print(f"Columns: {df_clean.columns.tolist()}")

---

## Part 2: Scatter Plots and Visual Relationships (25 points)

Before calculating correlations, it's essential to visualize the relationships between variables.

### Task 2.1: Single Scatter Plot (10 points)

**TODO**: Create a scatter plot showing the relationship between **weight** (x-axis) and **mpg** (y-axis).

Requirements:
- Add a title: "Car Weight vs Fuel Efficiency"
- Label axes appropriately with units
- Add a trend line (optional but encouraged)

In [None]:
# TODO: Create a scatter plot of weight vs mpg
# Step 1: Create figure with plt.figure()
# Step 2: Create scatter plot with plt.scatter()
# Step 3: Add labels and title
# Step 4 (optional): Add trend line using np.polyfit()

plt.figure(figsize=(10, 6))

# Your code here


plt.tight_layout()
plt.show()

### Task 2.2: Scatter Plot with Categories (10 points)

**TODO**: Create a scatter plot of **horsepower** vs **mpg**, colored by **origin** (USA, Europe, Japan).

Requirements:
- Different colors for each origin
- Add a legend
- Add appropriate title and labels

In [None]:
# TODO: Create a scatter plot colored by origin
# Hint: You'll need to use the original df (with 'origin' column)
# Use different colors for each origin group

plt.figure(figsize=(10, 6))

# Your code here
# You can use: for origin in df['origin'].unique(): plot each group
# Or use seaborn: sns.scatterplot(data=df, x='horsepower', y='mpg', hue='origin')


plt.tight_layout()
plt.show()

### Task 2.3: Pair Plot (5 points)

**TODO**: Create a pair plot (scatter plot matrix) for the variables: `mpg`, `horsepower`, `weight`, and `acceleration`.

In [None]:
# TODO: Create a pair plot
# Hint: Use seaborn's pairplot function
# sns.pairplot(df_clean[['mpg', 'horsepower', 'weight', 'acceleration']])

# Your code here


---

## Part 3: Correlation Analysis (30 points)

Now let's quantify the relationships we observed visually.

### Task 3.1: Calculate Single Correlation (10 points)

**TODO**: Calculate the Pearson correlation coefficient between **weight** and **mpg**.

Use at least two methods:
1. NumPy's `np.corrcoef()`
2. SciPy's `stats.pearsonr()` (which also gives you the p-value)

In [None]:
# TODO: Calculate correlation between weight and mpg
# Method 1: np.corrcoef()
# Method 2: stats.pearsonr()

# Your code here


# Print results with interpretation
# print(f"Correlation (numpy): {r_numpy:.4f}")
# print(f"Correlation (scipy): {r_scipy:.4f}")
# print(f"P-value: {p_value:.2e}")

### Task 3.2: Correlation Matrix (10 points)

**TODO**: Create a correlation matrix for all numerical variables and visualize it as a heatmap.

In [None]:
# TODO: Calculate the correlation matrix
# Hint: Use df_clean.corr()

# corr_matrix = ...

# Print the correlation matrix
# print("Correlation Matrix:")
# print(corr_matrix.round(3))

In [None]:
# TODO: Create a heatmap of the correlation matrix
# Hint: Use sns.heatmap(corr_matrix, annot=True, cmap='RdBu_r', center=0)

plt.figure(figsize=(10, 8))

# Your code here


plt.tight_layout()
plt.show()

### Task 3.3: Interpret the Correlations (10 points)

**TODO**: Based on the correlation matrix, answer the following questions in the markdown cell below:

1. Which variable has the **strongest positive correlation** with mpg?
2. Which variable has the **strongest negative correlation** with mpg?
3. Which two variables (other than mpg) have the **highest correlation** with each other?
4. Are there any surprising correlations? Explain.

**Your Answers:**

1. Strongest positive correlation with mpg: *Write your answer here*

2. Strongest negative correlation with mpg: *Write your answer here*

3. Highest correlation between other variables: *Write your answer here*

4. Surprising correlations: *Write your answer here*

---

## Part 4: Prediction Using Correlation (20 points)

Now let's use correlation to make predictions.

### Task 4.1: Implement Prediction Function (10 points)

**TODO**: Implement a function that predicts y from x using the formula:

$$\hat{y} = \bar{y} + r \frac{\sigma_y}{\sigma_x}(x - \bar{x})$$

Where:
- $r$ is the correlation coefficient
- $\bar{x}, \bar{y}$ are the means
- $\sigma_x, \sigma_y$ are the standard deviations

In [None]:
# TODO: Implement the prediction function
def predict_from_correlation(x_new, x_data, y_data):
    """
    Predict y from x using correlation.
    
    Parameters:
    - x_new: the new x value(s) to predict for
    - x_data: array of x values (training data)
    - y_data: array of y values (training data)
    
    Returns:
    - predicted y value(s)
    """
    # Calculate correlation
    r = None  # TODO
    
    # Calculate means
    x_mean = None  # TODO
    y_mean = None  # TODO
    
    # Calculate standard deviations
    x_std = None  # TODO
    y_std = None  # TODO
    
    # Calculate prediction
    y_pred = None  # TODO: Use the formula
    
    return y_pred

# Test the function (uncomment after implementing)
# test_pred = predict_from_correlation(3000, df_clean['weight'], df_clean['mpg'])
# print(f"Predicted mpg for 3000 lbs car: {test_pred:.2f}")

### Task 4.2: Make Predictions (10 points)

**TODO**: Use your function to predict MPG for cars with the following weights:
- 2500 lbs
- 3000 lbs
- 3500 lbs
- 4000 lbs
- 4500 lbs

Then create a plot showing:
- The original scatter plot (weight vs mpg)
- The regression line with your predictions

In [None]:
# TODO: Make predictions for different weights
weights_to_predict = [2500, 3000, 3500, 4000, 4500]

# Your code here - make predictions


# Print predictions
# print("Weight (lbs) | Predicted MPG")
# print("-" * 30)
# for w, mpg in zip(weights_to_predict, predictions):
#     print(f"{w:12} | {mpg:.2f}")

In [None]:
# TODO: Create a plot with scatter points and regression line
plt.figure(figsize=(10, 6))

# Step 1: Plot original data as scatter
# Step 2: Plot regression line
# Step 3: Mark the predicted points

# Your code here


plt.tight_layout()
plt.show()

---

## Part 5: Correlation Pitfalls (15 points)

Understanding the limitations of correlation is crucial.

### Task 5.1: Non-linear Relationships (5 points)

**TODO**: The relationship between `acceleration` and `mpg` might not be perfectly linear. Create a scatter plot and calculate the correlation. Is correlation the best measure for this relationship?

In [None]:
# TODO: Analyze the acceleration vs mpg relationship
# Create scatter plot and calculate correlation

plt.figure(figsize=(10, 6))

# Your code here


plt.tight_layout()
plt.show()

# Calculate and print correlation
# r_accel_mpg = ...
# print(f"Correlation between acceleration and mpg: {r_accel_mpg:.3f}")

### Task 5.2: Correlation vs Causation (5 points)

**TODO**: Answer the following question in the markdown cell below:

We found a strong negative correlation between weight and mpg. Does this mean that:
- Making a car heavier **causes** it to have lower mpg?
- What are potential **confounding variables**?
- Can we conclude causation from this correlation?

**Your Answer:**

*Write your answer here discussing causation vs correlation, and potential confounding variables*


### Task 5.3: Pearson vs Spearman (5 points)

**TODO**: Calculate both Pearson and Spearman correlations for `horsepower` vs `mpg`. Which one is more appropriate and why?

In [None]:
# TODO: Calculate Pearson and Spearman correlations
# Hint: Use stats.pearsonr() and stats.spearmanr()

# Your code here


# print(f"Pearson correlation:  {r_pearson:.4f}")
# print(f"Spearman correlation: {r_spearman:.4f}")

---

## Bonus Challenge (10 extra points)

For extra credit, complete the following challenge.

### Bonus: Multi-variable Analysis

**TODO**: Investigate whether the relationship between weight and mpg differs by origin (USA, Europe, Japan).

1. Calculate the correlation between weight and mpg **separately** for each origin
2. Create a visualization showing the different relationships
3. Explain your findings

In [None]:
# BONUS: Analyze weight-mpg correlation by origin
# Calculate correlations for each origin group

# Your code here


In [None]:
# BONUS: Create visualization with separate regression lines for each origin

plt.figure(figsize=(12, 6))

# Your code here


plt.tight_layout()
plt.show()

### Bonus Question:

What do the different correlations by origin tell us? Is the weight-mpg relationship consistent across all origins, or are there differences?

**Your Answer:**

*Write your answer here*


---

## Submission Checklist

Before submitting, make sure you have:

- [ ] Completed all TODO tasks
- [ ] Run all cells from top to bottom without errors
- [ ] Added titles and labels to all visualizations
- [ ] Written answers to all analysis questions
- [ ] Saved your notebook

**Total Points: 100 (+ 10 bonus)**

| Section | Points |
|---------|--------|
| Part 1: Loading and Exploring | 10 |
| Part 2: Scatter Plots | 25 |
| Part 3: Correlation Analysis | 30 |
| Part 4: Prediction | 20 |
| Part 5: Correlation Pitfalls | 15 |
| Bonus | 10 |

---

**Good luck!** ðŸŽ‰