<a href="https://colab.research.google.com/github/chebil/stat/blob/main/part1/ch02_assignment_solution.ipynb" target="_blank" rel="noopener noreferrer"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Chapter 2 Assignment: Analyzing Relationships in Data

## **SOLUTION KEY** ðŸ”‘

---

## Overview

In this assignment, you will apply the concepts learned in Chapter 2 to analyze relationships between variables.

---

## Part 1: Loading and Exploring the Data (10 points)

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Set display options
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')

In [None]:
# Load the Auto MPG dataset from a public URL
url = "https://raw.githubusercontent.com/mwaskom/seaborn-data/master/mpg.csv"
df = pd.read_csv(url)

# Display the first 10 rows
print("First 10 rows of the dataset:")
df.head(10)

### Task 1.1: Explore the Dataset (5 points)

In [None]:
# SOLUTION: Find the shape of the dataset
print(f"Dataset shape: {df.shape}")
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")

In [None]:
# SOLUTION: Display the data types and identify numerical columns
print("Data types:")
print(df.dtypes)
print("\n" + "="*50)
print("\nNumerical columns:")
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print(numerical_cols)

In [None]:
# SOLUTION: Check for missing values
print("Missing values per column:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")

### Task 1.2: Clean the Data (5 points)

In [None]:
# SOLUTION: Create a clean dataset
# Step 1: Drop rows with missing values
df_dropped = df.dropna()

# Step 2: Select numerical columns for correlation analysis
numerical_columns = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year']
df_clean = df_dropped[numerical_columns].copy()

# Verify the result
print(f"Original dataset shape: {df.shape}")
print(f"Clean dataset shape: {df_clean.shape}")
print(f"Rows removed: {df.shape[0] - df_clean.shape[0]}")
print(f"\nColumns: {df_clean.columns.tolist()}")

---

## Part 2: Scatter Plots and Visual Relationships (25 points)

### Task 2.1: Single Scatter Plot (10 points)

In [None]:
# SOLUTION: Create a scatter plot of weight vs mpg
plt.figure(figsize=(10, 6))

# Scatter plot
plt.scatter(df_clean['weight'], df_clean['mpg'], alpha=0.6, edgecolors='black', linewidth=0.5)

# Add trend line
z = np.polyfit(df_clean['weight'], df_clean['mpg'], 1)
p = np.poly1d(z)
x_line = np.linspace(df_clean['weight'].min(), df_clean['weight'].max(), 100)
plt.plot(x_line, p(x_line), 'r-', linewidth=2, label=f'Trend line (y = {z[0]:.4f}x + {z[1]:.2f})')

# Labels and title
plt.xlabel('Weight (lbs)', fontsize=12)
plt.ylabel('Miles per Gallon (mpg)', fontsize=12)
plt.title('Car Weight vs Fuel Efficiency', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Task 2.2: Scatter Plot with Categories (10 points)

In [None]:
# SOLUTION: Create a scatter plot colored by origin
# Need to use original df with 'origin' column (after dropping NaN)
df_with_origin = df.dropna()

plt.figure(figsize=(10, 6))

# Define colors for each origin
colors = {'usa': 'blue', 'europe': 'green', 'japan': 'red'}

# Plot each origin group
for origin in df_with_origin['origin'].unique():
    subset = df_with_origin[df_with_origin['origin'] == origin]
    plt.scatter(subset['horsepower'], subset['mpg'], 
                c=colors[origin], label=origin.capitalize(),
                alpha=0.6, edgecolors='black', linewidth=0.5)

plt.xlabel('Horsepower', fontsize=12)
plt.ylabel('Miles per Gallon (mpg)', fontsize=12)
plt.title('Horsepower vs Fuel Efficiency by Origin', fontsize=14)
plt.legend(title='Origin')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Task 2.3: Pair Plot (5 points)

In [None]:
# SOLUTION: Create a pair plot
sns.pairplot(df_clean[['mpg', 'horsepower', 'weight', 'acceleration']], 
             diag_kind='hist',
             plot_kws={'alpha': 0.5, 'edgecolor': 'black', 'linewidth': 0.3})
plt.suptitle('Pair Plot of Key Variables', y=1.02, fontsize=14)
plt.tight_layout()
plt.show()

---

## Part 3: Correlation Analysis (30 points)

### Task 3.1: Calculate Single Correlation (10 points)

In [None]:
# SOLUTION: Calculate correlation between weight and mpg

# Method 1: NumPy
r_numpy = np.corrcoef(df_clean['weight'], df_clean['mpg'])[0, 1]

# Method 2: SciPy (also gives p-value)
r_scipy, p_value = stats.pearsonr(df_clean['weight'], df_clean['mpg'])

# Method 3: Manual calculation for educational purposes
x = df_clean['weight']
y = df_clean['mpg']
z_x = (x - x.mean()) / x.std()
z_y = (y - y.mean()) / y.std()
r_manual = np.mean(z_x * z_y)

print("Correlation between Weight and MPG:")
print("="*45)
print(f"Correlation (numpy):  {r_numpy:.4f}")
print(f"Correlation (scipy):  {r_scipy:.4f}")
print(f"Correlation (manual): {r_manual:.4f}")
print(f"\nP-value: {p_value:.2e}")
print(f"\nInterpretation: Strong negative correlation")
print(f"Heavier cars tend to have lower fuel efficiency.")

### Task 3.2: Correlation Matrix (10 points)

In [None]:
# SOLUTION: Calculate the correlation matrix
corr_matrix = df_clean.corr()

print("Correlation Matrix:")
print("="*80)
print(corr_matrix.round(3))

In [None]:
# SOLUTION: Create a heatmap of the correlation matrix
plt.figure(figsize=(10, 8))

sns.heatmap(corr_matrix, 
            annot=True, 
            fmt='.3f',
            cmap='RdBu_r', 
            center=0,
            square=True,
            linewidths=0.5,
            cbar_kws={'label': 'Correlation Coefficient'})

plt.title('Correlation Matrix Heatmap', fontsize=14)
plt.tight_layout()
plt.show()

### Task 3.3: Interpret the Correlations (10 points)

**SOLUTION:**

1. **Strongest positive correlation with mpg**: `model_year` (r â‰ˆ 0.58)
   - Newer cars tend to have better fuel efficiency due to technological improvements.

2. **Strongest negative correlation with mpg**: `weight` (r â‰ˆ -0.83)
   - Heavier cars require more energy to move, resulting in lower fuel efficiency.

3. **Highest correlation between other variables**: `displacement` and `cylinders` (r â‰ˆ 0.95)
   - This makes physical sense: more cylinders generally means larger engine displacement.
   - Also very high: `weight` and `displacement` (r â‰ˆ 0.93), `weight` and `cylinders` (r â‰ˆ 0.90)

4. **Surprising correlations**:
   - `acceleration` has a weak positive correlation with `mpg` (r â‰ˆ 0.42). This might seem surprising because faster acceleration often suggests more power, but actually slower-accelerating cars tend to have smaller, more efficient engines.
   - `cylinders`, `displacement`, `horsepower`, and `weight` are all highly correlated with each other (all > 0.84), suggesting they represent similar underlying characteristics of "car size/power".

In [None]:
# SOLUTION: Additional analysis to support the answers
print("Correlations with MPG:")
print("="*40)
mpg_corr = corr_matrix['mpg'].drop('mpg').sort_values()
print(mpg_corr.round(3))

print("\n" + "="*40)
print("\nStrongest positive with mpg:", mpg_corr.idxmax(), f"({mpg_corr.max():.3f})")
print("Strongest negative with mpg:", mpg_corr.idxmin(), f"({mpg_corr.min():.3f})")

---

## Part 4: Prediction Using Correlation (20 points)

### Task 4.1: Implement Prediction Function (10 points)

In [None]:
# SOLUTION: Implement the prediction function
def predict_from_correlation(x_new, x_data, y_data):
    """
    Predict y from x using correlation.
    
    Parameters:
    - x_new: the new x value(s) to predict for
    - x_data: array of x values (training data)
    - y_data: array of y values (training data)
    
    Returns:
    - predicted y value(s)
    """
    # Calculate correlation
    r = np.corrcoef(x_data, y_data)[0, 1]
    
    # Calculate means
    x_mean = np.mean(x_data)
    y_mean = np.mean(y_data)
    
    # Calculate standard deviations
    x_std = np.std(x_data)
    y_std = np.std(y_data)
    
    # Calculate prediction using the formula:
    # y_pred = y_mean + r * (y_std / x_std) * (x_new - x_mean)
    y_pred = y_mean + r * (y_std / x_std) * (x_new - x_mean)
    
    return y_pred

# Test the function
test_pred = predict_from_correlation(3000, df_clean['weight'], df_clean['mpg'])
print(f"Predicted mpg for 3000 lbs car: {test_pred:.2f}")

### Task 4.2: Make Predictions (10 points)

In [None]:
# SOLUTION: Make predictions for different weights
weights_to_predict = [2500, 3000, 3500, 4000, 4500]

predictions = [predict_from_correlation(w, df_clean['weight'], df_clean['mpg']) 
               for w in weights_to_predict]

# Print predictions
print("Weight (lbs) | Predicted MPG")
print("-" * 30)
for w, mpg in zip(weights_to_predict, predictions):
    print(f"{w:12} | {mpg:.2f}")

In [None]:
# SOLUTION: Create a plot with scatter points and regression line
plt.figure(figsize=(10, 6))

# Step 1: Plot original data as scatter
plt.scatter(df_clean['weight'], df_clean['mpg'], alpha=0.5, label='Actual Data', 
            edgecolors='black', linewidth=0.3)

# Step 2: Plot regression line
x_range = np.linspace(df_clean['weight'].min(), df_clean['weight'].max(), 100)
y_pred_line = [predict_from_correlation(x, df_clean['weight'], df_clean['mpg']) for x in x_range]
plt.plot(x_range, y_pred_line, 'r-', linewidth=2, label='Regression Line')

# Step 3: Mark the predicted points
plt.scatter(weights_to_predict, predictions, color='green', s=100, zorder=5, 
            marker='*', label='Predictions', edgecolors='black', linewidth=1)

# Add labels for predictions
for w, mpg in zip(weights_to_predict, predictions):
    plt.annotate(f'{mpg:.1f}', (w, mpg), textcoords="offset points", 
                 xytext=(5, 10), fontsize=9)

plt.xlabel('Weight (lbs)', fontsize=12)
plt.ylabel('Miles per Gallon (mpg)', fontsize=12)
plt.title('Weight vs MPG with Predictions', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

---

## Part 5: Correlation Pitfalls (15 points)

### Task 5.1: Non-linear Relationships (5 points)

In [None]:
# SOLUTION: Analyze the acceleration vs mpg relationship
plt.figure(figsize=(10, 6))

# Scatter plot
plt.scatter(df_clean['acceleration'], df_clean['mpg'], alpha=0.5, 
            edgecolors='black', linewidth=0.3)

# Add trend line
z = np.polyfit(df_clean['acceleration'], df_clean['mpg'], 1)
p = np.poly1d(z)
x_line = np.linspace(df_clean['acceleration'].min(), df_clean['acceleration'].max(), 100)
plt.plot(x_line, p(x_line), 'r-', linewidth=2, label='Linear fit')

plt.xlabel('Acceleration (seconds to 60 mph)', fontsize=12)
plt.ylabel('Miles per Gallon (mpg)', fontsize=12)
plt.title('Acceleration vs Fuel Efficiency', fontsize=14)
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate and print correlation
r_accel_mpg, p_val = stats.pearsonr(df_clean['acceleration'], df_clean['mpg'])
print(f"\nCorrelation between acceleration and mpg: {r_accel_mpg:.3f}")
print(f"P-value: {p_val:.2e}")
print(f"\nInterpretation: Moderate positive correlation.")
print(f"The relationship appears roughly linear, but with high variability.")
print(f"Correlation is appropriate here, but RÂ² = {r_accel_mpg**2:.3f} shows")
print(f"only {r_accel_mpg**2*100:.1f}% of variance in mpg is explained by acceleration.")

### Task 5.2: Correlation vs Causation (5 points)

**SOLUTION:**

We found a strong negative correlation (r â‰ˆ -0.83) between weight and mpg.

**Does making a car heavier CAUSE lower mpg?**

In this case, there IS likely a **causal relationship**:
- Physics tells us that heavier objects require more force (and thus more energy/fuel) to accelerate and maintain speed
- This is a direct physical mechanism

**However, we should consider potential confounding variables:**

1. **Engine size**: Larger engines tend to be in heavier cars AND consume more fuel
2. **Car class/type**: SUVs and trucks are heavier AND designed for power over efficiency
3. **Luxury features**: More features add weight AND luxury cars may prioritize performance over efficiency
4. **Aerodynamics**: Larger (heavier) cars may have worse aerodynamics
5. **Year of manufacture**: Older cars may be heavier AND less fuel-efficient

**Can we conclude causation?**

While correlation alone never proves causation, in this specific case:
- There is a plausible physical mechanism (Newton's laws)
- The direction makes theoretical sense
- Controlled experiments (adding weight to the same car) confirm the effect

So while we can't prove causation from correlation alone, the combination of correlation + physical theory + domain knowledge suggests weight does causally affect fuel efficiency.

### Task 5.3: Pearson vs Spearman (5 points)

In [None]:
# SOLUTION: Calculate Pearson and Spearman correlations
r_pearson, p_pearson = stats.pearsonr(df_clean['horsepower'], df_clean['mpg'])
r_spearman, p_spearman = stats.spearmanr(df_clean['horsepower'], df_clean['mpg'])

print("Horsepower vs MPG Correlations:")
print("="*45)
print(f"Pearson correlation:  {r_pearson:.4f} (p = {p_pearson:.2e})")
print(f"Spearman correlation: {r_spearman:.4f} (p = {p_spearman:.2e})")
print(f"\nDifference: {abs(r_pearson - r_spearman):.4f}")

print("\n" + "="*45)
print("\nInterpretation:")
print("- Both correlations are very similar (strong negative)")
print("- Spearman is slightly stronger, suggesting a monotonic")
print("  relationship that may not be perfectly linear")
print("- When |Pearson| â‰ˆ |Spearman|, the relationship is roughly linear")
print("- Spearman is more robust to outliers")
print("\nWhich is more appropriate?")
print("- For this data, both are appropriate")
print("- If concerned about outliers or non-linearity, use Spearman")
print("- If you need to make linear predictions, use Pearson")

---

## Bonus Challenge (10 extra points)

### Bonus: Multi-variable Analysis

In [None]:
# SOLUTION: Analyze weight-mpg correlation by origin
df_with_origin = df.dropna()

print("Correlation between Weight and MPG by Origin:")
print("="*50)

correlations_by_origin = {}
for origin in df_with_origin['origin'].unique():
    subset = df_with_origin[df_with_origin['origin'] == origin]
    r, p = stats.pearsonr(subset['weight'], subset['mpg'])
    correlations_by_origin[origin] = r
    print(f"{origin.capitalize():10}: r = {r:.4f}, n = {len(subset):3}, p = {p:.2e}")

print("\n" + "="*50)
print(f"\nOverall correlation: r = {np.corrcoef(df_with_origin['weight'], df_with_origin['mpg'])[0,1]:.4f}")

In [None]:
# SOLUTION: Create visualization with separate regression lines for each origin
plt.figure(figsize=(12, 6))

colors = {'usa': 'blue', 'europe': 'green', 'japan': 'red'}

for origin in df_with_origin['origin'].unique():
    subset = df_with_origin[df_with_origin['origin'] == origin]
    
    # Scatter plot
    plt.scatter(subset['weight'], subset['mpg'], 
                c=colors[origin], alpha=0.5, label=f'{origin.capitalize()} (r={correlations_by_origin[origin]:.2f})',
                edgecolors='black', linewidth=0.3)
    
    # Regression line for each origin
    z = np.polyfit(subset['weight'], subset['mpg'], 1)
    p = np.poly1d(z)
    x_line = np.linspace(subset['weight'].min(), subset['weight'].max(), 100)
    plt.plot(x_line, p(x_line), c=colors[origin], linewidth=2, linestyle='--')

plt.xlabel('Weight (lbs)', fontsize=12)
plt.ylabel('Miles per Gallon (mpg)', fontsize=12)
plt.title('Weight vs MPG by Country of Origin', fontsize=14)
plt.legend(title='Origin')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

### Bonus Question Solution:

**What do the different correlations by origin tell us?**

**Observations:**

1. **All origins show strong negative correlations** between weight and mpg:
   - USA: r â‰ˆ -0.82
   - Europe: r â‰ˆ -0.85
   - Japan: r â‰ˆ -0.80

2. **The relationship is consistent across all origins** - heavier cars have worse fuel efficiency regardless of where they're made.

3. **Key differences:**
   - **USA cars** cluster in the high-weight, low-mpg region (heavier vehicles, lower efficiency)
   - **Japanese cars** cluster in the low-weight, high-mpg region (lighter, more efficient)
   - **European cars** fall in between

4. **The slope (regression line) is similar for all origins**, suggesting the physical relationship between weight and fuel efficiency is universal.

5. **The intercepts differ**, meaning at any given weight:
   - Japanese cars tend to be slightly more efficient
   - USA cars tend to be slightly less efficient
   - This could be due to other factors (engine technology, aerodynamics, design philosophy)

**Conclusion:** The weight-mpg relationship is consistent across origins, but manufacturers from different countries tend to produce cars in different weight/efficiency ranges. The correlation is NOT an artifact of mixing different groups (Simpson's paradox doesn't apply here).

---

## Grading Rubric

**Total Points: 100 (+ 10 bonus)**

| Section | Points | Key Criteria |
|---------|--------|-------------|
| Part 1: Loading and Exploring | 10 | Correct data exploration and cleaning |
| Part 2: Scatter Plots | 25 | Proper visualization with labels, titles, colors |
| Part 3: Correlation Analysis | 30 | Accurate calculations, proper heatmap, good interpretation |
| Part 4: Prediction | 20 | Correct implementation of formula, accurate predictions |
| Part 5: Pitfalls | 15 | Understanding of non-linearity, causation, different measures |
| Bonus | 10 | Complete multi-variable analysis with insights |

---

**End of Solution Key** ðŸ”‘