# Introduction to Machine Learning: Supervised Learning

**Instructor:** Daniel Acuna, Ph.D.
**Position:** Associate Professor of Computer Science
**Institution:** University of Colorado Boulder

---

Lab 2: Regression model evaluation

---

## Setup (do not edit)

In [None]:
import pathlib
from typing import Tuple, Dict, List

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

RANDOM_STATE: int = 42  # global seed for full reproducibility
np.random.seed(RANDOM_STATE)

_DATA_PATH = pathlib.Path("auto.csv")
if not _DATA_PATH.exists():
    raise FileNotFoundError(
        "auto.csv is missing from the lab directory. Please download it or ask the TA "
        "for assistance."
    )

## 1. Load the Dataset *(5 points)*

Write a function `load_auto_data()` that reads `auto.csv` into a `pandas.DataFrame`. Store the shape of the DataFrame in a variable called **`q1_shape`**.

In [None]:
def load_auto_data() -> pd.DataFrame:
    """Load the Auto dataset from ``auto.csv``.

    Returns
    -------
    pd.DataFrame
        A DataFrame containing the auto dataset.
    """
    # your code here
    return pd.read_csv("auto.csv")


# Compute the answer required by the autograder
q1_shape: Tuple[int, int] = load_auto_data().shape
print(q1_shape)

(392, 8)


In [7]:
# If all tests pass (there might be hidden tests), you will earn 5 points
# Test Cell: Question 1
assert isinstance(q1_shape, tuple), "q1_shape must be a tuple"
assert len(q1_shape) == 2, "q1_shape should have 2 elements (rows, cols)"
assert q1_shape[0] > 0 and q1_shape[1] > 0, "Shape values must be positive"
print(f"Dataset shape: {q1_shape}")

Dataset shape: (392, 8)


## 2. Prepare Data *(10 points)*

Create the feature matrix `X` and target vector `y`.
- `X` should contain the columns: `displacement`, `horsepower`, `weight`, `acceleration`.
- `y` should contain the `mpg` column.
Store the number of features in **`q2_num_features`**.

In [9]:
df = load_auto_data()
# YOUR CODE HERE
# your code here
X = df[['displacement', 'horsepower', 'weight', 'acceleration']]
y = df['mpg']
q2_num_features = len(X.columns)
q2_num_features

4

In [10]:
# If all tests pass (there might be hidden tests), you will earn 10 points
# Test Cell: Question 2
assert "X" in locals() and "y" in locals(), "X and y must be defined."
assert isinstance(X, pd.DataFrame), "X should be a DataFrame"
assert isinstance(y, pd.Series), "y should be a Series"
assert isinstance(q2_num_features, int), "q2_num_features should be an integer"
assert q2_num_features > 0, "Number of features must be positive"
print(f"Number of features: {q2_num_features}")

Number of features: 4


## 3. Train/Test Split *(10 points)*

Perform an 80/20 split of `X` and `y`. Use `random_state=RANDOM_STATE`. Store the row counts of each split in a tuple **`q3_split_counts = (n_train, n_test)`**.

In [13]:
# YOUR CODE HERE
# your code here
#raise NotImplementedError
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)
n_train = len(X_train)
n_test = len(X_test)
q3_split_counts = (n_train, n_test)
q3_split_counts

(313, 79)

In [15]:
# If all tests pass (there might be hidden tests), you will earn 10 points
# Test Cell: Question 3
assert isinstance(q3_split_counts, tuple), "q3_split_counts must be a tuple"
assert len(q3_split_counts) == 2, "q3_split_counts should have 2 elements"
train_rows, test_rows = q3_split_counts
assert train_rows > 0 and test_rows > 0, "Both splits must have positive counts"
assert train_rows > test_rows, "Training set should be larger than test set"
print(f"Train rows: {train_rows}, Test rows: {test_rows}")

Train rows: 313, Test rows: 79


## 4. Train Linear Regression Model *(15 points)*

Train a `LinearRegression` model on the training data. Store the fitted model in a variable named **`model`**.

In [21]:
# YOUR CODE HERE
# your code here
#raise NotImplementedError
model = LinearRegression()
model.fit(X_train, y_train)
model.coef_

array([-0.00935113, -0.04847209, -0.00502316, -0.05942766])

In [22]:
# If all tests pass (there might be hidden tests), you will earn 15 points
# Test Cell: Question 4
assert "model" in locals(), "The 'model' variable is not defined."
assert hasattr(model, "fit"), "The model should have a 'fit' method."
assert hasattr(model, "predict"), "The model should have a 'predict' method."
assert hasattr(model, "coef_"), "The model has not been fitted yet."
print(f"Model trained successfully with {len(model.coef_)} coefficients")

Model trained successfully with 4 coefficients


## 5. Calculate Test MSE *(15 points)*

Use the trained model to make predictions on the test set, then compute the Mean Squared Error (MSE). Store the result in **`q5_mse`** (float, rounded to three decimals).

In [25]:
# YOUR CODE HERE
# your code here
#raise NotImplementedError
y_pred = model.predict(X_test)
q5_mse = round(mean_squared_error(y_test, y_pred), 3)
q5_mse

18.066

In [26]:
# If all tests pass (there might be hidden tests), you will earn 15 points
# Test Cell: Question 5
assert isinstance(q5_mse, float), "MSE must be a float."
assert q5_mse >= 0, "MSE must be non-negative."
assert q5_mse < 1000, "MSE seems unreasonably large. Check your calculation."
print(f"Test MSE: {q5_mse:.3f}")

Test MSE: 18.066


## 6. Calculate Test R² *(15 points)*

Compute the R-squared (R²) value for the test set. Store the result in **`q6_r2`** (float, rounded to three decimals).

In [31]:
# YOUR CODE HERE
# your code here
#raise NotImplementedError
#q6_r2 = round(r2_score(y_test), 3)
#q6_r2
q6_r2 = round(r2_score(y_test, y_pred), 3)
q6_r2

0.646

In [32]:
# If all tests pass (there might be hidden tests), you will earn 15 points
# Test Cell: Question 6
assert isinstance(q6_r2, float), "R-squared must be a float."
assert q6_r2 <= 1.0, "R-squared cannot exceed 1.0."
assert q6_r2 > -10, "R-squared value seems unreasonable. Check your calculation."
print(f"Test R²: {q6_r2:.3f}")

Test R²: 0.646


## 7. Reusable Evaluation Function *(20 points)*

Implement a function **`evaluate_model(model, X_test, y_test)`** that returns a tuple containing the (MSE, R²) score, with each value rounded to three decimals.

In [35]:
# YOUR CODE HERE
# your code here
#raise NotImplementedError
def evaluate_model(model, X_test, y_test) -> Tuple[float, float]:
    mse = round(mean_squared_error(y_test, model.predict(X_test)), 3)
    r2 = round(r2_score(y_test, model.predict(X_test)), 3)
    return mse, r2

In [36]:
# If all tests pass (there might be hidden tests), you will earn 20 points
# Test Cell: Question 7
test_mse, test_r2 = evaluate_model(model, X_test, y_test)
assert isinstance(test_mse, float) and isinstance(
    test_r2, float
), "Function must return two floats."
assert test_mse >= 0, "MSE must be non-negative."
assert test_r2 <= 1.0, "R-squared cannot exceed 1.0."
print(f"Function returned MSE: {test_mse:.3f}, R²: {test_r2:.3f}")

Function returned MSE: 18.066, R²: 0.646


## 8. Analyze Coefficients *(10 points)*

Identify the feature with the largest *negative* coefficient (i.e., the one that most negatively impacts MPG). Store its name as a string in **`q8_strongest_negative_feature`**.

In [38]:
# YOUR CODE HERE
# your code here
#raise NotImplementedError
q8_strongest_negative_feature = X.columns[np.argmin(model.coef_)]
q8_strongest_negative_feature

'acceleration'

In [39]:
# If all tests pass (there might be hidden tests), you will earn 10 points
# Test Cell: Question 8
assert isinstance(q8_strongest_negative_feature, str), "Answer must be a string."
assert len(q8_strongest_negative_feature) > 0, "Feature name cannot be empty."
assert (
    q8_strongest_negative_feature in X.columns
), "Feature name must be one of the columns in X."
print(f"Feature with most negative coefficient: {q8_strongest_negative_feature}")

Feature with most negative coefficient: acceleration


## Next Steps

Congratulations on completing the assignment! Before submitting:

1. Make sure all your cells run without errors.
2. Ensure you've answered all parts of each question.
3. If any autograder tests fail, revisit your answers.
