# Introduction to Machine Learning: Supervised Learning

**Instructor:** Daniel Acuna, Ph.D.
**Position:** Associate Professor of Computer Science
**Institution:** University of Colorado Boulder

---

Lab 1: Introduction to Machine Learning: Supervised Learning

---

## Setup (do not edit)

In [2]:
import pathlib
from typing import Tuple, Dict, List

import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split, KFold, cross_val_score
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression

RANDOM_STATE: int = 42  # global seed for full reproducibility
np.random.seed(RANDOM_STATE)

_DATA_PATH = pathlib.Path("california_housing.csv")
if not _DATA_PATH.exists():
    raise FileNotFoundError(
        "california_housing.csv is missing from the lab directory. Please download it or ask the TA "
        "for assistance."
    )

## 1. Load the dataset *(10 points)*

Write a function `load_housing()` that reads the CSV into a `pandas.DataFrame`.  Store the shape of
that DataFrame in a variable called **`q1_shape`**.

In [5]:
def load_housing() -> pd.DataFrame:
    """Load the California Housing data from ``california_housing.csv``.

    Returns
    -------
    pd.DataFrame
        A DataFrame containing the 8 predictors plus the target column ``MedHouseVal``.
    """
    # your code here
    # raise NotImplementedError  # Comment this out when implementing
    data = pd.read_csv("california_housing.csv")
    return data

df = load_housing()
print(f"DataFrame shape: {df.shape}")

# Compute the answer required by the autograder
q1_shape: Tuple[int, int] = load_housing().shape

DataFrame shape: (20640, 9)


In [7]:
# If all tests pass (there might be hidden tests), you will earn 10 points
# Test Cell: Exercise 5.1
assert isinstance(q1_shape, tuple), "q1_shape must be a tuple"
assert len(q1_shape) == 2, "q1_shape should have 2 elements (rows, cols)"
assert q1_shape[0] > 0 and q1_shape[1] > 0, "Shape values must be positive"
print(f"Dataset shape: {q1_shape}")


Dataset shape: (20640, 9)


## 2. Income → Value gap *(10 points)*

Split the dataset into **income quartiles** using the ``MedInc`` feature:
* bottom 25 % (Q1)
* top 25 %  (Q4)

Compute the **mean** of the target variable ``MedHouseVal`` for each of those two groups and store
their *difference* (**top – bottom**) in **`q2_income_value_gap`** (float, rounded to three
decimals).

In [8]:
df = load_housing()

# your code here
# Calculate the 25th and 75th percentiles of MedInc
q1_threshold = df['MedInc'].quantile(0.25)  # bottom 25%
q4_threshold = df['MedInc'].quantile(0.75)  # top 25%

# Filter data into bottom and top quartiles
bottom_quartile = df[df['MedInc'] <= q1_threshold]
top_quartile = df[df['MedInc'] >= q4_threshold]

# Calculate mean MedHouseVal for each group
bottom_mean = bottom_quartile['MedHouseVal'].mean()
top_mean = top_quartile['MedHouseVal'].mean()

# Calculate the difference (top - bottom) and round to 3 decimals
q2_income_value_gap = round(top_mean - bottom_mean, 3)

print(f"Bottom 25% mean house value: {bottom_mean:.3f}")
print(f"Top 25% mean house value: {top_mean:.3f}")
print(f"Income-Value gap: {q2_income_value_gap:.3f}")

Bottom 25% mean house value: 1.230
Top 25% mean house value: 3.168
Income-Value gap: 1.938


In [9]:
# If all tests pass (there might be hidden tests), you will earn 10 points
# Test Cell: Exercise 5.2
assert isinstance(q2_income_value_gap, float), "q2_income_value_gap must be a float"
assert q2_income_value_gap > 0, "Gap should be positive (top - bottom)"
assert q2_income_value_gap < 10.0, "Gap seems unreasonably large"
print(f"Income-Value gap: {q2_income_value_gap:.3f}")

Income-Value gap: 1.938


## 3. Train/test split *(10 points)*

Perform an 80 / 20 split of the predictors ``X`` (all columns except ``MedHouseVal``) and the
target ``y`` (only ``MedHouseVal``).  Use ``random_state=RANDOM_STATE``.  Store the **row counts**
of each split in a tuple **`q3_split_counts = (n_train, n_test)`**.

In [11]:
# your code here
# Load the dataset
df = load_housing()

# Separate features (X) and target (y)
X = df.drop('MedHouseVal', axis=1)  # All columns except MedHouseVal
y = df['MedHouseVal']  # Only the target column

# Perform 80/20 train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE
)

# Store the row counts as required
q3_split_counts = (len(X_train), len(X_test))

print(f"Total dataset size: {len(df)}") 
print(f"Training set size: {len(X_train)}")
print(f"Test set size: {len(X_test)}")
print(f"Train/Test split: {q3_split_counts[0]} / {q3_split_counts[1]}")

Total dataset size: 20640
Training set size: 16512
Test set size: 4128
Train/Test split: 16512 / 4128


In [12]:
# If all tests pass (there might be hidden tests), you will earn 10 points
# Test Cell: Exercise 3
assert isinstance(q3_split_counts, tuple), "q3_split_counts must be a tuple"
assert len(q3_split_counts) == 2, "q3_split_counts should have 2 elements (train, test)"
assert all(
    isinstance(n, (int, np.integer)) for n in q3_split_counts
), "Both counts must be integers"
assert all(n > 0 for n in q3_split_counts), "Both counts must be positive"
print(f"Train/Test split: {q3_split_counts[0]} / {q3_split_counts[1]}")

Train/Test split: 16512 / 4128


## 4. k-NN with *k = 5* *(10 points)*

Train a ``KNeighborsRegressor`` with ``n_neighbors=5`` on the training data and compute the **test
RMSE**.  Store the scalar in **`q4_knn5_rmse`** rounded to **three** decimals.

In [15]:
# your code here
# Create a k-NN regressor with k=5
knn = KNeighborsRegressor(n_neighbors=5)

# Train the model on training data
knn.fit(X_train, y_train)

# Make predictions on test data
y_pred = knn.predict(X_test)

# Calculate RMSE (Root Mean Square Error)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

# Store the result rounded to 3 decimals
q4_knn5_rmse = round(rmse, 3)

print(f"k-NN (k=5) Test RMSE: {q4_knn5_rmse:.3f}")

k-NN (k=5) Test RMSE: 1.058


In [None]:
# If all tests pass (there might be hidden tests), you will earn 10 points
# Test Cell: Exercise 5.4
assert isinstance(q4_knn5_rmse, float), "RMSE must be a float"
assert q4_knn5_rmse >= 0, "RMSE must be non-negative"
assert q4_knn5_rmse < 10.0, "RMSE seems unreasonably large"
print(f"k-NN (k=5) Test RMSE: {q4_knn5_rmse:.3f}")

## 5. Reusable k-NN evaluator *(15 points)*

Implement a function **`knn_rmse(k: int) -> float`** that
1. trains a ``KNeighborsRegressor`` with the given *k* on **`X_train`, `y_train`** (from Exercise 5.3),
2. returns the **test RMSE** rounded to three decimals.

In [16]:
# your code here
def knn_rmse(k: int) -> float:
    """Train a k-NN regressor and return test RMSE.
    
    Parameters
    ----------
    k : int
        Number of neighbors for the KNeighborsRegressor
        
    Returns
    -------
    float
        Test RMSE rounded to 3 decimals
    """
    # Create k-NN regressor with the given k
    knn = KNeighborsRegressor(n_neighbors=k)
    
    # Train on training data
    knn.fit(X_train, y_train)
    
    # Make predictions on test data
    y_pred = knn.predict(X_test)
    
    # Calculate RMSE
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    
    # Return rounded to 3 decimals
    return round(rmse, 3)

# Test the function
print(f"knn_rmse(5) = {knn_rmse(5):.3f}")

knn_rmse(5) = 1.058


In [17]:
# If all tests pass (there might be hidden tests), you will earn 15 points
# Test Cell: Exercise 5.5
assert callable(knn_rmse), "knn_rmse should be a callable function"
# Test with k=5
test_rmse = knn_rmse(5)
assert isinstance(test_rmse, float), "knn_rmse should return a float"
assert test_rmse >= 0, "RMSE must be non-negative"
assert test_rmse < 10.0, "RMSE seems unreasonably large"
print(f"knn_rmse(5) = {test_rmse:.3f}")

knn_rmse(5) = 1.058


## 6. Linear Regression baseline *(15 points)*

Fit an ordinary least-squares ``LinearRegression`` model (with default settings) on the training
data and compute its **test RMSE**.  Store the value in **`q6_linreg_rmse`** rounded to three
decimals.

In [18]:
# your code here
# Create a Linear Regression model with default settings
linreg = LinearRegression()

# Fit the model on training data
linreg.fit(X_train, y_train)

# Make predictions on test data
y_pred = linreg.predict(X_test)

# Calculate RMSE (Root Mean Square Error)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))

# Store the result rounded to 3 decimals
q6_linreg_rmse = round(rmse, 3)

print(f"Linear Regression Test RMSE: {q6_linreg_rmse:.3f}")

Linear Regression Test RMSE: 0.746


In [19]:
# If all tests pass (there might be hidden tests), you will earn 15 points
# Test Cell: Exercise 6
assert isinstance(q6_linreg_rmse, float), "RMSE must be a float"
assert q6_linreg_rmse >= 0, "RMSE must be non-negative"
assert q6_linreg_rmse < 10.0, "RMSE seems unreasonably large"
print(f"Linear Regression Test RMSE: {q6_linreg_rmse:.3f}")

Linear Regression Test RMSE: 0.746


## 7. 5-fold CV for multiple *k* values *(15 points)*

Implement **`cross_val_knn(k_values: List[int]) -> Dict[int, float]`** that performs
5-fold cross-validation *only on the training split* for each *k* in `k_values` and returns a
dictionary mapping *k* → mean CV RMSE (rounded to three decimals).  Use
``KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)``.

In [22]:
# your code here
def cross_val_knn(k_values: List[int]) -> Dict[int, float]:
    """Perform 5-fold cross-validation for k-NN with multiple k values.
    
    Parameters
    ----------
    k_values : List[int]
        List of k values to evaluate
        
    Returns
    -------
    Dict[int, float]
        Dictionary mapping k -> mean CV RMSE (rounded to 3 decimals)
    """
    # Create KFold splitter as specified
    kfold = KFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
    
    results = {}
    
    for k in k_values:
        # Create k-NN regressor with current k
        knn = KNeighborsRegressor(n_neighbors=k)
        
        # Perform cross-validation on training data only
        # Use negative mean squared error scoring
        cv_scores = cross_val_score(knn, X_train, y_train, cv=kfold, scoring='neg_mean_squared_error')
        
        # Convert negative MSE to RMSE and take mean
        rmse_scores = np.sqrt(-cv_scores)  # Convert neg MSE to positive RMSE
        mean_rmse = np.mean(rmse_scores)
        
        # Store result rounded to 3 decimals
        results[k] = round(mean_rmse, 3)
    
    return results

# Test the function with a couple of k values
test_result = cross_val_knn([3, 5])
print(f"Sample cross-validation results: {test_result}")

Sample cross-validation results: {3: 1.108, 5: 1.081}


In [23]:
# If all tests pass (there might be hidden tests), you will earn 15 points
# Test Cell: Exercise 7
test_vals = cross_val_knn([3, 5])
assert callable(cross_val_knn), "cross_val_knn should be a callable function"
assert isinstance(test_vals, dict), "cross_val_knn should return a dictionary"
assert len(test_vals) == 2, "Dictionary should have one entry per k value"
assert all(isinstance(k, int) for k in test_vals.keys()), "Keys should be integers"
assert all(isinstance(v, float) for v in test_vals.values()), "Values should be floats"
assert all(v >= 0 for v in test_vals.values()), "RMSE values must be non-negative"
print(f"Sample CV scores: {test_vals}")

Sample CV scores: {3: 1.108, 5: 1.081}


## 8. Choose best *k* and evaluate on test set *(15 points)*

1. Use your `cross_val_knn` from Exercise 5.7 with the list `[1, 3, 5, 7, 9, 15, 25]`.
2. Identify the *k* that achieves the **lowest** cross-validated RMSE → **`q8_best_k`
3. Finally, use your `knn_rmse` function from Exercise 5.5 to compute the **test set RMSE** for this best *k* → **`q8_test_rmse`**

In [None]:
# your code here
# Define the k candidates to evaluate
k_candidates = [1, 3, 5, 7, 9, 15, 25]

# Step 1: Use cross_val_knn to get CV RMSE for each k
cv_results = cross_val_knn(k_candidates)
print("Cross-validation results:")
for k, rmse in cv_results.items():
    print(f"k={k}: CV RMSE = {rmse:.3f}")

# Step 2: Find the k with the lowest CV RMSE
q8_best_k = min(cv_results, key=cv_results.get)
print(f"\nBest k (lowest CV RMSE): {q8_best_k}")
print(f"Best CV RMSE: {cv_results[q8_best_k]:.3f}")

# Step 3: Use knn_rmse to get test set performance for the best k
q8_test_rmse = knn_rmse(q8_best_k)
print(f"\nFinal test set RMSE for k={q8_best_k}: {q8_test_rmse:.3f}")

# Summary
print(f"\n=== FINAL RESULTS ===")
print(f"Best k: {q8_best_k}")
print(f"Test RMSE: {q8_test_rmse:.3f}")

Cross-validation results:
k=1: CV RMSE = 1.286
k=3: CV RMSE = 1.108
k=5: CV RMSE = 1.081
k=7: CV RMSE = 1.075
k=9: CV RMSE = 1.073
k=15: CV RMSE = 1.082
k=25: CV RMSE = 1.098

Best k (lowest CV RMSE): 9
Best CV RMSE: 1.073

Final test set RMSE for k=9: 1.050

=== FINAL RESULTS ===
Best k: 9
Test RMSE: 1.050


In [25]:
# If all tests pass (there might be hidden tests), you will earn 15 points
# Test Cell: Exercise 8
k_candidates = [1, 3, 5, 7, 9, 15, 25]
assert isinstance(q8_best_k, int), "q8_best_k must be an integer"
assert q8_best_k > 0, "k must be positive"
assert isinstance(q8_test_rmse, float), "q8_test_rmse must be a float"
assert q8_test_rmse >= 0, "RMSE must be non-negative"
assert q8_test_rmse < 10.0, "RMSE seems unreasonably large"
print(f"Best k: {q8_best_k}, Test RMSE: {q8_test_rmse:.3f}")

Best k: 9, Test RMSE: 1.050


## Next Steps

Congratulations on completing the assignment! Before submitting:

1. Make sure all your cells run without errors.
2. Ensure you've answered all parts of each question.
3. If any autograder tests fail, revisit your answers.
