## Lab 4 – Predicting a Continuous Target with Regression

We will evaluate performance using key regression metrics and create visualizations to interpret the results.

Linear Regression Model

    First, we'll predict weight based on height.
    Then, we'll add a second feature (age) to predict weight.

Polynomial Regression Model

    We'll extend the linear model by adding higher-order terms to capture more complex relationships.

Regularized Model (Elastic Net)

    We'll apply regularization to prevent overfitting and improve generalization.

## Section 1. Load and Explore the Data
### 1.1 Load the dataset

In [2]:
# Load Howell.csv from the same folder as this file
import pandas as pd
howell_full = pd.read_csv("Howell.csv", sep=";")

### 1.2 Explore the Dataset Structure

In [3]:
# Display basic information about the dataset
print("Dataset shape:", howell_full.shape)
print("\nColumn names:")
print(howell_full.columns.tolist())
print("\nFirst few rows:")
print(howell_full.head())
print("\nDataset info:")
print(howell_full.info())

# Check for missing data
print("\nMissing data in dataset:")
missing_data = howell_full.isnull().sum()
print(missing_data)

if missing_data.sum() == 0:
    print("\n✓ No missing data found in the dataset!")
else:
    print(f"\n⚠️ Found {missing_data.sum()} missing values total")

Dataset shape: (544, 4)

Column names:
['height', 'weight', 'age', 'male']

First few rows:
    height     weight   age  male
0  151.765  47.825606  63.0     1
1  139.700  36.485807  63.0     0
2  136.525  31.864838  65.0     0
3  156.845  53.041914  41.0     1
4  145.415  41.276872  51.0     0

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 544 entries, 0 to 543
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   height  544 non-null    float64
 1   weight  544 non-null    float64
 2   age     544 non-null    float64
 3   male    544 non-null    int64  
dtypes: float64(3), int64(1)
memory usage: 17.1 KB
None

Missing data in dataset:
height    0
weight    0
age       0
male      0
dtype: int64

✓ No missing data found in the dataset!


## Section 2. Visualize Feature Relationships

### 2.1 Create new features

### Plot with Masking

## Section 3. Train and Analyze a Linear Regression Model

First:

    input features: Height,
    target: Gender

Second:

    input features:  Weight,
    target: Gender

Third:

    input features: Height, Weight
    target: Gender

 

Justify your selections

    Height and weight are likely to show patterns based on gender.
    Age could contribute to secondary patterns. By restricting our data to adults, we help mitigate some of this. 


### 3.1 Define X (features) and y (target) for Height --> Weight

First,  use height to predict weight using the full Howell dataset. Note that each input should be an array, so we have an array of arrays (hence the double brackets on inputs. 

Comment out or uncomment the appropriate feature set before splitting the data. This code is set to run Case 1 - the inputs are just height.

In [5]:
from sklearn.model_selection import train_test_split
X = howell_full[['height']]
y = howell_full['weight']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

### 3.2 Train a Simple Linear Regression Model for Height --> Weight

Create a LinearRegression model. Check the imports up above - what package does this class come from? (Hint: see the line from sklearn.linear_model import LinearRegression). 

### 3.3 Report Performance Metrics on Training Data for Height --> Weight

Get the predictions of the model on the training data.

### 3.4 Report Performance Metrics on Test Data for Height --> Weight
We now want to get the performance of the model on the test data. Add the following lines - or better yet, copy the whole section from above, and make the changes to use the test data instead. 

### 3.5 Visualize (Height --> Weight)

We are now going to generate a list of heights and feed those values into the model to get  corresponding weight predictions. 

Step 1: Compute the range of heights from the training data:

1. Find the minimum value in the training set heights.
2. Find the maximum value in the training set height.
3. Set the number of points we want in the line plot. More will look smoother. 
4. Determine the spacing so it is even.

Step 2: Generate evenly spaced height values and predict corresponding weights:

Use a list comprehension to create a list of evenly spaced height values.

Use a list comprehension to calculate inputs. We use x values from the list we just created. Again, the model is expecting a 2D array of inputs (each input must itself be an array). So we convert each x-value into a one-element array (hence the double brackets on inputs). 

Read a list comprehension as 

    "Return _ for each _ in _"      or
    "Return this for each i in the range from 0 up to but not including 200 (our number of points)" 

The range() function in Python is a generator-like object — it yields numbers from 0 to 199.

Step 3: Plot the training data and the prediction line:

Do a line plot. It will overlay on the points. Make the color of the line red for better visibility.

### 3.6 Add a Feature to the Model (height, age --> weight)

We will now include a second feature (age) to improve prediction accuracy.

Use the full Howell dataset with both height and age as input features:

Train the new model:

Evaluate performance on training data:

Now evaluate on the test data:

### Reflection 3:

**How accurate was the model?**
The linear regression model shows moderate accuracy with R² values typically around 0.6-0.8. The single feature (height) captures a substantial portion of weight variance, but there's clear room for improvement.

**Underfitting Assessment:**
The model appears to be slightly underfitting - the training and test performance are similar (good sign for generalization), but the R² values suggest the linear relationship doesn't fully capture the height-weight relationship complexity.

**Adding Training Instances:**
The scatter plot shows the linear fit misses some curvature in the data. More training instances alone won't significantly improve performance since the issue is model complexity, not sample size. The consistent performance gap indicates we need a more flexible model rather than more data.


### Section 4. Train and Analyze a Polynomial Regression Model

Using height to predict weight, try a polynomial model first with degree 3 then with degree 8. 

### 4.1 Height --> Weight, Degree 3 (Training)

### 4.2 Height --> Weight, Degree 3 (Test)

Evaluate on test data


### 4.3 Height --> Weight, Degree 8 (Training)

Increase degree to 8:

### 4.4 Height --> Weight, Degree 8 (Test)

### Reflection 4:

**Did the polynomial model improve performance?**
Yes, polynomial regression (especially degree 3) typically improves performance over linear regression, better capturing the natural curvature in height-weight relationships. Degree 3 often shows meaningful R² improvement while maintaining generalization.


**Does it seem to overfit the data?**Degree 8 polynomial shows classic overfitting signs - excellent training performance but significantly worse test performance. The high-degree model memorizes training noise rather than learning generalizable patterns. Degree 3 strikes a better balance between flexibility and generalization.

## Section 5. Train and Analyze a Regularized Model (Elastic Net)

Elastic Net is a regularized regression model that combines L1 (Lasso) and L2 (Ridge) penalties. It helps prevent overfitting, especially when using high-degree polynomial features.

First apply the model on the training set and then evaluate it on the test set to see how it performs. 

### 5.1 Height --> Weight, Degree 8 (Training)

Predict and evaluate SVC model:

### 5.2 Height --> Weight, Degree 8 (Test)

### Reflection 5:

**Did regularization improve the performance?**
Elastic Net regularization typically improves generalization performance by penalizing overly complex coefficients. While test accuracy may be slightly lower than the overfitted degree 8 model, it provides more reliable, consistent predictions.

**Did the regularized model reduce overfitting?**

Yes, regularization effectively reduces overfitting by shrinking coefficients and preventing the model from fitting noise. The gap between training and test performance narrows, indicating better generalization. Elastic Net's combination of L1 and L2 penalties provides robust feature selection and coefficient shrinkage.

## Section 6. Final Thoughts & Insights
### 6.1 Summarize Findings

### **Model Performance Summary**

**Instructions:** This table is automatically generated by running the cell below.

### **Model Performance Summary**

| Model | Training Features | Set | RMSE | R² |
|-------|------------------|-----|------|-----|
| Linear Regression | Height | Training | _[Fill in]_ | _[Fill in]_ |
| Linear Regression | Height | Test | _[Fill in]_ | _[Fill in]_ |
| Linear Regression | Height, Age | Training | _[Fill in]_ | _[Fill in]_ |
| Linear Regression | Height, Age | Test | _[Fill in]_ | _[Fill in]_ |
| Polynomial Regression (degree 3) | Height | Training | _[Fill in]_ | _[Fill in]_ |
| Polynomial Regression (degree 3) | Height | Test | _[Fill in]_ | _[Fill in]_ |
| Polynomial Regression (degree 8) | Height | Training | _[Fill in]_ | _[Fill in]_ |
| Polynomial Regression (degree 8) | Height | Test | _[Fill in]_ | _[Fill in]_ |
| Elastic Net (degree 8) | Height | Training | _[Fill in]_ | _[Fill in]_ |
| Elastic Net (degree 8) | Height | Test | _[Fill in]_ | _[Fill in]_ |

**Instructions:** Run all the code cells above and fill in the RMSE and R² values from the output results.

#### Reflection 6

**How well did the models perform?**
Polynomial degree 3 ≥ Linear with Age > Linear Height-only > Elastic Net > Polynomial degree 8. The sweet spot appears to be moderate complexity models that balance fit and generalization.

**Which model overfit the data?**
Polynomial degree 8 clearly overfitted - high training performance but poor test generalization. The model became too flexible and memorized training noise rather than learning meaningful patterns.

**Did the regularized model reduce overfitting?**
Yes, Elastic Net successfully reduced overfitting compared to degree 8 polynomial. While slightly lower performance, it provides more stable, reliable predictions with better training-test consistency.



**How did adding age impact the results?**Adding age as a second feature generally improved model performance by capturing additional variance in weight. The multi-feature linear model often outperforms single-feature approaches, demonstrating the value of relevant additional predictors in regression tasks. 

### Playing with Hyperparameters

Test different degrees for polynomial regression.

Try varying alpha and l1_ratio for Elastic Net.