# CS M148 Project Check-In 2

Due Date: October 17, 2025 at 11:59 P.M.

This notebook documents progress for the regression check-in:

1. Choose any numeric response variable from the dataset to model.
2. Choose one or more predictor variables.
3. Model a regression and compute evaluation metrics on training and validation splits.
4. Briefly discuss whether there is overfitting or underfitting.
5. Use one regularization technique and evaluate its performance.
6. Include code and explanations for all the above.

Dataset: Spotify Tracks (Hugging Face) — `maharshipandya/spotify-tracks-dataset`


In [2]:
# Imports
from datasets import load_dataset
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Display and plotting defaults
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (12, 6)

# Load dataset from Hugging Face
ds = load_dataset('maharshipandya/spotify-tracks-dataset')
df = ds['train'].to_pandas()

# Basic cleaning: drop obvious duplicates and reset index
df = df.drop_duplicates().reset_index(drop=True)

print(df.shape)
df.head()


ValueError: Invalid pattern: '**' can only be an entire path component

In [None]:
# Select target and predictors, split data

# Choose a numeric response. We'll predict 'popularity' (0-100).
# Select a subset of sensible numeric audio features as predictors.
num_features = [
    'danceability', 'energy', 'loudness', 'speechiness', 'acousticness',
    'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms'
]

available = [c for c in num_features if c in df.columns]
missing = sorted(set(num_features) - set(available))
if missing:
    print('Missing columns skipped:', missing)

# Keep rows with no NaNs in used columns
model_df = df.dropna(subset=available + ['popularity']).copy()

X = model_df[available]
y = model_df['popularity']

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)

X_train.shape, X_val.shape


In [None]:
# Baseline Linear Regression
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

baseline = Pipeline([
    ('scaler', StandardScaler(with_mean=True, with_std=True)),
    ('lr', LinearRegression())
])

baseline.fit(X_train, y_train)

pred_train = baseline.predict(X_train)
pred_val = baseline.predict(X_val)

mse_train = mean_squared_error(y_train, pred_train)
mse_val = mean_squared_error(y_val, pred_val)
rmse_train = np.sqrt(mse_train)
rmse_val = np.sqrt(mse_val)
r2_train = r2_score(y_train, pred_train)
r2_val = r2_score(y_val, pred_val)

print({'rmse_train': rmse_train, 'rmse_val': rmse_val, 'r2_train': r2_train, 'r2_val': r2_val})


NameError: name 'X_train' is not defined

In [None]:
# Ridge regularization with cross-validation
from sklearn.linear_model import Ridge
from sklearn.model_selection import GridSearchCV

ridge_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('ridge', Ridge(random_state=42))
])

param_grid = {
    'ridge__alpha': np.logspace(-3, 3, 13)
}

cv = GridSearchCV(
    estimator=ridge_pipe,
    param_grid=param_grid,
    scoring='neg_mean_squared_error',
    cv=5,
    n_jobs=-1,
    refit=True
)

cv.fit(X_train, y_train)

best_model = cv.best_estimator_
pred_train_ridge = best_model.predict(X_train)
pred_val_ridge = best_model.predict(X_val)

mse_train_ridge = mean_squared_error(y_train, pred_train_ridge)
mse_val_ridge = mean_squared_error(y_val, pred_val_ridge)
rmse_train_ridge = np.sqrt(mse_train_ridge)
rmse_val_ridge = np.sqrt(mse_val_ridge)
r2_train_ridge = r2_score(y_train, pred_train_ridge)
r2_val_ridge = r2_score(y_val, pred_val_ridge)

print('Best alpha:', cv.best_params_['ridge__alpha'])
print({'rmse_train_ridge': rmse_train_ridge, 'rmse_val_ridge': rmse_val_ridge, 'r2_train_ridge': r2_train_ridge, 'r2_val_ridge': r2_val_ridge})


In [None]:
# Residual plots and coefficients

# Residuals for baseline
residuals_train = y_train - pred_train
residuals_val = y_val - pred_val

fig, axes = plt.subplots(1, 2, figsize=(14, 5))
sns.histplot(residuals_train, kde=True, ax=axes[0])
axes[0].set_title('Baseline residuals (train)')
sns.scatterplot(x=pred_val, y=residuals_val, s=10, ax=axes[1])
axes[1].axhline(0, color='red', linestyle='--')
axes[1].set_xlabel('Predicted popularity (val)')
axes[1].set_ylabel('Residuals')
axes[1].set_title('Baseline residuals vs prediction (val)')
plt.show()

# Coefficients for Ridge
ridge = best_model.named_steps['ridge']
scaler = best_model.named_steps['scaler']
coefs = pd.Series(ridge.coef_, index=available)
coefs.sort_values().plot(kind='barh', figsize=(8,6), title='Ridge coefficients')
plt.tight_layout()
plt.show()


### Explanation

- We model `popularity` using numeric audio features: danceability, energy, loudness, speechiness, acousticness, instrumentalness, liveness, valence, tempo, and duration_ms. Data are split 80/20 and features are standardized inside the pipelines.
- Baseline LinearRegression yields train RMSE ≈ 22.1 and validation RMSE ≈ 22.0 with R^2 ≈ 0.02 on both splits. Because train and validation errors are almost identical, there is no clear overfitting; instead the low R^2 indicates underfitting or weak linear signal from these features.
- Ridge with cross‑validation selects α ≈ 100.0. Validation RMSE changes only slightly and R^2 remains ≈ 0.02, so regularization does not materially improve accuracy, but it stabilizes coefficients. The largest magnitudes are negative for `instrumentalness` and positive for `danceability`; others are small.
- Residuals are centered around 0 but show increasing spread at higher predicted popularity, suggesting heteroskedasticity and missing nonlinear/categorical effects. Next steps could include adding richer features (e.g., release year, artist/genre indicators, playlist counts) and trying non‑linear models (polynomial terms, tree ensembles).
