<h1 align="center">Overall Aim of this Notebook</h1>

This notebook aims to use several ML algorithms (including MARS, KNN, Linear GAM) to predict sale price using most of the features in the dataset. The notebook is split into two halves - first, I remove certain columns and see how the algorithms perform, then I add them back in and repeat the process.

In [1]:
# for data manip
import numpy as np
import pandas as pd
import scipy.sparse as sp

# for preprocessing
from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler, PolynomialFeatures, FunctionTransformer
from sklearn.compose import ColumnTransformer, make_column_selector as selector
from skrub import TableVectorizer

# modeling
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.model_selection import GridSearchCV, KFold, train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from catboost import CatBoostRegressor
from pygam import LinearGAM, s, f
from pygam.terms import TermList
from sklearn.neural_network import MLPRegressor
from sklearn.compose import TransformedTargetRegressor
from sklearn.svm import SVR
from pyearth import Earth
from sklearn.neighbors import KNeighborsRegressor

# for pipeline
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, RegressorMixin

# for utilities
from functools import reduce
import operator

In [2]:
df = pd.read_csv('Used_Car_Price_Prediction.csv')
df

Unnamed: 0,car_name,yr_mfr,fuel_type,kms_run,sale_price,city,times_viewed,body_type,transmission,variant,...,total_owners,broker_quote,original_price,car_rating,ad_created_on,fitness_certificate,emi_starts_from,booking_down_pymnt,reserved,warranty_avail
0,maruti swift,2015,petrol,8063,386399,noida,18715,hatchback,manual,lxi opt,...,2,397677,404177.0,great,2021-04-04T07:09:18.583,True,8975,57960,False,False
1,maruti alto 800,2016,petrol,23104,265499,noida,2676,hatchback,manual,lxi,...,1,272935,354313.0,great,2021-03-22T14:07:32.833,True,6167,39825,False,False
2,hyundai grand i10,2017,petrol,23402,477699,noida,609,hatchback,manual,sports 1.2 vtvt,...,1,469605,,great,2021-03-20T05:36:31.311,True,11096,71655,False,False
3,maruti swift,2013,diesel,39124,307999,noida,6511,hatchback,manual,vdi,...,1,294262,374326.0,great,2021-01-21T12:59:19.299,True,7154,46200,False,False
4,hyundai grand i10,2015,petrol,22116,361499,noida,3225,hatchback,manual,magna 1.2 vtvt,...,1,360716,367216.0,great,2021-04-01T13:33:40.733,True,8397,54225,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7395,honda amaze,2018,diesel,53486,604299,ghaziabad,2756,sedan,,1.5 v cvt i-dtec,...,1,630810,787750.0,great,2021-02-07T08:05:30.443,True,14036,90645,True,False
7396,maruti ignis,2018,petrol,8854,562599,chennai,640,hatchback,manual,delta 1.2 k12,...,1,549440,,great,2021-03-31T10:21:56.289,True,13068,84390,False,False
7397,honda amaze,2015,petrol,46300,400499,pune,795,sedan,manual,1.2 smt i vtec,...,1,383419,,great,2021-03-04T12:40:38.652,True,9303,60075,True,False
7398,maruti alto k10,2016,petrol,27245,284099,new delhi,1155,hatchback,manual,lxi,...,1,286515,369885.0,great,2021-03-16T13:31:39.766,True,6599,42615,False,False


### This section aims to look at the null values

In [3]:
print(df.isna().sum().sum())

4723


### Here I was interested in where the null values were:

In [4]:
df.isna().sum()

car_name                  0
yr_mfr                    0
fuel_type                 0
kms_run                   0
sale_price                0
city                      0
times_viewed              0
body_type               103
transmission            556
variant                   0
assured_buy               0
registered_city          10
registered_state         10
is_hot                    0
rto                       0
source                  126
make                      0
model                     0
car_availability        620
total_owners              0
broker_quote              0
original_price         3280
car_rating                9
ad_created_on             1
fitness_certificate       8
emi_starts_from           0
booking_down_pymnt        0
reserved                  0
warranty_avail            0
dtype: int64

### I saw that the original price column had a signficant proportion of missing values, so I tried dropping it later.
### Next, I wanted to see the correlations among the continous columns:

In [5]:
# select continous columns and find correlations
full_df_num_cols = df.select_dtypes(include=[np.number]).columns
correlation_matrix = df[full_df_num_cols].corr()

print(correlation_matrix)

                      yr_mfr   kms_run  sale_price  times_viewed  \
yr_mfr              1.000000 -0.395842    0.518973      0.059617   
kms_run            -0.395842  1.000000   -0.104727     -0.114795   
sale_price          0.518973 -0.104727    1.000000      0.091579   
times_viewed        0.059617 -0.114795    0.091579      1.000000   
total_owners       -0.301315  0.133000   -0.131306     -0.001361   
broker_quote        0.543350 -0.126691    0.963484      0.123785   
original_price      0.508361 -0.087615    0.986005      0.103439   
emi_starts_from     0.518972 -0.104728    1.000000      0.091579   
booking_down_pymnt  0.518973 -0.104727    1.000000      0.091579   

                    total_owners  broker_quote  original_price  \
yr_mfr                 -0.301315      0.543350        0.508361   
kms_run                 0.133000     -0.126691       -0.087615   
sale_price             -0.131306      0.963484        0.986005   
times_viewed           -0.001361      0.123785        0

### I saw that sale price has a perfect correlation with emi_starts_from and booking_down payment. So I wanted to drop those columns during my analysis. I wanted to avoid the problem of data leakage. For the same reason, I dropped broker_quote.

In [6]:
df_reduced = df.drop(['original_price','broker_quote', 'booking_down_pymnt', 'emi_starts_from'], axis=1) 

In [7]:
df_reduced.columns

Index(['car_name', 'yr_mfr', 'fuel_type', 'kms_run', 'sale_price', 'city',
       'times_viewed', 'body_type', 'transmission', 'variant', 'assured_buy',
       'registered_city', 'registered_state', 'is_hot', 'rto', 'source',
       'make', 'model', 'car_availability', 'total_owners', 'car_rating',
       'ad_created_on', 'fitness_certificate', 'reserved', 'warranty_avail'],
      dtype='object')

In [8]:
full_df_num_cols = df_reduced.select_dtypes(include=[np.number]).columns
correlation_matrix = df_reduced[full_df_num_cols].corr()

print(correlation_matrix)

                yr_mfr   kms_run  sale_price  times_viewed  total_owners
yr_mfr        1.000000 -0.395842    0.518973      0.059617     -0.301315
kms_run      -0.395842  1.000000   -0.104727     -0.114795      0.133000
sale_price    0.518973 -0.104727    1.000000      0.091579     -0.131306
times_viewed  0.059617 -0.114795    0.091579      1.000000     -0.001361
total_owners -0.301315  0.133000   -0.131306     -0.001361      1.000000


### The correlations now seem to be in an acceptable range

In [9]:
# create target column and predictor matrix
X = df_reduced.drop(columns=["sale_price"])
y = df_reduced["sale_price"] 

In [10]:
# make train test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=26       
)

In [11]:
# create catboost pipeline using a table vectorizer and knn imputation
pipe = Pipeline([
    ("tv", TableVectorizer()),
    ("knn", KNNImputer(n_neighbors=5)),
    ("catboost", CatBoostRegressor(
        depth=8, learning_rate=0.05, n_estimators=800,
        loss_function="RMSE", verbose=False
    ))
])

In [12]:
# fit the catboost regressor
pipe.fit(X_train, y_train)

In [13]:
# predict the catboost regressor on the x test set
y_pred = pipe.predict(X_test)

In [14]:
# find the RMSE and R^2 on the test set for the single catboost regressor model
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.3f}, R²: {r2:.3f}")

RMSE: 69404.599, R²: 0.943


In [15]:
# select continous and categorical columns for the ridge regression using polynomial features
num_cols = X.select_dtypes(include=[np.number]).columns
cat_cols = X.select_dtypes(exclude=[np.number]).columns

In [16]:
# set up the ridge regression pipeline with polynomial features for the continous features
num_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("scale", StandardScaler())
])

In [17]:
# set up the ridge regression pipeline with polynomial features for the categorical features
cat_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="most_frequent")),
    ("ohe", OneHotEncoder(handle_unknown="ignore",
                          min_frequency=0.01,
                          sparse_output=True))
])

In [18]:
# set up the preprocessing for the different types of features
pre = ColumnTransformer(
    transformers=[
        ("num", num_pipe, num_cols),
        ("cat", cat_pipe, cat_cols),
    ],)

In [19]:
# set up the pipeline for the ridge regression
pipe = Pipeline([
    ("pre", pre),
    ("ridge", Ridge(solver="sag", random_state=26))
])

In [20]:
# set up the parameter grid for the ridge regression grid search
param_grid = {
    "pre__num__poly__degree": [1, 2], 
    "ridge__alpha": [1.0, 10.0, 100.0]
}

In [21]:
# set up the search across the paramter grid for the ridge regression
search = GridSearchCV(
    pipe, param_grid, cv=KFold(n_splits=5, shuffle=True, random_state=26),
    scoring="neg_root_mean_squared_error",
    n_jobs=1,   
    verbose=1
)

In [22]:
# fit the grid to the training data for the ridge regression
search.fit(X_train, y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


In [23]:
# find the best training RMSE and best parameters
print("Best RMSE:", -search.best_score_)
print("Best params:", search.best_params_)

Best RMSE: 144558.7630434305
Best params: {'pre__num__poly__degree': 2, 'ridge__alpha': 10.0}


In [24]:
# select the best performing paramters and predict the y using those
best_pipe = search.best_estimator_
y_pred = best_pipe.predict(X_test)

In [25]:
# find the test RMSE and R^2 for the ridge regression
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.3f}, R²: {r2:.3f}")

RMSE: 156899.543, R²: 0.707


In [26]:
# make the preprocessing pipeline for the Linear GAM model
pre = ColumnTransformer([
    ('num', Pipeline([
        ('impute', SimpleImputer(strategy='mean')),
        ('scale', StandardScaler())
    ]), selector(dtype_include=np.number)),
    ('cat', Pipeline([
        ('impute', SimpleImputer(strategy='most_frequent')),
        ('encode', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))
    ]), selector(dtype_exclude=np.number))
])

In [27]:
# make new training and testing matrices
X_tr = pre.fit_transform(X_train)
X_te = pre.transform(X_test)
y_tr = np.asarray(y_train).ravel()
n_cont = pre.named_transformers_['num'].n_features_in_
n_cat  = X_tr.shape[1] - n_cont

In [28]:
# fit the single Linear GAM model
term_pieces = [s(i, n_splines=10) for i in range(n_cont)] + \
              [f(n_cont + j) for j in range(n_cat)]
terms = reduce(operator.add, term_pieces)
gam = LinearGAM(terms, lam=1.0, max_iter=400, tol=1e-4)
gam.fit(X_tr, y_tr)

LinearGAM(callbacks=[Deviance(), Diffs()], fit_intercept=True, 
   max_iter=400, scale=None, 
   terms=s(0) + s(1) + s(2) + s(3) + f(4) + f(5) + f(6) + f(7) + f(8) + f(9) + f(10) + f(11) + f(12) + f(13) + f(14) + f(15) + f(16) + f(17) + f(18) + f(19) + f(20) + f(21) + f(22) + f(23) + intercept,
   tol=0.0001, verbose=False)

In [29]:
# deal with NaN numbers to allow prediction on the test data, then predict
cat_start = n_cont
cat_end = n_cont + n_cat

X_te[:, cat_start:cat_end] = np.nan_to_num(
    X_te[:, cat_start:cat_end], nan=0.0, posinf=0.0, neginf=0.0
)

mins = X_tr[:, cat_start:cat_end].min(axis=0)
maxs = X_tr[:, cat_start:cat_end].max(axis=0)
X_te[:, cat_start:cat_end] = np.clip(X_te[:, cat_start:cat_end], mins, maxs)

y_pred = gam.predict(X_te)


In [30]:
# find test RMSE and R^2 for the linear GAM
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.3f}, R²: {r2:.3f}")

RMSE: 93708.590, R²: 0.896


In [31]:
# select continous and categorical variables for the Neural Network
num_sel = selector(dtype_include=np.number)
cat_sel = selector(dtype_exclude=np.number)

# set up imputation for the continuous variables and also center/scale them
num_pipe = Pipeline([
    ("impute_num", KNNImputer(n_neighbors=5)),
    ("scale", StandardScaler())
])

# set up categorical imputation and one-hot encode them
cat_pipe = Pipeline([
    ("impute_cat", SimpleImputer(strategy="most_frequent")),
    ("ohe", OneHotEncoder(handle_unknown="ignore"))
])

# set up separate preprocessing for the different types of variables
tab = ColumnTransformer([
    ("num", num_pipe, num_sel),
    ("cat", cat_pipe, cat_sel),
])

# set up the Neural network 
mlp = MLPRegressor(
    hidden_layer_sizes=(256, 128),
    activation="relu",
    solver="adam",
    learning_rate_init=1e-3,
    alpha=1e-3,            
    batch_size=256,
    max_iter=1000,
    early_stopping=True,
    n_iter_no_change=20,
    validation_fraction=0.15,
    random_state=26
)

# set up the overall pipeline
pipe = Pipeline([
    ("tab", tab),
    ("reg", TransformedTargetRegressor(
        regressor=mlp,
        transformer=StandardScaler()
    ))
])

In [32]:
# fit the neural network
pipe.fit(X_train, y_train)

In [33]:
# predict the y values for the neural network on the test set
y_pred = pipe.predict(X_test)

In [34]:
# find the test RMSE and R^2 for the neural network
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.3f}, R²: {r2:.3f}")

RMSE: 71143.759, R²: 0.940


In [35]:
# set up selectors for continous and categorical data (for later SVM)
num_sel = selector(dtype_include=np.number)
cat_sel = selector(dtype_exclude=np.number)

# set up a continous preprocessing pipeline
num_pipe = Pipeline([
    ("impute_num", KNNImputer(n_neighbors=5)),
    ("scale", StandardScaler())
])

# set up a categorical preprocessing pipeline
cat_pipe = Pipeline([
    ("impute_cat", SimpleImputer(strategy="most_frequent")),
    ("ohe", OneHotEncoder(handle_unknown="ignore"))
])

# set up a transformer for the different types of variables
tab = ColumnTransformer([
    ("num", num_pipe, num_sel),
    ("cat", cat_pipe, cat_sel),
])

# function to make the array dense to help with the SVM fitting
to_dense = FunctionTransformer(
    lambda X: X.toarray() if sp.issparse(X) else X, accept_sparse=True
)

# set up the SVM with radial basis function and given cost
svr = SVR(kernel="rbf", C=10.0, epsilon=0.1, gamma="scale")

# SVM pipeline
pipe_svr = Pipeline([
    ("tab", tab),
    ("dense", to_dense),  
    ("reg", TransformedTargetRegressor(
        regressor=svr,
        transformer=StandardScaler()
    ))
])

In [36]:
# fit the SVM model
pipe_svr.fit(X_train, y_train)

In [37]:
# predict the SVM y test values
y_pred = pipe_svr.predict(X_test)

In [38]:
# find the test RMSE and R^2 for the SVM model
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.3f}, R²: {r2:.3f}")

RMSE: 81217.178, R²: 0.922


In [43]:
# set up a pipeline for the MARS model
# make numerical and categorical selector
num_sel = selector(dtype_include=np.number)
cat_sel = selector(dtype_exclude=np.number)

# make a transformer for each type of variable
pre = ColumnTransformer(
    transformers=[
        ("num", Pipeline([
            ("impute", KNNImputer(n_neighbors=5)),  
            ("scale", StandardScaler(with_mean=True))
        ]), num_sel),
        ("cat", Pipeline([
            ("impute", SimpleImputer(strategy="most_frequent")),
            ("ohe", OneHotEncoder(handle_unknown="ignore", sparse=False))
        ]), cat_sel),
    ],
    remainder="drop",
    verbose_feature_names_out=False
)

# make main MARS pipeline
pipe = Pipeline([
    ("prep", pre),
    ("mars", Earth(
        max_degree=2,
        enable_pruning=True,
        penalty=3,
        max_terms=13,
        minspan_alpha=0.5,
        endspan_alpha=0.5
    ))
])

In [44]:
# fit the MARS model
pipe.fit(X_train, y_train)

  pruning_passer.run()
  coef, resid = np.linalg.lstsq(B, weighted_y[:, i])[0:2]


In [45]:
# predict the y values on the test set for the MARS model
y_pred = pipe.predict(X_test)

In [46]:
# find the test RMSE and R^2 for the MARS model
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.3f}, R²: {r2:.3f}")

RMSE: 151423.782, R²: 0.727


In [47]:
# set up the KNN pipeline
knn_pipe = Pipeline([
    ("tv", TableVectorizer()),                  
    ("impute", KNNImputer(n_neighbors=5)),      
    ("scale", StandardScaler()),                
    ("knn", KNeighborsRegressor(
        n_neighbors=10,
        weights="distance",     
        metric="minkowski", p=2
    ))
])

In [48]:
# fit the KNN model
knn_pipe.fit(X_train, y_train)

In [49]:
# predict the test y-values
y_pred = knn_pipe.predict(X_test)

In [50]:
# find the test RMSE and R^2
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.3f}, R²: {r2:.3f}")

RMSE: 162813.168, R²: 0.685


In [51]:
# drop only the perfectly correlated columns
df = df.drop(['booking_down_pymnt', 'emi_starts_from'], axis=1)

In [52]:
# make the target (response) vector and the predictor matrix
X = df.drop(columns=["sale_price"])
y = df["sale_price"] 

In [53]:
# make the train test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=26       
)

In [54]:
# make the single catboost regressor pipeline
pipe = Pipeline([
    ("tv", TableVectorizer()),
    ("knn", KNNImputer(n_neighbors=5)),
    ("catboost", CatBoostRegressor(
        depth=8, learning_rate=0.05, n_estimators=800,
        loss_function="RMSE", verbose=False
    ))
])

In [55]:
# fit the single catboost regressor
pipe.fit(X_train, y_train)

In [56]:
# predict the single catboost regressor model
y_pred = pipe.predict(X_test)

In [57]:
# find the test RMSE and R^2 for the single catboost regressor model
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.3f}, R²: {r2:.3f}")

RMSE: 50013.797, R²: 0.970


In [58]:
# select the categorical and continous columns from the X matrix
num_cols = X.select_dtypes(include=[np.number]).columns
cat_cols = X.select_dtypes(exclude=[np.number]).columns

In [59]:
# make a pipeline for the ridge regression on the continous variables
num_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("poly", PolynomialFeatures(degree=2, include_bias=False)),
    ("scale", StandardScaler())
])

In [60]:
# make a pipeline for the ridge regression on the categorical variables
cat_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="most_frequent")),
    ("ohe", OneHotEncoder(handle_unknown="ignore",
                          min_frequency=0.01,
                          sparse_output=True))
])

In [61]:
# make a preprocessing transformer for the variables for the ridge regression
pre = ColumnTransformer(
    transformers=[
        ("num", num_pipe, num_cols),
        ("cat", cat_pipe, cat_cols),
    ],)

In [62]:
# make a pipeline for the ridge regression
pipe = Pipeline([
    ("pre", pre),
    ("ridge", Ridge(solver="sag", random_state=26))
])

In [63]:
# make a parameter grid for the ridge regression
param_grid = {
    "pre__num__poly__degree": [1, 2], 
    "ridge__alpha": [1.0, 10.0, 100.0]
}

In [64]:
# search over the parameter grid for the ridge regression
search = GridSearchCV(
    pipe, param_grid, cv=KFold(n_splits=5, shuffle=True, random_state=26),
    scoring="neg_root_mean_squared_error",
    n_jobs=1,   
    verbose=1
)

In [65]:
# fit the grid search for the ridge regression on the training data
search.fit(X_train, y_train)

Fitting 5 folds for each of 6 candidates, totalling 30 fits


In [66]:
# show the best RMSE and parameters for the ridge regression
print("Best RMSE:", -search.best_score_)
print("Best params:", search.best_params_)

Best RMSE: 62644.4119237985
Best params: {'pre__num__poly__degree': 2, 'ridge__alpha': 10.0}


In [67]:
# make the prediction of the ys using the best model
best_pipe = search.best_estimator_
y_pred = best_pipe.predict(X_test)

In [68]:
# find the RMSE and R^2 for the best model on the test set for the ridge regression model
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.3f}, R²: {r2:.3f}")

RMSE: 48097.907, R²: 0.972


In [69]:
# make the preprocessing pipeline for the Linear GAM model
pre = ColumnTransformer([
    ('num', Pipeline([
        ('impute', SimpleImputer(strategy='mean')),
        ('scale', StandardScaler())
    ]), selector(dtype_include=np.number)),
    ('cat', Pipeline([
        ('impute', SimpleImputer(strategy='most_frequent')),
        ('encode', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1))
    ]), selector(dtype_exclude=np.number))
])

In [70]:
# transform all of the data so that it works with the Linear GAM model
X_tr = pre.fit_transform(X_train)
X_te = pre.transform(X_test)
y_tr = np.asarray(y_train).ravel()
n_cont = pre.named_transformers_['num'].n_features_in_
n_cat  = X_tr.shape[1] - n_cont

In [71]:
# fit the single Linear GAM model
term_pieces = [s(i, n_splines=10) for i in range(n_cont)] + \
              [f(n_cont + j) for j in range(n_cat)]
terms = reduce(operator.add, term_pieces)
gam = LinearGAM(terms, lam=1.0, max_iter=400, tol=1e-4)
gam.fit(X_tr, y_tr)

LinearGAM(callbacks=[Deviance(), Diffs()], fit_intercept=True, 
   max_iter=400, scale=None, 
   terms=s(0) + s(1) + s(2) + s(3) + s(4) + s(5) + f(6) + f(7) + f(8) + f(9) + f(10) + f(11) + f(12) + f(13) + f(14) + f(15) + f(16) + f(17) + f(18) + f(19) + f(20) + f(21) + f(22) + f(23) + f(24) + f(25) + intercept,
   tol=0.0001, verbose=False)

In [72]:
# process the test data so that we can predict on it, then predict on it using the linear GAM model
cat_start = n_cont
cat_end = n_cont + n_cat

X_te[:, cat_start:cat_end] = np.nan_to_num(
    X_te[:, cat_start:cat_end], nan=0.0, posinf=0.0, neginf=0.0
)

mins = X_tr[:, cat_start:cat_end].min(axis=0)
maxs = X_tr[:, cat_start:cat_end].max(axis=0)
X_te[:, cat_start:cat_end] = np.clip(X_te[:, cat_start:cat_end], mins, maxs)

y_pred = gam.predict(X_te)

In [73]:
# find the test RMSE and R^2 using the linear GAM model
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.3f}, R²: {r2:.3f}")

RMSE: 56204.791, R²: 0.962


In [74]:
# make a selector for the continous and categorical data
num_sel = selector(dtype_include=np.number)
cat_sel = selector(dtype_exclude=np.number)

# make a pipeline for the continous data
num_pipe = Pipeline([
    ("impute_num", KNNImputer(n_neighbors=5)),
    ("scale", StandardScaler())
])

# make a pipeline for the categorical data
cat_pipe = Pipeline([
    ("impute_cat", SimpleImputer(strategy="most_frequent")),
    ("ohe", OneHotEncoder(handle_unknown="ignore"))
])

# make an overall transformer for the variables
tab = ColumnTransformer([
    ("num", num_pipe, num_sel),
    ("cat", cat_pipe, cat_sel),
])

# define the parameters for the neural network
mlp = MLPRegressor(
    hidden_layer_sizes=(256, 128),
    activation="relu",
    solver="adam",
    learning_rate_init=1e-3,
    alpha=1e-3,            
    batch_size=256,
    max_iter=1000,
    early_stopping=True,
    n_iter_no_change=20,
    validation_fraction=0.15,
    random_state=26
)

# definie the overall pipeline for the neural network
pipe = Pipeline([
    ("tab", tab),
    ("reg", TransformedTargetRegressor(
        regressor=mlp,
        transformer=StandardScaler()
    ))
])

In [75]:
# fit the neural network to the training data
pipe.fit(X_train, y_train)

In [76]:
# predict the y values on the test data for the neural network
y_pred = pipe.predict(X_test)

In [77]:
# find the test RMSE and R^2 for the neural network
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.3f}, R²: {r2:.3f}")

RMSE: 44091.582, R²: 0.977


In [78]:
# make selectors for the continous and categorical data for the SVM model
num_sel = selector(dtype_include=np.number)
cat_sel = selector(dtype_exclude=np.number)

# make the pipeline for the continous data
num_pipe = Pipeline([
    ("impute_num", KNNImputer(n_neighbors=5)),
    ("scale", StandardScaler())
])

# make the pipeline for the categorical data
cat_pipe = Pipeline([
    ("impute_cat", SimpleImputer(strategy="most_frequent")),
    ("ohe", OneHotEncoder(handle_unknown="ignore"))
])

# make the transformer for the variables
tab = ColumnTransformer([
    ("num", num_pipe, num_sel),
    ("cat", cat_pipe, cat_sel),
])

# make the array dense to help with the SVM model
to_dense = FunctionTransformer(
    lambda X: X.toarray() if sp.issparse(X) else X, accept_sparse=True
)

# define the kernel for SVM given a cost and epsilon
svr = SVR(kernel="rbf", C=10.0, epsilon=0.1, gamma="scale")

# define the overall SVM pipeline
pipe_svr = Pipeline([
    ("tab", tab),
    ("dense", to_dense),  
    ("reg", TransformedTargetRegressor(
        regressor=svr,
        transformer=StandardScaler()
    ))
])

In [79]:
# fit the SVM model
pipe_svr.fit(X_train, y_train)

In [80]:
# predict the y values for the SVM model
y_pred = pipe_svr.predict(X_test)

In [81]:
# find the test RMSE and R^2 for the SVM model
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.3f}, R²: {r2:.3f}")

RMSE: 52378.372, R²: 0.967


In [82]:
# set up a pipeline for the MARS model
# make numerical and categorical selector
num_sel = selector(dtype_include=np.number)
cat_sel = selector(dtype_exclude=np.number)

# make a transformer for each type of variable
pre = ColumnTransformer(
    transformers=[
        ("num", Pipeline([
            ("impute", KNNImputer(n_neighbors=5)),  
            ("scale", StandardScaler(with_mean=True))
        ]), num_sel),
        ("cat", Pipeline([
            ("impute", SimpleImputer(strategy="most_frequent")),
            ("ohe", OneHotEncoder(handle_unknown="ignore", sparse=False))
        ]), cat_sel),
    ],
    remainder="drop",
    verbose_feature_names_out=False
)

# make main MARS pipeline
pipe = Pipeline([
    ("prep", pre),
    ("mars", Earth(
        max_degree=2,
        enable_pruning=True,
        penalty=3,
        max_terms=13,
        minspan_alpha=0.5,
        endspan_alpha=0.5
    ))
])

In [83]:
# fit the MARS model
pipe.fit(X_train, y_train)

  pruning_passer.run()
  coef, resid = np.linalg.lstsq(B, weighted_y[:, i])[0:2]


In [84]:
# predict the y values on the test set for the MARS model
y_pred = pipe.predict(X_test)

In [85]:
# find the test RMSE and R^2 for the MARS model
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.3f}, R²: {r2:.3f}")

RMSE: 41793.244, R²: 0.979


In [86]:
# define the KNN pipeline
knn_pipe = Pipeline([
    ("tv", TableVectorizer()),                  
    ("impute", KNNImputer(n_neighbors=5)),      
    ("scale", StandardScaler()),                
    ("knn", KNeighborsRegressor(
        n_neighbors=10,
        weights="distance",     
        metric="minkowski", p=2
    ))
])

In [87]:
# fit the KNN model using the training data
knn_pipe.fit(X_train, y_train)

In [88]:
# predict the y values using the KNN model
y_pred = knn_pipe.predict(X_test)

In [89]:
# find the test RMSE and R^2 using the KNN model
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse:.3f}, R²: {r2:.3f}")

RMSE: 145534.846, R²: 0.748
