## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [1]:
# import models and fit
import pandas as pd
import numpy as np
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error


In [18]:
# import models and fit
# Load the training dataset
dataframe_train_poly = pd.read_csv('/Users/blairjdaniel/lighthouse/lighthouse/DS-Midterm-Project/processed/train.csv')
X_train = dataframe_train_poly.drop(columns=['description.sold_price']).values
y_train = dataframe_train_poly['description.sold_price'].values
print("Training set shape:", X_train.shape, y_train.shape)

# Load the test dataset
dataframe_test_poly = pd.read_csv('/Users/blairjdaniel/lighthouse/lighthouse/DS-Midterm-Project/processed/test.csv')
X_test = dataframe_test_poly.drop(columns=['description.sold_price']).values
y_test = dataframe_test_poly['description.sold_price'].values
print("Test set shape:", X_test.shape, y_test.shape)

Training set shape: (902, 14) (902,)
Test set shape: (226, 14) (226,)


In [9]:
# import models and fit
# Load the training dataset
dataframe_train_poly = pd.read_csv('/Users/blairjdaniel/lighthouse/lighthouse/DS-Midterm-Project/processed/train_pca.csv')
X_train_2 = dataframe_train_poly.drop(columns=['description.sold_price']).values
y_train_2 = dataframe_train_poly['description.sold_price'].values
print("Training set shape:", X_train_2.shape, y_train_2.shape)

# Load the test dataset
dataframe_test_poly = pd.read_csv('/Users/blairjdaniel/lighthouse/lighthouse/DS-Midterm-Project/processed/test_pca.csv')
X_test_2 = dataframe_test_poly.drop(columns=['description.sold_price']).values
y_test_2 = dataframe_test_poly['description.sold_price'].values
print("Test set shape:", X_test_2.shape, y_test_2.shape)

Training set shape: (902, 13) (902,)
Test set shape: (226, 13) (226,)


In [10]:
# import models and fit
# Load the training dataset
dataframe_train_poly = pd.read_csv('/Users/blairjdaniel/lighthouse/lighthouse/DS-Midterm-Project/processed/train_poly.csv')
X_train_3 = dataframe_train_poly.drop(columns=['description.sold_price']).values
y_train_3 = dataframe_train_poly['description.sold_price'].values
print("Training set shape:", X_train_3.shape, y_train_3.shape)

# Load the test dataset
dataframe_test_poly = pd.read_csv('/Users/blairjdaniel/lighthouse/lighthouse/DS-Midterm-Project/processed/test_poly.csv')
X_test_3 = dataframe_test_poly.drop(columns=['description.sold_price']).values
y_test_3 = dataframe_test_poly['description.sold_price'].values
print("Test set shape:", X_test_3.shape, y_test_3.shape)

Training set shape: (902, 119) (902,)
Test set shape: (226, 119) (226,)


In [19]:
# Define the model
model = GradientBoostingRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [300],
    'max_depth': [3],
    'learning_rate': [0.1, 0.2],
    'subsample': [0.8, 1.0],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='r2')

# Execute the search
grid_search.fit(X_train, y_train)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(X_test)
print(f"Best Parameters: {grid_search.best_params_}")

# Calculate metrics
mse = mean_squared_error(y_test, predictions)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
adjusted_r2 = 1 - (1 - r2) * (len(y_test) - 1) / (len(y_test) - X_test.shape[1] - 1)

# Print metrics
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"R²: {r2}")
print(f"Adjusted R²: {adjusted_r2}")

Best Parameters: {'learning_rate': 0.1, 'max_depth': 3, 'min_samples_leaf': 2, 'min_samples_split': 5, 'n_estimators': 300, 'subsample': 0.8}
Mean Squared Error (MSE): 6399959654.029679
Root Mean Squared Error (RMSE): 79999.74783728809
Mean Absolute Error (MAE): 55246.17209822343
R²: 0.816082243489204
Adjusted R²: 0.8038791695974924


In [13]:
# Define the model
model = GradientBoostingRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [300],
    'max_depth': [3],
    'learning_rate': [0.1, 0.2],
    'subsample': [0.8, 1.0],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='r2')

# Execute the search
grid_search.fit(X_train_2, y_train_2)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(X_test_2)
print(f"Best Parameters: {grid_search.best_params_}")

# Calculate metrics
mse = mean_squared_error(y_test_2, predictions)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test_2, predictions)
r2 = r2_score(y_test_2, predictions)
adjusted_r2 = 1 - (1 - r2) * (len(y_test_2) - 1) / (len(y_test_2) - X_test_2.shape[1] - 1)

# Print metrics
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"R²: {r2}")
print(f"Adjusted R²: {adjusted_r2}")

Best Parameters: {'learning_rate': 0.1, 'max_depth': 3, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 300, 'subsample': 0.8}
Mean Squared Error (MSE): 8430636956.906597
Root Mean Squared Error (RMSE): 91818.50007981287
Mean Absolute Error (MAE): 65154.627682586564
R²: 0.7577259984607905
Adjusted R²: 0.742869573838103


In [14]:
# Define the model
model = GradientBoostingRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [300],
    'max_depth': [3],
    'learning_rate': [0.1, 0.2],
    'subsample': [0.8, 1.0],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='r2')

# Execute the search
grid_search.fit(X_train_3, y_train_3)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(X_test_3)
print(f"Best Parameters: {grid_search.best_params_}")

# Calculate metrics
mse = mean_squared_error(y_test_3, predictions)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test_3, predictions)
r2 = r2_score(y_test_3, predictions)
adjusted_r2 = 1 - (1 - r2) * (len(y_test_3) - 1) / (len(y_test_3) - X_test_3.shape[1] - 1)

# Print metrics
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"R²: {r2}")
print(f"Adjusted R²: {adjusted_r2}")

Best Parameters: {'learning_rate': 0.1, 'max_depth': 3, 'min_samples_leaf': 1, 'min_samples_split': 5, 'n_estimators': 300, 'subsample': 0.8}
Mean Squared Error (MSE): 6229726431.574986
Root Mean Squared Error (RMSE): 78928.61605004224
Mean Absolute Error (MAE): 55822.453485942446
R²: 0.8209742918848149
Adjusted R²: 0.6199926006988995


In [20]:
# Define the model
model = XGBRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [300],
    'max_depth': [3],
    'learning_rate': [0.1],
    'subsample': [0.8],
    'colsample_bytree': [1.0],
    'gamma': [0],
    'reg_alpha': [0.1],
    'reg_lambda': [1]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='r2')

# Execute the search
grid_search.fit(X_train, y_train)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(X_test)

# Calculate metrics
mse = mean_squared_error(y_test, predictions)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)
adjusted_r2 = 1 - (1 - r2) * (len(y_test) - 1) / (len(y_test) - X_test.shape[1] - 1)

# Print metrics
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"R²: {r2}")
print(f"Adjusted R²: {adjusted_r2}")

Mean Squared Error (MSE): 6466147328.0
Root Mean Squared Error (RMSE): 80412.35805521437
Mean Absolute Error (MAE): 57161.2734375
R²: 0.8141801953315735
Adjusted R²: 0.8018509191924361


In [16]:
# Define the model
model = XGBRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [300],
    'max_depth': [3],
    'learning_rate': [0.1],
    'subsample': [0.8],
    'colsample_bytree': [1.0],
    'gamma': [0],
    'reg_alpha': [0.1],
    'reg_lambda': [1]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='r2')

# Execute the search
grid_search.fit(X_train_2, y_train_2)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(X_test_2)

# Calculate metrics
mse = mean_squared_error(y_test_2, predictions)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test_2, predictions)
r2 = r2_score(y_test_2, predictions)
adjusted_r2 = 1 - (1 - r2) * (len(y_test_2) - 1) / (len(y_test_2) - X_test_2.shape[1] - 1)

# Print metrics
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"R²: {r2}")
print(f"Adjusted R²: {adjusted_r2}")

Mean Squared Error (MSE): 8367230976.0
Root Mean Squared Error (RMSE): 91472.56952770049
Mean Absolute Error (MAE): 66028.4609375
R²: 0.7595481276512146
Adjusted R²: 0.7448034373656759


In [17]:
# Define the model
model = XGBRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [300],
    'max_depth': [3],
    'learning_rate': [0.1],
    'subsample': [0.8],
    'colsample_bytree': [1.0],
    'gamma': [0],
    'reg_alpha': [0.1],
    'reg_lambda': [1]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='r2')

# Execute the search
grid_search.fit(X_train_3, y_train_3)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(X_test_3)

# Calculate metrics
mse = mean_squared_error(y_test_3, predictions)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test_3, predictions)
r2 = r2_score(y_test_3, predictions)
adjusted_r2 = 1 - (1 - r2) * (len(y_test_3) - 1) / (len(y_test_3) - X_test_3.shape[1] - 1)

# Print metrics
print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"Mean Absolute Error (MAE): {mae}")
print(f"R²: {r2}")
print(f"Adjusted R²: {adjusted_r2}")

Mean Squared Error (MSE): 6251127808.0
Root Mean Squared Error (RMSE): 79064.07406654429
Mean Absolute Error (MAE): 56277.46875
R²: 0.8203592896461487
Adjusted R²: 0.6186871714187119


Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [None]:
# gather evaluation metrics and compare results


## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [None]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)