## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [3]:
# import models and fit
import pandas as pd
import numpy as np
from sklearn.dummy import DummyRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.svm import SVR
from sklearn.neighbors import KNeighborsRegressor
from xgboost import XGBRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error


In [4]:
# Load the training dataset
dataframe_train_poly = pd.read_csv('/Users/blairjdaniel/lighthouse/lighthouse/DS-Midterm-Project/DS-Midterm-Project/processed/train_scaled.csv')
X_train = dataframe_train_poly.drop(columns=['description.sold_price']).values
y_train = dataframe_train_poly['description.sold_price'].values
print("Training set shape:", X_train.shape, y_train.shape)

# Load the test dataset
dataframe_test_poly = pd.read_csv('/Users/blairjdaniel/lighthouse/lighthouse/DS-Midterm-Project/DS-Midterm-Project/processed/test_scaled.csv')
X_test = dataframe_test_poly.drop(columns=['description.sold_price']).values
y_test = dataframe_test_poly['description.sold_price'].values
print("Test set shape:", X_test.shape, y_test.shape)

# Load the training dataset
dataframe_train_pca = pd.read_csv('/Users/blairjdaniel/lighthouse/lighthouse/DS-Midterm-Project/DS-Midterm-Project/processed/train_pca.csv')
X_train_2 = dataframe_train_pca.drop(columns=['description.sold_price']).values
y_train_2 = dataframe_train_pca['description.sold_price'].values
print("Training set shape:", X_train_2.shape, y_train_2.shape)

# Load the test dataset
dataframe_test_pca = pd.read_csv('/Users/blairjdaniel/lighthouse/lighthouse/DS-Midterm-Project/DS-Midterm-Project/processed/test_pca.csv')
X_test_2 = dataframe_test_pca.drop(columns=['description.sold_price']).values
y_test_2 = dataframe_test_pca['description.sold_price'].values
print("Test set shape:", X_test_2.shape, y_test_2.shape)




Training set shape: (904, 209) (904,)
Test set shape: (226, 209) (226,)
Training set shape: (904, 57) (904,)
Test set shape: (226, 57) (226,)


In [5]:
# Load the training dataset
dataframe_train_poly_no_lot = pd.read_csv('/Users/blairjdaniel/lighthouse/lighthouse/DS-Midterm-Project/DS-Midterm-Project/processed/train_scaled_no_lot.csv')
X_train_3 = dataframe_train_poly_no_lot.drop(columns=['description.sold_price']).values
y_train_3 = dataframe_train_poly_no_lot['description.sold_price'].values
print("Training set shape:", X_train_3.shape, y_train_3.shape)

# Load the test dataset
dataframe_test_poly = pd.read_csv('/Users/blairjdaniel/lighthouse/lighthouse/DS-Midterm-Project/DS-Midterm-Project/processed/test_scaled_no_lot.csv')
X_test_3 = dataframe_test_poly.drop(columns=['description.sold_price']).values
y_test_3 = dataframe_test_poly['description.sold_price'].values
print("Test set shape:", X_test_3.shape, y_test_3.shape)

Training set shape: (904, 104) (904,)
Test set shape: (226, 104) (226,)


In [6]:
# Load the training dataset
dataframe_train_poly_no_lot = pd.read_csv('/Users/blairjdaniel/lighthouse/lighthouse/DS-Midterm-Project/DS-Midterm-Project/processed/train_scaled_1.csv')
X_train_4 = dataframe_train_poly_no_lot.drop(columns=['description.sold_price']).values
y_train_4 = dataframe_train_poly_no_lot['description.sold_price'].values
print("Training set shape:", X_train_4.shape, y_train_4.shape)

# Load the test dataset
dataframe_test_poly = pd.read_csv('/Users/blairjdaniel/lighthouse/lighthouse/DS-Midterm-Project/DS-Midterm-Project/processed/test_scaled_1.csv')
X_test_4 = dataframe_test_poly.drop(columns=['description.sold_price']).values
y_test_4 = dataframe_test_poly['description.sold_price'].values
print("Test set shape:", X_test_4.shape, y_test_4.shape)

Training set shape: (903, 90) (903,)
Test set shape: (226, 90) (226,)


In [7]:
# Load the training dataset
dataframe_train_poly_no_lot = pd.read_csv('/Users/blairjdaniel/lighthouse/lighthouse/DS-Midterm-Project/DS-Midterm-Project/processed/train_pca_1.csv')
X_train_5 = dataframe_train_poly_no_lot.drop(columns=['description.sold_price']).values
y_train_5 = dataframe_train_poly_no_lot['description.sold_price'].values
print("Training set shape:", X_train_5.shape, y_train_5.shape)

# Load the test dataset
dataframe_test_poly = pd.read_csv('/Users/blairjdaniel/lighthouse/lighthouse/DS-Midterm-Project/DS-Midterm-Project/processed/test_pca_1.csv')
X_test_5 = dataframe_test_poly.drop(columns=['description.sold_price']).values
y_test_5 = dataframe_test_poly['description.sold_price'].values
print("Test set shape:", X_test_5.shape, y_test_5.shape)

Training set shape: (903, 21) (903,)
Test set shape: (226, 21) (226,)


In [8]:
# Define the models and their parameter grids
models = {
    'Ridge': (Ridge(), {'alpha': [0.1, 1.0, 10.0]}),
    'Lasso': (Lasso(), {'alpha': [0.1, 1.0, 10.0]}),
    'ElasticNet': (ElasticNet(), {'alpha': [0.1, 1.0, 10.0], 'l1_ratio': [0.1, 0.5, 0.9]}),
    'DecisionTree': (DecisionTreeRegressor(), {'max_depth': [None, 10, 20, 30]}),
    'RandomForest': (RandomForestRegressor(), {'n_estimators': [100, 200], 'max_depth': [None, 10, 20]}),
    'GradientBoosting': (GradientBoostingRegressor(), {'n_estimators': [100, 200], 'learning_rate': [0.01, 0.1]}),
    'XGBoost': (XGBRegressor(), {'n_estimators': [100, 200], 'learning_rate': [0.01, 0.1]}),
    'SVR': (SVR(), {'C': [0.1, 1.0, 10.0], 'epsilon': [0.1, 0.2]}),
    'KNN': (KNeighborsRegressor(), {'n_neighbors': [3, 5, 7]})
}

# Iterate over the models and perform grid search
for name, (model, param_grid) in models.items():
    search = GridSearchCV(model, param_grid, cv=5)
    search.fit(X_train, y_train)
    print(f"Best parameters for {name}: {search.best_params_}")
    print(f"Best score for {name}: {search.best_score_}")

Best parameters for Ridge: {'alpha': 10.0}
Best score for Ridge: 0.8102474472492709


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


Best parameters for Lasso: {'alpha': 10.0}
Best score for Lasso: 0.7498841921051502


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


Best parameters for ElasticNet: {'alpha': 1.0, 'l1_ratio': 0.9}
Best score for ElasticNet: 0.8289757057748217
Best parameters for DecisionTree: {'max_depth': None}
Best score for DecisionTree: 0.6825669897095901
Best parameters for RandomForest: {'max_depth': 20, 'n_estimators': 100}
Best score for RandomForest: 0.8353826100993873
Best parameters for GradientBoosting: {'learning_rate': 0.1, 'n_estimators': 200}
Best score for GradientBoosting: 0.8502465545182843
Best parameters for XGBoost: {'learning_rate': 0.1, 'n_estimators': 200}
Best score for XGBoost: 0.8248134255409241
Best parameters for SVR: {'C': 10.0, 'epsilon': 0.1}
Best score for SVR: -0.04230721489289206
Best parameters for KNN: {'n_neighbors': 5}
Best score for KNN: 0.5943185594873445


In [9]:
# Define the models and their parameter grids
models = {
    'Ridge': (Ridge(), {'alpha': [0.1, 1.0, 10.0]}),
    'Lasso': (Lasso(), {'alpha': [0.1, 1.0, 10.0]}),
    'ElasticNet': (ElasticNet(), {'alpha': [0.1, 1.0, 10.0], 'l1_ratio': [0.1, 0.5, 0.9]}),
    'DecisionTree': (DecisionTreeRegressor(), {'max_depth': [None, 10, 20, 30]}),
    'RandomForest': (RandomForestRegressor(), {'n_estimators': [100, 200], 'max_depth': [None, 10, 20]}),
    'GradientBoosting': (GradientBoostingRegressor(), {'n_estimators': [100, 200], 'learning_rate': [0.01, 0.1]}),
    'XGBoost': (XGBRegressor(), {'n_estimators': [100, 200], 'learning_rate': [0.01, 0.1]}),
    'SVR': (SVR(), {'C': [0.1, 1.0, 10.0], 'epsilon': [0.1, 0.2]}),
    'KNN': (KNeighborsRegressor(), {'n_neighbors': [3, 5, 7]})
}

# Iterate over the models and perform grid search
for name, (model, param_grid) in models.items():
    search = GridSearchCV(model, param_grid, cv=5)
    search.fit(X_train_2, y_train_2)
    print(f"Best parameters for {name}: {search.best_params_}")
    print(f"Best score for {name}: {search.best_score_}")

Best parameters for Ridge: {'alpha': 10.0}
Best score for Ridge: 0.8190465413629603
Best parameters for Lasso: {'alpha': 10.0}
Best score for Lasso: 0.8168401735979792
Best parameters for ElasticNet: {'alpha': 1.0, 'l1_ratio': 0.9}
Best score for ElasticNet: 0.8236889798947411
Best parameters for DecisionTree: {'max_depth': 10}
Best score for DecisionTree: 0.47008550043885355
Best parameters for RandomForest: {'max_depth': None, 'n_estimators': 200}
Best score for RandomForest: 0.723681126798099
Best parameters for GradientBoosting: {'learning_rate': 0.1, 'n_estimators': 200}
Best score for GradientBoosting: 0.7849900890842006
Best parameters for XGBoost: {'learning_rate': 0.1, 'n_estimators': 200}
Best score for XGBoost: 0.7442242622375488
Best parameters for SVR: {'C': 10.0, 'epsilon': 0.1}
Best score for SVR: -0.04217512529968741
Best parameters for KNN: {'n_neighbors': 7}
Best score for KNN: 0.5926665387287844


In [10]:
# Define the models and their parameter grids
models = {
    'Ridge': (Ridge(), {'alpha': [0.1, 1.0, 10.0]}),
    'Lasso': (Lasso(), {'alpha': [0.1, 1.0, 10.0]}),
    'ElasticNet': (ElasticNet(), {'alpha': [0.1, 1.0, 10.0], 'l1_ratio': [0.1, 0.5, 0.9]}),
    'DecisionTree': (DecisionTreeRegressor(), {'max_depth': [None, 10, 20, 30]}),
    'RandomForest': (RandomForestRegressor(), {'n_estimators': [100, 200], 'max_depth': [None, 10, 20]}),
    'GradientBoosting': (GradientBoostingRegressor(), {'n_estimators': [100, 200], 'learning_rate': [0.01, 0.1]}),
    'XGBoost': (XGBRegressor(), {'n_estimators': [100, 200], 'learning_rate': [0.01, 0.1]}),
    'SVR': (SVR(), {'C': [0.1, 1.0, 10.0], 'epsilon': [0.1, 0.2]}),
    'KNN': (KNeighborsRegressor(), {'n_neighbors': [3, 5, 7]})
}

# Iterate over the models and perform grid search
for name, (model, param_grid) in models.items():
    search = GridSearchCV(model, param_grid, cv=5)
    search.fit(X_train_3, y_train_3)
    print(f"Best parameters for {name}: {search.best_params_}")
    print(f"Best score for {name}: {search.best_score_}")

Best parameters for Ridge: {'alpha': 10.0}
Best score for Ridge: 0.7883321298881075


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


Best parameters for Lasso: {'alpha': 10.0}
Best score for Lasso: 0.769056338677365


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


Best parameters for ElasticNet: {'alpha': 0.1, 'l1_ratio': 0.5}
Best score for ElasticNet: 0.7939846218445334
Best parameters for DecisionTree: {'max_depth': 10}
Best score for DecisionTree: 0.5924169229429385
Best parameters for RandomForest: {'max_depth': None, 'n_estimators': 200}
Best score for RandomForest: 0.7977941111621301
Best parameters for GradientBoosting: {'learning_rate': 0.1, 'n_estimators': 200}
Best score for GradientBoosting: 0.8116344980082959
Best parameters for XGBoost: {'learning_rate': 0.1, 'n_estimators': 200}
Best score for XGBoost: 0.7896558165550231
Best parameters for SVR: {'C': 10.0, 'epsilon': 0.1}
Best score for SVR: -0.042263799362407095
Best parameters for KNN: {'n_neighbors': 7}
Best score for KNN: 0.5984368271155651


In [11]:
# Define the models and their parameter grids
models = {
    'Ridge': (Ridge(), {'alpha': [0.1, 1.0, 10.0]}),
    'Lasso': (Lasso(), {'alpha': [0.1, 1.0, 10.0]}),
    'ElasticNet': (ElasticNet(), {'alpha': [0.1, 1.0, 10.0], 'l1_ratio': [0.1, 0.5, 0.9]}),
    'DecisionTree': (DecisionTreeRegressor(), {'max_depth': [None, 10, 20, 30]}),
    'RandomForest': (RandomForestRegressor(), {'n_estimators': [100, 200], 'max_depth': [None, 10, 20]}),
    'GradientBoosting': (GradientBoostingRegressor(), {'n_estimators': [100, 200], 'learning_rate': [0.01, 0.1]}),
    'XGBoost': (XGBRegressor(), {'n_estimators': [100, 200], 'learning_rate': [0.01, 0.1]}),
    'SVR': (SVR(), {'C': [0.1, 1.0, 10.0], 'epsilon': [0.1, 0.2]}),
    'KNN': (KNeighborsRegressor(), {'n_neighbors': [3, 5, 7]})
}

# Iterate over the models and perform grid search
for name, (model, param_grid) in models.items():
    search = GridSearchCV(model, param_grid, cv=5)
    search.fit(X_train_4, y_train_4)
    print(f"Best parameters for {name}: {search.best_params_}")
    print(f"Best score for {name}: {search.best_score_}")

Best parameters for Ridge: {'alpha': 10.0}
Best score for Ridge: 0.7882452040892852


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


Best parameters for Lasso: {'alpha': 10.0}
Best score for Lasso: 0.7583325687410637


  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


Best parameters for ElasticNet: {'alpha': 0.1, 'l1_ratio': 0.1}
Best score for ElasticNet: 0.7977860290009369
Best parameters for DecisionTree: {'max_depth': 10}
Best score for DecisionTree: 0.5923573898786824
Best parameters for RandomForest: {'max_depth': 10, 'n_estimators': 200}
Best score for RandomForest: 0.785243966588703
Best parameters for GradientBoosting: {'learning_rate': 0.1, 'n_estimators': 200}
Best score for GradientBoosting: 0.8059852192792297
Best parameters for XGBoost: {'learning_rate': 0.1, 'n_estimators': 200}
Best score for XGBoost: 0.7854320526123046
Best parameters for SVR: {'C': 10.0, 'epsilon': 0.1}
Best score for SVR: -0.041615078976521634
Best parameters for KNN: {'n_neighbors': 7}
Best score for KNN: 0.6104178979812304


In [12]:
# Define the models and their parameter grids 5
models = {
    'Ridge': (Ridge(), {'alpha': [0.1, 1.0, 10.0]}),
    'Lasso': (Lasso(), {'alpha': [0.1, 1.0, 10.0]}),
    'ElasticNet': (ElasticNet(), {'alpha': [0.1, 1.0, 10.0], 'l1_ratio': [0.1, 0.5, 0.9]}),
    'DecisionTree': (DecisionTreeRegressor(), {'max_depth': [None, 10, 20, 30]}),
    'RandomForest': (RandomForestRegressor(), {'n_estimators': [100, 200], 'max_depth': [None, 10, 20]}),
    'GradientBoosting': (GradientBoostingRegressor(), {'n_estimators': [100, 200], 'learning_rate': [0.01, 0.1]}),
    'XGBoost': (XGBRegressor(), {'n_estimators': [100, 200], 'learning_rate': [0.01, 0.1]}),
    'SVR': (SVR(), {'C': [0.1, 1.0, 10.0], 'epsilon': [0.1, 0.2]}),
    'KNN': (KNeighborsRegressor(), {'n_neighbors': [3, 5, 7]})
}

best_model = {}

# Iterate over the models and perform grid search
for name, (model, param_grid) in models.items():
    search = GridSearchCV(model, param_grid, cv=5)
    search.fit(X_train_5, y_train_5)
    print(f"Best parameters for {name}: {search.best_params_}")
    print(f"Best score for {name}: {search.best_score_}")
    best_model[name] = search.best_estimator_

# Make predictions using the best model for each algorithm
predictions = {}
for name, model in best_model.items():
    predictions[name] = model.predict(X_test_5)

# Example: Calculate metrics for one of the models (e.g., GradientBoosting)
model_name = 'GradientBoosting'
mse = mean_squared_error(y_test_5, predictions[model_name])
mae = mean_absolute_error(y_test_5, predictions[model_name])
r2 = r2_score(y_test_5, predictions[model_name])
adjusted_r2 = 1 - (1 - r2) * (len(y_test_5) - 1) / (len(y_test_5) - X_test_5.shape[1] - 1)


print(f"Metrics for {model_name}:")
print(f"Mean Squared Error: {mse}")
print(f"Mean Absolute Error: {mae}")
print(f"R² score: {r2}")
print(f"Adjusted R²: {adjusted_r2}")

Best parameters for Ridge: {'alpha': 10.0}
Best score for Ridge: 0.7926647789687568
Best parameters for Lasso: {'alpha': 0.1}
Best score for Lasso: 0.7926474124364626
Best parameters for ElasticNet: {'alpha': 0.1, 'l1_ratio': 0.9}
Best score for ElasticNet: 0.7926672065395989
Best parameters for DecisionTree: {'max_depth': 10}
Best score for DecisionTree: 0.4451038056966269
Best parameters for RandomForest: {'max_depth': 20, 'n_estimators': 200}
Best score for RandomForest: 0.7397418118387445
Best parameters for GradientBoosting: {'learning_rate': 0.1, 'n_estimators': 200}
Best score for GradientBoosting: 0.7564102495087539
Best parameters for XGBoost: {'learning_rate': 0.1, 'n_estimators': 100}
Best score for XGBoost: 0.7439904808998108
Best parameters for SVR: {'C': 10.0, 'epsilon': 0.1}
Best score for SVR: -0.04151069663904283
Best parameters for KNN: {'n_neighbors': 7}
Best score for KNN: 0.6216407382889019
Metrics for GradientBoosting:
Mean Squared Error: 9089463328.168728
Mean Ab

In [13]:
# Define the model
model = LinearRegression()

# Define the parameter grid
param_grid = {
    'fit_intercept': [True, False],
    'copy_X': [True, False],
    'n_jobs': [None, -1],  # Adding 'n_jobs' parameter for parallel processing
    'positive': [True, False]  # Adding 'positive' parameter to enforce positive coefficients
}

# Set up the grid search
search = GridSearchCV(model, param_grid, cv=5)

# Execute the search
result = search.fit(X_train, y_train)




# Print the best parameters and best score
print(f"Best Parameters: {result.best_params_}")
print(f"Best Score: {result.best_score_}")

Best Parameters: {'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': True}
Best Score: 0.8287020451389531


In [14]:
# Define the model for _2
model = LinearRegression()

# Define the parameter grid
param_grid = {
    'fit_intercept': [True, False],
    'copy_X': [True, False],
    'n_jobs': [None, -1],  # Adding 'n_jobs' parameter for parallel processing
    'positive': [True, False]  # Adding 'positive' parameter to enforce positive coefficients
}

# Set up the grid search
search = GridSearchCV(model, param_grid, cv=5)

# Execute the search
result = search.fit(X_train_2, y_train_2)

# Print the best parameters and best score
print(f"Best Parameters: {result.best_params_}")
print(f"Best Score: {result.best_score_}")

Best Parameters: {'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False}
Best Score: 0.8165893741250898


In [15]:
# Define the model for _3
model = LinearRegression()

# Define the parameter grid
param_grid = {
    'fit_intercept': [True, False],
    'copy_X': [True, False],
    'n_jobs': [None, -1],  # Adding 'n_jobs' parameter for parallel processing
    'positive': [True, False]  # Adding 'positive' parameter to enforce positive coefficients
}

# Set up the grid search
search = GridSearchCV(model, param_grid, cv=5)

# Execute the search
result = search.fit(X_train_3, y_train_3)

#

# Print the best parameters and best score
print(f"Best Parameters: {result.best_params_}")
print(f"Best Score: {result.best_score_}")

Best Parameters: {'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': True}
Best Score: 0.7884045341350576


In [16]:
# Define the model for 4
model = LinearRegression()

# Define the parameter grid
param_grid = {
    'fit_intercept': [True, False],
    'copy_X': [True, False],
    'n_jobs': [None, -1],  # Adding 'n_jobs' parameter for parallel processing
    'positive': [True, False]  # Adding 'positive' parameter to enforce positive coefficients
}

# Set up the grid search
search = GridSearchCV(model, param_grid, cv=5)

# Execute the search
result = search.fit(X_train_4, y_train_4)



# Print the best parameters and best score
print(f"Best Parameters: {result.best_params_}")
print(f"Best Score: {result.best_score_}")

Best Parameters: {'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': True}
Best Score: 0.7946639837775161


In [17]:
# Define the model for 5
model = LinearRegression()

# Define the parameter grid
param_grid = {
    'fit_intercept': [True, False],
    'copy_X': [True, False],
    'n_jobs': [None, -1],  # Adding 'n_jobs' parameter for parallel processing
    'positive': [True, False]  # Adding 'positive' parameter to enforce positive coefficients
}

# Set up the grid search
search = GridSearchCV(model, param_grid, cv=5)

# Execute the search
result = search.fit(X_train_5, y_train_5)



# Print the best parameters and best score
print(f"Best Parameters: {result.best_params_}")
print(f"Best Score: {result.best_score_}")

Best Parameters: {'copy_X': True, 'fit_intercept': True, 'n_jobs': None, 'positive': False}
Best Score: 0.792647499048425


In [18]:
# # Create a linear regression model
# model = LinearRegression()

# # Train the model
# model.fit(X_train_5, y_train_5)

# # Make predictions
# predictions = model.predict(X_test)

# # Calculate scores
# mae = mean_absolute_error(y_test_5, predictions)
# r2 = r2_score(y_test, predictions)

# print(f"Mean Absolute Error: {mae}")
# print(f"R² score: {r2}")

In [19]:
# # Create a linear regression model _2
# model = LinearRegression()

# # Train the model
# model.fit(X_train_2, y_train_2)

# # Make predictions
# predictions = model.predict(X_test_2)

# # Calculate scores
# mae = mean_absolute_error(y_test_2, predictions)
# r2 = r2_score(y_test_2, predictions)

# print(f"Mean Absolute Error: {mae}")
# print(f"R² score: {r2}")

In [20]:
# # Create a linear regression model _2
# model = LinearRegression()

# # Train the model
# model.fit(X_train_3, y_train_3)

# # Make predictions
# predictions = model.predict(X_test_3)

# # Calculate scores
# mae = mean_absolute_error(y_test_3, predictions)
# r2 = r2_score(y_test_3, predictions)

# print(f"Mean Absolute Error: {mae}")
# print(f"R² score: {r2}")

In [21]:
# # Create a linear regression model
# model = LinearRegression()

# # Train the model
# model.fit(X_train_4, y_train_4)

# # Make predictions
# predictions = model.predict(X_test_4)

# # Calculate scores
# mae = mean_absolute_error(y_test_4, predictions)
# r2 = r2_score(y_test_4, predictions)

# print(f"Mean Absolute Error: {mae}")
# print(f"R² score: {r2}")

In [22]:
# # Create a linear regression model
# model = LinearRegression()

# # Train the model
# model.fit(X_train_5, y_train_5)

# # Make predictions
# predictions = model.predict(X_test_5)

# # Calculate scores
# mae = mean_absolute_error(y_test_5, predictions)
# r2 = r2_score(y_test_5, predictions)

# print(f"Mean Absolute Error: {mae}")
# print(f"R² score: {r2}")

In [1]:
# Define the model
model = RandomForestRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='r2')

# Execute the search
grid_search.fit(X_train, y_train)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(X_test)

# Calculate scores
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Mean Absolute Error: {mae}")
print(f"R² score: {r2}")

NameError: name 'RandomForestRegressor' is not defined

In [None]:
# Define the model _2
model = RandomForestRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='r2')

# Execute the search
grid_search.fit(X_train_2, y_train_2)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(X_test_2)

# Calculate scores
mae = mean_absolute_error(y_test_2, predictions)
r2 = r2_score(y_test_2, predictions)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Mean Absolute Error: {mae}")
print(f"R² score: {r2}")

In [None]:
# Define the model _2
model = RandomForestRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='r2')

# Execute the search
grid_search.fit(X_train_3, y_train_3)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(X_test_3)

# Calculate scores
mae = mean_absolute_error(y_test_3, predictions)
r2 = r2_score(y_test_3, predictions)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Mean Absolute Error: {mae}")
print(f"R² score: {r2}")

In [None]:
# Define the model
model = RandomForestRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='r2')

# Execute the search
grid_search.fit(X_train_4, y_train_4)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(X_test_4)

# Calculate scores
mae = mean_absolute_error(y_test_4, predictions)
r2 = r2_score(y_test_4, predictions)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Mean Absolute Error: {mae}")
print(f"R² score: {r2}")

In [None]:
# Define the model
model = RandomForestRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [None, 10, 20, 30],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='r2')

# Execute the search
grid_search.fit(X_train_5, y_train_5)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(X_test_5)

# Calculate scores
mae = mean_absolute_error(y_test_5, predictions)
r2 = r2_score(y_test_5, predictions)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Mean Absolute Error: {mae}")
print(f"R² score: {r2}")

In [None]:
# Define the model
model = XGBRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [300],
    'max_depth': [3],
    'learning_rate': [0.1],
    'subsample': [0.8],
    'colsample_bytree': [1.0],
    'gamma': [0],
    'reg_alpha': [0.1],
    'reg_lambda': [1]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='r2')

# Execute the search
grid_search.fit(X_train, y_train)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(X_test)

# Calculate scores
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Mean Absolute Error: {mae}")
print(f"R² score: {r2}")

In [None]:
# Define the model
model = XGBRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [300],
    'max_depth': [3],
    'learning_rate': [0.1],
    'subsample': [0.8],
    'colsample_bytree': [1.0],
    'gamma': [0],
    'reg_alpha': [0.1],
    'reg_lambda': [1]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='r2')

# Execute the search
grid_search.fit(X_train_2, y_train_2)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(X_test_2)

# Calculate scores
mae = mean_absolute_error(y_test_2, predictions)
r2 = r2_score(y_test_2, predictions)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Mean Absolute Error: {mae}")
print(f"R² score: {r2}")

In [None]:
# Define the model
model = XGBRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [300],
    'max_depth': [3],
    'learning_rate': [0.1],
    'subsample': [0.8],
    'colsample_bytree': [1.0],
    'gamma': [0],
    'reg_alpha': [0.1],
    'reg_lambda': [1]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='r2')

# Execute the search
grid_search.fit(X_train_3, y_train_3)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(X_test_3)

# Calculate scores
mae = mean_absolute_error(y_test_3, predictions)
r2 = r2_score(y_test_3, predictions)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Mean Absolute Error: {mae}")
print(f"R² score: {r2}")

In [None]:
# Define the model
model = XGBRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [300],
    'max_depth': [3],
    'learning_rate': [0.1],
    'subsample': [0.8],
    'colsample_bytree': [1.0],
    'gamma': [0],
    'reg_alpha': [0.1],
    'reg_lambda': [1]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='r2')

# Execute the search
grid_search.fit(X_train_4, y_train_4)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(X_test_4)

# Calculate scores
mae = mean_absolute_error(y_test_4, predictions)
r2 = r2_score(y_test_4, predictions)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Mean Absolute Error: {mae}")
print(f"R² score: {r2}")

In [None]:
# Define the model
model = XGBRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [300],
    'max_depth': [3],
    'learning_rate': [0.1],
    'subsample': [0.8],
    'colsample_bytree': [1.0],
    'gamma': [0],
    'reg_alpha': [0.1],
    'reg_lambda': [1]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='r2')

# Execute the search
grid_search.fit(X_train_5, y_train_5)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(X_test_5)

# Calculate scores
mae = mean_absolute_error(y_test_5, predictions)
r2 = r2_score(y_test_5, predictions)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Mean Absolute Error: {mae}")
print(f"R² score: {r2}")

In [None]:
# Define the model
model = GradientBoostingRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [300],
    'max_depth': [3],
    'learning_rate': [0.1, 0.2],
    'subsample': [0.8, 1.0],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='r2')

# Execute the search
grid_search.fit(X_train, y_train)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(X_test)

# Calculate scores
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Mean Absolute Error: {mae}")
print(f"R² score: {r2}")

In [None]:
# Define the model
model = GradientBoostingRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [300],
    'max_depth': [3],
    'learning_rate': [0.1, 0.2],
    'subsample': [0.8, 1.0],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='r2')

# Execute the search
grid_search.fit(X_train_2, y_train_2)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(X_test_2)

# Calculate scores
mae = mean_absolute_error(y_test_2, predictions)
r2 = r2_score(y_test_2, predictions)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Mean Absolute Error: {mae}")
print(f"R² score: {r2}")

In [None]:
# Define the model
model = GradientBoostingRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [300],
    'max_depth': [3],
    'learning_rate': [0.1, 0.2],
    'subsample': [0.8, 1.0],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='r2')

# Execute the search
grid_search.fit(X_train_3, y_train_3)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(X_test_3)

# Calculate scores
mae = mean_absolute_error(y_test_3, predictions)
r2 = r2_score(y_test_3, predictions)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Mean Absolute Error: {mae}")
print(f"R² score: {r2}")

In [None]:
# Define the model
model = GradientBoostingRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [300],
    'max_depth': [3],
    'learning_rate': [0.1, 0.2],
    'subsample': [0.8, 1.0],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='r2')

# Execute the search
grid_search.fit(X_train_4, y_train_4)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(X_test_4)

# Calculate scores
mae = mean_absolute_error(y_test_4, predictions)
r2 = r2_score(y_test_4, predictions)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Mean Absolute Error: {mae}")
print(f"R² score: {r2}")

In [None]:
# Define the model
model = GradientBoostingRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [300],
    'max_depth': [3],
    'learning_rate': [0.1, 0.2],
    'subsample': [0.8, 1.0],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='r2')

# Execute the search
grid_search.fit(X_train_5, y_train_5)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(X_test_5)

# Calculate scores
mae = mean_absolute_error(y_test_5, predictions)
r2 = r2_score(y_test_5, predictions)

print(f"Best Parameters: {grid_search.best_params_}")
print(f"Mean Absolute Error: {mae}")
print(f"R² score: {r2}")

Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [None]:
# # gather evaluation metrics and compare results
# # Make predictions
# predictions = best_model.predict(X_test)

# # Calculate metrics
# mse = mean_squared_error(y_test, predictions)
# rmse = np.sqrt(mse)
# mae = mean_absolute_error(y_test, predictions)
# r2 = r2_score(y_test, predictions)
# adjusted_r2 = 1 - (1 - r2) * (len(y_test) - 1) / (len(y_test) - X_test.shape[1] - 1)

# # Print metrics
# print(f"Mean Squared Error (MSE): {mse}")
# print(f"Root Mean Squared Error (RMSE): {rmse}")
# print(f"Mean Absolute Error (MAE): {mae}")
# print(f"R²: {r2}")
# print(f"Adjusted R²: {adjusted_r2}")


In [None]:
# # Make predictions
# predictions = best_model.predict(X_test_2)

# # Calculate metrics
# mse = mean_squared_error(y_test_2, predictions)
# rmse = np.sqrt(mse)
# mae = mean_absolute_error(y_test_2, predictions)
# r2 = r2_score(y_test_2, predictions)
# adjusted_r2 = 1 - (1 - r2) * (len(y_test_2) - 1) / (len(y_test_2) - X_test_2.shape[1] - 1)

# # Print metrics
# print(f"Mean Squared Error (MSE): {mse}")
# print(f"Root Mean Squared Error (RMSE): {rmse}")
# print(f"Mean Absolute Error (MAE): {mae}")
# print(f"R²: {r2}")
# print(f"Adjusted R²: {adjusted_r2}")

In [None]:
# # Make predictions
# predictions = best_model.predict(X_test_3)

# # Calculate metrics
# mse = mean_squared_error(y_test_3, predictions)
# rmse = np.sqrt(mse)
# mae = mean_absolute_error(y_test_3, predictions)
# r2 = r2_score(y_test_3, predictions)
# adjusted_r2 = 1 - (1 - r2) * (len(y_test_3) - 1) / (len(y_test_3) - X_test_3.shape[1] - 1)

# # Print metrics
# print(f"Mean Squared Error (MSE): {mse}")
# print(f"Root Mean Squared Error (RMSE): {rmse}")
# print(f"Mean Absolute Error (MAE): {mae}")
# print(f"R²: {r2}")
# print(f"Adjusted R²: {adjusted_r2}")

In [None]:
# # gather evaluation metrics and compare results
# # Make predictions
# predictions = best_model.predict(X_test_4)

# # Calculate metrics
# mse = mean_squared_error(y_test_4, predictions)
# rmse = np.sqrt(mse)
# mae = mean_absolute_error(y_test_4, predictions)
# r2 = r2_score(y_test_4, predictions)
# adjusted_r2 = 1 - (1 - r2) * (len(y_test_4) - 1) / (len(y_test_4) - X_test_4.shape[1] - 1)

# # Print metrics
# print(f"Mean Squared Error (MSE): {mse}")
# print(f"Root Mean Squared Error (RMSE): {rmse}")
# print(f"Mean Absolute Error (MAE): {mae}")
# print(f"R²: {r2}")
# print(f"Adjusted R²: {adjusted_r2}")

Metrics for Regression Evaluation
Mean Squared Error (MSE):

Description: Measures the average squared difference between predicted and actual values.
Pros: Penalizes larger errors more than smaller ones.
Cons: The units are squared, making it less interpretable in the context of the original data.
Root Mean Squared Error (RMSE):

Description: The square root of MSE, providing error in the same units as the target variable.
Pros: Easier to interpret than MSE because it's in the same units as the target variable (e.g., dollars).
Cons: Still sensitive to outliers, as it squares the errors before averaging.
Mean Absolute Error (MAE):

Description: Measures the average absolute difference between predicted and actual values.
Pros: Less sensitive to outliers compared to MSE and RMSE. Provides a straightforward interpretation of the average error.
Cons: Does not penalize larger errors as much as MSE or RMSE.
R² (Coefficient of Determination):

Description: Represents the proportion of the variance in the dependent variable that is predictable from the independent variables.
Pros: Provides a measure of how well the model explains the variability of the target variable.
Cons: Can be misleading if used alone, especially with overfitting.
Adjusted R²:

Description: Adjusts the R² value based on the number of predictors in the model, penalizing for adding predictors that do not improve the model.
Pros: More reliable than R² for comparing models with different numbers of predictors.
Cons: Can be more complex to interpret.
Choosing Metrics
RMSE: Useful for understanding the model's error in the same units as the target variable. It is sensitive to outliers, which can be a drawback if the dataset contains significant outliers.
MAE: Provides a straightforward interpretation of the average error and is less sensitive to outliers. It is a good metric for understanding the typical error magnitude.
R² and Adjusted R²: Useful for understanding the proportion of variance explained by the model. Adjusted R² is particularly useful when comparing models with different numbers of predictors.

## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [None]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)