## Model Selection

This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [6]:
# import models and fit
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, r2_score, mean_squared_error

In [7]:
train_df = pd.read_csv('../processed/train.csv')
test_df = pd.read_csv('../processed/test.csv')


In [8]:
train_df.head()

Unnamed: 0,description.year_built,description.sold_price,description.baths,description.garage,description.stories,description.beds,num_days,central_air,dishwasher,fireplace,basement,price_per_sqft,median_value_per_sqft,description.type_encoded,city_encoded
0,1952,429000,2,2,1,4,93,0,0,1,0,239,1,1,54
1,2002,685000,3,1,2,3,128,1,1,0,1,323,0,2,51
2,1923,138000,2,1,2,3,57,1,1,0,1,76,0,1,29
3,2015,609000,3,3,1,3,95,1,1,1,1,210,1,1,32
4,1952,270000,1,2,1,4,66,0,1,1,1,208,1,1,35


In [9]:
X_train = train_df.drop(columns=['description.sold_price'])
y_train = train_df['description.sold_price']
X_test = test_df.drop(columns=['description.sold_price'])
y_test = test_df['description.sold_price']

Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [10]:
# Define the model
model = RandomForestRegressor()

# Define the parameter grid
param_grid = {
    'n_estimators': [100, 200, 300],  # 3 values
    'max_depth': [None, 10, 20, 30],  # 4 values
    'min_samples_split': [2, 5, 10],  # 3 values
    'min_samples_leaf': [1, 2, 4],  # 3 values
    'max_features': ['sqrt', 'log2', 0.5],  # 3 values
    'bootstrap': [True, False],  # 2 values
    'oob_score': [True],  # 1 value
    'ccp_alpha': [0.0, 0.001, 0.01],  # 3 values
    'max_leaf_nodes': [None, 50, 100],  # 3 values
    'min_impurity_decrease': [0.0, 0.0001, 0.001]  # 3 values
}

# Set up the grid search
grid_search = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, n_jobs=-1, scoring='r2')

# Execute the search
grid_search.fit(X_train, y_train)

# Get the best model
best_model = grid_search.best_estimator_

# Make predictions
predictions = best_model.predict(X_test)

# Calculate scores
mae = mean_absolute_error(y_test, predictions)
r2 = r2_score(y_test, predictions)

# Calculate Adjusted R²
n = X_test.shape[0]  # Number of samples
p = X_test.shape[1]  # Number of features
adj_r2 = 1 - (1 - r2) * ((n - 1) / (n - p - 1))

# Print results
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Mean Absolute Error: {mae}")
print(f"R² Score: {r2}")
print(f"Adjusted R² Score: {adj_r2}")

43740 fits failed out of a total of 87480.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
43740 fits failed with the following error:
Traceback (most recent call last):
  File "c:\anaconda\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "c:\anaconda\anaconda3\Lib\site-packages\sklearn\base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\anaconda\anaconda3\Lib\site-packages\sklearn\ensemble\_forest.py", line 450, in fit
    raise ValueError("Out of bag estimation only available if bootstrap=True")
ValueError: Out of bag estimation only available

Best Parameters: {'bootstrap': True, 'ccp_alpha': 0.0, 'max_depth': None, 'max_features': 0.5, 'max_leaf_nodes': None, 'min_impurity_decrease': 0.0001, 'min_samples_leaf': 1, 'min_samples_split': 2, 'n_estimators': 200, 'oob_score': True}
Mean Absolute Error: 64223.21017699115
R² Score: 0.8089344577018689
Adjusted R² Score: 0.7962571231418033


## Feature Selection - STRETCH

> **This step doesn't need to be part of your Minimum Viable Product (MVP), but its recommended you complete it if you have time!**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [None]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)