This notebook should include preliminary and baseline modeling.
- Try as many different models as possible.
- Don't worry about hyperparameter tuning or cross validation here.
- Ideas include:
    - linear regression
    - support vector machines
    - random forest
    - xgboost

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import r2_score

In [2]:
df = pd.read_csv('model_df.csv')

In [3]:
X, y = df.drop(['description.sold_price'], axis=1), df['description.sold_price']
print(X.shape, y.shape)

(5250, 63) (5250,)


In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(X_train.shape, X_test.shape)


(4200, 63) (1050, 63)


In [5]:
# importing random forest classifier model
from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(n_estimators=200)

rfr.fit(X_train, y_train)

rfr_pred = rfr.predict(X_test)

In [6]:
#using a linear regressor
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm 

lr = LinearRegression()
X = sm.add_constant(X)

X




Unnamed: 0.1,const,Unnamed: 0,description.lot_sqft,description.sqft,description.beds,basement,big_lot,big_yard,carport,central_air,...,description.type_condo,description.type_condo_townhome_rowhome_coop,description.type_condos,description.type_duplex_triplex,description.type_land,description.type_mobile,description.type_multi_family,description.type_other,description.type_single_family,description.type_townhomes
0,1.0,0,22651.0,1539.0,3.0,1,1,1,0,1,...,0,0,0,0,0,0,0,0,1,0
1,1.0,1,2614.0,2429.0,3.0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,1.0,2,13504.0,1120.0,3.0,1,1,0,1,1,...,0,0,0,0,0,0,0,0,1,0
3,1.0,3,2688.0,2400.0,3.0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
4,1.0,4,871.0,1478.0,3.0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5245,1.0,8148,9583.0,2160.0,4.0,1,0,1,0,1,...,0,0,0,0,0,0,0,0,1,0
5246,1.0,8150,14810.0,1852.0,4.0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
5247,1.0,8151,871.0,1824.0,3.0,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
5248,1.0,8152,3049.0,1504.0,2.0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0


In [45]:
lin_reg = sm.OLS(y,X)
lr_model = lin_reg.fit()
print_model = lr_model.summary()
print(print_model)

                              OLS Regression Results                              
Dep. Variable:     description.sold_price   R-squared:                       0.484
Model:                                OLS   Adj. R-squared:                  0.478
Method:                     Least Squares   F-statistic:                     83.91
Date:                    Wed, 28 Feb 2024   Prob (F-statistic):               0.00
Time:                            16:31:45   Log-Likelihood:                -75204.
No. Observations:                    5250   AIC:                         1.505e+05
Df Residuals:                        5191   BIC:                         1.509e+05
Df Model:                              58                                         
Covariance Type:                nonrobust                                         
                                                   coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------

Consider what metrics you want to use to evaluate success.
- If you think about mean squared error, can we actually relate to the amount of error?
- Try root mean squared error so that error is closer to the original units (dollars)
- What does RMSE do to outliers?
- Is mean absolute error a good metric for this problem?
- What about R^2? Adjusted R^2?
- Briefly describe your reasons for picking the metrics you use

In [47]:
#Evaluation metrics for random forest regressor 
forest_mae = mean_absolute_error(y_test, rfr_pred)
forest_mse = mean_squared_error(y_test, rfr_pred)
forest_rmse = mean_squared_error(y_test, rfr_pred, squared=False)
forest_r2 = r2_score(y_test, rfr_pred)

print(f'Linear Regression MAE: {forest_mae}')
print(f'Linear Regression MSE: {forest_mse}')
print(f'Linear Regression RMSE: {forest_rmse}')
print(f'Linear Regression R-squared: {forest_r2}')

Linear Regression MAE: 21233.536238095243
Linear Regression MSE: 21150021720.1309
Linear Regression RMSE: 145430.47039781898
Linear Regression R-squared: 0.95506219393717




**STRETCH**

Even with all the preprocessing we did in Notebook 1, you probably still have a lot of features. Are they all important for prediction?

Investigate some feature selection algorithms (Lasso, RFE, Forward/Backward Selection)
- Perform feature selection to get a reduced subset of your original features
- Refit your models with this reduced dimensionality - how does performance change on your chosen metrics?
- Based on this, should you include feature selection in your final pipeline? Explain

Remember, feature selection often doesn't directly improve performance, but if performance remains the same, a simpler model is often preferrable. 



In [None]:
# perform feature selection 
# refit models
# gather evaluation metrics and compare to the previous step (full feature set)