# Simple Regression model with target feature price 

Now that we have a clean dataframe with no multicollinearity, we can attempt to create a simle model first. After we can attempt to create a multi feature model. Again, our target feature is price.

**Main Goals for model fitting:**
* Fit an intial regression model. Using statistical analysis look at the p-value of features and determine which features are important. 
* Test for normality using Jarqu-Bera test. Test for heteroscedasticity. 
* From the tests we can refine and improve our model. 

In [1]:
#import all neccesary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 
%matplotlib inline
plt.style.use('seaborn')

import statsmodels.api as sm
import scipy.stats as stats

import warnings
warnings.filterwarnings("ignore")

There are 4 main assumptions for regression models: 
1. Linearity 
> There needs to be a linear relationship between the target variable and the predictors being used. 
2. Normality 
> The residual erros from the model are to be assumed as normally distributed. This can be checked using a Qunatile-Quantile Plot. 
3. Homoscedasticity 
> The residual erros from the model should not be correlated with the target variable or any predictors. This can be viewed through a residual plot. The erros must bve random. 

In [6]:
# read cleaned dataframe
clean = pd.read_csv('../data/king_clean_df.csv', index_col=0)
clean.head()

Unnamed: 0,price_log,sqft_lot,sqft_above,sqft_garage,sqft_patio,bedrooms,bathrooms,floors,condition_num,extracted_grade_num,waterfront_YES,greenbelt_YES,basement_1.0,sewer_PUBLIC,heat_Electricity/Solar,heat_Gas,heat_Gas/Solar,heat_Oil
21177,12.422989,39808,1790,460,290,3,1.5,1.0,4,7,0,0,0,1,0,0,0,1
10844,13.319574,12866,2232,440,60,4,1.5,1.5,4,7,0,0,0,1,0,1,0,0
9292,13.835313,15156,1380,0,0,5,2.0,1.0,4,8,0,0,1,1,0,0,0,1
17878,13.321214,15552,1210,0,330,5,2.0,1.0,3,7,0,0,1,0,0,0,0,1
14450,13.458836,8620,1720,0,0,3,2.0,1.5,5,7,0,0,0,1,0,1,0,0


## Model 1 with sqft_garage, sqft_patio, and sqft_basement treated as a categorical variable using dummy variables

In [7]:
# create a copy of cleaned data and drop target variable
# drop city since I already made dummies of it
model1_df = clean.drop(['price_log'], axis=1)

# Specify the model parameters
X = model1_df
y = clean['price_log']

In [8]:
model1 = sm.OLS(y, sm.add_constant(X))
model1_results = model1.fit()
print(model1_results.summary())

                            OLS Regression Results                            
Dep. Variable:              price_log   R-squared:                       0.481
Model:                            OLS   Adj. R-squared:                  0.481
Method:                 Least Squares   F-statistic:                     1643.
Date:                Tue, 25 Jul 2023   Prob (F-statistic):               0.00
Time:                        14:56:00   Log-Likelihood:                -16540.
No. Observations:               30110   AIC:                         3.312e+04
Df Residuals:                   30092   BIC:                         3.327e+04
Df Model:                          17                                         
Covariance Type:            nonrobust                                         
                             coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------
const                     11

**Observations:**
* It seems that all the features are statistically significant. 
* The variance of the model only represents 48% of the vairance in price_log

In [9]:
clean.corr()['price_log'].sort_values(ascending=False)

price_log                 1.000000
extracted_grade_num       0.614471
sqft_above                0.548116
bathrooms                 0.516526
bedrooms                  0.345930
sqft_patio                0.310323
sqft_garage               0.284608
floors                    0.234231
heat_Gas                  0.183325
waterfront_YES            0.138611
basement_1.0              0.136989
greenbelt_YES             0.097197
sqft_lot                  0.084713
heat_Gas/Solar            0.036587
sewer_PUBLIC              0.032915
condition_num             0.009101
heat_Electricity/Solar   -0.015479
heat_Oil                 -0.081566
Name: price_log, dtype: float64

## Model 2 with top 3 features

In [11]:
# Specify the model parameters
X = model1_df[['extracted_grade_num', 'sqft_above', 'bathrooms']]
y = clean['price_log']

In [13]:
model2 = sm.OLS(y, sm.add_constant(X))
model2_results = model1.fit()
print(model1_results.summary())

                            OLS Regression Results                            
Dep. Variable:              price_log   R-squared:                       0.414
Model:                            OLS   Adj. R-squared:                  0.414
Method:                 Least Squares   F-statistic:                     7088.
Date:                Tue, 25 Jul 2023   Prob (F-statistic):               0.00
Time:                        15:23:57   Log-Likelihood:                -18382.
No. Observations:               30110   AIC:                         3.677e+04
Df Residuals:                   30106   BIC:                         3.680e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
const                  11.7419    