# MODELING & EVALUATION
- Goal: develop a **regression model** that performs better than a **baseline**.

- You must **evaluate a baseline model**, and show how the model you end up with performs better than that.

- `model.py`: will have the functions to fit, predict and evaluate the model

- Your notebook will contain **various algorithms** and/or hyperparameters tried, along with the evaluation code and results, before settling on the final algorithm.

- Be sure and evaluate your model using the **standard techniques**: plotting the residuals, computing the evaluation metric (SSE, RMSE, and/or MSE), comparing to baseline, plotting y by y_hat

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error, explained_variance_score
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE

from split_scale import split_data
from feature_selection import select_feature

import warnings
warnings.filterwarnings("ignore")

In [5]:
train, test = split_data()
train, selector = select_feature(train, 2)
train.head()

Unnamed: 0,home_size,bathroomcnt,home_value
5364,1449.0,2.0,363906.0
10814,3739.0,3.0,1203071.0
10863,1920.0,2.5,507625.0
12983,1574.0,3.0,569000.0
325,1992.0,3.0,592398.0


In [6]:
X = train.drop(columns = 'home_value')
y = train[['home_value']]

## Linear Model

In [7]:
lm = LinearRegression()
lm.fit(X, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [8]:
y['prediction'] = lm.predict(X)
y.head()

Unnamed: 0,home_value,prediction
5364,363906.0,328467.492686
10814,1203071.0,796524.148998
10863,507625.0,433088.277028
12983,569000.0,380845.614345
325,592398.0,461101.331428


In [10]:
RMSE_lm = np.sqrt(mean_squared_error(y.home_value, y.prediction))
RMSE_lm

253963.69196194154

## Baseline model

In [12]:
y['baseline'] = y['home_value'].mean()
y.head()

Unnamed: 0,home_value,prediction,baseline
5364,363906.0,328467.492686,383667.989531
10814,1203071.0,796524.148998,383667.989531
10863,507625.0,433088.277028,383667.989531
12983,569000.0,380845.614345,383667.989531
325,592398.0,461101.331428,383667.989531


In [13]:
RMSE_bl = np.sqrt(mean_squared_error(y.home_value, y.baseline))
RMSE_bl

299116.330842889