## Modeling

Describe and justify the process for analyzing or modeling the data.

Questions to consider:

- How did you analyze the data to arrive at an initial approach?
- How did you iterate on your initial approach to make it better?
- Why are these choices appropriate given the data and the business problem?

## Evaluation

The evaluation of each model should accompany the creation of each model, and you should be sure to evaluate your models consistently.

Evaluate how well your work solves the stated business problem. 

Questions to consider:

- How do you interpret the results?
- How well does your model fit your data? How much better is this than your baseline model? Is it over or under fit?
- How well does your model/data fit any modeling assumptions?

For the final model, you might also consider:

- How confident are you that your results would generalize beyond the data you have?
- How confident are you that this model would benefit the business if put into use?

### Baseline Understanding

- What does a baseline, model-less prediction look like?

In [1]:
# code here to arrive at a baseline prediction

### Import our Final CSV from our Data Preparation and Cleaning Notebook

In [2]:
#Import libraries
# Standard Packages
import pandas as pd
import numpy as np

# Viz Packages
import seaborn as sns
import matplotlib.pyplot as plt

# Scipy Stats
import scipy.stats as stats 

# Statsmodel Api
import statsmodels.api as sm
from statsmodels.formula.api import ols

# SKLearn Modules
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.preprocessing import StandardScaler,  OrdinalEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
import sklearn.metrics as metrics

# Suppress future and deprecation warnings
import warnings
warnings.filterwarnings("ignore", category= FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning) 

In [3]:
final_kc_df = pd.read_csv('/Users/Aidan/Documents/Flatiron/Phase_2/King-County-House-Sales-/final_kc.csv')

In [4]:
#Reminder of what it looks like.
final_kc_df.head()

Unnamed: 0.1,Unnamed: 0,index,id,date,price,bedrooms,bathrooms,sqft_living,sqft_lot,floors,...,address,lat,long,Zip Code,DistrictName,ELA,Math,Science,District Test Score,zipcodes
0,0,0,7399300360,5/24/2022,675000.0,4,1.0,1180,7140,1.0,...,"2102 Southeast 21st Court, Renton, Washington ...",47.461975,-122.19052,98055,Renton School District,42.9,32.5,33.3,36.23,"['98006', '98031', '98032', '98055', '98056', ..."
1,1,4,2873000690,6/11/2021,680000.0,4,3.0,2130,7649,1.0,...,"20432 130th Place Southeast, Kent, Washington ...",47.418155,-122.16696,98031,Renton School District,42.9,32.5,33.3,36.23,"['98006', '98031', '98032', '98055', '98056', ..."
2,2,10,5469700570,6/23/2021,810000.0,5,3.0,3030,24759,1.0,...,"12605 Southeast 235th Street, Kent, Washington...",47.39079,-122.17303,98031,Renton School District,42.9,32.5,33.3,36.23,"['98006', '98031', '98032', '98055', '98056', ..."
3,3,28,7399301200,3/29/2022,728000.0,4,2.0,2170,7520,1.0,...,"1814 Aberdeen Avenue Southeast, Renton, Washin...",47.46393,-122.18974,98055,Renton School District,42.9,32.5,33.3,36.23,"['98006', '98031', '98032', '98055', '98056', ..."
4,4,50,9899200050,3/24/2022,565000.0,4,2.0,1400,10364,1.5,...,"3426 Shattuck Avenue South, Renton, Washington...",47.44845,-122.21243,98055,Renton School District,42.9,32.5,33.3,36.23,"['98006', '98031', '98032', '98055', '98056', ..."


In [5]:
#Remove uninformative columns 
final_kc_df= final_kc_df.drop(labels=['index','id', 'date','Unnamed: 0','address','lat','long', 'yr_renovated',\
                                     'zipcodes'], axis=1)

In [6]:
#Let's look at our shape after removing those columns.
final_kc_df.shape

(7545, 24)

In [7]:
final_kc_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7545 entries, 0 to 7544
Data columns (total 24 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   price                7545 non-null   float64
 1   bedrooms             7545 non-null   int64  
 2   bathrooms            7545 non-null   float64
 3   sqft_living          7545 non-null   int64  
 4   sqft_lot             7545 non-null   int64  
 5   floors               7545 non-null   float64
 6   waterfront           7545 non-null   object 
 7   greenbelt            7545 non-null   object 
 8   view                 7545 non-null   object 
 9   condition            7545 non-null   object 
 10  grade                7545 non-null   object 
 11  heat_source          7545 non-null   object 
 12  sewer_system         7545 non-null   object 
 13  sqft_above           7545 non-null   int64  
 14  sqft_basement        7545 non-null   int64  
 15  sqft_garage          7545 non-null   i

### First  Model

Before going too far down the data preparation rabbit hole, be sure to check your work against a first 'substandard' model! What is the easiest way for you to find out how hard your problem is?

In [8]:
# code here for your first 'substandard' model

In [9]:
# Prepare data

#Using only numeric data for our first model and not using Science due to its Null values
X = final_kc_df.drop(['price','waterfront','greenbelt','view','condition', 'grade', 'heat_source',\
                      'sewer_system', 'DistrictName', 'Science'], axis=1)
y = final_kc_df['price']

In [10]:
# Test/Train Split: Train is 80% of Data, Test is 20% of Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

In [11]:
# Instantiate model
lr_simple_model = LinearRegression()

# Fit model
first_result = lr_simple_model.fit(X_train, y_train)

In [12]:
# code here to evaluate your first 'substandard' model

In [13]:
# Retrieve estimated slope coefficient
first_result.coef_

array([-2.20082591e+04,  3.81784694e+04,  1.21088887e+02,  2.88741002e-01,
        2.82172963e+04,  2.58275410e+01,  7.82431431e+00,  1.05334932e+01,
        8.14861195e+01, -9.14935234e+02,  1.05862858e+03,  4.05475033e+04,
        2.81938674e+04, -6.18437290e+04])

In [14]:
# Retrieve estimated y-intercept coefficient
first_result.intercept_

-102035003.15897562

In [15]:
#GIVES US OUR R**2
print("First Model Train:" , lr_simple_model.score(X_train, y_train))
print("First Model Test:" , lr_simple_model.score(X_test, y_test))

First Model Train: 0.41098826984635306
First Model Test: 0.41715467409033447


In [16]:
#MEAN ABSOLUTE ERROR
metrics.mean_absolute_error(y_test, lr_simple_model.predict(X_test))

168903.66087232338

### Modeling Iterations

Before we create another model we should look for Multicollinearity and ensure that our predictive variables are indeed independent.

#### Second Model

In [17]:
# code here to iteratively improve your models

In [18]:
# code here to evaluate your iterations

### 'Final' Model

In the end, you'll arrive at a 'final' model - aka the one you'll use to make your recommendations/conclusions. This likely blends any group work. It might not be the one with the highest scores, but instead might be considered 'final' or 'best' for other reasons.

In [19]:
# code here to show your final model

In [20]:
# Prepare data
X = final_kc_df.drop('price', axis=1)
y = final_kc_df['price']

In [21]:
# Test/Train Split - Train 80% of Data, Test 20% of Data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)

### Processing the Data

#### Continuous X/ Predictive Variables

In [22]:
#Separate Continuous X Variables from rest of X Variables

*Train Data*

In [23]:
#Define our columns with numeric data
numeric_cols = ['bathrooms','sqft_living','sqft_patio']

In [24]:
#Define our numeric training data
X_train_numeric = X_train[numeric_cols]

*Test Data*

In [25]:
#Define our numeric testing data
X_test_numeric = X_test[numeric_cols]

#### Discrete X/ Predictive Variables

#### Discrete Ordinal X/ Predictive Variables

In [26]:
#Separate Discrete Ordinal X Variables from rest of X Variables

*Train Data*

In [27]:
#Define our columns with ordinal data and create train data subset
ord_cat_selector = ['condition','grade']
X_train_ord_cat_subset = X_train[ord_cat_selector]

In [28]:
#View unique values in our ordinal columns
X_train_ord_cat_subset['condition'].unique()

array(['Average', 'Very Good', 'Good'], dtype=object)

In [29]:
X_train_ord_cat_subset['grade'].unique()

array(['8 Good', '7 Average', '10 Very Good', '9 Better', '6 Low Average',
       '5 Fair', '11 Excellent', '13 Mansion'], dtype=object)

In [30]:
# REPLACE WITH COLUMN'S UNIQUE VALUES IN ASCENDING ORDER --->
condition_list = ['Poor', 'Fair', 'Average', 'Good', 'Very Good'] 
grade_list = ['3 Poor', '5 Fair', '6 Low Average', '7 Average','8 Good','9 Better','10 Very Good','11 Excellent',\
             '12 Luxury','13 Mansion']

In [31]:
#Fit ordinal train data
o_enc = OrdinalEncoder(categories = [condition_list, grade_list])
o_enc.fit(X_train_ord_cat_subset)

OrdinalEncoder(categories=[['Poor', 'Fair', 'Average', 'Good', 'Very Good'],
                           ['3 Poor', '5 Fair', '6 Low Average', '7 Average',
                            '8 Good', '9 Better', '10 Very Good',
                            '11 Excellent', '12 Luxury', '13 Mansion']])

In [32]:
#Transform ordinal train data
o_enc.transform(X_train_ord_cat_subset)

array([[2., 4.],
       [4., 3.],
       [3., 6.],
       ...,
       [2., 5.],
       [2., 4.],
       [2., 3.]])

In [33]:
#Turn ordinal train data back into a dataframe
X_train_ord = pd.DataFrame(o_enc.transform(X_train_ord_cat_subset),
                        columns = X_train_ord_cat_subset.columns)

In [34]:
#Preview ordinal data to confirm numerical values have replaced unique values
X_train_ord.head()

Unnamed: 0,condition,grade
0,2.0,4.0
1,4.0,3.0
2,3.0,6.0
3,3.0,3.0
4,3.0,3.0


*Test Data*

In [35]:
#Define ordinal test subset
X_test_ord_cat_subset = X_test[ord_cat_selector]

In [36]:
#Transform ordinal test data
o_enc.transform(X_test_ord_cat_subset)

array([[3., 4.],
       [3., 3.],
       [2., 3.],
       ...,
       [3., 3.],
       [2., 3.],
       [2., 4.]])

In [37]:
#Turn ordinal test data back into a dataframe
X_test_ord = pd.DataFrame(o_enc.transform(X_test_ord_cat_subset),
                        columns = X_test_ord_cat_subset.columns)

#### Discrete Nominal  X/ Predictive Variables

In [38]:
#Separate Discrete Nominal X Variables from rest of X Variables

In [39]:
onehot_enc = OneHotEncoder(drop = 'first', sparse = False)

*Train Data*

In [40]:
#Define nominal columns
# nominal_cols = ['DistrictName'] 
# R2 is 52
nominal_cols = ['waterfront','greenbelt','view','heat_source','sewer_system','yr_built','Zip Code']
# R2 is 58.9

In [41]:
#Fit and Transform nominal train data
X_train_nom_trans = onehot_enc.fit_transform(X_train[nominal_cols])

In [42]:
#View nominal train data shape
X_train_nom_trans.shape

(6036, 209)

In [43]:
#Get columns names
cols = onehot_enc.get_feature_names()

In [44]:
#Turn nominal train data back into a dataframe
X_train_nom = pd.DataFrame(X_train_nom_trans, columns = cols)
#Preview nominal train dataframe
X_train_nom.head()

Unnamed: 0,x0_YES,x1_YES,x2_EXCELLENT,x2_FAIR,x2_GOOD,x2_NONE,x3_Electricity/Solar,x3_Gas,x3_Gas/Solar,x3_Oil,...,x6_98155,x6_98166,x6_98168,x6_98177,x6_98178,x6_98188,x6_98198,x6_98199,x6_98288,x6_98354
0,0.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


*Test Data*

In [45]:
# Transform nominal test data
X_test_nom_trans = onehot_enc.transform(X_test[nominal_cols])

In [46]:
#View nominal test data shape
X_test_nom_trans.shape

(1509, 209)

In [47]:
#Turn nominal test data back into a dataframe
X_test_nom = pd.DataFrame(X_test_nom_trans, columns = cols)

#### Combine Discrete/Categorical X/ Predictive Variables

*Train Data*

In [48]:
#Concatinating the ordinal and nominal training dataframes
X_train_cat = pd.concat([X_train_ord, X_train_nom],axis = 1)

*Test Data*

In [49]:
#Concatinating the ordinal and nominal testing dataframes
X_test_cat = pd.concat([X_test_ord, X_test_nom],axis = 1)

#### Combine Continuous and Discrete X/ Predictive Variables

*Train Data - Combine all processed X Variables*

In [50]:
#Resetting index so the data lines up
X_process_train = pd.concat([X_train_numeric.reset_index(),X_train_cat],axis = 1)
#Dropping the old index column
X_process_train = X_process_train.drop('index', axis = 1)

In [51]:
#Checking the new shape of our combined dataframe
X_process_train.shape

(6036, 214)

*Test Data - Combine all processed X Variables*

In [52]:
#Resetting index so the data lines up
X_process_test = pd.concat([X_test_numeric.reset_index(),X_test_cat],axis = 1)
#Dropping the old index column
X_process_test = X_process_test.drop('index', axis = 1)

#### Scale the Data

In [53]:
# Create a StandardScaler object to scale data
ss = StandardScaler()

*Train Data*

In [54]:
#Fit and Transforming Train Data with our StandardScaler object
ss.fit(X_process_train)
X_standard_process_train = ss.transform(X_process_train)

*Test Data*

In [55]:
#Fit and Transforming Test Data with our StandardScaler object
X_standard_process_test = ss.transform(X_process_test)

### Creating, Fitting, and Running the Model

In [56]:
# Instantiate model
lr_simple_model = LinearRegression()

# Fit model
final_result = lr_simple_model.fit(X_standard_process_train, y_train)

In [None]:
# code here to evaluate your final model

In [57]:
# Retrieve estimated slope coefficient
final_result.coef_

array([ 7.39012933e+03,  6.44555729e+04,  8.54966982e+03,  3.47766649e+04,
        6.13349001e+04, -1.54734397e+02,  7.03400392e+03,  6.47846279e+03,
        7.87056705e+03,  4.52953736e+03, -1.47390041e+04, -1.21345394e+04,
        3.31287351e+03,  2.19271848e+02,  2.31791239e+02,  4.17166020e+03,
        3.70280131e+03, -3.58739027e+03,  5.38795959e+03,  3.61891100e+03,
        1.14870068e+03, -3.53593592e+02,  3.06712627e+03, -1.24171965e+03,
       -2.23455927e+03,  2.09184359e+03,  1.16116404e+03,  5.64593337e+03,
        4.11500889e+03,  2.65512160e+03, -1.43369245e+02,  3.04004019e+03,
        5.67643801e+03,  2.60550248e+03,  5.77378010e+03,  5.60319083e+03,
        1.92358449e+03,  6.61480058e+03,  4.94702389e+02,  4.04175249e+03,
        3.54351190e+03,  6.51895048e+03,  1.69932355e+03,  8.37278177e+03,
       -1.77609224e+03,  4.26210802e+03,  3.59602976e+02,  4.45810715e+03,
        3.81234915e+02,  4.71715850e+03, -4.40138608e+03,  1.78383014e+03,
       -9.66299015e+02,  

In [58]:
# Retrieve estimated y-intercept coefficient
final_result.intercept_

904465.635520212

In [59]:
#GIVES US OUR R**2
print("Final Model Train:" , lr_simple_model.score(X_standard_process_train, y_train))
print("Final Model Test:" , lr_simple_model.score(X_standard_process_test, y_test))

Final Model Train: 0.6197893726751975
Final Model Test: 0.5894612277636274


In [62]:
#MEAN ABSOLUTE ERROR
metrics.mean_absolute_error(y_test, lr_simple_model.predict(X_standard_process_test))

131819.640483476

## Conclusions

Provide your conclusions about the work you've done, including any limitations or next steps.

Questions to consider:

- What would you recommend the business do as a result of this work?
- What are some reasons why your analysis might not fully solve the business problem?
- What else could you do in the future to improve this project (future work)?
