***
# House Price Prediction - Model
***

***
# Model Prediction 
***

Based on the work that was completed in the EDA process we want to complete a little additional work on our data to prepare it for model training. We will be dropping features that have
a strong correlation, encoding categorical features and scaling the data.  

***
# Index
***

1. [Import Data](#data_import)
2. [Drop Correlated Features](#drop)
3. [Encoding Categorical Features](#encoding)
4. [Feature Scaling](#feature_scaling)
5. [Decision Tree Predict](#tree)
6. [Logistic Regression](#logit)

<a id="data_import"></a>
<div class="alert alert-block alert-info">
    <font size="+3"><b>Encoding & Feature Scaling</b></font>
</div>

In [183]:
# Imports 
import pandas as pd
import numpy as np
import seaborn as sns 
import matplotlib.pyplot as plt
import datetime as dt
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn import tree
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler

In [184]:
# Custom Helper Functions 

def convert_categorical(dataframe, col_name,order_rank):
    '''Takes the name of the dataframe, the column name to convert and array like object
    containing the rank/order of values. Converts that desired feature to an ORDERED Categorical
    Data Type'''
    df[col_name] = pd.Categorical(dataframe[col_name],order_rank,
                            ordered = True)
    
def scale_convert(list_int):
    '''Scale converter takes a list of integers and converts the values using Numpy
    log10(). '''
    np_list_int = np.array(list_int)
    log_values = np.log10(np_list_int)
    converted_values_log10 = [round(i,2) for i in log_values]
    return converted_values_log10

def convert_thousands(list_to_convert):
    '''Convert thousands will take a list if values and / 1000 '''
    np_list_to_convert = np.array(list_to_convert)
    converted = np_list_to_convert / 1000
    converted_values = [int(i) for i in converted]
    return converted_values

<a id="data_import"></a>
## Data Import
Import the data that was modified in the EDA process.  
1. `df_test` - Testing data only for final validation
2. `df` - Training data for model training 

In [185]:
# Need to add imorting of test data
# df_test = pd.read_csv('house_price_test.csv')
df = pd.read_csv('predicting_house_prices_20200412.csv')
df.head(2)

Unnamed: 0,LotArea,Street,Utilities,LotConfig,Neighborhood,BldgType,OverallQual,OverallCond,ExterCond,Foundation,1stFlrSF,2ndFlrSF,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,home_sqft
0,8450,Pave,AllPub,Inside,CollgCr,1Fam,7,5,TA,PConc,856,854,,0,2,2008,WD,Normal,208500,1710
1,9600,Pave,AllPub,FR2,Veenker,1Fam,6,8,TA,CBlock,1262,0,,0,5,2007,WD,Normal,181500,1262


<a id="drop"></a>
<div class="alert alert-block alert-info">
    <font size="+3"><b>Drop Correlated Features</b></font>
</div>

### Features to Drop

* `1stFlrSF` and `2ndFlrSF`: We can drop 1stFlrSF and 2ndFlrSF features to reduce potential over fitting. Total `home_sqft` was created and will be used in-place of those two columns.

`MiscFeature` and `MiscVal`: Misc value is the value associated with Misc feature and there are only about 60 additional values. Will `drop MiscFeature` and keep the misc value feature which we do not have to convert only scale. 

In [186]:
# check number of features before drop
print(f"Feature Count before drop: {len(df.columns)}")

Feature Count before drop: 20


In [187]:
df.drop(columns=['1stFlrSF', '2ndFlrSF', 'MiscFeature'],inplace=True)
print(f"Feature count after drop should be 17: {len(df.columns)}")

Feature count after drop should be 17: 17


<a id="encoding"></a>
<div class="alert alert-block alert-info">
    <font size="+3"><b>Encoding Categorical Features</b></font>
</div>

## Encoding Categorical Features 
We are going to use **OneHotEncoder** from preprocessing.  If there are two classes within a fearture `OneHotEncoder` will return two columns with a **1** for the "active" value for that sample.  There are **9** features that will be converted adding a total of 65 features to the data set.  We can drop the original 9 to remove the duplicated columns. Then we will need to create a **second data set** where we will drop one column from each new data set. If done correctly this data set will have `9 less columns`.  

* 2 'Street
* 2 'Utilities'
* 5 'LotConfig'
* 24 'Neighborhood'
* 5 'BldgType'
* 5 'ExterCond'
* 6 'Foundation'
* 9 'SaleType'
* 6 'SaleCondition'

Overall Quality (*OverallQual*) and Overall Condition are (*OverallCond*) have an ranked integer value from 1 to 10.  These values can remain, as is. However, exterior condition (*ExterCond*) will need to be updated:

* Current Values 'ExterCond' Ranked hight to low `['Ex', 'Gd', 'TA', 'Fa', 'Po']`
* Convert on a scale from 5 to 1


[OneHotEncoder](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html#sklearn.preprocessing.OneHotEncoder)

[get_dummies or OneHotEncoder](https://stackoverflow.com/questions/36631163/pandas-get-dummies-vs-sklearns-onehotencoder-what-are-the-pros-and-cons)

### Update Feature ExterCond
Here we will conver the text to integer values

In [188]:
# Quick look at ExterCond before converting values from alpha to numeric
df['ExterCond'].value_counts()

TA    1282
Gd     146
Fa      28
Ex       3
Po       1
Name: ExterCond, dtype: int64

In [189]:
# order high to low ['Ex', 'Gd', 'TA', 'Fa', 'Po']
for i in range(len(df['ExterCond'])):
    if df['ExterCond'].iloc[i] == 'TA':
        df['ExterCond'].iloc[i] = 3
    if df['ExterCond'].iloc[i] == 'Gd':
        df['ExterCond'].iloc[i] = 4
    if df['ExterCond'].iloc[i] == 'Fa':
        df['ExterCond'].iloc[i] = 2
    if df['ExterCond'].iloc[i] == 'Ex':
        df['ExterCond'].iloc[i] = 5
    if df['ExterCond'].iloc[i] == 'Po':
        df['ExterCond'].iloc[i] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)


In [190]:
# use value counts to check updates are correct
df['ExterCond'].value_counts()

3    1282
4     146
2      28
5       3
1       1
Name: ExterCond, dtype: int64

### Encoding Month - Cyclical Feature 
For the `Month Sold` feature we need to use `sin` and `cos` to convert the values. 
We will create two new columns 

* df['MoSold_sin'] = np.sin((df['MoSold'] - 1) * (2.0 * np.pi / 12))
* df['MoSold_cos'] = np.cos((df['MoSold'] - 1)* (2.0 * np.pi / 12))

[Link Encoding Cyclical Features](http://blog.davidkaleko.com/feature-engineering-cyclical-features.html)

In [191]:
# convert MoSold
df['MoSold_sin'] = np.sin((df['MoSold'] - 1) * (2.0 * np.pi / 12))
df['MoSold_cos'] = np.cos((df['MoSold'] - 1)* (2.0 * np.pi / 12))

In [192]:
df.filter(items=['MoSold','MoSold_sin','MoSold_cos'])

Unnamed: 0,MoSold,MoSold_sin,MoSold_cos
0,2,0.500000,8.660254e-01
1,5,0.866025,-5.000000e-01
2,9,-0.866025,-5.000000e-01
3,2,0.500000,8.660254e-01
4,12,-0.500000,8.660254e-01
...,...,...,...
1455,8,-0.500000,-8.660254e-01
1456,2,0.500000,8.660254e-01
1457,5,0.866025,-5.000000e-01
1458,4,1.000000,6.123234e-17


In [193]:
# calcualte total feature add to validate
features =  ['Street', 'Utilities','LotConfig', 'Neighborhood','BldgType',
                   'ExterCond', 'Foundation', 'SaleType', 'SaleCondition']
count = 0
for col in features:
    count += len(df[col].unique())
print(f"Total features added: {count}")

Total features added: 65


### OneHotEncoder 
Use the One Hot Encoder to encode select features. 
1. first create the `Object` from OneHotEncoder `Class`
2. set sparse=False to create Numpy Array
3. fit_transoform select features from the dataframe

In [194]:
# list of all the features in the dataframe to encode 
features =  ['Street', 'Utilities','LotConfig', 'Neighborhood','BldgType',
                   'ExterCond', 'Foundation', 'SaleType', 'SaleCondition']
# instantiate object 
OHE = OneHotEncoder(sparse=False)
# call fit_transform method()
features_OHE = OHE.fit_transform(df[features])
# note b/c sparse = false OHE returns a numpy array
features_OHE


array([[0., 1., 1., ..., 0., 1., 0.],
       [0., 1., 1., ..., 0., 1., 0.],
       [0., 1., 1., ..., 0., 1., 0.],
       ...,
       [0., 1., 1., ..., 0., 1., 0.],
       [0., 1., 1., ..., 0., 1., 0.],
       [0., 1., 1., ..., 0., 1., 0.]])

In [195]:
# list all the converted categories 
OHE.categories_

[array(['Grvl', 'Pave'], dtype=object),
 array(['AllPub', 'NoSeWa'], dtype=object),
 array(['Corner', 'CulDSac', 'FR2', 'FR3', 'Inside'], dtype=object),
 array(['Blmngtn', 'Blueste', 'BrDale', 'BrkSide', 'ClearCr', 'CollgCr',
        'Crawfor', 'Edwards', 'Gilbert', 'IDOTRR', 'MeadowV', 'Mitchel',
        'NAmes', 'NPkVill', 'NWAmes', 'NoRidge', 'NridgHt', 'OldTown',
        'SWISU', 'Sawyer', 'SawyerW', 'Somerst', 'StoneBr', 'Timber',
        'Veenker'], dtype=object),
 array(['1Fam', '2fmCon', 'Duplex', 'Twnhs', 'TwnhsE'], dtype=object),
 array([1, 2, 3, 4, 5], dtype=object),
 array(['BrkTil', 'CBlock', 'PConc', 'Slab', 'Stone', 'Wood'], dtype=object),
 array(['COD', 'CWD', 'Con', 'ConLD', 'ConLI', 'ConLw', 'New', 'Oth', 'WD'],
       dtype=object),
 array(['Abnorml', 'AdjLand', 'Alloca', 'Family', 'Normal', 'Partial'],
       dtype=object)]

In [196]:
headers = np.arange(1,66,1)
df_OHE = pd.DataFrame(features_OHE)
df_OHE.shape

(1460, 65)

In [197]:
df_train = pd.concat([df,df_OHE],axis=1)
df_train.head()

Unnamed: 0,LotArea,Street,Utilities,LotConfig,Neighborhood,BldgType,OverallQual,OverallCond,ExterCond,Foundation,...,55,56,57,58,59,60,61,62,63,64
0,8450,Pave,AllPub,Inside,CollgCr,1Fam,7,5,3,PConc,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,9600,Pave,AllPub,FR2,Veenker,1Fam,6,8,3,CBlock,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,11250,Pave,AllPub,Inside,CollgCr,1Fam,7,5,3,PConc,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,9550,Pave,AllPub,Corner,Crawfor,1Fam,7,5,3,BrkTil,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
4,14260,Pave,AllPub,FR2,NoRidge,1Fam,8,5,3,PConc,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


In [198]:
# drop the features converted using OHE
features =  ['Street', 'Utilities','LotConfig', 'Neighborhood','BldgType',
             'Foundation', 'SaleType', 'SaleCondition']

df_train = df_train.drop(columns=features)

In [199]:
df_train.head()

Unnamed: 0,LotArea,OverallQual,OverallCond,ExterCond,MiscVal,MoSold,YrSold,SalePrice,home_sqft,MoSold_sin,...,55,56,57,58,59,60,61,62,63,64
0,8450,7,5,3,0,2,2008,208500,1710,0.5,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,9600,6,8,3,0,5,2007,181500,1262,0.866025,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,11250,7,5,3,0,9,2008,223500,1786,-0.866025,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,9550,7,5,3,0,2,2006,140000,1717,0.5,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
4,14260,8,5,3,0,12,2008,250000,2198,-0.5,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


In [200]:
df_train.columns

Index([    'LotArea', 'OverallQual', 'OverallCond',   'ExterCond',
           'MiscVal',      'MoSold',      'YrSold',   'SalePrice',
         'home_sqft',  'MoSold_sin',  'MoSold_cos',             0,
                   1,             2,             3,             4,
                   5,             6,             7,             8,
                   9,            10,            11,            12,
                  13,            14,            15,            16,
                  17,            18,            19,            20,
                  21,            22,            23,            24,
                  25,            26,            27,            28,
                  29,            30,            31,            32,
                  33,            34,            35,            36,
                  37,            38,            39,            40,
                  41,            42,            43,            44,
                  45,            46,            47,           

<a id="feature_scaling"></a>
<div class="alert alert-block alert-info">
    <font size="+3"><b>Feature Scaling</b></font>
</div>

## Feature Scaling
We will use the `MinMaxScaler` for select columns to convert our data to a range of 1 to 0. The features to scale are listed below.
1. `LotArea`
2. `OverallQual`
3. `OverallCond`
4. `ExterCond`
5. `YrSold`
6. `home_sqft`

In [201]:
# features to scale
mm_features = ['LotArea', 'OverallQual','OverallCond',
               'ExterCond','YrSold','home_sqft' ]
# create object mm_scale
mm_scale = MinMaxScaler()

In [202]:
for m in mm_features:
    df_train[m] = mm_scale.fit_transform(df_train[[m]])
    
df.head()

Unnamed: 0,LotArea,Street,Utilities,LotConfig,Neighborhood,BldgType,OverallQual,OverallCond,ExterCond,Foundation,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice,home_sqft,MoSold_sin,MoSold_cos
0,8450,Pave,AllPub,Inside,CollgCr,1Fam,7,5,3,PConc,0,2,2008,WD,Normal,208500,1710,0.5,0.866025
1,9600,Pave,AllPub,FR2,Veenker,1Fam,6,8,3,CBlock,0,5,2007,WD,Normal,181500,1262,0.866025,-0.5
2,11250,Pave,AllPub,Inside,CollgCr,1Fam,7,5,3,PConc,0,9,2008,WD,Normal,223500,1786,-0.866025,-0.5
3,9550,Pave,AllPub,Corner,Crawfor,1Fam,7,5,3,BrkTil,0,2,2006,WD,Abnorml,140000,1717,0.5,0.866025
4,14260,Pave,AllPub,FR2,NoRidge,1Fam,8,5,3,PConc,0,12,2008,WD,Normal,250000,2198,-0.5,0.866025


In [203]:
df_train.head()

Unnamed: 0,LotArea,OverallQual,OverallCond,ExterCond,MiscVal,MoSold,YrSold,SalePrice,home_sqft,MoSold_sin,...,55,56,57,58,59,60,61,62,63,64
0,0.03342,0.666667,0.5,0.5,0,2,0.5,208500,0.259231,0.5,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
1,0.038795,0.555556,0.875,0.5,0,5,0.25,181500,0.17483,0.866025,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
2,0.046507,0.666667,0.5,0.5,0,9,0.5,223500,0.273549,-0.866025,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,0.038561,0.666667,0.5,0.5,0,2,0.0,140000,0.26055,0.5,...,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0
4,0.060576,0.777778,0.5,0.5,0,12,0.5,250000,0.351168,-0.5,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


<a id="train_test_split"></a>
<div class="alert alert-block alert-info">
    <font size="+3"><b>Train, Test Split</b></font>
</div>

## Splitting & Training Data
We will set aside 30% of the data for testing the models created.  

In [204]:
X = df_train.drop(columns=['SalePrice'])
y = df_train['SalePrice']

In [205]:
# split the data holding back 30% of data for testing 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, 
                                                    random_state=43)

In [206]:
# 30% of 1460 is 438 
print(f"Shape of X_train: {X_train.shape}")
print(f"Shape of X_test: {X_test.shape}")

print(f"Shape of y_train: {y_train.shape}")
print(f"Shape of y_test: {y_test.shape}")

Shape of X_train: (1022, 75)
Shape of X_test: (438, 75)
Shape of y_train: (1022,)
Shape of y_test: (438,)


<a id="tree"></a>
<div class="alert alert-block alert-info">
    <font size="+3"><b>Decision Tree Prediction</b></font>
</div>

In [207]:
# create the classifier standard  
clf_dt = tree.DecisionTreeClassifier()

# Apply the training data to the classifier
clf_dt.fit(X_train,y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [208]:
# pass in the features to the model for a prediction
results = clf_dt.predict(X_test)
len(results)

438

In [209]:
# loop through 438 predicted 'results' and compare with actual results y_test
diff = []
count = 0
for i in range(len(results)):
    actual = y_test.iloc[i]
    predicted = results[i]
    diff.append(abs(actual - predicted))
    #if abs(actual - predicted) < 999:
        #count += 1
        #print(abs(actual - predicted))
#print(f"count is: {count}") 
diff = np.array(diff)

## Absolute Difference 
Using the absolute difference to measure the score on the initial decision tree predictor 
model indicates that this is not a good model or better job is required on feature selection. 

In [210]:
# calcualte the absolute difference 
# score of 0 indicates perfect prediction 
less_1k = diff < 999
less_5k = (diff >= 1000) & (diff < 4999)
less_10k = (diff >= 5000) & (diff < 9999)
less_20k = (diff >= 10000) & (diff < 19999)
less_30k = (diff >= 20000) & (diff < 29999)
less_40k = (diff >= 30000) & (diff < 39999)
less_50k = (diff >= 40000) & (diff < 49999)
less_100k = (diff >= 50000) & (diff < 99999)
less_100k_plus = diff >= 100000
print("Decision Tree Prediction Range")
print(f"Less than $1k:\t\t {round(sum(less_1k)/438*100,2)} Percent")
print(f"Between $1k - $5k:\t {round(sum(less_50k)/438*100,2)} Percent")
print(f"Between $5k - $10k:\t {round(sum(less_10k)/438*100,2)} Percent")
print(f"Between $10k - $20k:\t {round(sum(less_20k)/438*100,2)} Percent")
print(f"Between $20k - $30k:\t {round(sum(less_30k)/438*100,2)} Percent")
print(f"Between $30k - $40k:\t {round(sum(less_40k)/438*100,2)} Percent")
print(f"Between $50k - $100k:\t {round(sum(less_100k)/438*100,2)} Percent")
print(f"Greater than $100k:\t {round(sum(less_100k_plus)/438*100,2)} Percent")

Decision Tree Prediction Range
Less than $1k:		 2.51 Percent
Between $1k - $5k:	 8.9 Percent
Between $5k - $10k:	 10.73 Percent
Between $10k - $20k:	 18.72 Percent
Between $20k - $30k:	 9.82 Percent
Between $30k - $40k:	 12.79 Percent
Between $50k - $100k:	 18.26 Percent
Greater than $100k:	 6.39 Percent


### Decision Tree Parameter Adjustments

In [211]:
mae_list = []
nodes = 4
lowest_score = 5000000
while nodes >= 4 and nodes <= 35:
    # start 2 step 2 stop 15 not not inclusive
    for i in range(2,37,2):
        clf_dt = tree.DecisionTreeClassifier(min_samples_split = i, 
                                               random_state = 42,
                                               max_leaf_nodes=nodes)
        clf_dt.fit(X_train, y_train)
        
        # Use the predict method on the test data
        predictions = clf_dt.predict(X_test)
        mae = round(mean_absolute_error(y_test,predictions),4)
        print(f"MAE = {mae} | Nodes = {nodes}  | Split = {i}")
        
        if mae <= lowest_score:
            lowest_score = mae
            # mae, nodes, split
            low_list = []
            low_list.append((mae, nodes, i))
        mae_list.append(mae)
        
        print('--'*25)
    nodes += 1

MAE = 41935.5571 | Nodes = 4  | Split = 2
--------------------------------------------------
MAE = 41935.5571 | Nodes = 4  | Split = 4
--------------------------------------------------
MAE = 41935.5571 | Nodes = 4  | Split = 6
--------------------------------------------------
MAE = 41935.5571 | Nodes = 4  | Split = 8
--------------------------------------------------
MAE = 41935.5571 | Nodes = 4  | Split = 10
--------------------------------------------------
MAE = 41935.5571 | Nodes = 4  | Split = 12
--------------------------------------------------
MAE = 41935.5571 | Nodes = 4  | Split = 14
--------------------------------------------------
MAE = 41935.5571 | Nodes = 4  | Split = 16
--------------------------------------------------
MAE = 41935.5571 | Nodes = 4  | Split = 18
--------------------------------------------------
MAE = 41935.5571 | Nodes = 4  | Split = 20
--------------------------------------------------
MAE = 41935.5571 | Nodes = 4  | Split = 22
---------------------

MAE = 42884.8721 | Nodes = 9  | Split = 14
--------------------------------------------------
MAE = 42884.8721 | Nodes = 9  | Split = 16
--------------------------------------------------
MAE = 42884.8721 | Nodes = 9  | Split = 18
--------------------------------------------------
MAE = 42642.863 | Nodes = 9  | Split = 20
--------------------------------------------------
MAE = 42642.863 | Nodes = 9  | Split = 22
--------------------------------------------------
MAE = 42382.589 | Nodes = 9  | Split = 24
--------------------------------------------------
MAE = 42382.589 | Nodes = 9  | Split = 26
--------------------------------------------------
MAE = 42382.589 | Nodes = 9  | Split = 28
--------------------------------------------------
MAE = 41290.7717 | Nodes = 9  | Split = 30
--------------------------------------------------
MAE = 41290.7717 | Nodes = 9  | Split = 32
--------------------------------------------------
MAE = 41290.7717 | Nodes = 9  | Split = 34
----------------------

MAE = 39207.3242 | Nodes = 14  | Split = 28
--------------------------------------------------
MAE = 39327.4155 | Nodes = 14  | Split = 30
--------------------------------------------------
MAE = 39327.4155 | Nodes = 14  | Split = 32
--------------------------------------------------
MAE = 39327.4155 | Nodes = 14  | Split = 34
--------------------------------------------------
MAE = 39327.4155 | Nodes = 14  | Split = 36
--------------------------------------------------
MAE = 43067.5205 | Nodes = 15  | Split = 2
--------------------------------------------------
MAE = 42667.9772 | Nodes = 15  | Split = 4
--------------------------------------------------
MAE = 42667.9772 | Nodes = 15  | Split = 6
--------------------------------------------------
MAE = 41605.3836 | Nodes = 15  | Split = 8
--------------------------------------------------
MAE = 41605.3836 | Nodes = 15  | Split = 10
--------------------------------------------------
MAE = 41605.3836 | Nodes = 15  | Split = 12
----------

MAE = 38379.0137 | Nodes = 19  | Split = 36
--------------------------------------------------
MAE = 42516.1507 | Nodes = 20  | Split = 2
--------------------------------------------------
MAE = 41315.8858 | Nodes = 20  | Split = 4
--------------------------------------------------
MAE = 41262.9178 | Nodes = 20  | Split = 6
--------------------------------------------------
MAE = 39504.5845 | Nodes = 20  | Split = 8
--------------------------------------------------
MAE = 39504.5845 | Nodes = 20  | Split = 10
--------------------------------------------------
MAE = 40597.7352 | Nodes = 20  | Split = 12
--------------------------------------------------
MAE = 40597.7352 | Nodes = 20  | Split = 14
--------------------------------------------------
MAE = 39754.3562 | Nodes = 20  | Split = 16
--------------------------------------------------
MAE = 39204.5845 | Nodes = 20  | Split = 18
--------------------------------------------------
MAE = 39903.2146 | Nodes = 20  | Split = 20
----------

MAE = 41327.8721 | Nodes = 25  | Split = 14
--------------------------------------------------
MAE = 42237.6895 | Nodes = 25  | Split = 16
--------------------------------------------------
MAE = 39000.4749 | Nodes = 25  | Split = 18
--------------------------------------------------
MAE = 38933.5799 | Nodes = 25  | Split = 20
--------------------------------------------------
MAE = 38933.5799 | Nodes = 25  | Split = 22
--------------------------------------------------
MAE = 39114.4018 | Nodes = 25  | Split = 24
--------------------------------------------------
MAE = 39114.4018 | Nodes = 25  | Split = 26
--------------------------------------------------
MAE = 38297.5068 | Nodes = 25  | Split = 28
--------------------------------------------------
MAE = 37952.3014 | Nodes = 25  | Split = 30
--------------------------------------------------
MAE = 37952.3014 | Nodes = 25  | Split = 32
--------------------------------------------------
MAE = 37952.3014 | Nodes = 25  | Split = 34
------

MAE = 40358.2374 | Nodes = 30  | Split = 14
--------------------------------------------------
MAE = 39757.7808 | Nodes = 30  | Split = 16
--------------------------------------------------
MAE = 39180.3836 | Nodes = 30  | Split = 18
--------------------------------------------------
MAE = 38952.0731 | Nodes = 30  | Split = 20
--------------------------------------------------
MAE = 38447.5068 | Nodes = 30  | Split = 22
--------------------------------------------------
MAE = 38206.6393 | Nodes = 30  | Split = 24
--------------------------------------------------
MAE = 38206.6393 | Nodes = 30  | Split = 26
--------------------------------------------------
MAE = 37518.3973 | Nodes = 30  | Split = 28
--------------------------------------------------
MAE = 37067.0274 | Nodes = 30  | Split = 30
--------------------------------------------------
MAE = 37067.0274 | Nodes = 30  | Split = 32
--------------------------------------------------
MAE = 37067.0274 | Nodes = 30  | Split = 34
------

MAE = 38428.6712 | Nodes = 35  | Split = 22
--------------------------------------------------
MAE = 37992.5982 | Nodes = 35  | Split = 24
--------------------------------------------------
MAE = 37781.411 | Nodes = 35  | Split = 26
--------------------------------------------------
MAE = 37121.137 | Nodes = 35  | Split = 28
--------------------------------------------------
MAE = 36982.1324 | Nodes = 35  | Split = 30
--------------------------------------------------
MAE = 36982.1324 | Nodes = 35  | Split = 32
--------------------------------------------------
MAE = 36980.5342 | Nodes = 35  | Split = 34
--------------------------------------------------
MAE = 36980.5342 | Nodes = 35  | Split = 36
--------------------------------------------------


### Min Mean Absolute Error 
When measuring the MAE we are looking for the lowest possible value. In terms
of predicting the value of a home we want the combination of value where `MAE` is
closest to zero.  The lowest MAE was 36,223.4429 and this was found when the 
max_leaf_node was set to 12 and min_sample_split was equal to 27

In [212]:
print(f"What is the MIN MAE Value: {min(mae_list)}")
print(f"What is the node, split combination: {low_list}")

What is the MIN MAE Value: 36842.8265
What is the node, split combination: [(36842.8265, 34, 32)]


In [213]:
# tune the parms based on lowest score from MAE  
clf_dt = tree.DecisionTreeClassifier(min_samples_split = 32, 
                                               random_state = 42,
                                               max_leaf_nodes=34)

# Apply the training data to the classifier
clf_dt.fit(X_train,y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=34,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=32,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=42, splitter='best')

In [214]:
# pass in the features to the model for a prediction
results = clf_dt.predict(X_test)
len(results)

438

In [215]:
# loop through 438 predicted 'results' and compare with actual results y_test
diff = []
count = 0
for i in range(len(results)):
    actual = y_test.iloc[i]
    predicted = results[i]
    diff.append(abs(actual - predicted))
diff = np.array(diff)

In [216]:
# DT with adjustments calcualte the absolute difference 
# score of 0 indicates perfect prediction 
less_1k = diff < 999
less_5k = (diff >= 1000) & (diff < 4999)
less_10k = (diff >= 5000) & (diff < 9999)
less_20k = (diff >= 10000) & (diff < 19999)
less_30k = (diff >= 20000) & (diff < 29999)
less_40k = (diff >= 30000) & (diff < 39999)
less_50k = (diff >= 40000) & (diff < 49999)
less_100k = (diff >= 50000) & (diff < 99999)
less_100k_plus = diff >= 100000
print("Decision Tree Prediction Range")
print(f"Less than $1k:\t\t {round(sum(less_1k)/438*100,2)} Percent")
print(f"Between $1k - $5k:\t {round(sum(less_50k)/438*100,2)} Percent")
print(f"Between $5k - $10k:\t {round(sum(less_10k)/438*100,2)} Percent")
print(f"Between $10k - $20k:\t {round(sum(less_20k)/438*100,2)} Percent")
print(f"Between $20k - $30k:\t {round(sum(less_30k)/438*100,2)} Percent")
print(f"Between $30k - $40k:\t {round(sum(less_40k)/438*100,2)} Percent")
print(f"Between $50k - $100k:\t {round(sum(less_100k)/438*100,2)} Percent")
print(f"Greater than $100k:\t {round(sum(less_100k_plus)/438*100,2)} Percent")

Decision Tree Prediction Range
Less than $1k:		 1.6 Percent
Between $1k - $5k:	 7.31 Percent
Between $5k - $10k:	 14.38 Percent
Between $10k - $20k:	 21.92 Percent
Between $20k - $30k:	 13.7 Percent
Between $30k - $40k:	 8.68 Percent
Between $50k - $100k:	 16.44 Percent
Greater than $100k:	 7.08 Percent


<a id="logit"></a>
<div class="alert alert-block alert-info">
    <font size="+3"><b>Logistic Regression</b></font>
</div>

In [217]:
# create the classifier standard  
clf_log = LogisticRegression()

# Apply the training data to the classifier
clf_log.fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [218]:
# pass in the features to the model for a prediction
results = clf_log.predict(X_test)
len(results)

438

In [219]:
# loop through 438 predicted 'results' and compare with actual results y_test
diff = []
count = 0
for i in range(len(results)):
    actual = y_test.iloc[i]
    predicted = results[i]
    diff.append(abs(actual - predicted))
    #if abs(actual - predicted) < 999:
        #count += 1
        #print(abs(actual - predicted))
#print(f"count is: {count}") 
diff = np.array(diff)

In [220]:
# calcualte the absolute difference 
# score of 0 indicates perfect prediction 
less_1k = diff < 999
less_5k = (diff >= 1000) & (diff < 4999)
less_10k = (diff >= 5000) & (diff < 9999)
less_20k = (diff >= 10000) & (diff < 19999)
less_30k = (diff >= 20000) & (diff < 29999)
less_40k = (diff >= 30000) & (diff < 39999)
less_50k = (diff >= 40000) & (diff < 49999)
less_100k = (diff >= 50000) & (diff < 99999)
less_100k_plus = diff >= 100000
print("Decision Tree Prediction Range")
print(f"Less than $1k:\t\t {round(sum(less_1k)/438*100,2)} Percent")
print(f"Between $1k - $5k:\t {round(sum(less_50k)/438*100,2)} Percent")
print(f"Between $5k - $10k:\t {round(sum(less_10k)/438*100,2)} Percent")
print(f"Between $10k - $20k:\t {round(sum(less_20k)/438*100,2)} Percent")
print(f"Between $20k - $30k:\t {round(sum(less_30k)/438*100,2)} Percent")
print(f"Between $30k - $40k:\t {round(sum(less_40k)/438*100,2)} Percent")
print(f"Between $50k - $100k:\t {round(sum(less_100k)/438*100,2)} Percent")
print(f"Greater than $100k:\t {round(sum(less_100k_plus)/438*100,2)} Percent")

Decision Tree Prediction Range
Less than $1k:		 1.83 Percent
Between $1k - $5k:	 9.59 Percent
Between $5k - $10k:	 7.31 Percent
Between $10k - $20k:	 11.64 Percent
Between $20k - $30k:	 9.13 Percent
Between $30k - $40k:	 12.79 Percent
Between $50k - $100k:	 23.74 Percent
Greater than $100k:	 19.41 Percent
