## House Prediction Using Decision Tree Regressors

This dataset was downloaded from https://www.kaggle.com/jcalvarezj/house-price-regression-prediction

It has a total of 16 features, 15 of which are attributes that are used to predict the price of the house, measured in half millions. There are exactly 500,000 records or rows.

Since the output or target variable is continous, this is a regression problem. 

Features include house variables like area in square feet, the number of garages, number of bathrooms, number of floors, solar, electric powered all of which are used as predictors for the target variable, Prices

In [1]:
#Load the needed libraries
#import libraries needed
from sklearn.datasets import load_iris
from sklearn.impute import SimpleImputer
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO 
from IPython.display import Image 
from pydot import graph_from_dot_data
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
import pandas as pd
import numpy as np



In [2]:
#load the house prices dataset into a pandas dataframe object
df = pd.read_csv("Datasets/HousePrices_HalfMil.csv")

In [4]:
#display the first five rows of the dataset
df.head()

Unnamed: 0,Area,Garage,FirePlace,Baths,White Marble,Black Marble,Indian Marble,Floors,City,Solar,Electric,Fiber,Glass Doors,Swiming Pool,Garden,Prices
0,164,2,0,2,0,1,0,0,3,1,1,1,1,0,0,43800
1,84,2,0,4,0,0,1,1,2,0,0,0,1,1,1,37550
2,190,2,4,4,1,0,0,0,2,0,0,1,0,0,0,49500
3,75,2,4,4,0,0,1,1,1,1,1,1,1,1,1,50075
4,148,1,4,2,1,0,0,1,2,1,0,0,1,1,1,52400


In [5]:
#information on the columns
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 16 columns):
 #   Column         Non-Null Count   Dtype
---  ------         --------------   -----
 0   Area           500000 non-null  int64
 1   Garage         500000 non-null  int64
 2   FirePlace      500000 non-null  int64
 3   Baths          500000 non-null  int64
 4   White Marble   500000 non-null  int64
 5   Black Marble   500000 non-null  int64
 6   Indian Marble  500000 non-null  int64
 7   Floors         500000 non-null  int64
 8   City           500000 non-null  int64
 9   Solar          500000 non-null  int64
 10  Electric       500000 non-null  int64
 11  Fiber          500000 non-null  int64
 12  Glass Doors    500000 non-null  int64
 13  Swiming Pool   500000 non-null  int64
 14  Garden         500000 non-null  int64
 15  Prices         500000 non-null  int64
dtypes: int64(16)
memory usage: 61.0 MB


In [6]:
# number of rows x number of columns
df.shape

(500000, 16)

In [7]:
#summary statistics
df.describe()

Unnamed: 0,Area,Garage,FirePlace,Baths,White Marble,Black Marble,Indian Marble,Floors,City,Solar,Electric,Fiber,Glass Doors,Swiming Pool,Garden,Prices
count,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0
mean,124.929554,2.00129,2.003398,2.998074,0.332992,0.33269,0.334318,0.499386,2.00094,0.498694,0.50065,0.500468,0.49987,0.500436,0.501646,42050.13935
std,71.795363,0.817005,1.414021,1.414227,0.471284,0.471177,0.471752,0.5,0.816209,0.499999,0.5,0.5,0.5,0.5,0.499998,12110.237201
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,7725.0
25%,63.0,1.0,1.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,33500.0
50%,125.0,2.0,2.0,3.0,0.0,0.0,0.0,0.0,2.0,0.0,1.0,1.0,0.0,1.0,1.0,41850.0
75%,187.0,3.0,3.0,4.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,50750.0
max,249.0,3.0,4.0,5.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,77975.0


In [9]:
#check for missing values
df.isna().sum()

Area             0
Garage           0
FirePlace        0
Baths            0
White Marble     0
Black Marble     0
Indian Marble    0
Floors           0
City             0
Solar            0
Electric         0
Fiber            0
Glass Doors      0
Swiming Pool     0
Garden           0
Prices           0
dtype: int64

In [10]:
#Y is our target variable
y = df["Prices"]

In [11]:
#X is all of our attributes or features
X = df.drop(columns=["Prices"], axis=1)

In [12]:
#split the dataset into random training and testing sets, 80% is used for training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.2)

In [13]:
X_train.shape

(400000, 15)

In [14]:
X_test.shape

(100000, 15)

In [15]:
y_train.shape

(400000,)

In [16]:
y_test.shape

(100000,)

In [17]:
#construct a DecisionTreeClassifier model and fit it to the training data
#by default, sklearn uses the gini coefficient instead of entropy for
#splitting features with most information gain(reduce uncertainty or randomness)
dt = DecisionTreeRegressor()
dt.fit(X_train, y_train)

DecisionTreeRegressor(ccp_alpha=0.0, criterion='mse', max_depth=None,
                      max_features=None, max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, presort='deprecated',
                      random_state=None, splitter='best')

In [18]:
#Test the fitted decision tree model on testing X
y_pred = dt.predict(X_test)

In [19]:
#model evaluation metrics for decisiontreeregressor
from sklearn import metrics
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Mean Absolute Error: 234.216
Mean Squared Error: 107644.9375
Root Mean Squared Error: 328.0928793802145
