# Data Project 6 - Support Vector Machines

## House Price Prediction Dataset

This dataset was downloaded from https://www.kaggle.com/jcalvarezj/house-price-regression-prediction

It has a total of 16 features, 15 of which are attributes that are used to predict the price of the house, measured in half millions. There are exactly 500,000 records or rows.

Since the output or target variable is continous, this is a regression problem. 

Features include house variables like area in square feet, the number of garages, number of bathrooms, number of floors, solar, electric powered all of which are used as predictors for the target variable, Prices

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [2]:
#load the house prices dataset into a pandas dataframe object
df = pd.read_csv("Datasets/HousePrices_HalfMil.csv")

In [3]:
df.head()

Unnamed: 0,Area,Garage,FirePlace,Baths,White Marble,Black Marble,Indian Marble,Floors,City,Solar,Electric,Fiber,Glass Doors,Swiming Pool,Garden,Prices
0,164,2,0,2,0,1,0,0,3,1,1,1,1,0,0,43800
1,84,2,0,4,0,0,1,1,2,0,0,0,1,1,1,37550
2,190,2,4,4,1,0,0,0,2,0,0,1,0,0,0,49500
3,75,2,4,4,0,0,1,1,1,1,1,1,1,1,1,50075
4,148,1,4,2,1,0,0,1,2,1,0,0,1,1,1,52400


In [4]:
df.info()

#all are int64 datatypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500000 entries, 0 to 499999
Data columns (total 16 columns):
 #   Column         Non-Null Count   Dtype
---  ------         --------------   -----
 0   Area           500000 non-null  int64
 1   Garage         500000 non-null  int64
 2   FirePlace      500000 non-null  int64
 3   Baths          500000 non-null  int64
 4   White Marble   500000 non-null  int64
 5   Black Marble   500000 non-null  int64
 6   Indian Marble  500000 non-null  int64
 7   Floors         500000 non-null  int64
 8   City           500000 non-null  int64
 9   Solar          500000 non-null  int64
 10  Electric       500000 non-null  int64
 11  Fiber          500000 non-null  int64
 12  Glass Doors    500000 non-null  int64
 13  Swiming Pool   500000 non-null  int64
 14  Garden         500000 non-null  int64
 15  Prices         500000 non-null  int64
dtypes: int64(16)
memory usage: 61.0 MB


In [5]:
df.shape

(500000, 16)

In [6]:
df.describe()

Unnamed: 0,Area,Garage,FirePlace,Baths,White Marble,Black Marble,Indian Marble,Floors,City,Solar,Electric,Fiber,Glass Doors,Swiming Pool,Garden,Prices
count,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0,500000.0
mean,124.929554,2.00129,2.003398,2.998074,0.332992,0.33269,0.334318,0.499386,2.00094,0.498694,0.50065,0.500468,0.49987,0.500436,0.501646,42050.13935
std,71.795363,0.817005,1.414021,1.414227,0.471284,0.471177,0.471752,0.5,0.816209,0.499999,0.5,0.5,0.5,0.5,0.499998,12110.237201
min,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,7725.0
25%,63.0,1.0,1.0,2.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,33500.0
50%,125.0,2.0,2.0,3.0,0.0,0.0,0.0,0.0,2.0,0.0,1.0,1.0,0.0,1.0,1.0,41850.0
75%,187.0,3.0,3.0,4.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,50750.0
max,249.0,3.0,4.0,5.0,1.0,1.0,1.0,1.0,3.0,1.0,1.0,1.0,1.0,1.0,1.0,77975.0


In [7]:
df.isna().sum()

Area             0
Garage           0
FirePlace        0
Baths            0
White Marble     0
Black Marble     0
Indian Marble    0
Floors           0
City             0
Solar            0
Electric         0
Fiber            0
Glass Doors      0
Swiming Pool     0
Garden           0
Prices           0
dtype: int64

In [11]:
df.groupby('Floors')['Prices'].mean()
#average price for houses with different levels

Floors
0    34557.656598
1    49561.046265
Name: Prices, dtype: float64

In [12]:
df.groupby('Garage')['Prices'].mean()
#average price for houses with different garage sizes

Garage
1    40567.213093
2    42036.961281
3    43540.448393
Name: Prices, dtype: float64

In [7]:
#X is all of our attributes or features
X = df.drop(columns=["Prices"], axis=1)
y = df['Prices']

#use small dataset, svm regressor with 500,000 samples taking too long to train

X = X.iloc[:50000, :]
y = y[:50000]

In [8]:
#split the dataset into random training and testing sets, 80% is used for training and 20% for testing
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1, test_size=0.2)

In [9]:
print('Training Datasets Dimensions\n')

print('X_train:', X_train.shape)
print('y_train:', y_train.shape)

print('\nTesting Datasets Dimensions\n')

print('X_test:', X_test.shape)
print('y_test:', y_test.shape)

Training Datasets Dimensions

X_train: (40000, 15)
y_train: (40000,)

Testing Datasets Dimensions

X_test: (10000, 15)
y_test: (10000,)


In [10]:
#feature scaling

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)

X_test = scaler.transform(X_test)

In [11]:
X_train

array([[ 9.71874793e-01,  1.20903379e+00, -7.07517987e-01, ...,
         9.91833346e-01,  9.87873520e-01,  9.94713971e-01],
       [ 1.66886701e+00, -1.25517918e-02, -9.36220840e-04, ...,
         9.91833346e-01,  9.87873520e-01,  9.94713971e-01],
       [-3.24530727e-01, -1.23413737e+00,  1.41222731e+00, ...,
         9.91833346e-01, -1.01227534e+00,  9.94713971e-01],
       ...,
       [ 1.18097246e+00, -1.23413737e+00, -7.07517987e-01, ...,
         9.91833346e-01,  9.87873520e-01,  9.94713971e-01],
       [ 1.08339355e+00, -1.23413737e+00, -9.36220840e-04, ...,
        -1.00823390e+00, -1.01227534e+00,  9.94713971e-01],
       [ 2.39653803e-02, -1.25517918e-02,  7.05645545e-01, ...,
        -1.00823390e+00, -1.01227534e+00, -1.00531412e+00]])

In [12]:
from sklearn.svm import SVR
regr = SVR()
regr.fit(X_train, y_train)

SVR()

In [14]:
from sklearn.metrics import mean_absolute_error

y_pred = regr.predict(X_test)

print('Mean Absolute Error:$', mean_absolute_error(y_test, y_pred))

Mean Absolute Error:$ 9221.95151239194


In [16]:
print('Coefficient of Determination(r^2):', regr.score(X_test, y_test))

Coefficient of Determination(r^2): 0.12405639277683389
