# Housing: SVM

Goal: Use Boston city housing data to compare diffrente regressors
- Value: Median value of owner-occupied homes in \$1000's
- Crime: per capita crime rate by town
- Zone: proportion of residential land zoned for lots over 25,000 sq.ft.
- Industrial: proportion of non-retail business acres per town
- River: Charles River dummy (= 1 if tract bounds river; 0 otherwise)
- Nitric: nitric oxides concentration (parts per 10 million)
- Room: average number of rooms per dwelling
- Age: proportion of owner-occupied units built prior to 1940
- Distance: weighted distances to five Boston employment centres
- Hghway: index of accessibility to radial highways
- Tax: full-value property-tax rate per $10,000
- Tacher: pupil-teacher ratio
- Low: percentage of the lower status of the population

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.svm import SVR
from sklearn.metrics import mean_squared_error

In [2]:
df = pd.read_csv('Housing Boston city.csv')
df.head()

Unnamed: 0,value,crime,zone,industrial,river,nitric,room,owner,distance,highway,tax,teacher,low
0,24.0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,4.98
1,21.6,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,9.14
2,34.7,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,4.03
3,33.4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,2.94
4,36.2,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,5.33


In [3]:
# Split the data into features and target variable
x = df.drop('value', axis=1)
y = df.value

# Splitting the dataset into the Training set and Test set
xtrain, xtest, ytrain, ytest = train_test_split(x, y, random_state=42)

# Feature scaling
xtrain = StandardScaler().fit_transform(xtrain)
xtest = StandardScaler().fit_transform(xtest)

- `x = df.drop('value', axis=1)` creates a new DataFrame `x` by dropping the column 'value'.
    - `axis=1` indicates that the operation should be performed along columns (i.e., dropping a column).
- `y = df.value` assigns the values from the 'value' column to a variable `y`.
- `xtrain, xtest, ytrain, ytest = train_test_split()` splits the data into training and testing sets.
- `xtrain = StandardScaler().fit_transform(xtrain)` standardizes the training features (`xtrain`). It involves scaling the features to have a mean of 0 and a standard deviation of 1.
    - `fit_transform` fitS the scaler to the training data (`xtrain`) and then transform it.

In [4]:
# Decision Tree regressor

dt = DecisionTreeRegressor(max_depth=5, random_state=1).fit(xtrain, ytrain)
pred1 = dt.predict(xtest)
mse1 = mean_squared_error(ytest, pred1)

- `dt = DecisionTreeRegressor().fit(xtrain, ytrain)` creates a DecisionTreeRegressor object.
    - `max_depth` specifies the maximum depth of the tree.
    - `fit` fits the model to the training data.
- `pred1 = dt.predict(xtest)` generates predictions (`pred1`) for the target variable using the trained Decision Tree regressor model (`dt`) on the testing features (`xtest`).
- `mse1 = mean_squared_error(ytest, pred1)` calculates the mean squared error (MSE) between the actual target values (`ytest`) and the predicted values (`pred1`).

In [5]:
# Random Forest Regressor

rf = RandomForestRegressor(max_features=10, random_state=1).fit(xtrain, ytrain)
pred2 = rf.predict(xtest)
mse2 = mean_squared_error(ytest, pred2)

- `rf = RandomForestRegressor(max_features=10, random_state=1).fit(xtrain, ytrain)` creates a RandomForestRegressor object with a maximum number of features considered for splitting each node set to 10.

In [6]:
# SVM Regressor with linear kernel

svm1 = SVR(kernel='linear').fit(xtrain, ytrain)
pred3 = svm1.predict(xtest)
mse3 = mean_squared_error(ytest, pred3)

- `svm1 = SVR(kernel='linear').fit(xtrain, ytrain)` creates an SVR (Support Vector Regressor) object with a linear kernel.

In [7]:
# SVM Regressor with polynomial kernel

svm2 = SVR(kernel='poly', degree=3).fit(xtrain, ytrain)
pred4 = svm2.predict(xtest)
mse4 = mean_squared_error(ytest, pred4)

- `svm2 = SVR(kernel='poly', degree=3).fit(xtrain, ytrain)` creates an SVR (Support Vector Regressor) object with a polynomial kernel of degree 3.

In [8]:
# SVM Regressor with RBF kernel

svm3 = SVR(kernel='rbf').fit(xtrain, ytrain)
pred5 = svm3.predict(xtest)
mse5 = mean_squared_error(ytest, pred5)

- `svm3 = SVR(kernel='rbf').fit(xtrain, ytrain)` creates an SVR (Support Vector Regressor) object with an RBF kernel.

In [9]:
print("Decision Tree MSE:", mse1)
print("Random Forest MSE:", mse2)
print("SVM Linear MSE:", mse3)
print("SVM Polynomial MSE:", mse4)
print("SVM RBF MSE:", mse5)

Decision Tree MSE: 16.911385263259735
Random Forest MSE: 11.561203433070865
SVM Linear MSE: 25.114979899031418
SVM Polynomial MSE: 20.380428314798667
SVM RBF MSE: 26.277714829158647


### Hyperparameter tuning

In [10]:
# Define parameter grid
param = {'C': [0.1, 1, 10, 100, 1000]}

# Create the SVR model with RBF kernel
svr = SVR(kernel='rbf')

# Grid search with cross-validation
search = GridSearchCV(svr, param, cv=5, scoring='neg_mean_squared_error').fit(xtrain, ytrain)

# Best parameters and best score
print("Best parameter:", search.best_params_)
print("Best MSE:", -search.best_score_)

Best parameter: {'C': 100}
Best MSE: 9.984685618908673


- `param` specifies the hyperparameters to be tuned and the range of values to be searched over. 
   - `C`: Regularization parameter.
- `svr = SVR(kernel='rbf')`: An SVR model with a radial basis function (RBF) kernel is instantiated.
- `GridSearchCV` exhaustively searches through a specified grid of hyperparameters, evaluating the model performance using cross-validation on each combination of hyperparameters.
   - `estimator`: the SVR model with RBF kernel (`svr`).
   - `cv`: The number of folds for cross-validation, set to 5 (`cv=5`).
   - `scoring`: The evaluation metric used to select the best parameters (negative mean squared error).
- After fitting the grid search model, the best combination of hyperparameters (`best_params_`) and the corresponding best score in terms of negative mean squared error (`best_score_`) are printed.
    - `-` sign is used to obtain MSE