# Regression Model For Predicting House Prices

Author: Nishant Sahni

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression, Lasso
from sklearn.kernel_ridge import KernelRidge
from sklearn.preprocessing import normalize
from sklearn.preprocessing import StandardScaler

The data is first loaded onto a pandas dataframe and some summary statistics are obtained. The correlation between attribute columns is also obtained. The data.describe() function gives some summary statistics for each feature including count, mean, minimum value, maximum value, etc. The data.isna() is used to determine if there are any missing values or erroneous values in the data.

In [None]:
data = pd.read_csv('/Users/Nishant/Desktop/Machine Learning/Exam/regression_housing_prices.csv')
data.head()
print(data.keys())
print("")
print(data.describe())
print("")
print(data.corr())
print(data.isna())

After observing and exploring the information above, the columns having low correlation with the target (price) are droped as they have a low impact on the prediction. The final list of features is then obtained.

In [None]:
featurelist = []
threshold = 0.25

for item in data:
	if item != 'date' and item != 'price':
		corr = float(data[item].corr(data['price']))
		print(item, corr)
		if corr >= threshold:
			featurelist.append(item)

print("")
print("Final Feature List: ", featurelist)

The X and y values are then loaded into dataframes as follows.

In [None]:
X = (pd.DataFrame(data, columns=(featurelist))).as_matrix()
y = (pd.DataFrame(data, columns=(['price']))).as_matrix()

The data is then split into training and testing sets with a 80:20 split. This is done so that the model can be trained with the training data and the prediction accuracy can be measured with respect to the testing data.

A series of plots is obtained for each feature with the target to further examine their correlation.

In [None]:
for i in range(0, len(featurelist), 1):
	plt.title("Scatter Plot - %s vs price" %featurelist[i])
	plt.xlabel(featurelist[i])
	plt.ylabel('price')
	plt.scatter(x_train[:, i], y_train[:, 0], c='red')
	plt.show()

# Ordinary Least Squares Regression

We then start by trying Linear Regression with our data. Since the data has values with varying ranges, it is decided to normalize it so that it gives better results with regression. This is done by setting normalize to True.

In [None]:
linreg = LinearRegression(normalize=True)

GridSearchCV is used to carry out 5-fold cross validation. Cross validation is used to avoid over fitting by training multiple models on a certain number of subsets of the data and then evaluating the model.

In [None]:
ols = GridSearchCV(linreg, cv=5, param_grid={})

We then train the data.

In [None]:
ols.fit(x_train, np.ravel(y_train))

A set of scores is obtained for the model. This includes the cross validation score.

In [None]:
print("")
print("SCORES FOR ORDINARY LEAST SQUARES:")
print("")
print("Gridsearch CV score: ", ols.best_score_)
print("Training set score: ", ols.score(x_train, y_train))
print("Linear accuracy score: ", ols.score(x_test, y_test))

We then use the trained model to predict the test data.

In [None]:
ols_predictions = ols.predict(x_test)

The mean square error and r^2 score are obtained for the predictions.

In [None]:
print("Mean squared error: ", mean_squared_error(y_test, ols_predictions))
print('r^2 score: ', r2_score(y_test, ols_predictions))

A graph is then plotted between predicted prices and actual prices.

In [None]:
plt.scatter(ols_predictions, y_test, c='red')
plt.xlabel('Predicted Prices')
plt.ylabel('Actual Prices')
plt.show()

# Lasso Regression 

We then move on to Lasso Regression. Since the data has values with varying ranges, it is decided to normalize it so that it gives better results with regression. This is done by setting normalize to True.

In [None]:
lassreg = Lasso(normalize=True)

GridSearchCV is used to conduct 5-fold cross validation and for the selection of the alpha parameter. The model is then trained.

In [None]:
lasso = GridSearchCV(lassreg, cv=5, param_grid={'alpha': [0.001, 0.01, 0.1, 1, 2]})
lasso.fit(x_train, np.ravel(y_train))

Some scores are then obtained. Also, the best parameters selected by GridSearchCV can be found below.

In [None]:
print("")
print("SCORES FOR LASSO:")
print("")
print("Gridsearch CV score: ", lasso.best_score_)
print("Training set score: ", lasso.score(x_train, y_train))
print("Linear accuracy score: ", lasso.score(x_test, y_test))
print("Best Parameters Selected: ", lasso.best_params_)

As we can see, the best value of alpha was selected to be 2. We now use the trained model to predict our test data.

In [None]:
lasso_predictions = lasso.predict(x_test)

The mean square error and r^2 score are then obtained as follows.

In [None]:
print("Mean squared error: ", mean_squared_error(y_test, lasso_predictions))
print('r^2 score: ', r2_score(y_test, lasso_predictions))

A graph is then plotted showing the predicted values versus the actual values of the price.

In [None]:
plt.scatter(lasso_predictions, y_test, c='red')
plt.xlabel('Predicted Prices')
plt.ylabel('Actual Prices')
plt.show()

# Kernalized Ridge Regression

Kernelized Ridge Regression is attempted with our data next.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=10)

In [None]:
kernreg = KernelRidge()

Since the data has values with varying ranges, it is decided to normalize it so that it gives better results with regression. Also, since this is a non-linear model, this helps to speed up the training process.

In [None]:
x_train_normalized = normalize(x_train, norm='l2')
x_test_normalized = normalize(x_test, norm='l2')

GridSearchCV is then used to conduct 5-fold cross validation and to select the best parameters and kernels from the ones specified in the question. The data is then trained with this model.

In [None]:
ridge = GridSearchCV(kernreg, cv=5, param_grid=[{'kernel': ['linear']}, {'alpha': [1], 'kernel': ['poly'], 'gamma': [1], 'degree': [2, 4, 7]}, {'kernel': ['rbf'], 'gamma': [0.1, 0.5, 1, 2, 4]}])
ridge.fit(x_train_normalized, np.ravel(y_train))

The cross validation score and a few other statistics are obtained as follows.

In [None]:
print("")
print("SCORES FOR KERNALIZED RIDGE:")
print("")
print("Gridsearch CV score: ", ridge.best_score_)
print("Training set score: ", ridge.score(x_train_normalized, y_train))
print("Linear accuracy score: ", ridge.score(x_test_normalized, y_test))
print("Best Parameters Selected: ", ridge.best_params_)

The best parameters selected by GridSearchCV can be observed above. The trained model is then used to predict the test values.

In [None]:
ridge_predictions = ridge.predict(x_test_normalized)

The mean square error and r^2 scores are then obtained.

In [None]:
print("Mean squared error: ", mean_squared_error(y_test, ridge_predictions))
print('r^2 score: ', r2_score(y_test, ridge_predictions))

A graph is then plotted to show the predicted results versus actual prices.