 ## Dataset Attributes

 The dataset contains information about houses in Ames, Iowa. The data was collected by the Ames City Assessor’s Office describing 2930 property sales which occurred in Ames, Iowa between 2006 and 2010. The dataset, containing 81 variables, was compiled and published by De Cock in 2011.

 Some of the variables contained in the original dataset have been removed from the the dataset provided to you.
 The dataset provided to you contains the following variables:
* **Year_Built:** year that the house was originally constructed
* **Year_Remod_Add:** year that the house was last remodelled
* **Total_Bsmt_SF:** total size of basement area in square feet
* **First_Flr_SF:** size of the first floor in square feet
* **Second_Flr_SF:** size of the second floor in square feet
* **Gr_Liv_Area:** size of above grade, ground living area in square feet
* **Full_Bath:** number of full above grade bathrooms in the house
* **Half_Bath:** number of half above grade bathrooms in the house
* **Bedroom_AbvGr:** number of above grade bedrooms (does not include basement bedrooms)
* **Kitchen_AbvGr:** number of above grade kitchens
* **TotRms_AbvGrd:** total number of above grade rooms (does not include bathrooms)
* **Fireplaces:** number of fireplaces in the house
* **Garage_Area:** size of garage in square feet
* **Sale_Price:** sale price of the house in dollars


*De Cock, D. (2011). "Ames, Iowa: Alternative to the Boston Housing Data as an End of Semester
Regression Project," Journal of Statistics Education, Volume 19, Number 3.*

- https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt
- http://ww2.amstat.org/publications/jse/v19n3/decock.pdf


 ## Objective

 The goal of this task is to analyse the relationship between these variables and build a multiple linear regression model to predict the sales prices based on the 'Gr_Liv_Area' and 'Garage_Area` variables.


In [826]:
# Import libraries
import numpy as np 
import pandas as pd 

# Plotting modules
import matplotlib.pyplot as plt
import seaborn as sns

# Regression module
from sklearn.linear_model import LinearRegression

# Scaling module
from sklearn.preprocessing import MinMaxScaler

# Training module 
from sklearn.model_selection import train_test_split


# Ensures the same random data is used each time you execute the code
np.random.seed(0)

In [827]:
# Read in the data set
df = pd.read_csv('ames.csv')

In [None]:
# Getting number of rows and columns in dataframe.
df.shape

In [None]:
# Printing the first 5 rows of dataframe.
df.head()


In [None]:
# Clean and pre-process data if necessary.
missing_values_count = df.isnull().sum()
print(missing_values_count)

No missing values above so no need to clean the data.

In [None]:
# Exploring the relationship between dependent and independent variables.
selected_columns_1 = ['Year_Built', 'Year_Remod_Add', 'Total_Bsmt_SF', 'First_Flr_SF', 'Second_Flr_SF', 'Sale_Price']
sns.pairplot(data=df[selected_columns_1])

In [None]:
selected_columns_2 = ['Gr_Liv_Area', 
'Full_Bath', 'Half_Bath', 'Bedroom_AbvGr', 'Kitchen_AbvGr', 'TotRms_AbvGrd', 'Fireplaces', 'Garage_Area', 'Sale_Price']
sns.pairplot(data=df[selected_columns_2])

The features above don't follow a Gaussian distribution, so it is best to normalise the data. Additionally, the pairplot shows that the total basement area in square feet has the most positive correlation with the sales prices of the houses. The first floor area in square feet also has a really strong positive correlation with the sale price of houses. The total rooms above ground is one of the variables with a weak correlation to the sales price of the houses.

In [None]:
# Explore the data with visualisations such as histograms and correlation matrices
plt.figure(figsize=(10,6))
corr_coeff_mat = df.corr()
sns.heatmap(corr_coeff_mat, annot=True)
plt.show()
plt.close()

In [834]:
# Splitting data into dependent and independent variables.

# Independent variables and dependent variables
X = df[['Gr_Liv_Area', 'Garage_Area']].values # Independent variables.
y = df['Sale_Price'].values # Dependent variables.

# Reshaping for fitting.
y = y.reshape(-1, 1)
X = X.reshape(-1, X.shape[1])


In [None]:
# Generating plots to explore relationship between independent and dependent variables.
sns.pairplot(data=df[['Gr_Liv_Area', 'Garage_Area', 'Sale_Price']])


The ground living area has a strong positive correlation with the sales price of the house. The garage area has a positive correlation with sales price of the house but it is not as strong. Not all the features follow gausian distribution so we can apply normalisation to the trained data later.

In [836]:
# Create a training and test set with a 75:25 split ratio
rseed = 23 # Use the same random seed for learning purposes to get the same result
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25,  random_state=rseed)

In [837]:
# Building a multiple linear regression model using a training set with all independent variables.

# Normalising the data using a MinMaxScaler.

# Fit the scaler on train data
sc = MinMaxScaler()
sc.fit(X_train)

# Apply the scaler on train and test data
X_train = sc.transform(X_train)
X_test = sc.transform(X_test)


In [None]:
# Fit a model.
lm = LinearRegression()
model = lm.fit(X_train, y_train)
predictions = lm.predict(X_test)

# Create line coordinates.
X_line = np.empty(X_test.shape)
for i in range(X_test.shape[1]):
    X_line[:, i] = np.linspace(np.min(X_test[:,i]), np.max(X_test[:,i]), num=X_test.shape[0])
y_line = lm.predict(X_line)

# Printing the intercept and gradient values.
print('Intercept: \n', lm.intercept_)
print('Coefficients: \n', lm.coef_)

The ground floor living area has almost twice a greater impact on house sales price than the garage area.

In [None]:
# Generating predictions for test set.
print(predictions[0:20])

In [None]:
# Evaluating model performance using the mean squared error.
print ("R2 Score:", round (model.score(X_test, y_test), 4))

56% of the variance in the sale price is explained by the independent variables in the fitted model.

In [None]:
# Plot the errors
fig, ax = plt.subplots(1, X_test.shape[1], sharey=True, sharex=True, figsize=(20,14))
fig.suptitle("Sale Price vs  ")
ax[0].set_ylabel('Sale Price')

# Get values for the error bar
error_bar_values = np.abs((y_test-predictions)[:,0])

# Plot data, predicted values, and error bars
for i in range(X_test.shape[1]):
    ax[i].errorbar(X_test[:, i], y_test[:, 0], yerr=error_bar_values, fmt='.k', ecolor='red', label='True')
    ax[i].scatter(X_test[:,i], predictions[:,0], c='b', marker='.', label='Predicted')
    ax[i].legend(loc='best', fontsize='x-small')
    

**Interpret coefficients in the context of the prediction:**
The ground floor living area has twice as strong a correlation with the house sales price as it has a gradient of 403196.70350595 compared to the value of 213494.61081906 for the coefficient for the garage area.

**Summarise findings**
Ground floor living area, first floor living area and the garage area of houses have the strongest positive correlation with house sales prices looking at the correlation matrix. There is a negative correlation between the  number of above ground kitchens and the house sales price. The number of bedrooms above grade has the weakest positive correlation with house sales prices.

The linear regression model fitted onto the training data set captures almost 60% of the variance of the house sales price caused by the independent variables.

For ground floor living area, from the error plot the predicted values (blue points) closely match with the true values (black points) for the cheaper  house prices, indicating that the model performs  well for cheap houses.
As house prices increase howerver, the errors (red bars) become larger, suggesting that the model struggles to accurately predict high-priced houses. The same conclusions can be drawn from the error plot for sales price against garage area.