### LSE Data Analytics Online Career Accelerator 

# Course 301: Advanced Analytics for Organisational Impact

## My model: Linear regression to predict house prices

As part of my exams I built a model to predict the house prices for a real estate company so that it can help its customers to plan the purchase of a house based on the predicted price range. I used two variables to make the prediction: the average number of rooms per property and the weighted distance of each property from five employment hubs in Cape Town, South Africa. 
The steps I took were:

- install the Python packages and examine the data
- set the variables, fit the model, and call the predictions for X
- check the value of R-squared, the intercept, and the coefficients
- make some predictions
- train and test subsets with multiple linear regression (MLR)
- run a regression test.

## 1. Import the libraries

In [1]:
import statsmodels.api as sm
from sklearn import datasets 
import numpy as np
from sklearn import linear_model
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
import sklearn
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt

# Note: Indicates situations that aren’t necessarily exceptions.
import warnings  
warnings.filterwarnings('ignore')  

## 2. Import the data set

In [3]:
# Load the CSV file (house_price.csv).
hp = pd.read_csv('house_prices.csv')  

# Print the DataFrame.
hp.head() 

Unnamed: 0,Rooms,Distance,Value
0,6.575,4.09,24.0
1,6.421,4.9671,21.6
2,7.185,4.9671,34.7
3,6.998,6.0622,33.4
4,7.147,6.0622,36.2


In [4]:
# View the DataFrame.
hp.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Rooms     506 non-null    float64
 1   Distance  506 non-null    float64
 2   Value     506 non-null    float64
dtypes: float64(3)
memory usage: 12.0 KB


## 3. Set the variables, fitting the model, and calling the predictions for X

In [5]:
# Define the dependent variable.
y = hp['Value']  

# Define the independent variable.
X = hp[['Rooms', 'Distance']] 

In [6]:
# Fit the regression model.
mlr = linear_model.LinearRegression()
mlr.fit(X, y) 

LinearRegression()

In [8]:
# Call the predictions for X (array).
mlr.predict(X) 

array([25.23262311, 24.30597474, 31.03025338, 29.9197274 , 31.23113776,
       24.92052548, 20.99628003, 22.59515685, 17.89792552, 21.43016488,
       24.59312806, 21.29554669, 19.86012857, 20.02480328, 21.19854962,
       18.91052046, 19.79946305, 20.16587486, 15.24036623, 17.62554884,
       16.24441157, 19.82577837, 21.36632302, 18.52848931, 19.65425152,
       16.82067934, 18.81534563, 20.76312523, 24.70679323, 26.17680132,
       17.71571146, 20.84706509, 19.68285587, 17.39216584, 20.85532906,
       19.22540394, 18.42427779, 18.77543693, 19.75391977, 26.04958067,
       29.82538634, 27.7461615 , 22.45651299, 22.82617229, 21.57637181,
       17.86689491, 18.78224174, 21.21771802, 15.7523132 , 17.64542212,
       21.17812468, 22.51593928, 26.00129836, 21.48617409, 20.7648873 ,
       33.41670435, 26.03470634, 29.42393915, 23.26887906, 20.91861579,
       19.42498135, 21.20638654, 25.71803969, 28.7805479 , 32.39778062,
       23.95685233, 19.52974218, 20.27518634, 17.77558538, 20.33

## 4. Check the value of R-squared, the intercept, and the coefficients

In [9]:
# Print the R-squared value.
print("R-squared: ", mlr.score(X,y))  

# Print the intercept.
print("Intercept: ", mlr.intercept_) 

# Print the coefficients.
print("Coefficients:")  

# Map a similar index of multiple containers (to be used as a single entity).
list(zip(X, mlr.coef_))  

R-squared:  0.4955246476058477
Intercept:  -34.636050175473315
Coefficients:


[('Rooms', 8.801411828632594), ('Distance', 0.48884853656712307)]

### R2 analysis: 49% of the fluctuation in house prices can be determined by this dataset, e.g. the distance from a metropolitan hub and the number of rooms. The Intercept means the predicted value of Y when each of the independent varables were 0. It is not always analytically relevant. 
### Controlling for distance, a 1 unit increase of rooms would improve the value by 8 units. Controlling for rooms, decreasing the distance would increase the value of the property by .48 units. This determines the sensitivity the dependent variable has on the independent variable. 
Adjusted R-squared is useful to compare this model verus other complicated models, which don't necessarily explain more of the behaviour.

## 5. Make predictions

In [10]:
# Create a variable 'New_Rooms' and define it as 5.7.
New_Rooms = 5.75

# Create 'New_Distance' and define it as 15.2.
New_Distance = 15.2  

# Print the predicted value. 
print ("Predicted Value: \n", mlr.predict([[New_Rooms ,New_Distance]]))  

Predicted Value: 
 [23.40256559]


In [11]:
# Create a variable 'New_Rooms' and define it as 5.7.
New_Rooms = 6.75

# Create 'New_Distance' and define it as 15.2.
New_Distance = 15.2  

# Print the predicted value. 
print ("Predicted Value: \n", mlr.predict([[New_Rooms ,New_Distance]]))  

Predicted Value: 
 [32.20397742]


##  6. Train and test subsets with (MLR) multiple linear regression

In [12]:
# Split the data in 'train' (80%) and 'test' (20%) sets.
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, y,
                                                                            test_size = 0.20,
                                                                            random_state = 5)

In [13]:
# Training the model using the 'statsmodel' OLS library.
# Fit the model with the added constant.
model = sm.OLS(Y_train, sm.add_constant(X_train)).fit()

# Set the predicted response vector.
Y_pred = model.predict(sm.add_constant(X_test)) 

# Call a summary of the model.
print_model = model.summary()

# Print the summary.
print(print_model)  

                            OLS Regression Results                            
Dep. Variable:                  Value   R-squared:                       0.449
Model:                            OLS   Adj. R-squared:                  0.446
Method:                 Least Squares   F-statistic:                     163.3
Date:                Sat, 05 Nov 2022   Prob (F-statistic):           1.33e-52
Time:                        16:06:03   Log-Likelihood:                -1352.5
No. Observations:                 404   AIC:                             2711.
Df Residuals:                     401   BIC:                             2723.
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const        -32.8597      3.141    -10.462      0.0

## 7. Run a regression test

> First train the model and then test the model.

In [14]:
# Specify the model.
mlr = LinearRegression()  

# Fit the model. We can only fit the model with the training data set.
mlr.fit(X_train, Y_train)  

LinearRegression()

In [15]:
# Call the predictions for X in the test set.
y_pred_mlr = mlr.predict(X_train)  

# Print the predictions.
print("Prediction for test set: {}".format(y_pred_mlr)) 

Prediction for test set: [17.63626948 37.66571679 18.69656171 20.35132516 22.39778381 10.05828849
 22.93899112 10.74873225 27.16871507 19.94730086 23.01486764 31.53962547
 24.92883571 19.94963594 16.36537912 21.49365062 21.77879406 19.05491856
 20.47609686 31.83685414 24.30035564 19.27743482 21.85259968 26.27861126
 21.48519624 19.25449933 19.76896032 37.12586674 34.48515006 23.10082167
 17.55275278 20.26766924 23.52707578 21.25246102 14.55081093  7.43546483
 25.96897982 27.95249473 19.83445037 20.31918341 15.5626069  19.85303113
 32.95665327 17.62371179 25.22198181 19.90200838 23.43654839 25.36457553
 17.77296547 27.34610874 23.0599489  20.77358438 26.28203607 33.52195834
 14.43876093 26.32793967 29.91619053 29.72143049 25.70873954 30.61253761
 16.24067703 29.20361136 22.69074152 18.82055141 19.08794386 24.57907765
 28.51315612 -1.74008454 25.03097868 22.68018779 16.08063529 20.16361279
 26.04840862 10.57025458 19.50502784 27.68868803 22.68918536 19.50019423
 29.12005629 34.92971235 1

> Test the model

In [16]:
# Call the predictions for X in the test set.
y_pred_mlr = mlr.predict(X_test)  

# Print the predictions.
print("Prediction for test set: {}".format(y_pred_mlr))  

Prediction for test set: [37.16294771 26.26390003 23.32233629  9.42654469 34.84637649 13.06883011
 27.54742048 26.25779081 25.61852352 23.38233874 32.02478151 20.6203452
 19.55213869 30.79259805 24.81138243 21.67579149  7.2803958  13.88603048
 14.44348739 16.78124002 11.3765575  22.51194954 38.90058634 23.27198665
 31.37745427 17.87469188 23.78577475 21.07447304 24.35126682 28.96509876
 17.9190036  14.06265518 18.09111236 31.24553727 25.32850094 22.0900311
 25.72964865 18.23891395 39.68116278 29.76640908 20.69268094 10.53395631
 25.12859787 19.29367293 26.04930156 29.82778882  6.41394588 18.88978814
 19.99319269 20.43100524 20.31822044 21.79013133 23.00092875 20.97257056
 16.49172001 26.44887238 35.45280085 23.87091819 27.14389189 21.55561781
 19.68025144 20.21289567 16.44845796 27.10481774 20.65455239 13.48164705
 24.74749771 23.04711977 20.68288625 19.01153213 23.4499095  20.56595991
 18.08514336 26.1019444  15.44534286 31.02285181 18.34999384 15.48675139
 30.47602818 19.152676   21.

In [17]:
# Print the R-squared value.
print(mlr.score(X_test, Y_test)*100)  

69.27370456571789


# 

## 8. Check for multicollinearity with Python

In [None]:
# Add a constant.
x_temp = sm.add_constant(X_train)  

# Create an empty DataFrame. 
vif = pd.DataFrame() 

# Calculate the 'vif' for each value.
vif["VIF Factor"] = [variance_inflation_factor(x_temp.values, 
                                               i) for i in range(x_temp.values.shape[1])]  


# Create the feature columns.
vif['features'] = x_temp.columns  

# Print the values to two decimal points.
print(vif.round(2))  

## 9. Evaluate the model

In [None]:
# Call the ‘metrics.mean_absolute_error’ function.  
print('Mean Absolute Error (Final):', metrics.mean_absolute_error(Y_test, Y_pred))  

# Call the ‘metrics.mean_squared_error’ function.
print('Mean Square Error (Final):', metrics.mean_squared_error(Y_test, Y_pred))  

## 10. Conclusion

> Add your own notes here.