# Feature scaling with sklearn - Exercise

You are given a real estate dataset. 

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate_price_size_year.csv'. 

You are expected to create a multiple linear regression (similar to the one in the lecture), using the new data. This exercise is very similar to a previous one. This time, however, **please standardize the data**.

Apart from that, please:
-  Display the intercept and coefficient(s)
-  Find the R-squared and Adjusted R-squared
-  Compare the R-squared and the Adjusted R-squared
-  Compare the R-squared of this regression and the simple linear regression where only 'size' was used
-  Using the model make a prediction about an apartment with size 750 sq.ft. from 2009
-  Find the univariate (or multivariate if you wish - see the article) p-values of the two variables. What can you say about them?
-  Create a summary table with your findings

In this exercise, the dependent variable is 'price', while the independent variables are 'size' and 'year'.

Good luck!

## Import the relevant libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Load the data

In [2]:
data=pd.read_csv("real_estate_price_size_year.csv")

In [3]:
data.head()

Unnamed: 0,price,size,year
0,234314.144,643.09,2015
1,228581.528,656.22,2009
2,281626.336,487.29,2018
3,401255.608,1504.75,2015
4,458674.256,1275.46,2009


## Create the regression

In [4]:
from sklearn.linear_model import LinearRegression

### Declare the dependent and the independent variables

In [44]:
X=data[["size","year"]]
Y=data["price"]

### Scale the inputs

In [45]:
from sklearn.preprocessing import StandardScaler

In [46]:
scaler=StandardScaler()

In [47]:
scaler.fit(X)

StandardScaler()

In [48]:
X_scaled=scaler.transform(X)

In [49]:
X_scaled

array([[-0.70816415,  0.51006137],
       [-0.66387316, -0.76509206],
       [-1.23371919,  1.14763808],
       [ 2.19844528,  0.51006137],
       [ 1.42498884, -0.76509206],
       [-0.937209  , -1.40266877],
       [-0.95171405,  0.51006137],
       [-0.78328682, -1.40266877],
       [-0.57603328,  1.14763808],
       [-0.53467702, -0.76509206],
       [ 0.69939906, -0.76509206],
       [ 3.33780001, -0.76509206],
       [-0.53467702,  0.51006137],
       [ 0.52699137,  1.14763808],
       [ 1.51100715, -1.40266877],
       [ 1.77668568, -1.40266877],
       [-0.54810263,  1.14763808],
       [-0.77276222, -1.40266877],
       [-0.58004747, -1.40266877],
       [ 0.58943055,  1.14763808],
       [-0.78365788,  0.51006137],
       [-1.02322731,  0.51006137],
       [ 1.19557293,  0.51006137],
       [-1.12884431,  0.51006137],
       [-1.10378093, -0.76509206],
       [ 0.84424715,  1.14763808],
       [-0.95171405,  1.14763808],
       [ 1.62279723,  0.51006137],
       [-0.58004747,

### Regression

In [50]:
reg=LinearRegression()

In [51]:
reg.fit(X_scaled,Y)

LinearRegression()

In [52]:
reg.score(X_scaled,Y)

0.7764803683276793

### Find the intercept

In [53]:
reg.intercept_

292289.4701599997

### Find the coefficients

In [54]:
reg.coef_

array([67501.57614152, 13724.39708231])

### Calculate the R-squared

In [55]:
R2=reg.score(X_scaled,Y)

In [56]:
R2

0.7764803683276793

### Calculate the Adjusted R-squared

In [57]:
n=X_scaled.shape[0]

In [58]:
n

100

In [59]:
p=X_scaled.shape[1]

In [60]:
p

2

In [61]:
adj_R2=1-((1-R2)*(n-1)/(n-p))

In [62]:
adj_R2

0.7741995557595944

## Making Prediction 

In [63]:
new_data=[[750,2000]]
new_data_scaled=scaler.transform(new_data)



In [64]:
new_data_scaled

array([[-0.34752816, -2.67782219]])

In [65]:
reg.predict(new_data_scaled)

array([232079.2767184])

### Compare the R-squared and the Adjusted R-squared

It seems the the R-squared is only slightly larger than the Adjusted R-squared, implying that we were not penalized a lot for the inclusion of 2 independent variables. 

## Lets create simple linear regression for the given data

In [28]:
#X_simple=X_scaled[:,0].reshape(-1,1)

In [29]:
#scaler.fit(X_simple)

StandardScaler()

In [30]:
#X_simple_matrix=scaler.transform(X_simple)

In [32]:
#reg.fit(X_simple_matrix,Y)

LinearRegression()

In [33]:
#reg.score(X_simple_matrix,Y)

0.7447391865847587

In [34]:
#reg.coef_

array([66161.00300583])

In [35]:
#reg.intercept_

292289.47015999997

In [36]:
#adj_r2=1-(1-reg.score(X_simple_matrix,Y))

In [62]:
#adj_r2

0.7447391865847587

In [39]:
#new_simple_data=new_data_scaled[:,0].reshape(-1,1)
#reg.predict(new_simple_data)

array([269296.65874718])

### Compare the Adjusted R-squared with the R-squared of the simple linear regression

Comparing the Adjusted R-squared with the R-squared of the simple linear regression (when only 'size' was used - a couple of lectures ago), we realize that 'Year' is not bringing too much value to the result.

### Calculate the univariate p-values of the variables

In [66]:
from sklearn.feature_selection import f_regression

In [70]:
f_regression(X_scaled,Y)

(array([285.92105192,   0.85525799]), array([8.12763222e-31, 3.57340758e-01]))

In [72]:
p_values=f_regression(X_scaled,Y)[1]

In [76]:
p_values=p_values.round(3)

### Create a summary table with your findings

In [81]:
reg_summery=pd.DataFrame(data=X.columns.values,columns=["features"])

In [82]:
reg_summery

Unnamed: 0,features
0,size
1,year


In [86]:
reg_summery["Coefficients"]=reg.coef_
reg_summery["p_values"]=p_values

In [87]:
reg_summery

Unnamed: 0,features,Coefficients,p_values
0,size,67501.576142,0.0
1,year,13724.397082,0.357


Answer...