# Feature scaling with sklearn - Exercise

You are given a real estate dataset. 
 

You are expected to create a multiple linear regression using the data. Standardize the data.

Apart from that, please:
-  Display the intercept and coefficient(s)
-  Find the R-squared and Adjusted R-squared
-  Compare the R-squared and the Adjusted R-squared
-  Compare the R-squared of this regression and the simple linear regression where only 'size' was used
-  Using the model make a prediction about an apartment with size 750 sq.ft. from 2009
-  Find the univariate (or multivariate if you wish - see the article) p-values of the two variables. What can you say about them?
-  Create a summary table with your findings


## Import the relevant libraries

In [152]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression #Self-explanatory...
from sklearn.preprocessing import StandardScaler #Standardize features (independent variables) by removing the mean and scaling to unit variance
from sklearn.feature_selection import f_regression #Univariative Linear Regression Test

## Load the data

In [153]:
pwd #Check directory

'/Users/gabriel/Downloads'

In [154]:
data = pd.read_csv('real_estate_price_size_year.csv')
data.head()

Unnamed: 0,price,size,year
0,234314.144,643.09,2015
1,228581.528,656.22,2009
2,281626.336,487.29,2018
3,401255.608,1504.75,2015
4,458674.256,1275.46,2009


In [155]:
data.describe()

Unnamed: 0,price,size,year
count,100.0,100.0,100.0
mean,292289.47016,853.0242,2012.6
std,77051.727525,297.941951,4.729021
min,154282.128,479.75,2006.0
25%,234280.148,643.33,2009.0
50%,280590.716,696.405,2015.0
75%,335723.696,1029.3225,2018.0
max,500681.128,1842.51,2018.0


# Linear Regression

In [156]:
x = data[['size','year']]
y = data['price']
x.shape #100 observations, 2 variables (size and year)

(100, 2)

### Inputs scaling

In [157]:
scaler = StandardScaler().fit(x)# .fit compute the mean and std to be used for later scaling
x_scaled = scaler.transform(x) # .transform perform standardization by centering and scaling
x_scaled

  return self.partial_fit(X, y)
  


array([[-0.70816415,  0.51006137],
       [-0.66387316, -0.76509206],
       [-1.23371919,  1.14763808],
       [ 2.19844528,  0.51006137],
       [ 1.42498884, -0.76509206],
       [-0.937209  , -1.40266877],
       [-0.95171405,  0.51006137],
       [-0.78328682, -1.40266877],
       [-0.57603328,  1.14763808],
       [-0.53467702, -0.76509206],
       [ 0.69939906, -0.76509206],
       [ 3.33780001, -0.76509206],
       [-0.53467702,  0.51006137],
       [ 0.52699137,  1.14763808],
       [ 1.51100715, -1.40266877],
       [ 1.77668568, -1.40266877],
       [-0.54810263,  1.14763808],
       [-0.77276222, -1.40266877],
       [-0.58004747, -1.40266877],
       [ 0.58943055,  1.14763808],
       [-0.78365788,  0.51006137],
       [-1.02322731,  0.51006137],
       [ 1.19557293,  0.51006137],
       [-1.12884431,  0.51006137],
       [-1.10378093, -0.76509206],
       [ 0.84424715,  1.14763808],
       [-0.95171405,  1.14763808],
       [ 1.62279723,  0.51006137],
       [-0.58004747,

### Regression

In [158]:
reg = LinearRegression()
reg.fit(x_scaled,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

### Intercept

In [159]:
reg.intercept_ #bias

292289.4701599997

### Coefficients

In [160]:
reg.coef_  #weights of size and year respectively

array([67501.57614152, 13724.39708231])

### R-squared

In [161]:
r_squared = reg.score(x_scaled,y)
r_squared

0.7764803683276793

### Adjusted R-squared

$\bar{R^2} = 1-(1-R^2)\frac{(n-1)}{(n-p-1)}$

In [162]:
def r_adjusted(x,y): #Function to calculate r^2 adjusted (Multiple variables)
    r2 = reg.score(x,y)
    n = x.shape[0] #100 observations
    p = x.shape[1] #2 features
    r_adj2 = 1-(1-r2)*(n-1)/(n-p-1) #Formula for R_adjusted
    return r_adj2

In [163]:
r_adjusted(x_scaled,y)

0.77187171612825

### R-squared and  Adjusted R-squared Comparison

R_squared = 0.7765 and R_squared_adjusted = 0.7719
We can see the difference is barely 0.0046, meaning by this there is little penalization for adding a variable to the model

### Predictions

Find the predicted price of an apartment that has a size of 750 sq.ft. from 2009.

In [164]:
new_data = [[750,2009]]
new_data_scaled = scaler.transform(new_data)
reg.predict(new_data_scaled).round(2)

array([258330.34])

### Univariate p-values of the variables

In [165]:
f_regression(x,y) #Provides 2 arrays, 1st with F-statistics and 2nd with p-values

(array([285.92105192,   0.85525799]), array([8.12763222e-31, 3.57340758e-01]))

In [166]:
f_regression(x,y)[0] #F-statistics

array([285.92105192,   0.85525799])

In [167]:
p_values = f_regression(x,y)[1] 
p_values.round(3)

array([0.   , 0.357])

### Summary table

In [168]:
regression_summary = pd.DataFrame(data = x.columns , columns = ['Features']) #Brings the columns of x (size an year) and its title Features
regression_summary ['Coefficients'] = reg.coef_.round(2) #Column addition for coefficients
regression_summary ['p-values'] = p_values.round(3) #Column addition for p-values
regression_summary 

Unnamed: 0,Features,Coefficients,p-values
0,size,67501.58,0.0
1,year,13724.4,0.357


From such a high p-value we can extract that the year variable is not significant for the model accuracy and could be deleted.