<a href="https://colab.research.google.com/github/evelynda1985/DataScience/blob/master/sklearnFeatureScalingExercise.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature scaling with sklearn - Exercise

You are given a real estate dataset.

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate_price_size_year.csv'.

You are expected to create a multiple linear regression (similar to the one in the lecture), using the new data. This exercise is very similar to a previous one. This time, however, **please standardize the data**.

Apart from that, please:
-  Display the intercept and coefficient(s)
-  Find the R-squared and Adjusted R-squared
-  Compare the R-squared and the Adjusted R-squared
-  Compare the R-squared of this regression and the simple linear regression where only 'size' was used
-  Using the model make a prediction about an apartment with size 750 sq.ft. from 2009
-  Find the univariate (or multivariate if you wish - see the article) p-values of the two variables. What can you say about them?
-  Create a summary table with your findings

In this exercise, the dependent variable is 'price', while the independent variables are 'size' and 'year'.

Good luck!

## Import the relevant libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.linear_model import LinearRegression

## Load the data

In [2]:
import google.colab
google.colab.files.upload()

Saving real_estate_price_size_year.csv to real_estate_price_size_year.csv


{'real_estate_price_size_year.csv': b'price,size,year\r\n234314.144,643.09,2015\r\n228581.528,656.22,2009\r\n281626.336,487.29,2018\r\n401255.608,1504.75,2015\r\n458674.256,1275.46,2009\r\n245050.28,575.19,2006\r\n265129.064,570.89,2015\r\n175716.48,620.82,2006\r\n331101.344,682.26,2018\r\n218630.608,694.52,2009\r\n279555.096,1060.36,2009\r\n494778.992,1842.51,2009\r\n215472.104,694.52,2015\r\n418753.008,1009.25,2018\r\n444192.008,1300.96,2006\r\n440201.616,1379.72,2006\r\n248337.6,690.54,2018\r\n234178.16,623.94,2006\r\n225451.984,681.07,2006\r\n299416.976,1027.76,2018\r\n268125.08,620.71,2015\r\n171795.24,549.69,2015\r\n412569.472,1207.45,2015\r\n183459.488,518.38,2015\r\n168047.264,525.81,2009\r\n362519.72,1103.3,2018\r\n271793.312,570.89,2018\r\n406852.304,1334.1,2015\r\n297760.44,681.07,2015\r\n368988.432,1496.36,2015\r\n301635.728,1010.33,2006\r\n225452.32,681.07,2006\r\n207742.248,597.9,2009\r\n191486.896,525.81,2015\r\n285223.176,857.54,2018\r\n302000.92,622.97,2018\r\n269225.9

In [4]:
data = pd.read_csv('real_estate_price_size_year.csv')
data

Unnamed: 0,price,size,year
0,234314.144,643.09,2015
1,228581.528,656.22,2009
2,281626.336,487.29,2018
3,401255.608,1504.75,2015
4,458674.256,1275.46,2009
...,...,...,...
95,252460.400,549.80,2009
96,310522.592,1037.44,2009
97,383635.568,1504.75,2006
98,225145.248,648.29,2015


In [6]:
data.describe()

Unnamed: 0,price,size,year
count,100.0,100.0,100.0
mean,292289.47016,853.0242,2012.6
std,77051.727525,297.941951,4.729021
min,154282.128,479.75,2006.0
25%,234280.148,643.33,2009.0
50%,280590.716,696.405,2015.0
75%,335723.696,1029.3225,2018.0
max,500681.128,1842.51,2018.0


## Create the regression

### Declare the dependent and the independent variables

In [8]:
y = data['price']
x = data[['size', 'year']]

### Scale the inputs

In [9]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(x)
x_scaled = scaler.transform(x)
x_scaled

array([[-0.70816415,  0.51006137],
       [-0.66387316, -0.76509206],
       [-1.23371919,  1.14763808],
       [ 2.19844528,  0.51006137],
       [ 1.42498884, -0.76509206],
       [-0.937209  , -1.40266877],
       [-0.95171405,  0.51006137],
       [-0.78328682, -1.40266877],
       [-0.57603328,  1.14763808],
       [-0.53467702, -0.76509206],
       [ 0.69939906, -0.76509206],
       [ 3.33780001, -0.76509206],
       [-0.53467702,  0.51006137],
       [ 0.52699137,  1.14763808],
       [ 1.51100715, -1.40266877],
       [ 1.77668568, -1.40266877],
       [-0.54810263,  1.14763808],
       [-0.77276222, -1.40266877],
       [-0.58004747, -1.40266877],
       [ 0.58943055,  1.14763808],
       [-0.78365788,  0.51006137],
       [-1.02322731,  0.51006137],
       [ 1.19557293,  0.51006137],
       [-1.12884431,  0.51006137],
       [-1.10378093, -0.76509206],
       [ 0.84424715,  1.14763808],
       [-0.95171405,  1.14763808],
       [ 1.62279723,  0.51006137],
       [-0.58004747,

### Regression

In [10]:
reg = LinearRegression()
reg.fit(x_scaled, y)

### Find the intercept

In [11]:
reg.intercept_

np.float64(292289.4701599997)

### Find the coefficients

In [12]:
reg.coef_

array([67501.57614152, 13724.39708231])

### Calculate the R-squared

In [13]:
reg.score(x_scaled, y)

0.7764803683276793

### Calculate the Adjusted R-squared

In [14]:
r2 = reg.score(x_scaled, y)
n = x_scaled.shape[0]
p = x_scaled.shape[1]

In [15]:
adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
adjusted_r2

0.77187171612825

### Compare the R-squared and the Adjusted R-squared

Answer... r2 = 0.7764803683276793 and adjusted_r2 = 0.77187171612825 the values are pretty similar, it means that we are not being penalized for the independient variables

### Compare the Adjusted R-squared with the R-squared of the simple linear regression

Answer... simple_r2 = 0.40600391479679754 and multi_r2 = 0.7764803683276793
Values are very different, r2 value closer to zero better, here the simple linear regression is a better approach, and for multiple Linear Regression the new variable doesn't help that much

### Making predictions

Find the predicted price of an apartment that has a size of 750 sq.ft. from 2009.

In [16]:
new_data = pd.DataFrame([[750, 2009]], columns=['size', 'year'])
new_data

Unnamed: 0,size,year
0,750,2009


In [17]:
new_data_scaled = scaler.transform(new_data)
new_data_scaled

array([[-0.34752816, -0.76509206]])

In [18]:
reg.predict(new_data_scaled)

array([258330.34465995])

### Calculate the univariate p-values of the variables

In [23]:
from sklearn.feature_selection import f_regression

In [24]:
f_regression(x_scaled, y)

(array([285.92105192,   0.85525799]), array([8.12763222e-31, 3.57340758e-01]))

In [25]:
p_values = f_regression(x, y)[1]
p_values

array([8.12763222e-31, 3.57340758e-01])

In [26]:
p_values.round(3)

array([0.   , 0.357])

### Create a summary table with your findings

In [28]:
reg.summary = pd.DataFrame(data = x.columns.values, columns=['Features'])
reg.summary['Coefficients'] = reg.coef_
reg.summary['p-values'] = p_values.round(3)
reg.summary

Unnamed: 0,Features,Coefficients,p-values
0,size,67501.576142,0.0
1,year,13724.397082,0.357


Answer... Size definately has more weight or impact on the results than year. Year can be remove from the model