# Feature scaling with sklearn - Exercise

You are given a real estate dataset. 

Real estate is one of those examples that every regression course goes through as it is extremely easy to understand and there is a (almost always) certain causal relationship to be found.

The data is located in the file: 'real_estate_price_size_year.csv'. 

You are expected to create a multiple linear regression (similar to the one in the lecture), using the new data. This exercise is very similar to a previous one. This time, however, **please standardize the data**.

Apart from that, please:
-  Display the intercept and coefficient(s)
-  Find the R-squared and Adjusted R-squared
-  Compare the R-squared and the Adjusted R-squared
-  Compare the R-squared of this regression and the simple linear regression where only 'size' was used
-  Using the model make a prediction about an apartment with size 750 sq.ft. from 2009
-  Find the univariate (or multivariate if you wish - see the article) p-values of the two variables. What can you say about them?
-  Create a summary table with your findings

In this exercise, the dependent variable is 'price', while the independent variables are 'size' and 'year'.

Good luck!

## Import the relevant libraries

In [24]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from sklearn.linear_model import LinearRegression

## Load the data

In [85]:
data=pd.read_csv("real_estate_price_size_year.csv")
data.head()
print(list(data))

['price', 'size', 'year']


In [86]:
#Creando mi dataframe de 
x=data[['size', 'year']]
x.head()
#x_2d=x.reshape(-1,1)
print(x.shape)

(100, 2)


In [87]:
y=data['price']
y.head()

0    234314.144
1    228581.528
2    281626.336
3    401255.608
4    458674.256
Name: price, dtype: float64

## Create the regression

### Declare the dependent and the independent variables

In [88]:
reg=LinearRegression()
reg.fit(x,y)


0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [89]:
reg.coef_

array([ 227.70085401, 2916.78532684])

In [90]:
reg.intercept_

np.float64(-5772267.017463281)

### Scale the inputs

In [123]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

scalar=StandardScaler()
x_scaled=scalar.fit_transform(x)
#Como se puede ver la salida de inputar un feature en datatype es 
print(type(x_scaled))
print(x_scaled.shape)

<class 'numpy.ndarray'>
(100, 2)


### Regression

In [28]:
#Regresion ahora con X estandarizado o transformado 
reg.fit(x_scaled,y)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


### Find the intercept

In [29]:
reg.intercept_

np.float64(292289.4701599997)

### Find the coefficients

In [30]:
reg.coef_

array([67501.57614152, 13724.39708231])

### Calculate the R-squared

In [31]:
reg.score(x_scaled, y)

0.7764803683276793

### Calculate the Adjusted R-squared

In [35]:
# Let's use the handy function we created
def adj_r2(x,y):
    r2 = reg.score(x,y)
    n = x.shape[0]
    p = x.shape[1]
    adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
    return adjusted_r2

adj_rj2_scaled_features=adj_r2(x_scaled, y)
print(adj_rj2_scaled_features)

0.77187171612825


### Compare the R-squared and the Adjusted R-squared

📌 Interpretación cuando R² ajustado < R²

Siempre pasa que el R² ajustado es un poco menor que el R², porque incluye una penalización por el número de variables (𝑝)
frente al tamaño de la muestra (n).

Ahora bien, si la diferencia es muy pequeña → no pasa nada grave, es lo normal.

Pero si la diferencia es grande → significa que al menos una de las variables que agregaste no está aportando información real y puede estar metiendo solo ruido.

📌 Qué puedes deducir

🔎 Si agregas una variable y R² sube pero R² ajustado baja, esa variable no contribuye significativamente al modelo.

✂️ Esa variable podría eliminarse sin perder poder predictivo.

📊 En términos estadísticos: su p-value será alto, indicando que no es significativa.

### Compare the Adjusted R-squared with the R-squared of the simple linear regression

In [None]:
print(type(x_scaled))
x_scaled.shape

###Conservando solo la primera columna
x_scaled_linear=x_scaled[:,0]
print(x_scaled_linear)
print(x_scaled_linear.shape)
#Como se puede ver elimino la primera columna del array pero tengo como resultado un array (100,) tiene que ser 2D

<class 'numpy.ndarray'>
[-0.70816415 -0.66387316 -1.23371919  2.19844528  1.42498884 -0.937209
 -0.95171405 -0.78328682 -0.57603328 -0.53467702  0.69939906  3.33780001
 -0.53467702  0.52699137  1.51100715  1.77668568 -0.54810263 -0.77276222
 -0.58004747  0.58943055 -0.78365788 -1.02322731  1.19557293 -1.12884431
 -1.10378093  0.84424715 -0.95171405  1.62279723 -0.58004747  2.17014356
  0.5306345  -0.58004747 -0.8606021  -1.10378093  0.015233   -0.77603429
 -0.10057126 -0.95387294 -0.56517136 -0.5219598   0.56983186 -0.57603328
 -0.10057126  1.62279723  0.69939906 -0.5219598  -0.7415595  -0.5219598
 -0.7415595  -0.79600403 -0.69328805  0.56983186  0.56983186 -0.42214483
 -0.69328805  2.21224194  0.6039356   1.45329055 -0.08495304 -0.95751607
 -0.08387359 -0.52125142  1.18939985  0.56983186 -0.56517136 -0.08748299
  0.52699137 -1.02285625 -0.56517136  2.17014356  0.56983186 -0.70708471
 -0.66387316 -1.02285625 -0.56517136 -0.56517136  1.11464825  1.62279723
 -0.57603328  1.13205431 -0.58

In [None]:
#Conviertiendo mi array de una columna a un array 2d
x_scaled_linear_2d=x_scaled_linear.reshape(-1,1)
print(x_scaled_linear_2d.shape)

(100, 1)


In [67]:
reg.fit(x_scaled_linear_2d,y)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [68]:
reg.score(x_scaled_linear_2d,y)

0.7447391865847586

In [71]:
r2_adjusted_linear=adj_r2(x_scaled_linear_2d,y)
print(r2_adjusted_linear)

0.742134484407052


### Making predictions

Find the predicted price of an apartment that has a size of 750 sq.ft. from 2009.

In [113]:
new_data=np.array([[750,2009]])
#new_data_reshape=new_data.T
print(new_data)
print(new_data.shape)
new_data_df=pd.DataFrame(new_data, columns=["size",'year'])
new_data_df

[[ 750 2009]]
(1, 2)


Unnamed: 0,size,year
0,750,2009


In [124]:
reg=LinearRegression()
reg.fit(x_scaled,y)



0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [125]:
scaled_new_data=scalar.fit_transform(new_data_df)
print(type(scaled_new_data))
print(scaled_new_data.shape)


<class 'numpy.ndarray'>
(1, 2)


In [126]:
reg.predict(scaled_new_data)

array([292289.47016])

### Calculate the univariate p-values of the variables

### Create a summary table with your findings

In [127]:
reg.coef_

array([67501.57614152, 13724.39708231])

In [160]:
tabla=pd.DataFrame(data=[reg.coef_[0],reg.coef_[1]], columns=["coeffecient"])
tabla.head()

Unnamed: 0,coeffecient
0,67501.576142
1,13724.397082


In [162]:
tabla["features"]=["size", "year"]
tabla.head()

Unnamed: 0,coeffecient,features
0,67501.576142,size
1,13724.397082,year
