## For a regression equation log y = 1 + 60 log x, how do changes in y associate with changes in x?

1. $(x_{old}, y_{old})$:  $log y_{new} = 1+60 * log x_{new}$ 

2. $(x_{new}, y_{new})$:  $log y_{old} = 1+60 * log x_{old}$ 

1-2

$logy_{new} - logy_{old} = 60*(logx_{new} - logx_{old})$

$log (y_{new}/y_{old}) = 60*log (x_{new}/x_{old})$

$log (y_{new}/y_{old}) = log(x_{new}/x_{old})^{60}$

$y_{new}/y_{old} = (x_{new}/x_{old})^{60}$

**Answer:** Log-log model: if $x_{new}$ is 1% more than $x_{old}$, then $x_{new}/x_{old}$ is 1.01. Therefore, $y_{new}/y_{old}$ is $(1.01)^{60}$ =  1.81669669856. When $x$ increases by 1%, $y$ increases by 81.67%. 


## Before running the regression, normalize the elements, ptratio and rm.

In [1]:
import pandas as pd
housing = pd.read_csv("Housing.csv")
housing

Unnamed: 0,crim,zn,river,rm,ptratio,medv
0,0.00632,18.0,0,6.575,15.3,24.0
1,0.02731,0.0,0,6.421,17.8,21.6
2,0.02729,0.0,0,7.185,17.8,34.7
3,0.03237,0.0,0,6.998,18.7,33.4
4,0.06905,0.0,0,7.147,18.7,36.2
...,...,...,...,...,...,...
501,0.06263,0.0,0,6.593,21.0,22.4
502,0.04527,0.0,0,6.120,21.0,20.6
503,0.06076,0.0,0,6.976,21.0,23.9
504,0.10959,0.0,0,6.794,21.0,22.0


In [2]:
housing.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   crim     506 non-null    float64
 1   zn       506 non-null    float64
 2   river    506 non-null    int64  
 3   rm       506 non-null    float64
 4   ptratio  506 non-null    float64
 5   medv     506 non-null    float64
dtypes: float64(5), int64(1)
memory usage: 23.8 KB


In [3]:
# normalize ptratio and rm

from sklearn.preprocessing import MinMaxScaler

mms = MinMaxScaler()
housing['ptratio_normalize'] = mms.fit_transform(housing[['ptratio']])
housing['rm_normalize'] = mms.fit_transform(housing[['rm']])

In [4]:
# scikit.learn method:
import numpy as np
from sklearn.linear_model import LinearRegression

X2 = housing[['ptratio_normalize','rm_normalize']].values
y2 = housing['medv'].values

linear_model_2 = LinearRegression()
linear_model_2.fit(X2,y2)

r_sq_housing = linear_model_2.score(X2,y2)
print(f'Coefficient of Determination: {r_sq_housing}')

Coefficient of Determination: 0.5612534621272917


**Answer:** The $R^2$ when normalizing the features (ptraio and rm) is 0.561253462127291**7**.

## Randomly split the Housing data into two parts with 30% as test data. Use random_state = 1 in this split. Because this is a regression problem, you don’t want to use stratify = y part of the code from our Python tutorial. Regress medv on river and rm using the training data. Compute R2 on both the training data and the test data (i.e., in-sample R2 and out-of-sample R2). 

In [5]:
housing

Unnamed: 0,crim,zn,river,rm,ptratio,medv,ptratio_normalize,rm_normalize
0,0.00632,18.0,0,6.575,15.3,24.0,0.287234,0.577505
1,0.02731,0.0,0,6.421,17.8,21.6,0.553191,0.547998
2,0.02729,0.0,0,7.185,17.8,34.7,0.553191,0.694386
3,0.03237,0.0,0,6.998,18.7,33.4,0.648936,0.658555
4,0.06905,0.0,0,7.147,18.7,36.2,0.648936,0.687105
...,...,...,...,...,...,...,...,...
501,0.06263,0.0,0,6.593,21.0,22.4,0.893617,0.580954
502,0.04527,0.0,0,6.120,21.0,20.6,0.893617,0.490324
503,0.06076,0.0,0,6.976,21.0,23.9,0.893617,0.654340
504,0.10959,0.0,0,6.794,21.0,22.0,0.893617,0.619467


In [6]:
cols = housing.columns
cols

Index(['crim', 'zn', 'river', 'rm', 'ptratio', 'medv', 'ptratio_normalize',
       'rm_normalize'],
      dtype='object')

In [7]:
from sklearn.model_selection import train_test_split

X,y = housing.iloc[:, 2:4], housing.iloc[:, 5]

X_train, X_test, y_train, y_test = train_test_split(X,y,
                                                   test_size = 0.3,
                                                   random_state=1)
# X: column river and rm (2 and 3)
# y: column medv (5)
# test_size: default is 0.35, change to 0.3 for 30% to X_test and y_test
# random_state: 1
# no stratify=y


In [8]:
# Regress with training data using scikit.learn method:
import numpy as np
from sklearn.linear_model import LinearRegression

linear_model_housing = LinearRegression()
linear_model_housing.fit(X_train,y_train)

r_sq_housing_training = linear_model_housing.score(X_train,y_train)
print(f'Coefficient of Determination for Training Data: {r_sq_housing_training}')

Coefficient of Determination for Training Data: 0.43328174963537236


In [9]:
r_sq_housing_testing = linear_model_housing.score(X_test,y_test)
print(f'Coefficient of Determination for Testing Data: {r_sq_housing_testing}')

Coefficient of Determination for Testing Data: 0.6147469815435613


**Answer:** Training Data's $R^2$: 0.43328174963537236

**Answer:** Testing Data's $R^2$: 0.6147469815435613

## Use the first 30 rows of the training set you got in problem 4 as the new training set, regress medv on river and rm again. How much is the R2 on this training set? How much is R2 if you evaluate this new model’s performance using the test data in problem 4?

In [10]:
housing_training = pd.DataFrame(X_train)
housing_training

Unnamed: 0,river,rm
13,0,5.949
61,0,5.966
377,0,6.794
39,0,6.595
365,0,3.561
...,...,...
255,0,5.876
72,0,6.065
396,0,6.405
235,0,6.086


In [11]:
housing_training_y = pd.DataFrame(y_train)
housing_training_y

Unnamed: 0,medv
13,20.4
61,16.0
377,13.3
39,30.8
365,27.5
...,...
255,20.9
72,22.8
396,12.5
235,24.0


In [12]:
housing_training = housing_training.assign(medv = y_train.values)
housing_training

Unnamed: 0,river,rm,medv
13,0,5.949,20.4
61,0,5.966,16.0
377,0,6.794,13.3
39,0,6.595,30.8
365,0,3.561,27.5
...,...,...,...
255,0,5.876,20.9
72,0,6.065,22.8
396,0,6.405,12.5
235,0,6.086,24.0


In [13]:
first_30_housing_training = housing_training.head(30)
first_30_housing_training

Unnamed: 0,river,rm,medv
13,0,5.949,20.4
61,0,5.966,16.0
377,0,6.794,13.3
39,0,6.595,30.8
365,0,3.561,27.5
272,0,6.538,24.4
208,1,6.064,24.4
236,1,6.631,25.1
98,0,7.82,43.8
364,1,8.78,21.9


In [14]:
# scikit.learn method:
import numpy as np
from sklearn.linear_model import LinearRegression

X_2 = first_30_housing_training[['river','rm']].values
y_2 = first_30_housing_training['medv'].values

linear_model_2 = LinearRegression()
linear_model_2.fit(X_2,y_2)

r_sq_first_30_training = linear_model_2.score(X_2,y_2)
print(f'Coefficient of Determination for New Model Training Data: {r_sq_first_30_training}')

Coefficient of Determination for New Model Training Data: 0.1478549316471307


In [15]:
r_sq_first_30_testing = linear_model_2.score(X_test,y_test)
print(f'Coefficient of Determination for New Model Testing Data: {r_sq_first_30_testing}')

Coefficient of Determination for New Model Testing Data: 0.3416852369553084


**Answer:** Training's $R^2$: 0.1478549316471307

**Answer:** Testing's $R^2$: 0.3416852369553084