# Multiple linear regression

독립변수가 2개 이상인 경우의 회귀를 Multiple linear regression이라고 한다.

### 필요한 library를 Import 

In [1]:
# For these lessons we will need NumPy, pandas, matplotlib and seaborn
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# and of course the actual regression (machine learning) module
from sklearn.linear_model import LinearRegression

## Load the data

In [2]:
# Load the data from a .csv in the same folder
data = pd.read_csv('../../data/1.02. Multiple linear regression.csv')

# Let's explore the top 5 rows of the df
data.head()

Unnamed: 0,SAT,"Rand 1,2,3",GPA
0,1714,1,2.4
1,1664,3,2.52
2,1760,3,2.54
3,1685,3,2.74
4,1693,2,2.83


In [3]:
# This method gives us very nice descriptive statistics. We don't need this for now, but will later on!
data.describe()

Unnamed: 0,SAT,"Rand 1,2,3",GPA
count,84.0,84.0,84.0
mean,1845.27381,2.059524,3.330238
std,104.530661,0.855192,0.271617
min,1634.0,1.0,2.4
25%,1772.0,1.0,3.19
50%,1846.0,2.0,3.38
75%,1934.0,3.0,3.5025
max,2050.0,3.0,3.81


## Create the multiple linear regression

### Declare the dependent and independent variables
종속변수와 독립변수를 선언한다.

In [4]:
# There are two independent variables: 'SAT' and 'Rand 1,2,3'
x = data[['SAT','Rand 1,2,3']]

# and a single depended variable: 'GPA'
y = data['GPA']

### Regression 모형의 생성

In [5]:
# We start by creating a linear regression object
reg = LinearRegression()

# The whole learning process boils down to fitting the regression
reg.fit(x,y)

LinearRegression()

In [6]:
# Getting the coefficients of the regression
reg.coef_
# Note that the output is an array

array([ 0.00165354, -0.00826982])

In [7]:
# Getting the intercept of the regression
reg.intercept_
# Note that the result is a float as we usually expect a single value

0.29603261264909353

### 모형의 평가 - Calculating the R-squared

결정계수를 확인한다. 모형의 설명력을 확인할 수 있다.

In [8]:
# Get the R-squared of the regression
reg.score(x,y)

0.4066811952814282

### 모형의 평가 - Formula for Adjusted $R^2$

독립변수를 계속적으로 추가하면 모형의 설명력은 증가한다.  
불필요하게 설명력이 없는 변수가 추가 되었을 때 이를 보정하는 방법이 바로 adj $R^2$ 이다.  

adj $R^2$ 는 다음과 같은 공식으로 계산 가능하다.  

$R^2_{adj.} = 1 - (1-R^2)*\frac{n-1}{n-p-1}$

In [9]:
# Get the shape of x, to facilitate the creation of the Adjusted R^2 metric
x.shape

(84, 2)

In [10]:
# If we want to find the Adjusted R-squared we can do so by knowing the r2, the # observations, the # features
r2 = reg.score(x,y)
# Number of observations is the shape along axis 0
n = x.shape[0]
# Number of features (predictors, p) is the shape along axis 1
p = x.shape[1]

# We find the Adjusted R-squared using the formula
adjusted_r2 = 1-(1-r2)*(n-1)/(n-p-1)
adjusted_r2

0.39203134825134

### 모형의 평가 - MSE

모형의 예측값과 실제값 사이의 차이가 작을수록 좋은 모형으로 평가할 수 있다. 

In [11]:
from sklearn.metrics import mean_squared_error

y_true = y
y_pred = reg.predict(x)

mse = mean_squared_error(y_true, y_pred) 
rmse = mean_squared_error(y_true, y_pred, squared=False) # squared = False를 사용하면 RMSE(Root Mean Square Error)

print("MSE = ", mse, "\tRMSE = ", rmse)

MSE =  0.043251494565310245 	RMSE =  0.20796993668631591


## 실습하기 

1. 다음과 같이 샘플 데이터셋을 읽어 들이시오.
```
from sklearn.datasets import load_boston
data = load_boston()
x = data.data
y = data.target
```

2. 읽어들인 보스턴 집값을 예측하는 회귀모형을 생성하시오.
3. 생성된 모형을 평가하시오.

In [16]:
import pandas as pd
data = pd.read_csv('../data/housing.data.txt')
x = data[:-1]
y = data.target

FileNotFoundError: [Errno 2] No such file or directory: '../data/housing.data.txt'

In [13]:
print(data.DESCR)

.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pu