## Practice - Multiple Linear Regression

- 이번에는 multiple linear regression(다중 회귀 혹은 중다 회귀)을 실제 데이터로 구현해보겠습니다.

### 보스턴 주택 가격 데이터

1978 미국 보스턴의 주택 가격 데이터입니다.
`load_boston()` 명령으로 로드하며 다음과 같이 구성되어 있습니다.

* 타겟 데이터
 * `MEDV`: 506 타운의 주택 가격 중앙값 (단위 1,000 달러)<br><br>
 
* 특징 데이터 
 * `CRIM`: 범죄율
 * `ZN`: 25,000 평방피트를 초과 거주지역 비율
 * `INDUS`: 비소매상업지역 면적 비율
 * `CHAS`: 찰스강의 경계에 위치한 경우는 1, 아니면 0
 * `NOX`: 일산화질소 농도 
 * `RM`: 주택당 방 수
 * `AGE`: 1940년 이전에 건축된 주택의 비율
 * `DIS`: 직업센터의 거리
 * `RAD`: 방사형 고속도로까지의 거리
 * `TAX`:	재산세율
 * `PTRATIO`: 학생/교사 비율
 * `B`: 인구 중 흑인 비율
 * `LSTAT`: 인구 중 하위 계층 비율

In [1]:
import torch
import torch.nn as nn
from torch.autograd import Variable
import torch.optim as optim

In [2]:
from sklearn.datasets import load_boston
boston = load_boston()
print(boston.DESCR)

Boston House Prices dataset

Notes
------
Data Set Characteristics:  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive
    
    :Median Value (attribute 14) is usually the target

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
      

In [9]:
import pandas as pd

dfX = pd.DataFrame(boston.data, columns=boston.feature_names)
dfy = pd.DataFrame(boston.target, columns=["MEDV"])
df = pd.concat([dfX, dfy], axis=1)

# data의 갯수는 505개입니다.
df.tail()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
501,0.06263,0.0,11.93,0.0,0.573,6.593,69.1,2.4786,1.0,273.0,21.0,391.99,9.67,22.4
502,0.04527,0.0,11.93,0.0,0.573,6.12,76.7,2.2875,1.0,273.0,21.0,396.9,9.08,20.6
503,0.06076,0.0,11.93,0.0,0.573,6.976,91.0,2.1675,1.0,273.0,21.0,396.9,5.64,23.9
504,0.10959,0.0,11.93,0.0,0.573,6.794,89.3,2.3889,1.0,273.0,21.0,393.45,6.48,22.0
505,0.04741,0.0,11.93,0.0,0.573,6.03,80.8,2.505,1.0,273.0,21.0,396.9,7.88,11.9


In [10]:
df.describe()

Unnamed: 0,CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
count,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0,506.0
mean,3.593761,11.363636,11.136779,0.06917,0.554695,6.284634,68.574901,3.795043,9.549407,408.237154,18.455534,356.674032,12.653063,22.532806
std,8.596783,23.322453,6.860353,0.253994,0.115878,0.702617,28.148861,2.10571,8.707259,168.537116,2.164946,91.294864,7.141062,9.197104
min,0.00632,0.0,0.46,0.0,0.385,3.561,2.9,1.1296,1.0,187.0,12.6,0.32,1.73,5.0
25%,0.082045,0.0,5.19,0.0,0.449,5.8855,45.025,2.100175,4.0,279.0,17.4,375.3775,6.95,17.025
50%,0.25651,0.0,9.69,0.0,0.538,6.2085,77.5,3.20745,5.0,330.0,19.05,391.44,11.36,21.2
75%,3.647423,12.5,18.1,0.0,0.624,6.6235,94.075,5.188425,24.0,666.0,20.2,396.225,16.955,25.0
max,88.9762,100.0,27.74,1.0,0.871,8.78,100.0,12.1265,24.0,711.0,22.0,396.9,37.97,50.0


# Multiple Linear Regression in Statistics

- 통계학에서 배운 OLS(Ordinary Least Square) 방식의 Multiple linear regression과 비교해보겠습니다.

In [11]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.3, random_state=777)
print(len(y_train))
print(len(y_test))

354
152


In [12]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_train)
mse = mean_squared_error(y_train, y_pred)
print("Mean Squared Error :",mse)

y_pred = model.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
r_square = r2_score(y_test, y_pred)
print()
print("Mean Squared Error :",mse)
print("R^2 :",r_square)

Mean Squared Error : 21.025226779107708

Mean Squared Error : 25.357256011213824
R^2 : 0.6999392625722742


# Multiple Linear Regression with Neural Net

- 이번에는 neural net으로 multiple linear regression을 구현해보겠습니다.

In [13]:
from torch.utils.data import Dataset
from torch.utils.data import DataLoader

class makeData(Dataset):
    def __init__(self, X_data, y_data):
        self.X_data = X_data
        self.y_data = y_data
        
    def __getitem__(self, index):
        return (self.X_data[index], self.y_data[index])
    
    def __len__(self):
        return len(self.y_data)

train_data = makeData(X_train, y_train)
test_data = makeData(X_test, y_test)

In [15]:
class MultipleLinearRegression(nn.Module):
    
    def __init__(self, feature_size):
        super(MultipleLinearRegression, self).__init__()
        
        # 이 부분에서 x feature가 여러개 들어가기 때문에 
        # simple linear regression과 달리 1을 feature_size로 바꿔줄 뿐입니다 !
        self.Layer = nn.Linear(feature_size, 1) 

    def forward(self, inputs):
        x = self.Layer(inputs)
        return x.squeeze(1)
    
    def predict(self, test_input):
        x = self.Layer(test_input)
        return x

In [23]:
# 505개의 데이터를 batch_size = 100으로 2500번 정도 학습시켜보겠습니다.

EPOCHS = 5000
BATCH_SIZE = 200
FEATURE_SIZE = len(boston.data[0])

model = MultipleLinearRegression(FEATURE_SIZE)
criterion = nn.MSELoss()

optimizer = optim.Adam(model.parameters(), lr=0.01)

train_batch = DataLoader(dataset=train_data, batch_size=BATCH_SIZE, shuffle=True)
for epoch in range(EPOCHS):
    for X_batch, y_batch in train_batch:
        
        inputs = torch.Tensor(X_batch.float())
        targets = torch.Tensor(y_batch.float())
        model.zero_grad()
        y_pred = model(inputs)
        loss = criterion(y_pred, targets)
        loss.backward()
        optimizer.step()
    
    if epoch % 500 == 0:
        print(loss)

tensor(6673.9082, grad_fn=<MseLossBackward>)
tensor(39.9001, grad_fn=<MseLossBackward>)
tensor(32.1195, grad_fn=<MseLossBackward>)
tensor(27.0666, grad_fn=<MseLossBackward>)
tensor(32.8281, grad_fn=<MseLossBackward>)
tensor(21.0514, grad_fn=<MseLossBackward>)
tensor(22.8049, grad_fn=<MseLossBackward>)
tensor(29.4024, grad_fn=<MseLossBackward>)
tensor(30.7984, grad_fn=<MseLossBackward>)
tensor(30.0060, grad_fn=<MseLossBackward>)


In [22]:
y_pred = model.predict(torch.Tensor(X_test)).detach().numpy()
mse = mean_squared_error(y_test, y_pred)
r_square = r2_score(y_test, y_pred)

print("Mean Squared Error :",mse)
print("R^2 :",r_square)

# 기존 방법보다 R^2가 높은 선형회귀식을 완성했습니다.

Mean Squared Error : 23.08167879183166
R^2 : 0.726866915084028
