#**주차 수요 예측**
# 단계3 : 모델링

## 0.미션

* 1) 모델링
    * 4개 이상의 알고리즘을 이용하여 모델 생성
    * 각 모델에 대해 성능 튜닝 수행
    * 성능 평가 및 비교
        * 결과를 데이터프레임으로 정리하여 비교.
        * 성능 비교를 통해 최적의 모델 선정
        * 성능 가이드 
            * MAE : 115~178
            * MAPE : 0.69~1.51

* 2) 파이프라인모델링
    * 함수 생성
        * Input : 새로운 데이터를 입력 받아서 
        * 전처리를 수행한 후
        * 선정된 모델로 예측
        * Output : 예측 결과 

## 1.환경설정

* 세부 요구사항
    - 경로 설정 : 다음의 두가지 방법 중 하나를 선택하여 폴더를 준비하고 데이터를 로딩하시오.
        * 1) 로컬 수행(Ananconda)
            * 제공된 압축파일을 다운받아 압축을 풀고
            * anaconda의 root directory(보통 C:/Users/< ID > 에 project 폴더를 만들고, 복사해 넣습니다.
        * 2) 구글콜랩
            * 구글 드라이브 바로 밑에 project 폴더를 만들고, 
            * 데이터 파일을 복사해 넣습니다.
    - 라이브러리 설치 및 로딩
        * requirements.txt 파일로 부터 라이브러리 설치
    - 기본적으로 필요한 라이브러리를 import 하도록 코드가 작성되어 있습니다. 
        * 필요하다고 판단되는 라이브러리를 추가하세요.


### (1) 경로 설정

#### 1) 로컬 수행(Anaconda)
* project 폴더에 필요한 파일들을 넣고, 본 파일을 열었다면, 별도 경로 지정이 필요하지 않습니다.

In [113]:
path = 'C:/Users/User/program/mini_pjt/mini_3/실습파일_에이블러용/데이터/'

#### 2) 구글 콜랩 수행

* 구글 드라이브 연결

In [114]:
# from google.colab import drive
# drive.mount('/content/drive')

In [115]:
#path = '/content/drive/MyDrive/project/'

### (2) 라이브러리 설치 및 불러오기

#### 1) 설치

* requirements.txt 파일을 아래 위치에 두고 다음 코드를 실행하시오.
    * 로컬 : 다음 코드셀 실행
    * 구글콜랩 : requirements.txt 파일을 왼쪽 [파일]탭에 복사해 넣고 다음 코드셀 실행

In [116]:
#!pip install -r requirements.txt

#### 2) 라이브러리 로딩

* **세부 요구사항**
    - 기본적으로 필요한 라이브러리를 import 하도록 코드가 작성되어 있습니다.
    - 필요하다고 판단되는 라이브러리를 추가하세요.

In [117]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import joblib 

# 필요한 라이브러리 로딩
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import *

#### 3) 함수 생성

* 모델 실제 vs 예측결과 비교 그래프

In [118]:
def model_plot(y, pred) : 
    plt.figure(figsize = (12,8))
    plt.scatter(y, pred, alpha=0.4)

    x_l = np.linspace( y.min(), y.max(), 100)
    plt.plot(x_l, x_l, color = 'black', alpha=0.4)

    plt.xlabel("Actual")
    plt.ylabel("Predicted")
    plt.grid()
    plt.show()

### (3) 데이터 불러오기

* **세부 요구사항**
    - 탐색적 데이터분석 단계에서 저장한 파일을 불러옵니다.

In [119]:
train = joblib.load(path + 'train2.pkl')

In [120]:
train.head()

Unnamed: 0,단지코드,총세대수,전용면적별세대수,준공일자,건물형태,난방방식,승강기설치여부,전용면적,공급면적(공용),임대보증금,...,40이하,50이하,60이하,70이하,80이하,90이하,100이하,110이하,120이하,121이상
0,C0001,78,35,2013,계단식,가스난방,설치,51.89,19.2603,50758000,...,0.0,0.0,4393.14,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,C0001,78,43,2013,계단식,가스난방,설치,59.93,22.2446,63166000,...,0.0,0.0,4393.14,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,C0002,35,26,2013,복도식,가스난방,설치,27.75,16.5375,63062000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,C0002,35,9,2013,복도식,가스난방,설치,29.08,17.3302,63062000,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,C0003,88,7,2013,계단식,가스난방,설치,59.47,21.9462,72190000,...,0.0,0.0,5244.69,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 2.모델링

* **세부 요구사항**
    * 모델링을 위한 전처리 : NaN 조치. 데이터 분할, 스케일링, 가변수화 등
    * 4개 이상의 알고리즘을 이용하여 모델 생성
    * 각 모델에 대해 성능 튜닝 수행
        * 선형회귀 : 릿지, 라쏘, 엘라스틱넷 중 1~2개 함께 사용하여 성능 비교
        * 랜덤포레스트 : 기본값으로 모델링 수행
        * 그외 알고리즘 : 그리드서치 튜닝으로 성능 최적화
    * 성능 비교를 통해 최적의 모델 선정
        * 검증 성능 가이드라인 : MAE : 120 내외

### (1) 데이터전처리


* **세부 요구사항**
    * 모델링을 위한 전처리를 수행합니다.
        * x,y 분할(식별자 역할의 단지코드 칼럼 제거하고, x,y로 분할)
        * 단지코드 제거
        * 가변수화
        * 스케일링(필요시)
        * train : validation
            * 적절한 비율 사용하기
            * random_state 지정하기

#### 1) x, y 분할

In [121]:
train.columns

Index(['단지코드', '총세대수', '전용면적별세대수', '준공일자', '건물형태', '난방방식', '승강기설치여부', '전용면적',
       '공급면적(공용)', '임대보증금', '임대료', '실차량수', '30이하', '40이하', '50이하', '60이하',
       '70이하', '80이하', '90이하', '100이하', '110이하', '120이하', '121이상'],
      dtype='object')

In [122]:
target = '실차량수'

In [123]:
train.drop('단지코드', axis=1, inplace=True)

In [124]:
x = train.drop(target, axis=1)
y = train.loc[:, target]

#### 2) NaN 조치

In [125]:
# 1번 데이터 전처리에서 조치함
train.isna().sum() 

총세대수        0
전용면적별세대수    0
준공일자        0
건물형태        0
난방방식        0
승강기설치여부     0
전용면적        0
공급면적(공용)    0
임대보증금       0
임대료         0
실차량수        0
30이하        0
40이하        0
50이하        0
60이하        0
70이하        0
80이하        0
90이하        0
100이하       0
110이하       0
120이하       0
121이상       0
dtype: int64

#### 3) 가변수화

In [126]:
train

Unnamed: 0,총세대수,전용면적별세대수,준공일자,건물형태,난방방식,승강기설치여부,전용면적,공급면적(공용),임대보증금,임대료,...,40이하,50이하,60이하,70이하,80이하,90이하,100이하,110이하,120이하,121이상
0,78,35,2013,계단식,가스난방,설치,51.89,19.2603,50758000,620370,...,0.00,0.0,4393.14,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,78,43,2013,계단식,가스난방,설치,59.93,22.2446,63166000,665490,...,0.00,0.0,4393.14,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,35,26,2013,복도식,가스난방,설치,27.75,16.5375,63062000,458640,...,0.00,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,35,9,2013,복도식,가스난방,설치,29.08,17.3302,63062000,481560,...,0.00,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,88,7,2013,계단식,가스난방,설치,59.47,21.9462,72190000,586540,...,0.00,0.0,5244.69,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1152,956,956,1994,복도식,가스난방,설치,26.37,12.7500,9931000,134540,...,0.00,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1153,120,66,2020,복도식,난방,설치,24.83,15.1557,2129000,42350,...,1827.36,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1154,120,54,2020,복도식,난방,설치,33.84,20.6553,2902000,57730,...,1827.36,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1155,447,149,1994,복도식,유류난방,설치,26.37,13.3800,7134000,118880,...,9333.36,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [127]:
drop_dumm = ['건물형태', '난방방식', '승강기설치여부']
x = pd.get_dummies(x, columns=drop_dumm, drop_first=True, dtype=int)

In [128]:
x

Unnamed: 0,총세대수,전용면적별세대수,준공일자,전용면적,공급면적(공용),임대보증금,임대료,30이하,40이하,50이하,...,100이하,110이하,120이하,121이상,건물형태_복도식,건물형태_혼합식,난방방식_난방,"난방방식_난방,",난방방식_유류난방,승강기설치여부_설치
0,78,35,2013,51.89,19.2603,50758000,620370,0.00,0.00,0.0,...,0.0,0.0,0.0,0.0,0,0,0,0,0,1
1,78,43,2013,59.93,22.2446,63166000,665490,0.00,0.00,0.0,...,0.0,0.0,0.0,0.0,0,0,0,0,0,1
2,35,26,2013,27.75,16.5375,63062000,458640,983.22,0.00,0.0,...,0.0,0.0,0.0,0.0,1,0,0,0,0,1
3,35,9,2013,29.08,17.3302,63062000,481560,983.22,0.00,0.0,...,0.0,0.0,0.0,0.0,1,0,0,0,0,1
4,88,7,2013,59.47,21.9462,72190000,586540,0.00,0.00,0.0,...,0.0,0.0,0.0,0.0,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1152,956,956,1994,26.37,12.7500,9931000,134540,25209.72,0.00,0.0,...,0.0,0.0,0.0,0.0,1,0,0,0,0,1
1153,120,66,2020,24.83,15.1557,2129000,42350,1638.78,1827.36,0.0,...,0.0,0.0,0.0,0.0,1,0,1,0,0,1
1154,120,54,2020,33.84,20.6553,2902000,57730,1638.78,1827.36,0.0,...,0.0,0.0,0.0,0.0,1,0,1,0,0,1
1155,447,149,1994,26.37,13.3800,7134000,118880,3929.13,9333.36,0.0,...,0.0,0.0,0.0,0.0,1,0,0,0,1,1


#### 4) train : val 분할

In [129]:
x_train, x_val, y_train, y_val = train_test_split(x, y, test_size=0.2, random_state=2024)

#### 5) 스케일링

In [130]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(x_train)
x_train_s = scaler.transform(x_train)
x_val_s = scaler.transform(x_val)

### (2) 알고리즘1

In [131]:
model_lr = LinearRegression()
model_rf = RandomForestRegressor()
model_knn = KNeighborsRegressor()
model_dt = DecisionTreeRegressor()

In [132]:
# MAE : 115~178
# MAPE : 0.69~1.51
result = {}
def acc_print(y_, p_):
    print('MAE:', mean_absolute_error(y_, p_)) # 낮을 수록 실제값에 가까움 # 낮을 수록 오차가 적음
    print('='*60)
    print('MSE:', mean_squared_error(y_, p_)) # 낮을 수록 실제값에 가까움 #제곱 평균
    print('='*60)
    print('MSPE:', mean_absolute_percentage_error(y_, p_)) # 낮을 수록 실제값에 가까움 #MSE의 제곱근
    print('='*60)
    print('R2:', r2_score(y_, p_)) #결정계수 1에 가까울 수록 좋음
    print('='*60)

    return r2_score(y_, p_)

In [133]:
def modle_mk(modle_select, x_train, y_train, x_val, y_val):
    if modle_select == model_rf:
        name = 'Random Forest'
    if modle_select == model_lr:
        name = 'LinearRegression'
    if modle_select == model_knn:
        name = 'KNeighborsRegressor'
    if modle_select == model_dt:
        name = 'DecisionTreeRegressor'
    
    print("model_name:", name)
    model = modle_select
    model.fit(x_train, y_train)
    p1 = model.predict(x_val)

    r = acc_print(y_val, p1)
    result[name] = r

    return result

In [134]:
# LinearRegression 
modle_mk(model_lr, x_train, y_train, x_val, y_val)

model_name: LinearRegression
MAE: 135.93477162223348
MSE: 37976.11884223037
MSPE: 0.4340330002480123
R2: 0.7487593909438536


{'LinearRegression': 0.7487593909438536}

### (3) 알고리즘2

In [135]:
# Random Forest
modle_mk(model_rf, x_train, y_train, x_val, y_val) 

model_name: Random Forest
MAE: 40.421293103448285
MSE: 7094.626378448276
MSPE: 0.10911873855382453
R2: 0.9530637067007224


{'LinearRegression': 0.7487593909438536, 'Random Forest': 0.9530637067007224}

### (4) 알고리즘3

In [136]:
# DecisionTreeRegressor
modle_mk(model_dt, x_train, y_train, x_val, y_val) 

model_name: DecisionTreeRegressor
MAE: 26.30603448275862
MSE: 10762.081896551725
MSPE: 0.07169048464391965
R2: 0.9288007281198254


{'LinearRegression': 0.7487593909438536,
 'Random Forest': 0.9530637067007224,
 'DecisionTreeRegressor': 0.9288007281198254}

### (5) 알고리즘4

In [137]:
#KNeighborsRegressor
modle_mk(model_knn, x_train_s, y_train, x_val_s, y_val) 

model_name: KNeighborsRegressor
MAE: 99.89224137931035
MSE: 24580.13155172414
MSPE: 0.26363162217267744
R2: 0.8373839294270367


{'LinearRegression': 0.7487593909438536,
 'Random Forest': 0.9530637067007224,
 'DecisionTreeRegressor': 0.9288007281198254,
 'KNeighborsRegressor': 0.8373839294270367}

### (6) 성능결과 비교

* 세부 요구사항
    - 각 모델에 대해서 test 데이터로 성능 측정후, 데이터프레임으로 저장하고 비교한다.

In [138]:
result

{'LinearRegression': 0.7487593909438536,
 'Random Forest': 0.9530637067007224,
 'DecisionTreeRegressor': 0.9288007281198254,
 'KNeighborsRegressor': 0.8373839294270367}

## 3.파이프라인 구축

* **세부요구사항**
    * new data : data02_test.csv 를 읽어서 저장
    * 파이프라인 함수를 생성
        * data pipeline 함수
        * ML pipeline 함수

### (1) New Data 불러오기
* **세부요구사항**
    * test.xlsx 를 읽어서 new_data 이름으로 저장
    * 해당 데이터는 최초 데이터와 동일한 구조입니다. 이 데이터를 이용하여 전처리와 예측을 수행해야 합니다.

In [139]:
new_data = joblib.load(path + 'test2.pkl')

In [140]:
new_data

Unnamed: 0,단지코드,총세대수,전용면적별세대수,지역,준공일자,건물형태,난방방식,승강기설치여부,전용면적,공급면적(공용),임대보증금,임대료,실차량수
0,C0005,20,6,서울,2013,복도식,개별가스난방,전체동 설치,17.53,11.7251,50449000,263710,21
1,C0005,20,10,서울,2013,복도식,개별가스난방,전체동 설치,24.71,16.5275,52743000,321040,21
2,C0005,20,4,서울,2013,복도식,개별가스난방,전체동 설치,26.72,17.8720,53890000,332510,21
3,C0017,822,228,대구경북,2013,계단식,지역난방,전체동 설치,51.87,20.9266,29298000,411200,797
4,C0017,822,56,대구경북,2013,계단식,지역난방,전체동 설치,59.85,24.1461,38550000,462600,797
...,...,...,...,...,...,...,...,...,...,...,...,...,...
99,C0353,768,90,대전충남,2014,복도식,중앙난방,전체동 설치,40.32,16.5100,8848000,122290,123
100,C0360,588,98,서울,2014,복도식,지역난방,전체동 설치,51.37,21.5569,183228000,0,559
101,C0360,588,186,서울,2013,복도식,지역난방,전체동 설치,51.39,21.5652,183228000,0,559
102,C0360,588,102,서울,2013,복도식,지역난방,전체동 설치,59.76,25.0776,215057000,0,559


### (2) 데이터 파이프라인 구축
* **세부요구사항**
    * data pipeline 함수를 생성합니다.
        * 입력 : new_data
        * 처리 : 
            * 1.데이터전처리, 2.탐색적 데이터분석 단계에서 수행했던 전처리 코드들을 순차적으로 처리합니다.
            * 모델링을 위한 전처리 : Target 제거, NaN조치, 가변수화, (스케일링) 등을 수행합니다.
        * 출력 : 전처리 완료된 데이터 프레임
    

In [141]:
drop_cols = ['단지코드', '지역']
new_data.drop(drop_cols, axis=1, inplace=True)

In [142]:
new_data

Unnamed: 0,총세대수,전용면적별세대수,준공일자,건물형태,난방방식,승강기설치여부,전용면적,공급면적(공용),임대보증금,임대료,실차량수
0,20,6,2013,복도식,개별가스난방,전체동 설치,17.53,11.7251,50449000,263710,21
1,20,10,2013,복도식,개별가스난방,전체동 설치,24.71,16.5275,52743000,321040,21
2,20,4,2013,복도식,개별가스난방,전체동 설치,26.72,17.8720,53890000,332510,21
3,822,228,2013,계단식,지역난방,전체동 설치,51.87,20.9266,29298000,411200,797
4,822,56,2013,계단식,지역난방,전체동 설치,59.85,24.1461,38550000,462600,797
...,...,...,...,...,...,...,...,...,...,...,...
99,768,90,2014,복도식,중앙난방,전체동 설치,40.32,16.5100,8848000,122290,123
100,588,98,2014,복도식,지역난방,전체동 설치,51.37,21.5569,183228000,0,559
101,588,186,2013,복도식,지역난방,전체동 설치,51.39,21.5652,183228000,0,559
102,588,102,2013,복도식,지역난방,전체동 설치,59.76,25.0776,215057000,0,559


In [143]:
dumm_cols = ['건물형태',	'난방방식', '승강기설치여부']
new_data = pd.get_dummies(new_data, columns=dumm_cols, drop_first=True, dtype=int)
new_data.drop('실차량수', axis=1, inplace=True)
new_data

Unnamed: 0,총세대수,전용면적별세대수,준공일자,전용면적,공급면적(공용),임대보증금,임대료,건물형태_복도식,건물형태_혼합식,난방방식_개별난방,난방방식_중앙난방,난방방식_지역가스난방,난방방식_지역난방
0,20,6,2013,17.53,11.7251,50449000,263710,1,0,0,0,0,0
1,20,10,2013,24.71,16.5275,52743000,321040,1,0,0,0,0,0
2,20,4,2013,26.72,17.8720,53890000,332510,1,0,0,0,0,0
3,822,228,2013,51.87,20.9266,29298000,411200,0,0,0,0,0,1
4,822,56,2013,59.85,24.1461,38550000,462600,0,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
99,768,90,2014,40.32,16.5100,8848000,122290,1,0,0,1,0,0
100,588,98,2014,51.37,21.5569,183228000,0,1,0,0,0,0,1
101,588,186,2013,51.39,21.5652,183228000,0,1,0,0,0,0,1
102,588,102,2013,59.76,25.0776,215057000,0,1,0,0,0,0,1


In [144]:
drop_cols = ['30이하', '40이하',	'50이하', '60이하', '70이하', '80이하', '90이하', '100이하', '110이하', '120이하', '121이상']
x_train = x_train.drop(drop_cols, axis=1)
x_train

Unnamed: 0,총세대수,전용면적별세대수,준공일자,전용면적,공급면적(공용),임대보증금,임대료,건물형태_복도식,건물형태_혼합식,난방방식_난방,"난방방식_난방,",난방방식_유류난방,승강기설치여부_설치
582,1308,177,2006,46.760,19.0907,24665000,135960,1,0,0,0,0,1
601,384,42,2007,39.850,13.3232,19819000,110110,0,1,0,0,0,1
1026,60,31,2001,59.501,13.0100,10405000,86700,0,0,0,0,0,1
63,532,70,2016,59.670,23.6057,41054000,509410,0,0,1,0,0,1
548,714,224,2003,51.490,21.1473,17972000,234890,0,1,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...
183,307,147,2011,84.560,24.0516,0,0,0,0,0,0,0,1
446,291,91,2012,51.550,28.6965,50775000,422950,1,0,0,0,0,1
539,302,108,2005,59.570,18.1137,26082000,234340,0,0,0,0,0,1
640,496,206,2004,46.220,21.0120,18319000,102660,1,0,0,0,0,1


### (3) test
* **세부요구사항**
    * new_data로 부터, 전처리 및 예측 결과를 출력해 봅시다.

In [145]:
modle_mk(model_dt, x_train, y_train, new_data, y_val) 

model_name: DecisionTreeRegressor


ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- 난방방식_개별난방
- 난방방식_중앙난방
- 난방방식_지역가스난방
- 난방방식_지역난방
Feature names seen at fit time, yet now missing:
- 난방방식_난방
- 난방방식_난방,
- 난방방식_유류난방
- 승강기설치여부_설치
