# 집값 예측 경진대회

#### 이전 EDA 및 코드 공유를 통해 알아낸 점
- 결측치 없음
- 데이터 수 및 피처 수가 많지 않아서 파생 변수 생성이 중요할 듯
- 예측 변수 target의 외도 때문에 log로 변환 후 모델 학습이 용이할 듯
- 데이터셋 품질 피처 4개, 년도 피처 3개, 면적 및 개수 피처 6개로 구성
- 데이터 샘플이 적어서 (비율) 이상치 제거는 안하는 것이 좋을 듯
- 품질 관련 변수의 카테고리 값을 라벨인코딩 통해 숫자로 매핑 필요 (크기 비교 가능)
- Stacking 모델 잘 작동

#### 변수 분류
1) 품질 변수  
- OverallQual : 전반적 재료와 마감 품질  
- ExterQual : 외관 재료 품질  
- KitchenQual : 부억 품질  
- BsmtQual : 지하실 높이(품질) 


2) 년도 변수  
- YearBuilt : 완공 연도  
- YearRemodAdd : 리모델링 연도  
- GarageYrBlt : 차고 완공 연도  

3) 면적 변수  
- TotalBsmtSF : 지하실 면적   
- 1stFlrSF : 1층 면적   
- GrLivArea : 지상층 생활 면적  
- FullBath : 지상층 화장실 개수  
- GarageCars: 차고 자리 개수  
- GarageArea: 차고 면적   
 


#### 주요 참고 코드 
- 기세현 님의 〈[GB + RF + CB + NGB / Public : 0.09599](https://dacon.io/competitions/official/235869/codeshare/4266?page=1&dtype=recent)〉
- 다복 님의 〈[RandomForest, GBM, DecisionTree, XGB, LGB 한번에 비교](https://dacon.io/competitions/official/235869/codeshare/4256?page=1&dtype=recent)〉
- yun99 님의 〈[간단한 EDA + pycaret ensemble / Public 0.09795](https://dacon.io/competitions/official/235869/codeshare/4267?page=1&dtype=recent)〉

이 외에도 코드 공유해주신 모든 분들께 감사드립니다 🙇‍♀️



# 1. 데이터 및 라이브러리 불러오기

In [None]:
import pandas as pd
import os
import os.path as osp
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import warnings

warnings.filterwarnings('ignore')

In [4]:
data_dir = 'D:/_data/dacon/housing/'

train = pd.read_csv(osp.join(data_dir, 'train.csv'))
test = pd.read_csv(osp.join(data_dir, 'test.csv'))

train.drop('id', axis=1, inplace=True) # id 제거
test.drop('id', axis=1, inplace=True) # id 제거
print(train.shape, test.shape)

train.head()

NameError: name 'pd' is not defined

# 2. 전처리
CHECK
- 훈련 데이터 샘플은 1350개, 테스트는 1350개 많지 않다
- 피처는 총 13개, 훈련 데이터 셋에는 예측하고자 하는 target 변수가 포함되어 있다.

In [None]:
# 중복값 제거
print("제거 전 :", train.shape)
train = train.drop_duplicates()
print("제거 후 :", train.shape)

제거 전 : (1350, 14)
제거 후 : (1349, 14)


- `Garage Yr Blt` 이상치로 발견된 254번째 데이터에 대해서 변수 값 2207을 2007로 수정함

In [None]:
# train[train['Garage Yr Blt']> 2050] # 254
train.loc[254, 'Garage Yr Blt'] = 2007

- 품질 관련 변수의 카테고리 Poor(Po) → Fa(Fair) →Typical/Average(TA)→ Good(Gd) → Excellent(Ex)이므로 각 1~5 값으로 매핑
- `sklearn.preprocessing.LabelEncoder`를 사용할 수도 있지만 품질이 좋을수록 더 높은 값으로 지정해주기 위해 map 함수를 사용하여 직접 매핑 수행

In [None]:
# 품질 관련 변수 → 숫자로 매핑
qual_cols = train.dtypes[train.dtypes == np.object].index
def label_encoder(df_, qual_cols):
  df = df_.copy()
  mapping={
      'Ex':5, 'Gd':4, 'TA':3, 'Fa':2, 'Po':1
  }
  for col in qual_cols :
    df[col] = df[col].map(mapping)
  return df

train = label_encoder(train, qual_cols)
test = label_encoder(test, qual_cols)
train.head()

Unnamed: 0,Overall Qual,Gr Liv Area,Exter Qual,Garage Cars,Garage Area,Kitchen Qual,Total Bsmt SF,1st Flr SF,Bsmt Qual,Full Bath,Year Built,Year Remod/Add,Garage Yr Blt,target
0,10,2392,5,3,968,5,2392,2392,5,2,2003,2003,2003,386250
1,7,1352,4,2,466,4,1352,1352,5,2,2006,2007,2006,194000
2,5,900,3,1,288,3,864,900,3,1,1967,1967,1967,123000
3,5,1174,3,2,576,4,680,680,3,1,1900,2006,2000,135000
4,7,1958,4,3,936,4,1026,1026,4,2,2005,2005,2005,250000


## 주요 파생 변수 아이디어 구상 과정


#### 1. 지상층 생활 면적과 1층 면적의 차이는?   
- 미국 주택 면적의 단위는 square feet(sf)  
- 모든 데이터에서는 지상층 생활 면적이 1층 면적보다 크다.   
- 지상층(Ground Living Area)는 땅 위에 있는 전체 면적을 이야기하는데 <U>차고와 지하실을 제외</U>하고 계산한다. 
- 따라서 `Gr Liv Area - 1st Flr SF >0` 일 경우에 2층이 있는 것이므로 2층 면적 피처와 2층의 존재 여부를 새로운 피처로 생성하자.

cf. 참고 자료 : 미국 집 크기 계산 [링크](https://blog.daum.net/jk26922/28)

#### 2. 집값 예측에 면적 관련 피처가 중요해보인다
- 지상층 생활 면적에는 지하실과 차고를 제외하기 때문에 이를 더 다해 전체 면적에 해당하는 파생 변수를 만들자

### 3. 지상층 생활 면적과 1층 면적이 동일한 집이 있고 그렇지 않은 집도 있다. 이것이 의미하는 바는?
보통 차고가 1층에 있을텐데 (그렇지 않은 경우도 있을 수 있다. 하지만 일반적인 미국 주택을 떠올리자면...)  
차고를 제외한 지상층 면적과 1층 면적이 동일하다면 차고는 주택 밖에 있는 것이 아닐까?  
지상층 생활 면적과 1층 면적이 동일하지 않다면, 즉 차고가 밖에 있다면 1  그렇지 않다면 0인 파생 변수를 만들자

## 파생 변수
- 2층 면적 `2nd flr SF`= 지상층 생활 면적 - 1층 면적
- 2층 여부 `2nd flr`= 1(지상층 생활 면적 - 1층 면적 > 0), 0(지상층 생활 면적 - 1층 면적 < 0)
- 전체 면적 `Total SF` = 지상층 생활 면적 + 지하실 면적 + 차고 면적
- 차고 밖/안 `Garage In/Out` = 1(지상층 생활 면적 != 1층 면적), 0(지상층 생활 면적 == 1층 면적) 
- 리모델링 연도 차 `Year Gap Remod`  = 리모델링 연도 - 완공 연도
- 차고 자리당 면적 `Car Area`= 차고 면적/차고 자리 개수
- 품질 합 `Sum Qual` = (전반적 + 부억 + 재료 + 지하실) 품질 

In [None]:
def feature_eng(data_):
  data = data_.copy()
  data['Year Gap Remod'] = data['Year Remod/Add'] - data['Year Built']
  data['Car Area'] = data['Garage Area']/data['Garage Cars']
  data['2nd flr SF'] = data['Gr Liv Area'] - data['1st Flr SF']
  data['2nd flr'] = data['2nd flr SF'].apply(lambda x : 1 if x > 0 else 0)
  data['Total SF'] = data[['Gr Liv Area',"Garage Area", "Total Bsmt SF"]].sum(axis=1)
  data['Sum Qual'] = data[["Exter Qual", "Kitchen Qual", "Overall Qual"]].sum(axis=1)
  data['Garage InOut'] = data.apply(lambda x : 1 if x['Gr Liv Area'] != x['1st Flr SF'] else 0, axis=1)
  return data

train = feature_eng(train)
test = feature_eng(test)

# 3. 모델링
- 기세현 님이 공유해주신 모델링에서 선형 모델(LinearRegression, Lasso, Ridge)를 추가하여 수행한 후에 validation 셋의 성능을 비교하여 최종 결과를 산출했습니다. 
 
다시 한번 코드 공유해주신 기세현 님 감사드립니다!

In [None]:
# ! pip install catboost
# ! pip install ngboost

In [None]:
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from catboost import CatBoostRegressor, Pool
from ngboost import NGBRegressor
from sklearn.metrics import make_scorer
from sklearn.model_selection import KFold

In [None]:
# 평가 기준 정의
def NMAE(true, pred):
    mae = np.mean(np.abs(true-pred))
    score = mae / np.mean(np.abs(true))
    return score

In [None]:
nmae_score = make_scorer(NMAE, greater_is_better=False)
kf = KFold(n_splits = 10, random_state = 42, shuffle = True)

In [None]:
X = train.drop(['target'], axis = 1)
y = np.log1p(train.target)

target = test[X.columns]

In [None]:
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet

# LinearRegression
lr_pred = np.zeros(target.shape[0])
lr_val = []
for n, (tr_idx, val_idx) in enumerate(kf.split(X, y)) :
    print(f'{n + 1} FOLD Training.....')
    tr_x, tr_y = X.iloc[tr_idx], y.iloc[tr_idx]
    val_x, val_y = X.iloc[val_idx], np.expm1(y.iloc[val_idx])
    
    lr = LinearRegression(normalize=True)
    lr.fit(tr_x, tr_y)
    
    val_pred = np.expm1(lr.predict(val_x))
    val_nmae = NMAE(val_y, val_pred)
    lr_val.append(val_nmae)
    print(f'{n + 1} FOLD NMAE = {val_nmae}\n')
    
    target_data = Pool(data = target, label = None)
    fold_pred = lr.predict(target) / 10
    lr_pred += fold_pred
print(f'10FOLD Mean of NMAE = {np.mean(lr_val)} & std = {np.std(lr_val)}')

1 FOLD Training.....
1 FOLD NMAE = 0.0855513451370252

2 FOLD Training.....
2 FOLD NMAE = 0.1022031257206335

3 FOLD Training.....
3 FOLD NMAE = 0.09229735498844728

4 FOLD Training.....
4 FOLD NMAE = 0.11822494042457903

5 FOLD Training.....
5 FOLD NMAE = 0.08173765845290676

6 FOLD Training.....
6 FOLD NMAE = 0.10941202661958242

7 FOLD Training.....
7 FOLD NMAE = 0.097669462184245

8 FOLD Training.....
8 FOLD NMAE = 0.08952184236852667

9 FOLD Training.....
9 FOLD NMAE = 0.1010872166867319

10 FOLD Training.....
10 FOLD NMAE = 0.09907157927292878

10FOLD Mean of NMAE = 0.09767765518556065 & std = 0.010442845840448132


In [None]:
# Ridge
rg_pred = np.zeros(target.shape[0])
rg_val = []
for n, (tr_idx, val_idx) in enumerate(kf.split(X, y)) :
    print(f'{n + 1} FOLD Training.....')
    tr_x, tr_y = X.iloc[tr_idx], y.iloc[tr_idx]
    val_x, val_y = X.iloc[val_idx], np.expm1(y.iloc[val_idx])
    
    rg = Ridge()
    rg.fit(tr_x, tr_y)
    
    val_pred = np.expm1(rg.predict(val_x))
    val_nmae = NMAE(val_y, val_pred)
    rg_val.append(val_nmae)
    print(f'{n + 1} FOLD NMAE = {val_nmae}\n')
    
    target_data = Pool(data = target, label = None)
    fold_pred = rg.predict(target) / 10
    rg_pred += fold_pred
print(f'10FOLD Mean of NMAE = {np.mean(rg_val)} & std = {np.std(rg_val)}')

1 FOLD Training.....
1 FOLD NMAE = 0.08581885877289773

2 FOLD Training.....
2 FOLD NMAE = 0.10351488017384623

3 FOLD Training.....
3 FOLD NMAE = 0.09230885944304232

4 FOLD Training.....
4 FOLD NMAE = 0.11810498652284367

5 FOLD Training.....
5 FOLD NMAE = 0.0820310468874689

6 FOLD Training.....
6 FOLD NMAE = 0.10861654257511388

7 FOLD Training.....
7 FOLD NMAE = 0.09759836761926957

8 FOLD Training.....
8 FOLD NMAE = 0.08978354982132353

9 FOLD Training.....
9 FOLD NMAE = 0.10026834042526893

10 FOLD Training.....
10 FOLD NMAE = 0.0985271802984888

10FOLD Mean of NMAE = 0.09765726125395635 & std = 0.010271475354535207


In [None]:
# Lasso
ls_pred = np.zeros(target.shape[0])
ls_val = []
for n, (tr_idx, val_idx) in enumerate(kf.split(X, y)) :
    print(f'{n + 1} FOLD Training.....')
    tr_x, tr_y = X.iloc[tr_idx], y.iloc[tr_idx]
    val_x, val_y = X.iloc[val_idx], np.expm1(y.iloc[val_idx])
    
    ls = Lasso()
    ls.fit(tr_x, tr_y)
    
    val_pred = np.expm1(ls.predict(val_x))
    val_nmae = NMAE(val_y, val_pred)
    ls_val.append(val_nmae)
    print(f'{n + 1} FOLD NMAE = {val_nmae}\n')
    
    target_data = Pool(data = target, label = None)
    fold_pred = ls.predict(target) / 10
    ls_pred += fold_pred
print(f'10FOLD Mean of NMAE = {np.mean(ls_val)} & std = {np.std(ls_val)}')

1 FOLD Training.....
1 FOLD NMAE = 0.10720303142887724

2 FOLD Training.....
2 FOLD NMAE = 0.12959674392596052

3 FOLD Training.....
3 FOLD NMAE = 0.1166612396192025

4 FOLD Training.....
4 FOLD NMAE = 0.1458172835882167

5 FOLD Training.....
5 FOLD NMAE = 0.11665789637732302

6 FOLD Training.....
6 FOLD NMAE = 0.11807020226297728

7 FOLD Training.....
7 FOLD NMAE = 0.11704670524793234

8 FOLD Training.....
8 FOLD NMAE = 0.09876644780019046

9 FOLD Training.....
9 FOLD NMAE = 0.11797180876405988

10 FOLD Training.....
10 FOLD NMAE = 0.13774710368962914

10FOLD Mean of NMAE = 0.1205538462704369 & std = 0.013130211758909027


In [None]:
# ElasticNet
el_pred = np.zeros(target.shape[0])
el_val = []
for n, (tr_idx, val_idx) in enumerate(kf.split(X, y)) :
    print(f'{n + 1} FOLD Training.....')
    tr_x, tr_y = X.iloc[tr_idx], y.iloc[tr_idx]
    val_x, val_y = X.iloc[val_idx], np.expm1(y.iloc[val_idx])
    
    el = ElasticNet()
    el.fit(tr_x, tr_y)
    
    val_pred = np.expm1(el.predict(val_x))
    val_nmae = NMAE(val_y, val_pred)
    el_val.append(val_nmae)
    print(f'{n + 1} FOLD NMAE = {val_nmae}\n')
    
    target_data = Pool(data = target, label = None)
    fold_pred = el.predict(target) / 10
    el_pred += fold_pred
print(f'10FOLD Mean of NMAE = {np.mean(el_val)} & std = {np.std(el_val)}')

1 FOLD Training.....
1 FOLD NMAE = 0.10305093143634164

2 FOLD Training.....
2 FOLD NMAE = 0.12558173918842255

3 FOLD Training.....
3 FOLD NMAE = 0.11322009529909433

4 FOLD Training.....
4 FOLD NMAE = 0.143886981680882

5 FOLD Training.....
5 FOLD NMAE = 0.1082342823229339

6 FOLD Training.....
6 FOLD NMAE = 0.11503659879679513

7 FOLD Training.....
7 FOLD NMAE = 0.10861234914793798

8 FOLD Training.....
8 FOLD NMAE = 0.0961107004222138

9 FOLD Training.....
9 FOLD NMAE = 0.11662096556348844

10 FOLD Training.....
10 FOLD NMAE = 0.13057396637280907

10FOLD Mean of NMAE = 0.11609286102309188 & std = 0.013300359837437057


In [None]:
# GradientBoostingRegressor
gbr_pred = np.zeros(target.shape[0])
gbr_val = []
for n, (tr_idx, val_idx) in enumerate(kf.split(X, y)) :
    print(f'{n + 1} FOLD Training.....')
    tr_x, tr_y = X.iloc[tr_idx], y.iloc[tr_idx]
    val_x, val_y = X.iloc[val_idx], np.expm1(y.iloc[val_idx])
    
    gbr = GradientBoostingRegressor(random_state = 42, max_depth = 4, learning_rate = 0.05, n_estimators = 1000)
    gbr.fit(tr_x, tr_y)
    
    val_pred = np.expm1(gbr.predict(val_x))
    val_nmae = NMAE(val_y, val_pred)
    gbr_val.append(val_nmae)
    print(f'{n + 1} FOLD NMAE = {val_nmae}\n')
    
    fold_pred = gbr.predict(target) / 10
    gbr_pred += fold_pred
print(f'10FOLD Mean of NMAE = {np.mean(gbr_val)} & std = {np.std(gbr_val)}')

1 FOLD Training.....
1 FOLD NMAE = 0.08559571468989596

2 FOLD Training.....
2 FOLD NMAE = 0.09890395803205482

3 FOLD Training.....
3 FOLD NMAE = 0.09687086992126713

4 FOLD Training.....
4 FOLD NMAE = 0.12149704793725016

5 FOLD Training.....
5 FOLD NMAE = 0.09930487557858571

6 FOLD Training.....
6 FOLD NMAE = 0.10072738591779712

7 FOLD Training.....
7 FOLD NMAE = 0.0952391814814313

8 FOLD Training.....
8 FOLD NMAE = 0.09444487669990737

9 FOLD Training.....
9 FOLD NMAE = 0.09276499577761807

10 FOLD Training.....
10 FOLD NMAE = 0.11465257508135572

10FOLD Mean of NMAE = 0.10000014811171634 & std = 0.01001089200236599


In [None]:
# RandomForestRegressor
rf_pred = np.zeros(target.shape[0])
rf_val = []
for n, (tr_idx, val_idx) in enumerate(kf.split(X, y)) :
    print(f'{n + 1} FOLD Training.....')
    tr_x, tr_y = X.iloc[tr_idx], y.iloc[tr_idx]
    val_x, val_y = X.iloc[val_idx], np.expm1(y.iloc[val_idx])
    
    rf = RandomForestRegressor(random_state = 42, criterion = 'mae')
    rf.fit(tr_x, tr_y)
    
    val_pred = np.expm1(rf.predict(val_x))
    val_nmae = NMAE(val_y, val_pred)
    rf_val.append(val_nmae)
    print(f'{n + 1} FOLD NMAE = {val_nmae}\n')
    
    fold_pred = rf.predict(target) / 10
    rf_pred += fold_pred
print(f'10FOLD Mean of NMAE = {np.mean(rf_val)} & std = {np.std(rf_val)}')

1 FOLD Training.....
1 FOLD NMAE = 0.0894992847020689

2 FOLD Training.....
2 FOLD NMAE = 0.09652474280361749

3 FOLD Training.....
3 FOLD NMAE = 0.0968531737995619

4 FOLD Training.....
4 FOLD NMAE = 0.11796586961484536

5 FOLD Training.....
5 FOLD NMAE = 0.09132017185584299

6 FOLD Training.....
6 FOLD NMAE = 0.09886352213176591

7 FOLD Training.....
7 FOLD NMAE = 0.08828140015484487

8 FOLD Training.....
8 FOLD NMAE = 0.08553885538522918

9 FOLD Training.....
9 FOLD NMAE = 0.09595562577270407

10 FOLD Training.....
10 FOLD NMAE = 0.10741373337746232

10FOLD Mean of NMAE = 0.0968216379597943 & std = 0.009210859123101276


In [None]:
# NGBRegressor
ngb_pred = np.zeros(target.shape[0])
ngb_val = []
for n, (tr_idx, val_idx) in enumerate(kf.split(X, y)) :
    print(f'{n + 1} FOLD Training.....')
    tr_x, tr_y = X.iloc[tr_idx], y.iloc[tr_idx]
    val_x, val_y = X.iloc[val_idx], np.expm1(y.iloc[val_idx])
    
    ngb = NGBRegressor(random_state = 42, n_estimators = 1000, verbose = 0, learning_rate = 0.03)
    ngb.fit(tr_x, tr_y, val_x, val_y, early_stopping_rounds = 300)
    
    val_pred = np.expm1(ngb.predict(val_x))
    val_nmae = NMAE(val_y, val_pred)
    ngb_val.append(val_nmae)
    print(f'{n + 1} FOLD NMAE = {val_nmae}\n')
    
    target_data = Pool(data = target, label = None)
    fold_pred = ngb.predict(target) / 10
    ngb_pred += fold_pred
print(f'10FOLD Mean of NMAE = {np.mean(ngb_val)} & std = {np.std(ngb_val)}')

1 FOLD Training.....
1 FOLD NMAE = 0.08007248450195956

2 FOLD Training.....
2 FOLD NMAE = 0.0983957884568228

3 FOLD Training.....
3 FOLD NMAE = 0.08982578975551057

4 FOLD Training.....
4 FOLD NMAE = 0.11179768332157361

5 FOLD Training.....
5 FOLD NMAE = 0.0917275744128398

6 FOLD Training.....
6 FOLD NMAE = 0.10195164470116333

7 FOLD Training.....
7 FOLD NMAE = 0.09615107996093228

8 FOLD Training.....
8 FOLD NMAE = 0.0874281780226981

9 FOLD Training.....
9 FOLD NMAE = 0.09348281835446484

10 FOLD Training.....
10 FOLD NMAE = 0.11067237557084522

10FOLD Mean of NMAE = 0.09615054170588101 & std = 0.00946401635263995


In [None]:
# Catboost
cb_pred = np.zeros(target.shape[0])
cb_val = []
for n, (tr_idx, val_idx) in enumerate(kf.split(X, y)) :
    print(f'{n + 1} FOLD Training.....')
    tr_x, tr_y = X.iloc[tr_idx], y.iloc[tr_idx]
    val_x, val_y = X.iloc[val_idx], np.expm1(y.iloc[val_idx])
    
    tr_data = Pool(data = tr_x, label = tr_y)
    val_data = Pool(data = val_x, label = val_y)
    
    cb = CatBoostRegressor(depth = 4, random_state = 42, loss_function = 'MAE', n_estimators = 3000, learning_rate = 0.03, verbose = 0)
    cb.fit(tr_data, eval_set = val_data, early_stopping_rounds = 750, verbose = 1000)
    
    val_pred = np.expm1(cb.predict(val_x))
    val_nmae = NMAE(val_y, val_pred)
    cb_val.append(val_nmae)
    print(f'{n + 1} FOLD NMAE = {val_nmae}\n')
    
    target_data = Pool(data = target, label = None)
    fold_pred = cb.predict(target) / 10
    cb_pred += fold_pred
print(f'10FOLD Mean of NMAE = {np.mean(cb_val)} & std = {np.std(cb_val)}')

1 FOLD Training.....
0:	learn: 0.2927358	test: 187886.6143316	best: 187886.6143316 (0)	total: 47.9ms	remaining: 2m 23s
Stopped by overfitting detector  (750 iterations wait)

bestTest = 187886.5529
bestIteration = 249

Shrink model to first 250 iterations.
1 FOLD NMAE = 0.0868885625309975

2 FOLD Training.....
0:	learn: 0.2949061	test: 183672.7371379	best: 183672.7371379 (0)	total: 1.4ms	remaining: 4.2s
Stopped by overfitting detector  (750 iterations wait)

bestTest = 183672.7238
bestIteration = 223

Shrink model to first 224 iterations.
2 FOLD NMAE = 0.09827366837706818

3 FOLD Training.....
0:	learn: 0.2877837	test: 190826.8666450	best: 190826.8666450 (0)	total: 1.39ms	remaining: 4.18s
1000:	learn: 0.0665381	test: 190826.7975793	best: 190826.7975499 (975)	total: 1.22s	remaining: 2.44s
2000:	learn: 0.0584153	test: 190826.7980664	best: 190826.7970898 (1325)	total: 2.42s	remaining: 1.21s
Stopped by overfitting detector  (750 iterations wait)

bestTest = 190826.7971
bestIteration = 1325

In [None]:
# 검증 성능 확인하기
val_list = [lr_val, rg_val, ls_val, el_val, gbr_val, rf_val, ngb_val, cb_val]
for val in val_list :
  print("{:.8f}".format(np.mean(val))) 

0.09767766
0.09765726
0.12055385
0.11609286
0.10000015
0.09682164
0.09615054
0.10626885


In [None]:
# submission 파일에 입력
sub = pd.read_csv(osp.join(data_dir, 'sample_submission.csv'))
sub['target'] = np.expm1((ngb_pred + rf_pred + rg_pred + gbr_pred) / 4)
sub['target']

0       335327.454132
1       127869.079338
2       175267.374659
3       260383.783974
4       131900.302694
            ...      
1345    336105.645984
1346    122929.347797
1347     88532.731291
1348    187209.476042
1349    132793.228312
Name: target, Length: 1350, dtype: float64

In [None]:
# csv 파일로 내보내기
sub_dir = './housing/sub'
sub.to_csv(osp.join(sub_dir, 'baseline.csv'), index=False) 