<a href="https://colab.research.google.com/github/dajeong25/boostcourse/blob/main/Regressor/2019_2nd_ML_month_with_KaKR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 캐글 코리아와 함께하는 2nd ML 대회 - House Price Prediction
https://www.kaggle.com/competitions/2019-2nd-ml-month-with-kakr

In [1]:
# 데이터 다운로드 
!wget –no-check-certificate 'https://docs.google.com/uc?export=download&id=1IVvuG3SMlarSSGmcliGFjq1fMxZtksE0' -O kaggle-kakr-housing-data.zip

--2023-05-15 06:11:11--  http://xn--no-check-certificate-2t2l/
Resolving xn--no-check-certificate-2t2l (xn--no-check-certificate-2t2l)... failed: Name or service not known.
wget: unable to resolve host address ‘xn--no-check-certificate-2t2l’
--2023-05-15 06:11:11--  https://docs.google.com/uc?export=download&id=1IVvuG3SMlarSSGmcliGFjq1fMxZtksE0
Resolving docs.google.com (docs.google.com)... 142.250.141.113, 142.250.141.138, 142.250.141.101, ...
Connecting to docs.google.com (docs.google.com)|142.250.141.113|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-00-b8-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/c5l8hdekohfn0dq21rmdsv92krqi5kot/1684131000000/17597719433809694239/*/1IVvuG3SMlarSSGmcliGFjq1fMxZtksE0?e=download&uuid=6bc40912-a417-40e1-8925-d69d9b30290c [following]
--2023-05-15 06:11:12--  https://doc-00-b8-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/c5l8hdekohfn0dq21rmdsv92kr

압축도 풀어줍니다.

In [2]:
# 다운로드 받은 zip파일 압축풀기
!unzip -qq ./kaggle-kakr-housing-data.zip

## 데이터 살펴보기
1. ID : 집을 구분하는 번호
2. date : 집을 구매한 날짜
3. **price : 타겟 변수인 집의 가격**
4. bedrooms : 침실의 수
5. bathrooms : 침실당 화장실 개수
6. sqft_living : 주거 공간의 평방 피트
7. sqft_lot : 부지의 평방 피트
8. floors : 집의 층수
9. waterfront : 집의 전방에 강이 흐르는지 유무 (a.k.a. 리버뷰)
10. view : 집이 얼마나 좋아 보이는지의 정도
11. condition : 집의 전반적인 상태
12. grade : King County grading 시스템 기준으로 매긴 집의 등급
13. sqft_above : 지하실을 제외한 평방 피트
14. sqft_basement : 지하실의 평방 피트
15. yr_built : 집을 지은 년도
16. yr_renovated : 집을 재건축한 년도
17. zipcode : 우편번호
18. lat : 위도
19. long : 경도
20. sqft_living15 : 근처 15 가구의 주거 공간, 평방 피트
21. sqft_lot15 : 근처 15가구의 부지, 평방 피트

> baseline 참고
https://www.kaggle.com/code/kcs93023/2019-ml-month-2nd-baseline/notebook

In [3]:
import warnings
warnings.filterwarnings("ignore")

import os
from os.path import join

import pandas as pd
import numpy as np

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import KFold, cross_val_score
import xgboost as xgb
import lightgbm as lgb

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

#--------------------------------------------------------------------------------------------------------------------------------------

train_data_path = join('./data/train.csv')
sub_data_path = join('./data/test.csv')

data = pd.read_csv(train_data_path)
sub = pd.read_csv(sub_data_path)

#target 값 분리
y = data['price'] 
del data['price']

train_len = len(data)
data = pd.concat((data, sub), axis=0)

# id, date 변수 정리
sub_id = data['id'][train_len:]
del data['id']
data['date'] = data['date'].apply(lambda x : str(x[:6])).astype(int)

# 분포확인 후 치우친 데이터 분포 조절(log 변환)
skew_columns = ['bedrooms', 'sqft_living', 'sqft_lot', 'sqft_above', 'sqft_basement']

for c in skew_columns:
    data[c] = np.log1p(data[c].values)

y_log_transformation = np.log1p(y)

# train - test 데이터 분리
sub = data.iloc[train_len:, :] # test data
x = data.iloc[:train_len, :] # train data

print(x.shape)
print(sub.shape)

(15035, 19)
(6468, 19)


## 모델링

In [4]:
gboost = GradientBoostingRegressor(random_state=2023)
xgboost = xgb.XGBRegressor(random_state=2023)
lightgbm = lgb.LGBMRegressor(random_state=2023)

models = [{'model':gboost, 'name':'GradientBoosting'}, {'model':xgboost, 'name':'XGBoost'},
          {'model':lightgbm, 'name':'LightGBM'}]

### Cross Validation 교차검증
모델 성능 평가

In [15]:
def get_cv_score(models):
    kfold = KFold(n_splits=5).get_n_splits(x.values)
    for m in models:
      print("Model {} CV score : {:.4f}".format(m['name'], np.mean(cross_val_score(m['model'], x.values, y)), 
                                             kf=kfold))

이제 `get_cv_score`함수에 우리의 모델을 넣고 테스트 해보겠습니다.

In [16]:
get_cv_score(models)

Model GradientBoosting CV score : 0.8609
Model XGBoost CV score : 0.8861
Model LightGBM CV score : 0.8819


## Make Submission
[회귀모델] cross_val_score 함수 - R2 반환

In [None]:
def AveragingBlending(models, x, y, sub_x):
    # 모델학습
    for m in models : 
        m['model'].fit(x.values, y)
    
    # 모델예측
    predictions = np.column_stack([
        m['model'].predict(sub_x.values) for m in models])

    # 각 모델 에측의 평균을 return
    return np.mean(predictions, axis=1)

In [None]:
y_pred = AveragingBlending(models, x, y, sub)
print(len(y_pred))
y_pred

6468


array([ 529966.66304912,  430726.21272617, 1361676.91242777, ...,
        452081.69137012,  341572.97685942,  421725.1231835 ])

In [None]:
# 샘플파일 확인
submission = pd.read_csv('./data/sample_submission.csv')
submission.head()

Unnamed: 0,id,price
0,15035,100000
1,15036,100000
2,15037,100000
3,15038,100000
4,15039,100000


In [None]:
# 제출 dataframe 생성
result = pd.DataFrame({ 'id' : sub_id, 'price' : y_pred})
result.head()

Unnamed: 0,id,price
0,15035,529966.7
1,15036,430726.2
2,15037,1361677.0
3,15038,333803.6
4,15039,308900.6


In [None]:
result.to_csv('./data/submission.csv', index=False) 

```
# 첫 Score: 119927.51348
```