# Kaggle Competiton
- https://www.kaggle.com/competitions/2019-2nd-ml-month-with-kakr/overview
- Note : [Link](https://www.notion.so/parkjaeyoung/Kaggle-4147f4c9dd0b43e284d697c1cb6d7875?pvs=4https://www.notion.so/parkjaeyoung/Kaggle-4147f4c9dd0b43e284d697c1cb6d7875?pvs=4)

## Library 및 Data Load

In [1]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np

import missingno as msno
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor

from sklearn.model_selection import KFold, cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import GridSearchCV

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import RobustScaler

import xgboost as xgb
from xgboost import XGBRegressor

import lightgbm as lgb
from lightgbm import LGBMRegressor


from sklearn.metrics import mean_squared_error
from sklearn.pipeline import make_pipeline


In [2]:
pd.options.display.max_rows = 100
pd.set_option("display.max_rows", 100)

## Data Load

In [27]:
sub = pd.read_csv('sub.csv',index_col=0)
x = pd.read_csv('x.csv',index_col=0)
y = pd.read_csv('y.csv',index_col=0)   # Log Scaled
test_id = pd.read_csv('test_id.csv',index_col=0)

y = y['price_logscaled'].to_list()
test_id = test_id['id'].to_list()

In [32]:
print(x.shape)
print(len(y))
print(sub.shape)

(15035, 16)
15035
(6468, 16)


## Grid Search (XGBoost)

#### XGBoost
 -  Best Parameters: 
 {'colsample_bytree': 0.6, 'learning_rate': 0.05, 'max_depth': 7, 'n_estimators': 500, 'reg_alpha': 0.5, 'reg_lambda': 0.5, 'subsample': 0.8}
 - Best Score: 0.02539927288652839

In [50]:

# 탐색할 파라미터 그리드 생성
param_grid = {
    'learning_rate': [0.04, 0.05, 0.06],
    'max_depth': [6, 7, 8],
    'subsample': [0.6,0.7,0.8],
    'colsample_bytree': [0.5,0.6, 0.7],
    'n_estimators': [400,500,600],
    'reg_alpha': [0.4,0.5,0.7],
    'reg_lambda': [0.4,0.5,0.7]
}


In [51]:
xgb_model = xgb.XGBRegressor()

In [52]:
xgb_grid_search = GridSearchCV(estimator=xgb_model, param_grid=param_grid, cv=5, scoring='neg_mean_squared_error',n_jobs=-1,verbose=3)
xgb_grid_search.fit(x,y)

Fitting 5 folds for each of 2187 candidates, totalling 10935 fits
[CV 4/5] END colsample_bytree=0.5, learning_rate=0.04, max_depth=6, n_estimators=400, reg_alpha=0.4, reg_lambda=0.4, subsample=0.6;, score=-0.027 total time=   2.9s
[CV 2/5] END colsample_bytree=0.5, learning_rate=0.04, max_depth=6, n_estimators=400, reg_alpha=0.4, reg_lambda=0.5, subsample=0.6;, score=-0.028 total time=   2.9s
[CV 3/5] END colsample_bytree=0.5, learning_rate=0.04, max_depth=6, n_estimators=400, reg_alpha=0.4, reg_lambda=0.5, subsample=0.8;, score=-0.028 total time=   2.7s
[CV 1/5] END colsample_bytree=0.5, learning_rate=0.04, max_depth=6, n_estimators=400, reg_alpha=0.4, reg_lambda=0.7, subsample=0.7;, score=-0.026 total time=   2.8s
[CV 5/5] END colsample_bytree=0.5, learning_rate=0.04, max_depth=6, n_estimators=400, reg_alpha=0.4, reg_lambda=0.7, subsample=0.8;, score=-0.023 total time=   2.8s
[CV 5/5] END colsample_bytree=0.5, learning_rate=0.04, max_depth=6, n_estimators=400, reg_alpha=0.5, reg_lamb

In [53]:
# 최적 파라미터와 최적 점수 출력
print("Best Parameters:", xgb_grid_search.best_params_)
print("Best Score:", -xgb_grid_search.best_score_)

Best Parameters: {'colsample_bytree': 0.5, 'learning_rate': 0.04, 'max_depth': 8, 'n_estimators': 600, 'reg_alpha': 0.7, 'reg_lambda': 0.7, 'subsample': 0.8}
Best Score: 0.02525912775278597


 - 1차  
        - Best Parameters: {'colsample_bytree': 0.6, 'learning_rate': 0.05, 'max_depth': 7, 'n_estimators': 500, 'reg_alpha': 0.5, 'reg_lambda': 0.5, 'subsample': 0.8}
        - Best Score: 0.02539927288652839
 - 2차 Win
        - Best Parameters: {'colsample_bytree': 0.5, 'learning_rate': 0.04, 'max_depth': 8, 'n_estimators': 600, 'reg_alpha': 0.7, 'reg_lambda': 0.7, 'subsample': 0.8}
        - Best Score: 0.02525912775278597

In [57]:
#xgb_grid_search.cv_results_
xgb_grid_search.cv_results_['rank_test_score']

array([2179, 2160, 2168, ..., 2137, 1910, 1794], dtype=int32)

## Modeling

In [None]:
#best_xgb_model = xgb_grid_search.best_estimator_

In [None]:
best_xgb_parameter= {'colsample_bytree': 0.6,
                     'learning_rate': 0.05,
                     'max_depth': 7, 
                     'n_estimators': 500,
                     'reg_alpha': 0.5,
                     'reg_lambda': 0.5,
                     'subsample': 0.8}

xgb_model = xgb.XGBRegressor(random_state=36, n_jobs=-1, **best_xgb_parameter)

In [74]:
#challenge
best_xgb_parameter= {'colsample_bytree': 0.5, 'learning_rate': 0.04, 'max_depth': 8, 'n_estimators': 600, 'reg_alpha': 0.7, 'reg_lambda': 0.7, 'subsample': 0.8}

xgb_model = xgb.XGBRegressor(random_state=36, n_jobs=-1, **best_xgb_parameter)

In [75]:
xgb_model.fit(x, y)

In [76]:
# 교차 검증 수행
def evaluate_cross_validation(model, X_test, y_test, cv=5, scoring='neg_mean_squared_error'):
    y_pred = model.predict(X_test)
    mse_scores = -cross_val_score(model, X_test, y_test, cv=cv, scoring=scoring)
    rmse_scores = np.sqrt(mse_scores)
    print("MSE Scores:", mse_scores)
    print("Mean MSE:", np.mean(mse_scores))
    print("RMSE Scores:", rmse_scores)
    print("Mean RMSE:", np.mean(rmse_scores))

In [77]:
evaluate_cross_validation(xgb_model, x, y)

MSE Scores: [0.02595852 0.0266906  0.02700997 0.02555234 0.02192547]
Mean MSE: 0.02542737762315029
RMSE Scores: [0.16111647 0.16337257 0.1643471  0.15985098 0.14807252]
Mean RMSE: 0.1593519269627882


1차  
    - MSE Scores: [0.02596264 0.02701214 0.02728797 0.02567725 0.02181418]  
    - Mean MSE: 0.025550838971678303  
    - RMSE Scores: [0.16112928 0.16435371 0.16519072 0.16024124 0.14769624]  
    - Mean RMSE: 0.15972223781080652  
2차 win  
    - MSE Scores: [0.02595852 0.0266906  0.02700997 0.02555234 0.02192547]  
    - Mean MSE: 0.02542737762315029  
    - RMSE Scores: [0.16111647 0.16337257 0.1643471  0.15985098 0.14807252]  
    - Mean RMSE: 0.1593519269627882  

In [1]:
score 는 2차 가 이겼으나, kaggle. point 는 1차가 높았음

SyntaxError: invalid decimal literal (1784163649.py, line 1)

## Predict

In [78]:
y_pred = xgb_model.predict(sub)

In [79]:
y_pred = np.expm1(y_pred)

In [80]:
y_pred

array([ 504950.38,  486032.7 , 1262379.2 , ...,  480380.78,  321000.22,
        457577.1 ], dtype=float32)

### Make Submission

회귀 모델의 경우에는 cross_val_score 함수가 R<sup>2</sup>를 반환합니다.<br>
R<sup>2</sup> 값이 1에 가까울수록 모델이 데이터를 잘 표현함을 나타냅니다. 3개 트리 모델이 상당히 훈련 데이터에 대해 괜찮은 성능을 보여주고 있습니다.<br> 훈련 데이터셋으로 3개 모델을 학습시키고, Average Blending을 통해 제출 결과를 만들겠습니다.

In [81]:
sub_final = pd.DataFrame(data={'id':test_id,'price':y_pred})

In [82]:
sub_final.to_csv('submission.csv', index=False)

In [83]:
sub_final

Unnamed: 0,id,price
0,15035,5.049504e+05
1,15036,4.860327e+05
2,15037,1.262379e+06
3,15038,2.980670e+05
4,15039,3.313993e+05
...,...,...
6463,21498,2.321562e+05
6464,21499,4.245769e+05
6465,21500,4.803808e+05
6466,21501,3.210002e+05
