###  <span style="color : pink"> Intentio Voluntatis </span>

 ## `model`과 `grid_model`의 차이

  ---

  ### 1. `model` - 기본 머신러닝 모델

  ```python
  model = LGBMRegressor(random_state=42)

  - LGBMRegressor: LightGBM 회귀 모델 (기본 설정)
  - 아직 하이퍼파라미터가 튜닝되지 않은 상태
  - "틀"만 만든 것이고, 어떤 설정이 최적인지 모르는 상태

  ---
  2. grid_model - 하이퍼파라미터 탐색 도구

  grid_model = GridSearchCV(model, param_grid=param_grid, ...)

  - GridSearchCV: 여러 하이퍼파라미터 조합을 자동으로 테스트해주는 도구
  - model을 감싸서 최적의 설정을 찾아주는 래퍼(wrapper)

  ---
  3. 전체 프로세스 흐름

  1단계: 탐색할 파라미터 범위 정의
  param_grid = {'n_estimators': [50, 100], 'max_depth': [1, 10]}
  → 총 4가지 조합 (2 × 2)

  ⬇️

  2단계: 기본 모델 생성
  model = LGBMRegressor() — 아직 학습 안 됨, 설정도 기본값

  ⬇️

  3단계: GridSearchCV로 감싸기
  grid_model = GridSearchCV(model, param_grid, cv=5)
  → 4가지 조합 × 5-fold 교차검증 = 20번 학습/평가

  ⬇️

  4단계: 학습 실행
  grid_model.fit(train, y)

  ┌──────┬──────────────┬───────────┬───────────┐
  │ 실험 │ n_estimators │ max_depth │   결과    │
  ├──────┼──────────────┼───────────┼───────────┤
  │  1   │      50      │     1     │ 점수 측정 │
  ├──────┼──────────────┼───────────┼───────────┤
  │  2   │     100      │     1     │ 점수 측정 │
  ├──────┼──────────────┼───────────┼───────────┤
  │  3   │      50      │    10     │ 점수 측정 │
  ├──────┼──────────────┼───────────┼───────────┤
  │  4   │     100      │    10     │ ⭐ 최고!  │
  └──────┴──────────────┴───────────┴───────────┘
  ⬇️

  5단계: 결과 확인
  - grid_model.best_params_ → {'max_depth': 10, 'n_estimators': 100}
  - grid_model.best_score_ → 최고 점수
  - grid_model.cv_results_ → 모든 실험 결과

  ---
  4. 비유로 이해하기
  ┌──────────────────┬──────────────────────────────┐
  │       구분       │             비유             │
  ├──────────────────┼──────────────────────────────┤
  │ model            │ 요리사 (아직 레시피 없음)    │
  ├──────────────────┼──────────────────────────────┤
  │ param_grid       │ 시험해볼 레시피 목록         │
  ├──────────────────┼──────────────────────────────┤
  │ grid_model       │ 요리 대회 심사 시스템        │
  ├──────────────────┼──────────────────────────────┤
  │ grid_model.fit() │ 대회 진행 (모든 조합 테스트) │
  ├──────────────────┼──────────────────────────────┤
  │ best_params_     │ 우승 레시피                  │
  └──────────────────┴──────────────────────────────┘
  ---
  5. 핵심 요약
  ┌───────────────────────┬───────────────────────────┐
  │         model         │        grid_model         │
  ├───────────────────────┼───────────────────────────┤
  │ 단일 설정의 모델      │ 여러 설정을 자동 비교     │
  ├───────────────────────┼───────────────────────────┤
  │ 기본값 또는 수동 설정 │ 최적 설정을 찾아주는 도구 │
  └───────────────────────┴───────────────────────────┘
  ---


In [109]:
from lightgbm import LGBMRegressor
from sklearn.model_selection import GridSearchCV


print("PLENIS VELIS")

PLENIS VELIS


In [110]:
param_grid ={
    'n_estimators':[50,100],#사용할 트리의 개수
    'max_depth' : [1,10], # 트리의 깊이 몇단계의 질문 시스템 인가 max값이 정해져 있지는 않다 보통 데이터의 수가 max 너무 많이 하면 과적합 위험
}

Ensemble Model : 여러 개의 모델(머신러닝 알고리즘)을 합쳐서 더 강력한 성능을 내는 기술.
Tree : 앙상블 모델을 구성하는 개별 모델 하나 하나를 가르킴

In [111]:
# !pip install lightgbm
from lightgbm import LGBMRegressor
random_state = 42
model = LGBMRegressor(random_state=random_state)
print("Intentio Voluntatis! Plenis Velis!")

Intentio Voluntatis! Plenis Velis!


In [112]:
import pandas as pd
import numpy as np

# train 데이터 로드 및 전처리
train = pd.read_csv("/Aiffel/jan/data/train.csv")
y = train['price']
y = np.log1p(y)  # 로그 변환 (치우친 분포 정규화)

del train['price']
del train['id']  # id 컬럼 삭제 (예측에 불필요)
train['date'] = train['date'].apply(lambda i: i[:6]).astype(int)

print(f"train shape: {train.shape}")
print(f"y 변환 전 예시: 221900 -> 변환 후: {np.log1p(221900):.4f}")

train shape: (15035, 19)
y 변환 전 예시: 221900 -> 변환 후: 12.3100


#### <span style = "color : pink"> GridSearchCV로 최적 설정을 찾다,여러 파라미터 조합을 자동으로 테스트 <span/>

In [113]:
grid_model = GridSearchCV(model, param_grid=param_grid,
                        scoring='neg_mean_squared_error',
                        cv=5, verbose=1, n_jobs=5)

grid_model.fit(train, y)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002017 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2298
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001884 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2327
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001998 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2296
[LightGBM] [Info] Number of data points in the train set: 12028, number of used features: 19
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overh

0,1,2
,"estimator  estimator: estimator object This is assumed to implement the scikit-learn estimator interface. Either estimator needs to provide a ``score`` function, or ``scoring`` must be passed.",LGBMRegressor(random_state=42)
,"param_grid  param_grid: dict or list of dictionaries Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values, or a list of such dictionaries, in which case the grids spanned by each dictionary in the list are explored. This enables searching over any sequence of parameter settings.","{'max_depth': [1, 10], 'n_estimators': [50, 100]}"
,"scoring  scoring: str, callable, list, tuple or dict, default=None Strategy to evaluate the performance of the cross-validated model on the test set. If `scoring` represents a single score, one can use: - a single string (see :ref:`scoring_string_names`); - a callable (see :ref:`scoring_callable`) that returns a single value; - `None`, the `estimator`'s  :ref:`default evaluation criterion ` is used. If `scoring` represents multiple scores, one can use: - a list or tuple of unique strings; - a callable returning a dictionary where the keys are the metric  names and the values are the metric scores; - a dictionary with metric names as keys and callables as values. See :ref:`multimetric_grid_search` for an example.",'neg_mean_squared_error'
,"n_jobs  n_jobs: int, default=None Number of jobs to run in parallel. ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context. ``-1`` means using all processors. See :term:`Glossary ` for more details. .. versionchanged:: v0.20  `n_jobs` default changed from 1 to None",5
,"refit  refit: bool, str, or callable, default=True Refit an estimator using the best found parameters on the whole dataset. For multiple metric evaluation, this needs to be a `str` denoting the scorer that would be used to find the best parameters for refitting the estimator at the end. Where there are considerations other than maximum score in choosing a best estimator, ``refit`` can be set to a function which returns the selected ``best_index_`` given ``cv_results_``. In that case, the ``best_estimator_`` and ``best_params_`` will be set according to the returned ``best_index_`` while the ``best_score_`` attribute will not be available. The refitted estimator is made available at the ``best_estimator_`` attribute and permits using ``predict`` directly on this ``GridSearchCV`` instance. Also for multiple metric evaluation, the attributes ``best_index_``, ``best_score_`` and ``best_params_`` will only be available if ``refit`` is set and all of them will be determined w.r.t this specific scorer. See ``scoring`` parameter to know more about multiple metric evaluation. See :ref:`sphx_glr_auto_examples_model_selection_plot_grid_search_digits.py` to see how to design a custom selection strategy using a callable via `refit`. See :ref:`this example ` for an example of how to use ``refit=callable`` to balance model complexity and cross-validated score. .. versionchanged:: 0.20  Support for callable added.",True
,"cv  cv: int, cross-validation generator or an iterable, default=None Determines the cross-validation splitting strategy. Possible inputs for cv are: - None, to use the default 5-fold cross validation, - integer, to specify the number of folds in a `(Stratified)KFold`, - :term:`CV splitter`, - An iterable yielding (train, test) splits as arrays of indices. For integer/None inputs, if the estimator is a classifier and ``y`` is either binary or multiclass, :class:`StratifiedKFold` is used. In all other cases, :class:`KFold` is used. These splitters are instantiated with `shuffle=False` so the splits will be the same across calls. Refer :ref:`User Guide ` for the various cross-validation strategies that can be used here. .. versionchanged:: 0.22  ``cv`` default value if None changed from 3-fold to 5-fold.",5
,"verbose  verbose: int Controls the verbosity: the higher, the more messages. - >1 : the computation time for each fold and parameter candidate is  displayed; - >2 : the score is also displayed; - >3 : the fold and candidate parameter indexes are also displayed  together with the starting time of the computation.",1
,"pre_dispatch  pre_dispatch: int, or str, default='2*n_jobs' Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: - None, in which case all the jobs are immediately created and spawned. Use  this for lightweight and fast-running jobs, to avoid delays due to on-demand  spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A str, giving an expression as a function of n_jobs, as in '2*n_jobs'",'2*n_jobs'
,"error_score  error_score: 'raise' or numeric, default=np.nan Value to assign to the score if an error occurs in estimator fitting. If set to 'raise', the error is raised. If a numeric value is given, FitFailedWarning is raised. This parameter does not affect the refit step, which will always raise the error.",
,"return_train_score  return_train_score: bool, default=False If ``False``, the ``cv_results_`` attribute will not include training scores. Computing training scores is used to get insights on how different parameter settings impact the overfitting/underfitting trade-off. However computing the scores on the training set can be computationally expensive and is not strictly required to select the parameters that yield the best generalization performance. .. versionadded:: 0.19 .. versionchanged:: 0.21  Default value was changed from ``True`` to ``False``",False

0,1,2
,boosting_type,'gbdt'
,num_leaves,31
,max_depth,10
,learning_rate,0.1
,n_estimators,100
,subsample_for_bin,200000
,objective,
,class_weight,
,min_split_gain,0.0
,min_child_weight,0.001


In [114]:
# 최적 파라미터 확인
print(grid_model.best_params_)

  # 최적 점수 확인
print(grid_model.best_score_)

{'max_depth': 10, 'n_estimators': 100}
-0.027027144840492612


grid_model.fit 함수를 통해서 4가지 조합에 대한 실험을 모두 마쳤습니다.<br/>
실험에 대한 결과는 다음과 같이 grid_model.cv_results_ 안에 저장됩니다.

In [115]:
grid_model.cv_results_

{'mean_fit_time': array([0.12140021, 0.21631498, 1.31875019, 2.34646344]),
 'std_fit_time': array([0.00021997, 0.00727634, 0.10315398, 0.09445354]),
 'mean_score_time': array([0.0020102 , 0.00148726, 0.00224719, 0.00321159]),
 'std_score_time': array([0.0002149 , 0.00014009, 0.00012221, 0.00044873]),
 'param_max_depth': masked_array(data=[1, 1, 10, 10],
              mask=[False, False, False, False],
        fill_value=999999),
 'param_n_estimators': masked_array(data=[50, 100, 50, 100],
              mask=[False, False, False, False],
        fill_value=999999),
 'params': [{'max_depth': 1, 'n_estimators': 50},
  {'max_depth': 1, 'n_estimators': 100},
  {'max_depth': 10, 'n_estimators': 50},
  {'max_depth': 10, 'n_estimators': 100}],
 'split0_test_score': array([-0.0756974 , -0.05555652, -0.02885847, -0.02665428]),
 'split1_test_score': array([-0.07666447, -0.057876  , -0.03041465, -0.02795896]),
 'split2_test_score': array([-0.07354904, -0.05546079, -0.03068533, -0.02834112]),
 'spl

In [116]:
params = grid_model.cv_results_['params']
params

[{'max_depth': 1, 'n_estimators': 50},
 {'max_depth': 1, 'n_estimators': 100},
 {'max_depth': 10, 'n_estimators': 50},
 {'max_depth': 10, 'n_estimators': 100}]

In [117]:
score = grid_model.cv_results_['mean_test_score']
score

array([-0.07339447, -0.05502043, -0.02917734, -0.02702714])

In [118]:

results = pd.DataFrame(params)
results['score'] = score

results

Unnamed: 0,max_depth,n_estimators,score
0,1,50,-0.073394
1,1,100,-0.05502
2,10,50,-0.029177
3,10,100,-0.027027


In [119]:
import numpy as np
results['RMSE'] = np.sqrt(-1 * results['score'])
results

Unnamed: 0,max_depth,n_estimators,score,RMSE
0,1,50,-0.073394,0.270914
1,1,100,-0.05502,0.234564
2,10,50,-0.029177,0.170814
3,10,100,-0.027027,0.164399


In [120]:
results = results.rename(columns={'RMSE': 'RMSLE'})
results

Unnamed: 0,max_depth,n_estimators,score,RMSLE
0,1,50,-0.073394,0.270914
1,1,100,-0.05502,0.234564
2,10,50,-0.029177,0.170814
3,10,100,-0.027027,0.164399


In [121]:
# 위의 표를 `RMSLE`가 낮은 순서대로 정렬해주세요.

results = results.sort_values('RMSLE')
results

Unnamed: 0,max_depth,n_estimators,score,RMSLE
3,10,100,-0.027027,0.164399
2,10,50,-0.029177,0.170814
1,1,100,-0.05502,0.234564
0,1,50,-0.073394,0.270914


In [122]:
"""
다음과 같은 과정을 진행할 수 있는 `my_GridSearch(model, train, y, param_grid, verbose=2, n_jobs=5)` 함수를 구현해 보세요.

1. GridSearchCV 모델로 `model`을 초기화합니다.
2. 모델을 fitting 합니다.
3. params, score에 각 조합에 대한 결과를 저장합니다.
4. 데이터 프레임을 생성하고, RMSLE 값을 추가한 후 점수가 높은 순서로 정렬한 `results`를 반환합니다.
"""

def my_GridSearch(model, train, y, param_grid, verbose=2, n_jobs=5):
    # GridSearchCV 모델로 초기화
    grid_model = GridSearchCV(model, param_grid=param_grid, scoring='neg_mean_squared_error', \
                              cv=5, verbose=verbose, n_jobs=n_jobs)

    # 모델 fitting
    grid_model.fit(train, y)

    # 결과값 저장
    params = grid_model.cv_results_['params']
    score = grid_model.cv_results_['mean_test_score']

    # 데이터 프레임 생성
    results = pd.DataFrame(params)
    results['score'] = score

    # RMSLE 값 계산 후 정렬
    results['RMSLE'] = np.sqrt(-1 * results['score'])
    results = results.sort_values('RMSLE')

    return results
"""
cv = Cross-Validation (교차 검증) 의 약자입니다.
cv=5는 데이터를 5등분해서 검증한다는 의미입니다.
"""

'\ncv = Cross-Validation (교차 검증) 의 약자입니다.\ncv=5는 데이터를 5등분해서 검증한다는 의미입니다.\n'

In [123]:
print("최적 파라미터:", grid_model.best_params_)
print("최적 점수:", grid_model.best_score_)

최적 파라미터: {'max_depth': 10, 'n_estimators': 100}
최적 점수: -0.027027144840492612


### 12.제출하는 것도,빠르고 깔끔하게<br/>
#### <span style = "color : pink"> 데이터로 부터 패턴을 학습하고 예츨하는 학습모델:LGBMRegressor사용 <span/>


In [124]:
param_grid= {
    'n_estimators' : [50,100],
             'man_mepth' : [1,10],
}

model = LGBMRegressor(random_state=42)
my_GridSearch(model, train, y, param_grid, verbose=2, n_jobs=5)

Fitting 5 folds for each of 4 candidates, totalling 20 fits
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002006 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2327
[LightGBM] [Info] Number of data points in the train set: 12028, number of used features: 19
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001164 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2296
[LightGBM] [Info] Start training from score 13.052839
[LightGBM] [Info] Number of data points in the train set: 12028, number of used features: 19
[LightGBM] [Info] Start training from score 13.050187
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001289 seconds.
You can set `force_row

Unnamed: 0,man_mepth,n_estimators,score,RMSLE
1,1,100,-0.027051,0.164472
3,10,100,-0.027051,0.164472
0,1,50,-0.029198,0.170875
2,10,50,-0.029198,0.170875


In [125]:
# 이해의 편의상 위에서 진행 했던 순위와 파라미터르 확인 해보자
rank_df = pd.DataFrame({
      '순위': grid_model.cv_results_['rank_test_score'],
      'max_depth': grid_model.cv_results_['param_max_depth'],
      'n_estimators': grid_model.cv_results_['param_n_estimators'],
      '평균점수': grid_model.cv_results_['mean_test_score']
})
rank_df.sort_values('순위')

Unnamed: 0,순위,max_depth,n_estimators,평균점수
3,1,10,100,-0.027027
2,2,10,50,-0.029177
1,3,1,100,-0.05502
0,4,1,50,-0.073394


#### <span style = "color:pink"> 예측 결과에 no.expm1() 을 씌워 스케일로 되돌려야 한다 <span/>

In [126]:
# test 데이터 로드 및 전처리
test = pd.read_csv("/Aiffel/jan/data/test.csv")
test_id = test['id']  # 제출용 id 저장
del test['id']
test['date'] = test['date'].apply(lambda i: i[:6]).astype(int)

print(f"test shape: {test.shape}")

# 최적 파라미터로 모델 학습 및 예측
model = LGBMRegressor(max_depth=10, n_estimators=100, random_state=random_state)
model.fit(train, y)

# 예측 (로그 스케일)
prediction = model.predict(test)

# 원래 스케일로 복원 (np.expm1)
prediction = np.expm1(prediction)

print(f"예측값 샘플: {prediction[:5]}")

test shape: (6468, 19)
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000691 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2338
[LightGBM] [Info] Number of data points in the train set: 15035, number of used features: 19
[LightGBM] [Info] Start training from score 13.048122
예측값 샘플: [ 506766.66784595  479506.10405112 1345155.15609376  312257.88179592
  333864.49141891]


In [127]:
# sample_submission.csv 파일을 가져와보자

from os.path import join
data_dir = "/Aiffel/jan/data/"

submission_path = join(data_dir, 'sample_submission.csv')
submission = pd.read_csv(submission_path)
submission.head()

Unnamed: 0,id,price
0,15035,100000
1,15036,100000
2,15037,100000
3,15038,100000
4,15039,100000


In [128]:
#위의 데이터프레임에 우리의 모델이 예측한 값을 덮어씌우면 제출할 데이터가 완성되겠죠!
submission['price'] = prediction
submission.head()

Unnamed: 0,id,price
0,15035,506766.7
1,15036,479506.1
2,15037,1345155.0
3,15038,312257.9
4,15039,333864.5


In [129]:
#위의 데이터를 csv 파일로 저장하겠습니다.
#단, 앞으로는 많은 실험이 있을 예정이니 파일 이름에 모델의 종류와 위에서 확인했던 RMSLE 값을 넣어주면 제출 파일들이 깔끔하게 관리될 것입니다!
submission_csv_path = '{}/submission_{}_RMSLE_{}.csv'.format(data_dir, 'lgbm', '0.164399')
submission.to_csv(submission_csv_path, index=False)
print(submission_csv_path)

/Users/macminim4/PyCharmMiscProject/data//submission_lgbm_RMSLE_0.164399.csv


#### 위과정을 함수로 정리

In [130]:
"""
아래의 과정을 수행하는 `save_submission(model, train, y, test, model_name, rmsle)` 함수를 구현해 주세요.
1. 모델을 `train`, `y`로 학습시킵니다.
2. `test`에 대해 예측합니다.
3. 예측값을 `np.expm1`으로 변환하고, `submission_model_name_RMSLE_100000.csv` 형태의 `csv` 파일을 저장합니다.
"""
from os.path import join

def save_submission(model, train, y, test, model_name, rmsle=None):
    model.fit(train, y)
    prediction = model.predict(test)
    prediction = np.expm1(prediction)
    
    # 경로 수정
    data_dir = '/Aiffel/jan/data'
    
    submission_path = join(data_dir, 'sample_submission.csv')
    submission = pd.read_csv(submission_path)
    submission['price'] = prediction
    submission_csv_path = '{}/submission_{}_RMSLE_{}.csv'.format(data_dir, model_name, rmsle)
    submission.to_csv(submission_csv_path, index=False)
    print('{} saved!'.format(submission_csv_path))

In [131]:
save_submission(model, train, y, test, 'lgbm', rmsle='0.164399')

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000564 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2338
[LightGBM] [Info] Number of data points in the train set: 15035, number of used features: 19
[LightGBM] [Info] Start training from score 13.048122
/Users/macminim4/PyCharmMiscProject/data/submission_lgbm_RMSLE_0.164399.csv saved!
