In [1]:
import numpy as np
import pandas as pd
from sklearn.model_selection import KFold
from numpy.random import RandomState
from sklearn.metrics import mean_squared_error

In [5]:
train = pd.read_csv('preprocess/train.csv')
test = pd.read_csv('preprocess/test.csv')

---


## <center>**Wrapper特征筛选+LightGBM建模+TPE调优**

### 1.Wrapper特征筛选

&emsp;&emsp;接下来是特征筛选过程，此处先择使用Wrapper方法进行特征筛选，通过带入全部数据训练一个LightGBM模型，然后通过观察特征重要性，选取最重要的300个特征。当然，为了进一步确保挑选过程的有效性，此处我们考虑使用交叉验证的方法来进行多轮验证。实际多轮验证特征重要性的过程也较为清晰，我们只需要记录每一轮特征重要性，并在最后进行简单汇总即可。我们可以通过定义如下函数完成该过程：

In [84]:
def feature_select_wrapper(train, test):
    """
    lgm特征重要性筛选函数
    :param train:训练数据集
    :param test:测试数据集
    :return:特征筛选后的训练集和测试集
    """
    
    # Part 1.划分特征名称，删除ID列和标签列
    print('feature_select_wrapper...')
    label = 'target'
    features = train.columns.tolist()
    features.remove('card_id')
    features.remove('target')

    # Step 2.配置lgb参数
    # 模型参数
    params_initial = {
        'num_leaves': 31,
        'learning_rate': 0.1,
        'boosting': 'gbdt',
        'min_child_samples': 20,
        'bagging_seed': 2020,
        'bagging_fraction': 0.7,
        'bagging_freq': 1,
        'feature_fraction': 0.7,
        'max_depth': -1,
        'metric': 'rmse',
        'reg_alpha': 0,
        'reg_lambda': 1,
        'objective': 'regression'
    }
    # 控制参数
    # 提前验证迭代效果或停止
    ESR = 30
    # 迭代次数
    NBR = 10000
    # 打印间隔
    VBE = 50
    
    # Part 3.交叉验证过程
    # 实例化评估器
    kf = KFold(n_splits=5, random_state=2020, shuffle=True)
    # 创建空容器
    fse = pd.Series(0, index=features)
    
    for train_part_index, eval_index in kf.split(train[features], train[label]):
        # 封装训练数据集
        train_part = lgb.Dataset(train[features].loc[train_part_index],
                                 train[label].loc[train_part_index])
        # 封装验证数据集
        eval = lgb.Dataset(train[features].loc[eval_index],
                           train[label].loc[eval_index])
        # 在训练集上进行训练，并同时进行验证
        bst = lgb.train(params_initial, train_part, num_boost_round=NBR,
                        valid_sets=[train_part, eval],
                        valid_names=['train', 'valid'],
                        early_stopping_rounds=ESR, verbose_eval=VBE)
        # 输出特征重要性计算结果，并进行累加
        fse += pd.Series(bst.feature_importance(), features)
    
    # Part 4.选择最重要的300个特征
    feature_select = ['card_id'] + fse.sort_values(ascending=False).index.tolist()[:300]
    print('done')
    return train[feature_select + ['target']], test[feature_select]

In [8]:
train_LGBM, test_LGBM = feature_select_wrapper(train, test)

feature_select_wrapper...




You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 227016
[LightGBM] [Info] Number of data points in the train set: 161533, number of used features: 1626
[LightGBM] [Info] Start training from score -0.390986
Training until validation scores don't improve for 30 rounds
[50]	train's rmse: 3.43695	valid's rmse: 3.70629
Early stopping, best iteration is:
[66]	train's rmse: 3.39251	valid's rmse: 3.70281
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 227122
[LightGBM] [Info] Number of data points in the train set: 161533, number of used features: 1629
[LightGBM] [Info] Start training from score -0.396781
Training until validation scores don't improve for 30 rounds
[50]	train's rmse: 3.45546	valid's rmse: 3.67176
[100]	train's rmse: 3.33017	valid's rmse: 3.67221
Early stopping, best iteration is:
[82]	train's rmse: 3.37387	valid's rmse: 3.66794
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info

查看最终输出结果：

In [9]:
train_LGBM.shape

(201917, 302)

&emsp;&emsp;接下来，我们即可带入经过筛选的特征进行建模。

### 2.LightGBM模型训练与TPE参数优化

&emsp;&emsp;接下来，我们进行LightGBM的模型训练过程，和此前的随机森林建模过程类似，我们需要在训练模型的过程同时进行超参数的搜索调优。为了能够更好的借助hyperopt进行超参数搜索，此处我们考虑使用LightGBM的原生算法库进行建模，并将整个算法建模流程封装在若干个函数  内执行。

- 参数回调函数

&emsp;&emsp;首先对于lgb模型来说，并不是所有的超参数都需要进行搜索，为了防止多次实例化模型过程中部分超参数被设置成默认参数，此处我们首先需要创建一个参数回调函数，用于在后续多次实例化模型过程中反复申明这部分参数的固定取值：

In [12]:
def params_append(params):
    """
    动态回调参数函数，params视作字典
    :param params:lgb参数字典
    :return params:修正后的lgb参数字典
    """
    params['feature_pre_filter'] = False
    params['objective'] = 'regression'
    params['metric'] = 'rmse'
    params['bagging_seed'] = 2020
    return params

- 模型训练与参数优化函数

&emsp;&emsp;接下来就是更加复杂的模型训练与超参数调优的的过程。不同于sklearn内部的调参过程，此处由于涉及多个不同的库相互协同，外加本身lgb模型参数就较为复杂，因此整体模型训练与优化过程较为复杂，我们可以通过下述函数来执行该过程：

In [13]:
def param_hyperopt(train):
    """
    模型参数搜索与优化函数
    :param train:训练数据集
    :return params_best:lgb最优参数
    """
    # Part 1.划分特征名称，删除ID列和标签列
    label = 'target'
    features = train.columns.tolist()
    features.remove('card_id')
    features.remove('target')
    
    # Part 2.封装训练数据
    train_data = lgb.Dataset(train[features], train[label])
    
    # Part 3.内部函数，输入模型超参数损失值输出函数
    def hyperopt_objective(params):
        """
        输入超参数，输出对应损失值
        :param params:
        :return:最小rmse
        """
        # 创建参数集
        params = params_append(params)
        print(params)
        
        # 借助lgb的cv过程，输出某一组超参数下损失值的最小值
        res = lgb.cv(params, train_data, 1000,
                     nfold=2,
                     stratified=False,
                     shuffle=True,
                     metrics='rmse',
                     early_stopping_rounds=20,
                     verbose_eval=False,
                     show_stdv=False,
                     seed=2020)
        return min(res['rmse-mean']) # res是个字典

    # Part 4.lgb超参数空间
    params_space = {
        'learning_rate': hp.uniform('learning_rate', 1e-2, 5e-1),
        'bagging_fraction': hp.uniform('bagging_fraction', 0.5, 1),
        'feature_fraction': hp.uniform('feature_fraction', 0.5, 1),
        'num_leaves': hp.choice('num_leaves', list(range(10, 300, 10))),
        'reg_alpha': hp.randint('reg_alpha', 0, 10),
        'reg_lambda': hp.uniform('reg_lambda', 0, 10),
        'bagging_freq': hp.randint('bagging_freq', 1, 10),
        'min_child_samples': hp.choice('min_child_samples', list(range(1, 30, 5)))
    }
    
    # Part 5.TPE超参数搜索
    params_best = fmin(
        hyperopt_objective,
        space=params_space,
        algo=tpe.suggest,
        max_evals=30,
        rstate=RandomState(2020))
    
    # 返回最佳参数
    return params_best

接下来我们带入训练数据，测试函数性能：

In [14]:
best_clf = param_hyperopt(train_LGBM)

{'bagging_fraction': 0.7253952770621912, 'bagging_freq': 5, 'feature_fraction': 0.6972128940985931, 'learning_rate': 0.43437628238508774, 'min_child_samples': 6, 'num_leaves': 60, 'reg_alpha': 0, 'reg_lambda': 1.6139256132729207, 'feature_pre_filter': False, 'objective': 'regression', 'metric': 'rmse', 'bagging_seed': 2020}
  0%|          | 0/30 [00:00<?, ?trial/s, best loss=?]




You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 65975                    
[LightGBM] [Info] Number of data points in the train set: 100958, number of used features: 300
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 65975                    
[LightGBM] [Info] Number of data points in the train set: 100958, number of used features: 300
[LightGBM] [Info] Start training from score -0.396931 
[LightGBM] [Info] Start training from score -0.390344 
  0%|          | 0/30 [00:02<?, ?trial/s, best loss=?]




{'bagging_fraction': 0.557619162794617, 'bagging_freq': 6, 'feature_fraction': 0.768520768296847, 'learning_rate': 0.4484899481964635, 'min_child_samples': 1, 'num_leaves': 250, 'reg_alpha': 1, 'reg_lambda': 1.9478998979854978, 'feature_pre_filter': False, 'objective': 'regression', 'metric': 'rmse', 'bagging_seed': 2020}
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 65975                                             
[LightGBM] [Info] Number of data points in the train set: 100958, number of used features: 300
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 65975                                             
[LightGBM] [Info] Number of data points in the train set: 100958, number of used features: 300
[LightGBM] [Info] Start training from score -0.396931                          
[LightGBM] [Info] Start training from score -0.390344                          
{'bagging_fraction': 0.6089577218903197, 'bagging_

此时best_clf即为lgb模型的最优参数组。

In [15]:
best_clf

{'bagging_fraction': 0.9022336069269954,
 'bagging_freq': 2,
 'feature_fraction': 0.9373662317255621,
 'learning_rate': 0.014947332175194025,
 'min_child_samples': 5,
 'num_leaves': 7,
 'reg_alpha': 2,
 'reg_lambda': 3.5907566887206896}

### 3.LightGBM模型预测与结果排名

&emsp;&emsp;在搜索出最优参数后，接下来即可进行模型预测了。和此前一样，在实际执行预测时有两种思路，其一是单模型预测，即直接针对测试集进行预测并提交结果，其二则是通过交叉验证提交平均得分，并且在此过程中能同时保留下后续用于stacking集成时所需要用到的数据。

- 单模型预测

&emsp;&emsp;首先测试单独模型在测试集上的预测效果：

In [16]:
# 再次申明固定参数
best_clf = params_append(best_clf)

# 数据准备过程
label = 'target'
features = train_LGBM.columns.tolist()
features.remove('card_id')
features.remove('target')

# 数据封装
lgb_train = lgb.Dataset(train_LGBM[features], train_LGBM[label])

In [18]:
# 在全部数据集上训练模型
bst = lgb.train(best_clf, lgb_train)

You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 65975
[LightGBM] [Info] Number of data points in the train set: 201917, number of used features: 300
[LightGBM] [Info] Start training from score -0.393636


In [19]:
# 在测试集上完成预测
bst.predict(train_LGBM[features])

array([-0.24511938, -2.01595283,  0.1053809 , ..., -0.18190679,
       -1.11870804, -0.24511938])

In [20]:
# 简单查看训练集RMSE
np.sqrt(mean_squared_error(train_LGBM[label], bst.predict(train_LGBM[features])))

3.7213768397255365

接下来，对测试集进行预测，并将结果写入本地文件

In [21]:
test_LGBM['target'] = bst.predict(test_LGBM[features])
test_LGBM[['card_id', 'target']].to_csv("result/submission_LGBM.csv", index=False)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test_LGBM['target'] = bst.predict(test_LGBM[features])


In [23]:
test_LGBM[['card_id', 'target']].head(5)

Unnamed: 0,card_id,target
0,C_ID_0ab67a22ab,-2.418856
1,C_ID_130fd0cbdd,-0.752626
2,C_ID_b709037bc5,-0.030933
3,C_ID_d27d835a9f,-0.245119
4,C_ID_2b5e3df5c2,-0.36699


提交该结果，得到公榜、私榜结果如下：

<center><img src="https://s2.loli.net/2021/12/09/GowrHnvJWMOpxB4.png" alt="image-20211209172112868" style="zoom:33%;" />

对比此前的随机森林提交的两组结果，汇总情况如下：

| 模型 | Private Score | Public Score |
| ------ | ------ | ------ |
| randomforest | 3.65455 | 3.74969 |
| randomforest+validation | 3.65173 | 3.74954 |
| LightGBM | 3.69723 | 3.80436 |

能够发现，在单模型预测情况下，lgb要略弱于rf，接下来考虑进行交叉验证，以提高lgb模型预测效果。

- 结合交叉验证进行模型预测

&emsp;&emsp;和随机森林借助交叉验证进行模型预测的过程类似，lgb也需要遵照如下流程进行训练和预测，并同时创建后续集成所需数据集以及预测结果的平均值（作为最终预测结果）

<center><img src="https://s2.loli.net/2021/12/08/ALF3cfuSwmB7b8z.png" alt="image-20211208192640281" style="zoom:33%;" />

执行过程如下：

In [24]:
def train_predict(train, test, params):
    """

    :param train:
    :param test:
    :param params:
    :return:
    """
    # Part 1.选择特征
    label = 'target'
    features = train.columns.tolist()
    features.remove('card_id')
    features.remove('target')
    
    # Part 2.再次申明固定参数与控制迭代参数
    params = params_append(params)
    ESR = 30
    NBR = 10000
    VBE = 50
    
    # Part 3.创建结果存储容器
    # 测试集预测结果存储器，后保存至本地文件
    prediction_test = 0
    # 验证集的模型表现，作为展示用
    cv_score = []
    # 验证集的预测结果存储器，后保存至本地文件
    prediction_train = pd.Series()
    
    # Part 3.交叉验证
    kf = KFold(n_splits=5, random_state=2020, shuffle=True)
    for train_part_index, eval_index in kf.split(train[features], train[label]):
        # 训练数据封装
        train_part = lgb.Dataset(train[features].loc[train_part_index],
                                 train[label].loc[train_part_index])
        # 测试数据封装
        eval = lgb.Dataset(train[features].loc[eval_index],
                           train[label].loc[eval_index])
        # 依据验证集训练模型
        bst = lgb.train(params, train_part, num_boost_round=NBR,
                        valid_sets=[train_part, eval],
                        valid_names=['train', 'valid'],
                        early_stopping_rounds=ESR, verbose_eval=VBE)
        # 测试集预测结果并纳入prediction_test容器
        prediction_test += bst.predict(test[features])
        # 验证集预测结果并纳入prediction_train容器
        prediction_train = prediction_train.append(pd.Series(bst.predict(train[features].loc[eval_index]),
                                                             index=eval_index))
        # 验证集预测结果
        eval_pre = bst.predict(train[features].loc[eval_index])
        # 计算验证集上得分
        score = np.sqrt(mean_squared_error(train[label].loc[eval_index].values, eval_pre))
        # 纳入cv_score容器
        cv_score.append(score)
        
    # Part 4.打印/输出结果
    # 打印验证集得分与平均得分
    print(cv_score, sum(cv_score) / 5)
    # 将验证集上预测结果写入本地文件
    pd.Series(prediction_train.sort_index().values).to_csv("preprocess/train_lightgbm.csv", index=False)
    # 将测试集上预测结果写入本地文件
    pd.Series(prediction_test / 5).to_csv("preprocess/test_lightgbm.csv", index=False)
    # 测试集平均得分作为模型最终预测结果
    test['target'] = prediction_test / 5
    # 将测试集预测结果写成竞赛要求格式并保存至本地
    test[['card_id', 'target']].to_csv("result/submission_lightgbm.csv", index=False)
    return

In [25]:
train_LGBM, test_LGBM = feature_select_wrapper(train, test)
best_clf = param_hyperopt(train_LGBM)
train_predict(train_LGBM, test_LGBM, best_clf)

feature_select_wrapper...




You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 227016
[LightGBM] [Info] Number of data points in the train set: 161533, number of used features: 1626
[LightGBM] [Info] Start training from score -0.390986
Training until validation scores don't improve for 30 rounds
[50]	train's rmse: 3.43695	valid's rmse: 3.70629
Early stopping, best iteration is:
[66]	train's rmse: 3.39251	valid's rmse: 3.70281
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 227122
[LightGBM] [Info] Number of data points in the train set: 161533, number of used features: 1629
[LightGBM] [Info] Start training from score -0.396781
Training until validation scores don't improve for 30 rounds
[50]	train's rmse: 3.45546	valid's rmse: 3.67176
[100]	train's rmse: 3.33017	valid's rmse: 3.67221
Early stopping, best iteration is:
[82]	train's rmse: 3.37387	valid's rmse: 3.66794
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info




You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 65975                    
[LightGBM] [Info] Number of data points in the train set: 100958, number of used features: 300
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 65975                    
[LightGBM] [Info] Number of data points in the train set: 100958, number of used features: 300
[LightGBM] [Info] Start training from score -0.396931 
[LightGBM] [Info] Start training from score -0.390344 
  0%|          | 0/30 [00:02<?, ?trial/s, best loss=?]




{'bagging_fraction': 0.557619162794617, 'bagging_freq': 6, 'feature_fraction': 0.768520768296847, 'learning_rate': 0.4484899481964635, 'min_child_samples': 1, 'num_leaves': 250, 'reg_alpha': 1, 'reg_lambda': 1.9478998979854978, 'feature_pre_filter': False, 'objective': 'regression', 'metric': 'rmse', 'bagging_seed': 2020}
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 65975                                             
[LightGBM] [Info] Number of data points in the train set: 100958, number of used features: 300
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 65975                                             
[LightGBM] [Info] Number of data points in the train set: 100958, number of used features: 300
[LightGBM] [Info] Start training from score -0.396931                          
[LightGBM] [Info] Start training from score -0.390344                          
{'bagging_fraction': 0.6089577218903197, 'bagging_

  prediction_train = pd.Series()


You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 65666
[LightGBM] [Info] Number of data points in the train set: 161533, number of used features: 300
[LightGBM] [Info] Start training from score -0.390986
Training until validation scores don't improve for 30 rounds
[50]	train's rmse: 3.75892	valid's rmse: 3.77103
[100]	train's rmse: 3.71868	valid's rmse: 3.73851
[150]	train's rmse: 3.69462	valid's rmse: 3.721
[200]	train's rmse: 3.67853	valid's rmse: 3.71087
[250]	train's rmse: 3.66669	valid's rmse: 3.70432
[300]	train's rmse: 3.65751	valid's rmse: 3.69995
[350]	train's rmse: 3.65049	valid's rmse: 3.69762
[400]	train's rmse: 3.64309	valid's rmse: 3.69551
[450]	train's rmse: 3.63597	valid's rmse: 3.69377
[500]	train's rmse: 3.63008	valid's rmse: 3.6925
[550]	train's rmse: 3.62462	valid's rmse: 3.69105
[600]	train's rmse: 3.6198	valid's rmse: 3.69013
[650]	train's rmse: 3.61443	valid's rmse: 3.68931
[700]	train's rmse: 3.61005	valid's rmse: 3.68884
[7



You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 65677
[LightGBM] [Info] Number of data points in the train set: 161533, number of used features: 300
[LightGBM] [Info] Start training from score -0.396781
Training until validation scores don't improve for 30 rounds
[50]	train's rmse: 3.76719	valid's rmse: 3.73854
[100]	train's rmse: 3.72832	valid's rmse: 3.70102
[150]	train's rmse: 3.70615	valid's rmse: 3.68157
[200]	train's rmse: 3.68981	valid's rmse: 3.6711
[250]	train's rmse: 3.67731	valid's rmse: 3.6642
[300]	train's rmse: 3.6682	valid's rmse: 3.66021
[350]	train's rmse: 3.66021	valid's rmse: 3.65732
[400]	train's rmse: 3.653	valid's rmse: 3.65522
[450]	train's rmse: 3.6464	valid's rmse: 3.65337
[500]	train's rmse: 3.63997	valid's rmse: 3.65184
[550]	train's rmse: 3.63353	valid's rmse: 3.6508
[600]	train's rmse: 3.62768	valid's rmse: 3.64985
[650]	train's rmse: 3.62235	valid's rmse: 3.64887
[700]	train's rmse: 3.61719	valid's rmse: 3.64818
[750]



You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 65684
[LightGBM] [Info] Number of data points in the train set: 161534, number of used features: 300
[LightGBM] [Info] Start training from score -0.390348
Training until validation scores don't improve for 30 rounds
[50]	train's rmse: 3.75231	valid's rmse: 3.78794
[100]	train's rmse: 3.71093	valid's rmse: 3.75345
[150]	train's rmse: 3.68701	valid's rmse: 3.73658
[200]	train's rmse: 3.67191	valid's rmse: 3.72698
[250]	train's rmse: 3.66002	valid's rmse: 3.72123
[300]	train's rmse: 3.65114	valid's rmse: 3.71716
[350]	train's rmse: 3.64347	valid's rmse: 3.71463
[400]	train's rmse: 3.63661	valid's rmse: 3.71266
[450]	train's rmse: 3.63053	valid's rmse: 3.71135
[500]	train's rmse: 3.62438	valid's rmse: 3.71013
[550]	train's rmse: 3.61892	valid's rmse: 3.70927
[600]	train's rmse: 3.61344	valid's rmse: 3.70822
[650]	train's rmse: 3.60779	valid's rmse: 3.70713
[700]	train's rmse: 3.60292	valid's rmse: 3.7066



You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 65642
[LightGBM] [Info] Number of data points in the train set: 161534, number of used features: 300
[LightGBM] [Info] Start training from score -0.391392
Training until validation scores don't improve for 30 rounds
[50]	train's rmse: 3.73315	valid's rmse: 3.86928
[100]	train's rmse: 3.69309	valid's rmse: 3.83508
[150]	train's rmse: 3.66977	valid's rmse: 3.81704
[200]	train's rmse: 3.65402	valid's rmse: 3.80625
[250]	train's rmse: 3.64247	valid's rmse: 3.79953
[300]	train's rmse: 3.63288	valid's rmse: 3.79472
[350]	train's rmse: 3.62507	valid's rmse: 3.79137
[400]	train's rmse: 3.61782	valid's rmse: 3.78885
[450]	train's rmse: 3.61148	valid's rmse: 3.78654
[500]	train's rmse: 3.60518	valid's rmse: 3.78439
[550]	train's rmse: 3.59867	valid's rmse: 3.78269
[600]	train's rmse: 3.59283	valid's rmse: 3.78127
[650]	train's rmse: 3.58721	valid's rmse: 3.77968
[700]	train's rmse: 3.58204	valid's rmse: 3.7786



You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 65695
[LightGBM] [Info] Number of data points in the train set: 161534, number of used features: 300
[LightGBM] [Info] Start training from score -0.398675
Training until validation scores don't improve for 30 rounds
[50]	train's rmse: 3.78477	valid's rmse: 3.65732
[100]	train's rmse: 3.74396	valid's rmse: 3.62525
[150]	train's rmse: 3.72006	valid's rmse: 3.6094
[200]	train's rmse: 3.7039	valid's rmse: 3.59961
[250]	train's rmse: 3.69286	valid's rmse: 3.59368
[300]	train's rmse: 3.68357	valid's rmse: 3.59002
[350]	train's rmse: 3.67565	valid's rmse: 3.58705
[400]	train's rmse: 3.66878	valid's rmse: 3.58453
[450]	train's rmse: 3.66276	valid's rmse: 3.5828
[500]	train's rmse: 3.65653	valid's rmse: 3.58089
[550]	train's rmse: 3.65107	valid's rmse: 3.57947
[600]	train's rmse: 3.64527	valid's rmse: 3.57869
[650]	train's rmse: 3.64029	valid's rmse: 3.57741
[700]	train's rmse: 3.63434	valid's rmse: 3.57659
[

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  test['target'] = prediction_test / 5


接下来即可在竞赛主页提交预测结果。最终公榜私榜评分如下：

<center><img src="https://s2.loli.net/2021/12/09/EnekwUaMIVKQfDt.png" alt="image-20211209173249232" style="zoom:50%;" />

对比此前结果：

| 模型 | Private Score | Public Score |
| ------ | ------ | ------ |
| randomforest | 3.65455 | 3.74969 |
| randomforest+validation | 3.65173 | 3.74954 |
| LightGBM | 3.69723 | 3.80436 |
| LightGBM+validation | 3.64403 | 3.73875 |

能够看出，经过交叉验证后输出的平均值结果，较此前的预测评分，有较大提升，这也是目前我们跑出的最好成绩。同时，交叉验证的作用已得到充分征明，后续在进行其他模型训练时仅考虑模型+交叉验证的输出结果，不再进行单模型结果输出。