### 训练数据格式
LightGBM 支持 CSV, TSV 和 LibSVM 格式的输入数据文件。
Label 是第一列的数据，文件中是不包含 header（标题） 的。

### 类别特征支持
LightGBM 可以直接使用 categorical feature（类别特征）（不需要单独编码）。

In [19]:
import lightgbm as lgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, accuracy_score

In [7]:
iris = load_iris()
data = iris.data
target = iris.target

In [12]:
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2)

In [13]:
lgb_train = lgb.Dataset(X_train, y_train) #将数据保存到lightGBM二进制文件将使加载更快
lgb_eval = lgb.Dataset(X_test, y_test) #创建验证数据

In [14]:
params = {
    'task':'train',
    'boosting_type':'gbdt', #设置提升类型
    'objective':'regression', #目标函数
    'metric':{'l2','auc'}, #评估函数
    'num_leaves':31,  #叶子节点数
    'learning_rate':0.05, #学习速率
    'feature_fraction':0.9, #建树的特征选择比例
    'bagging_fraction':0.8, #建树的样本采样比例
    'bagging_freq':5, # k 意味着每k次迭代执行bagging
    'verbose':1 # <0 显示致命的； =0 显示错误（警告）； >0 显示信息
}

In [15]:
gbm = lgb.train(params,lgb_train, num_boost_round=20,valid_sets=[lgb_train, lgb_eval], early_stopping_rounds=5) # 训练数据需要参数列表和数据集


You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 87
[LightGBM] [Info] Number of data points in the train set: 120, number of used features: 4
[LightGBM] [Info] Start training from score 1.008333
[1]	training's l2: 0.644495	training's auc: 0.99359	valid_1's l2: 0.455037	valid_1's auc: 0.954545
Training until validation scores don't improve for 5 rounds
[2]	training's l2: 0.586588	training's auc: 0.99359	valid_1's l2: 0.413128	valid_1's auc: 0.954545
[3]	training's l2: 0.534155	training's auc: 0.99359	valid_1's l2: 0.375581	valid_1's auc: 0.954545
[4]	training's l2: 0.486845	training's auc: 0.99359	valid_1's l2: 0.341896	valid_1's auc: 0.954545
[5]	training's l2: 0.443975	training's auc: 0.99359	valid_1's l2: 0.311748	valid_1's auc: 0.954545
[6]	training's l2: 0.404842	training's auc: 0.99359	valid_1's l2: 0.283838	valid_1's auc: 0.954545
Early stopping, best iteration is:
[1]	training's l2: 0.644495	training's auc: 0.99359	valid_1's l2: 0.455037	val



In [16]:
gbm.save_model('model.txt') 

<lightgbm.basic.Booster at 0x7f96187d6e20>

In [17]:
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)

In [18]:
print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5) # 计算真实值和预测值之间的均方根误差

The rmse of prediction is: 0.6745642146042917


In [22]:
y_pred

array([0.95938726, 0.95938726, 1.05278846, 1.05278846, 0.95938726,
       1.00983974, 1.00983974, 0.95938726, 1.05278846, 1.00983974,
       1.00983974, 0.95938726, 0.95938726, 0.95938726, 1.05278846,
       1.00983974, 1.05278846, 1.00983974, 1.00983974, 1.05278846,
       1.00983974, 1.05278846, 1.00983974, 1.00983974, 0.95938726,
       0.95938726, 0.95938726, 1.05278846, 1.00983974, 1.00983974])

In [23]:
y_test

array([0, 0, 2, 2, 0, 1, 1, 0, 2, 1, 1, 1, 0, 1, 2, 1, 1, 1, 1, 2, 1, 2,
       1, 1, 0, 0, 0, 2, 1, 1])

In [1]:
import lightgbm as lgb
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
iris = load_iris()
data = iris.data
target = iris.target
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2)
# 创建成lgb的数据集格式
lgb_train = lgb.Dataset(X_train, y_train) #将数据保存到lightGBM二进制文件将使加载更快
lgb_eval = lgb.Dataset(X_test, y_test) #创建验证数据

#将参数写成字典形式
params = {
    'task':'train',
    'boosting_type':'gbdt', #设置提升类型
    'objective':'regression', #目标函数
    'metric':{'l2','auc'}, #评估函数
    'num_leaves':31,  #叶子节点数
    'learning_rate':0.05, #学习速率
    'feature_fraction':0.9, #建树的特征选择比例
    'bagging_fraction':0.8, #建树的样本采样比例
    'bagging_freq':5, # k 意味着每k次迭代执行bagging
    'verbose':1 # <0 显示致命的； =0 显示错误（警告）； >0 显示信息
}

print('Start training...')
# 训练 cv and train
gbm = lgb.train(params,lgb_train, num_boost_round=20,valid_sets=[lgb_train, lgb_eval],early_stopping_rounds=5) # 训练数据需要参数列表和数据集
print('Save model...') 
gbm.save_model('model.txt') 
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration)

Start training...
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 88
[LightGBM] [Info] Number of data points in the train set: 120, number of used features: 4
[LightGBM] [Info] Start training from score 0.983333
[1]	training's l2: 0.605777	training's auc: 0.987342	valid_1's l2: 0.608945	valid_1's auc: 0.97619
Training until validation scores don't improve for 5 rounds
[2]	training's l2: 0.551047	training's auc: 0.987342	valid_1's l2: 0.554966	valid_1's auc: 0.97619
[3]	training's l2: 0.501629	training's auc: 0.987342	valid_1's l2: 0.506601	valid_1's auc: 0.97619
[4]	training's l2: 0.455715	training's auc: 0.999691	valid_1's l2: 0.463858	valid_1's auc: 0.997354
[5]	training's l2: 0.413831	training's auc: 0.999691	valid_1's l2: 0.423967	valid_1's auc: 0.997354
[6]	training's l2: 0.376126	training's auc: 0.999691	valid_1's l2: 0.388316	valid_1's auc: 0.997354
[7]	training's l2: 0.342107	training's auc: 0.999691	valid_1's l2: 0.35639	valid_1's auc: 0.



In [2]:
y_pred

array([1.17191864, 0.80939951, 0.80939951, 1.17191864, 1.17191864,
       0.99632255, 0.85422191, 0.99632255, 0.85422191, 0.99632255,
       0.80939951, 0.80939951, 1.17191864, 0.99632255, 1.17191864,
       0.80939951, 1.17191864, 1.17191864, 1.17191864, 0.99632255,
       0.99632255, 1.17191864, 0.99632255, 0.80939951, 1.17191864,
       1.17191864, 0.80939951, 0.99632255, 0.80939951, 1.17191864])

In [3]:
print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5) # 计算真实值和预测值之间的均方根误差

NameError: name 'mean_squared_error' is not defined

In [18]:
lgb_eval 

<lightgbm.basic.Dataset at 0x7f9d88dfcb20>

In [14]:
data.shape

(150, 4)

In [10]:
target

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [7]:
iris['target']

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [1]:
import json
import lightgbm as lgb
import pandas as pd
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification

iris = load_iris() #载入鸢尾花数据集
data = iris.data
target = iris.target
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2)

# 加载你的数据
# print('Load data...')
# df_train = pd.read_csv('../regression/regression.train', header=None, sep='\t')
# df_test = pd.read_csv('../regression/regression.test', header=None, sep='\t')
#
# y_train = df_train[0].values
# y_test = df_test[0].values
# X_train = df_train.drop(0, axis=1).values
# X_test = df_test.drop(0, axis=1).values

# 创建成lgb的数据集格式
lgb_train = lgb.Dataset(X_train, y_train) #将数据保存到lightGBM二进制文件将使加载更快
lgb_eval = lgb.Dataset(X_test, y_test) #创建验证数据

#将参数写成字典形式
params = {
    'task':'train',
    'boosting_type':'gbdt', #设置提升类型
    'objective':'regression', #目标函数
    'metric':{'l2','auc'}, #评估函数
    'num_leaves':31,  #叶子节点数
    'learning_rate':0.05, #学习速率
    'feature_fraction':0.9, #建树的特征选择比例
    'bagging_fraction':0.8, #建树的样本采样比例
    'bagging_freq':5, # k 意味着每k次迭代执行bagging
    'verbose':1 # <0 显示致命的； =0 显示错误（警告）； >0 显示信息
}

print('Start training...')
# 训练 cv and train
gbm = lgb.train(params,lgb_train,num_boost_round=20,valid_sets=lgb_eval,early_stopping_rounds=5) # 训练数据需要参数列表和数据集
 
print('Save model...') 
 
gbm.save_model('model.txt')   # 训练后保存模型到文件
 
print('Start predicting...')
# 预测数据集
y_pred = gbm.predict(X_test, num_iteration=gbm.best_iteration) #如果在训练期间启用了早期停止，可以通过best_iteration方式从最佳迭代中获得预测
# 评估模型


print('The rmse of prediction is:', mean_squared_error(y_test, y_pred) ** 0.5) # 计算真实值和预测值之间的均方根误差

Start training...
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 86
[LightGBM] [Info] Number of data points in the train set: 120, number of used features: 4
[LightGBM] [Info] Start training from score 0.958333
[1]	valid_0's auc: 0.977273	valid_0's l2: 0.6505
Training until validation scores don't improve for 5 rounds
[2]	valid_0's auc: 0.977273	valid_0's l2: 0.591663
[3]	valid_0's auc: 0.997159	valid_0's l2: 0.539564
[4]	valid_0's auc: 1	valid_0's l2: 0.493325
[5]	valid_0's auc: 1	valid_0's l2: 0.45178
[6]	valid_0's auc: 1	valid_0's l2: 0.412135
[7]	valid_0's auc: 1	valid_0's l2: 0.378125
[8]	valid_0's auc: 1	valid_0's l2: 0.343605
[9]	valid_0's auc: 1	valid_0's l2: 0.315658
Early stopping, best iteration is:
[4]	valid_0's auc: 1	valid_0's l2: 0.493325
Save model...
Start predicting...
The rmse of prediction is: 0.7023709058255572


