# Sample LightGBM run using Julia

## General setup

Suggested steps before running this notebook:
- Set up a python virtual environment first (refer to the "Python LightGBM.ipynb" in the same directory as this file)
- Start Julia (at the root of this repo, so first `cd` there)
  - `julia --project=.`
- Install dependencies (fixed lightgbm and pycall version for easier comparison)
    ```
    using Pkg
    Pkg.add("IJulia") # enables jupyter notebook
    
    # point python to the venv as set up in the python notebook before building PyCall
    ENV["PYTHON"] = joinpath(pwd(),"venv/bin/python") 
    Pkg.add(Pkg.PackageSpec(;name="PyCall", version="1.92.2"))
    Pkg.build("PyCall")

    Pkg.add(Pkg.PackageSpec(;name="LightGBM", version="0.4.2"))
    Pkg.build("LightGBM")
    exit()
    ```
- Run notebook
  - `jupyter notebook`
  - There should then be the option to create a Julia jupyter notebook

In [1]:
using DelimitedFiles
using LightGBM

In [2]:
datapath = "https://github.com/microsoft/LightGBM/raw/master/examples/binary_classification/binary.train"
train_data = readdlm(download(datapath), '\t')
x_train = train_data[:, 2:end]

7000×28 Array{Float64,2}:
 0.869  -0.635   0.226  0.327  -0.69   …  0.978  0.92   0.722  0.989  0.877
 0.908   0.329   0.359  1.498  -0.313     0.986  0.978  0.78   0.992  0.798
 0.799   1.471  -1.636  0.454   0.426     0.986  0.951  0.803  0.866  0.78
 1.344  -0.877   0.936  1.992   0.882     0.999  0.728  0.869  1.027  0.958
 1.105   0.321   1.522  0.883  -1.205     0.987  0.838  1.133  0.872  0.808
 1.596  -0.608   0.007  1.818  -0.112  …  0.972  0.789  0.431  0.961  0.958
 0.409  -1.885  -1.027  1.672  -1.605     1.001  0.545  0.699  0.977  0.829
 0.934   0.629   0.528  0.238  -0.967     0.98   0.783  0.849  0.894  0.775
 1.405   0.537   0.69   1.18   -0.11      1.176  1.045  1.543  3.535  2.741
 1.177   0.104   1.397  0.48    0.266     0.986  1.104  0.849  0.937  0.812
 0.946   1.111   1.218  0.908   0.822  …  0.994  0.908  0.776  0.783  0.725
 0.739  -0.178   0.83   0.505  -0.13      0.982  0.542  1.251  0.83   0.761
 1.384   0.117  -1.18   0.763  -0.08      1.192  1.221  0.861  

In [3]:
y_train = train_data[:, 1] # binary labels/targets

7000-element Array{Float64,1}:
 1.0
 1.0
 1.0
 0.0
 1.0
 0.0
 1.0
 1.0
 1.0
 1.0
 1.0
 0.0
 1.0
 ⋮
 1.0
 1.0
 0.0
 1.0
 0.0
 1.0
 1.0
 1.0
 0.0
 0.0
 0.0
 1.0

## Using Julia's LightGBM directly

In [4]:
booster = LightGBM.LGBMClassification(;
    objective="binary", 
    metric=["average_precision"], 
    num_leaves=1000, 
    learning_rate=0.2, 
    max_bin=255, 
    max_depth=10, 
    min_data_in_leaf=50, 
    num_iterations=5, 
    num_class=1, 
    use_missing=true, 
    min_sum_hessian_in_leaf=1.
)

LGBMClassification(LightGBM.Booster(Ptr{Nothing} @0x0000000000000000, LightGBM.Dataset[]), "", "binary", "gbdt", 5, 0.2, 1000, 10, "serial", 6, -1.0, 50, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 1.0, 2, 1.0, 1.0, 1.0, 0, 3, 0, false, 6, 255, 200000, 1, "", true, false, Int64[], true, false, true, 1.0, 1.0, 0.1, 50, 0.5, false, false, 4, 0.2, 0.1, 100, 32, 10.0, 10.0, ["average_precision"], 1, false, Int64[], 1, 12400, 120, "", 1, "cpu", false, false)

In [5]:
fit!(booster, x_train, y_train)

[LightGBM] [Info] Number of positive: 3716, number of negative: 3284
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 6132
[LightGBM] [Info] Number of data points in the train set: 7000, number of used features: 28
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.530857 -> initscore=0.123586
[LightGBM] [Info] Start training from score 0.123586


Dict{String,Dict{String,Array{Float64,1}}}()

In [6]:
# printout the booster model details
for line in split(LightGBM.LGBM_BoosterSaveModelToString(booster.booster),"\n")
    println(line)
end

tree
version=v3
num_class=1
num_tree_per_iteration=1
label_index=0
max_feature_idx=27
objective=binary sigmoid:1
feature_names=Column_0 Column_1 Column_2 Column_3 Column_4 Column_5 Column_6 Column_7 Column_8 Column_9 Column_10 Column_11 Column_12 Column_13 Column_14 Column_15 Column_16 Column_17 Column_18 Column_19 Column_20 Column_21 Column_22 Column_23 Column_24 Column_25 Column_26 Column_27
feature_infos=[0.27500000000000002:6.6950000000000003] [-2.4169999999999998:2.4300000000000002] [-1.7430000000000001:1.7430000000000001] [0.019:5.7000000000000002] [-1.7430000000000001:1.7430000000000001] [0.159:4.1900000000000004] [-2.9409999999999998:2.9700000000000002] [-1.7410000000000001:1.7410000000000001] [0:2.173] [0.19:5.1929999999999996] [-2.9039999999999999:2.9089999999999998] [-1.742:1.7430000000000001] [0:2.2149999999999999] [0.26400000000000001:6.5229999999999997] [-2.7280000000000002:2.7269999999999999] [-1.742:1.742] [0:2.548] [0.36499999999999999:6.0679999999999996] [-2.495000000

internal_value=0.123586 0.179273 0.239838 0.272177 -0.0137022 0.344561 0.0407009 0.0409018 0.246815 0.0587972 0.17129 0.0900001 -0.0131273 0.0292739 0.0316396 0.221723 0.0136722 -0.105969 -0.0144908 0.183978 -0.0932302 -0.0860628 0.00584062 -0.125011 0.21395 0.314851 -0.147678 -0.0487212 0.259001 0.120204 0.33745 0.392731 0.349189 0.0926909 -0.0434947 0.0144077 -0.087411 0.147818 0.374639 0.0915924 0.0936577 0.136686 0.104232 0.0833621 0.0289744 -0.0449854 0.191914 -0.0608086 0.227295 -0.153584 0.0778422 0.0384721 0.108143 0.0988055 0.0340427 0.247063 -0.0198281 0.165284 -0.0913926 0.355301 0.0235188 -0.113769 -0.081795 0.0559238 -0.116447 0.213684 0.137663 0.252477 0.162205 0.283908 -0.133404 0.16699 0.292289 0.39659 0.431814 0.356035 -0.0988384 0.434676 -0.0454331 -0.075443 0.00914365 -0.0556288 -0.0472051 0.00389861 0.0575268 0.011254 0.396525 0.0643887 0.391277 -0.0702594 0.452993 0.0601969 0.372989 -0.193834 0.322667 0.388605 -0.241422 -0.23524 0.474581 0.468531 0.483309
internal_

internal_value=0 0.0364041 0.0746766 0.149361 -0.0060728 -0.061535 -0.0598901 0.0920127 0.124371 -0.141463 -0.174016 -0.0745025 0.0618394 0.148843 0.0992519 -0.0872726 -0.117749 0.213632 0.236016 -0.0712347 -0.038553 0.0226861 -0.0237392 -0.0435668 -0.0594982 -0.147946 -0.0140768 0.0495468 -0.04872 -0.11747 0.139269 -0.227208 0.0882162 -0.103389 0.0455737 0.0581958 -0.190142 0.0699146 0.000500648 -0.157958 0.0952179 -0.0268813 0.0106374 -0.0521165 0.0221374 0.145129 0.0732842 0.117438 -0.179068 -0.158909 -0.108315 0.010152 0.212166 0.167866 0.0248727 -0.0732215 -0.0276035 -0.029276 0.125221 -0.216887 -0.16788 -0.0869126 -0.12478 0.237066 -0.180679 -0.0573217 -0.0392856 0.0639616 -0.265537 0.14702 0.20215 -0.189293 0.0337239 -0.237443 -0.0335503 -0.227871 -0.0279328 -0.223601 0.0346788 0.0678637 0.208842 0.143989 -0.0984065 0.161395 0.263275 0.233721 0.271289 0.157724 0.187162 -0.256405 -0.0682909 -0.235827 0.261512 -0.297611 0.292642 0.150555 0.283247 -0.324667 0.30552 0.316818 -0.3474

internal_weight=0 1195.23 851.873 719.131 132.742 492.563 411.899 145.702 311.178 120.795 343.362 320.27 307.232 210.196 277.808 206.33 147.554 215.293 135.269 110.074 154.21 107.401 266.197 46.8086 55.9861 181.385 94.1493 52.5724 31.8792 81.1297 110.333 93.5237 41.5769 80.024 32.6976 47.3264 123.234 35.9796 249.793 75.6721 61.3347 44.6737 96.7624 84.5143 233.083 165.452 113.599 27.8579 51.8527 36.9483 28.7607 104.847 73.2335 43.131 58.7759 27.5466 212 28.9447 47.5617 35.0133 31.6138 62.5154 42.4506 36.1858 28.1504 24.9064 92.0347 37.7749 35.1768 30.1025 25.1648 45.9529 24.1864 53.3321 29.4237 39.6804 27.3214 54.2597 85.7411 50.0453 31.1822 28.3401 35.6957 197.695 33.6054 24.9362 37.9115 31.2358 24.5527 24.6183 31.2293
internal_count=7000 4980 3584 3044 540 2020 1776 607 1269 506 1396 1301 1268 854 1148 843 600 887 560 447 626 436 1169 190 228 751 382 213 129 330 457 387 169 327 134 193 501 147 1102 308 250 182 393 343 1031 682 470 115 212 151 117 426 296 174 243 111 938 117 193 142 13

internal_weight=0 1333.33 951.398 484.763 466.635 256.698 260.427 381.928 311.341 231.801 230.583 169.324 85.5016 28.7646 209.937 358.547 209.311 145.081 87.3738 29.8443 194.132 53.145 91.4219 143.475 70.4315 140.987 99.4346 82.0773 79.5401 61.2641 48.1612 27.3228 195.131 180.525 132.874 47.6512 63.5939 48.5319 149.236 112.035 41.5521 27.3276 224.336 143.223 82.4681 66.5822 89.5934 33.9162 116.935 95.2212 56.737 37.201 27.5041 69.2804 117.759 96.2799 51.3351 74.8931 59.3528 29.6564 27.0805 41.0389 140.379 105.647 48.4353 59.3451 35.8762 25.8487 53.9911 35.9024 35.8265 35.5085 57.5039 34.7314 42.7566 28.3722 37.4308 82.559 31.2002 26.5398 60.7547 39.8916 24.2695 23.3812 27.4051 81.113 66.9408 25.7874 49.3594 36.2632 22.5148 39.0781 57.212 42.7402 29.116 23.456 41.1534 22.0256
internal_count=7000 5684 4101 2154 1947 1073 1129 1583 1316 990 1007 704 362 118 874 1483 870 645 369 122 810 223 380 596 298 587 412 339 326 251 199 112 812 751 552 199 264 203 613 460 175 114 1025 652 373 306 368

internal_count=7000 5684 4442 2167 2275 1665 1043 1282 868 610 1316 710 688 435 1242 621 383 1203 1039 485 1124 253 978 431 889 594 847 459 268 151 284 202 395 323 216 158 736 155 101 581 275 791 512 397 182 366 288 116 164 491 411 176 152 235 110 106 175 922 524 398 332 155 177 306 254 148 191 279 176 303 130 195 102 106 103 115 166 136 107 143 141 150 115 103 112 111 381 100 113 103 330 100 274
shrinkage=0.2


end of trees

feature_importances:
Column_25=56
Column_24=41
Column_27=37
Column_5=33
Column_21=31
Column_22=31
Column_9=29
Column_26=28
Column_0=25
Column_3=20
Column_13=18
Column_18=18
Column_4=15
Column_1=13
Column_2=11
Column_14=11
Column_17=10
Column_19=9
Column_11=8
Column_6=7
Column_7=7
Column_15=7
Column_10=6
Column_23=6
Column_8=3
Column_12=2
Column_16=2

parameters:
[boosting: gbdt]
[objective: binary]
[metric: average_precision]
[tree_learner: serial]
[device_type: cpu]
[data: ]
[valid: ]
[num_iterations: 100]
[learning_rate: 0.2]
[num_leaves: 1000]
[num_threads: 6]


In [7]:
open("../booster_data/pure_julia_lightgbm_booster.txt", "w") do io
    write(io, LightGBM.LGBM_BoosterSaveModelToString(booster.booster))
end

55340

## Using python LightGBM via Julia's PyCall

In [8]:
using PyCall

py_numpy = PyCall.pyimport("numpy")
py_lightgbm = PyCall.pyimport("lightgbm") 

# printout should show the correct python env we should be using

PyObject <module 'lightgbm' from '/home/e/Documents/draft-lightgbm-blog/venv/lib/python3.8/site-packages/lightgbm/__init__.py'>

In [9]:
# check the python lightgbm package version
py_lightgbm.__version__

"3.1.0"

In [10]:
# Convert the train data to a python (numpy) object
x_train_py = py_numpy.asarray(x_train)
y_train_py = py_numpy.asarray(y_train)

7000-element Array{Float64,1}:
 1.0
 1.0
 1.0
 0.0
 1.0
 0.0
 1.0
 1.0
 1.0
 1.0
 1.0
 0.0
 1.0
 ⋮
 1.0
 1.0
 0.0
 1.0
 0.0
 1.0
 1.0
 1.0
 0.0
 0.0
 0.0
 1.0

In [11]:
# Use the same params as other runs and load dataset!

params_py = Dict(
    "objective" => "binary",
    "metric" => "average_precision",
    "num_leaves" => 1000,
    "learning_rate" => 0.2,
    "max_bin" => "1023",
    "max_depth" => 10,
    "min_data_in_leaf" => 50,
    "num_iterations" => 5,
    "num_class" => 1,
    "use_missing" => true,
    "min_sum_hessian_in_leaf" => 1.,
)

train_ds = py_lightgbm.Dataset(
    x_train_py, y_train_py, params=params_py
)

PyObject <lightgbm.basic.Dataset object at 0x7f7d3ad8a760>

In [12]:
booster_by_pycall = py_lightgbm.train(params_py, train_ds)

PyObject <lightgbm.basic.Booster object at 0x7f7d30292340>

In [13]:
# Print the model details here
for line in split(booster_by_pycall.model_to_string(), '\n')
    println(line)
end

tree
version=v3
num_class=1
num_tree_per_iteration=1
label_index=0
max_feature_idx=27
objective=binary sigmoid:1
feature_names=Column_0 Column_1 Column_2 Column_3 Column_4 Column_5 Column_6 Column_7 Column_8 Column_9 Column_10 Column_11 Column_12 Column_13 Column_14 Column_15 Column_16 Column_17 Column_18 Column_19 Column_20 Column_21 Column_22 Column_23 Column_24 Column_25 Column_26 Column_27
feature_infos=[0.27500000000000002:6.6950000000000003] [-2.4169999999999998:2.4300000000000002] [-1.7430000000000001:1.7430000000000001] [0.019:5.7000000000000002] [-1.7430000000000001:1.7430000000000001] [0.159:4.1900000000000004] [-2.9409999999999998:2.9700000000000002] [-1.7410000000000001:1.7410000000000001] [0:2.173] [0.19:5.1929999999999996] [-2.9039999999999999:2.9089999999999998] [-1.742:1.7430000000000001] [0:2.2149999999999999] [0.26400000000000001:6.5229999999999997] [-2.7280000000000002:2.7269999999999999] [-1.742:1.742] [0:2.548] [0.36499999999999999:6.0679999999999996] [-2.495000000

internal_value=0.123586 0.179531 0.239269 0.272523 -0.0137654 0.344118 0.03807 0.0410876 0.055625 0.247783 0.172146 0.111393 -0.0127565 0.03199 0.293094 0.0274494 -0.109294 0.0096142 0.0429964 0.18465 0.204672 -0.0893739 0.00432805 -0.130394 0.125267 -0.139942 0.241237 -0.0695777 -0.0188841 0.0349581 0.249537 0.0773498 0.123336 0.0803251 0.145912 0.108481 -0.0618062 -0.051768 0.39302 0.351282 -0.0693563 0.0703507 0.146915 0.10552 -0.158331 0.192534 -0.0929452 0.374263 0.18506 -0.103068 0.0679187 -0.0567871 0.019672 -0.0948122 0.222353 0.153066 -0.0209488 0.0934518 0.32021 0.238597 -0.114461 -0.079231 -0.114027 0.013441 -0.138377 -0.179555 0.133455 0.0876519 0.282362 0.121534 0.270125 0.194052 -0.126569 0.430563 0.354324 0.34375 0.057909 0.00614493 0.393905 0.246427 -0.0495071 0.387508 0.0391725 -0.0398177 0.282286 0.38837 0.366492 -0.0732785 0.193283 0.452681 0.426369 0.417344 -0.162507 -0.226242 0.388605 -0.248381 0.474354 0.483188
internal_weight=0 1238.76 835.805 717.258 504.571 418

internal_value=0 0.0347946 0.0715872 0.145284 -0.00696854 -0.0622951 0.0854415 -0.0609065 0.110437 -0.148282 -0.178994 -0.0763659 0.131654 0.0599304 -0.0884171 -0.0920204 0.212158 0.235304 0.0616335 0.0204226 0.0471534 -0.116198 -0.0716748 -0.0363223 -0.114091 -0.188294 -0.023334 -0.0447118 0.0764274 0.0947899 0.0446334 -0.0553495 -0.171431 0.0979856 0.0713182 0.0107026 0.00659729 0.0192184 0.0877297 0.171749 0.190005 0.133558 0.108972 -0.0556115 -0.0922373 0.129588 0.069827 -0.233427 -0.0673364 -0.00822441 0.215469 0.00435546 -0.0625334 -0.150806 -0.0313622 -0.114883 -0.0699433 -0.012742 0.160323 -0.260317 0.0905182 -0.156297 -0.127897 -0.0659312 0.154673 -0.214355 -0.0127207 -0.126897 0.0542069 -0.219233 -0.247613 0.0692413 -0.0465177 -0.212174 -0.171234 -0.115747 0.157967 0.0711809 0.150837 -0.102647 -0.100431 -0.289143 0.0738265 0.202151 0.265425 0.219628 -0.182777 0.215519 0.172077 0.282864 0.238725 0.273727 0.301131 0.259683 -0.320718 -0.340347 -0.334186 0.323865
internal_weight=

internal_value=0 0.0324125 0.061852 0.00672595 0.105421 0.0450851 0.144487 0.011709 0.0821214 -0.109357 -0.0640544 -0.0953723 -0.0550383 -0.0906527 0.0482635 0.0094352 0.042786 -0.058607 0.0758246 0.00767248 -0.0236225 0.140024 0.068146 0.0350451 -0.119237 -0.0975106 -0.0569102 -0.0890362 -0.0129961 0.050194 -0.141935 -0.14154 0.107616 0.172517 0.188938 0.105248 0.110757 0.0535977 -0.0725013 0.169625 0.134675 -0.116396 -0.139783 -0.162149 0.182366 -0.169521 0.0684418 0.144323 -0.108303 -0.145607 0.0587717 0.0100148 0.0474748 -0.166575 -0.106565 -0.141669 -0.081454 -0.0288136 -0.169297 -0.217202 -0.0797524 0.0100233 -0.172363 0.089432 -0.107521 -0.216757 -0.0344233 0.138099 -0.203724 -0.177692 -0.152402 -0.17919 -0.147278 -0.0962814 -0.183276 -0.12754 -0.146456 0.0402518 0.137372 -0.242022 -0.141298 0.21114 0.185058 0.118737 -0.263442 -0.0913824 -0.223082 0.153295 0.238739 -0.226712 0.216954 -0.193457 -0.239917 0.218218 -0.248145 -0.273713 0.256194 0.25973
internal_weight=0 1305.69 1000

internal_count=7000 4974 3584 3046 538 2026 1776 610 1287 240 1270 853 1390 631 449 246 1166 981 888 633 255 580 739 537 391 1094 941 870 783 634 468 166 406 118 189 139 527 273 217 144 370 434 337 282 163 119 222 1040 203 152 173 394 344 670 267 403 186 301 219 351 353 977 158 220 104 149 120 157 107 289 201 136 915 125 109 164 114 117 171 146 301 221 131 121 116
shrinkage=0.2


Tree=4
num_leaves=99
num_cat=0
split_feature=25 26 27 5 22 24 25 27 5 9 25 24 22 27 26 18 3 4 27 24 21 22 22 26 10 13 24 26 3 2 24 9 0 22 26 13 5 25 25 17 9 11 21 21 6 22 24 12 25 14 25 2 8 24 4 10 9 3 27 25 21 7 10 0 9 10 22 21 5 9 27 27 4 14 19 15 25 25 17 3 25 19 8 9 27 5 18 18 5 25 13 7 25 13 14 11 26 21
split_gain=69.8665 41.4208 55.603 27.7056 22.6298 37.1592 13.9939 13.4894 14.4108 13.8388 12.1759 10.7089 10.5146 11.5325 14.2279 10.3415 11.0509 10.7695 10.34 14.9412 10.2448 9.24762 9.23671 8.86486 8.5812 8.08927 7.84502 7.79299 7.37539 7.59278 7.34004 8.99907 7.06314 6.97763 6.78389 6.49627 6.53043 7.46

internal_count=7000 5689 4122 2143 1979 1093 1125 1567 938 435 629 385 1050 830 212 575 298 277 1311 986 220 708 243 503 325 249 372 618 599 151 886 825 280 311 140 1018 646 368 310 163 137 484 175 117 210 502 144 269 714 591 148 123 111 141 443 389 367 296 232 182 116 190 448 334 262 254 186 408 346 109 312 117 166 108 172 131 171 119 102 114 111 117 168 107 108 372 219 157 278 199 131 161 166 153 109 103 131 140
shrinkage=0.2


end of trees

feature_importances:
Column_25=55
Column_27=41
Column_5=37
Column_24=36
Column_22=33
Column_26=32
Column_9=30
Column_21=29
Column_3=19
Column_0=18
Column_14=17
Column_4=16
Column_13=16
Column_18=13
Column_15=11
Column_19=11
Column_10=10
Column_2=7
Column_7=7
Column_11=7
Column_17=7
Column_1=6
Column_6=6
Column_8=4
Column_23=4
Column_12=2
Column_16=2
Column_20=1

parameters:
[boosting: gbdt]
[objective: binary]
[metric: average_precision]
[tree_learner: serial]
[device_type: cpu]
[data: ]
[valid: ]
[num_iterations: 5]
[learning_rate: 0.2]
[num_lea

In [14]:
open("../booster_data/julia_pycall_lightgbm_booster.txt", "w") do io
    write(io, booster_by_pycall.model_to_string())
end

54628