# LightGBM Imputation Methods

[Resource](https://lightgbm.readthedocs.io/en/stable/Python-Intro.html)

A bit about LightGBM:

LightGBM, short for **Light Gradient-Boosting Machine**, is developed by Microsoft and based on **decision tree** algorithms and used for ranking, classification, and other machine learning tasks. The development focus is on performance and scalability.

Typically, in gradient descent, one uses th whole set of data to calculate the valley's slopes. However, this commonly used method assumes that every data point is equally informative.

By contrast, Gradient-Based One-Side Samping (GOSS), doesn't rely on that assumption. Instead, it treats data points with smaller gradients (shallower slopes) as less informative by randomly dropping them. This is intended to filter out data which may have been influenced by noise, allowing the model to more accurately model the underlying relationships in the data.

In [52]:
import lightgbm as lgb
import numpy as np
import pandas as pd

# Introduction

## Data Interface

The LightGBM Python module can load data from:
* LibSVM (zero-based) / TSV / CSV format text file
* NumPy 2D array(s), pandas DataFrame, SciPy sparse matrix
* LightGBM binary file
* LightGBM `Sequence` object(s)

The data is stored in a `Dataset` object.

## Load a numpy array into Dataset:

In [53]:
rng = np.random.default_rng()
data = rng.uniform(size=(500, 10)) # 500 entities, each contains 10 features
label = rng.integers(low=0, high=2, size=(500, )) # Binary target
train_data = lgb.Dataset(data, label=label)

## Load from Sequence Objects

We can implement `Sequence` interface to read binary files. The following example shows reading HD5 file with `h5py`.

In [54]:
import h5py

class HDFSequence(lgb.Sequence):
    def __init__(self, hdf_dataset, batch_size):
        self.data = hdf_dataset
        self.batch_size = batch_size
    
    def __getitem__(self, idx):
        return self.data[idx]
    
    def __len__(self):
        return len(self.data)
    
#f = h5py.File('train.hdf5', 'r')
#train_data = lgb.Dataset(HDFSequence(f['X'], 8192), label=f['Y'][:])

Features of using `Sequence` interface:
* Data sampling uses random access, thus doesn't go through the whole dataset
* Reading data in batch, thus saves memory when constructing `Dataset` object
* Supports creating `Dataset` from multiple data files

## Saving Dataset into a LightGBM Binary File

**Will make loading faster.**

`train_data = lgb.Dataset("train.svm.txt")
train_data.save_binary("train.bin")`

## Create Validation Data

`validation_data = train_data.create_valid("validation.svm")`

or

`validation_data = lgb.Dataset("validation.svm, reference=train_data)`

In LightGBM, the validation data should be aligned with training data.

## Specific Feature Names and Categorical Features

In [55]:
train_data = lgb.Dataset(data, label=label, feature_name=["c1", "c2", "c3'"], categorical_feature=["c3"])

**LightGBM can use categorical features as input directly. It doesn't need to convert to one-hot encoding, and is much faster than one-hot encoding**.

**Note:** You should convert your categorical features to `int` type before you construct `Dataset`.

(Incredible - This is why we read the docs too.)

## Weights can be set when needed

In [56]:
rng = np.random.default_rng()
w = rng.uniform(size=(500, ))
train_data = lgb.Dataset(data, label=label, weight=w)

or

In [57]:
train_data = lgb.Dataset(data, label=label)
rng = np.random.default_rng()
w = rng.uniform(size=(500, ))
train_data.set_weight(w)

<lightgbm.basic.Dataset at 0x11b4a70b0>

And you can use `Dataset.set_init_score()` to set initial score, and `Dataset.set_group()` to set group/query data for ranking tasks.

## Memory Efficient Usage

The `Dataset` object in LightGBM is very memory-efficient, it only needs to save discrete bins. However, **NumPy/Array/Pandas object is memory expensive**. If you are concerned about memory consumption, you can save memory by:
1. Set `free_raw_data=True` (default is `True`) when constructing the `Dataset`
1. Explicitly set `raw_data=None` after the `Dataset` has been constructed
1. Call `gc`

## Setting Parameters

LightGBM can use a dictionary to set Parameters. For instance:
* Booster parameters:

In [58]:
param = {"num_leaves": 31, "objective": "binary"}
param["metric"] = "auc"

* You can also specify multiple eval metrics:

In [59]:
param["metric"] = ["auc", "binary_logloss"]

## Training

Training a model requires a parameter list and data set:

`num_round = 10
bst = lgb.train(param, train_data, num_round, valid_sets=[validation_data])`

After training, the model can be saved:

`bst.save_model('model.txt')`

The trained model can also be dumped to JSON format:

`json_model = bst.dump_model()`

A saved model can be loaded:

`bst = lgb.Booster(model_file='model.txt') # innit model`

## CV 

Training with 5-fold CV:

`lgb.cv(param, train_data, num_round, nfold=56)`

## Early Stopping

If you have a validation set, you can use early stopping to find the optimal number of boosting rounds. Early stopping requires at least one set in `valid_sets`. If there is more than one, it will use all of them except the training data:

`bst = lgb.train(param, train_data, num_round, valid_sets=valid_sets, callbacks=[lgb.early_stopping(stopping_rounds=5)])
bst.save_model('model.txt', num_iteration=bst.best_iteration)`

The model will train until the validation score stops improving. Validation score needs to improve at least every `stopping_rounds` to continue training.

The index of iteration that has the best performance will be saved in the `best_iteration` field if early stopping logic is enables by setting `early_stopping` callback. Note that `train()` will return a model from the best iteration.

This works with both metrics to minimize (L2, log loss, etc.) and to maximize (NDCG, AUC, etc.). Note that if you specify more than one evaluation metric, all of them will be used for early stopping. However, you can change this behavior and make LightGBM check only the first metric for early stopping by passing `first_metric_only=True` in `early_stopping` callback constructor.

Very nice.

## Prediction

A model that has been trained or loaded can perform predictions on datasets:

`# 7 entities, each contains 10 features
rng = np.random.default_rng()
data = rng.uniform(size=(7, 10))
ypred = bst.predict(data)`

If early stopping is enabled during training, you can get predictions from the best iteration with `bst.best_iteration`:

`ypred = bst.predict(data, num_iteration=bst.best_iteration)`

This library seems pretty nuts. DULY noted for later learning.