<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In the following we define the classes [`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex) and [`LevelSetKDEx_kNN`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex_knn) where KDE is short for 'Kernel Density Estimator' and the 'x' is supposed to signal that both classes can be defined based on any arbitrary point predictor. The name 'LevelSet' stems from the fact that every approach presented in this notebook interprets the values of the point forecasts as a similarity measure between samples. The point predictor is specified by the argument `estimator` and must have a `.predict()`-method and should have been trained before hand. 

Both classes [`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex) and [`LevelSetKDEx_kNN`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex_knn) fulfill the same task: By first running `.fit(XTrain, yTrain)` and then calling `.generateWeights(XTest)`, they both output an estimation of the conditional density of every sample specified by 'XTest'. The basic idea for both approaches is also identical: Suppose we have a single test sample at hand. At first, we compare the value of the point prediction of this sample and the values of the point predictions of the training samples computed via `estimator.predict(XTrain)` and `estimator.predict(XTest)`, respectively. Based on this comparison, we select 'binSize'-many training samples that we deem the most similar to the test sample at hand. The concrete way we select the training samples constitutes the only difference between [`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex) and [`LevelSetKDEx_kNN`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex_knn). Finally, the empirical distribution of the y-values of these training samples then acts as our estimation of the conditional distribution.

Further details on how both approaches work approaches can be found below.

## LSx Bin Building

In [1]:
#| echo: false
#| output: asis
show_doc(LevelSetKDEx)

---

[source](https://github.com/kaiguender/dddex/blob/main/dddex/levelSetKDEx_univariate.py#L33){target="_blank" style="float:right; font-size:smaller"}

### LevelSetKDEx

>      LevelSetKDEx (estimator, binSize:int=100, weightsByDistance:bool=False)

*[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex) turns any point forecasting model into an estimator of the underlying conditional density.
The name 'LevelSet' stems from the fact that this approach interprets the values of the point forecasts
as a similarity measure between samples. The point forecasts of the training samples are sorted and 
recursively assigned to a bin until the size of the current bin reaches `binSize` many samples. Then
a new bin is created and so on. For a new test sample we check into which bin its point prediction
would have fallen and interpret the training samples of that bin as the empirical distribution function
of this test sample.*

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| estimator |  |  | Model with a .fit and .predict-method (implementing the scikit-learn estimator interface). |
| binSize | int | 100 | Size of the bins created while running fit. |
| weightsByDistance | bool | False | Determines behaviour of method `getWeights`. If False, all weights receive the same  <br>value. If True, the distance of the point forecasts is taking into account. |

### Generate Bins

In [2]:
#| echo: false
#| output: asis
show_doc(generateBins)

---

[source](https://github.com/kaiguender/dddex/blob/main/dddex/levelSetKDEx_univariate.py#L394){target="_blank" style="float:right; font-size:smaller"}

### generateBins

>      generateBins (binSize:int, yPred:numpy.ndarray)

*Used to generate the bin-structure used by [`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex) to compute density estimations.*

|    | **Type** | **Details** |
| -- | -------- | ----------- |
| binSize | int | Size of the bins of values of `yPred` being grouped together. |
| yPred | ndarray | 1-dimensional array of predicted values. |

## LSx Neighbors-Based

In [3]:
#| echo: false
#| output: asis
show_doc(LevelSetKDEx_NN)

---

[source](https://github.com/kaiguender/dddex/blob/main/dddex/levelSetKDEx_univariate.py#L435){target="_blank" style="float:right; font-size:smaller"}

### LevelSetKDEx_NN

>      LevelSetKDEx_NN (estimator, binSize:int=100, efficientRAM:bool=False)

*TBD.*

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| estimator |  |  | Model with a .fit and .predict-method (implementing the scikit-learn estimator interface). |
| binSize | int | 100 | Size of the bins created while running fit. |
| efficientRAM | bool | False | Setting 'efficientRAM = TRUE' is only necessary when there are roughly umore than 200k training observations to avoid<br>an overusage of RAM. This setting causes the run-time of the algorithm of the weights computation to linearly depend on <br>'binSize'. Because of that the algorithm becomes quite slow for 'binSize' > 10k'. |

### Get Neighbors

In [4]:
#| echo: false
#| output: asis
show_doc(getNeighbors)

---

[source](https://github.com/kaiguender/dddex/blob/main/dddex/levelSetKDEx_univariate.py#L565){target="_blank" style="float:right; font-size:smaller"}

### getNeighbors

>      getNeighbors (binSize:int, yPred:numpy.ndarray)

*Used to generate the neighboorhoods used by [`LevelSetKDEx_NN`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex_nn) to compute density estimations.*

|    | **Type** | **Details** |
| -- | -------- | ----------- |
| binSize | int | Size of the bins of values of `yPred` being grouped together. |
| yPred | ndarray | 1-dimensional array of predicted values. |

### Get Neighbor Test

In [5]:
#| echo: false
#| output: asis
show_doc(getNeighborsTest)

---

[source](https://github.com/kaiguender/dddex/blob/main/dddex/levelSetKDEx_univariate.py#L779){target="_blank" style="float:right; font-size:smaller"}

### getNeighborsTest

>      getNeighborsTest (binSize:int, yPred:numpy.ndarray,
>                        yPredTrain:numpy.ndarray, neighborsDictTrain:dict)

*Used to generate the neighboorhoods used by [`LevelSetKDEx_NN`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex_nn) to compute density estimations.*

|    | **Type** | **Details** |
| -- | -------- | ----------- |
| binSize | int | Size of the bins of values of `yPred` being grouped together. |
| yPred | ndarray | 1-dimensional array of predicted values. |
| yPredTrain | ndarray | 1-dimensional array of predicted train values. |
| neighborsDictTrain | dict | Dict containing the neighbors of all train samples. Keys are the train predictions. |

### Get Kernel Values

In [6]:
#| echo: false
#| output: asis
show_doc(getKernelValues)

---

[source](https://github.com/kaiguender/dddex/blob/main/dddex/levelSetKDEx_univariate.py#L880){target="_blank" style="float:right; font-size:smaller"}

### getKernelValues

>      getKernelValues (yPred, yPredTrain, neighborsDictTest,
>                       neighborsDictTrain, neighborsRemoved, neighborsAdded,
>                       binSize, efficientRAM=False)

## LSx kNN

In [7]:
#| echo: false
#| output: asis
show_doc(LevelSetKDEx_kNN)

---

[source](https://github.com/kaiguender/dddex/blob/main/dddex/levelSetKDEx_univariate.py#L1003){target="_blank" style="float:right; font-size:smaller"}

### LevelSetKDEx_kNN

>      LevelSetKDEx_kNN (estimator, binSize:int=100,
>                        weightsByDistance:bool=False)

*[`LevelSetKDEx_kNN`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex_knn) turns any point predictor that has a .predict-method 
into an estimator of the condititional density of the underlying distribution.
The basic idea of each level-set based approach is to interprete the point forecast
generated by the underlying point predictor as a similarity measure of samples.
In the case of the [`LevelSetKDEx_kNN`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex_knn) defined here, for every new samples
'binSize'-many training samples are computed whose point forecast is closest
to the point forecast of the new sample.
The resulting empirical distribution of these 'nearest' training samples are 
viewed as our estimation of the conditional distribution of each the new sample 
at hand.*

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| estimator |  |  | Model with a .fit and .predict-method (implementing the scikit-learn estimator interface). |
| binSize | int | 100 | Size of the bins created while running fit. |
| weightsByDistance | bool | False | Determines behaviour of method `getWeights`. If False, all weights receive the same  <br>value. If True, the distance of the point forecasts is taking into account. |

## LSx kMeans

In [8]:
#| echo: false
#| output: asis
show_doc(LevelSetKDEx_kMeans)

---

[source](https://github.com/kaiguender/dddex/blob/main/dddex/levelSetKDEx_univariate.py#L1187){target="_blank" style="float:right; font-size:smaller"}

### LevelSetKDEx_kMeans

>      LevelSetKDEx_kMeans (estimator, nClusters:int=10)

*TBD.*

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| estimator |  |  | Model with a .fit and .predict-method (implementing the scikit-learn estimator interface). |
| nClusters | int | 10 | Number of clusters to form as well as number of centroids to generate. |

## LSx Gaussian Kernel

In [9]:
#| echo: false
#| output: asis
show_doc(LevelSetKDEx_RBF)

---

[source](https://github.com/kaiguender/dddex/blob/main/dddex/levelSetKDEx_univariate.py#L1324){target="_blank" style="float:right; font-size:smaller"}

### LevelSetKDEx_RBF

>      LevelSetKDEx_RBF (estimator, lengthScale:float=1)

*TBD.*

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| estimator |  |  | Model with a .fit and .predict-method (implementing the scikit-learn estimator interface). |
| lengthScale | float | 1 | Size of the bins created while running fit. |

# Test Code

In [None]:
# #| hide

# data, XTrain, yTrain, XTest, yTest = loadDataBakery(returnXY = True)

# LGBM = LGBMRegressor(n_jobs = 1).fit(XTrain, yTrain)

In [None]:
# #| hide

# LSKDEx_drf = LevelSetKDEx_DRF(estimator = LGBM, binSize = 100)
# LSKDEx_drf.fit(XTrain, yTrain)

# weightsDataList = LSKDEx_drf.getWeights(XTest)

In [None]:
# #| hide

# import time
# start = time.time()
# LSKDEx = LevelSetKDEx(estimator = LGBM, binSize = 100)

# LSKDEx.fit(XTrain, yTrain)
# weights = LSKDEx.getWeights(XTest)
# print(time.time() - start)

## LSx based on DRF

In [None]:
# #| hide

# import ipdb
# from lightgbm import LGBMRegressor
# from xgboost import XGBRegressor
# from dddex.loadData import *
# from dddex.wSAA import RandomForestWSAA, SampleAverageApproximation
# import time 
# import psutil
# import os
# import sys

# from drf import drf

In [None]:
# data, XTrain, yTrain, XTest, yTest = loadDataBakery(returnXY = True)
# LGBM = LGBMRegressor(n_jobs = 1).fit(XTrain, yTrain)

# LSKDEx = LevelSetKDEx_RBF(estimator = LGBM, lengthScale = 1)

In [None]:
# LSKDEx.fit(XTrain, yTrain)

In [None]:
# weightsData = LSKDEx.getWeights(XTest[100:200], outputType = 'summarized')

In [None]:
# #| hide

# data, XTrain, yTrain, XTest, yTest = loadDataYaz(returnXY = True)

# LGBM = LGBMRegressor(n_jobs = 1).fit(XTrain, yTrain)
# yPredTrain = LGBM.predict(XTrain)
# yPredTest = LGBM.predict(XTest)

# yPredTrain = pd.DataFrame(yPredTrain)
# yPredTest = pd.DataFrame(yPredTest)
# yTrain = pd.Series(yTrain)

In [None]:
# #| hide

# DRF = drf(min_node_size = 100, num_trees = 500, num_features = 1, honesty = False, sample_fraction = 0.5, response_scaling = False, mtry = 1, num_threads = 1)

In [None]:
# #| hide

# DRF.fit(yPredTrain, yTrain)
# weights = DRF.predict(yPredTest).weights

In [None]:
# #| hide

# # Get statistic of weights of first row#
# weightsRow = weights[0]
# pd.Series(weightsRow[weightsRow > 0]).describe()

count    366.000000
mean       0.002732
std        0.002423
min        0.000020
25%        0.000590
50%        0.001994
75%        0.004667
max        0.009120
dtype: float64

In [None]:
# import time
# start = time.time()
# LSKDEx = LevelSetKDEx_clustering2(estimator = LGBM, nClusters = 100)

# LSKDEx.fit(XTrain, yTrain)
# weights = LSKDEx.getWeights(XTest)
# print(time.time() - start)

In [None]:
# path = '/home/kagu/SID/data/dataSID.csv'
# data = pd.read_csv(path)

# ids = data.id.unique()[0:30]
# filtering = [ID in ids for ID in data.id]
# data = data[filtering]

# X = np.array(data.drop(['demand', 'date', 'id', 'label'], axis = 1))
# Y = np.array(data['demand'])

# indicesTrain = data['label'] == 'train'
# indicesTest = data['label'] == 'test'

# XTrain = X[indicesTrain]
# yTrain = Y[indicesTrain]

# XTest = X[indicesTest]
# yTest = Y[indicesTest]

# dataTrain = data[indicesTrain]
# dataTest = data[indicesTest]

# scalingList = dataTest['scalingValue'].tolist()

In [None]:
# data.shape

In [None]:
# process = psutil.Process(os.getpid())
# print(f"Memory used by Jupyter notebook: {process.memory_info().rss / 2**20:.2f} MB")

In [None]:
# LGBM = LGBMRegressor(boosting_type = 'gbdt',
#                      n_jobs = 1)

# LGBM.fit(X = XTrain, y = yTrain)

In [None]:
# start = time.time()
# LSKDEx = LevelSetKDEx(estimator = LGBM, binSize = 5000)
# LSKDEx.fit(XTrain, yTrain)
# print(time.time() - start)

# yPredTrain = LSKDEx.yPredTrain

In [None]:
# yPredTrain[LSKDEx.neighborsDictTrain[list(LSKDEx.neighborsDictTrain.keys())[100]]]

In [None]:
# yPredTrain[LSKDEx.neighborsDictTrain[np.array(list(LSKDEx.neighborsDictTrain.keys()))[-1]]]

In [None]:
# weights = LSKDEx.getWeights(XTest, efficientRAM = True)

In [None]:
# res = LSKDEx.predict(XTest,
#                      probs = [0.1, 0.5], 
#                      scalingList = scalingList)

In [None]:
# process = psutil.Process(os.getpid())
# print(f"Memory used by Jupyter notebook: {process.memory_info().rss / 2**20:.2f} MB")

In [None]:
# XTrainMod = XTrain[0:10000]
# yTrainMod = yTrain[0:10000]
# XTestMod = XTest[0:10]

# yPredTrainMod = LGBM.predict(XTrainMod)
# yPredTestMod = LGBM.predict(XTestMod)

In [None]:
# #| hide

# LSKDEx = LevelSetKDEx(estimator = LGBM, binSize = 100, weightsByDistance = False)
# LSKDEx.fit(XTrainMod, yTrainMod)

In [None]:
# %%timeit
# #| hide

# res = LSKDEx.solveKernelGLS(X = XTrainMod, sigma = 0.5, c = yPredTrainMod)

In [None]:
# %%timeit
# #| hide
# res = LSKDEx.getKernelVectorProduct(X1 = XTrainMod, c = yPredTrainMod)

In [None]:
# %%timeit

# mean, cov = LSKDEx.getGaussianPosterior(XTrain = XTrainMod, 
#                                         XTest = XTestMod,
#                                         yTrain = yTrainMod,
#                                         sigma = 0.5)

In [None]:
# yPredTest = LSKDEx.estimator.predict(XTest)
# binPerPredTest = np.searchsorted(a = LSKDEx.lowerBoundPerBin, v = yPredTest, side = 'right') - 1

# binVectorsTest = [(binPerPredTest == i).reshape(-1, 1) * 1 for i in range(len(LSKDEx.lowerBoundPerBin))]
# binVectorsToSliceTest = [np.where(binVector)[0] for binVector in binVectorsTest]

In [None]:
# v = binVectorsToSliceTest[2]
# cov[v[:, None],  v]

In [None]:
# v

In [None]:
# pd.Series(np.ravel(mean)).describe()

In [None]:
# pd.Series(yTest).describe()

In [None]:
# yPred = np.concatenate([np.arange(5000)] * 2, axis = 0)
# yPredTrain = np.concatenate([np.arange(50000)] * 2, axis = 0)
# binSize = 200

# neighborsDictTrain, neighborsRemoved, neighborsAdded = generateNeighborhoodsUnique(binSize = binSize,
#                                                                                yPred = yPredTrain)

# neighborsDictTest = generateNeighborhoodsTestUnique(binSize = binSize,
#                                                 yPred = yPred,
#                                                 yPredTrain = yPredTrain,
#                                                 neighborsDictTrain = neighborsDictTrain)

# start = time.time()
# kernelValuesList = getKernelValues(binSize = binSize,
#                                    yPred = yPred,
#                                    yPredTrain = yPredTrain,
#                                    neighborsDictTest = neighborsDictTest,
#                                    neighborsDictTrain = neighborsDictTrain,
#                                    neighborsRemoved = neighborsRemoved,
#                                    neighborsAdded = neighborsAdded)
# print(time.time() - start)