<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

In the following we define the classes [`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex.html#levelsetkdex) and [`LevelSetKDEx_kNN`](https://kaiguender.github.io/dddex/levelsetkdex.html#levelsetkdex_knn) where KDE is short for 'Kernel Density Estimator' and the 'x' is supposed to signal that both classes can be defined based on any arbitrary point predictor. The name 'LevelSet' stems from the fact that every approach presented in this notebook interprets the values of the point forecasts as a similarity measure between samples. The point predictor is specified by the argument `estimator` and must have a `.predict()`-method and should have been trained before hand. 

Both classes [`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex.html#levelsetkdex) and [`LevelSetKDEx_kNN`](https://kaiguender.github.io/dddex/levelsetkdex.html#levelsetkdex_knn) fulfill the same task: By first running `.fit(XTrain, yTrain)` and then calling `.generateWeights(XTest)`, they both output an estimation of the conditional density of every sample specified by 'XTest'. The basic idea for both approaches is also identical: Suppose we have a single test sample at hand. At first, we compare the value of the point prediction of this sample and the values of the point predictions of the training samples computed via `estimator.predict(XTrain)` and `estimator.predict(XTest)`, respectively. Based on this comparison, we select 'binSize'-many training samples that we deem the most similar to the test sample at hand. The concrete way we select the training samples constitutes the only difference between [`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex.html#levelsetkdex) and [`LevelSetKDEx_kNN`](https://kaiguender.github.io/dddex/levelsetkdex.html#levelsetkdex_knn). Finally, the empirical distribution of the y-values of these training samples then acts as our estimation of the conditional distribution.

Further details on how both approaches work approaches can be found below.

## Level-Set Approach based on Bin Building

In [1]:
#|output: asis
#| echo: false
show_doc(LevelSetKDEx)

---

[source](https://github.com/kaiguender/dddex/blob/main/dddex/levelSetKDEx.py#L22){target="_blank" style="float:right; font-size:smaller"}

### LevelSetKDEx

>      LevelSetKDEx (estimator, binSize:int=None)

[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex.html#levelsetkdex)

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| estimator |  |  | (Fitted) object with a .predict-method. |
| binSize | int | None | Size of the bins created to group the training samples. |

In [None]:
# show_doc(LevelSetKDEx)

In [None]:
# show_doc(LevelSetKDEx.fit)

In [None]:
# show_doc(LevelSetKDEx.getWeights)

#### Generate Bins

In [2]:
#|output: asis
#| echo: false
show_doc(generateBins)

---

[source](https://github.com/kaiguender/dddex/blob/main/dddex/levelSetKDEx.py#L138){target="_blank" style="float:right; font-size:smaller"}

### generateBins

>      generateBins (binSize:int, yPred:numpy.ndarray)

Used to generate the bin-structure induced by the Level-Set-Forecaster algorithm

|    | **Type** | **Details** |
| -- | -------- | ----------- |
| binSize | int | Size of the bins of values being grouped together. |
| yPred | np.ndarray | 1-dimensional array of predicted values. |

## Level-Set Approach based on kNN

In [3]:
#|output: asis
#| echo: false
show_doc(LevelSetKDEx_kNN)

---

[source](https://github.com/kaiguender/dddex/blob/main/dddex/levelSetKDEx.py#L179){target="_blank" style="float:right; font-size:smaller"}

### LevelSetKDEx_kNN

>      LevelSetKDEx_kNN (estimator, binSize:int|None=None)

[`LevelSetKDEx_kNN`](https://kaiguender.github.io/dddex/levelsetkdex.html#levelsetkdex_knn) turns any point predictor that has a .predict-method 
into an estimator of the condititional density of the underlying distribution.
The basic idea of each level-set based approach is to interprete the point forecast
generated by the underlying point predictor as a similarity measure of samples.
In the case of the [`LevelSetKDEx_kNN`](https://kaiguender.github.io/dddex/levelsetkdex.html#levelsetkdex_knn) defined here, for every new samples
'binSize'-many training samples are computed whose point forecast is closest
to the point forecast of the new sample.
The resulting empirical distribution of these 'nearest' training samples are 
viewed as our estimation of the conditional distribution of each the new sample 
at hand.

NOTE 1: The [`LevelSetKDEx_kNN`](https://kaiguender.github.io/dddex/levelsetkdex.html#levelsetkdex_knn) class can only be applied to estimators that 
have been fitted already.

NOTE 2: In contrast to the standard [`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex.html#levelsetkdex), it is possible to apply
[`LevelSetKDEx_kNN`](https://kaiguender.github.io/dddex/levelsetkdex.html#levelsetkdex_knn) to arbitrary dimensional point predictors.

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| estimator |  |  | Object with a .predict-method (fitted). |
| binSize | int \| None | None | Size of the neighbors considered to compute conditional density. |

In [None]:
# show_doc(LevelSetKDEx_kNN)

In [None]:
# show_doc(LevelSetKDEx_kNN.fit)

In [None]:
# show_doc(LevelSetKDEx_kNN.getWeights)

## Bin-Size CV

In [4]:
#|output: asis
#| echo: false
show_doc(binSizeCV)

---

[source](https://github.com/kaiguender/dddex/blob/main/dddex/levelSetKDEx.py#L330){target="_blank" style="float:right; font-size:smaller"}

### binSizeCV

>      binSizeCV (estimator, cvFolds, LSF_type:"'LSF'|'LSF_kNN'",
>                 weightsByDistance:bool=False, binSizeGrid:list|np.ndarray=[4,
>                 7, 10, 15, 20, 30, 40, 50, 60, 70, 80, 100, 125, 150, 200,
>                 250, 300, 350, 400, 450, 500, 600, 700, 800, 900, 1000, 1250,
>                 1500, 1750, 2000, 2500, 3000, 4000, 5000, 6000, 7000, 8000,
>                 9000, 10000], probs:list|np.ndarray=[0.01, 0.02, 0.03, 0.04,
>                 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14,
>                 0.15, 0.16, 0.17, 0.18, 0.19, 0.2, 0.21, 0.22, 0.23, 0.24,
>                 0.25, 0.26, 0.27, 0.28, 0.29, 0.3, 0.31, 0.32, 0.33, 0.34,
>                 0.35, 0.36, 0.37, 0.38, 0.39, 0.4, 0.41, 0.42, 0.43, 0.44,
>                 0.45, 0.46, 0.47, 0.48, 0.49, 0.5, 0.51, 0.52, 0.53, 0.54,
>                 0.55, 0.56, 0.57, 0.58, 0.59, 0.6, 0.61, 0.62, 0.63, 0.64,
>                 0.65, 0.66, 0.67, 0.68, 0.69, 0.7, 0.71, 0.72, 0.73, 0.74,
>                 0.75, 0.76, 0.77, 0.78, 0.79, 0.8, 0.81, 0.82, 0.83, 0.84,
>                 0.85, 0.86, 0.87, 0.88, 0.89, 0.9, 0.91, 0.92, 0.93, 0.94,
>                 0.95, 0.96, 0.97, 0.98, 0.99], refitPerProb:bool=False,
>                 n_jobs:int|None=None)

Initialize self.  See help(type(self)) for accurate signature.

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| estimator |  |  | Object with a .predict-method (fitted). |
| cvFolds |  |  | Specifies cross-validation-splits. Identical to 'cv' used for cross-validation in sklearn. |
| LSF_type | 'LSF' \| 'LSF_kNN' |  | Specifies which LSF-Object we work with during cross-validation. |
| weightsByDistance | bool | False |  |
| binSizeGrid | list \| np.ndarray | [4, 7, 10, 15, 20, 30, 40, 50, 60, 70, 80, 100, 125, 150, 200, 250, 300, 350, 400, 450, 500, 600, 700, 800, 900, 1000, 1250, 1500, 1750, 2000, 2500, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000] |  |
| probs | list \| np.ndarray | [0.01, 0.02, 0.03, 0.04, 0.05, 0.06, 0.07, 0.08, 0.09, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16, 0.17, 0.18, 0.19, 0.2, 0.21, 0.22, 0.23, 0.24, 0.25, 0.26, 0.27, 0.28, 0.29, 0.3, 0.31, 0.32, 0.33, 0.34, 0.35, 0.36, 0.37, 0.38, 0.39, 0.4, 0.41, 0.42, 0.43, 0.44, 0.45, 0.46, 0.47, 0.48, 0.49, 0.5, 0.51, 0.52, 0.53, 0.54, 0.55, 0.56, 0.57, 0.58, 0.59, 0.6, 0.61, 0.62, 0.63, 0.64, 0.65, 0.66, 0.67, 0.68, 0.69, 0.7, 0.71, 0.72, 0.73, 0.74, 0.75, 0.76, 0.77, 0.78, 0.79, 0.8, 0.81, 0.82, 0.83, 0.84, 0.85, 0.86, 0.87, 0.88, 0.89, 0.9, 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99] | list or array of floats between 0 and 1. p-quantiles being predicted to evaluate performance of LSF. |
| refitPerProb | bool | False | If True, for each p-quantile a fitted LSF with best binSize to predict it is returned. Otherwise only one LSF is returned that is best over all probs. |
| n_jobs | int \| None | None | number of folds being computed in parallel. |

In [None]:
# show_doc(binSizeCV)

In [5]:
#|output: asis
#| echo: false
show_doc(binSizeCV.fit)

---

[source](https://github.com/kaiguender/dddex/blob/main/dddex/levelSetKDEx.py#L381){target="_blank" style="float:right; font-size:smaller"}

### binSizeCV.fit

>      binSizeCV.fit (X, y)

In [None]:
# show_doc(binSizeCV.fit)

#### Scores for Single Fold

In [6]:
#|output: asis
#| echo: false
show_doc(scoresForFold)

---

[source](https://github.com/kaiguender/dddex/blob/main/dddex/levelSetKDEx.py#L459){target="_blank" style="float:right; font-size:smaller"}

### scoresForFold

>      scoresForFold (cvFold, binSizeGrid, probs, estimator, LSF_type,
>                     weightsByDistance, y, X)

##### Get Cost Ratio

In [7]:
#|output: asis
#| echo: false
show_doc(getCostRatio)

---

[source](https://github.com/kaiguender/dddex/blob/main/dddex/levelSetKDEx.py#L522){target="_blank" style="float:right; font-size:smaller"}

### getCostRatio

>      getCostRatio (decisions, decisionsSAA, yTest, prob)

In [None]:
## Bin-Size CV 2

In [None]:
# #| export

# class binSizeCV2:

#     def __init__(self,
#                  estimator, # Object with a .predict-method (fitted).
#                  paramGrid = None,
#                  binSizeGrid: list | np.ndarray = [4, 7, 10, 15, 20, 30, 40, 50, 60, 70, 80, 
#                                                    100, 125, 150, 200, 250, 300, 350, 400, 450, 500, 600, 700, 800, 900,
#                                                    1000, 1250, 1500, 1750, 2000, 2500, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000], # binSize (int) values being evaluated.         
#                  cvFolds, # Specifies cross-validation-splits. Identical to 'cv' used for cross-validation in sklearn.
#                  LSF_type: 'LSF' | 'LSF_kNN', # Specifies which LSF-Object we work with during cross-validation.       
#                  probs: list | np.ndarray = [i / 100 for i in range(1, 100, 1)], # list or array of floats between 0 and 1. p-quantiles being predicted to evaluate performance of LSF.
#                  refitPerProb: bool = False, # If True, for each p-quantile a fitted LSF with best binSize to predict it is returned. Otherwise only one LSF is returned that is best over all probs.
#                  n_jobs: int | None = None, # number of folds being computed in parallel.
#                  ):
        
#         # CHECKS
        
#         if isinstance(estimator, (LevelSetKDEx, LevelSetKDEx_kNN)):
#             raise ValueError("'estimator' has to be a point predictor and not a LevelSetKDEx-Object!")   
#         elif not (hasattr(estimator, 'predict') and callable(estimator.predict)):
#             raise ValueError("'estimator' has to have a 'predict'-method!")
#         else:
#             self.estimator = estimator
            
#         if LSF_type is None or not LSF_type in ["LSF", "LSF_kNN"]:
#             raise ValueError("LSF_type must be specified and must either be 'LSF' or 'LSF_kNN'!")
#         else:
#             self.LSF_type = LSF_type
            
#         if np.any(np.array(probs) > 1) or np.any(np.array(probs) < 0): 
#             raise ValueError("probs must only contain numbers between 0 and 1!")
#         else:
#             self.probs = probs
        
#         #---
        
#         self.binSizeGrid = binSizeGrid        
#         self.cvFolds = cvFolds
#         self.refitPerProb = refitPerProb
#         self.n_jobs = n_jobs
        
#         self.best_binSize = None
#         self.best_binSize_perProb = None
#         self.best_estimatorLSx = None
#         self.cv_results = None
#         self.cv_results_raw = None

In [None]:
# #| export

# @patch
# def fit(self: binSizeCV2, 
#         X, 
#         y):
    
#     scoresPerFold = Parallel(n_jobs = self.n_jobs)(delayed(scoresForFold)(cvFold = cvFold,
#                                                                           binSizeGrid = self.binSizeGrid,
#                                                                           probs = self.probs,
#                                                                           estimator = self.estimator,
#                                                                           LSF_type = self.LSF_type,
#                                                                           y = y,
#                                                                           X = X) for cvFold in cvFolds)    

#     self.cv_results_raw = scoresPerFold

#     #---

#     nvCostsMatrix = scoresPerFold[0]

#     for i in range(1, len(scoresPerFold)):
#         nvCostsMatrix = nvCostsMatrix + scoresPerFold[i]

#     nvCostsMatrix = nvCostsMatrix / len(cvFolds)

#     self.cv_results = nvCostsMatrix

#     #---

#     meanCostsDf = nvCostsMatrix.mean(axis = 1)
#     binSizeBestOverall = meanCostsDf.index[np.argmax(meanCostsDf)]
#     self.best_binSize = binSizeBestOverall

#     binSizeBestPerProb = nvCostsMatrix.idxmax(axis = 0)
#     self.best_binSize_perProb = binSizeBestPerProb

#     #---

#     if self.refitPerProb:

#         LSFDict = dict()
#         for binSize in binSizeBestPerProb.unique():

#             if self.LSF_type == 'LSF':
#                 LSF = LevelSetKDEx(estimator = self.estimator, 
#                                          binSize = binSize)
#             else:
#                 LSF = LevelSetKDEx_kNN(estimator = self.estimator, 
#                                              binSize = binSize)

#             LSF.fit(X = X, y = y)
#             LSFDict[binSize] = LSF

#         self.best_estimatorLSx = {prob: LSFDict[binSizeBestPerProb.loc[prob]] 
#                                   for prob in binSizeBestPerProb.index}

#     else:
#         if self.LSF_type == 'LSF':
#             LSF = LevelSetKDEx(estimator = self.estimator, 
#                                      binSize = binSizeBestOverall)
#         else:
#             LSF = LevelSetKDEx_kNN(estimator = self.estimator, 
#                                          binSize = binSizeBestOverall)

#         LSF.fit(X = X, y = y)

#         self.best_estimatorLSx = LSF

# Test Code

In [None]:
# #| hide

# from lightgbm import LGBMRegressor
# from dddex.loadData import *

# data, XTrain, yTrain, XTest, yTest = loadDataYaz(testDays = 14, 
#                                                  returnXY = True,
#                                                  daysToCut = 0)

In [None]:
# #| hide

# LGBM = LGBMRegressor(boosting_type = 'gbdt',
#                      n_jobs = 1)

# LGBM.fit(X = XTrain, y = yTrain)

LGBMRegressor(n_jobs=1)

In [None]:
# #| hide

# LS_KDEx_kNN = LevelSetKDEx_kNN(estimator = LGBM, binSize = 100)
# LS_KDEx_kNN.fit(XTrain, yTrain)

# LS_KDEx = LevelSetKDEx(estimator = LGBM, binSize = 100)
# LS_KDEx.fit(XTrain, yTrain)

In [None]:
# #| hide
# test = LS_KDEx_kNN.predictQ(X = XTest, weightsByDistance = True, outputAsDf = True)
# test2 = LS_KDEx_kNN.predictQ(X = XTest, weightsByDistance = False, outputAsDf = True)

In [None]:
# #| hide
# test3 = LS_KDEx.predictQ(X = XTest, weightsByDistance = True, outputAsDf = True)
# test4 = LS_KDEx.predictQ(X = XTest, weightsByDistance = False, outputAsDf = True)

In [None]:
# #| hide
# from dddex.utils import groupedTimeSeriesSplit

# dataTrain = data[data.label == 'train']

# cvFolds = groupedTimeSeriesSplit(data = dataTrain, 
#                                  kFolds = 3, 
#                                  testLength = 28, 
#                                  groupFeature = 'id', 
#                                  timeFeature = 'dayIndex')

# CV = binSizeCV(estimator = LGBM,
#                cvFolds = cvFolds,
#                LSF_type = 'LSF',
#                weightsByDistance = True,
#                binSizeGrid = [10, 100, 1000],
#                probs = [0.001, 0.5, 0.999])

# CV.fit(X = XTrain, y = yTrain)

# CV2 = binSizeCV(estimator = LGBM,
#                 cvFolds = cvFolds,
#                 LSF_type = 'LSF',
#                 weightsByDistance = False,
#                 binSizeGrid = [10, 100, 1000],
#                 probs = [0.001, 0.5, 0.999])

# CV2.fit(X = XTrain, y = yTrain)