<!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->

## Level-Set Approach based on Clusters

In [1]:
#| echo: false
#| output: asis
show_doc(LevelSetKDEx_multivariate)

---

[source](https://github.com/kaiguender/dddex/blob/main/dddex/levelSetKDEx_multivariate.py#L33){target="_blank" style="float:right; font-size:smaller"}

### LevelSetKDEx_multivariate

>      LevelSetKDEx_multivariate (estimator, binSize:int=None,
>                                 equalBins:bool=False)

*[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex) turns any point forecasting model into an estimator of the underlying conditional density.
The name 'LevelSet' stems from the fact that this approach interprets the values of the point forecasts
as a similarity measure between samples. The point forecasts of the training samples are sorted and 
recursively assigned to a bin until the size of the current bin reaches `binSize` many samples. Then
a new bin is created and so on. For a new test sample we check into which bin its point prediction
would have fallen and interpret the training samples of that bin as the empirical distribution function
of this test sample.*

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| estimator |  |  | Model with a .fit and .predict-method (implementing the scikit-learn estimator interface). |
| binSize | int | None | Size of the bins created while running fit. |
| equalBins | bool | False | Determines behaviour of method `getWeights`. If False, all weights receive the same  <br>value. If True, the distance of the point forecasts is taking into account. |

## LSx Multivariate Version with Theoretical Asymptotic Optimality

In [2]:
#| echo: false
#| output: asis
show_doc(LevelSetKDEx_multivariate_opt)

---

[source](https://github.com/kaiguender/dddex/blob/main/dddex/levelSetKDEx_multivariate.py#L264){target="_blank" style="float:right; font-size:smaller"}

### LevelSetKDEx_multivariate_opt

>      LevelSetKDEx_multivariate_opt (estimator, nClusters:int=None,
>                                     minClusterSize:int=None)

*[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex) turns any point forecasting model into an estimator of the underlying conditional density.
The name 'LevelSet' stems from the fact that this approach interprets the values of the point forecasts
as a similarity measure between samples. 
In this version of the LSx algorithm, we are grouping the point predictions of the samples specified via `X`
based on a k-means clustering algorithm. The number of clusters is determined by the `nClusters` parameter.  
In order to ensure theoretical asymptotic optimality of the algorithm, it has to be ensured that the number
of training observations receiving positive weight is at least minClusterSize, while minClusterSize has to be 
an element of o(N) meaning minClusterSize / N -> 0 as N -> infinity.
To ensure this, each cluster is checked for its size and clusters being smaller than minClusterSize have to be
modified. For every cluster that is too small, we are recurvely searching for the closest other cluster until
the size of the combined cluster is at least minClusterSize. The clusters are not actually merged in the traditional
sense, though. Instead, we are creating new overlapping sets of samples that are used to compute the weights. 
Let's say we have three clusters A, B and C, minClusterSize = 10, the sizes of the clusters are 4, 4 and 20. Furthermore,
assume B is closest to A and C closest to B. The set of indices are given then as follows:
A: A + B + C
B: B + C
C: C
This way it is ensured that the number of training observations receiving positive weight is at least 10 for every cluster. 
At the same time, the above algorithm ensure that the distance of the samples receiving positive weight*

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| estimator |  |  | Model with a .fit and .predict-method (implementing the scikit-learn estimator interface). |
| nClusters | int | None | Number of clusters being created while running fit. |
| minClusterSize | int | None | Minimum size of a cluster. If a cluster is smaller than this value, it will be merged with another cluster. |

## Level-Set Approach based on Decision Tree

In [3]:
#| echo: false
#| output: asis
show_doc(LevelSetKDEx_DT)

---

[source](https://github.com/kaiguender/dddex/blob/main/dddex/levelSetKDEx_multivariate.py#L492){target="_blank" style="float:right; font-size:smaller"}

### LevelSetKDEx_DT

>      LevelSetKDEx_DT (estimator, max_depth:int=8, min_samples_leaf:int=100)

*[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex) turns any point forecasting model into an estimator of the underlying conditional density.
The name 'LevelSet' stems from the fact that this approach interprets the values of the point forecasts
as a similarity measure between samples. 
TBD.*

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| estimator |  |  | Model with a .fit and .predict-method (implementing the scikit-learn estimator interface). |
| max_depth | int | 8 | Maximum depth of the decision tree used to generate the bins. |
| min_samples_leaf | int | 100 | Minimum number of samples required to be in a bin. |

## LSx Multivariate Gessaman Rule

In [4]:
#| echo: false
#| output: asis
show_doc(LevelSetKDEx_multivariate_bin)

---

[source](https://github.com/kaiguender/dddex/blob/main/dddex/levelSetKDEx_multivariate.py#L636){target="_blank" style="float:right; font-size:smaller"}

### LevelSetKDEx_multivariate_bin

>      LevelSetKDEx_multivariate_bin (estimator, nBinsPerDim:int=None)

*[`LevelSetKDEx`](https://kaiguender.github.io/dddex/levelsetkdex_univariate.html#levelsetkdex) turns any point forecasting model into an estimator of the underlying conditional density.
The name 'LevelSet' stems from the fact that this approach interprets the values of the point forecasts
as a similarity measure between samples. 
In this version of the LSx algorithm, we are applying the so-called Gessaman rule to create statistically
equivalent blocks of samples. In essence, the algorithm is a multivariate extension of the univariate
LevelSetKDEx algorithm based on bin-building. 
We are creating equally sized bins of samples based on the point predictions of the samples specified via `X`
for every coordinate axis. Every bin of one axis is combined with the bins of all other axes resulting in
a total of nBinsPerDim^dim many bins. 
Example: Let's say we have 100000 samples, the binSize is given as 20 and the number of dimension
is 3. As the binSize is given as 20, we want to create 5000 bins alltogether. Hence, there have to be
5000^(1/dim) = 5000^(1/3) = 17 bins per dimension. 
IMPORTANT NOTE: The getWeights function is not yet finished and has to be completed.*

|    | **Type** | **Default** | **Details** |
| -- | -------- | ----------- | ----------- |
| estimator |  |  | Model with a .fit and .predict-method (implementing the scikit-learn estimator interface). |
| nBinsPerDim | int | None | Number of samples belonging to each bin. |

# Test Code

In [None]:
# #| hide

# import ipdb
# from lightgbm import LGBMRegressor
# from sklearn.ensemble import RandomForestRegressor
# from datasetsDynamic.loadDataYaz import loadDataYaz

In [None]:
# #| hide

# data, XTrain, yTrain, XTest, yTest = loadDataYaz(testDays = 14,
#                                                  daysToCut = 0,
#                                                  normalizeDemand = True,
#                                                  unstacked = True,
#                                                  returnXY = True)

# RF = RandomForestRegressor(n_estimators = 10, n_jobs = 1)
# RF.fit(X = XTrain, y = yTrain)

# # Duplicate XTrain and yTrain m times
# m = 1000
# XTrain = np.vstack([XTrain for i in range(m)])
# yTrain = np.vstack([yTrain for i in range(m)])

# print(XTrain.shape)
# print(yTrain.shape)

# # Add gaussian to XTrain and yTrain
# XTrain = XTrain + np.random.normal(0, 0.1, XTrain.shape)
# yTrain = yTrain + np.random.normal(0, 0.1, yTrain.shape)

In [None]:
# LSKDEx = LevelSetKDEx_multivariate_opt(estimator = RF, nClusters = 100, minClusterSize = 20)
# LSKDEx.fit(X = XTrain, y = yTrain)

# yPred = LSKDEx.estimator.predict(XTest).astype(np.float32)
# clusters = LSKDEx.kmeans.assign(yPred)[1]

# weightsDataList = LSKDEx.getWeights(X = XTest, outputType='onlyPositiveWeights')

In [None]:
# centers = LSKDEx.centers
# yPred = LSKDEx.yPredTrain

# distances = cdist(yPred, centers, metric = 'euclidean')

# minCenters = np.argmin(distances, axis = 1)

In [None]:
# nPosValues = np.array([len(weightsDataList[i][0]) for i in range(len(weightsDataList))])
# print(nPosValues)

# lenIndices = np.array([len(LSKDEx.indicesPerBin[i]) for i in range(len(LSKDEx.indicesPerBin))])
# print(min(lenIndices))
# print(max(lenIndices))

[20 20 20 20 30 28 38 22 30 20 36 36 22 38]
20
48
