# Divergence Measure
The main objective of this notebook is to implement the Cauchy-Schwarz Divergence Measure between two probability density functions (PDFs). To improve understanding of the code, we start by providing a walkthrough of the choices we made. 

**Table of Contents**

- [Estimating probability density functions](##Estimating-probability-density-functions)
  - [The multivariative kernel density estimator](###The-multivariative-kernel-density-estimator)
- [The Cauchy-Schawartz divergence measure](##The-Cauchy--Schawartz-divergence-measure)
  - [Aproximation 1](###Aproximation-1)
  - [Aproximation 2](###Aproximation-2)
  - [Aproximation 3](###Aproximation-3)
  - [Aproximation 4](###Aproximation-4)
- [Divergence measure function implementation](##Divergence-measure-function-implementation)
  - [Syntax](###Syntax)
  - [Input arguments](###Input-arguments)
  - [Output arguments](###Output-arguments)

## Estimating probability density functions

Let $X = [x_{1}; x_{2}; ...;x_{N}]$ be an $N-by-d$ matrix and $y = [y_{1}; y_{2}; ...;y_{N}]$ be an $N-by-1$ vector, where $x_{i}$ represents a d-dimentional observation and $y_{i} \in {1,2,...,n_{cl}}$ corresponds to the class label for $x_{i}$. 

Consider that the observations are divided into k clusters, denoted as $S_{1}$, $S_{2}$, ..., $S_{k}$. Thus S_{c} is composed of the rows of $X$ that belong to the $c$-th cluster.

Let $S$ $\in$ $[S_{1}, S_{2}, ..., S_{k}]$ be an arbitrary fixed cluster. Furthermore, given $ a < b \in [1, 2, ..., n_{cl}],$ we define:

- A = $[x_{a}; x_{a2}; ...; {{x_{a}}_{N}}_{a}]$, representing the observations from $X$ that are in cluster $S$ and have class label $a$.
- B = $[x_{b}; x_{b2}; ...; {{x_{b}}_{N}}_{b}]$, representing the observations from $X$ that are in cluster $S$ and have class label $b$.
- $V_{a}$ = $var(A)$ as a row vector containing the variance corresponding to each column of the matrix $A$. (i.e., ${[V_{a}]}_{i}$ = [${cov(A)}_{ij}]$)
- $V_{b}$ = $var(B)$ as a row vector containing the variance corresponding to each column of the matrix $B$. (i.e., ${[V_{b}]}_{i}$ = [${cov(B)}_{ij}]$)
- $p_{a}(x)$ **and** p_{b}(x) are the probability density functions (PDFs) of the process generating the d-dimensional samples in $A$ and $B$, respectively.

### The multivariative kernel density estimator
A multivariate kernel density estivator for p_{a}(x), using a Gaussian kernel function, can be expressed as: 

$\hat{p_{a}(x)} = \frac{1}{N_{a}} \sum_{i=1}^{Na} {W_{H}}_{a} (x,x_{i})$, with ${W_{H}}_{a} (x,x_{i})$ = $\frac{1}{\sqrt{(2\pi)^{d} |H_{a}|}} exp(-\frac{(x-x_{i})^{T} H_{a}^{-1} (x-x_{i})}{2})$.

Depending on the choice of $H_{a}$, a d-by-d matrix, we arrive at the distinct definitions for $\hat{p_{a}(x)}$ addressed in the literature. 

Here are the cases we will explore: 
- By considering $H_{a} = h_{a}^{2} mean(V_{a})I_{d},$ with $h_{a}^{2} = (\frac{4}{(2d + 1)N_{d}})^{\frac{2}{d+4}}$
- By considering $H_{a} = h_{a}^{2} diag(V_{a}),$ with $h_{a}^{2} = (\frac{4}{(d + 2)N_{d}})^{\frac{2}{d+4}}$
- By conse