<a href="https://colab.research.google.com/github/elsa9421/Interactive-IPython-Demos/blob/main/KDE.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## 1-D Gaussian Kernel Density Estimator

This notebook plots the 1-D Kernel Density Estimator(KDE) for given `N=100` data points. 

The free parameters of kernel density estimation are the `kernel`, which specifies the shape of the distribution placed at each point, and the `kernel bandwidth`, which controls the size of the kernel at each point tunable bandwidth.

In this case, the kernel used is a `Gaussian kernel` and `bandwidth` is a tunable parameter.

The choice of `bandwidth` within KDE is extremely important to finding a suitable density estimate, and is the knob that controls the bias–variance trade-off in the estimate of density: too narrow a `bandwidth` leads to a high-variance estimate (i.e., over-fitting), where the presence or absence of a single point makes a large difference. Too wide a `bandwidth` leads to a high-bias estimate (i.e., under-fitting) where the structure in the data is washed out by the wide kernel.


In [None]:
# Import libraries

import numpy as np
import matplotlib
import matplotlib.pyplot as plt
from sklearn.neighbors import KernelDensity
from scipy.stats import norm
from ipywidgets import interact, interactive_output, fixed, interact_manual,interactive
import ipywidgets as widgets
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import LeaveOneOut
import seaborn as sns; sns.set()

# Generating the data points
N = 100
np.random.seed(1)
X = np.concatenate((np.random.normal(0, 1, int(0.3 * N)),
                    np.random.normal(5, 1, int(0.7 * N))))[:, np.newaxis]   #(0.3*N+0.7*N,1)

def plot_portfolio(X=X,bandwidth=0.5):
  '''
  This function that generates and plots input normal distribution along with the 
  Gaussian kernel density estimate over the same distribution.
  
  Input:
  bandwidth : smoothing parameter called the bandwidth, 
              (undersmooth)if bandwidth is too small the estimate contains too many spurious data artifacts
              (oversmooth)if bandwidth is too large the estimate obscures much of the underlying structure
              bandwidth is considered to be optimally smoothed when its density estimate is close to the true density.

  '''
  # Plot a 1D density example


  X_plot = np.linspace(-5, 10, 1000)[:, np.newaxis]

  # Note : scipy.stats.norm(mu,sigma).pdf(x) computes the value of the pdf at point x for a given mu, sigma
  true_dens = (0.3 * norm(0, 1).pdf(X_plot[:, 0])
              + 0.7 * norm(5, 1).pdf(X_plot[:, 0]))

  plt.figure(figsize=(7,6))
  # Plot true distribution
  
  plt.fill(X_plot[:, 0], true_dens, fc='black', alpha=0.2,
          label='Input Distribution') 
  
  # Plot KDE for given bandwidth
  kde = KernelDensity(kernel='gaussian', bandwidth=bandwidth).fit(X)

  # Evaluate the log density model on the data.
  log_dens = kde.score_samples(X_plot)
  # plot the pdf
  plt.plot(X_plot[:, 0], np.exp(log_dens), color='navy', lw=2,
          linestyle='-', label='Gaussian Estimate')

  plt.text(6.5, 0.38, "N={0} points".format(N))
  plt.plot(X[:,0], np.full_like(X[:,0], -0.01), '|k', markeredgewidth=1,label='Datapoints')

  plt.legend(loc='upper left')
  #plt.plot(X[:, 0], -0.005 - 0.01 * np.random.random(X.shape[0]), '+k')


  plt.xlim(-3.5, 9.5)
  plt.ylim(-0.02, 0.4)
  plt.show()





bandwidths = 10 ** np.linspace(-1, 1, 100)
grid = GridSearchCV(KernelDensity(kernel='gaussian'),
                    {'bandwidth': bandwidths}).fit(X)
bandwidth=grid.best_params_['bandwidth']


print("Initial bandwidth=",bandwidth)

bandwidth_slider=widgets.FloatSlider(value=bandwidth,
                                 min=0.1,
                                 max=2,
                                 step=.1,
                                 description='Bandwidth',
                                 continuous_update=False)

bandwidth_text=widgets.FloatText(value=bandwidth,
                                 min=0.1,
                                 max=2,
                                 step=.1,
                                 description='Bandwidth',
                                 continuous_update=False)

widgets.link((bandwidth_slider, 'value'), (bandwidth_text, 'value'))
bandwidth_widget=widgets.HBox([bandwidth_slider,bandwidth_text])

out=interactive_output(plot_portfolio,{"bandwidth":bandwidth_text})
display(bandwidth_widget,out)

Initial bandwidth= 0.6135907273413174


HBox(children=(FloatSlider(value=0.6135907273413174, continuous_update=False, description='Bandwidth', max=2.0…

Output()