## Kernel density estimation



Kernel density estimation is the process of estimating an unknown probability density function using a kernel function $K(u)$

In the following example, we will estimate the PDF of a bimodal distribution: a mixture of two normal distributions with locations at -1 and 1



In [None]:
import numpy as np
import statsmodels.api as sm
from scipy import stats
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Location, scale and weight for the two distributions
dist1_loc, dist1_scale, weight1 = -1 , .5, .25
dist2_loc, dist2_scale, weight2 = 1 , .5, .75

In [None]:
from statsmodels.distributions.mixture_rvs import mixture_rvs
# Sample from a mixture of distributions
obs_dist = mixture_rvs(prob=[weight1, weight2], size=250, 
                        dist=[stats.norm, stats.norm],
                        kwargs = (dict(loc=dist1_loc, scale=dist1_scale),
                                  dict(loc=dist2_loc, scale=dist2_scale)))

The simplest non-parametric technique for density estimation is the histogram



In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)
ax.scatter(obs_dist, np.abs(np.random.randn(obs_dist.size)), 
            zorder=15, color='red', marker='x', alpha=0.5, label='Samples')
lines = ax.hist(obs_dist, bins=20, edgecolor='k', label='Histogram')
ax.legend(loc='best')
ax.grid(True, zorder=-5)

Let's fit a continuous probability density function



In [None]:
kde = sm.nonparametric.KDEUnivariate(obs_dist)
kde.fit()

Plotting the results



In [None]:
fig = plt.figure()
ax = fig.add_subplot(111)

# Plot the histrogram
ax.hist(obs_dist, bins=20, normed=True, label='Histogram from samples', 
        zorder=5, edgecolor='k', alpha=0.5)

# Plot the KDE as fitted using the default arguments
ax.plot(kde.support, kde.density, lw=3, label='KDE from samples', zorder=10)

# Plot the true distribution
true_values = (stats.norm.pdf(loc=dist1_loc, scale=dist1_scale, x=kde.support)*weight1
              + stats.norm.pdf(loc=dist2_loc, scale=dist2_scale, x=kde.support)*weight2)
ax.plot(kde.support, true_values, lw=3, label='True distribution', zorder=15)

# Plot the samples
ax.scatter(obs_dist, np.abs(np.random.randn(obs_dist.size))/40, 
           marker='x', color='red', zorder=20, label='Samples', alpha=0.5)

ax.legend(loc='best')
ax.grid(True, zorder=-5)

## Machine learning



-   `scikit-learn` is the most widely used ML library in Python
-   standard supervised and unsupervised machine learning methods
-   models that can be used for classification, clustering, prediction



![img](images/sklearn.png)



## Let's download some datasets!



Go to [https://www.sciencebase.gov/catalog/item/5a58af4fe4b00b291cd6a5fb>](https://www.sciencebase.gov/catalog/item/5a58af4fe4b00b291cd6a5fb>)and download these files

-   `BASIN_CHARACTERISTICS.csv`
-   `MON_P_CRU_19012015.csv`
-   `MON_T_CRU_19012015.csv`



and load them up



In [None]:
import pandas as pd
b = pd.read_csv("examples/BASIN_CHARACTERISTICS.csv")
p = pd.read_csv("examples/MON_P_CRU_19012015.csv")
t = pd.read_csv("examples/MON_T_CRU_19012015.csv")

## Clustering



-   One of the simplest unsupervised classification methods
-   We will focus on the **k-means** clustering algorithm



In [None]:
from sklearn.cluster import KMeans

## K-means clustering



Clusters data by trying to separate samples in n groups of equal variance

$$\sum_{i=0}^{n} \min\limits_{\mu_{j} \in C} (||x_{i} - \mu_{j}||^{2})$$

Inertia, or the within-cluster sum of squares criterion is a measure of how internally coherent clusters are. Let's look at an example



In [None]:
from sklearn.datasets import make_blobs
plt.figure(figsize=(12, 12))
n_samples = 1500
random_state = 170
X, y = make_blobs(n_samples=n_samples, random_state=random_state)

Now let's cluster the data and predict their label



In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
model = KMeans(n_clusters=3, random_state=random_state).fit(X)
y_pred = model.predict(X)
plt.scatter(X[:, 0], X[:, 1], c=y_pred)

## Let's try to classify the basins according to mean temperature and precipitation!



-   Create the appropriate arrays to be used as inputs to clustering algorithm
-   Cluster basins according to precipitation
-   Cluster basins according to temperature and precipitation
-   Do you need to add any more features?



## How do we select the number of clusters?



When the true labels are unknown, we can use Silhouette Analysis to evaluate the number of clusters

Silhouette coefficient is defined as
$$s = \dfrac{b-a}{max(a,b)}$$
where $a$ is the mean distance between a sample and all other points in the same class, and $b$ is the mean distance between a sample and all other points in the next nearest cluster.



### Silhouette analysis example



In [None]:
from sklearn import metrics
labels = model.labels_
metrics.silhouette_score(X, labels, metric='euclidean')

### Use Silhouette Analysis to evaluate the number of clusters for the basins



-   Perform the clustering on the basins data and calculate the Silhouette coefficient as a function of number of clusters
-   Plot the relationship between them.



## Support Vector Machines



Support vector machines (SVMs) are a set of supervised learning methods used for classification, regression and outliers detection.



The advantages of support vector machines are:

-   Effective in high dimensional spaces
-   Still effective in cases where number of dimensions is greater than the number of samples
-   Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient
-   Versatile: different Kernel functions can be specified for the decision function



### We will focus on regression



Let's try an example



In [None]:
from sklearn import svm
X = [[0, 0], [2, 2]]
y = [0.5, 2.5]
clf = svm.SVR()
clf.fit(X, y)

In [None]:
clf.predict([[1, 1]])

### Now let's try to use SVMs to predict mean temperature as a function of elevation



-   Build a regression using SVMs on the basin data
-   Split the dataset into training and validation subsets
-   Is the regression model any good?



In [None]:
from sklearn.model_selection import train_test_split

## Homework



![img](https://media.makeameme.org/created/no-homework-brwfr0.jpg)