# Temporal and spatial data mining

## Time Series [Clustering](https://scikit-learn.org/stable/modules/clustering.html#)

### Task 1

**a)** What is "clustering" and why it is an unsupervised learning process? 

**b)** Which aspects can influence the outcome of the clusters at the pre-processing stage?

**c)** In which application areas could be necessary to cluster time series data? Think about an use-case.

**d)** why is a hiererchical clustering considered advantageous or disadvantageous?

**e)** What is a dendrogram?

**f)** What are the differences between the "linkage" concepts and what are they used for?

**g)** can the three "linkage" concepts lead to different results? why?



**h)** What is $k$-means or $c$-means clustering? How does it work? What do $k$-means and *Expectation Maximization* have to do with each other?


 **---  Your Text Here ----** 


In [1]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
plt.style.use('ggplot')



### Task 2: $k$-Means (Bonus)

**a)** Read the following data set `'./data/chart_data.h5'`

>Plot the data considering the index `TSID` and save the labels for each Time Series Sequence in a single array

In [41]:
df = pd.read_hdf('chart_data.h5')
df_1 = df.loc[df['label']==2]
df['label'].unique

<bound method Series.unique of TSID
1      0.0
1      0.0
1      0.0
1      0.0
1      0.0
      ... 
823    3.0
823    3.0
823    3.0
823    3.0
823    3.0
Name: label, Length: 273236, dtype: float64>

**b)** Is there some preprocessing required? Explain/comment shortly the Pre-Processing techniques applied in python if necessary.

**c)** Does it make sense to reduce the dimensionality of the data? Implement a suitable technique on python if required

>Find a two dimensional representation of each time series e.g. mapping the complete sequence into a two dimensional space.

In [None]:
####################
# Your Code Here   #
####################

In [None]:
plt.figure(figsize=(8,6))
plt.scatter(mySeries_transformed[:,0],mySeries_transformed[:,1], s=100, edgecolor='black')
plt.show()

**d)** Apply $k$-means for clustering the time series data. How many Clusters do you consider to use? 

>For the estimation of the quality of a clustering there are many different evaluation measures. One of the most common is the [Davies Bouldin Index](https://scikit-learn.org/stable/modules/clustering.html#davies-bouldin-index) ([wiki](https://en.wikipedia.org/wiki/Davies%E2%80%93Bouldin_index)) or the averaged [Intra-Cluster-Distance](https://scikit-learn.org/stable/modules/clustering.html#silhouette-coefficient)     ([geeksforgeeks](https://www.geeksforgeeks.org/ml-intercluster-and-intracluster-distance/)). Familiarize yourself with the two measures.

>**Davies-Bouldin-Index**

>\begin{align}
    \mu_i = & \frac{1}{|C_i|} \sum_{y \in C_i}{y} \text{ (Clusters centers)} \\
    d_i = & \frac{1}{|C_i|}\sum{y \in C_i} d(y,\mu_i) \text{ (by distance to the cluster center)}\\
    R_{ij} = & \frac{d_i + d_j}{d (\mu_i, \mu_j)}\\
    R_i = & \max{R_{ij} | 1 \leq J \leq K , i \neq j}\\
    DB = &\frac{1}{K}\sum_{i=1}^{K}R_i
\end{align}

>$R_{ij}$ : Compactness of two clusters in relation to their distance from each other (the smaller $R_{ij}$ the better $C_i$ and $C_j$ are separated from each other)

>$R_i$: how well is $C_i$ separated from other clusters in the worst case?

>$DB$: Average of all $R_i$ (the smaller the better)

>**Average intra-cluster distance (ICD):**

>The average intra-cluster distance corresponds to the average distance between the points in a cluster multiplied by the number of points minus 1. Becomes smaller as the number of clusters increases.

**Hint: Plot the DB Score/ICD for multiple number of clusters and find the optimal value**

In [None]:
####################
# Your Code Here   #
####################

In [None]:
####################
# Your Code Here   #
####################

**f)** Do a scaling procedure at the Pre-Processing stage influence the results? Discuss

### Task 3: DBSCAN (Bonus)

**a)** Apply now DBSCAN for clustering the time series data given in **Task 2**

>Repeat all the steps above and find the best parameters for DBSCAN. How many clusters do you recognize? What is the difference?

In [None]:
plt.figure(figsize=(10,8))
plt.scatter(mySeries_transformed[:,0],mySeries_transformed[:,1], s=50)
plt.show()

In [None]:
####################
# Your Code Here   #
####################

In [None]:
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='3d')
ax.view_init(30, 60)
X, Y = np.meshgrid(eps_range,min_samples_range)
ax.plot_surface(X, Y, np.vstack(DBI_row_list).transpose(), cmap="plasma")

In [None]:
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111, projection='3d')
ax.view_init(30, 225)
X, Y = np.meshgrid(eps_range,min_samples_range)
ax.plot_surface(X, Y, np.vstack(ICD_row_list).transpose(), cmap="plasma")

In [None]:
### SOLUTION ###

db = DBSCAN(eps=0.2, min_samples=10).fit(mySeries_transformed)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

**d)** Plot the Data with a scatter Plot and color labels

In [None]:
####################
# Your Code Here   #
####################