# dislib hands-on exercise

This notebook includes some exercises to learn the basics of using [dislib](https://dislib.bsc.es).

## Requirements

Apart from dislib, this notebook requires [PyCOMPSs 2.8 or higher](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/).


## Setup


First, we need to start an interactive PyCOMPSs session:

In [None]:
import pycompss.interactive as ipycompss
import os

os.environ["ComputingUnits"] = "1"

if 'BINDER_SERVICE_HOST' in os.environ:
    ipycompss.start(graph=True,
                    project_xml='../xml/project.xml',
                    resources_xml='../xml/resources.xml')
else:
    ipycompss.start(graph=True, monitor=1000)

Next, we import dislib and we are all set to start working!

In [None]:
import dislib as ds

## Machine learning with dislib

Dislib provides an estimator-based API very similar to [scikit-learn](https://scikit-learn.org/stable/). An estimator is anything that learns from data. To illustrate how an estimator works, let's first generate some data:

In [None]:
from sklearn.datasets import make_blobs

x_np, y = make_blobs(n_samples=1500, random_state=170)

`x_np` and `y` are random samples and labels. Samples are vectors and labels are numbers that represent the category of each sample. In this example, we are going to run clustering algorithms, which are useful to understand **unlabeled** data, and thus, we will not use `y`. 

Since the samples in `x_np` are 2-dimensional, we can plot them and see that there are 3 clusters in our data:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.scatter(x_np[:, 0], x_np[:, 1])

To use dislib, we first need to convert `x` to a ds-array:

In [None]:
x = ds.array(x_np, block_size=(300, 2))
x

## Using DBSCAN

We have seen before the behaviour of K-means. K-means is a simple yet effective clustering method. However, K-means has a major drawback: the number of clusters needs to be defined beforehand. This is no always possible, and other clustering methods have attempted to address this limitation.

An example is DBSCAN, which is a density based clustering algorithm. In DBSCAN, users define density using two parameters: `eps` and `min_samples`. The algorithm then finds an arbitrary number of clusters based on these two parameters.

Your task now is to experiment with different `eps` and `min_samples` values to see how DBSCAN performs with the blob data!

**WARNING:** DBSCAN works a bit different to K-means. You can find its API reference [here](https://dislib.readthedocs.io/en/stable/dislib.cluster.dbscan.html#dislib.cluster.dbscan.base.DBSCAN):

In [None]:
from dislib.cluster import DBSCAN

# fill in the values for eps and min_samples
dbscan = DBSCAN(eps=, min_samples=)

In [None]:
# fit and predict the labels for x
y_pred = 

When you are done, you can plot the results with the following code (assuming predicted labels are in `y_pred`):

In [None]:
# set the color of each sample to the predicted label
plt.scatter(x_np[:, 0], x_np[:, 1], c=y_pred.collect())

Now let's try with different data.

In [None]:
from sklearn.datasets import make_circles
x_np, _ = make_circles(n_samples=1500, factor=.5, noise=.05, random_state=170)

Use K-means and DBSCAN to cluster the data in x_np. Which algorithm performs better?

In [None]:
# Create ds-array
x = 

from dislib.cluster import KMeans

# Create estimators
kmeans =  
dbscan =  

# Fit and predict labels
y_dbscan =  
y_km =  

In [None]:
# Use this to plot the results

ax = plt.subplot(121)
ax.title.set_text("DBSCAN")
ax.scatter(x_np[:, 0], x_np[:, 1], c=y_dbscan.collect())
ax = plt.subplot(122)
ax.title.set_text("K-means")
ax.scatter(x_np[:, 0], x_np[:, 1], c=y_km.collect())

## Classification

Now we will solve an exercise using the digits data set from scikit-learn. Samples in this data set represent images of handwritten digits (0 to 9), where each feature represents a pixel in the image.

First, we load the data set:

In [None]:
from sklearn.datasets import load_digits

digits = load_digits()
x_np = digits.data
y = digits.target.reshape(-1, 1)

x_np.shape

`x_np` contains 1797 samples of 64 features, and `y` contains the labels, which in this case is the handwritten number.

In [None]:
digit = x_np[23]
digit

We can see actually see the digit by reshaping the vector and plotting it:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

digit = digit.reshape((8,8))
plt.imshow(digit)

...and the corresponding label should be a 3:

In [None]:
y[23]

Although the original data set has 10 different labels, we can simplify the problem by converting it into a binary classification problem, where even numbers have label=0 and odd numbers have label=1:

In [None]:
y = y%2

Now `y[23]` should be 1:

In [None]:
y[23]

Classification is different from clustering in that labels are also used for the fitting process. Once a classifier is fitted, we can use it to label unlabeled data.

To simulate having labeled (training) and unlabeled (test) data, we split the digits data set:

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x_np, y)

`x_train` and `x_test` now contain 75% and 25% of the samples in `x` respectively.

Your task is to use `CascadeSVM` and `RandomForestClassifier` to classify the digits data, and get the accuracy obtained!

### Hints:

- You can find dislib's API reference [here](https://dislib.readthedocs.io/en/stable/api-reference.html).
- Remember to convert data to ds-arrays before passing them to the classifiers.
- Do not worry too much about the classifiers' parameters, you can use the default values.
- Use the train data to fit the estimators, and the test data to check the accuracy.
- Accuracy can be obtained using the `score` method.

In [None]:
# convert your data to ds-arrays
x_ds_train = ds.array(x_train, block_size=(300, 64))
y_ds_train = ds.array(y_train, block_size=(300, 1))

x_ds_test = 
y_ds_test =  

In [None]:
from dislib.classification import CascadeSVM

# create CascadeSVM estimator
csvm =  


# fit the estimator with training data
 


# print the accuracy on the test data

from pycompss.api.api import compss_wait_on



In [None]:
# now do the same as above using the RandomForestClassifier :)

 




Which classifier gets better results?

## Hyperparameter optimization

Classifiers' performance is highly sensitive to the initialization parameters (or hyperparameters), and it is difficult to know which parameters are optimal beforehand. Thankfully, there are hyperparameter optimization techniques that allow us to find good parameters for our classification problem.

One of these techniques is grid search with cross-validation. This model selection algorithm performs an exhaustive search on a predefined set of hyperparameters to find the optimal ones.

Try to improve the score of the CascadeSVM classifier using grid search! You can find GridSearchCV reference [here](https://dislib.readthedocs.io/en/stable/dislib.model_selection.html#dislib.model_selection.GridSearchCV).

In [None]:
from dislib.model_selection import GridSearchCV

# hyperparameter search space
params = {
    "gamma" : (0.1, 0.01, 0.0001), 
    "c" : (1, 10, 100)
}

csvm = CascadeSVM()

# use grid search with your training data (it might take a while, be patient)
searcher =  
 



In [None]:
# Use this cell to print the results in a nice way
import pandas as pd
pd.DataFrame(searcher.cv_results_)[["params", "mean_test_score"]]

Now, try the optimal parameters with the test data:

In [None]:
# Set the optimal parameters here
csvm = CascadeSVM(gamma= , c= )

csvm.fit(x_ds_train, y_ds_train)

from pycompss.api.api import compss_wait_on
compss_wait_on(csvm.score(x_ds_test, y_ds_test))

Did the results improve?

## Close the session

To finish the session, we need to stop PyCOMPSs:

In [None]:
ipycompss.stop()