# dislib tutorial

This tutorial will show the basics of using [dislib](https://dislib.bsc.es).

## Requirements

Apart from dislib, this notebook requires [PyCOMPSs 2.8](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/).


## Setup


First, we need to start an interactive PyCOMPSs session:

In [None]:
import pycompss.interactive as ipycompss
import os

os.environ["ComputingUnits"] = "1"

if 'BINDER_SERVICE_HOST' in os.environ:
    ipycompss.start(graph=True,
                    project_xml='../xml/project.xml',
                    resources_xml='../xml/resources.xml')
else:
    ipycompss.start(graph=True, monitor=1000)

Next, we import dislib and we are all set to start working!

In [None]:
import dislib as ds

## Distributed arrays

The main data structure in dislib is the distributed array (or ds-array). These arrays are a distributed representation of a 2-dimensional array that can be operated as a regular Python object. Usually, rows in the array represent samples, while columns represent features.

To create a random array we can run the following NumPy-like command:

In [None]:
x = ds.random_array(shape=(500, 500), block_size=(100, 100))
print(x.shape)
x

Now `x` is a 500x500 ds-array of random numbers stored in blocks of 100x100 elements. Note that `x` is not stored in memory. Instead, `random_array` generates the contents of the array in tasks that are usually executed remotely. This allows the creation of really big arrays.

The content of `x` is a list of `Futures` that represent the actual data (wherever it is stored).

To see this, we can access the `_blocks` field of `x`:

In [None]:
x._blocks[0][0]

`block_size` is useful to control the granularity of dislib algorithms.

To retrieve the actual contents of `x`, we use `collect`, which synchronizes the data and returns the equivalent NumPy array:

In [None]:
x.collect()

Another way of creating ds-arrays is using array-like structures like NumPy arrays or lists:

In [None]:
x1 = ds.array([[1, 2, 3], [4, 5, 6]], block_size=(1, 3))
x1

Distributed arrays can also store sparse data in CSR format:

In [None]:
from scipy.sparse import csr_matrix

sp = csr_matrix([[0, 0, 1], [1, 0, 1]])
x_sp = ds.array(sp, block_size=(1, 3))
x_sp

In this case, `collect` returns a CSR matrix as well:

In [None]:
x_sp.collect()

### Loading data

A typical way of creating ds-arrays is to load data from disk. Dislib currently supports reading data in CSV and SVMLight formats like this:

In [None]:
x, y = ds.load_svmlight_file("./files/libsvm/1", block_size=(20, 100), n_features=780, store_sparse=True)
print(x)

In [None]:
csv = ds.load_txt_file("./files/csv/1", block_size=(500, 122))
print(csv)

### Slicing

Similar to NumPy, ds-arrays support the following types of slicing:

(Note that slicing a ds-array creates a new ds-array)

In [None]:
x = ds.random_array((50, 50), (10, 10))

Get a single row:

In [None]:
x[4].shape

Get a single element:

In [None]:
x[2, 3].collect()

Get a set of rows or a set of columns:

In [None]:
# Consecutive rows
print(x[10:20])

# Consecutive columns
print(x[:, 10:20])

# Non consecutive rows
print(x[[3, 7, 22]])

# Non consecutive columns
print(x[:, [5, 9, 48]])

Get any set of elements:

In [None]:
x[0:5, 40:45]

### Other functions

Apart from this, ds-arrays also provide other useful operations like `transpose` and `mean`:

In [None]:
x.mean(axis=0).collect()

In [None]:
x.transpose().collect()

## Machine learning with dislib

Dislib provides an estimator-based API very similar to [scikit-learn](https://scikit-learn.org/stable/). An estimator is anything that learns from data. To illustrate how an estimator works, let's first generate some data:

In [None]:
from sklearn.datasets import make_blobs

x_np, y = make_blobs(n_samples=1500, random_state=170)

`x_np` and `y` are random samples and labels. Samples are vectors and labels are numbers that represent the category of each sample. In this example, we are going to run K-means clustering, which is useful to understand **unlabeled** data, and thus, we will not use `y`. 

Since the samples in `x_np` are 2-dimensional, we can plot them and see that there are 3 clusters in our data:

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

plt.scatter(x_np[:, 0], x_np[:, 1])

Now, let's see how good is K-means in detecting these three clusters.

To use dislib, we first need to convert `x` to a ds-array:

In [None]:
x = ds.array(x_np, block_size=(300, 2))
x

Now, we create a K-means instance. In K-means, we need to define the number of clusters in our data. Since this example is simple, the obvious value here is 3.

In [None]:
from dislib.cluster import KMeans

km = KMeans(n_clusters=2)

Next, we need to fit our estimator with training data (i.e., `x_ds`):

In [None]:
km.fit(x)

The fit method does the main part of the computational work. In the case of K-means, fitting means finding the optimal cluster centers given some training data. K-means will find as many cluster centers as specified in the constructor (i.e., 3). 

We can check the computed centers like this:

In [None]:
km.centers

After an estimator is fitted, we can compute the labels of unlabeled data. For example, we can compute labels for `x` using the `predict` method:

In [None]:
y_pred = km.predict(x)
y_pred

`y_pred` is a ds-array of predicted labels for `x`. The prediction process depends on the cluster centers computed in the `fit` step.

Finally, we can plot the samples in `x`, using the predicted labels as colors (we also plot the cluster centers in red).

In [None]:
centers = km.centers

# set the color of each sample to the predicted label
plt.scatter(x_np[:, 0], x_np[:, 1], c=y_pred.collect())

# plot the computed centers in red
plt.scatter(centers[:, 0], centers[:, 1], c='red')

Note that we need to call `y_pred.collect()` to retrieve the actual labels and plot them.

## Close the session

To finish the session, we need to stop PyCOMPSs:

In [None]:
ipycompss.stop()