# dislib tutorial

This tutorial will show the basics of using [dislib](https://dislib.bsc.es).

## Requirements

Apart from dislib, this notebook requires [PyCOMPSs 2.5](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/).


## Setup


First, we need to start an interactive PyCOMPSs session:

In [None]:
import pycompss.interactive as ipycompss
ipycompss.start(graph=True, monitor=1000)

Next, we import dislib and we are all set to start working!

In [None]:
import dislib as ds

## Distributed arrays

The main data structure in dislib is the distributed array (or ds-array). These arrays are a distributed representation of a 2-dimensional array that can be operated as a regular Python object. Usually, rows in the array represent samples, while columns represent features.

To create a random array we can run the following NumPy-like command:

In [None]:
x = ds.random_array(shape=(500, 500), block_size=(100, 100))
print(x.shape)
x

Now `x` is a 500x500 ds-array of random numbers stored in blocks of 100x100 elements. Note that `x` is not stored in memory. Instead, `random_array` generates the contents of the array in tasks that are usually executed remotely. This allows the creation of really big arrays.

The content of `x` is a list of Futures that represent the actual data (wherever it is stored).

To see this, we can access the `_blocks` field of `x`:

In [None]:
x._blocks[0][0]

`block_size` is useful to control the granularity of dislib algorithms.

To retrieve the actual contents of `x`, we use `collect`, which synchronizes the data and returns the equivalent NumPy array:

In [None]:
x.collect()

Another way of creating ds-arrays is using array-like structures like NumPy arrays or lists:

In [None]:
x1 = ds.array([[1, 2, 3], [4, 5, 6]], block_size=(1, 3))
x1

Distributed arrays can also store sparse data in CSR format:

In [None]:
from scipy.sparse import csr_matrix

sp = csr_matrix([[0, 0, 1], [1, 0, 1]])
x_sp = ds.array(sp, block_size=(1, 3))
x_sp

In this case, `collect` returns a CSR matrix as well:

In [None]:
x_sp.collect()

### Slicing

Similar to NumPy, ds-arrays support the following types of slicing:

(Note that slicing a ds-array creates a new ds-array)

In [None]:
x = ds.random_array((50, 50), (10, 10))

Get a single row:

In [None]:
x[4]

Get a single element:

In [None]:
x[2, 3]

Get a set of rows or a set of columns:

In [None]:
# Consecutive rows
print(x[10:20])

# Consecutive columns
print(x[:, 10:20])

# Non consecutive rows
print(x[[3, 7, 22]])

# Non consecutive columns
print(x[:, [5, 9, 48]])

Get any set of elements:

In [None]:
x[0:5, 40:45]

### Other functions

Apart from this, ds-arrays also provide other useful operations like `transpose` and `mean`:

In [None]:
x.mean(axis=0).collect()

In [None]:
x.transpose().collect()

## Machine learning with dislib

Dislib provides an estimator-based API very similar to [scikit-learn](https://scikit-learn.org/stable/). To run an algorithm, we first create an estimator. For example, a K-means estimator:

In [None]:
from dislib.cluster import KMeans

km = KMeans(n_clusters=3)

Now, we create a ds-array with some blob data, and fit the estimator:

In [None]:
from sklearn.datasets import make_blobs

# create ds-array
x, y = make_blobs(n_samples=1500)
x_ds = ds.array(x, block_size=(500, 2))

km.fit(x_ds)

Finally, we can make predictions on new (or the same) data:

In [None]:
y_pred = km.predict(x_ds)
y_pred

`y_pred` is a ds-array of predicted labels for `x_ds`

Let's plot the results

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt


centers = km.centers

# set the color of each sample to the predicted label
plt.scatter(x[:, 0], x[:, 1], c=y_pred.collect())

# plot the computed centers in red
plt.scatter(centers[:, 0], centers[:, 1], c='red')

Note that we need to call `y_pred.collect()` to retrieve the actual labels and plot them. The rest is the same as if we were using scikit-learn.

To finish the session, we need to stop PyCOMPSs:

In [None]:
ipycompss.stop()