# Machine learning with dislib

This tutorial will show the different algorithms available in [dislib](https://dislib.bsc.es).

## Requirements

Apart from dislib, this notebook requires [PyCOMPSs 2.8](https://www.bsc.es/research-and-development/software-and-apps/software-list/comp-superscalar/).


## Setup


First, we need to start an interactive PyCOMPSs session:

In [None]:
import pycompss.interactive as ipycompss
ipycompss.start(graph=True, monitor=1000)

Next, we import dislib and we are all set to start working!

In [None]:
import dislib as ds

## Download the MNIST dataset

The datasets (train and test) can be downloaded from https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#mnist.

The next cell downloads them automatically:

In [None]:
%%bash
if [ -f "mnist" ]; then
    echo "MNIST dataset already downloaded."
else
    echo "Downloading the MNIST datasets... Please wait."
    wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist.bz2
    bzip2 -d mnist.bz2
    wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist.t.bz2
    bzip2 -d mnist.t.bz2
fi

## Load the MNIST dataset

In [None]:
import os
x, y = ds.load_svmlight_file(os.getcwd() + '/mnist',
                             block_size=(10000, 784), n_features=784, store_sparse=False)

In [None]:
x.shape

In [None]:
y.shape

In [None]:
y_array = y.collect()
y_array

In [None]:
img = x[0].collect().reshape(28,28)

In [None]:
import matplotlib.pyplot as plt
plt.imshow(img)

In [None]:
int(y[0].collect())

## dislib algorithms

### Preprocessing

In [None]:
from dislib.preprocessing import StandardScaler
from dislib.decomposition import PCA

### Clustering

In [None]:
from dislib.cluster import KMeans
from dislib.cluster import DBSCAN
from dislib.cluster import GaussianMixture

### Classification

In [None]:
from dislib.classification import CascadeSVM
from dislib.classification import RandomForestClassifier

### Recommendation

In [None]:
from dislib.recommendation import ALS

### Model selection

In [None]:
from dislib.model_selection import GridSearchCV

### Others

In [None]:
from dislib.regression import LinearRegression
from dislib.neighbors import NearestNeighbors

## Examples
### KMeans

In [None]:
kmeans = KMeans(n_clusters=10)
pred_clusters = kmeans.fit_predict(x).collect()

Get the number of images of each class in the cluster 0:

In [None]:
from collections import Counter
Counter(y_array[pred_clusters==0])

### GaussianMixture

Fit the GaussianMixture with the painted pixels of a single image:

In [None]:
import numpy as np
img_filtered_pixels = np.stack([np.array([i, j]) for i in range(28) for j in range(28) if img[i,j] > 10])
img_pixels = ds.array(img_filtered_pixels, block_size=(50,2))
gm = GaussianMixture(n_components=7, random_state=0)
gm.fit(img_pixels)

Get the parameters that define the Gaussian components:

In [None]:
from pycompss.api.api import compss_wait_on
means = compss_wait_on(gm.means_)
covariances = compss_wait_on(gm.covariances_)
weights = compss_wait_on(gm.weights_)

Use the Gaussian mixture model to sample random pixels replicating the original distribution:

In [None]:
samples = np.concatenate([np.random.multivariate_normal(means[i], covariances[i], int(weights[i]*1000)) for i in range(7)])
plt.scatter(samples[:,1], samples[:,0])
plt.gca().set_aspect('equal', adjustable='box')
plt.gca().invert_yaxis()
plt.draw()

### PCA

In [None]:
pca = PCA()
pca.fit(x)

Calculate the explained variance of the 10 first eigenvectors:

In [None]:
explained_variance = pca.explained_variance_.collect()
sum(explained_variance[0:10])/sum(explained_variance)

Show the weights of the first eigenvector:

In [None]:
plt.imshow(np.abs(pca.components_.collect()[0]).reshape(28,28))

### RandomForestClassifier

In [None]:
rf = RandomForestClassifier(n_estimators=5, max_depth=3)
rf.fit(x, y)

Use the test dataset to get an accuracy score:

In [None]:
x_test, y_test = ds.load_svmlight_file(os.getcwd() + '/mnist.t',
                                       block_size=(10000, 784), n_features=784, store_sparse=False)
score = rf.score(x_test, y_test)
print(compss_wait_on(score))

### GridSearchCV

Grid Search with Cross-Validation (5-fold) and 4 different combinations of parameters:

In [None]:
parameters = {'n_estimators': (5, 10),
              'max_depth': range(3, 5)}
gs = GridSearchCV(rf, parameters, cv=5)
gs.fit(x,y)

Explore the results:

In [None]:
import pandas as pd
pd_df = pd.DataFrame.from_dict(gs.cv_results_)
print(pd_df[['params', 'mean_test_score']])

The estimator with the best results has been refitted on the whole dataset:

In [None]:
gs.best_estimator_

Get the accuracy score of the best estimator:

In [None]:
score = gs.best_estimator_.score(x_test, y_test)
print(compss_wait_on(score))

### Close the session

To finish the session, we need to stop PyCOMPSs:

In [None]:
ipycompss.stop()