# Activities

Please forgive the multiple meanings of the word "cluster" ...

* We will generate some (X, Y) point data, and send it to an HPC cluster for clustering training (KMeans);
* We will grab cluster means for the remote trained model and send it back to the notebook to initialize a new model;
* We will use the local trained model data in the notebook to predict data clusters.

# Preparation

## On this machine ...

In order for this notebook to be able to generate and plot data, you need to make sure that you have all of the packages installed first (preferably in a virual environment, see README):
```
pip install numpy
pip install pandas
pip install matlplotlib
pip install sklearn
```

## On the cluster ...

Often worker machines on a cluster will not have full network access, so it's important to install the python packages we need up front into a virtual environment:

```
module load python/3.7
virtualenv --no-download ~/virtualenv_jobservant_kmeans
source ~/virtualenv_jobservant_kmeans/bin/activate
pip install numpy
pip install sklearn
```

We will store the location of this virtual environment (`~/virtualenv_jobservant_kmeans`) in a variable below.

# Imports

The last include is for some helper routines in a separate file (to keep this demo short and focused on jobservant).

In [None]:
import os
import sys

# Find jobservant
this_dir = os.path.dirname(os.path.realpath('.'))
sys.path.append(this_dir)

# Load / reload jobservant
from jobservant.cluster_account import ClusterAccount
from jobservant.jupyter.job_presenter import JobPresenter

from jobservant_kmeans_demo_helper import get_points_and_labels, plot_clusters, numpy_to_csv, output_to_numpy

# Create dataset
Points are sampled from three gaussian distributions, labels indicate which distribution the point was taken from.

This is basically a wrapper around `make_blobs` from `scikit-learn`.

In [None]:
xy_points, labels = get_points_and_labels(initialize_seed=True)

# Visualize dataset

In [None]:
print('First 10 xy_points: \n',xy_points[:10])
plot_clusters('Unlabeled clusters', xy_points)
print('First 10 labels: \n', labels[:10])
plot_clusters('Labeled clusters', xy_points, labels)

# Python script and job script for clustering detection
If you wanted to run the clustering algorithm locally, you would just do:
```
from sklearn.cluster import KMeans
kmeans_labels = KMeans(n_clusters=3, n_init=1000).fit(xy_points).labels_
```

In [None]:
python_virtualenv ='~/virtualenv_jobservant_kmeans'


kmeans_py = """
import numpy
from sklearn.cluster import KMeans

# Load data
xy_points = numpy.loadtxt(open("xy_points.csv", "rb"), delimiter=",")

# Calculate clusters: look for three clusters, best model out of a thousand attempts
kmeans = KMeans(n_clusters=3, n_init=1000)
kmeans.fit(xy_points)

for x_y in kmeans.cluster_centers_:
    print('%f,%f' % (x_y[0], x_y[1]))
"""


job_script = """
module load python/3.7 > /dev/null 2>&1
source {virtualenv}/bin/activate

python kmeans.py
""".format(virtualenv=python_virtualenv)

# Send data, scripts to the HPC cluster for clustering detection

In [None]:
# Setup cluster account (note: use keyword 'username' to override default)
cluster_name = os.environ.get('JOBSERVANT_CLUSTER') or 'cluster.name.here.please'
cluster_account = ClusterAccount(server=cluster_name, log_level='info')

# Create (but don't submit!) job
accounting_group = os.environ.get('JOBSERVANT_ACCOUNTING_GROUP') or 'accounting-group-here-please'
job_params = {
    'account': accounting_group,
    'time': '00:03:00',
}
job = cluster_account.create_job(text=job_script, **job_params)

# Create the data in the work directory on the remote server 
job.create_remote_file('xy_points.csv', numpy_to_csv(xy_points))

# Create the python script in the work directory on the remote server 
job.create_remote_file('kmeans.py', kmeans_py)


# Submit the job and monitor it's progress

In [None]:
job.submit()

job_presenter = JobPresenter(job)

# Note: final output will not be printed
job_presenter.progress()

# Get output and visualize results

In [None]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3)
kmeans.cluster_centers_ = output_to_numpy(job.fetch_output())

kmeans_labels = kmeans.predict(xy_points)
print('First 10 computed labels: ', kmeans_labels[:10])

plot_clusters('Re-plot of original clusters', xy_points, labels)
plot_clusters('Calculated clusters', xy_points, kmeans_labels)

# Generate some new data and run model on it to label clusters

In [None]:
xy_points2, labels2 = get_points_and_labels()
kmeans_labels2 = kmeans.predict(xy_points2)

plot_clusters('New clusters', xy_points2, labels2)
plot_clusters('New predicted clusters', xy_points2, kmeans_labels2)