# Preparation

## On this machine ...

In order for this notebook to be able to generate and plot data, you need to make sure that you have all of the packages installed first (preferably in a virual environment):
```
pip install numpy
pip install pandas
pip install matlplotlib
pip install sklearn
```

## On the cluster ...

Often worker machines on a cluster will not have full network access, so it's important to install the python packages we need upfront into a virtual environment:

```
module load python/3.7
virtualenv --no-download ~/virtualenv_jobservant_kmeans
source ~/virtualenv_jobservant_kmeans/bin/activate
pip install numpy
pip install sklearn
```

We will store the location of this virtual environment (`~/virtualenv_jobservant_kmeans`) in a variable below.

# Imports

The last include is for some helper routines in a separate file (to keep this demo short and focused on jobservant).

In [None]:
from importlib import reload
import os
import sys

# Find jobservant
this_dir = os.path.dirname(os.path.realpath('.'))
sys.path.append(this_dir)

# Load / reload jobservant
import jobservant
import jobservant.cluster_account
import jobservant.jupyter.job_display
reload(jobservant)
reload(jobservant.cluster_account)
reload(jobservant.jupyter.job_display)

from jobservant.cluster_account import ClusterAccount
from jobservant.jupyter.job_display import JobDisplay

import jobservant_kmeans_demo_helper
reload(jobservant_kmeans_demo_helper)
from jobservant_kmeans_demo_helper import get_points_and_labels, plot_clusters, numpy_to_csv, output_to_numpy

# Create dataset
Points are sampled from three gaussian distributions, labels indicate which distribution the point was taken from

In [None]:
xy_points, labels = get_points_and_labels()

# Visualize dataset

In [None]:
print('First 10 xy_points: ',xy_points[:10])
print('First 10 labels: ', labels[:10])
plot_clusters('Labeled clusters', xy_points, labels)

# Python script and job script for clustering detection
If you wanted to run the clustering algorithm locally, you would just do:
```
from sklearn.cluster import KMeans
kmeans_labels = KMeans(n_clusters=3, n_init=1000).fit(xy_points).labels_
```

In [None]:
python_virtualenv ='~/virtualenv_jobservant_kmeans'

kmeans_py = """
import numpy
from sklearn.cluster import KMeans

# Load data
xy_points = numpy.loadtxt(open("xy_points.csv", "rb"), delimiter=",")

# Calculate cluster
kmeans_labels = KMeans(n_clusters=3, n_init=1000).fit(xy_points).labels_

# Print output
for row in kmeans_labels:
    print(row)
"""

job_script = """
module load python/3.7 > /dev/null 2>&1
source %s/bin/activate

python kmeans.py
""" % (python_virtualenv)

# Send data, scripts to the HPC cluster for clustering detection

In [None]:
cluster_name = os.environ.get('JOBSERVANT_CLUSTER') or 'cluster.name.here.please'
accounting_group = os.environ.get('JOBSERVANT_ACCOUNTING_GROUP') or 'accounting-group-here-please'

job_params = {
    'account': accounting_group,
    'time': '00:03:00',
}

cluster_account = ClusterAccount(server=cluster_name, log_level='info')
job = cluster_account.create_job(text=job_script, **job_params)

job.create_remote_file('xy_points.csv', numpy_to_csv(xy_points))
job.create_remote_file('kmeans.py', kmeans_py)


# Submit the job and monitor it's progress

In [None]:
job.submit()

job_display = JobDisplay(job)
job_display.progress(suppress_output=True)

# Get output and visualize results

In [None]:
out = job.output()
kmeans_labels = output_to_numpy(out)
print('First 10 computed labels: ', kmeans_labels[:10])

plot_clusters('Calculated clusters', xy_points, kmeans_labels)