# TPOT on Dask on CDSW Workers

## Setup

First we install dependencies.

In [None]:
!pip3 install --upgrade \
    dask[complete]==2021.2.0 \
    dask-ml==1.8.0 \
    numpy==1.19.5 \
    TPOT==0.11.7 \
    scikit-learn==0.24.1

Then we import dependencies.

In [None]:
import os
import time

import cdsw
from dask.distributed import Client
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from tpot import TPOTClassifier

Finally, we make two directories that are needed by Dask. Dask uses these directories to share network information between the scheduler and workers. From the user perspective, create them and forget them.

In [None]:
os.makedirs("_scheduler_", exist_ok=True)
os.makedirs("_worker_", exist_ok=True)

## Start Dask scheduler

We start a Dask scheduler as a CDSW worker process. The scheduler is responsible for coordinating work between the workers. Later we'll start a client in this notebook. The client talks to the scheduler, and the scheduler talks to the workers.

In [None]:
dask_scheduler = cdsw.launch_workers(
  n=1,
  cpu=1,
  memory=2,
  kernel="python3",
  code=f"!dask-scheduler --host 0.0.0.0 --dashboard-address 127.0.0.1:8090 --scheduler-file /home/cdsw/_scheduler_/dask.log"
)

# Wait for the scheduler to start.
time.sleep(10)

We need the IP address of worker with the scheduler on it, so we can connect the dask workers to it. The IP is not returned in the `dask_scheduler` object (it's unknown at the launch of the scheduler), so we scan through the worker list and find the IP of the worker with the scheduler `id`. This returns a list, but there should be only one entry.

In [None]:
scheduler_workers = cdsw.list_workers()
scheduler_id = dask_scheduler[0]['id']
scheduler_ip = [worker['ip_address'] for worker in scheduler_workers
                if worker['id'] == scheduler_id][0]

scheduler_url = f"tcp://{scheduler_ip}:8786"

scheduler_url

## Start Dask workers

Start some CDSW workers, each with one dask worker process on it. We pass the scheduler URL we just found so that the scheduler can distribute work to the workers.

In [None]:
dask_workers = cdsw.launch_workers(
  n=10,
  cpu=1,
  memory=2,
  kernel="python3",
  code=f"!dask-worker {scheduler_url} --local-directory /home/cdsw/_worker_"
)

# Wait for the workers to start.
time.sleep(10)

## Connect Dask client

Start a local client and connect it to our scheduler. This is how we'll talk to the Dask cluster.

In [None]:
client = Client(scheduler_url)

We can view some stats about the Dask cluster.

In [None]:
client

Construct URL of Dask dashboard, which is hosted from a worker.

In [None]:
print('//'.join(dask_scheduler[0]['app_url'].split('//'))+ 'status')

## Load data

We load some data. We're just setting up pipelines here so the data isn't important.

In [None]:
digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target, train_size=0.75, test_size=0.25)

## Define estimator (using Dask!)

We define a TPOT classifier. TPOT is rather sophisticated, and will search over many possible pipelines of sklearn preprocessors and estimators. All we have to do to use the Dask cluster is pass the `use_dask=True` flag, and it'll connect via the client we defined (we do not need to (and cannot) explicitly pass the client).

In [None]:
estimator = TPOTClassifier(generations=5, population_size=20, use_dask=True, verbosity=2, n_jobs=-1)

## Fit estimator (using Dask workers!)

Fit the `TPOTClassifier`. TPOT tries `population_size` pipeline combinations, then collects the results, and chooses new combinations in a smart way (it's an evolutionary algorithm). It repeats this `generations` times. For each pipeline, it uses 10-fold cross-validation. This is a lot of compute (to do it properly, expect hours or days), so we have restricted to a mere 5 generations, each with population 20. We can stope the process at any point, and TPOT will output the best performing pipeline to that point.

In [None]:
estimator.fit(X_train, y_train)

We can now use this object exactly like a sklearn estimator.

In [None]:
estimator.predict(X_train)

In [None]:
estimator.score(X_test, y_test)

Exporting the estimator will generate a short template python script to build the selected pipeline from it's raw sklearn components.

In [None]:
estimator.export("tpot_estimator.py")

## Close workers

Stop workers. Stop only those that we started, not all the workers on the cluster, that others may be using.

In [None]:
cdsw.stop_workers(*[worker['id'] for worker in dask_workers])

Stop scheduler.

In [None]:
cdsw.stop_workers(*[worker['id'] for worker in dask_scheduler])