# Machine Learning Pipeline with KubeDirector - Lab 2
## Deploy a local Jupyter Notebook cluster to interact with a tenant-shared training cluster

### **Lab workflow**

In this lab:

1. As tenant user, you will first create a local (lightweight) Jupyter Notebook application KubeDirector cluster to develop your model. You will attach your Jupyter Notebook cluster to a remote tenant-shared training cluster to train your model. The shared training cluster **training-engine-shared** includes the open source ML toolkits, libraries and frameworks for developing and training models. It has been already deployed by the tenant administrator for your tenant. The shared training cluster will allow you to train your model faster using more compute and memory resources than your local lightweight Jupyter Notebook cluster.

2. You will then access your local Jupyter Notebook web UI via the gateway network service port to train your model to the remote tenant-shared training cluster. 

**Recommended blog reading:**

* [Building Dynamic Machine Learning Pipelines with KubeDirector](https://developer.hpe.com/blog/building-dynamic-machine-learning-pipelines-with-kubedirector)

**Definitions:**

- *KubeDirector:* also known as Kubernetes Director. KubeDirector is an **open-source** project initiated and led by HPE that addresses stateful scaleout application deployment in standard Kubernetes clusters with a focus on non-cloud native stateful analytics workloads (AI/ML, data processing and Big Data apps). These applications are generally referred to as a distributed, single-node or multi-node application **virtual cluster** where each application virtual cluster node runs as **a container** in the Kubernetes cluster.

- *Training:* Input datasets are processed to create a Machine Learning model. Data scientists can use a local Jupyter Notebook to build and train their models. They can also interact with a remote, larger capacity training cluster to train their models faster.

- *Cloud native application:* Also known as the [12-Factor app](https://www.mirantis.com/blog/how-do-you-build-12-factor-apps-using-kubernetes/), a modern application that leverages a microservices architecture with loosely coupled services. The microservice architectural style is an approach to developing a single application as a suite of small independently deployed services.

- *Non-cloud native application:* a multi-tier application with tightly coupled and interdependent services. 

- *Stateless application:* A stateless application is an application which does not require persistence of data nor an application state.

- *Stateful application:* A stateful application typically requires persistence of certain mountpoints across application cluster nodes rescheduling, restarts, upgrades, rollbacks. A stateful application can also need persistence of network identity (i.e.: hostname). 

### **1- Initialize the environment**

Let's first define the environment variables needed to execute this part of the lab.

In [None]:
#
# environment variables
#
username="student{{ STDID }}" 
password="{{ PASSSTU }}"
studentId="student{{ STDID }}" # your Jupyter Notebook student Identifier (i.e.: student<xx>)

#
gateway_host="{{ HPEECPGWNAME }}"
Internet_access="{{ JPHOSTEXT }}"

kc_secret="kc-secret-students.yaml" #the kubeconfig secret object.
JupyterNotebookApp="cr-cluster-jupyter-notebook.yaml" # the Jupyter Notebook KD App manifest you will deploy to build your model
DeploymentEngineApp="cr-cluster-endpoint-wrapper.yaml" # the Deployment engine KD App manifest you will deploy to query your model for answers

echo "Your studentId is: "$studentId

### **2- List the registered KubeDirector applications**
You can get the list of KubeDirector applications (kdapp) registered with the Kubernetes cluster for your tenant using the `kubectl get kdapp` command. A KubeDirector application (kapp) is a _template or a blueprint_ for the application. It describes an application's **metadata** (service roles, Docker images, configuration packages, services ports, persistent storage). A KubeDirector cluster (kdcluster) is a running instance of a KubeDirector application.

In this workshop, you will be using the KubeDirector application _jupyter-notebook_ to create your local Jupyter Notebook cluster.

In [None]:
kubectl get kdapp

List the instance of the training engine kdapp already deployed for your tenant by the lab administrator:

In [None]:
kubectl get kdcluster training-engine-shared

### **3- Deploying your local Jupyter Notebook cluster with _Connection_ to a remote tenant-shared training cluster**

You will deploy an instance of the **jupyter-notebook** kdapp by creating a KubeDirector virtual cluster (kdcluster). A KubeDirector cluster (kdcluster) is a running instance of a KubeDirector application. A kdcluster identifies the desired kdapp and specifies runtime configuration parameters, such as the size and resource requirements of the virtual cluster. 

> **Note:** _The Jupyter Notebook kdapp includes the open source machine learning toolkits, software libraries and frameworks for developing and training models such as TensorFlow, scikit-learn, keras, XGBoost, matplotlib, Jupyter Notebook, Numpy, Scipy, Pandas, etc._

#### Create the manifest file and deploy an instance of the Jupyter Notebook KubeDirector application:
Like any other containerized application deployment on Kubernetes, the `kubectl apply -f ManifestAppFile` command is used to deploy the kdcluster. The application manifest is a YAML file that describes the attributes of the KubeDirector virtual cluster.  

> **Note:** _One of the most interesting parts of the kdcluster specification is the **Connections** stanza (a related group of attributes), which identifies other resources of interest to that kdcluster. Here, you simply connect your local Jupyter Notebook cluster to the tenant-shared training cluster **training-engine-shared** already deployed by the tenant administrator.
> You also attach a Kubernetes Secret that contains the **encrypted form of the user-specific kubeconfig file**. This allows the user to execute kubectl commands from within the Notebook._ 

* So, let's first create the encrypted form of the user-specific kubeconfig file.

In [None]:
#Create the encrypted form of the user-specific kubeconfig file 
config=$(cat ~/.kube/config | base64)
kcConfig=$(echo $config | tr -d '\r' | tr -d ' ') #we remove any potential spaces
#
cat > $kc_secret << EOF
apiVersion: v1
kind: Secret
metadata:
  name: kc-secret-$studentId
  labels:
    kubedirector.hpe.com/secretType: kubeconfig
    kubedirector.hpe.com/username: $studentId
data:
   config: $kcConfig
EOF

#cat $kc_secret

In [None]:
kubectl apply -f $kc_secret

* And now deploy the local Jupyter Notebook KubeDirector application cluster: 

In [None]:
cat $JupyterNotebookApp

In [None]:
kubectl apply -f $JupyterNotebookApp

After a few seconds, you should get the response message: *kubedirectorcluster/Your-instance-name created*.  

### **4- Inspect the deployed KubeDirector Application instance**
Your application will be represented in the Kubernetes cluster by a custom resource of type **KubeDirectorCluster (kdcluster)**, with the name that was indicated inside the YAML file used to create it. Use the command `kubectl get kdcluster YourClustername` to list your kdcluster.

In [None]:
clusterName="jupyter-notebook-${studentId}"
kubectl get kdcluster $clusterName

After creating the instance of the KubeDirector application, you can use the `kubectl describe kdcluster` command below to observe its status and any events logged against it.

The virtual cluster status indicates its overall "state" (top-level property of the status object). It should have a value of **"configured"**. 

> **Note:** _The first time a virtual cluster of a given KubeDirector application type is created, it may take several minutes to reach its **"configured"** state, as the relevant Docker image must be downloaded and imported._ 

**>Run the `kubectl describe` command below and scroll down to the `Events` section to check the overall state of your kdcluster.**

**>Regularly repeat (every minute or so) the command below until the kdcluster is in the state "_configured_".**

In [None]:
kubectl describe kdcluster $clusterName

You can use a form of the `kubectl get pod,service,statefulset` command that matches against a value of the **kubedirector.hpe.com/kdcluster=YourClusterApplicationName** label to observe the standard Kubernetes resources that compose the application virtual cluster:

In [None]:
kubectl get pod,service,statefulset -l kubedirector.hpe.com/kdcluster=$clusterName

Your instance of the KubeDirector Application virtual cluster is made up of a **StatefulSet**, a **POD** (a cluster node) and a **NodePort Service** per service role member (Controller), and a **headless service** for the application cluster.   

* The ClusterIP service is the headless service required by a Kubernetes StatefulSet to work. It maintains a stable POD network identity (i.e.: persistence of the hostname of the PODs across PODs rescheduling).
* The NodePort service exposes the Notebook application service with token-based authorization outside the Kubernetes cluster. 

### **5- Get your local Jupyter Notebook's service endpoint to connect to it**
To get a report on all services related to a specific virtual KubeDirector cluster, you can use a form of **kubectl describe** that matches against a value of the **kubedirector.hpe.com/kdcluster=YourClusterApplicationName** label.

In [None]:
#
# Getting the service endpoint URL:
#
JupyterAppURL=$(kubectl describe service -l  kubedirector.hpe.com/kdcluster=${clusterName} | grep gateway/8000 | awk '{print $2}')
JupyterAppPort=$(echo $JupyterAppURL | cut -d':' -f 2) # extract the gateway re-mapped port value.
myJupyterApp_endpoint="https://$gateway_host:$JupyterAppPort"
echo "Your application service endpoint re-mapped port is: "$JupyterAppPort
#echo "Your Intranet application service endpoint is: "$myJupyterApp_endpoint
echo "Your Jupyter Notebook service endpoint URL is: https://"$Internet_access:$JupyterAppPort
echo "Your local Jupyter Notebook web UI login credentials are: $username / $password"
#

### **6- Download the python code files to your PC/laptop**

Download the files below on your local PC/laptop. You will use these files in the next part of the lab from your local Jupyter Notebook cluster you have just created.

- ***3-WKSHP-K8s-ML-Pipeline-Model-Training.ipynb***
- ***XGB_Scoring.py***
- ***XGB_Scoringv2.py***
- ***ML-Workflow.jpg***

From the left side panel of your JupyterHub account, navigate to **Code/NYCTaxi** folder: double-click on the folder **Code**, then the folder **NYCTaxi**. Right-click on each file (or select all the files then right-click) and choose **Download**.

You can click the ellipsis **(...)** to go back to the root of your repository in JupyterHub.

## **7- Connect to your local Jupyter Notebook web UI and upload code files**

Click the **_service endpoint URL_** from Step 5 above to connect to your Jupyter Notebook sandbox. This opens a Jupyter Notebook login screen in a new browser tab. **Use the login credentials from step 5 above to authenticate.**

* Log in and follow instructions below to upload the code files from your local PC/laptop to your local Jupyter Notebook server.

> <font color="red"> **Note:** On Windows PC, ***Firefox*** is the recommended browser to connect to your local Jupyter Notebook UI. When using **Chrome** you may observe a message _"Server Not Running"_, in which case just click Restart button.</font>

> <font color="red"> **Note:** If you are seeing a security warning about the certificate while connecting to the local Jupyter Notebook web UI, please accept the risk and proceed to continue.</font>

* Click on **Upload Files** button on the top left of your local Jupyter Notebook server.

<img src="Pictures/Jupyter-Notebook-Upload.png" height="800" width="800">

* Select the 4 files you previously downloaded (**3-WKSHP-K8s-ML-Pipeline-Model-Training.ipynb**, **XGB_Scoring.py**, **XGB_Scoringv2.py** and **ML-Workflow.jpg**) and upload them.

### <font color="red">Now, from your local Jupyter Notebook, open the notebook **3-WKSHP-K8s-ML-Pipeline-Model-Training.ipynb** and follow the instructions from the notebook to build, train and test the model.</font>

Once your model is trained and saved to a file, follow the instructions in Lab 4 to deploy your trained model:

* [Lab 4 Model Registry and Deployment](4-WKSHP-K8s-ML-Pipeline-Register-Model-Deployment.ipynb)

## Summary

In this lab, we have shown you how you can deploy a local Jupyter Notebook virtual cluster and attach it to a shared distributed training cluster using _**Connections**_ stanza in a KubeDirector Cluster manifest YAML file. This local Jupyter Notebook will be used in the next lab to do model training on the tenant-shared training cluster.