# Creating a apache-spark clustre using pykube-ng

The purpose of this notebook is to introduce the use of Apache Spark in a kubernetes single node cluster, deploying an Apache Spark master and workers.

> Due to connection problems, this Issue was divided into 2 notebooks that have to be run on their specific pods.

> This notebook must be run using the jupyter notebook server inside the pod generated by the issue5 "notebook" deployment

> To these notebooks to work, your kubernetes needs to have some specifications:
- the ports 8889, 4040, 4041, 7077, 8080, 7078 must be free

##  1 Creating master and workers

Similar to previous issues, we will use pykube-ng and .yaml manifests to create the necessary deployments. The .yaml files are inside the folder ["issue8"](http://localhost:8888/tree/issue8) located in the home directory of this jupyter server.

Run the cells below to create the deployments and services:

In [1]:
import pykube
import yaml
from issue8.kube import *

api = pykube.HTTPClient(pykube.KubeConfig.from_file("k3s.yaml"))

In [2]:
#load master files
master_dep_file = open("issue8/spark-master-deployment.yaml", "r")
master_svc_file = open("issue8/spark-master-service.yaml", "r")
#load workers files
worker_dep_file = open("issue8/spark-worker-deployment.yaml", "r")
worker_svc_file = open("issue8/spark-worker-service.yaml", "r")
#load entrypoint jupyter files
notebook_pod_file = open("issue8/pyspark-notebook-pod.yaml", "r")
notebook_svc_file = open("issue8/pyspark-notebook-service.yaml", "r")

#create a list for the manifests
master_specs = []
worker_specs = []
notebook_specs = []

master_specs.append(yaml.load(master_dep_file.read(), Loader=yaml.FullLoader))
master_specs.append(yaml.load(master_svc_file.read(), Loader=yaml.FullLoader))
worker_specs.append(yaml.load(worker_dep_file.read(), Loader=yaml.FullLoader))
worker_specs.append(yaml.load(worker_svc_file.read(), Loader=yaml.FullLoader))
notebook_specs.append(yaml.load(notebook_pod_file.read(), Loader=yaml.FullLoader))
notebook_specs.append(yaml.load(notebook_svc_file.read(), Loader=yaml.FullLoader))

#close all files
master_dep_file.close()
master_svc_file.close()
worker_dep_file.close()
worker_svc_file.close()
notebook_pod_file.close()
notebook_svc_file.close()

> These files create three pods and make some alterations: 
- A [bde2020/spark-master](https://hub.docker.com/r/bde2020/spark-master/) pod;
  - Expose the pod at ports 8080 and 7077 (WebUi and Master ports respectively)
  - Create the PYSPARK_PYTHON environment variable to define python3 as default spark python
- A [bde2020/spark-worker](https://hub.docker.com/r/bde2020/spark-worker/) pod;
  - Expose the pod at port 7078
  - Create the PYSPARK_PYTHON environment variable to define python3 as default spark python
  - Create the SPARK_MASTER environment variable to define the spark master adress to connection
  - Create the MASTER_HOST environment variable to define the spark master adress inside kuberetes
- A [jupyter/pyspark-notebook:ubuntu-18.04](https://hub.docker.com/r/jupyter/pyspark-notebook/) pod;
  - Expose the pod at ports 8889, 4040, 4041 (Jupyter, WebUi and Communication respectively)
  - Configure the pod to start the notebook with no authentication and at the 8889 port
  - Create the PYSPARK_PYTHON environment variable t"/home/jovyan/work"o define python3 as default spark python
  - Configure the pod workdir to "/home/jovyan/work"
  - Mount the same volume used in this notebook inside the pyspark-notebook pod workdir

In [3]:
#constroy the manifests objects
for spec in master_specs:
    constroy(api,spec)
for spec in worker_specs:
    constroy(api,spec)
for spec in notebook_specs:
    constroy(api,spec)
    
#wait until the cluster get ready
import time
starting = True
print("The cluster is starting ...")
while(starting):
    time.sleep(1)
    pods = pykube.Pod.objects(api)
    for pod in pods:
        starting = not pod.ready
        if starting:
            break
print("The cluster is ready")

The cluster is starting ...
The cluster is ready


## 2 Testing the cluster

Once our cluster is ready to use, go to the [entrypoint notebook](http://localhost:8889/notebooks/Issue8(pyspark-test).ipynb) and run a simple task to test the cluster.

> Run the [entrypoint notebook](http://localhost:8889/notebooks/Issue8(pyspark-test).ipynb) code twice, before and after scaling the cluster to 2 workers and see the diference

In [17]:
scale_cluster(api,"spark-worker",2,"default")
starting = True
print("The cluster is scaling ...")
while(starting):
    time.sleep(1)
    pods = pykube.Pod.objects(api)
    for pod in pods:
        starting = not pod.ready
        if starting:
            break
print("The cluster is ready")

default spark-worker replicas 2
The cluster is starting ...
The cluster is ready


## 3 Destroy the cluster

In [16]:
for spec in worker_specs:
    destroy(api,spec)
for spec in master_specs:
    destroy(api,spec)
for spec in notebook_specs:
    destroy(api,spec)