# Google Cloud Platform Project Creation Workbook 
 
Use this workbook to create a google cloud project with everything needed to create a Kubernetes Engine Cluster and deploy a spark image to be used to create a cluster mode SparkSession. 
 
Prerequisites:  
+ Create Google user account  <br><br>
+ Create your own personal Google Cloud Project and Enable Billing
    - Enable Free Tier account by seleting "Try it Free" here: [Try Google Cloud Platform for free](https://cloud.google.com/cloud-console)
    - Follow steps to activate billing found here: [Create New Billing Account](https://cloud.google.com/billing/docs/how-to/manage-billing-account#create_a_new_billing_account)
        - Billing account is required for APIs used in this project
        - You will not exceed the $300 free trial setting up this project but make sure to delete the project if you do not want to be charged
        - Take note of project name created because this billing account will be used with the new project <br><br>
+ Install and initialize Google Cloud SDK by following instructions found here: [Cloud SDK Quickstart](https://cloud.google.com/sdk/docs/quickstart) <br><br>

## Step 1 - Check Prequisites Successfully Completed
Check that you have successfully installed and enabled Cloud SDK by running the config list command. If you get an error please refer to Troubleshooting steps found here [Cloud SDK Quickstart](https://cloud.google.com/sdk/docs/quickstart).  
You should see an output that includes your account along with any other configuration setup when using gcloud init

In [1]:
!gcloud config list

[accessibility]
screen_reader = False
[compute]
region = us-central1
[core]
account = cwilbar@alumni.nd.edu
disable_usage_reporting = False
project = spark-on-kubernetes-testing

Your active configuration is: [default]


In [None]:
#!gcloud auth login

## Step 2 - Create GCP Project

In [None]:
###### TO DO: Enter name for new project
###### Note: Proect name must be unique across GCP. If you get error when creating project please change the project name here and try again.

new_project_id = 'spark-on-kubernetes-demo'

In [None]:
!gcloud projects create {new_project_id}

In [None]:
!gcloud config set project {new_project_id}

#### IMPORTANT
*****TO DO: Navigate to [Cloud Console](https://console.cloud.google.com/), Change to new project, and enable billing following instructions found here: [Enable Billing](https://cloud.google.com/billing/docs/how-to/modify-project#enable_billing_for_a_project)***


## Step 3 - Enable Necessary Cloud Services

This project uses:
+ Google Kubernetes Engine for a kubernetes cluster manager
+ Google Container Registry to store spark Docker container images
  
List below contains all services needed at time of creation of this workbook. Please add/remove from this list if the names/necessary services have changed.

In [5]:
enable_services_list = [
    'bigquery.googleapis.com',
    'bigquerystorage.googleapis.com',
    'cloudapis.googleapis.com',
    'cloudbuild.googleapis.com',
    'clouddebugger.googleapis.com',
    'cloudtrace.googleapis.com',
    'compute.googleapis.com',
    'container.googleapis.com',
    'containeranalysis.googleapis.com',
    'containerregistry.googleapis.com',
    'iam.googleapis.com ',
    'iamcredentials.googleapis.com ',
    'oslogin.googleapis.com',
    'servicemanagement.googleapis.com',
    'serviceusage.googleapis.com',
    'sql-component.googleapis.com',
    'storage-api.googleapis.com',
    'storage-component.googleapis.com',
    'storage.googleapis.com'    
]

In [6]:
## Services can only be enabled 20 at a time at the time of workbook creation. Use this loop to enable 20 at a time.
for x in range(0,len(enable_services_list),20):
    !gcloud services enable {' '.join(enable_services_list[x:(x+20)])} --project={new_project_id}   

Operation "operations/acf.p2-601703040934-6457616c-5804-4e6e-9c54-28937d0a7e85" finished successfully.


In [7]:
# Check that services were enabled
!gcloud services list --project=simple-webapp-dev

NAME                              TITLE
automl.googleapis.com             Cloud AutoML API
bigquery.googleapis.com           BigQuery API
bigquerystorage.googleapis.com    BigQuery Storage API
cloudapis.googleapis.com          Google Cloud APIs
clouddebugger.googleapis.com      Cloud Debugger API
cloudtrace.googleapis.com         Cloud Trace API
containerregistry.googleapis.com  Container Registry API
datastore.googleapis.com          Cloud Datastore API
language.googleapis.com           Cloud Natural Language API
logging.googleapis.com            Cloud Logging API
monitoring.googleapis.com         Cloud Monitoring API
pubsub.googleapis.com             Cloud Pub/Sub API
run.googleapis.com                Cloud Run Admin API
servicemanagement.googleapis.com  Service Management API
serviceusage.googleapis.com       Service Usage API
sql-component.googleapis.com      Cloud SQL
storage-api.googleapis.com        Google Cloud Storage JSON API
storage-component.googleapis.com  Cloud Storage
st

## Step 4 - Create Necessary Service Accounts

There are two primary service accounts used in this project:  
- **Deployment Service Account**
    - We create this and add necessary roles below using the Cloud SDK
    - deployer-sa@your_project_name.iam.gserviceaccount.com
    - This account is used to deploy and test docker container and kubernetes cluster<br><br>
- **BigQuery Service Account**
    - We create this and add necessary roles below using the Cloud SDK
    - bigquery-sa@your_project_name.iam.gserviceaccount.com
    - This account is used in the container for access to big query

Check what service ccounts are already created (should be the two default ones described above)

In [8]:
!gcloud iam service-accounts list --project={new_project_id}

DISPLAY NAME                            EMAIL                                               DISABLED
Compute Engine default service account  601703040934-compute@developer.gserviceaccount.com  False


In [9]:
!gcloud iam service-accounts create deployer-sa \
    --display-name="Deployment Service Account" \
    --description="Account used to deploy to Google Cloud Project" \
    --project={new_project_id}

Created service account [deployer-sa].


In [10]:
!gcloud iam service-accounts create bigquery-sa \
    --display-name="BigQuery Service Account" \
    --description="Account used by Spark Containers to Connect to BigQuery" \
    --project={new_project_id}

Created service account [bigquery-sa].


Check service accounts were created successfully

In [11]:
!gcloud iam service-accounts list --project={new_project_id}

DISPLAY NAME                            EMAIL                                                         DISABLED
Compute Engine default service account  601703040934-compute@developer.gserviceaccount.com            False
BigQuery Service Account                bigquery-sa@spark-on-kubernetes-demo.iam.gserviceaccount.com  False
Deployment Service Account              deployer-sa@spark-on-kubernetes-demo.iam.gserviceaccount.com  False


Programatically update the roles for the new service accounts using the guide found here: [Programatic Change Access](https://cloud.google.com/iam/docs/granting-changing-revoking-access#programmatic)

In [12]:
# Save policy file in directory above where the repo is saved so that it is not stored to github
file_directory = '..\..\policy.json'

In [13]:
# Write current policy to file directory
!gcloud projects get-iam-policy {new_project_id} --format json > {file_directory}

**If running jupyter notebook run below cell to load and modify policy file.**

In [14]:
import json

with open('..\..\policy.json') as f:
    policy = json.load(f)

def modify_policy_add_role(policy, role, member):
    """Adds a new role binding to a policy."""

    binding = {"members": [member],"role": role }
    policy["bindings"].append(binding)
    return policy

members = [f'serviceAccount:deployer-sa@{new_project_id}.iam.gserviceaccount.com', 
           f'serviceAccount:bigquery-sa@{new_project_id}.iam.gserviceaccount.com']
roles = {
        members[0]:['roles/editor','roles/container.admin'],
        members[1]:['roles/bigquery.dataEditor','roles/run.serviceAgent', 'roles/bigquery.user',
                    'roles/storage.admin']}

for member in members:
    for role in roles[member]:
        policy = modify_policy_add_role(policy, role, member)

with open('..\..\policy.json', 'w') as json_file:
    json.dump(policy, json_file)

In [15]:
!gcloud projects set-iam-policy {new_project_id} {file_directory}

bindings:
- members:
  - serviceAccount:bigquery-sa@spark-on-kubernetes-demo.iam.gserviceaccount.com
  role: roles/bigquery.dataEditorUpdated IAM policy for project [spark-on-kubernetes-demo].

- members:
  - serviceAccount:bigquery-sa@spark-on-kubernetes-demo.iam.gserviceaccount.com
  role: roles/bigquery.user
- members:
  - serviceAccount:601703040934@cloudbuild.gserviceaccount.com
  role: roles/cloudbuild.builds.builder
- members:
  - serviceAccount:service-601703040934@gcp-sa-cloudbuild.iam.gserviceaccount.com
  role: roles/cloudbuild.serviceAgent
- members:
  - serviceAccount:service-601703040934@compute-system.iam.gserviceaccount.com
  role: roles/compute.serviceAgent
- members:
  - serviceAccount:deployer-sa@spark-on-kubernetes-demo.iam.gserviceaccount.com
  role: roles/container.admin
- members:
  - serviceAccount:service-601703040934@container-engine-robot.iam.gserviceaccount.com
  role: roles/container.serviceAgent
- members:
  - serviceAccount:service-601703040934@container-

In [16]:
# Remove policy file 
!del {file_directory}

## Step 5 - Create Kubernetes Engine Cluster

In order to deploy a container to kubernetes to run an application you first need to create a kubernetes engine cluster

In [17]:
## TO DO: Change region  to your default region
COMPUTE_REGION = 'us-central1'
CLUSTER_NAME = 'spark-cluster'
# COMPUTE_ZONE = 'us-central1-c'

In [None]:
#!gcloud compute regions list

In [18]:
!gcloud config set compute/region {COMPUTE_REGION}

Updated property [compute/region].


In [None]:
# !gcloud config set compute/zone {COMPUTE_ZONE}

In [19]:
# Create cluster with default settings. This may take serveral minutes
!gcloud container clusters create-auto {CLUSTER_NAME} \
    --project={new_project_id}

NAME           LOCATION     MASTER_VERSION   MASTER_IP       MACHINE_TYPE  NODE_VERSION     NUM_NODES  STATUS
spark-cluster  us-central1  1.19.9-gke.1400  35.224.220.234  e2-medium     1.19.9-gke.1400  3          RUNNING
Creating cluster spark-cluster in us-central1...
..........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

In [20]:
# Get credentials to use when deploying to cluster
!gcloud container clusters get-credentials {CLUSTER_NAME}

Fetching cluster endpoint and auth data.
kubeconfig entry generated for spark-cluster.


In [21]:
!kubectl cluster-info

Kubernetes master is running at https://35.224.220.234
GLBCDefaultBackend is running at https://35.224.220.234/api/v1/namespaces/kube-system/services/default-http-backend:http/proxy
KubeDNS is running at https://35.224.220.234/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
KubeDNSUpstream is running at https://35.224.220.234/api/v1/namespaces/kube-system/services/kube-dns-upstream:dns/proxy
Metrics-server is running at https://35.224.220.234/api/v1/namespaces/kube-system/services/https:metrics-server:/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.


In [22]:
account = f'bigquery-sa@{new_project_id}.iam.gserviceaccount.com' 

In [23]:
# Download bigquery service account json file
!gcloud iam service-accounts keys create sa.json \
    --iam-account={account}

created key [80b06e0d229e20b45f669f832ff5dd491abdf7dd] of type [json] as [sa.json] for [bigquery-sa@spark-on-kubernetes-demo.iam.gserviceaccount.com]


In [24]:
# Create Kubernetes Secret from file
!kubectl create secret generic bigquery-credentials \
  --from-file ./sa.json

secret/bigquery-credentials created


In [25]:
# Remove service account file from local system now that Kubernetes Secret
!del sa.json

In [26]:
# Create spark service account on Kubernetes
!kubectl create serviceaccount spark

serviceaccount/spark created


In [27]:
# Create role for service account to enable edit
!kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default

clusterrolebinding.rbac.authorization.k8s.io/spark-role created


## Step 7 - Build and Push Container to GCR



In [28]:
!docker build -t jupyterlab-pyspark cluster-mode-standalone/jupyterlab

#1 [internal] load build definition from Dockerfile
#1 sha256:f9f399f3d4f6dd7181072e865f382bad519a64a3cb52e8fe789e22c642848889
#1 transferring dockerfile: 32B 0.0s done
#1 DONE 0.3s

#2 [internal] load .dockerignore
#2 sha256:72472a21fa9b0f37e1b807e671642cf3d43abfea18f554e02289d987472c31e2
#2 transferring context: 2B done
#2 DONE 0.3s

#3 [internal] load metadata for docker.io/library/base:latest
#3 sha256:f9acfca7c619f83f3a6772cbdc63caa09becc43fe1028edcee2a2990821935c7
#3 DONE 0.0s

#4 [1/9] FROM docker.io/library/base
#4 sha256:46b25337b6497e2a04bb43db7e2ddfd626590fa67e24801204556a1c358dfb18
#4 DONE 0.0s

#9 [internal] load build context
#9 sha256:a041496ec171bb8076518906f4aa587bd72e2c26fec588600960d56095e00894
#9 transferring context: 79B done
#9 DONE 0.1s

#11 [7/9] RUN pip3 install --upgrade pip &&     pip3 install -r requirements.txt &&     jupyter lab clean
#11 sha256:32d4e225f61b4e1291bd8a6970ee752b756d610a49e3e26a967cf9c9a9fab2d9
#11 CACHED

#10 [6/9] COPY requirements.txt req

In [29]:
!docker build -t spark-base cluster-mode-standalone\spark-base

#1 [internal] load build definition from Dockerfile
#1 sha256:61651b4a19e1fc013b56a6706fce259d726c27281e6fdc914f8d94798e4910fe
#1 transferring dockerfile: 32B done
#1 DONE 0.1s

#2 [internal] load .dockerignore
#2 sha256:c19ceea3664cea9970b438ae3c6e9b6f3b89c5b3928e14b0746c32a5277e93e0
#2 transferring context: 2B done
#2 DONE 0.1s

#3 [internal] load metadata for docker.io/library/base:latest
#3 sha256:f9acfca7c619f83f3a6772cbdc63caa09becc43fe1028edcee2a2990821935c7
#3 DONE 0.0s

#4 [ 1/10] FROM docker.io/library/base
#4 sha256:46b25337b6497e2a04bb43db7e2ddfd626590fa67e24801204556a1c358dfb18
#4 DONE 0.0s

#8 https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-2.2.0.jar
#8 sha256:173f6ef3aef50bc006d358109ed87a9ceb38ec7b8570fecb583413853f5055c7
#8 DONE 0.5s

#11 https://storage.googleapis.com/spark-lib/bigquery/spark-bigquery-with-dependencies_2.12-0.20.0.jar
#11 sha256:89b5ffef7e782b56dfd9b6af19658592f873337a5886ddeb5a48695eb67d8fbf
#11 DONE 0.6s

#12 [ 7/10] ADD https://

In [30]:
!docker tag spark-base:latest gcr.io/{new_project_id}/spark-cluster-spark-base:latest

In [31]:
# Push to Google Container Registry - This may take a few minutes.
!docker push gcr.io/{new_project_id}/spark-cluster-spark-base:latest

The push refers to repository [gcr.io/spark-on-kubernetes-demo/spark-cluster-spark-base]
5f70bf18a086: Preparing
de8fab2b7fdf: Preparing
fde9e65a648f: Preparing
838aabcb60dd: Preparing
eb650d0160b4: Preparing
564be568a877: Preparing
5f70bf18a086: Preparing
1427caa01bbe: Preparing
5f70bf18a086: Preparing
c0843df22a5a: Preparing
346be19f13b0: Preparing
935f303ebf75: Preparing
0e64bafdc7ee: Preparing
564be568a877: Waiting
935f303ebf75: Waiting
0e64bafdc7ee: Waiting
1427caa01bbe: Waiting
c0843df22a5a: Waiting
346be19f13b0: Waiting
5f70bf18a086: Layer already exists
eb650d0160b4: Mounted from spark-on-kubernetes-testing/spark-cluster-spark-base
838aabcb60dd: Mounted from spark-on-kubernetes-testing/spark-cluster-spark-base
de8fab2b7fdf: Mounted from spark-on-kubernetes-testing/spark-cluster-spark-base
fde9e65a648f: Mounted from spark-on-kubernetes-testing/spark-cluster-spark-base
346be19f13b0: Layer already exists
564be568a877: Mounted from spark-on-kubernetes-testing/spark-cluster-spark-ba

## Step 8 - Run Jupyterlab Image and Test Code

You now how a local Jupyterlab Docker image. Run the code below in the command line outside of this interactive session to launch Jupyer Lab with a starter notebook with the code to create a SparkSession to your new server. You will first need to create a new file named untitled.txt and copy the output of sa.json. Make sure you don't upload the sa.json file to github and you delete it once you are done with it.

Once you have the file, run the first three cells and find the Kubernetes IP in the "Kubernetes control plane is running at:" line and copy this in to the SparkSession command as the spark master IP. 
  
One weird bug I have found is that you first need to create a SparkSession that is not a cluster mode kubernetes session and stop it. Then you can create a Kubernetes mode spark session.

At this point navigate to the Cloud Console and check out Kubernetes Engine Workloads. You will see pods attempt to be created but continuously failing. This is as far as I was able to get to get this working. Make sure you stop the session and then delete the project.

## COPY AND RUN IN CMD SESSION NOT IN JUPYTER NOTEBOOK ##
docker run -it --name jupyterlab-pyspark --rm -p 8888:8888 jupyterlab-pyspark
  
Click on 127.0.0.0 link to launch

In [None]:
# Run this code to download the service account json for the deployer account
account = f'deployer-sa@{new_project_id}.iam.gserviceaccount.com' 

In [None]:
# Download bigquery service account json file
!gcloud iam service-accounts keys create sa.json \
    --iam-account={account}

## Optional - Delete Project

To avoid on-going charges for everything created in this workbook run the below command to delete the project that you just created. Note it will take approximately 30 days for full completion and you will stil be charged for any charges accrued during this walkthrough. Check out [Deleting GCP Project](https://cloud.google.com/resource-manager/docs/creating-managing-projects?visit_id=637510410447506984-2569255859&rd=1#shutting_down_projects) for more information.

In [None]:
### Uncomment code to delete project
!gcloud projects delete {new_project_id}