# Google Cloud Platform Project Creation Workbook 
 
Use this workbook to create a google cloud project with everything needed to collect new data and host your own web app. 
 
Prerequisites:  
+ Create Google user account  <br><br>
+ Create your own personal Google Cloud Project and Enable Billing
    - Enable Free Tier account by seleting "Try it Free" here: [Try Google Cloud Platform for free](https://cloud.google.com/cloud-console)
    - Follow steps to activate billing found here: [Create New Billing Account](https://cloud.google.com/billing/docs/how-to/manage-billing-account#create_a_new_billing_account)
        - Billing account is required for APIs used in this project
        - You will not exceed the $300 free trial setting up this project but make sure to delete the project if you do not want to be charged
        - Take note of project name created because this billing account will be used with the new project <br><br>
+ Install and initialize Google Cloud SDK by following instructions found here: [Cloud SDK Quickstart](https://cloud.google.com/sdk/docs/quickstart) <br><br>

## Step 1 - Check Prequisites Successfully Completed
Check that you have successfully installed and enabled Cloud SDK by running the config list command. If you get an error please refer to Troubleshooting steps found here [Cloud SDK Quickstart](https://cloud.google.com/sdk/docs/quickstart).  
You should see an output that includes your account along with any other configuration setup when using gcloud init

In [11]:
!gcloud config list

[accessibility]
screen_reader = False
[compute]
region = us-central1
zone = us-central1-c
[core]
account = cwilbar@alumni.nd.edu
disable_usage_reporting = False
project = spark-on-kubernetes-testing

Your active configuration is: [default]


In [None]:
#!gcloud auth login

## Step 2 - Create GCP Project

In [5]:
###### TO DO: Enter name for new project
###### Note: Proect name must be unique across GCP. If you get error when creating project please change the project name here and try again.

new_project_id = 'spark-on-kubernetes-testing'

In [6]:
!gcloud projects create {new_project_id}

Create in progress for [https://cloudresourcemanager.googleapis.com/v1/projects/spark-on-kubernetes-testing].
Waiting for [operations/cp.6139311934009095807] to finish...
..done.
Enabling service [cloudapis.googleapis.com] on project [spark-on-kubernetes-testing]...
Operation "operations/acf.p2-291725804806-a8596c4a-3df6-43a4-aa82-cbebc437582a" finished successfully.


In [7]:
!gcloud config set project {new_project_id}

Updated property [core/project].


#### IMPORTANT
*****TO DO: Navigate to [Cloud Console](https://console.cloud.google.com/), Change to new project, and enable billing following instructions found here: [Enable Billing](https://cloud.google.com/billing/docs/how-to/modify-project#enable_billing_for_a_project)***


## Step 3 - Enable Necessary Cloud Services

This project uses:
+ Google Kubernetes Engine for a kubernetes cluster manager
+ Google Container Registry to store spark Docker container images
  
List below contains all services needed at time of creation of this workbook. Please add/remove from this list if the names/necessary services have changed.

In [15]:
enable_services_list = [
    'bigquery.googleapis.com',
    'bigquerystorage.googleapis.com',
    'cloudapis.googleapis.com',
    'cloudbuild.googleapis.com',
    'clouddebugger.googleapis.com',
    'cloudtrace.googleapis.com',
    'compute.googleapis.com',
    'container.googleapis.com',
    'containeranalysis.googleapis.com',
    'containerregistry.googleapis.com',
    'iam.googleapis.com ',
    'iamcredentials.googleapis.com ',
    'language.googleapis.com',
    'oslogin.googleapis.com',
    'servicemanagement.googleapis.com',
    'serviceusage.googleapis.com',
    'sql-component.googleapis.com',
    'storage-api.googleapis.com',
    'storage-component.googleapis.com',
    'storage.googleapis.com'    
]

In [16]:
## Services can only be enabled 20 at a time at the time of workbook creation. Use this loop to enable 20 at a time.
for x in range(0,len(enable_services_list),20):
    !gcloud services enable {' '.join(enable_services_list[x:(x+20)])} --project={new_project_id}   

Operation "operations/acf.p2-291725804806-1bda1b9f-92f0-426e-a346-bc333af49b02" finished successfully.


In [17]:
# Check that services were enabled
!gcloud services list --project=simple-webapp-dev

NAME                              TITLE
automl.googleapis.com             Cloud AutoML API
bigquery.googleapis.com           BigQuery API
bigquerystorage.googleapis.com    BigQuery Storage API
cloudapis.googleapis.com          Google Cloud APIs
clouddebugger.googleapis.com      Cloud Debugger API
cloudtrace.googleapis.com         Cloud Trace API
containerregistry.googleapis.com  Container Registry API
datastore.googleapis.com          Cloud Datastore API
language.googleapis.com           Cloud Natural Language API
logging.googleapis.com            Cloud Logging API
monitoring.googleapis.com         Cloud Monitoring API
pubsub.googleapis.com             Cloud Pub/Sub API
run.googleapis.com                Cloud Run Admin API
servicemanagement.googleapis.com  Service Management API
serviceusage.googleapis.com       Service Usage API
sql-component.googleapis.com      Cloud SQL
storage-api.googleapis.com        Google Cloud Storage JSON API
storage-component.googleapis.com  Cloud Storage
st

## Step 4 - Create Necessary Service Accounts

There are two primary service accounts used in this project:  
- **Deployment Service Account**
    - We create this and add necessary roles below using the Cloud SDK
    - deployer-sa@your_project_name.iam.gserviceaccount.com
    - This account is used to deploy and test docker container and kubernetes cluster<br><br>
- **BigQuery Service Account**
    - We create this and add necessary roles below using the Cloud SDK
    - bigquery-sa@your_project_name.iam.gserviceaccount.com
    - This account is used in the container for access to big query

Check what service ccounts are already created (should be the two default ones described above)

In [18]:
!gcloud iam service-accounts list --project={new_project_id}

DISPLAY NAME                            EMAIL                                               DISABLED
Compute Engine default service account  291725804806-compute@developer.gserviceaccount.com  False


In [19]:
!gcloud iam service-accounts create deployer-sa \
    --display-name="Deployment Service Account" \
    --description="Account used to deploy to Google Cloud Project" \
    --project={new_project_id}

Created service account [deployer-sa].


In [20]:
!gcloud iam service-accounts create bigquery-sa \
    --display-name="BigQuery Service Account" \
    --description="Account used by Spark Containers to Connect to BigQuery" \
    --project={new_project_id}

Created service account [bigquery-sa].


Check service accounts were created successfully

In [21]:
!gcloud iam service-accounts list --project={new_project_id}

DISPLAY NAME                            EMAIL                                               DISABLED
Compute Engine default service account  291725804806-compute@developer.gserviceaccount.com  False


Programatically update the roles for the new service accounts using the guide found here: [Programatic Change Access](https://cloud.google.com/iam/docs/granting-changing-revoking-access#programmatic)

In [22]:
# Save policy file in directory above where the repo is saved so that it is not stored to github
file_directory = '..\..\policy.json'

In [23]:
# Write current policy to file directory
!gcloud projects get-iam-policy {new_project_id} --format json > {file_directory}

**If running jupyter notebook run below cell to load and modify policy file.**

In [24]:
import json

with open('..\..\policy.json') as f:
    policy = json.load(f)

def modify_policy_add_role(policy, role, member):
    """Adds a new role binding to a policy."""

    binding = {"members": [member],"role": role }
    policy["bindings"].append(binding)
    return policy

members = [f'serviceAccount:deployer-sa@{new_project_id}.iam.gserviceaccount.com', 
           f'serviceAccount:bigquery-sa@{new_project_id}.iam.gserviceaccount.com']
roles = {
        members[0]:['roles/editor','roles/container.admin'],
        members[1]:['roles/bigquery.dataEditor','roles/run.serviceAgent', 'roles/bigquery.user',
                    'roles/storage.admin']}

for member in members:
    for role in roles[member]:
        policy = modify_policy_add_role(policy, role, member)

with open('..\..\policy.json', 'w') as json_file:
    json.dump(policy, json_file)

In [25]:
!gcloud projects set-iam-policy {new_project_id} {file_directory}

bindings:
- members:
  - serviceAccount:bigquery-sa@spark-on-kubernetes-testing.iam.gserviceaccount.com
  role: roles/bigquery.dataEditor
- members:
  - serviceAccount:bigquery-sa@spark-on-kubernetes-testing.iam.gserviceaccount.com
  role: roles/bigquery.user
- members:
  - serviceAccount:291725804806@cloudbuild.gserviceaccount.com
  role: roles/cloudbuild.builds.builder
- members:
  - serviceAccount:service-291725804806@gcp-sa-cloudbuild.iam.gserviceaccount.com
  role: roles/cloudbuild.serviceAgent
- members:
  - serviceAccount:service-291725804806@compute-system.iam.gserviceaccount.com
  role: roles/compute.serviceAgentUpdated IAM policy for project [spark-on-kubernetes-testing].

- members:
  - serviceAccount:service-291725804806@container-engine-robot.iam.gserviceaccount.com
  role: roles/container.serviceAgent
- members:
  - serviceAccount:service-291725804806@container-analysis.iam.gserviceaccount.com
  role: roles/containeranalysis.ServiceAgent
- members:
  - serviceAccount:serv

In [26]:
# Remove policy file 
!del {file_directory}

## Step 5 - Create Kubernetes Engine Cluster

In order to deploy a container to kubernetes to run an application you first need to create a kubernetes engine cluster

In [27]:
## TO DO: Change region  to your default region
COMPUTE_REGION = 'us-central1'
CLUSTER_NAME = 'spark-cluster'
# COMPUTE_ZONE = 'us-central1-c'

In [17]:
#!gcloud compute regions list

In [28]:
!gcloud config set compute/region {COMPUTE_REGION}

Updated property [compute/region].


In [29]:
# !gcloud config set compute/zone {COMPUTE_ZONE}

Updated property [compute/zone].


In [33]:
# Create cluster with default settings. This may take serveral minutes
!gcloud container clusters create-auto {CLUSTER_NAME} \
    --project={new_project_id}

NAME           LOCATION     MASTER_VERSION   MASTER_IP      MACHINE_TYPE  NODE_VERSION     NUM_NODES  STATUS
Creating cluster spark-cluster in us-central1...spark-cluster  us-central1  1.19.9-gke.1400  35.239.173.65  e2-medium     1.19.9-gke.1400  3          RUNNING

............................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

In [34]:
# Get credentials to use when deploying to cluster
!gcloud container clusters get-credentials {CLUSTER_NAME}

Fetching cluster endpoint and auth data.
kubeconfig entry generated for spark-cluster.


In [35]:
!kubectl cluster-info

Kubernetes master is running at https://35.239.173.65
GLBCDefaultBackend is running at https://35.239.173.65/api/v1/namespaces/kube-system/services/default-http-backend:http/proxy
KubeDNS is running at https://35.239.173.65/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
KubeDNSUpstream is running at https://35.239.173.65/api/v1/namespaces/kube-system/services/kube-dns-upstream:dns/proxy
Metrics-server is running at https://35.239.173.65/api/v1/namespaces/kube-system/services/https:metrics-server:/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.


In [25]:
account = f'bigquery-sa@{new_project_id}.iam.gserviceaccount.com' 

In [36]:
account = f'deployer-sa@{new_project_id}.iam.gserviceaccount.com' 

In [37]:
# Download bigquery service account json file
!gcloud iam service-accounts keys create sa.json \
    --iam-account={account}

created key [9e41f235b4e9fdce3b0a37c10add384e52e86d0f] of type [json] as [sa.json] for [deployer-sa@spark-on-kubernetes-testing.iam.gserviceaccount.com]


In [27]:
# Create Kubernetes Secret from file
!kubectl create secret generic bigquery-credentials \
  --from-file ./sa.json

secret/bigquery-credentials created


In [28]:
# Remove service account file from local system now that Kubernetes Secret
!del sa.json

In [38]:
# Create spark service account on Kubernetes
!kubectl create serviceaccount spark

serviceaccount/spark created


In [39]:
# Create role for service account to enable edit
!kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default

clusterrolebinding.rbac.authorization.k8s.io/spark-role created


## Step 6 - Create BigQuery Dataset

Your new project will need a dataset to store the data if you plan on copying/creating your own repository of data.  

This has to be a unique name per project.  

In my workflows I have named the dataset 'nba' but feel free to change it. Note that if you do change it, then you will also need to change the dataset name in any of the other python scripts in this project appropriately. 

In [29]:
dataset_name = 'amazon_reviews'

In [30]:
#Stop and re-run if this takes more than a minute
!bq --location=US mk --dataset \
--description "Stores transformed amazon review data orginally found at https://nijianmo.github.io/amazon/index.html" \
{new_project_id}:{dataset_name}  

Dataset 'spark-container-testing-6:amazon_reviews' successfully created.


## Step 7 - Build and Push Container to GCR



In [31]:
!docker build -t jupyterlab-pyspark cluster-mode-standalone/jupyterlab

#1 [internal] load build definition from Dockerfile
#1 sha256:de2dd75818e130241f519f771f57058d587994ce36d11b60328f8ee1de3fa7fc
#1 transferring dockerfile: 32B done
#1 DONE 0.0s

#2 [internal] load .dockerignore
#2 sha256:3581d49c95519dc794417c23b5021871ef8eaa4ce3c3426f1dd9f86c35553cc6
#2 transferring context: 35B done
#2 DONE 0.1s

#3 [internal] load metadata for docker.io/library/ubuntu:latest
#3 sha256:8c6bdfb121a69744f11ffa1fedfc68ec20085c2dcce567aac97a3ff72e53502d
#3 DONE 0.0s

#4 [ 1/16] FROM docker.io/library/ubuntu:latest
#4 sha256:0a5f349eacf4edfd2fc1577c637ef52a2ed3280d9d5c0ab7f2e4c4052e7d6c9f
#4 DONE 0.0s

#10 [internal] load build context
#10 sha256:f87e9bdbcbfc66eeb7e110c17042b648c026980daed0bf0ff55a0c6a2df62bfc
#10 transferring context: 238B done
#10 DONE 0.0s

#14 https://storage.googleapis.com/hadoop-lib/gcs/gcs-connector-hadoop3-2.2.0.jar
#14 sha256:173f6ef3aef50bc006d358109ed87a9ceb38ec7b8570fecb583413853f5055c7
#14 DONE 0.4s

#17 https://storage.googleapis.com/spark-l

In [32]:
!docker tag jupyterlab-pyspark:latest gcr.io/{new_project_id}/jupyterlab-pyspark:latest

In [33]:
# Push to Google Container Registry - This may take a few minutes. See end of README for instuctions on how to authenticate if you get an error with the push
!docker push gcr.io/{new_project_id}/jupyterlab-pyspark:latest

The push refers to repository [gcr.io/spark-container-testing-6/client-mode-spark-notebook]
84657af47b81: Preparing
5bb2d99bbe83: Preparing
4776f7e3e7a4: Preparing
c321536c1000: Preparing
62b4a933f82f: Preparing
3539f8208cfb: Preparing
e78f6d846b48: Preparing
5f70bf18a086: Preparing
73699085186c: Preparing
234a7be13f6b: Preparing
fd1892e6bbf9: Preparing
b365e4125915: Preparing
8b167cb6e70a: Preparing
5f70bf18a086: Preparing
420aaa2ef3d1: Preparing
346be19f13b0: Preparing
935f303ebf75: Preparing
0e64bafdc7ee: Preparing
fd1892e6bbf9: Waiting
b365e4125915: Waiting
3539f8208cfb: Waiting
e78f6d846b48: Waiting
8b167cb6e70a: Waiting
5f70bf18a086: Waiting
420aaa2ef3d1: Waiting
73699085186c: Waiting
234a7be13f6b: Waiting
346be19f13b0: Waiting
935f303ebf75: Waiting
0e64bafdc7ee: Waiting
84657af47b81: Pushed
5bb2d99bbe83: Pushed
4776f7e3e7a4: Pushed
5f70bf18a086: Layer already exists
e78f6d846b48: Pushed
234a7be13f6b: Pushed
3539f8208cfb: Pushed
fd1892e6bbf9: Pushed
b365e4125915: Pushed
c321536c1

## Step 8 - Deploy App to Cluster and Expose


In [34]:
APP_NAME = 'spark-server'

In [35]:
template = f'''
apiVersion: apps/v1
kind: Deployment
metadata:
  name: spark-server
  labels:
    app: spark-server
spec:
  replicas: 1
  selector:
    matchLabels:
      app: spark-server
  template:
    metadata:
      labels:
        app: spark-server
    spec:
      # The secret data is exposed to Containers in the Pod through a Volume.
      volumes:
      - name: secret-volume
        secret:
          secretName: bigquery-credentials
      containers:
      - image: gcr.io/{new_project_id}/client-mode-spark-notebook:latest
        name: client-mode-spark-notebook
        volumeMounts:
          # name must match the volume name below
          - name: secret-volume
            mountPath: /var/secrets/google    
'''

In [36]:
with open('deployment.yaml', 'w') as file:
    file.write(template)  

In [37]:
!kubectl apply -f deployment.yaml

deployment.apps/spark-server created


In [38]:
!kubectl expose deployment {APP_NAME} --type LoadBalancer --port 80 --target-port 8888

service/spark-server exposed


In [41]:
!kubectl describe services

Name:              kubernetes
Namespace:         default
Labels:            component=apiserver
                   provider=kubernetes
Annotations:       <none>
Selector:          <none>
Type:              ClusterIP
IP:                10.3.240.1
Port:              https  443/TCP
TargetPort:        443/TCP
Endpoints:         34.136.228.198:443
Session Affinity:  None
Events:            <none>


Name:                     spark-server
Namespace:                default
Labels:                   app=spark-server
Annotations:              <none>
Selector:                 app=spark-server
Type:                     LoadBalancer
IP:                       10.3.244.39
LoadBalancer Ingress:     34.71.125.54
Port:                     <unset>  80/TCP
TargetPort:               8888/TCP
NodePort:                 <unset>  30851/TCP
Endpoints:                10.0.0.7:8888
Session Affinity:         None
External Traffic Policy:  Cluster
Events:
  Type    Reason                Age   From                Me

## Step 9 - Create Cloud Storage Bucket

In order to load data to BigQuery with Spark we need a temporary Cloud Storage Bucket we create below.

Bucket names must be globally unique so we include the project name.

When running the Jupyter Notebook to load the data you will have to set the project ID so that it uses this bucket.

In [41]:
!gsutil mb gs://amazon_reviews_bucket-{new_project_id}

## Optional - Delete Project

To avoid on-going charges for everything created in this workbook run the below command to delete the project that you just created. Note it will take approximately 30 days for full completion and you will stil be charged for any charges accrued during this walkthrough. Check out [Deleting GCP Project](https://cloud.google.com/resource-manager/docs/creating-managing-projects?visit_id=637510410447506984-2569255859&rd=1#shutting_down_projects) for more information.

In [40]:
### Uncomment code to delete project
!gcloud projects delete {new_project_id}

^C
