# [INFO-H515 - Big Data Scalable Analytics](https://uv.ulb.ac.be/course/view.php?id=85246?username=guest)

## TP 7 - Spark, HDFS and Jupyter notebooks on Google Cloud

*Materials originally developed by Yann-Aël Le Borgne and Gianluca Bontempi*

#### *Theo Verhelst, Daniele Lunghi and Gianluca Bontempi*

####  Date: TBD



This class aims at illustrating how a cloud service provider such as Google can be used to launch a Spark/HDFS cluster and interact with it with a Jupyter notebook.

### Class objectives:

* Set up a Google Cloud account
* Launch a Spark/HDFS cluster with DataProc
* Start Spark jobs from a Jupyter notebook
* Monitor jobs and cluster usage through with Spark, Yarn and HDFS web monitoring interfaces

# 1) Introduction

Cloud service providers such as Google Cloud, Amazon Web Services, or Microsoft Azure all provide different solutions for setting up clusters with Big Data services such as Spark, Hadoop, or Yarn. Due to the large heterogeneity of user needs, cloud service providers offer many ways to set up and configure clusters, ranging from fine-grained solutions with detailed control on hardware and software configurations, to high-level services allowing to spin up preconfigured clusters with just a few mouse clicks. 

We will rely in this class on Google Cloud, and its [DataProc](https://cloud.google.com/dataproc) service. Google Cloud was chosen since, at the time of the writing of this class, it provided the safest (no charges unless explicitly authorized) and most attractive offer in terms of free credits (300 dollars over a three-month period) for testing their platform. The DataProc service was chosen since it allows to easily spin up a Spark/HDFS cluster, and interact with it from a Jupyter notebook. Note that there exists other ways to launch Spark/HDFS clusters with Google cloud, and that similar solutions exist with other cloud services providers such as AWS or Microsoft Azure.



# 2) Account creation 


All cloud service providers require you to create an account, verify it with your phone, and provide credit card details even if you only rely on the free credits. 

In order to access to Google cloud services, go to https://console.cloud.google.com. You will need to create a Google account if you don't have one.

Once signed-in, you will have to accept the terms of services.

<img src="images/Setting_Up_Account_1.jpg"  width="900"/> 

Note the banner at the top, which invites you to activate your free credits. 

<img src="images/Setting_Up_Account_2.jpg"  width="900"/> 

Click 'Activate' in order to obtain the 300 dollars of free credits. Three steps will need to be taken to obtain the credits:

* First step: Select 'Belgium' as the country, and 'Other' as the organization. Agree to the terms of services.

<img src="images/Setting_Up_Account_3.jpg"  width="900"/> 

* Second step: Enter your phone number and click on 'Send code' to obtain a validation code on your phone. Verify your account with the 6-digit code sent to your phone.

* Third step: Verify your payment information. For account 'type', select 'individual'. Enter your address, and choose your payment method. Finally, click on 'Start my free trial'. 

You may get a pop-up window asking your for further verification (e.g., 3D-secure verification). Follow the verification procedure by clicking on 'Continue'.

<img src="images/Setting_Up_Account_4.jpg"  width="500"/> 


**Important notes regarding billing:** 

* **Important note 1:** In our experience, Google will not charge you unless you explicitly authorize it to do so. That is, once you run out of free credits, access to paid services will simply be suspended. **Keep in mind that this may change in the future, and do not take for granted that you will not be billed**.    

* **Important note 2:** Billing concepts for cloud services are very complex. See here for an overview: https://cloud.google.com/billing/docs/concepts?hl=en-GB. 

* **Important note 3:** You may (and will) have bad surprises regarding billing. It is common, when starting a service that other services are also activated without you noticing. For example, starting a Spark cluster implicitly creates a Google bucket, for which you will also be charged. If you delete your cluster, but forget to delete the bucket, you will be charged until you delete the bucket. Another example concerns queries on big datasets, using for example the Google BigQuery service. In that case, billing depends on the amount of data that needed to be processed in order to answer your query. Even if your query seems simple (like select something from a table), if the table is huge, charges may be significant. There are many other examples of unexpected charges. **Therefore, always read carefully the documentation for the services you try**. 

* **Important note 4:** **Always delete everything you created after using a service, unless you want to keep it (and therefore pay for it)**. The last step of this class consists in properly deleting all of the services that were used. **Follow it carefully**.

Useful links regarding billing:

* Google Cloud free program: https://cloud.google.com/free/docs/gcp-free-tier
* Overview of Cloud Billing concepts: https://cloud.google.com/billing/docs/concepts?hl=en-GB
* Understanding Google Cloud costs: https://www.cloudskillsboost.google/quests/90?qlcampaign=1q-cloudent-061&utm_source=qwiklabs&utm_medium=console&utm_campaign=pilot&hl=en-GB
* Quota requests: https://support.google.com/cloud/answer/6330231

Your billing details can be found in the navigation menu, under 'Billing'. Go to 'Overview'. On the dashboard, your free remaining credits should appear on the left side of the screen, at the bottom. 

<img src="images/Setting_Up_Account_5.jpg"  width="350"/> 


**Check the dashboard regularly, even when not using services, in order to avoid being charged for a service you would have forgotten to properly shut down**.




# 3) Set up Google Dataproc

[Google Dataproc](https://cloud.google.com/dataproc) is the Google service for launching Spark/HDFS clusters and running Spark jobs. In the following, we show how to create and configure a new project on Dataproc for launching a Spark/HDFS cluster.

The total cost to run this lab on Google Cloud is about 2 dollars. See the [Dataproc pricing details here](https://cloud.google.com/dataproc/pricing?authuser=2). The last section of this class will detail how to clean up your project. 

### Create a new project

Sign-in to Google Cloud Platform console at [console.cloud.google.com](console.cloud.google.com) and create a new project:

<table>
    <tr>
        <td>1)</td>
        <td>
            <img src="images/Create_Project_1.png"  width="550"/> 
        </td>
    </tr>
    <tr>
        <td>2)</td>
        <td>
            <img src="images/Create_Project_2a.jpg"  width="700"/> 
        </td>
    </tr>
    <tr>
        <td>3)</td>
        <td>
            <img src="images/Create_Project_3a.jpg"  width="600"/> 
        </td>
    </tr>
</table>






### Set up your environment

First, open up Cloud Shell by clicking the button in the top right-hand corner of the cloud console:

<img src="images/Set_up_environment_1.png"  width="600"/> 

After the Cloud Shell loads, run the following command to set the project ID from the previous step:

```
gcloud config set project <project_id>

```

The project ID can also be found by clicking on your project in the top left of the cloud console:

<img src="images/Set_up_environment_2.png"  width="500"/> 

<img src="images/Set_up_environment_3.png"  width="600"/> 

You be may be asked to authorize Google to make a GCP API call. If so, authorize.

Next, enable the Dataproc, Compute Engine and Storage Components APIs.

```
gcloud services enable dataproc.googleapis.com \
  compute.googleapis.com \
  storage-component.googleapis.com 
```

Enabling APIs can take up to one minute.

Alternatively this can be done in the Cloud Console. Click on the menu icon in the top left of the screen.

<img src="images/Set_up_environment_4.png"  width="300"/> 


Select API Manager from the drop down.


<img src="images/Set_up_environment_5.png"  width="300"/> 

Click on Enable APIs and Services.

<img src="images/Set_up_environment_6.png"  width="500"/> 

Search for and enable the following APIs:

* Compute Engine API
* Dataproc API
* Storage Components


# 4) Create a Spark/HDFS cluster

## Creating your cluster

Set the environment variables for your cluster (in the Cloud shell):

```
REGION=europe-west1
CLUSTER_NAME=<project_id>
```

Then run this gcloud command to create your cluster with all the necessary components to work with Jupyter on your cluster. We will use two `n1-standard-8` machines, which feature [4 vCPUs and 30GB RAM](https://cloud.google.com/compute/all-pricing) and cost around 0.4 dollar/hour.

```
gcloud beta dataproc clusters create ${CLUSTER_NAME} \
 --region=${REGION} \
 --image-version=1.5 \
 --master-machine-type=n1-standard-8 \
 --worker-machine-type=n1-standard-8 \
 --num-workers=2 \
 --optional-components=ANACONDA,JUPYTER \
 --enable-component-gateway 
```

You should see the following output while your cluster is being created

```
Waiting on operation [projects/cluster-hdfs-spark/regions/europe-west1/operations/abcd123456].
Waiting for cluster creation operation...
```

It should take about 90 seconds to create your cluster and once it is ready you will be able to access your cluster from the [Dataproc Cloud console UI](https://console.cloud.google.com/dataproc/clusters?authuser=2).

While you are waiting you can carry on reading below to learn more about the flags used in gcloud command (description for all the flags at https://cloud.google.com/sdk/gcloud/reference/beta/dataproc/clusters/create).

You should the following output once the cluster is created:

```
Created [https://dataproc.googleapis.com/v1beta2/projects/project-id/regions/europe-west1/clusters/spark-jupyter] Cluster placed in zone [europe-west1-c].
```

#### Flags used in gcloud dataproc create command

Here is a breakdown of the flags used in the gcloud dataproc create command

```
--region=${REGION}
```

Specifies the region where the cluster will be created. You can see the list of available regions [here](https://cloud.google.com/compute/docs/regions-zones).

```
--image-version=1.5
```

The image version to use in your cluster. You can see the list of available versions [here](https://cloud.google.com/dataproc/docs/concepts/versioning/dataproc-versions).


```
--master-machine-type=n1-standard-8
--worker-machine-type=n1-standard-8
```

The machine types to use for your Dataproc cluster. You can see a list of available machine types [here](https://cloud.google.com/compute/docs/machine-types), and princing detail [here](https://cloud.google.com/compute/all-pricing).

By default, 1 master node and 2 worker nodes are created if you do not set the flag –-num-workers

```
--optional-components=ANACONDA,JUPYTER
```

Setting these values for optional components will install all the necessary libraries for Jupyter and Anaconda (which is required for Jupyter notebooks) on your cluster.

```
--enable-component-gateway
```

Enabling Component Gateway creates an App Engine link using Apache Knox and Inverting Proxy which gives easy, secure and authenticated access to the Jupyter and JupyterLab web interfaces meaning you no longer need to create SSH tunnels.

It will also create links for other tools on the cluster including the Yarn Resource manager and Spark History Server which are useful for seeing the performance of your jobs and cluster usage patterns.



# 5) Upload a Spark notebook

#### Accessing the JupyterLab web interface

Once the cluster is ready you can find the Component Gateway link to the JupyterLab web interface by going to [Dataproc Clusters - Cloud console](https://console.cloud.google.com/dataproc/clusters?authuser=2), clicking on the cluster you created and going to the Web Interfaces tab. The Dataproc cluster console can also be found under 'Big Data'/'Dataproc'/'Clusters' in the Cloud console main menu (top left of the UI).

<img src="images/DataProc_UI.gif"  width="900"/> 

You will notice that you have access to Jupyter which is the classic notebook interface or JupyterLab which is described as the next-generation UI for Project Jupyter.

There are a lot of great new UI features in JupyterLab and so if you are new to using notebooks or looking for the latest improvements it is recommended to go with using JupyterLab as it will eventually replace the classic Jupyter interface according to the official docs.

However, JupyterLab in DataProc does not easily support plotting with Matplotlib. Therefore, for this class where we require plotting some results, select the Jupyter interface.

Notice that this web interface page also gives you access to the Spark history server, the YARN ResourceManager, and the HDFS NameNode web interface. 

#### Upload a notebook

From the Jupyter interface, **go to GCS folder**, and upload the 7-FeatureSelection-GoogleCloud notebook (if you do not go to GCS folder, the upload will hang since you do not have write permission on the root folder). 

The notebook is an adapted version of the notebook used for the class on feature selection. The main differences are:

* A dataset with 1000 observations and 30000 features is used
* The sections for data visualization and ranking with correlation are removed. Experiments are made with mRMR
* The master is changed to 'yarn' for Spark, where you can vary the number of instances and cores
* We finally show how data can be stored and then loaded from HDFS for faster execution.

<img src="images/Jupyter_UI.gif"  width="900"/> 




# 6) Feature selection with mRMR


As for TP 6, import the required libraries, and generate an artificial dataset (Section 1 of the notebook). The dataset contains 1000 observations and 10000 informative features, 10000 noisy features and 10000 redundant features (30000 features in total).

##  Centralized approach

Run the centralized approach. Note that the time to select a feature linearly increases with the number of features, each step taking on average 2.5 seconds more than the previous step. The first step takes around 2.5 seconds, while the tenth step takes around 25 seconds.

<img src="images/Execution_times_1.jpg"  width="600"/> 

## Spark and Map/Reduce

### 1 instance, 1 core

With 1 instance and 1 core, the overhead of Spark and Yarn makes the execution around 15 seconds longer for each step of the algorithm. The increase of execution times for each new selected feature is however the same as the centralized approach, that is, around 2.5 seconds. 

<img src="images/Execution_times_2.jpg"  width="600"/> 

You can check in the Spark History server how tasks were distributed during the last step of the feature selection. Since there is only one instance with one core, the 16 partitions are processed one after another.

<img src="images/SparkHS_UI_1.gif"  width="900"/> 





### 2 instances, 8 cores

The tasks are now distributed on the 16 cores available on the two instances. The overhead of Spark and Yarn is reduced. The increase of execution time for each new selected feature is divided by a factor of 10, down to around 0.25 seconds.

<img src="images/Execution_times_3.jpg"  width="600"/> 

You can check in the Spark UI how tasks were distributed during the last step of the feature selection. Since there are 2 instances with 8 cores, the 16 partitions are processed in parallel.

Note: Since the Spark session is still active, we need to go to the Spark UI instead of the Spark history server. The Spark UI can be accessed from the Yarn ResourceManager.

<img src="images/Spark_UI_1.gif"  width="900"/> 


### 2 instance, 8 cores, using HDFS

Using HDFS allows to reduce the overhead due to data transfer. The increase of execution time for each new selected feature remains at around 0.25 seconds. 

<img src="images/Execution_times_4.jpg"  width="600"/> 

We can use the HDFS NameNode UI to see how much space was used on HDFS, and how many replicas exist for the data (2 in this setup).

<img src="images/HDFS_UI_1.gif"  width="900"/> 

# 7) Clean up your resources

To avoid incurring unnecessary charges to your GCP account after completion of this lab:

* [Delete the Cloud Storage bucket](https://cloud.google.com/storage/docs/deleting-buckets) for the environment and that you created
* [Delete the Dataproc environment](https://cloud.google.com/dataproc/docs/guides/manage-cluster).

If you created a project just for this lab, you can also optionally delete the project:

* In the GCP Console, go to the Projects page.
* In the project list, select the project you want to delete and click Delete.
* In the box, type the project ID, and then click Shut down to delete the project.

Caution: Deleting a project has the following effects:

* Everything in the project is deleted. If you used an existing project for this tutorial, when you delete it, you also delete any other work you've done in the project.
* Custom project IDs are lost. When you created this project, you might have created a custom project ID that you want to use in the future. To preserve the URLs that use the project ID, such as an appspot.com URL, delete selected resources inside the project instead of deleting the whole project.


## Acknowledgements

* Part of this lab was inspired by the following tutorial: Apache Spark and Jupyter Notebooks on Cloud Dataproc - https://codelabs.developers.google.com/codelabs/spark-jupyter-dataproc#0