# Class exercise: Version control in practice

*Tuesday 17, September*

**Requirements:** Access to the Internet

## Execrise: Create a project with your partner



1.   create a repository
2.   create a txt file with your name in it
3.   commit the txt file
4.   the partner fork your repository
5.   the partner add his/her name to the file
6.   the partner commit the txt file and pull the request
7.   merge the request



## Using `git` command

Why command lines?



*   sometimes the GitHub desktop is not available: linux system, cloud computing platform
*   programming on the virtual machine, server, etc


### Getting & Creating Projects

| Command | Description |
| ------- | ----------- |
| `git init` | Initialize a local Git repository |
| `git clone ssh://git@github.com/[username]/[repository-name].git` | Create a local copy of a remote repository |

### Basic Snapshotting

| Command | Description |
| ------- | ----------- |
| `git status` | Check status |
| `git add [file-name.txt]` | Add a file to the staging area |
| `git add -A` | Add all new and changed files to the staging area |
| `git commit -m "[commit message]"` | Commit changes |
| `git rm -r [file-name.txt]` | Remove a file (or folder) |

### Branching & Merging

| Command | Description |
| ------- | ----------- |
| `git branch` | List branches (the asterisk denotes the current branch) |
| `git branch -a` | List all branches (local and remote) |
| `git branch [branch name]` | Create a new branch |
| `git branch -d [branch name]` | Delete a branch |
| `git push origin --delete [branch name]` | Delete a remote branch |
| `git checkout -b [branch name]` | Create a new branch and switch to it |
| `git checkout -b [branch name] origin/[branch name]` | Clone a remote branch and switch to it |
| `git checkout [branch name]` | Switch to a branch |
| `git checkout -` | Switch to the branch last checked out |
| `git checkout -- [file-name.txt]` | Discard changes to a file |
| `git merge [branch name]` | Merge a branch into the active branch |
| `git merge [source branch] [target branch]` | Merge a branch into a target branch |
| `git stash` | Stash changes in a dirty working directory |
| `git stash clear` | Remove all stashed entries |

### Sharing & Updating Projects

| Command | Description |
| ------- | ----------- |
| `git push origin [branch name]` | Push a branch to your remote repository |
| `git push -u origin [branch name]` | Push changes to remote repository (and remember the branch) |
| `git push` | Push changes to remote repository (remembered branch) |
| `git push origin --delete [branch name]` | Delete a remote branch |
| `git pull` | Update local repository to the newest commit |
| `git pull origin [branch name]` | Pull changes from remote repository |
| `git remote add origin ssh://git@github.com/[username]/[repository-name].git` | Add a remote repository |
| `git remote set-url origin ssh://git@github.com/[username]/[repository-name].git` | Set a repository's origin branch to SSH |

### Inspection & Comparison

| Command | Description |
| ------- | ----------- |
| `git log` | View changes |
| `git log --summary` | View changes (detailed) |
| `git diff [source branch] [target branch]` | Preview changes before merging |



# Online/cloud computing resources

## Why could computing?


*   Personal Compute cannot meet the computation demands
*   Do want to buy the expansive CPU, GPU, etc.
*   colaborative working



## Cloud computing platforms



*   Google Cloud Platform
*   Amazon Web Service



# Google Cloud Platform

In this class, we are going to learn how to set up Google Cloud Platform (GCP) and perform some simple tasks.
Please make sure that you get everything to work, as we will use GCP at various points during the course.

First, we will look at some basic commands to use the GCP.
Then, we will set up a Dataproc cluster. A Dataproc cluster is Google's name for a user-friendly platform for data processing, analytics and machine learning.
Finally, we will learn how to run a jupyter notebook on this cluster.


## 1. Preparation
In this section we will look into setting up a remote cluster using the Google Cloud Platform.
Similar functionality is also available through other Infrastructure-as-a-Service providers, such as Amazon Web Services and Microsoft Azure.

GCP can be accessed via the Google Cloud Console (in your favourite browser) and through Google Cloud SDK, a command-line interface for GCP. Please make sure you can access the Google Cloud Console and install the Google Cloud SDK on your computer.

**Are we sure we want to make people install the Google Cloud SDK?**

### Google Cloud Console
Google Cloud Console is a web user-interface for GCP, for configuring and using various services and monitoring usage activity.
You can access Google Cloud Console here: https://console.cloud.google.com. You will need to register with your Google account (if you have one). Registering give you access to a 12-month free trial, with $300 worth of computing credit.

 ![alt text](https://raw.githubusercontent.com/lse-st446/lectures2019/master/week01/class/figs/gcp.png?token=AGU4ZEJKEGHASOCG2O3I2R25JVZRS)

### Google Cloud SDK
Google Cloud SDK (Software Development Kit) is a command line tool that allows you to manage resources and services hosted on GCP.
For example, you may list, copy, remove files using `gsutil ls`, `gsutil cp`, `gsutil rm`, respectively.
Please check out this more detailed [overview](https://cloud.google.com/sdk/docs/overview).

You need to install Google Cloud SDK before using it. Please follow the [installation instructions](https://cloud.google.com/sdk/docs/quickstarts). Once you have installed it (and your google cloud account is set up), make sure to run the provided examples to check that your installation is running as it should be.

(You might need to use Python 2.7 to install Google Cloud SDK on your computer. In Anaconda, you can set up an environment for this, through `conda create -n py27 python=2.7 anaconda`, without uninstalling other versions of python. Anaconda will let you switch between environments.)

Once Google Cloud SDK is installed, you can get basic information about your configuration using the terminal/command line command:
```
gcloud config list
```
Note: You will need to be in either in the folder where gcloud resides or have the directory path where gcloud is located added to PATH environment variable.

Example output:
```
LSE021353:~ vojnovic$ gcloud config list
[core]
account = milanvojnov@gmail.com
disable_usage_reporting = True
project = integral-linker-185619

Your active configuration is: [default]
```

## 2. Upload and download files into GCP

A bucket works very similarly to a regular directory/folder on your computer.

In order to upload and download files, we first need to create a bucket and then upload and download the files to/from the bucket.
This can be done using either Google Cloud Console or Google Cloud SDK.
We provide you with resources on how to do it in Google Cloud Console and provide you with the code for Google Cloud SDK.

### Google Cloud Console approach:

* Create bucket: https://cloud.google.com/storage/docs/creating-buckets
* Upload, download and delete a file: https://cloud.google.com/storage/docs/object-basics

### Google Cloud SDK approach:
All the commands are very similar to the usual commands you use in a command line interface of your Mac/Linux computer.

For Windows system, once you have installed Google Cloud SDK, a 'Google Cloud SDK shell' will appear on desktop or start menu.
Use this one instead of Powershell or command line prompt.

* See what buckets you have:

```
gsutil ls
```
* Create a bucket (change `my-bucket` to your own bucket name):

```
gsutil mb gs://my-bucket/
```
Example:
```
LSE021353:~ vojnovic$ gsutil mb gs://my-bucket-01/
Creating gs://my-bucket-01/...
```
See [here](https://cloud.google.com/storage/docs/gsutil/commands/mb) for more on creating buckets.

Once your bucket is created, you should be able to see it in the Google Cloud Console:

![alt text](https://raw.githubusercontent.com/lse-st446/lectures2019/master/week01/class/figs/gcp-bucket0.png?token=AGU4ZEJCS4SJPSMVMZI3SPK5JVZRS)

Here are some details about the created bucket:

![alt text](https://raw.githubusercontent.com/lse-st446/lectures2019/master/week01/class/figs/gcp-bucket.png?token=AGU4ZENF2LFC2NLAOZMO5WC5JVZRS)



* Upload a file:
You can get the helloworld.py [here](helloworld.py)
```
gsutil cp helloworld.py gs://my-bucket
```

Example:
```
LSE021353:files vojnovic$ gsutil cp helloworld.py gs://my-bucket-01
Copying file://helloworld.py [Content-Type=text/x-python]...
\ [1 files][  147.0 B/  147.0 B]                                                
Operation completed over 1 objects/147.0 B.                    
```

You can check if your file has been uploaded by checking what's inside your bucket:
```
gsutil ls gs://my-bucket
```

Example:
```
LSE021353:files vojnovic$ gsutil ls gs://my-bucket-01
gs://my-bucket-01/helloworld.py
```

* Download a file:

```
gsutil cp gs://my-bucket/helloworld.py Desktop
```
* Remove a file:

```
gsutil rm gs://my-bucket/helloworld.py
```
(You may not want to do this until the end of the exercise.)
* Remove the whole bucket:

```
gsutil rm -r gs://my-bucket
```
(You may not want to do this until the end of the exercise.)

Example:

```
LSE021353:files vojnovic$ gsutil rm -r gs://my-bucket-01
Removing gs://my-bucket-01/helloworld.py#1515399609789091...
Removing gs://my-bucket-01/google-cloud-dataproc-metainfo/646f3404-0769-4f92-9809-333074bbb661/cluster.properties#1515399984493936...
Removing gs://my-bucket-01/google-cloud-dataproc-metainfo/646f3404-0769-4f92-9809-333074bbb661/my-cluster-01-m/dataproc-initialization-script-0_output#1515400127345120...
Removing gs://my-bucket-01/google-cloud-dataproc-metainfo/646f3404-0769-4f92-9809-333074bbb661/my-cluster-01-m/dataproc-startup-script_output#1515400044457041...
/ [4 objects]                                                                   
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m -o ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Removing gs://my-bucket-01/google-cloud-dataproc-metainfo/646f3404-0769-4f92-9809-333074bbb661/my-cluster-01-w-0/dataproc-initialization-script-0_output#1515400027778604...
Removing gs://my-bucket-01/google-cloud-dataproc-metainfo/646f3404-0769-4f92-9809-333074bbb661/my-cluster-01-w-0/dataproc-startup-script_output#1515400015951859...
Removing gs://my-bucket-01/google-cloud-dataproc-metainfo/646f3404-0769-4f92-9809-333074bbb661/my-cluster-01-w-1/dataproc-initialization-script-0_output#1515400029415869...
Removing gs://my-bucket-01/google-cloud-dataproc-metainfo/646f3404-0769-4f92-9809-333074bbb661/my-cluster-01-w-1/dataproc-startup-script_output#1515400015954649...
/ [8 objects]                                                                   
Operation completed over 8 objects.                                              
Removing gs://my-bucket-01/...

```


Reference: https://cloud.google.com/storage/docs/quickstart-gsutil

## 3. Running jobs on a Google Cloud dataproc cluster
In this section, we set up a new dataproc cluster, run some pyspark code, and delete the cluster.

Google Cloud Dataproc is a managed Apache Spark and Apache Hadoop service. It allows you to run Apache Spark and Apache Hadoop clusters on GCP. You may find more information [in the documentation](https://cloud.google.com/dataproc/docs/).

In the following, we show you how to set up a cluster on Google Cloud Dataproc and run a Pyspark job on this cluster.
You can also create a cluster via Google Cloud Console. You should be able to do this yourself by following the references at the end of this document.

### 3a. Create and delete a Google Cloud dataproc cluster

1\. Create cluster

Use the following command to set up a cluster.

You should set `<projectid>` to your own `projectid` and `<bucket id>` to the ones you created in Section 2. You can choose your own `<clustername>`.

```
gcloud dataproc clusters create <clustername> --project <projectid> --bucket <bucketname>
```

Example:

```
LSE021353:files vojnovic$ gcloud dataproc clusters create my-cluster-01 --project integral-linker-185619 --bucket my-bucket-01 --initialization-actions gs://dataproc-initialization-actions/jupyter/jupyter.sh
For the following cluster:
 - [my-cluster-01]
choose a zone:
 [1] asia-east1-a
 [2] asia-east1-b
 [3] asia-east1-c
 [4] asia-northeast1-a
 [5] asia-northeast1-b
 [6] asia-northeast1-c
 [7] asia-south1-a
 [8] asia-south1-b
 [9] asia-south1-c
 [10] asia-southeast1-a
 [11] asia-southeast1-b
 [12] australia-southeast1-a
 [13] australia-southeast1-b
 [14] australia-southeast1-c
 [15] europe-west1-b
 [16] europe-west1-c
 [17] europe-west1-d
 [18] europe-west2-a
 [19] europe-west2-b
 [20] europe-west2-c
 [21] europe-west3-a
 [22] europe-west3-b
 [23] europe-west3-c
 [24] southamerica-east1-a
 [25] southamerica-east1-b
 [26] southamerica-east1-c
 [27] us-central1-a
 [28] us-central1-b
 [29] us-central1-c
 [30] us-central1-f
 [31] us-east1-b
 [32] us-east1-c
 [33] us-east1-d
 [34] us-east4-a
 [35] us-east4-b
 [36] us-east4-c
 [37] us-west1-a
 [38] us-west1-b
 [39] us-west1-c
Please enter your numeric choice:  31

Waiting on operation [projects/integral-linker-185619/regions/global/operations/159ff2e2-c2ee-4552-a4ed-48060fc36956].
Waiting for cluster creation operation...done.                                 
Created [https://dataproc.googleapis.com/v1/projects/integral-linker-185619/regions/global/clusters/my-cluster-01] Cluster placed in zone [us-east1-b].
```

Again, once the cluster is created, you can see it in the Google Cloud Console:

![alt text](https://raw.githubusercontent.com/lse-st446/lectures2019/master/week01/class/figs/gcp-cluster0.png?token=AGU4ZELSE2G42AAIZ3K2J7C5JVZRS)

Here are some details about the cluster nodes:

![alt text](https://raw.githubusercontent.com/lse-st446/lectures2019/master/week01/class/figs/gcp-cluster.png?token=AGU4ZEKIJ2F4QBYC47ZBGSK5JVZRS)


Here are some further details about the master node:

![alt text](https://raw.githubusercontent.com/lse-st446/lectures2019/master/week01/class/figs/gcp-cluster-master.png?token=AGU4ZEOJDVK4C45S3KGU7LK5JVZRS)

### 3b. Submit a job

Once you have your cluster ready, you can submit your job using `job submit` command. Here we submit a simple hello world PySpark job.
```
gcloud dataproc jobs submit pyspark --cluster <clustername> helloworld.py
```
(Don't worry about the details yet, we will re-visit `helloworld.py`.)

### 3c. Delete a cluster

You may delete you cluster as follows. Once you have deleted your cluster, make sure also to delete your bucket, so that we don't waste our allocated credit.
```
gcloud dataproc clusters delete <clustername>
```

Example:
```
LSE021353:files vojnovic$ gcloud dataproc clusters delete my-cluster-01
The cluster 'my-cluster-01' and all attached disks will be deleted.

Do you want to continue (Y/n)?  y

Waiting on operation [projects/integral-linker-185619/regions/global/operations/ef21874e-3863-46a0-aa2d-c030c51e8750].
Waiting for cluster deletion operation...done.                                 
Deleted [https://dataproc.googleapis.com/v1/projects/integral-linker-185619/regions/global/clusters/my-cluster-01].
```

Delete the bucket using
```
gsutil rm -r gs://<bucket_name>
```

## 4. Running Jupyter notebooks on Google Cloud Dataproc clusters

In this section, we will set up a dataproc cluster for PySpark and connect it to a Jupyter notebook.
Please try to get it to run by next week.

If you have difficulties setting it up, please get in contact with the teaching assistant.

The following instructions worked for me on MacOS with chrome, when ignoring all the warnings that chrome yields:
https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook

Windows and Linux have slightly different syntax, please refer to the first link in the reference for the correct setup on your own machine.

### 4a. Create a cluster on which to run Jupyter notebook
This is very similar to what we have done in Section 3b, except that we add the argument `-initialization-actions gs://dataproc-initialization-actions/jupyter/jupyter.sh`.


```
gcloud dataproc clusters create <clustername> --project <projectid> --bucket <bucketname> --initialization-actions gs://dataproc-initialization-actions/jupyter/jupyter.sh
```

This should yield something like, e.g.,
```
Created [https://dataproc.googleapis.com/v1/projects/<project-name>/regions/global/clusters/<clustername>] Cluster placed in zone [europe-west2-c].
```

### 4b. Set up a ssh channel:

Replace the `<clustername>` with the name you set for your cluster and `<region>` to a suitable region:
```
gcloud compute ssh --zone=<region> --ssh-flag="-D" --ssh-flag="10000" --ssh-flag="-N" --ssh-flag="-n" "<clustername>-m" &
```

For me (Simon), the following worked
```
gcloud compute --project "<project-name>" ssh --zone "europe-west2-c" "<cluster-name>-m" &
```
(Also try this command without the trailing `&`. In that way, you will get access to a command line on a remote machine on which you can run commands. You can exit it by using the command `exit`.)

### 4c. Configure your browser to run a jupyter notebook GUI:
```
/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome http://<clustername>-m:8123 --proxy-server="socks5://localhost:10000" --host-resolver-rules="MAP * 0.0.0.0 , EXCLUDE localhost" --user-data-dir=/tmp/
```
For other browsers, you will need to figure this out in detail. Check out https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook and links therein for details. It is highly recommended that you follow the instructions on this website.

The following worked for me (Simon, cf. https://cloud.google.com/dataproc/docs/concepts/accessing/cluster-web-interfaces#create_an_ssh_tunnel):

```
export PORT=8123
export HOSTNAME=<cluster-name>-m
export PROJECT="project-simon"
export ZONE="europe-west2-c"

gcloud compute ssh ${HOSTNAME} \
    --project=${PROJECT} --zone=${ZONE}  -- \
    -D ${PORT} -N &

"/Applications/Google Chrome.app/Contents/MacOS/Google Chrome" \
      --proxy-server="socks5://localhost:${PORT}" \
      --user-data-dir=/tmp/${HOSTNAME}
```

Now you can open a Jupyter notebook, write your code and run it.
Below is a short list of available ports:

Web UI                 |	Port	| URL
-----------------------|--------|----
Jupyter Notebook       | 8123   | http://master-host-name:8123 i.e., in practice http://cluster-name-m:8123
YARN ResourceManager	 | 8088	  |  http://master-host-name:8088
HDFS NameNode	         | 9870*	| http://master-host-name:9870

# References

If you have any difficulties in running the codes above, you can have a look of the following links:

* Tutorial: https://cloud.google.com/dataproc/docs/tutorials/jupyter-notebook
* https://cloud.google.com/blog/big-data/2017/02/google-cloud-platform-for-data-scientists-using-jupyter-notebooks-with-apache-spark-on-google-cloud
* FAQ: https://cloud.google.com/dataproc/docs/resources/faq)
* Create cluster: https://cloud.google.com/dataproc/docs/guides/create-cluster
* Submit jobs: https://cloud.google.com/dataproc/docs/guides/submit-job
* Using the Python Client Library: https://cloud.google.com/dataproc/docs/tutorials/python-library-example
* Initialization actions: https://cloud.google.com/dataproc/docs/concepts/configuring-clusters/init-actions and https://github.com/GoogleCloudPlatform/dataproc-initialization-actions