### 0. Preliminary - Setting up the working environment of Virtual Machines on the okeanos-knossos platform

Among the prerequisites for the semester's assignment is the creation of Virtual Machines (VMs) and configuring the working environment using the Apache Hadoop and Apache Spark frameworks. Below is a detailed guide for the students regarding the configuration of this particular environment.

#### 0.1 Creating an account on the cloud service *~okeanos-knossos*
To *~okeanos-knossos* is an Infrastructure-as-a-Service (IAAS) that allows users to create and manage virtual computing systems. In recent years, similar services like Amazon AWS, Google Cloud, and Microsoft Azure have dominated the market because they provide customers with access to computational resources without the burden of maintaining the hardware infrastructure cost. In contrast to these commercial services, *~okeanos-knossos* has been developed for research purposes and is provided to students for free.

As part of this assignment, students are initially requested to create an account on the *~okeanos-knossos* service by following the link:  https://okeanos-knossos.grnet.gr/home/.

By choosing registration using their academic account at the National Technical University of Athens (NTUA), students can complete the creation of their academic accounts on this service.


#### 0.2 Enrollment in the course project
Next, in order to allocate the necessary computing resources, students must enroll in the project corresponding to the **Advanced Topics in Databases course**. Specifically, by following the link: https://astakos.okeanos-knossos.grnet.gr/ui/projects/search, students should search for the *project* `advancedDB.dblab.ntua.gr` and create a registration request.

#### 0.3 Creating a pair of cryptographic keys and registering it with the *~okeanos-knossos* service
Before providing information regarding the creation of Virtual Machines, it is advisable to ensure the existence of a pair of cryptographic keys on the computer from which users will have remote access to their resources in the *~okeanos-knossos* service. To create a new pair of cryptographic keys, it is sufficient to execute the following command (in a Linux environment) and follow the instructions:`$ ssh-keygen`.

- Alternatively, a complete guide on how to create a key pair is available at the link: https://www.digitalocean.com/community/tutorials/how-to-set-up-ssh-keys-on-ubuntu-20-04.

Upon completing the process, the contents of the public key will be displayed on the screen with the command:

`$ cat ~/.ssh/id_rsa`.

Next, you need to register the public key in your ~okeanos-knossos account. By following the link https://cyclades.okeanos-knossos.grnet.gr/ui/#public-keys/ and choosing to create a new key pair, users can import their key and associate it with their account using a chosen name.

#### 0.4 Virtual Machine Creation
The next step is the creation of the Virtual Machines (VMs), where the software infrastructure will be installed. By following the link https://cyclades.okeanos-knossos.grnet.gr/ui/#machines/icon/, users will have the ability to create new VMs. You are encouraged to create 2 VMs with the following characteristics:
 - Ubuntu Server LTS 16.04 OS
 - 4 CPUs
 - 8GB RAM
 - 30GB disk capacity

After initiating the creation of a new VM, students should make the above selections in the following menus.

**Caution!**
 - It is important, during the process of creating the Virtual Machines, to import the public key generated in the previous step; otherwise, access to the VMs will not be possible.
 - After the VMs are created, it's crucial to store the automatically generated system password for the user because, in case of loss, the process will need to be repeated.


Once the virtual machines are successfully created, users can access them with the command:
```
$ ssh user@snf-****-ok-kno.grnetcloud.net
```

Since this is the first time remote connection to the specific system is requested, the local operating system will prompt for confirmation of the action:
```
Are you sure you want to continue connecting (yes/no/[fingerprint])? yes
```

Subsequently, it is recommended to change the automatically generated password using the command:
```
$ passwd
```

The same steps should be followed for creating the second VM.

#### 0.5 Configuration of the public and private network
The *~okeanos-knossos* service provides us with a unique public IPv4 address. Network resource management can be accessed through the link: https://cyclades.okeanos-knossos.grnet.gr/ui/#networks/. It is recommended for students to assign this specific IP to one of their VMs (the master). Subsequently, in the following steps, the public IP will need to be disconnected from the specified VM and connected to the other one to successfully complete the operating system upgrade process (see 1.2 below).

- **Be very cautious with the use of the public IP. You can use a firewall, disconnect the IP when not in use, or even disable your virtual machines entirely. In the event of an attack, there is a possibility that your IP may be automatically classified as vulnerable, and you could lose access to the IPv4 address space and, of course, access to your cluster's web applications.**

Furthermore, for your convenience, it is recommended to change the hostnames of the VMs. This is executed on the master and the worker, respectively:

```bash
$ sudo hostnamectl set-hostname okeanos-master
$ sudo hostnamectl set-hostname okeanos-worker
```

Next is the configuration of the private network through which the two VMs will communicate. By following the link: https://cyclades.okeanos-knossos.grnet.gr/ui/#networks/, users should create a private IPv4 network and assign an IP address from this network to each VM.

Afterwards, you should add the IP of your private network to both machines with the corresponding name in the /etc/hosts file. The final image of this specific file should be as follows for both nodes:

```text
127.0.0.1       localhost
192.168.0.xxx   okeanos-master
192.168.0.xxx   okeanos-worker


# The following lines are desirable for IPv6 capable hosts
::1             localhost ip6-localhost ip6-loopback
ff02::1         ip6-allnodes
ff02::2         ip6-allrouters
```
Next we need to restart the VMs with the command:
```
sudo reboot
```
After the VMs restart, we connect to the master and create a new cryptographic key that will be used for this node to access both itself and the worker through SSH. Similar to the previous steps (see step 0.3):
```bash
$ ssh-keygen -t rsa #press [ENTER] as many times as promted
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
```
We print the contents of the key in the terminal with the following command and then insert it into the worker VM:
```bash
$ cat ~/.ssh/id_rsa.pub #στο master
$ vim ~/.ssh/authorized_keys #στο worker: copy above output and paste it here
```
We confirm SSH access from the master to both VMs.
```bash
$ ssh okeanos-master
$ ssh okeanos-worker
```

### 1. Optional (**RECOMMENDED!**): Updating and upgrading the VMs' operating system.

We recommend updating and upgrading the software on your worker nodes to ensure full compatibility with the latest versions of all the software you may want to use.
#### 1.1 Ubuntu 16.04 -> Ubuntu 18.04
First, connect to your VM:
```bash
$ ssh user@snf-****-ok-kno.grnetcloud.net
```
Next, update + upgrade your existing OS software:
```bash
$ sudo apt update && sudo apt upgrade -y
```
Enter the password whenever prompted and accept all changes. When asked about installing new versions of configuration files, please select '`Install the package maintainer's version`' (unlike what you see in the picture below :D ).
![Select the 'Install the package maintainer's version' option](https://onedrive.live.com/embed?resid=23805F5DB37ABB76%212098&authkey=%21AAcRhT7t6gVEXKY&width=1194&height=336)

Restart the VM:
```bash
$ sudo reboot
```
After giving it a few seconds to reboot, reconnect (via SSH) and enter the following command to initiate the version upgrade:
```bash
$ sudo do-release-upgrade
```
Following the instructions, we accept the upgrade initially by choosing `'y'`. The next prompt asks us to press `[ENTER]`, and then another `'y'` to proceed with the process. The next dialogue prompts us to select our keyboard configuration, and then we are asked again if we want to keep previous system settings. We should once again select `Install the package maintainer's version`.

![Select the 'Install the package maintainer's version' option](https://onedrive.live.com/embed?resid=23805F5DB37ABB76%212099&authkey=%21ALpqz7AEdZ1O7Gk&width=483&height=198)

Next, a dialogue will appear asking us to select a system disk for the installation of the GRUB (boot loader). We should choose `/dev/vda` by pressing `[SPACE]` and then `[ENTER]`.

**Caution**: In case of having chosen the wrong option at this step, the process will need to be repeated from the start, as it will cause your VMs to not boot!

PS. Of course, anyone interested can find more information about GRUB on the internet (https://en.wikipedia.org/wiki/GNU_GRUB).
![Pick '/dev/vda'](https://onedrive.live.com/embed?resid=23805F5DB37ABB76%212100&authkey=%21AH0yW9WfYKfCWRA&width=1251&height=378)

As before, we are asked once again if we want to keep previous system settings. Once again, we should select 'Install the package maintainer's version'.

![Select the 'Install the package maintainer's version' option](https://onedrive.live.com/embed?resid=23805F5DB37ABB76%212101&authkey=%21AMO4abNJt-MmMLs&width=482&height=198)

Finally, another prompt will prompt us to remove installed packages that are no longer necessary. After accepting it, the system will prompt us to proceed with a restart and will start loading the latest version of the operating system.

#### 1.2 Ubuntu 18.04 -> Ubuntu 20.04
We wait for a few seconds for the reboot and then connect to the VM via SSH. We repeat the command to upgrade the version of the Ubuntu operating system:
```bash
$ sudo do-release-upgrade
```
Following the instructions, we accept the upgrade initially by choosing `'y'`. The next prompt asks us to press `[ENTER]`, and then another `'y'` to proceed with the process. Then, we will be asked for our permission to restart certain services that the system is running. We grant permission.
![Allow the system to restart services](https://onedrive.live.com/embed?resid=23805F5DB37ABB76%212102&authkey=%21AEoQG9snCgmMS2Q&width=1251&height=233)
We accept the upgrade of the LXD snap package to version `4.0`.
![Upgrade LXD snap to version 4.0](https://onedrive.live.com/embed?resid=23805F5DB37ABB76%212103&authkey=%21AOXa1TPtYBxHEAA&width=1203&height=325)
Similarly to previous steps, another prompt will ask us to remove installed packages that are no longer necessary. After accepting it, the system will prompt us to proceed with a restart and will start loading the latest version of the operating system.

- *Note*: Installing the LXD snap requires a network with a public IPv4 address. Therefore, if you want to install it on both VMs in your team, you should disconnect the public IP from the master after its upgrade, connect it to the worker, and after the worker is also upgraded, return it to the master.

#### 1.2 Ubuntu 20.04 -> Ubuntu 22.04
We wait for a few seconds for the reboot and then connect to the VM with SSH. We repeat the command to upgrade the version of the Ubuntu operating system:
```bash
$ sudo do-release-upgrade
```
Just like the previous times, we accept the upgrade initially by choosing `'y'`. The next prompt asks us to press `[ENTER]`, and then another `'y'` to proceed with the process. We accept the removal of unnecessary software packages and restart as in the previous steps.

### 2. Installation and configuration of Apache Spark over YARN on the small cluster of Virtual Machines
**Note:** The following instructions assume that the actions described above in the same guide have been followed faithfully. Students are free to change the given pieces of code as per their own preferences in the configuration.

#### 2.1 Java Installation
We continue with the installation of Apache Hadoop on our new cluster. Initially, we need to install the Java version supported by the official repositories of our operating system on all nodes:
```
$ sudo apt install default-jdk -y
```
Successful installation can be verified with the following command:
```
$ java -version
```
#### 2.2 Installation of Hadoop and Apache Spark
In order to install the latest versions of Apache Spark and Hadoop, we create the `~/opt` directory where we will store the executables of the latest versions of the two Apache projects.
```bash
$ mkdir ./opt
$ mkdir ./opt/bin
```
Next, we download the compressed files from the official websites of both projects, extract their contents, move them to the `~/opt/bin` directory, and create links in the main  `~/opt` directory.
```bash
$ wget https://dlcdn.apache.org/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz
$ tar -xvzf hadoop-3.3.6.tar.gz
$ mv hadoop-3.3.6 ./opt/bin
$ wget https://dlcdn.apache.org/spark/spark-3.5.0/spark-3.5.0-bin-hadoop3.tgz
$ tar -xvzf spark-3.5.0-bin-hadoop3.tgz
$ mv ./spark-3.5.0-bin-hadoop3 ./opt/bin/
$ cd ./opt
$ ln -s ./bin/hadoop-3.3.6/ ./hadoop
$ ln -s ./bin/spark-3.5.0-bin-hadoop3/ ./spark
$ cd
$ rm hadoop-3.3.6.tar.gz
$ rm spark-3.5.0-bin-hadoop3.tgz
$ mkdir ~/opt/data
$ mkdir ~/opt/data/hadoop
$ mkdir ~/opt/data/hdfs
```

#### 2.3 Configuration of environment variables
To configure environment variables, you will initially edit the `~/.bashrc` file using your preferred text editor. This file is executed every time a terminal connection to your virtual machines is opened. You can add the following lines to it:
```bash
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64  #Value should match: dirname $(dirname $(readlink -f $(which java)))
export HADOOP_HOME=/home/user/opt/hadoop
export SPARK_HOME=/home/user/opt/spark
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export HADOOP_YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin:$SPARK_HOME/bin;
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
export LD_LIBRARY_PATH=/home/ubuntu/opt/hadoop/lib/native:$LD_LIBRARY_PATH
export PYSPARK_PYTHON=python3
```
Make sure to save the changes to the ~/.bashrc file after adding these lines.
To update the environment, execute the following command to apply the changes from the script:
```bash
$ source ~/.bashrc
```
#### 2.4 Hadoop Distributed File System configuration
The Hadoop configuration files are located in the `/home/user/opt/hadoop/etc/hadoop` directory. Since we've already set the `HADOOP_HOME` variable to point to the `/home/user/opt/hadoop` directory, we refer to the same path by writing: `$HADOOP_HOME/etc/hadoop/`. Below, we record the changes that need to be made:

We start with `$HADOOP_HOME/etc/hadoop/hadoop-env.sh`, where the following line needs to be added:
```bash
export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64
```

We edit `$HADOOP_HOME/etc/hadoop/core-site.xml` so that it contains the following:
```xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>hadoop.tmp.dir</name>
        <value>/home/user/opt/data/hadoop</value>
        <description>Parent directory for other temporary directories.</description>
    </property>
    <property>
        <name>fs.defaultFS </name>
        <value>hdfs://okeanos-master:54310</value>
        <description>The name of the default file system. </description>
    </property>
</configuration>
```
Επίσης, διαμορφώνουμε το αρχείο `$HADOOP_HOME/etc/hadoop/hdfs-site.xml` ώστε τα περιεχόμενά του να είναι τα ακόλουθα:
```xml
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
        <description>Default block replication.</description>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/home/user/opt/data/hdfs</value>
    </property>
</configuration>
```
Finally, we create a new file, `$HADOOP_HOME/etc/hadoop/workers` and edit it to contain the next two lines:
 ```text
 okeanos-master
 okeanos-worker
 ```
We register both VMs as workers, as the master will also act as one.


#### 2.5 Starting and experimenting with HDFS
We start HDFS using the following commands:
 ```bash
 $ $HADOOP_HOME/bin/hdfs namenode -format
 $ start-dfs.sh
 ```
We confirm that the process was successful by accessing the web interface on the master's public IP using a browser: `http://83.212.xxx.xxx:9870`. We should see 2 available live nodes.

#### 2.6 Hadoop YARN configuration
We edit `$HADOOP_HOME/etc/hadoop/yarn-site.xml`, the contents of which need to be the following:

```xml
<?xml version="1.0"?>
<configuration>
<!-- Site specific YARN configuration properties -->
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>okeanos-master</value>
    </property>
    <property>
        <name>yarn.resourcemanager.webapp.address</name>
        <!--Insert the public IP of your master machine here-->
        <value>83.212.xxx.xxx:8088</value>
    </property>
    <property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>6144</value>
    </property>
    <property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>6144</value>
    </property>
    <property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>128</value>
    </property>
    <property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
    </property>
   <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle,spark_shuffle</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name>
        <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.spark_shuffle.class</name>
        <value>org.apache.spark.network.yarn.YarnShuffleService</value>
    </property>
    <property>
        <name>yarn.nodemanager.aux-services.spark_shuffle.classpath</name>
        <value>/home/user/opt/spark/yarn/*</value>
    </property>
</configuration>

```
We start YARN using the following command:
 ```bash
 $ start-yarn.sh
 ```
We confirm that the process was successful by accessing the web interface on the master's public IP using a browser: `http://83.212.xxx.xxx::8088/cluster`. We should see 2 available live nodes.


#### 2.7 Spark configuration
We create the file `$SPARK_HOME/conf/spark-defaults.conf`, where we define the basic properties of our job execution environment, and we copy the following contents:
```text
spark.eventLog.enabled          true
spark.eventLog.dir              hdfs://okeanos-master:54310/spark.eventLog
spark.history.fs.logDirectory   hdfs://okeanos-master:54310/spark.eventLog
spark.master                    yarn
spark.submit.deployMode         client
spark.driver.memory             1g
spark.executor.memory           1g
spark.executor.cores            1
```

We create the directory in HDFS where historical data of our jobs will be stored and start the Spark history server:
```bash
$ hadoop fs -mkdir /spark.eventLog
$ $SPARK_HOME/sbin/start-history-server.sh
```

We confirm that the process has been successful by accessing the web interface at the public IP of the master: `http://83.212.xxx.xxx:18080`.

#### 2.8 Εκτέλεση παραδείγματος εφαρμογής Spark για επιβεβαίωση ορθότητας
We are ready to use our fresh infrastructure. Execute:
```bash
$ spark-submit --class org.apache.spark.examples.SparkPi ~/opt/spark/examples/jars/spark-examples_2.12-3.5.0.jar 100
```
You should be able to monitor the progress of the work in both the YARN web application (`http://83.212.xxx.xxx:8088`) and the history server (`http://83.212.xxx.xxx:18080`) after it is completed.




### 3. Homework!

*   Find the differences between client and cluster mode execution in YARN. What will you use for development and what in production?
*   Find out how you can change the number of Spark executors by running `spark-submit` from the command line. Experiment with different numbers of executors. What do you observe (and why)?"


