# 0. Jupyter 101

Let's look around Jupyter environments.
* **What is cell?** There are (more than) two type of cells: `Markdown` and `code`. You will get what they mean intuitively.
* **How to run a cell?** Activate a cell below and run it by pressing `Ctrl (or Cmd) + Return` or `Shift + Return`.

# 1. Setup Network Configurations

* To use hadoop (and most SWs in the hadoop ecosystem), we need configure
    * Hosts and hostname to let all machines be accessible
    * SSH keys since most programs are executed in remote machines through SSH
    * Profiles such as paths, environment variables
* You will modify `/etc/hosts`, `/etc/hostname`, `~/.ssh/authorized_keys`, `/etc/environment`


## 1) /etc/hosts

Instead of pointing each machine with an IP address, using hostname is much more handy (e.g., master, worker1, worker2). We need to modify `/etc/hosts` to register the IP addresses and hostnames.

**[TODO]** Replace *[MASTER|WORKER]_PRIVATE_IP* with the *Private IPs* that you can see in the instances page.
 You may use public IPs but be aware that you need to update whenever you restart VMs since AWS changes the public IPs assigned to VMs when restarting them.



In [None]:
%%bash
sudo cp ~/hosts.bak /etc/hosts # Just in case if you run more than once.
echo '$MASTER_IP master' | sudo tee -a /etc/hosts
echo '$WORKER1_IP worker1' | sudo tee -a /etc/hosts
echo '$WORKER2_IP worker2' | sudo tee -a /etc/hosts

Make sure all entries (master, worker1, worker2) are set properly.

**[Note]** In the SSH commands below, 
* `-o` option allows us to login without typing `yes` for the option (it does not work in jupyter), and 
* `-i $KEY_FILE` option uses the given `pem` file for credential. 

The file is created when I issued the initial instance for creating AMI. We will temporarily use this key file until we add SSH credential below.

In [None]:
%%bash

for host in master worker1 worker2
do
    # The below commands send SSH request to echo a string.
    ssh -o "StrictHostKeyChecking no" -i ~/ssds2-2018.pem $host "echo 'hi from $host'"
done

To allow other machines can identify other machines (at least master) as well, let's replicate `/etc/hosts` files to others.

In [None]:
%%bash
for host in worker1 worker2
do
    cat /etc/hosts | ssh -o "StrictHostKeyChecking no" -i ~/ssds2-2018.pem $host 'sudo tee /etc/hosts'
done

## 2) /etc/hostname

By default, AWS sets a VM's hostname with its private IP (e.g., `ip-172-31-22-59`), which we don't want to use. Instead, let's change it to the ones that we (and Hadoop precisely) will use (e.g., master, worker1, worker2). You can set it by running `hostname $HOST_NAME`, but AWS will revert when the machine is rebooted so that you need to update the change again similar to private IP. The best solution is to modify `/etc/hostname` instead.

In [None]:
%%bash
for host in master worker1 worker2
do
    ssh -o "StrictHostKeyChecking no" -i ~/ssds2-2018.pem $host "sudo hostname $host"
done

To apply those changes, we need to reboot machines. Instead, let's change the hostname of the current active session as well

In [None]:
%%bash
for host in master worker1 worker2
do
    ssh -o "StrictHostKeyChecking no" -i ~/ssds2-2018.pem $host "echo '$host' | sudo tee /etc/hostname"
done

## 3) SSH setup for passwordless login

To use Hadoop, the machines should be able to login without password. Let's create an SSH key file and register the key file in the `authorized_keys`.


In [None]:
%%bash

# Create SSH keys on master
for host in master worker1 worker2
do
    ssh -o "StrictHostKeyChecking no" -i ~/ssds2-2018.pem $host "ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa"
done

In [None]:
%%bash
for host in master worker1 worker2
do
    ssh -o "StrictHostKeyChecking no" -i ~/ssds2-2018.pem $host cp ~/authorized_keys.bak ~/.ssh/authorized_keys
    ssh -o "StrictHostKeyChecking no" -i ~/ssds2-2018.pem $host cp ~/known_hosts.bak ~/.ssh/known_hosts

    for host_to_scan in master worker1 worker2
    do
        ssh -o "StrictHostKeyChecking no" -i ~/ssds2-2018.pem $host_to_scan cat ~/.ssh/id_rsa.pub | ssh -o "StrictHostKeyChecking no" -i ~/ssds2-2018.pem $host "tee -a ~/.ssh/authorized_keys"        
        ssh -o "StrictHostKeyChecking no" -i ~/ssds2-2018.pem $host "ssh-keyscan -t rsa $host_to_scan >> ~/.ssh/known_hosts"
    done
done

# 2. Setup Hadoop

## 1) Download and extract the hadoop binaries

We downloaded the hadoop binary (from http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-3.0.1/hadoop-3.0.1.tar.gz) and extracted to home directory (`/home/ubuntu/hadoop`)

In [None]:
%%bash
ls /home/ubuntu/hadoop # Equivalently ~/hadoop

## 2) Setup environment variables

Hadoop processes (and related programs like Spark) identifies Hadoop's directory by looking up `HADOOP_HOME` environment variable - thus we need to set in the host environment. 
Besides, in order to make Hadoop executable in any location, the path should be added the `PATH` environment variable.
To reduce the overhead, I modified `/etc/environment`. Let's see how it was 
Check whether the hadoop is added correctly

In [None]:
%%bash
cat /etc/environment

**[Note]** You may ask why the variables are not set in the user-level profile (e.g., .profile, .bash_profile). This is because the variables are accessed via SSH and they were not found in many cases. Although you may not be able to change the system-wide configuration files (e.g., `/etc/environment`), I assumed that you have admin previledge to install frameworks like hadoop or spark on your cluster.

If all set, `hadoop version` should work

In [None]:
%%bash
hadoop version

## 3) Setup Hadoop configuration files

Hadoop's configuration is set via xml files in the `$HADOOP_HOME/etc/hadoop/etc`.

To save time, we prepared the configuration files in your home directory.
Just take a look at how the files look like.

### core-site.xml

`core-site.xml` is for specifying the high-level configuration for the entire cluster such as file systems, security, high availability, etc. Most importantly, this configuration file consists of the location of the HDFS's namenode.

In [None]:
cat ~/core-site.xml

### hdfs-site.xml
`hdfs-site.xml` specifies the HDFS-specific configurations. For example, we can configure where we store the actual file blocks (in the Operating Systems's view), and replication, etc. Here we specify the location of the files and HDFS web UI's address.

In [None]:
cat ~/hdfs-site.xml

### workers
`workers` (previously `slaves`) specifies which nodes will run as workers (i.e., DataNode in HDFS and NodeManager in YARN). You can just list up the hostnames.

In [None]:
cat ~/workers

In [None]:
%%bash 

for host in master worker1 worker2
do
    ssh $host cp ~/core-site.xml ~/hadoop/etc/hadoop/core-site.xml
    ssh $host cp ~/hdfs-site.xml ~/hadoop/etc/hadoop/hdfs-site.xml
    ssh $host cp ~/workers ~/hadoop/etc/hadoop/workers
done

## 4) Format Namenode

Now it's time to format Namenode to initialize the metadata. Only thing you need to do is put the one-line command:

In [None]:
%%bash
hdfs namenode -format

## 5) Start HDFS daemons

In [None]:
%%bash
start-dfs.sh

## 6) Explore HDFS Web UI!

Go to `http://<MASTER_IP>:50070`, and enjoy a nice web UI provided by HDFS.

# 3. Use HDFS Commands

## 1) Which commands you can use in HDFS?

There are many commands that HDFS provides (Remember the commands that you used above `hdfs namenode format`).
You will mostly use `hdfs dfs` command, which is for the file system interface (similar to `ls`, `cp`, `mv`, `rm`).

You can see the list of file system commands by running `hdfs dfs`:

In [None]:
%%bash
hdfs dfs 

## 2) [TODO] Let's upload a file to HDFS!

In your home directory (`/home/ubuntu/spark_inputs`), you can find the dataset files that we downloaded. Let's upload one of those using HDFS command.

In [None]:
%%bash

### Please replace the arguments appropriately in the below command
# Hint: 1) there are variations: you can use either copyFromLocal or put (or else)
#       2) the simplest target directory is '/' (Advanced: Try to create a directory and put the file there)
hdfs dfs -put ~/spark_inputs/pagecounts-20160101-000000 hdfs://master:9000/

# 4. Setup YARN

We need one more configuration file: `yarn-site.xml`. The only configuration we will add is the address of Resource Manager.

In [None]:
%%bash
cat ~/yarn-site.xml

As above, let's copy and paste yarn-site to your hadoop directory.

In [None]:
%%bash 
for host in master worker1 worker2
do
    ssh $host cp ~/yarn-site.xml ~/hadoop/etc/hadoop/yarn-site.xml
done

In [None]:
%%bash
start-yarn.sh

Go to `http://<MASTER_IP>:8088`, and enjoy a nice web UI provided by HDFS.

# 5. Running YARN Examples

## 1) Run one example application in Hadoop distribution

In [None]:
%%bash
yarn jar ~/hadoop/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.0.1.jar \
  -jar ~/hadoop/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-3.0.1.jar \
  -shell_command 'sleep 120; echo hello yarn'

## 2) Find the output files in HDFS and see the contents