# Building a Hadoop cluster on EC2

This note is on how to build a hadoop cluster on AWS ec2 from your jupyter notebook. Most of the tasks can be finished entirely within this notebook. However there are some dependencies and requirements that you will need to get ready:
* Python 3.3+ on linux for this notebook to run
* An AWS account with aws_access_key_id, and aws_secret_access_key
* Installed aws_cli
* boto3

ref: https://blog.insightdatascience.com/spinning-up-a-free-hadoop-cluster-step-by-step-c406d56bae42#.j3zrw1yua

###  1. Install boto3

boto3 is a handy aws resource management tool. We use it to build a simple cluster

In [44]:
# !pip install aws
# run 'aws configure' in a terminal

In [2]:
!pip install boto3

In [3]:
import boto3

** Check your aws account credentials **

In [None]:
%load ~/.aws/credentials

### 2. Setting up the clusters

First we define two handy functions that we use often: check_status, and terminate

In [5]:
def check_status(ec2, instances):
    """check statuses of instances 
    """
    for instance in instances:
        print("{} is {}".format(instance.id, 
                                ec2.Instance(id=instance.id).state['Name']))
        
def terminate(instances):
    """terminate instances 
    """    
    for instance in instances:
        instance.terminate()    

#### Configurations 

In [6]:
# configurations
ubuntu14lts = 'ami-5ac2cd4d'  # image Ubuntu Server 14.04 LTS (HVM), SSD Volume Type
num_nodes = 4
instance_type = 't2.micro'
security_group_name = 'jupyter-users-open' # security group name
key_pair_name = 'jupyter' # name for your pem file, will store in ~/.aws/

In [7]:
ec2 = boto3.resource('ec2')
client = boto3.client('ec2')

#### Security group: open to all

In [8]:
# we will create a special security group 
response = ec2.create_security_group(GroupName=security_group_name,
                                     Description='security group for jupyter notebooks traffic')
# allow port 22 for ssh
client.authorize_security_group_ingress(GroupId=response.group_id,
                                     IpProtocol="tcp",
                                     CidrIp="0.0.0.0/0",
                                     FromPort=22,
                                     ToPort=22) 
# allow port 50070  
client.authorize_security_group_ingress(GroupId=response.group_id,
                                     IpProtocol="tcp",
                                     CidrIp="0.0.0.0/0",
                                     FromPort=50070,
                                     ToPort=50070) 

{'ResponseMetadata': {'HTTPHeaders': {'content-type': 'text/xml;charset=UTF-8',
   'date': 'Wed, 21 Dec 2016 18:22:48 GMT',
   'server': 'AmazonEC2',
   'transfer-encoding': 'chunked',
   'vary': 'Accept-Encoding'},
  'HTTPStatusCode': 200,
  'RequestId': '70423049-1824-4c57-a764-1720571de12f',
  'RetryAttempts': 0}}

#### Create a new pem file

In [46]:
# create a new key pair (.pem)
from os.path import expanduser
from pathlib import Path
kp_path = expanduser("~") + '/.aws/'+key_pair_name + '.pem'
if not Path(kp_path).is_file():
    key_pair = client.create_key_pair(KeyName=key_pair_name)
    with open(kp_path,'w') as wt:
        wt.write(key_pair['KeyMaterial'])
else:
    print("{} already exists.".format(kp_path))

In [10]:
!chmod 600 {kp_path}

#### Commands to run at launch for all nodes

We will let all the nodes run the following commands upon spinning up. <br>
It will perform the following tasks:
* Install java-8
* Download hadoop-2.7 and install
* Create ~/.profile and set PATH
* Setup a bunch of Hadoop config files that we later will update once we have the public dns

In [11]:
user_data = '''#!/bin/bash
sudo add-apt-repository ppa:openjdk-r/ppa 
sudo apt-get update 
sudo apt-get install -y openjdk-8-jdk
wget http://apache.mirrors.tds.net/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz -P ~/Downloads
sudo tar zxvf ~/Downloads/hadoop-* -C /usr/local
sudo mv /usr/local/hadoop-* /usr/local/hadoop

sudo cat << EOF >> /home/ubuntu/.profile
export JAVA_HOME=/usr
export PATH=DOLLAR_SIGNPATH:DOLLAR_SIGNJAVA_HOME/bin
export HADOOP_HOME=/usr/local/hadoop
export PATH=DOLLAR_SIGNPATH:DOLLAR_SIGNHADOOP_HOME/bin
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
EOF

sed -i -e 's/DOLLAR_SIGN/$/g' /home/ubuntu/.profile

source /home/ubuntu/.profile
echo "export JAVA_HOME=/usr" >> $HADOOP_CONF_DIR/hadoop-env.sh

sudo sed -i -e "/configuration>/d" $HADOOP_CONF_DIR/core-site.xml
sudo cat << EOF >> $HADOOP_CONF_DIR/core-site.xml
<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://NAMENODE_PUBLIC_DNS:9000</value>
  </property>
</configuration>
EOF

sudo sed -i -e "/configuration>/d" $HADOOP_CONF_DIR/yarn-site.xml
sudo sed -i -e "/YARN configuration properties/d" $HADOOP_CONF_DIR/yarn-site.xml
sudo cat << EOF >> $HADOOP_CONF_DIR/yarn-site.xml
<configuration>
<!-- Site specific YARN configuration properties -->
  <property>
    <name>yarn.nodemanager.aux-services</name>
    <value>mapreduce_shuffle</value>
  </property> 
  <property>
    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
  </property>
  <property>
    <name>yarn.resourcemanager.hostname</name>
    <value>NAMENODE_PUBLIC_DNS</value>
  </property>
</configuration>
EOF

sudo cp $HADOOP_CONF_DIR/mapred-site.xml.template $HADOOP_CONF_DIR/mapred-site.xml
sudo sed -i -e "/configuration>/d" $HADOOP_CONF_DIR/mapred-site.xml
sudo cat << EOF >> $HADOOP_CONF_DIR/mapred-site.xml
<configuration>
  <property>
    <name>mapreduce.jobtracker.address</name>
    <value>NAMENODE_PUBLIC_DNS:54311</value>
  </property>
  <property>
    <name>mapreduce.framework.name</name>
    <value>yarn</value>
  </property>
</configuration>
EOF

'''

### 3. Send out request to create instances, finally!!

For more details on other options (for example spot instances), pls refer the boto3 documentation linked below

http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances

In [12]:
# now create instances
instances = ec2.create_instances(ImageId = ubuntu14lts,
                                 SecurityGroups = [ security_group_name ],
                                 InstanceType = instance_type, 
                                 KeyName = key_pair_name,
                                 MinCount = num_nodes,
                                 MaxCount = num_nodes,
                                 UserData = user_data)

#### Check this a couple of times until all are running/or terminated

In [18]:
check_status(ec2, instances)

i-080f8e2a03ebde2be is running
i-0d0ddc30606e9c99c is running
i-0ed39d968db0c0e2e is running
i-00e466e926beb2587 is running


In [222]:
# BE CAUTIOUS!! This will kill the whole cluster
terminate(instances)

In [19]:
# we need a bunch of parameters that will be needed below
# We will set the first node to be the master, and others are slaves
pub_dns = []
ins_ids = []
pri_dns = []
for instance in instances:
    #print("{} is {}".format(instance.id, ec2.Instance(id=instance.id).state['Name']))
    pub_dns.append(ec2.Instance(id=instance.id).public_dns_name)
    ins_ids.append(instance.id)
    pri_dns.append(ec2.Instance(id=instance.id).private_dns_name.split('.')[0])
    print(ec2.Instance(id=instance.id).public_dns_name)

ec2-107-20-1-214.compute-1.amazonaws.com
ec2-54-210-60-253.compute-1.amazonaws.com
ec2-107-20-27-163.compute-1.amazonaws.com
ec2-54-145-8-225.compute-1.amazonaws.com


### 4. Now, we need to do a bunch of modifications to the hadoop config files

In [21]:
# first, upload a copy of pem file to master
!scp -i {kp_path} -o 'StrictHostKeyChecking no' {kp_path} ubuntu@{pub_dns[0]}:/home/ubuntu/.ssh

jupyter.pem                                   100% 1670     1.6KB/s   00:00    


#### Now config ssh for master node, to gain full ssh access to others

In [41]:
# ssh_cmd: cmd to update ssh

pem_key_file = '/home/ubuntu/.ssh/'+key_pair_name+'.pem'
ssh_cmd = '''
cat << EOF | tee /home/ubuntu/.ssh/config
Host namenode
  HostName {}
  User ubuntu
  IdentityFile {}
'''.format( pub_dns[0], pem_key_file)
for i,dns in enumerate(pub_dns[1:]):
    ssh_cmd += '''
Host datanode{}
  HostName {}
  User ubuntu
  IdentityFile {}
    '''.format(i+1, dns, pem_key_file)
ssh_cmd += '''
EOF

ssh-keygen -f /home/ubuntu/.ssh/id_rsa -t rsa -P ''
cat /home/ubuntu/.ssh/id_rsa.pub >> /home/ubuntu/.ssh/authorized_keys
'''
for i,dns in enumerate(pub_dns[1:]):
    ssh_cmd += '''
    cat /home/ubuntu/.ssh/id_rsa.pub | ssh -o 'StrictHostKeyChecking no' datanode{} 'cat >> /home/ubuntu/.ssh/authorized_keys'
    '''.format(i+1)
#ssh_cmd

In [25]:
# update the ssh config at master
!ssh -i {kp_path} -o "StrictHostKeyChecking no" "ubuntu@"{pub_dns[0]} "{ssh_cmd}"

Host namenode
  HostName ec2-107-20-1-214.compute-1.amazonaws.com
  User ubuntu
  IdentityFile /home/ubuntu/.ssh/jupyter.pem

Host datanode1
  HostName ec2-54-210-60-253.compute-1.amazonaws.com
  User ubuntu
  IdentityFile /home/ubuntu/.ssh/jupyter.pem
    
Host datanode2
  HostName ec2-107-20-27-163.compute-1.amazonaws.com
  User ubuntu
  IdentityFile /home/ubuntu/.ssh/jupyter.pem
    
Host datanode3
  HostName ec2-54-145-8-225.compute-1.amazonaws.com
  User ubuntu
  IdentityFile /home/ubuntu/.ssh/jupyter.pem
    
Generating public/private rsa key pair.
Your identification has been saved in /home/ubuntu/.ssh/id_rsa.
Your public key has been saved in /home/ubuntu/.ssh/id_rsa.pub.
The key fingerprint is:
b7:0f:97:51:af:a5:64:d5:a3:15:27:fa:0c:a7:a7:14 ubuntu@ip-172-31-8-62
The key's randomart image is:
+--[ RSA 2048]----+
|              ...|
|             . .+|
|            E ooo|
|             Ooo.|
|        S . +.* o|
|         . o B + |
|          o + o  |
|           +     |
|      

In [34]:
# update_cmd: cmd to update everybody's hadoop configuration
update_cmd = 'source /home/ubuntu/.profile; '
update_cmd += 'sudo sed -i "s/NAMENODE_PUBLIC_DNS/{}/g" \$HADOOP_CONF_DIR/core-site.xml;'.format(pub_dns[0])
update_cmd += 'sudo sed -i "s/NAMENODE_PUBLIC_DNS/{}/g" \$HADOOP_CONF_DIR/yarn-site.xml;'.format(pub_dns[0])
update_cmd += 'sudo sed -i "s/NAMENODE_PUBLIC_DNS/{}/g" \$HADOOP_CONF_DIR/mapred-site.xml;'.format(pub_dns[0])
#run_cmds  

In [29]:
# update each node (include master)
for everybody in pub_dns: # slave public dns
    print(everybody)
    !ssh -i {kp_path} -o "StrictHostKeyChecking no" "ubuntu@"{everybody} "{update_cmd}"

ec2-107-20-1-214.compute-1.amazonaws.com
ec2-54-210-60-253.compute-1.amazonaws.com
ec2-107-20-27-163.compute-1.amazonaws.com
ec2-54-145-8-225.compute-1.amazonaws.com


#### Now update the master hadoop config

In [35]:
# master_cmd: modify master 

master_cmd = '''cat << EOF | sudo tee /etc/hosts
127.0.0.1 localhost
'''
for instance in instances:
    master_cmd += '{} {}\n'.format(ec2.Instance(id=instance.id).public_dns_name, 
                            ec2.Instance(id=instance.id).private_dns_name.split('.')[0])
master_cmd += '''
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
EOF

source /home/ubuntu/.profile

sudo sed -i -e '/configuration>/d' \$HADOOP_CONF_DIR/hdfs-site.xml
sudo cat << EOF | sudo tee --append \$HADOOP_CONF_DIR/hdfs-site.xml
<configuration>
  <property>
    <name>dfs.replication</name>
    <value> {} </value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///usr/local/hadoop/hadoop_data/hdfs/namenode</value>
  </property>
</configuration>
EOF

sudo mkdir -p \$HADOOP_HOME/hadoop_data/hdfs/namenode

'''.format(num_nodes-1)

master_cmd += ' echo "{}" | sudo tee \$HADOOP_CONF_DIR/masters; '.format(pri_dns[0])
master_cmd += ' sudo rm -rf \$HADOOP_CONF_DIR/slaves;'
for slave in pri_dns[1:]:
    master_cmd += 'sudo echo "{}" | sudo tee --append \$HADOOP_CONF_DIR/slaves; '.format(slave)

master_cmd += 'sudo chown -R ubuntu \$HADOOP_HOME'
#master_cmd

In [31]:
# modify master
!ssh -i {kp_path} -o "StrictHostKeyChecking no" "ubuntu@"{pub_dns[0]} "{master_cmd}"

127.0.0.1 localhost
ec2-107-20-1-214.compute-1.amazonaws.com ip-172-31-8-62
ec2-54-210-60-253.compute-1.amazonaws.com ip-172-31-12-203
ec2-107-20-27-163.compute-1.amazonaws.com ip-172-31-7-232
ec2-54-145-8-225.compute-1.amazonaws.com ip-172-31-15-211

# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters
ff02::3 ip6-allhosts
<configuration>
  <property>
    <name>dfs.replication</name>
    <value> 3 </value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///usr/local/hadoop/hadoop_data/hdfs/namenode</value>
  </property>
</configuration>
ip-172-31-8-62
ip-172-31-12-203
ip-172-31-7-232
ip-172-31-15-211


In [32]:
# slave_cmd: modify slave config

slave_cmd = '''
source /home/ubuntu/.profile

sudo sed -i -e '/configuration>/d' \$HADOOP_CONF_DIR/hdfs-site.xml
sudo cat << EOF | sudo tee --append \$HADOOP_CONF_DIR/hdfs-site.xml
<configuration>
  <property>
    <name>dfs.replication</name>
    <value> {} </value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///usr/local/hadoop/hadoop_data/hdfs/datanode</value>
  </property>
</configuration>
EOF

sudo mkdir -p \$HADOOP_HOME/hadoop_data/hdfs/datanode

sudo chown -R ubuntu \$HADOOP_HOME

'''.format(num_nodes-1)

In [33]:
# modify each slave
for slave in pub_dns[1:]: # slave public dns
    print(slave)
    !ssh -i {kp_path} -o "StrictHostKeyChecking no" "ubuntu@"{slave} "{slave_cmd}"

ec2-54-210-60-253.compute-1.amazonaws.com
<configuration>
  <property>
    <name>dfs.replication</name>
    <value> 3 </value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///usr/local/hadoop/hadoop_data/hdfs/datanode</value>
  </property>
</configuration>
ec2-107-20-27-163.compute-1.amazonaws.com
<configuration>
  <property>
    <name>dfs.replication</name>
    <value> 3 </value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///usr/local/hadoop/hadoop_data/hdfs/datanode</value>
  </property>
</configuration>
ec2-54-145-8-225.compute-1.amazonaws.com
<configuration>
  <property>
    <name>dfs.replication</name>
    <value> 3 </value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///usr/local/hadoop/hadoop_data/hdfs/datanode</value>
  </property>
</configuration>


## Hurrah! You are ready to run your hadoop cluster!

It is recommended to use your termininal to start the cluster <br>
Open a terminal, ssh to your master node <br>
And run the commands:

```bash
master$ hdfs namenode -format
master$ $HADOOP_HOME/sbin/start-dfs.sh
```

For more details, check out Ouyang's post: <br>
https://blog.insightdatascience.com/spinning-up-a-free-hadoop-cluster-step-by-step-c406d56bae42#.j3zrw1yua

Now go to this place to find the status

In [42]:
'https://{}:50070'.format(pub_dns[0])

'https://ec2-107-20-1-214.compute-1.amazonaws.com:50070'

## Don't forget to terminate

In [38]:
terminate(instances)

In [39]:
check_status(ec2, instances)

i-080f8e2a03ebde2be is terminated
i-0d0ddc30606e9c99c is terminated
i-0ed39d968db0c0e2e is terminated
i-00e466e926beb2587 is terminated
