# Building a Spark cluster on EC2

This note is on how to build a spark cluster on AWS ec2 from your jupyter notebook. Most of the tasks can be finished entirely within this notebook. However there are some dependencies and requirements that you will need to get ready:
* Python 3.3+ on linux for this notebook to run
* An AWS account with aws_access_key_id, and aws_secret_access_key
* Installed aws_cli
* boto3

ref: http://blog.insightdatalabs.com/spark-cluster-step-by-step/

###  1. Install boto3

boto3 is a handy aws resource management tool. We use it to build a simple cluster

In [44]:
# !pip install aws
# run 'aws configure' in a terminal

In [38]:
!pip install boto3

In [4]:
import boto3

** Check your aws account credentials **

In [None]:
# %load ~/.aws/credentials

### 2. Setting up the clusters

First we define two handy functions that we use often: check_status, and terminate

In [6]:
def check_status(ec2, instances):
    """check statuses of instances 
    """
    for instance in instances:
        print("{} is {}".format(instance.id, 
                                ec2.Instance(id=instance.id).state['Name']))
        
def terminate(instances):
    """terminate instances 
    """    
    for instance in instances:
        instance.terminate()    

#### Configurations 

In [39]:
# configurations
ubuntu14lts = 'ami-5ac2cd4d'  # image Ubuntu Server 14.04 LTS (HVM), SSD Volume Type
num_nodes = 4
instance_type = 'm4.large'
security_group_name = 'jupyter-spark' # security group name
key_pair_name = 'jupyter' # name for your pem file, will store in ~/.aws/

In [40]:
ec2 = boto3.resource('ec2')
client = boto3.client('ec2')

#### Security group: open to all

For similicity, we allow all inbound traffic from anywhere. To be safe, worker nodes need to be carefully modified to only accept internal traffic.

In [9]:
# we will create a special security group 
response = ec2.create_security_group(GroupName=security_group_name,
                                     Description='security group for jupyter notebooks traffic')
# allow port 22 for ssh
client.authorize_security_group_ingress(GroupId=response.group_id,
                                     IpProtocol="tcp",
                                     CidrIp="0.0.0.0/0",
                                     FromPort=22,
                                     ToPort=22) 
# allow port 8080  
client.authorize_security_group_ingress(GroupId=response.group_id,
                                     IpProtocol="tcp",
                                     CidrIp="0.0.0.0/0",
                                     FromPort=8080,
                                     ToPort=8080) 

client.authorize_security_group_ingress(GroupId=response.group_id,
                                     IpProtocol="tcp",
                                     CidrIp="0.0.0.0/0",
                                     FromPort=0,
                                     ToPort=65565) 

{'ResponseMetadata': {'HTTPHeaders': {'content-type': 'text/xml;charset=UTF-8',
   'date': 'Wed, 21 Dec 2016 20:47:39 GMT',
   'server': 'AmazonEC2',
   'transfer-encoding': 'chunked',
   'vary': 'Accept-Encoding'},
  'HTTPStatusCode': 200,
  'RequestId': 'a48420c0-0c77-4507-b82a-3214bf7429c7',
  'RetryAttempts': 0}}

#### Create a new pem file

In [10]:
# create a new key pair (.pem)
from os.path import expanduser
from pathlib import Path
kp_path = expanduser("~") + '/.aws/'+key_pair_name + '.pem'
if not Path(kp_path).is_file():
    key_pair = client.create_key_pair(KeyName=key_pair_name)
    with open(kp_path,'w') as wt:
        wt.write(key_pair['KeyMaterial'])
else:
    print("{} already exists.".format(kp_path))

/home/ddu/.aws/jupyter.pem already exists.


In [11]:
!chmod 600 {kp_path}

#### Commands to run at launch for all nodes

We will let all the nodes run the following commands upon spinning up. <br>
It will perform the following tasks:
* Install java-8
* Download spark-1.6.3 and install
* Create ~/.profile and set PATH

In [41]:
user_data = '''#!/bin/bash
sudo add-apt-repository ppa:openjdk-r/ppa 
sudo apt-get update 
sudo apt-get install -y openjdk-8-jdk
sudo apt-get install scala
wget http://apache.mirrors.tds.net/spark/spark-1.6.3/spark-1.6.3-bin-hadoop2.4.tgz -P ~/Downloads
sudo tar zxvf ~/Downloads/spark-* -C /usr/local
sudo mv /usr/local/spark-* /usr/local/spark

sudo cat << EOF >> /home/ubuntu/.profile
export SPARK_HOME=/usr/local/spark  
export PATH=DOLLAR_SIGNPATH:DOLLAR_SIGNSPARK_HOME/bin  
EOF

sed -i -e 's/DOLLAR_SIGN/$/g' /home/ubuntu/.profile
source /home/ubuntu/.profile
sudo chown -R ubuntu $SPARK_HOME  

echo "#!/usr/bin/env bash" >> $SPARK_HOME/conf/spark-env.sh
echo "export JAVA_HOME=/usr" >> $SPARK_HOME/conf/spark-env.sh
echo "export SPARK_PUBLIC_DNS='current_node_public_dns' " >> $SPARK_HOME/conf/spark-env.sh
echo "export SPARK_WORKER_CORES={}" >> $SPARK_HOME/conf/spark-env.sh
'''.format( (num_nodes - 1)*2 )
user_data

'#!/bin/bash\nsudo add-apt-repository ppa:openjdk-r/ppa \nsudo apt-get update \nsudo apt-get install -y openjdk-8-jdk\nsudo apt-get install scala\nwget http://apache.mirrors.tds.net/spark/spark-1.6.3/spark-1.6.3-bin-hadoop2.4.tgz -P ~/Downloads\nsudo tar zxvf ~/Downloads/spark-* -C /usr/local\nsudo mv /usr/local/spark-* /usr/local/spark\n\nsudo cat << EOF >> /home/ubuntu/.profile\nexport SPARK_HOME=/usr/local/spark  \nexport PATH=DOLLAR_SIGNPATH:DOLLAR_SIGNSPARK_HOME/bin  \nEOF\n\nsed -i -e \'s/DOLLAR_SIGN/$/g\' /home/ubuntu/.profile\nsource /home/ubuntu/.profile\nsudo chown -R ubuntu $SPARK_HOME  \n\necho "#!/usr/bin/env bash" >> $SPARK_HOME/conf/spark-env.sh\necho "export JAVA_HOME=/usr" >> $SPARK_HOME/conf/spark-env.sh\necho "export SPARK_PUBLIC_DNS=\'current_node_public_dns\' " >> $SPARK_HOME/conf/spark-env.sh\necho "export SPARK_WORKER_CORES=6" >> $SPARK_HOME/conf/spark-env.sh\n'

### 3. Send out request to create instances, finally!!

For more details on other options (for example spot instances), pls refer the boto3 documentation linked below

http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances

In [42]:
volumns = { 'DeviceName': '/dev/sda1',
            'Ebs': { 'VolumeSize': 200,
                     'DeleteOnTermination': True,
                     'VolumeType': 'standard', # Magnetic
                    },
          }
volumns

{'DeviceName': '/dev/sda1',
 'Ebs': {'DeleteOnTermination': True,
  'VolumeSize': 200,
  'VolumeType': 'standard'}}

In [22]:
# now create instances
instances = ec2.create_instances(ImageId = ubuntu14lts,
                                 SecurityGroups = [ security_group_name ],
                                 InstanceType = instance_type, 
                                 KeyName = key_pair_name,
                                 MinCount = num_nodes,
                                 MaxCount = num_nodes,
                                 UserData = user_data,    
                                 BlockDeviceMappings=[volumns])

#### Check this a couple of times until all are running/or terminated

In [43]:
check_status(ec2, instances)

i-0de4771761532efa7 is terminated
i-033e623d21d818abb is terminated
i-0ae5a7a81aad61775 is terminated
i-0523d0de5abaf9c2b is terminated


In [222]:
# BE CAUTIOUS!! This will kill the whole cluster
terminate(instances)

In [26]:
# we need a bunch of parameters that will be needed below
# We will set the first node to be the master, and others are slaves
pub_dns = []
ins_ids = []
pri_dns = []
for instance in instances:
    #print("{} is {}".format(instance.id, ec2.Instance(id=instance.id).state['Name']))
    pub_dns.append(ec2.Instance(id=instance.id).public_dns_name)
    ins_ids.append(instance.id)
    pri_dns.append(ec2.Instance(id=instance.id).private_dns_name.split('.')[0])
    print(ec2.Instance(id=instance.id).public_dns_name)

ec2-54-88-158-48.compute-1.amazonaws.com
ec2-54-196-73-124.compute-1.amazonaws.com
ec2-54-152-115-105.compute-1.amazonaws.com
ec2-54-145-234-70.compute-1.amazonaws.com


### 4. Now, we need to do a bunch of modifications to the spark config files

In [31]:
# first, upload a copy of pem file to master
!scp -i {kp_path} -o 'StrictHostKeyChecking no' {kp_path} ubuntu@{pub_dns[0]}:/home/ubuntu/.ssh

jupyter.pem                                     0%    0     0.0KB/s   --:-- ETAjupyter.pem                                   100% 1670     1.6KB/s   00:00    


In [44]:
# ssh_cmd: cmd to update ssh

pem_key_file = '/home/ubuntu/.ssh/'+key_pair_name+'.pem'
ssh_cmd = '''
cat << EOF | tee /home/ubuntu/.ssh/config
Host namenode
  HostName {}
  User ubuntu
  IdentityFile {}
'''.format( pub_dns[0], pem_key_file)
for i,dns in enumerate(pub_dns[1:]):
    ssh_cmd += '''
Host datanode{}
  HostName {}
  User ubuntu
  IdentityFile {}
    '''.format(i+1, dns, pem_key_file)
ssh_cmd += '''
EOF

ssh-keygen -f /home/ubuntu/.ssh/id_rsa -t rsa -P ''
cat /home/ubuntu/.ssh/id_rsa.pub >> /home/ubuntu/.ssh/authorized_keys
'''
for i,dns in enumerate(pub_dns[1:]):
    ssh_cmd += '''
    cat /home/ubuntu/.ssh/id_rsa.pub | ssh -o 'StrictHostKeyChecking no' datanode{} 'cat >> /home/ubuntu/.ssh/authorized_keys'
    '''.format(i+1)
#ssh_cmd

In [34]:
# update the ssh config at master
!ssh -i {kp_path} -o "StrictHostKeyChecking no" "ubuntu@"{pub_dns[0]} "{ssh_cmd}"

Host namenode
  HostName ec2-54-88-158-48.compute-1.amazonaws.com
  User ubuntu
  IdentityFile /home/ubuntu/.ssh/jupyter.pem

Host datanode1
  HostName ec2-54-196-73-124.compute-1.amazonaws.com
  User ubuntu
  IdentityFile /home/ubuntu/.ssh/jupyter.pem
    
Host datanode2
  HostName ec2-54-152-115-105.compute-1.amazonaws.com
  User ubuntu
  IdentityFile /home/ubuntu/.ssh/jupyter.pem
    
Host datanode3
  HostName ec2-54-145-234-70.compute-1.amazonaws.com
  User ubuntu
  IdentityFile /home/ubuntu/.ssh/jupyter.pem
    
Generating public/private rsa key pair.
Your identification has been saved in /home/ubuntu/.ssh/id_rsa.
Your public key has been saved in /home/ubuntu/.ssh/id_rsa.pub.
The key fingerprint is:
96:d4:68:c7:ed:aa:45:36:78:9b:a2:1c:a4:0f:0e:68 ubuntu@ip-172-31-28-152
The key's randomart image is:
+--[ RSA 2048]----+
|                 |
|         + .     |
|        + + .    |
|       o + .     |
|      . S = .    |
| .   o . + =     |
|.E. o . . =      |
|.  o + o +       |
|  

#### Now update the master node config

In [29]:
# master_cmd: modify master 

master_cmd = ' sudo rm -rf \$SPARK_HOME/conf/slaves; '
master_cmd += ' source /home/ubuntu/.profile; '
for slave in pub_dns[1:]:
    master_cmd += 'sudo echo "{}" | sudo tee --append \$SPARK_HOME/conf/slaves; '.format(slave)
master_cmd

' sudo rm -rf \\$SPARK_HOME/conf/slaves;  source /home/ubuntu/.profile; sudo echo "ec2-54-196-73-124.compute-1.amazonaws.com" | sudo tee --append \\$SPARK_HOME/conf/slaves; sudo echo "ec2-54-152-115-105.compute-1.amazonaws.com" | sudo tee --append \\$SPARK_HOME/conf/slaves; sudo echo "ec2-54-145-234-70.compute-1.amazonaws.com" | sudo tee --append \\$SPARK_HOME/conf/slaves; '

In [30]:
# modify master
!ssh -i {kp_path} -o "StrictHostKeyChecking no" "ubuntu@"{pub_dns[0]} "{master_cmd}"

ec2-54-196-73-124.compute-1.amazonaws.com
ec2-54-152-115-105.compute-1.amazonaws.com
ec2-54-145-234-70.compute-1.amazonaws.com


## Hurrah! You are ready to run your spark cluster!

It is recommended to use your termininal to start the cluster <br>
Open a terminal, ssh to your master node <br>
And run the commands:

```bash
master$ $SPARK_HOME/sbin/start-all.sh
```

In case to stop,
```bash
master$ $SPARK_HOME/sbin/stop-all.sh
```

For more details, check out Ouyang's post: <br>
http://blog.insightdatalabs.com/spark-cluster-step-by-step/

Now go to this place to find the status

In [45]:
'https://{}:8080'.format(pub_dns[0])

'https://ec2-54-88-158-48.compute-1.amazonaws.com:8080'

## Don't forget to terminate

In [36]:
terminate(instances)

In [37]:
check_status(ec2, instances)

i-0de4771761532efa7 is shutting-down
i-033e623d21d818abb is shutting-down
i-0ae5a7a81aad61775 is shutting-down
i-0523d0de5abaf9c2b is shutting-down
