**Description:** <br>
This document aims to help a beginer with their first attempt to run on spark standalone cluster. This document assumes the readers have already signed up for an EC2 account on the Amazon Web Services site and have some knowledge about spark, python and bash.<br>
Most contents here are from the course of *Distributed Computing* taught by Diane Woodbridge, which is part of the MS in Analytics program at the University of San Francisco. I create this documents mainly as a note. If you find any errors or have any comments, feel free to give me the feed back. It would be much appreciated if you can help to improve this document.<br>

# Step 1: Create a IAM user
*An IAM user is an entity that you create in AWS. It can give you the ability to sign in to the AWS Management Console for interactive tasks and to make programmatic requests to AWS services using the API or CLI. You can check <a href="http://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_create.html">this link</a> for more details.*
#### 1. Open your AWS console.
#### 2. Put your cursor on your username (beside the bell icon).
#### 3. Click "My Security Credentials".
#### 4. Click "Users" on the sidebar.
#### 5. Choose "Add user" (the blue button).
<img src="image/step1/1.png" width="800" height="500">
#### 6. Choose a user name and access type, then click "Next".
<img src="image/step1/2.png" width="800" height="500">
### 7. Choose "Attach existing policies directly" and "AmazonEC2FullAccess", then click "Next".
<img src="image/step1/3.png" width="800" height="500">
### 8. Click "Create user".
<img src="image/step1/4.png" width="800" height="500">
### 9. A new user has been successfully created. Now we can copy the "Access key ID" and "Secret access key", or we can download .csv file which also includes the information.
<img src="image/step1/5.png" width="800" height="500">
### 10. Now add the "Access key ID" and "Secret access key" to your `~/.profile` or `~/.bash_profile`. You can use `vim` to edit `~/.profile` or `~/.bash_profile`. Remember use `:w` to save the change. Now you can sign into the AWS Management Console as a user from the terminal.
<img src="image/step1/6.png" width="800" height="500">

# Step 2: Create key pairs
*Amazon EC2 uses public–key cryptography to encrypt and decrypt login information. To log in to your instance, you must create a key pair, specify the name of the key pair when you launch the instance, and provide the private key when you connect to the instance. You can check <a href="http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html">this link</a> for more details.*
1. Go to the "EC2 Dashboard", choose "Key Pairs" to create new a key pair. Note that each region has its own "Key Pairs" and "Instances" and different regions are seperated. Here we choose "US West(Oregon)".
<img src="image/step2/1.png" width="800" height="500"> 
2. Click "Create Key Pair" and choose your "Key pair name". Then click "Create".
<img src="image/step2/2.png" width="800" height="500"> 
3. Now your key pair has been created and a private key file (i.e., a .pem file) will be automatically downloaded.
<img src="image/step2/3.png" width="800" height="500"> 
4. If the suffix of the file is ".pem.txt", change it into ".pem". Store the .pem into a safe place on your machine.
<img src="image/step2/4.png"> 
5. Set the permissions for the .pem file to 600 (i.e., only you can read and write it) so that `ssh` will work.
<img src="image/step2/5.png" width="800" height="500"> 

# Step 3: Use `spark-ec2`
*`spark-ec2` allows you to launch, manage and shut down Apache Spark clusters on Amazon EC2. It automatically sets up Apache Spark and HDFS on the cluster for you. Check <a href="https://github.com/amplab/spark-ec2">this link</a> for more details.*
1. Clone AMPLab's code. <br>
`$git clone https://github.com/amplab/spark-ec2`
2. Go to `spark-ec2` directory. The default branch is 1.6. We need to switch the branch to 2.0 using `$git checkout branch-2.0`. We can check if we switch successfully using `$git branch`.<br>
`$cd spark-ec2`<br>
`$git checkout branch-2.0`<br>
`$git branch`<br>
<img src="image/step3/1.png" width="800" height="500"> 
This means we've switched successfully.
3. Now we can use `./spark-ec2` to manage our EC2 instances. Before we start, let's get familiar with available commands and arguments first.<br>

|Command|Description|
|:------:|:----------:|
|`launch`|Launches EC2 instances, installs the required software packages, and starts the master and slaves.|
|`login`|Logs in to the instance running the Spark master.|
|`stop`|Stops all the cluster instances.|
|`start`|Starts all the cluster instances and reconfigures the cluster.|
|`get-master`|Return the address of the instance where the master is running|
|`destroy`|An unrecoverable action that terminate EC2 instances and destroys the cluster.|

|Name|Arguments|Description|
|:---:|:---:|:---:|
|key-pair|`-k`|The name of your EC2 key pair.|
|identity-file|`-i`|the private key file for your key pair.|
|region|`-r`|specifies an EC2 region in which to launch instances.|
|zone|`-z`|used to specify an EC2 availability zone to launch instances in.|
|slave|`-s`|the number of slaves|

$4.$ Now we can use `./spark-ec2` command to launch EC2 instances in the terminal.<br>
`$./spark-ec2 -k example -i example.pem -s 1 -r us-west-2 -z us-west-2a launch instance-name`<br>
Note:
* `-k example`: here `example` is the name of your EC2 key pair 
* `-i example.pem`: here `example.pem` is the private key file for your key pair. If it is not stored at the `spark-ec2` directory, you should write the absolute path of .pem file here (i.e., `~\~\example.pem`)
* `-s 1`: here 1 is the number of slaves of the instance
* `-r us-west-2`: here `us-west-2` is the region of the instance. It should be the same as region of the key pair. Bellow is the corresponding code for each region. For example, if the region of your key pair is US East (Ohio), you should write `-r us-east-2`

|Code|Name
|:--:|:--:
|us-east-1|US East (N. Virginia)
|us-east-2|US East (Ohio)
|us-west-1|US West (N. California)
|us-west-2|US West (Oregon)
|ca-central-1|Canada (Central)
|eu-west-1|EU (Ireland)
|eu-central-1|EU (Frankfurt)
|eu-west-2|EU (London)
|ap-northeast-1|Asia Pacific (Tokyo)
|ap-northeast-2|Asia Pacific (Seoul)
|ap-southeast-1|Asia Pacific (Singapore)
|ap-southeast-2|Asia Pacific (Sydney)
|ap-south-1|Asia Pacific (Mumbai)
|sa-east-1|South America (São Paulo)

* `-z us-west-2a`: here `us-west-2a` is the availability zone of the instance. Usually there are three availability zones for each region. You can check the status of each zone in the "EC2 Dashbord"
<img src="image/step3/2.png" width="800" height="500"> 
* `launch`: command
* `instance-name`: the name of your instance. You can choose as you want

$5.$ After you launch your instance, you will see information like below in your terminal. We notice that 1 slave was launched as we specify in the command. Therefore you will see two instances in you AWS console. One is the master. One is the slave.<br>
If you see `SSH connection error`, don't panic. This is **temporary**. Just wait patiently and don't interupt the process. You will finally get there.
<img src="image/step3/4.png" width="800" height="500">
<img src="image/step3/3.png" width="800" height="500">
$6.$ If you see the information bellow, your instances have successfully launched. You can use the highlighted url to check your web UI.
<img src="image/step3/5.png" width="800" height="500">
<img src="image/step3/6.png" width="800" height="500">
Click the "Worker Id", you can check the running executors.
<img src="image/step3/7.png" width="800" height="500">

$7.$ Now log in to the cluster<br>
`$./spark-ec2 -k example -i example.pem -r us-west-2 -z us-west-2a login instance-name`<br>
Your address in the terminal will change to a remote machine.
<img src="image/step3/8.png">
**(1) Run pyspark**<br>
Use command `$~/spark/bin/pyspark`, you can open pyspark shell. Enter `sc` in the shell, you can create a `pyspark.context.SparkContext object`.<br>
<img src="image/step3/9.png" width="800" height="500">
Now if you check your web UI again, you can see a new application is running.
<img src="image/step3/10.png" width="800" height="500">
Make sure you quit the shell appropriately. Use command `ctrl+D` or `quit()`. Don't use `ctrl+C`. Or the shell will use all the resource in the instance. After you quit, the application will appear in the "Completed Applications" part.<br>
<img src="image/step3/11.png" width="800" height="500">

**(2) Run spark-submit**<br>
To run `spark-submit`, you need a python script (i.e., .py file) in the remote machine. You can either write a script derictly in the remote machine using the terminal or upload a local file from your local machine to the remote machine. Here we will give an example of how to upload a .py file to the remote machine.
a. write a .py file at your local machine. Master should be set the same as the ip address in the web UI. AppName can be set as you want.
<img src="image/step3/12.png" width="800" height="500">
<img src="image/step3/13.png" width="800" height="500">
b. In the remote machine, create a new directory in your master node.<br>
`$mkdir example_code`
c. In your local machine(open a new terminal), use command `scp` to upload the .py file in to the remote machine.<br>
`$scp -i example.pem –r LocalDir root@ec2-35-161-213-27.us-west-2.compute.amazonaws.com:~/RemoteDir`
    * `-i example.pem`: if you are not in the directory of the key pair, make sure you wrire the abusolute path of the .pem file
    * `–r LocalDir`: it is the directory in your local machine where you store your .py file
    * `root@ec2-35-161-213-27.us-west-2.compute.amazonaws.com:`: it is the Public DNS of your master instance. You can find it on your AWS console.
    <img src="image/step3/14.png" width="800" height="500">
    * `~/RemoteDir`: it is the directory in the remote machine where you want to store the .py file. In this case, we use `~/example_code`, which is the directory we just create.
d. In the remote machine, you can check if you've uploaded the .py file successfully.
<img src="image/step3/15.png" width="800" height="500">
e. After we upload the .py file to the master node, we need to copy it to all the slave(worker) nodes. We do this in the remote machine using the command bellow.<br>
*If you are in `example_code` directory, `$cd ~` first.*<br>
`$~/spark-ec2/copy-dir ~/RemoteDir`
   * `~/spark-ec2/copy-dir`: command
   * `~/RemoteDir`: it is the directory in the remote machine where you store the .py file. In this case, we use `~/example_code`. Note that after we copy .py file to worker nodes, we are in `example_code` directory automatically.<br>
f. Now we can run `spark-submit`.<br>
`$~/spark/bin/spark-submit --deploy-mode client example.py > output.txt`<br>
 After we run `spark-submit`, `output.txt` is created in `example_code` directory. We check the content of `output.txt` using `$head output.txt`. We find it is error information. It is because .py file couldn't find the `"file:///root/example/input_2.txt"`. This means the command `$spark-submit` works but the python script it executes has a bug.
<img src="image/step3/16.png" width="800" height="500">
g. Now if you refresh your web UI again, you will find a new completed appication named 'Example'.
<img src="image/step3/17.png" width="800" height="500">
<img src="image/step3/18.png" width="800" height="500">

$8.$ When you finish, don't forget to stop or terminate your instance, or you will be charged. You can do it either manually from the AWS console or use commands in your terminal. Bellow are some examples to stop/start/terminate the cluster in the terminal.<br>
These commands need to be executed in the `spark-ec2` directory of the local machine, so `$cd spark-ec2` first. If key pair is not in the same directory, use the absolute path.<br>
    `$./spark-ec2 -r us-west-2 -z us-west-2a stop instance-name`<br>
    `$./spark-ec2 -k example -i example.pem -r us-west-2 -z us-west-2a start instance-name`<br>
    `$./spark-ec2 -k example -i example.pem -r us-west-2 -z us-west-2a destroy instance-name`<br>
$9.$ Double check your instances are truly stopped or terminated in your AWS console before you close it.