# Running Distributed Tensorflow on Ec2

## AWS Setup 
### Navigating to the Console
- Go to the ec2 console page by clicking on "Services" (Top left), then EC2 in the dropdown.
    - On this page you'll see statistics about how many "instances" are running, etc.
- Click the instances tab in the left
    - This page shows you all the instances that are running / terminated / stopped
        - Stopping an instance saves the state of the machine (files, etc from previous runs are left unchanged). But the machine is not alive. 
        - Terminating an instance shuts it down, and the machine is basically gone forever.
        - A running instance is currently alive.
               
### Launching your first EC2
- Click the blue "launch instance" button in the top
- Make sure you are in the ("US-WEST oregon") region, in the top left corner
- Select the "Ubuntu" AMI 
    - Note: AMI = Amazon machine instance = a saved machine state (containing installed software, etc)
        - Amazon has a "marketplace" of AMIs which allow launching instances pre-installed with specific software packages. Additionally you can create your own AMI after configuring a machine with installed software and share it.
- Select t2.micro (free tier) and click "Next:Configure..."
- Leave this page unchanged (skip the "Configure Instance Details" page). Click next.
- Leave this page unchanged (skip the "Add Storage" page). Click next.
- Leave this page unchanged (skip the "Tag Instance" page). Click next.
- On "Step 6: Configure Security Group", make sure SSH can be accessed from a source of "Anywhere". Click Review and Launch.
- Click Launch
    - A popup will appear telling you to select a key pair. Since this is our first time launching an instance, select create a new key pair, name it, and download it to a safe location on your machine. On later launches, you should use pre-existing key pairs.
    - Select Launch Instance
    
### SSH'ing into your EC2 machine
- Navigate to the instances tab of the EC2 console.
- After following the steps for launching the EC2 machine, you should see a new entry in the instance page.
- Selecting the checkbox next to it will show details regarding the machine
    - Click the Connect button at the top. It will tell you how to ssh into that instance.
    - Note: the .pem they refer to is the key file you downloaded when launching the EC2.
   
### Distributed Tensorflow Inception Model with EC2
#### Basic Info
- In distributed tensorflow there are 2 types of machines 
    - Worker machines, which do gradient computation
    - Parameter servers, which hold the model
    - See https://www.tensorflow.org/versions/r0.11/how_tos/distributed/index.html for more info
- Running distributed tensorflow means
    - Running tensorflow individually across multiple machines (E.G: with 10 machines, tensorflow will be running on each and every one of them)
    - How tensorflow knows which machines are workers, and which are parameter servers
        - Providing the private ips of the machines that create the cluster (this is done through command line args)
        - Providing the type of worker that the machine is via an index (this is done through command line args)
        - Example: 
            - ./bazel-bin/inception/imagenet_distributed_train [... other_args here ...] --worker_hosts='172.31.7.97:1234,172.31.14.51:1234,172.31.9.233:1234,172.31.8.86:1234,172.31.7.247:1234' --ps_hosts='172.31.13.165:1234' --task_id=0 --job_name='worker'
            - worker_hosts is a string containing a comma separated list of private ips in the cluster
            - ps_hosts is a string containing a comma separated list of private ips in the cluster
            - job_name is a string either "worker" or "ps" specifying the type of machine
            - task_id is an integer (0 indexed) specifying the index of the machine in either worker_hosts if the machine is a worker, or ps_hosts if the machin is a ps
    - So if you want a cluster of 10 machines to run the inception model, you need to launch 10 instances on EC2, ssh into each one, and run the appropriate command. I have written some scripts (though they are quite ugly) to make this easier.
    
