# Running Distributed Tensorflow on Ec2

## AWS Setup 
### Navigating to the Console
- Go to the ec2 console page by clicking on "Services" (Top left), then EC2 in the dropdown.
    - On this page you'll see statistics about how many "instances" are running, etc.
- Click the instances tab in the left
    - This page shows you all the instances that are running / terminated / stopped
        - Stopping an instance saves the state of the machine (files, etc from previous runs are left unchanged). But the machine is not alive. 
        - Terminating an instance shuts it down, and the machine is basically gone forever.
        - A running instance is currently alive.
               
### Launching your first EC2
- Click the blue "launch instance" button in the top
- Make sure you are in the ("US-WEST oregon") region, in the top left corner
- Select the "Ubuntu" AMI 
    - Note: AMI = Amazon machine instance = a saved machine state (containing installed software, etc)
        - Amazon has a "marketplace" of AMIs which allow launching instances pre-installed with specific software packages. Additionally you can create your own AMI after configuring a machine with installed software and share it.
- Select t2.micro (free tier) and click "Next:Configure..."
- Leave this page unchanged (skip the "Configure Instance Details" page). Click next.
- Leave this page unchanged (skip the "Add Storage" page). Click next.
- Leave this page unchanged (skip the "Tag Instance" page). Click next.
- On "Step 6: Configure Security Group", make sure SSH can be accessed from a source of "Anywhere". Click Review and Launch.
- Click Launch
    - A popup will appear telling you to select a key pair. Since this is our first time launching an instance, select create a new key pair, name it, and download it to a safe location on your machine. On later launches, you should use pre-existing key pairs.
    - Select Launch Instance
    
### SSH'ing into your EC2 machine
- Navigate to the instances tab of the EC2 console.
- After following the steps for launching the EC2 machine, you should see a new entry in the instance page.
- Selecting the checkbox next to it will show details regarding the machine
    - Click the Connect button at the top. It will tell you how to ssh into that instance.
    - Note: the .pem they refer to is the key file you downloaded when launching the EC2.
   
### Distributed Tensorflow Inception Model with EC2
#### Basic Info
- In distributed tensorflow there are 2 types of machines 
    - Worker machines, which do gradient computation
    - Parameter servers, which hold the model
    - See https://www.tensorflow.org/versions/r0.11/how_tos/distributed/index.html for more info
- Running distributed tensorflow means
    - Running tensorflow individually across multiple machines (E.G: with 10 machines, tensorflow will be running on each and every one of them)
    - How tensorflow knows which machines are workers, and which are parameter servers
        - Providing the private ips of the machines that create the cluster (this is done through command line args)
        - Providing the type of worker that the machine is via an index (this is done through command line args)
        - Example: 
            - ./bazel-bin/inception/imagenet_distributed_train [... other_args here ...] --worker_hosts='172.31.7.97:1234,172.31.14.51:1234,172.31.9.233:1234,172.31.8.86:1234,172.31.7.247:1234' --ps_hosts='172.31.13.165:1234' --task_id=0 --job_name='worker'
            - worker_hosts is a string containing a comma separated list of private ips in the cluster
            - ps_hosts is a string containing a comma separated list of private ips in the cluster
            - job_name is a string either "worker" or "ps" specifying the type of machine
            - task_id is an integer (0 indexed) specifying the index of the machine in either worker_hosts if the machine is a worker, or ps_hosts if the machin is a ps
    - So if you want a cluster of 10 machines to run the inception model, you need to launch 10 instances on EC2, ssh into each one, and run the appropriate command. I have written some scripts (though they are quite ugly) to make this easier. There are probably better solutions to managing this, but I did not seem to find many that were simple and widely used. Definitely let me know if you find any better ways to manage these jobs!
    
### Launching
- First you need to install the cli to AWS
    - https://aws.amazon.com/cli/
    - http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-set-up.html
- I've provided some very basic example scripts to launch a bunch of machines and run the inception model.
    - launch.sh
        - https://github.com/agnusmaximus/models/blob/9b4021c95bea4f8ab146cb7af1f5e6ec4a96fd4a/inception/tools/launch.sh
        - sh launch.sh [machine_tier] [n_instances]
        - WARNING: Kills all live instances, spot requests, spot instances
        - launches n_instances spot machines of tier machine_tier
            - Spot machines are machines where users bid for the price. This means if many people big a higher price they might kill your machine and take it away from you. But usually the costs are orders of magnitude lower than demand instances.
            - Set price with spot_price variable in the script. (.1 = 10 cents per hour)
            - See price history by navigating to the Spot Requests tab in the AWS EC2 console.
        - Note: Uses a customized AMI with tensorflow installed, inception pulled from the github repository, flowers dataset downloaded.
    - run_distributed.sh
        -  https://github.com/agnusmaximus/models/blob/9b4021c95bea4f8ab146cb7af1f5e6ec4a96fd4a/inception/tools/run_distributed.sh
        - Note this relies on the python script - https://github.com/agnusmaximus/models/blob/9b4021c95bea4f8ab146cb7af1f5e6ec4a96fd4a/inception/tools/extract_workers_ps.py
        - If there are N instances running, assigns 1 to be PS and the rest to be workers.
            - First instance is master
            - Last instance is PS
        - sh run_distributed.sh batch_size
            - SSH's into all the live instances
            - Kills any existing python jobs
            - Cds into the inception directory and pulls from my github url (you might want to remove this)
            - Runs bazel/bin/imagenet_distributed_train...
                - This is printed to the console so you know what is running on each machine
            - Prints the public ip address and corresponding imagenet_distributed_train command executed on them
                - This is useful as you can ssh into that instance with the ip address
    - My general workflow is
        - To launch machines and run
            - sh tools/launch.sh m4.2xlarge 6 && sleep 180 && sh tools/run_distributed.sh 100
                - batchsize=100, 6 instances.
                - sleep in between since it takes some time for the machines to set up. Without the sleep sometimes you get ssh errors.
            - Then I usually SSH into the master machine via the public ip address printed by the script, cd into models/inception, then "more out0" to see what is outputed by the imagenet_distributed_train command.
                - This contains details as to the loss computed by the machines, etc.
        - To rerun the inception model training with existing instances
            - sh tools/run_distributed.sh [batch_size]
                - This will kill all python processes before running the distributed training
        - To shut everything down
            - sh tools/launch.sh
                - Will prompt whether you want to shut everything down. Type y and enter.
                - Will then prompt whether you want to launch instances. Type N and enter.
        - You may want to modify these scripts to be adaptable for your own purposes.
            
        