# FloydHub SDK Demo

This notebook shows how to use the Floyd SDK to automate your FloydHub workflow. You can do all the operations you perform on the cli programatically using the Python SDK. In fact the cli itself uses the sdk to communicate with the FloydHub server. Use pip to install the sdk.

The best way to execute this notebook is to create a new directory and copy this notebook in to that directory. Then populate the current directory with some files.

In [None]:
# Install sdk
!pip install -q floyd-cli

# Create some files for testing purposes
!echo "hello" > ./hello.txt
!echo "print (\"Hello world\")" > ./hello_world.py

# Authentication with username / password

First step is to authenticate yourselves against the FloydHub server. You can use your username / password combo to get an access token from the server.

The token is saved by the AuthConfigManager and automatically accessed in subsequent sdk calls. The path where this token is stored is `~/.floydconfig`

In [None]:
from floyd.client.auth import AuthClient
from floyd.log import configure_logger
from floyd.model.access_token import AccessToken
from floyd.model.credentials import Credentials
from floyd.manager.auth_config import AuthConfigManager

# Initialize logger
configure_logger(verbose=False)

# Login using credentials (replace with your credentials)
login_credentials = Credentials(username="your_username", password="your_password")
access_code = AuthClient().login(login_credentials)
user = AuthClient().get_user(access_code)
access_token = AccessToken(username=user.username,
                           token=access_code)

# Auth token is stored and automatically used in subsequent sdk calls
AuthConfigManager.set_access_token(access_token)

# Authentication with API Key

Alternatively, you can get an api key for your account at https://www.floydhub.com/settings/apikey. You can set the expiration of the key and use it for authentication.

In [None]:
AuthConfigManager.set_apikey(username="your_username", apikey="apikey_from_floydhub")

# Data

FloydHub manages data separately from code. You need to create a dataset directly from the [website](https://www.floydhub.com/datasets/create). Then use the dataset name in the section below to upload the contents of the current directory to FloydHub as a dataset. You will later mount this data into a job.

In [1]:
from floyd.client.data import DataClient
from floyd.client.dataset import DatasetClient
from floyd.manager.auth_config import AuthConfigManager
from floyd.manager.data_config import DataConfig
from floyd.cli.data_upload_utils import initialize_new_upload, complete_upload
from floyd.cli.utils import get_namespace_from_name

# Get access token from the stored config file
# Or re-authenticate from the previous step
access_token = AuthConfigManager.get_access_token()

# Replace with your dataset name
dataset_name = "floydlabs/test11"
dataset = DatasetClient().get_by_name(dataset_name)

namespace, name = get_namespace_from_name(dataset_name)
data_config = DataConfig(name=name,
                         namespace=namespace,
                         family_id=dataset.id)

# This is the actual upload step
initialize_new_upload(data_config, access_token, "new upload")
complete_upload(data_config)

Waiting for unpack....


In [2]:
from floyd.manager.data_config import DataConfigManager
from floyd.cli.utils import normalize_data_name

# Get the uploaded data name
data_config = DataConfigManager.get_config()
data_name = normalize_data_name(data_config.data_name)

## Dataset info & status

We can retrieve the name and info of the datasets we have uploaded. 

In [3]:
from floyd.cli.data import get_data_object
from floyd.client.data import DataClient
from tabulate import tabulate

def print_data(data_sources):
    """
    Print dataset information in tabular form
    """
    if not data_sources:
        return

    headers = ["DATA NAME", "CREATED", "STATUS", "DISK USAGE"]
    data_list = []
    for data_source in data_sources:
        data_list.append([data_source.name,
                          data_source.created_pretty,
                          data_source.state, data_source.size])
    print(tabulate(data_list, headers=headers))

# This will retrieve the info for all the datasets under floydlabs/test11
data_sources = DataClient().get_all()
print_data(data_sources)

DATA NAME                     CREATED         STATUS    DISK USAGE
----------------------------  --------------  --------  ------------
floydlabs/datasets/test11/22  13 seconds ago  valid     341.0 KB
floydlabs/datasets/test11/18  1 months ago    valid     53.0 KB
floydlabs/datasets/test11/17  4 months ago    valid     11.18 MB
floydlabs/datasets/test11/15  7 months ago    valid     795.42 MB
floydlabs/datasets/test11/14  7 months ago    valid     795.47 MB
floydlabs/datasets/test11/13  7 months ago    valid     795.46 MB
floydlabs/datasets/test11/12  7 months ago    valid     795.4 MB
floydlabs/datasets/test11/11  8 months ago    valid     769.06 MB
floydlabs/datasets/test11/8   8 months ago    valid     278.07 MB
floydlabs/datasets/test11/9   8 months ago    valid     278.07 MB
floydlabs/datasets/test11/10  8 months ago    valid     62.77 MB
floydlabs/datasets/test11/7   8 months ago    valid     20.0 KB
floydlabs/datasets/test11/6   10 months ago   valid     40.0 KB
floydlabs/datase

In [4]:
# or we can get the status of a single entry
dataset_name = "floydlabs/test11"

data_source = get_data_object(dataset_name, use_data_config=False)
print_data([data_source] if data_source else [])

DATA NAME                     CREATED         STATUS    DISK USAGE
----------------------------  --------------  --------  ------------
floydlabs/datasets/test11/22  17 seconds ago  valid     341.0 KB


## Delete a dataset version

You can easily delete the dataset version[s]. Please, be careful with this! Expecially if you are automataizing this process.

In [5]:
# We will remove the last dataset version we have just created
dataset_to_remove = 'floydlabs/datasets/test11/22'

data_source = get_data_object(dataset_to_remove, use_data_config=True)
if not DataClient().delete(data_source.id):
    print("Error!")
else:
    print("Data Deleted: ", dataset_to_remove)

Data Deleted:  floydlabs/datasets/test11/22


# Job

You can kick off a training job, monitor it and download the output all using the sdk. The next section shows how to run a job under a specific project. Create the project from the FloydHub [website](https://www.floydhub.com/projects/create) and use the project name in the next section.

In [None]:
from floyd.client.project import ProjectClient
from floyd.manager.experiment_config import ExperimentConfigManager
from floyd.manager.floyd_ignore import FloydIgnoreManager
from floyd.model.experiment_config import ExperimentConfig
from floyd.cli.utils import get_namespace_from_name

# Replace with your project name
project_name = "floydlabs/private-proj"
project = ProjectClient().get_by_name(project_name)

namespace, name = get_namespace_from_name(project_name)
experiment_config = ExperimentConfig(name=name,
                                     namespace=namespace,
                                     family_id=project.id)
ExperimentConfigManager.set_config(experiment_config)
FloydIgnoreManager.init()

# Mounting Data

You can mount any data on FloydHub (that you have access to) in to your job at the path you specify. In this case we are mounting the dataset we created above and mounting it at `/training` path. You also need to specify the floydhub instance type and the [environment](https://docs.floydhub.com/guides/environments/) you want to use.

Running a job is currently two step process - you first need to upload the code and then run the experiment (or job).

In [9]:
from floyd.client.experiment import ExperimentClient
from floyd.client.module import ModuleClient
from floyd.constants import INSTANCE_ARCH_MAP
from floyd.model.experiment import ExperimentRequest
from floyd.model.module import Module

# Run a job
# Get the data mount id (data_name comes from the previous step)
data_obj = DataClient().get(normalize_data_name(data_name))
data_ids = ["{}:{}".format(data_obj.id, "/training")]

# Define the data mount point for data
module_inputs = {
    "name": "/training",
    "type": "dir" # Always use dir here
}
    
# First create a module and then use it in the experiment create step

experiment_name = project_name
instance_type = "c1" # You can use c1 for cpu, c2 for cpu2, g1 for gpu and g2 for gpu2
project_id = project.id

# Get env value
arch = INSTANCE_ARCH_MAP[instance_type]
env = "tensorflow-1.5"  # Choose env that you need

module = Module(name=experiment_name,
                description='foo',
                command="ls /training",
                mode='command',
                family_id=project_id,
                inputs=module_inputs,
                env=env,
                arch=arch)

module_id = ModuleClient().create(module)
    
experiment_request = ExperimentRequest(name=experiment_name,
                                       description='foo',
                                       full_command='ls /training',
                                       module_id=module_id,
                                       env=env,
                                       data_ids=data_ids,
                                       family_id=project_id,
                                       instance_type=instance_type)
expt_info = ExperimentClient().create(experiment_request)

Creating project run. Total upload size: 29.3KiB
Syncing code ...


# Tracking an experiment

You can track an experiment periodically and wait for it to finish. You can also setup a [notification webhook](https://docs.floydhub.com/guides/notifications/) and get notified when jobs finish. You can also programatically download the output of your training job.

In [12]:
from floyd.client.experiment import ExperimentClient
from floyd.client.resource import ResourceClient

# Track experiment
job_id = expt_info['id']
experiment = ExperimentClient().get(job_id)
print(experiment.state)

# Stop running job (works only if the job is queued or running)
# ExperimentClient().stop(job_id)

success


In [14]:
# Get logs
log_resource_id = experiment.instance_log_id
logs = ResourceClient().get_content(log_resource_id)
print(logs)

2019-01-15 10:59:43,547 INFO - Preparing to run TaskInstance <TaskInstance: floydlabs/projects/private-proj/95 (id: MojAv2Wf9kGjENAfqhDEUV)
2019-01-15 10:59:43,573 INFO - Starting attempt 1
2019-01-15 10:59:43,590 INFO - Downloading and setting up data sources
2019-01-15 10:59:43,602 INFO - Downloading and mounting training. ETA: 2 seconds
2019-01-15 10:59:43,990 INFO - Using Docker image: floydhub/tensorflow:1.5.0-py3_aws.35
2019-01-15 10:59:44,121 INFO - Starting container...
2019-01-15 10:59:44,329 INFO - 
################################################################################

2019-01-15 10:59:44,330 INFO - Run Output:
2019-01-15 10:59:44,344 INFO - Starting services.
2019-01-15 10:59:44,493 INFO - demo
2019-01-15 10:59:44,493 INFO - floydhub_sdk_demo.ipynb
2019-01-15 10:59:44,493 INFO - hello.txt
2019-01-15 10:59:44,493 INFO - hello_world.py
2019-01-15 10:59:44,545 INFO - 
################################################################################

2019-01-15 10:59:4

In [13]:
# Download an output model file
output_id = experiment.output_id
data_url = "https://www.floydhub.com/api/v1/resources/{}?content=true&download=true".format(output_id)
DataClient().download_tar(url=data_url,
                          untar=True,
                          delete_after_untar=True)

Downloading the tar file to the current directory ...
Untarring the contents of the file ...
Cleaning up the tar file ...


'output.tar'

In [None]:
## Get detailed info about the experiment by directly parsing the job info json.
## Note: Some of these fields are for internal use and can change without warning.

from floyd.client.experiment import ExperimentClient

ExperimentClient().request("GET", "/experiments/" + experiment.id).json()

# Support

This sdk is in beta. If you have any questions or are interested in adopting this for your workflow, please contact us at support@floydhub.com. We are happy to support you and work with you in automating your training.