Skip to content

Provisioning of distributed infrastructure for learning as an optimization problem. An implementation of proof-of-concept iteration components is provided.

Notifications You must be signed in to change notification settings

alfin3/infrastructure-as-opt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

infrastructure-as-opt

The need for faster processing of large amounts of data, in some cases with a notion of locality, drives the design of distributed approaches to machine learning. Two common considerations are: i) which parts of a computation to distribute and reduce/synchronize, and ii) how to minimize communication latencies.

These considerations pose an optimization problem in the “infrastructure vs. model architecture” search space.

To effectively traverse this search space, it is necessary to:

  1. iterate across infrastructures and model architectures, and
  2. measure the “fit” between an infrastructure and a model architecture.

Towards this goal, as an Insight Fellow, I developed:

  1. a one-step provisioning of an on-premises capable k8s infrastructure to streamline infrastructure iteration, and
  2. a set of proof-of-concept modules for constructing objective functions based on latency tracing

infrastructure-as-opt/infrastructure

infra_example

see further instructions in infrastructure-as-opt/infrastructure.

infrastructure-as-opt/tracing-ml

latencies_explorer

see further instructions in infrastructure-as-opt/{cpu, gpu}-build-jobs.

About

Provisioning of distributed infrastructure for learning as an optimization problem. An implementation of proof-of-concept iteration components is provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published