Skip to content

InsuLearn - Intuitive and robust distributed system designed to perform regression and classification on medical data, while preserving data security and privacy.

License

Notifications You must be signed in to change notification settings

DistributedML/InsuLearn

Repository files navigation

InsuLearn

InsuLearn is an intuitive and robust distributed system designed to perform regression and classification on medical data, while preserving data security and privacy. You can read a published paper that describes the project:

Scalable and Fault Tolerant Platform for Distributed Learning on Private Medical Data. Alborz Amir-Khalili, Soheil Kianzad, Rafeef Abugharbieh, Ivan Beschastnikh. MICCAI MLMI 2017

@inproceedings{AmirKhalili2017,
  author    = {Alborz Amir{-}Khalili and Soheil Kianzad and Rafeef Abugharbieh and Ivan Beschastnikh},
  title     = {{Scalable and Fault Tolerant Platform for Distributed Learning on Private Medical Data}},
  booktitle = {Machine Learning in Medical Imaging - 8th International Workshop,
               {MLMI} 2017, Held in Conjunction with {MICCAI} 2017},
  series    = {Lecture Notes in Computer Science},
  volume    = {10541},
  pages     = {176--184},
  publisher = {Springer},
  year      = {2017},
  url       = {https://doi.org/10.1007/978-3-319-67389-9\_21},
}

InsuLearn is built on ensemble learning, in which statistical models are developed at each institution independently and combined at secure coordinator nodes. InsuLearn protocols are designed such that the liveness of the system is guaranteed as institutions join and leave the network. Coordination is implemented as a cluster of replicated state machines, making it tolerant to individual node failures. Fault-tolerant replication is achieved using the Raft consensus algorithm.

  • bclass/ : A simple classification library and Bayesian aggregation scheme implemented in Go
  • client/ : Examples of different client implementations using the bclass or distmlMatlab libraries with and without replication, some of which are instrumented with GoVector
  • distmlMatlab : Library for interfacing with built-in MATLAB classification, regression, and ensemble techniques
  • server/ : Examples of different client implementations using the bclass or distmlMatlab libraries with and without replication, some of which are instrumented with GoVector
  • testdata/ : Example training and testing data for an execution with 5 nodes.

Compatibility

The Go only implementations which use the bclass library is cross-platform compatible and has been tested on 64 bit Windows, Linux, and OS X. The MATLAB based implementations (distmlMatlab library) interface with MATLAB using the MATLAB Engine API for C, and require nonshared, single session, engine instances that are only supported on 64 bit Windows platforms up to MATLAB version R2016b. The distmlMatlab implementations have been tested on 64 bit Windows 7, 8.1, Server 2012 and MATLAB versions R2011b, R2014b, R2015a, and R2016b.

Installation

To use InsuLearn you must have a correctly configured go development environment, see How to write Go Code

Once you set up your environment, InsuLearn can be installed into the designated GOPATH directory.

bclass Dependencies (Cross-Platform 64 bit)

The bclass polynomial classification library require the gonum matrix library, which can also be installed with the go tool command:

go get github.com/gonum/matrix/mat64

distmlMatlab Dependencies (64 bit Windows Only)

The following steps are required to enable communication between Go and MATLAB via cgo and MATLAB Engine API.

  1. Install 64 bit MATLAB (R2011b or newer) into default directory

  2. Set MATLAB run-time library path (see instructions)

  3. Install 64-bit gcc compiler for windows (TDM-GCC-64 5.1.0)

  4. Create the following symbolic links because the cgo linker does not play well with windows paths (must be done with Administrative privilages):

    mklink /D <InsuLearn_root_directory>\include <matlab_install_directory>\extern\incldue
    
    mklink /D <InsuLearn_root_directory>\distmlMatlab\include <matlab_install_directory>\extern\incldue
    
    mklink /D <InsuLearn_root_directory>\lib <matlab_install_directory>\bin\win64
    
    mklink /D <InsuLearn_root_directory>\distmlMatlab\lib <matlab_install_directory>\bin\win64 
    
  5. Ensure that the distmlMatlab\matlabfunctions\ folder has write privilages

  6. Update MATLAB_FUNCTION_PATH in distmlMatlab\matlabfun.c to the distmlMatlab\matlabfunctions folder

  7. Update batch files with apropriate paths for MATLAB run-time and gcc compiler

Other Dependencies

The following libraries are required for Raft replication and GoVector instrumentation:

go get github.com/arcaneiceman/GoVector/govec
go get github.com/coreos/etcd/raft

Client-Side Commands

The implementation of the client prompts the user for the following commands.

  • read : Reads data from disk.
  • push : Pushes trained model to server.
  • pull : Request global model from server.
  • train : Train local model from local data (reports error).
  • valid : Validate global model with local data.
  • test : Test local model with test data.
  • testg : Test global model with test data.
  • who : Print node name.

###Examples

Here are a list of command line arguments that need to be passed to each example program.

client

  • name : A string representing the unique name of the node in the system
  • ip:port : Address that the client uses to listen to the server
  • ip:port : Address of the server
  • train_data.txt : Name of the file containing the features of training data used to train the local model
  • train_label.txt : Name of the file containing the labels of training data used to train the local model
  • test_data.txt : Name of the file containing the features of testing data used to test the local and global models
  • test_label.txt : Name of the file containing the labels of training data used to train the local and global models
  • id : A string representing the name of the node for GoVec log

client_raft

  • name : A string representing the unique name of the node in the system
  • ip:port : Address that the client uses to listen to the server
  • serverlist.txt : Name of the file containing addresses of the servers (ip:port)
  • train_data.txt : Name of the file containing the features of training data used to train the local model
  • train_label.txt : Name of the file containing the labels of training data used to train the local model
  • test_data.txt : Name of the file containing the features of testing data used to test the local and global models
  • test_label.txt : Name of the file containing the labels of training data used to train the local and global
  • id : A string representing the name of the node for GoVec log

server

  • ip:port : Address that the server uses to listen to the server
  • id : A string representing the name of the server for GoVec log

server_raft

  • ip:port : Address that the server uses to listen to the server
  • raftlist.txt : Name of the file containing addresses that the raft servers use to communicate to each other (ip:port)
  • index : Integer number mapping this server to an entry in the raftlist.txt
  • id : A string representing the name of the hospital for GoVec log

Scripts

bclass (Cross-Platform 64 bit)

Four bash and batch scripts have been provided to initialize a regular run and a raft replicated run of the system in either Windows or Linux environment.

To execute a normal run of the system, run:

sh run.sh

or

run.bat

To execute a Raft replicated run of the system, run:

sh run_raft.sh

or

run_raft.bat

distmlMatlab (64 bit Windows Only)

To execute a normal run of the system, run:

run_matlab.bat

To execute a Raft replicated run of the system, run:

run_matlab_raft.bat

distmlMatlab Azure Scripts (64 bit Windows Only)

The following scripts are used to deploy an automated version of the client and server on Microsoft Azure Cloud:

  • Azure_client.bat
  • Azure_server.bat
  • Azure_raftserver.bat

Processing Log Files

The following scripts are used to process GoVector generated log files:

  • concat.bat
  • parse-log.bat

About

InsuLearn - Intuitive and robust distributed system designed to perform regression and classification on medical data, while preserving data security and privacy.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published