In-Database Machine Learning with CorgiPile: Stochastic Gradient Descent without Full Data Shuffle

Introduction

CorgiPile is a novel two-level hierarchical data shuffle mechanism for efficient SGD in DB systems. CorgiPile first samples and shuffles data at the block-level, and then shuffles data at the tuple-level within the sampled data blocks, i.e., firstly shuffling data blocks (a batch of table pages per block), and then merging sampled blocks in a buffer, and finally shuffling tuples in the buffer for SGD. Compared with existing mechanisms, CorgiPile can avoid the full shuffle while maintaining comparable convergence rate as if a full shuffle was performed.

We have implemented CorgiPile inside PostgreSQL, with three new inherent physical operators. Extensive experimental results show that our CorgiPile in PostgreSQL can achieve comparable convergence rate with the full-shuffle based SGD, and 2.X-12.8X faster than the state-of-the-art in-DB ML systems including Apache MADlib and Bismarck, on both HDD and SSD.

Implementation in PostgreSQL

The following figure illustrates the implementation of CorgiPile with new operators and double-buffering optimization, in PostgreSQL.

The query, control flow and iteration paradigm of CorgiPile in PostgreSQL is as follows.

Query: SELECT * FROM table_name TRAIN BY model_name WITH params = args;

// Query plan: ExecSGD => ExecTupleShuffle => ExecBlockShuffle 

/************************
 *  SGD operator 
************************/
void ExecInit():
   initialize the ML model;
    
Model ExecSGD():
   for each epoch:
      /* training */
      while(true):
         pull a tuple i from TupleShuffle operator;
         if (tuple i is not null):
            compute the gradient g of tuple i;
            update model with g;
         else:
            break;
      ExecReScan();
   return trained model;
    
void ExecReScan():
   invoke TupleShuffle.ExecReScan();

/************************
 * TupleShuffle operator
************************/
void ExecInit(): 
   initialize buffer and I/O offsets;
   
Tuple ExecTupleShuffle():
   if (buffer is empty):
      pull tuples from previous BlockShuffle operator
           one by one until the buffer is filled;
      shuffle tuples in the buffer;
   else:
      return buffer[offset++];
      
void ExecReScan():
   clear the buffer and I/O offsets;
   invoke BlockShuffle.ExecReScan();

/************************
 * BlockShuffle operator
************************/
void ExecInit(): 
   compute the total block number;
   shuffle the block indexes;
   
Tuple ExecBlockShuffle():
   for each shuffled block i:
      for each page j in the block i:
         read a tuple k from the page j;
         return tuple k;
            
void ExecReScan():
   re-shuffle the block indexes;
   clear I/O offsets;

In our current implemention, the source code of the three operators are available at SGD operator with SGD computation, TupleShuffle operator with its TupleShuffle implementation, and BlockShuffle operator with its BlockShuffle implementation.

Compile, install, and execute CorgiPile

Please refer to compile-install.md.

CorgiPile Performance

Datasets

The datasets used in our experiments are avaialbe here.

End-to-end exeuction time

The end-to-end execution time of SGD with different data shuffle strategies in PostgreSQL, for clustered datasets on HDD and SSD.

From this Figure, we can observe that CorgiPile converges the fastest among all systems, and can simultaneously achieve comparable converged accuracy to the best Shuffle Once baseline, usually within 1-3 epochs due to the large number of data tuples. Compared with the Shuffle Once strategy in MADlib and Bismarck, CorgiPile converges 2.9X-12.8X faster than MADlib and 2X-4.7X faster than Bismarck, on HDD and SSD. This is due to the eliminated data shuffling time.

Convergence rates

The following Figure shows the convergence behavior of all strategies, where Sliding-Window Shuffle, MRS Shuffle, and CorgiPile all use the same buffer size (10% of the dataset). As shown in this Figure, Sliding-Window Shuffle suffers from a lower converged accuracy, whereas MRS Shuffle can achieve an accuracy comparable to Shuffle Once only on yfcc but suffers from a lower accuracy on the other datasets.

Per-epoch exeuction time

The following Figure presents the results of No Shuffle and CorgiPile, as well as the single-buffer version of CorgiPile.

We make the following three observations:

For small datasets with in-memory I/O bandwidth, the average time per epoch of CorgiPile is comparable to that of No Shuffle.
For large datasets with disk I/O bandwidth, the average per-epoch time of CorgiPile is up to 1.1X slower than that of No Shuffle, i.e., it incurs at most an additional 11% overhead, due to buffer copy and tuple shuffle.
By using double-buffering optimization, CorgiPile can achieve up to 23.6% shorter per-epoch execution time.

Name		Name	Last commit message	Last commit date
Latest commit History 35,559 Commits
.vscode		.vscode
config		config
contrib		contrib
corgipile-docs		corgipile-docs
doc		doc
src		src
.gitignore		.gitignore
COPYRIGHT		COPYRIGHT
GNUmakefile.in		GNUmakefile.in
HISTORY		HISTORY
IO-perf.pdf		IO-perf.pdf
Makefile		Makefile
Proof.pdf		Proof.pdf
README.md		README.md
aclocal.m4		aclocal.m4
compile-install.md		compile-install.md
configure		configure
configure.in		configure.in
datasets_intro.md		datasets_intro.md

DS3Lab/CorgiPile-PostgreSQL

Folders and files

Latest commit

History

Repository files navigation

In-Database Machine Learning with CorgiPile: Stochastic Gradient Descent without Full Data Shuffle

Introduction

Implementation in PostgreSQL

Compile, install, and execute CorgiPile

CorgiPile Performance

Datasets

End-to-end exeuction time

Convergence rates

Per-epoch exeuction time

About

Resources

Stars

Watchers

Forks

Languages