Added a diagram. Used package pgfplotstable. You probably have to upd…

…ate this package on your system, because it is very new. But it is amazing!
 @@ -15,14 +15,10 @@ \usepackage{dcolumn} % Decimal point alignment in tables \usepackage{xcolor} % for color comments \usepackage[ -% bookmarks, -% colorlinks=false, -% linkcolor=blue, -% citecolor=blue, -% pagebackref=false, pdftitle={Introducing MADlib (MAD Skills, the SQL)}, pdfauthor={} -]{hyperref} % Für PDF-Features +]{hyperref} +\usepackage{pgfplotstable} % Generating pretty-printed tables and plots \makeatletter % The vldb style file is an anachronsism. So no need to worry about hacks anyway. @@ -884,7 +880,7 @@ \subsection{Initial Performance Results} \begin{math} O(k^3 + (n \cdot k^2)/p) \end{math} -where $k$ is the number of independent variables, $n$ is the number of observations, and $p$ is the number of segments. The $k^3$ time is needed for the matrix inversion, and the $k^2$ is needed for computing each outer product $\vec x_i \vec x_i^T$ and adding it to the running sum. It turns out that our runtime measurements fit these expectations quite well, and the constant factors are relatively small. See Figure~\ref{fig:regression}. In particular we note: +where $k$ is the number of independent variables, $n$ is the number of observations, and $p$ is the number of segments. The $k^3$ time is needed for the matrix inversion, and the $k^2$ is needed for computing each outer product $\vec x_i \vec x_i^T$ and adding it to the running sum. It turns out that our runtime measurements fit these expectations quite well, and the constant factors are relatively small. See Figures~\ref{fig:regression} and \ref{fig:regression-diagram}. In particular we note: \begin{itemize} \item The overhead for a single query is very low and only a fraction of a second. This also implies that we lose little in implementating iterative algorithms using driver functions that run multiple SQL queries. \item Given the previous points, the Greenplum database achieves perfect linear speedup in the example shown. @@ -901,7 +897,7 @@ \subsection{Initial Performance Results} Other noteworthy results during our performance studies included that there are no measurable performance differences between PostgreSQL 9.1.1 (both in single and multi-user mode) and GP 4.1 in running the aggregate function on a single core. Moreover, while testing linear/logistic-regression execution times, single-core performance of even laptop CPUs (like the Core i5 540M) did not differ much from today's server CPUs (like the Xeon family). Typically the differences were even less than what the difference in clock speeds might have suggested. \jmh{Care to speculate why?} \fs{Pure speculation at this stage: Memory speed does not scale similarly, non-linear parts of the code/branches, different pipeline lengths, different number of clock cycles per FP operation, ...?} \begin{comment} -% The following script was used for testing linear regression: +% The following script was used for testing linear regression. % --8<-- #!/usr/bin/env bash PORT=5555 @@ -955,50 +951,63 @@ \subsection{Initial Performance Results} \begin{figure*} \centering -\begin{tabular}{+r^r^rD{.}{.}{3.3}D{.}{.}{3.3}D{.}{.}{3.4}} -\toprule\rowstyle{\bfseries} -\# segments & \# variables & \multicolumn{1}{^c}{\# rows} - & \multicolumn{1}{^c}{v0.3} - & \multicolumn{1}{^c}{v0.2.1beta} - & \multicolumn{1}{^c}{v0.1alpha}\\ - & & \multicolumn{1}{^c}{(million)} - & \multicolumn{1}{^c}{(s)} - & \multicolumn{1}{^c}{(s)} - & \multicolumn{1}{^c}{(s)} \\ -\otoprule - 6 & 10 & 10 & 4.447 & 9.501 & 1.337 \\ - 6 & 20 & 10 & 4.688 & 11.60 & 1.874 \\ - 6 & 40 & 10 & 6.843 & 17.96 & 3.828 \\ - 6 & 80 & 10 & 13.28 & 52.94 & 12.98 \\ - 6 & 160 & 10 & 35.66 & 181.4 & 51.20 \\ - 6 & 320 & 10 & 186.2 & 683.8 & 333.4 \\ -\midrule - 12 & 10 & 10 & 2.115 & 4.756 & 0.9600\\ - 12 & 20 & 10 & 2.432 & 5.760 & 1.212 \\ - 12 & 40 & 10 & 3.420 & 9.010 & 2.046 \\ - 12 & 80 & 10 & 6.797 & 26.48 & 6.469 \\ - 12 & 160 & 10 & 17.71 & 90.95 & 25.67 \\ - 12 & 320 & 10 & 92.41 & 341.5 & 166.6 \\ -\midrule - 18 & 10 & 10 & 1.418 & 3.206 & 0.6197\\ - 18 & 20 & 10 & 1.648 & 3.805 & 1.003 \\ - 18 & 40 & 10 & 2.335 & 5.994 & 1.183 \\ - 18 & 80 & 10 & 4.461 & 17.73 & 4.314 \\ - 18 & 160 & 10 & 11.90 & 60.58 & 17.14 \\ - 18 & 320 & 10 & 61.66 & 227.7 & 111.4 \\ -\midrule - 24 & 10 & 10 & 1.197 & 2.383 & 0.3904\\ - 24 & 20 & 10 & 1.276 & 2.869 & 0.4769\\ - 24 & 40 & 10 & 1.698 & 4.475 & 1.151 \\ - 24 & 80 & 10 & 3.363 & 13.35 & 3.263 \\ - 24 & 160 & 10 & 8.840 & 45.48 & 13.10 \\ - 24 & 320 & 10 & 46.18 & 171.7 & 84.59 \\ -\bottomrule -\end{tabular} +\pgfplotstabletypeset[ + every nth row={6}{before row=\midrule}, + columns/segments/.style={ + column type=r,column name={}, + }, + columns/indep/.style={ + column type=r,column name={}, + }, + columns/rows/.style={ + column type=r,column name={(million)}, + }, + columns/v0.3/.style={ + dcolumn={D{.}{.}{3.3}}{c},column name={(s)},precision=4 + }, + columns/v0.2/.style={ + dcolumn={D{.}{.}{3.3}}{c},column name={(s)},precision=4 + }, + columns/v0.1/.style={ + dcolumn={D{.}{.}{3.4}}{c},column name={(s)},precision=4 + }, + every head row/.style={before row={% + \toprule + \multicolumn{1}{c}{\textbf{\# Segments}} & + \multicolumn{1}{c}{\textbf{\# Rows}} & + \multicolumn{1}{c}{\textbf{\# Variables}} & + \multicolumn{1}{c}{\textbf{v0.3}} & + \multicolumn{1}{c}{\textbf{v0.2.1beta}} & + \multicolumn{1}{c}{\textbf{v0.1alpha}} + \\% + }, after row=\midrule}, + every last row/.style={after row=\bottomrule} +]{Tables/linregr.dat} \caption{Linear-regression execution times} \label{fig:regression} \end{figure*} +\begin{figure} +\begin{tikzpicture} +\begin{axis}[height=5cm,width=8cm,xlabel={\# independent variables},ylabel={execution time (s)}] + \addplot table[ + x=indep,y=v0.3,restrict expr to domain={\thisrow{segments}}{6:6} + ]{Tables/linregr.dat}; + \addplot table[ + x=indep,y=v0.3,restrict expr to domain={\thisrow{segments}}{12:12} + ]{Tables/linregr.dat}; + \addplot table[ + x=indep,y=v0.3,restrict expr to domain={\thisrow{segments}}{18:18} + ]{Tables/linregr.dat}; + \addplot table[ + x=indep,y=v0.3,restrict expr to domain={\thisrow{segments}}{24:24} + ]{Tables/linregr.dat}; +\end{axis} +\end{tikzpicture} +\caption{Linear-regression execution times using MADlib v0.3 on Greenplum Database 4.2.0, 10 million rows} +\label{fig:regression-diagram} +\end{figure} + % \subsection{Using MADlib} % % Typical User Workflow \fs{(needs to be substantiated)}: