Permalink
Browse files

Added a diagram. Used package pgfplotstable. You probably have to upd…

…ate this package on your system, because it is very new. But it is amazing!
  • Loading branch information...
1 parent 9918db8 commit 309b114dc04ffbd3b99280642cc4bc55ae8d9f70 Florian Schoppmann committed Mar 29, 2012
Showing with 83 additions and 48 deletions.
  1. +26 −0 Tables/linregr.dat
  2. +57 −48 madlib_the_sql.tex
View
@@ -0,0 +1,26 @@
+# Linear regression runtime measurements
+segments indep rows v0.3 v0.2 v0.1
+6 10 10 4.447 9.501 1.337
+6 20 10 4.688 11.60 1.874
+6 40 10 6.843 17.96 3.828
+6 80 10 13.28 52.94 12.98
+6 160 10 35.66 181.4 51.20
+6 320 10 186.2 683.8 333.4
+12 10 10 2.115 4.756 0.9600
+12 20 10 2.432 5.760 1.212
+12 40 10 3.420 9.010 2.046
+12 80 10 6.797 26.48 6.469
+12 160 10 17.71 90.95 25.67
+12 320 10 92.41 341.5 166.6
+18 10 10 1.418 3.206 0.6197
+18 20 10 1.648 3.805 1.003
+18 40 10 2.335 5.994 1.183
+18 80 10 4.461 17.73 4.314
+18 160 10 11.90 60.58 17.14
+18 320 10 61.66 227.7 111.4
+24 10 10 1.197 2.383 0.3904
+24 20 10 1.276 2.869 0.4769
+24 40 10 1.698 4.475 1.151
+24 80 10 3.363 13.35 3.263
+24 160 10 8.840 45.48 13.10
+24 320 10 46.18 171.7 84.59
View
@@ -15,14 +15,10 @@
\usepackage{dcolumn} % Decimal point alignment in tables
\usepackage{xcolor} % for color comments
\usepackage[
-% bookmarks,
-% colorlinks=false,
-% linkcolor=blue,
-% citecolor=blue,
-% pagebackref=false,
pdftitle={Introducing MADlib (MAD Skills, the SQL)},
pdfauthor={}
-]{hyperref} % Für PDF-Features
+]{hyperref}
+\usepackage{pgfplotstable} % Generating pretty-printed tables and plots
\makeatletter
% The vldb style file is an anachronsism. So no need to worry about hacks anyway.
@@ -884,7 +880,7 @@ \subsection{Initial Performance Results}
\begin{math}
O(k^3 + (n \cdot k^2)/p)
\end{math}
-where $k$ is the number of independent variables, $n$ is the number of observations, and $p$ is the number of segments. The $k^3$ time is needed for the matrix inversion, and the $k^2$ is needed for computing each outer product $\vec x_i \vec x_i^T$ and adding it to the running sum. It turns out that our runtime measurements fit these expectations quite well, and the constant factors are relatively small. See Figure~\ref{fig:regression}. In particular we note:
+where $k$ is the number of independent variables, $n$ is the number of observations, and $p$ is the number of segments. The $k^3$ time is needed for the matrix inversion, and the $k^2$ is needed for computing each outer product $\vec x_i \vec x_i^T$ and adding it to the running sum. It turns out that our runtime measurements fit these expectations quite well, and the constant factors are relatively small. See Figures~\ref{fig:regression} and \ref{fig:regression-diagram}. In particular we note:
\begin{itemize}
\item The overhead for a single query is very low and only a fraction of a second. This also implies that we lose little in implementating iterative algorithms using driver functions that run multiple SQL queries.
\item Given the previous points, the Greenplum database achieves perfect linear speedup in the example shown.
@@ -901,7 +897,7 @@ \subsection{Initial Performance Results}
Other noteworthy results during our performance studies included that there are no measurable performance differences between PostgreSQL 9.1.1 (both in single and multi-user mode) and GP 4.1 in running the aggregate function on a single core. Moreover, while testing linear/logistic-regression execution times, single-core performance of even laptop CPUs (like the Core i5 540M) did not differ much from today's server CPUs (like the Xeon family). Typically the differences were even less than what the difference in clock speeds might have suggested. \jmh{Care to speculate why?} \fs{Pure speculation at this stage: Memory speed does not scale similarly, non-linear parts of the code/branches, different pipeline lengths, different number of clock cycles per FP operation, ...?}
\begin{comment}
-% The following script was used for testing linear regression:
+% The following script was used for testing linear regression.
% --8<--
#!/usr/bin/env bash
PORT=5555
@@ -955,50 +951,63 @@ \subsection{Initial Performance Results}
\begin{figure*}
\centering
-\begin{tabular}{+r^r^rD{.}{.}{3.3}D{.}{.}{3.3}D{.}{.}{3.4}}
-\toprule\rowstyle{\bfseries}
-\# segments & \# variables & \multicolumn{1}{^c}{\# rows}
- & \multicolumn{1}{^c}{v0.3}
- & \multicolumn{1}{^c}{v0.2.1beta}
- & \multicolumn{1}{^c}{v0.1alpha}\\
- & & \multicolumn{1}{^c}{(million)}
- & \multicolumn{1}{^c}{(s)}
- & \multicolumn{1}{^c}{(s)}
- & \multicolumn{1}{^c}{(s)} \\
-\otoprule
- 6 & 10 & 10 & 4.447 & 9.501 & 1.337 \\
- 6 & 20 & 10 & 4.688 & 11.60 & 1.874 \\
- 6 & 40 & 10 & 6.843 & 17.96 & 3.828 \\
- 6 & 80 & 10 & 13.28 & 52.94 & 12.98 \\
- 6 & 160 & 10 & 35.66 & 181.4 & 51.20 \\
- 6 & 320 & 10 & 186.2 & 683.8 & 333.4 \\
-\midrule
- 12 & 10 & 10 & 2.115 & 4.756 & 0.9600\\
- 12 & 20 & 10 & 2.432 & 5.760 & 1.212 \\
- 12 & 40 & 10 & 3.420 & 9.010 & 2.046 \\
- 12 & 80 & 10 & 6.797 & 26.48 & 6.469 \\
- 12 & 160 & 10 & 17.71 & 90.95 & 25.67 \\
- 12 & 320 & 10 & 92.41 & 341.5 & 166.6 \\
-\midrule
- 18 & 10 & 10 & 1.418 & 3.206 & 0.6197\\
- 18 & 20 & 10 & 1.648 & 3.805 & 1.003 \\
- 18 & 40 & 10 & 2.335 & 5.994 & 1.183 \\
- 18 & 80 & 10 & 4.461 & 17.73 & 4.314 \\
- 18 & 160 & 10 & 11.90 & 60.58 & 17.14 \\
- 18 & 320 & 10 & 61.66 & 227.7 & 111.4 \\
-\midrule
- 24 & 10 & 10 & 1.197 & 2.383 & 0.3904\\
- 24 & 20 & 10 & 1.276 & 2.869 & 0.4769\\
- 24 & 40 & 10 & 1.698 & 4.475 & 1.151 \\
- 24 & 80 & 10 & 3.363 & 13.35 & 3.263 \\
- 24 & 160 & 10 & 8.840 & 45.48 & 13.10 \\
- 24 & 320 & 10 & 46.18 & 171.7 & 84.59 \\
-\bottomrule
-\end{tabular}
+\pgfplotstabletypeset[
+ every nth row={6}{before row=\midrule},
+ columns/segments/.style={
+ column type=r,column name={},
+ },
+ columns/indep/.style={
+ column type=r,column name={},
+ },
+ columns/rows/.style={
+ column type=r,column name={(million)},
+ },
+ columns/v0.3/.style={
+ dcolumn={D{.}{.}{3.3}}{c},column name={(s)},precision=4
+ },
+ columns/v0.2/.style={
+ dcolumn={D{.}{.}{3.3}}{c},column name={(s)},precision=4
+ },
+ columns/v0.1/.style={
+ dcolumn={D{.}{.}{3.4}}{c},column name={(s)},precision=4
+ },
+ every head row/.style={before row={%
+ \toprule
+ \multicolumn{1}{c}{\textbf{\# Segments}} &
+ \multicolumn{1}{c}{\textbf{\# Rows}} &
+ \multicolumn{1}{c}{\textbf{\# Variables}} &
+ \multicolumn{1}{c}{\textbf{v0.3}} &
+ \multicolumn{1}{c}{\textbf{v0.2.1beta}} &
+ \multicolumn{1}{c}{\textbf{v0.1alpha}}
+ \\%
+ }, after row=\midrule},
+ every last row/.style={after row=\bottomrule}
+]{Tables/linregr.dat}
\caption{Linear-regression execution times}
\label{fig:regression}
\end{figure*}
+\begin{figure}
+\begin{tikzpicture}
+\begin{axis}[height=5cm,width=8cm,xlabel={\# independent variables},ylabel={execution time (s)}]
+ \addplot table[
+ x=indep,y=v0.3,restrict expr to domain={\thisrow{segments}}{6:6}
+ ]{Tables/linregr.dat};
+ \addplot table[
+ x=indep,y=v0.3,restrict expr to domain={\thisrow{segments}}{12:12}
+ ]{Tables/linregr.dat};
+ \addplot table[
+ x=indep,y=v0.3,restrict expr to domain={\thisrow{segments}}{18:18}
+ ]{Tables/linregr.dat};
+ \addplot table[
+ x=indep,y=v0.3,restrict expr to domain={\thisrow{segments}}{24:24}
+ ]{Tables/linregr.dat};
+\end{axis}
+\end{tikzpicture}
+\caption{Linear-regression execution times using MADlib v0.3 on Greenplum Database 4.2.0, 10 million rows}
+\label{fig:regression-diagram}
+\end{figure}
+
% \subsection{Using MADlib}
%
% Typical User Workflow \fs{(needs to be substantiated)}:

0 comments on commit 309b114

Please sign in to comment.