Table vs numpy.matrix speed #48

pxinghao · 2015-08-19T01:57:40Z

Below is a benchmark for comparing Table vs numpy.matrix on an access pattern that I expect is very common in data science applications. Essentially, all I'm doing is treating each row as a vector, and attempting to compute pairwise distances between rows / vectors by iterating over all their values.

from datascience import *
import numpy as np
import time

numDatapoints = 10
numFeatures = 250
countsTable = Table([[0 for i in range(0,numDatapoints)] for j in range(0,numFeatures)], [str(j) for j in range(0,numFeatures)])

countsMatrix = countsTable.matrix().transpose()



t0 = time.clock()
[sum([abs(countsMatrix[0,k] - countsMatrix[j,k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using numpy.matrix, took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using numpy.matrix, took 0.007395999999999958s



t0 = time.clock()
[sum([abs(countsTable.columns[k][0] - countsTable.columns[k][j]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table.columns, took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using Table.columns, took 0.4431849999999997s



t0 = time.clock()
[sum([abs(countsTable.rows[0][k] - countsTable.rows[j][k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table.rows, took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using Table.rows, took 31.142619999999994s

Running this code shows that iteration over numpy.matrix is about 100x faster than iterating over Table.columns, which in turn is 100x faster than using Table.rows.

pxinghao · 2015-08-19T02:09:13Z

Looking at the Table code, it seems like doing a .columns essentially does a complete dump of the Table, before [i] indexes into the dump. Would it be possible to instead have a function like .value(i,j) that returns the value of a cell without dumping all columns or even the column being accessed? Or amortize the cost by storing the dumped columns somewhere?

For example, a quick fix was to do the dump once:

columns = countsTable.columns
t0 = time.clock()
[sum([abs(columns[k][0] - columns[k][j]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table.columns once, took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using Table.columns once, took 0.0015499999999804004s

deculler · 2015-08-19T17:32:47Z

mytable[col][row] is the value.

The primary design point for tables treats each column as a vector. A row
is really more like a record. A table is definitely not a matrix. (And it
isn't C, i.e., row major, although numpy arrays down inside are implemented
as such and fast.)

The fundamental unit to be working with is mytable[column_label]. If you
care about performance, that is what you care about. And you want to
operate on columns.

I don't follow what you mean by "complete dump" in this context. All of
this is in memory. But perhaps you mean making copies. All the columns
are numpy arrays down inside. Rows are not.

It would be great if you'd do the timing for columns (not lists constructed
from cells of columns or rows) rows, and cells. Comparisons with Pandas and
Numpy would be great too.

David E. Culler
Friesen Professor of Computer Science
Electrical Engineering and Computer Sciences
University of California, Berkeley

On Tue, Aug 18, 2015 at 6:57 PM, Xinghao Pan notifications@github.com
wrote:

Below is a benchmark for comparing Table vs numpy.matrix on an access
pattern that I expect is very common in data science applications.
Essentially, all I'm doing is treating each row as a vector, and attempting
to compute pairwise distances between rows / vectors by iterating over all
their values.

from datascience import *
import numpy as np
import time

numDatapoints = 10
numFeatures = 250
countsTable = Table([[0 for i in range(0,numDatapoints)] for j in range(0,numFeatures)], [str(j) for j in range(0,numFeatures)])

countsMatrix = countsTable.matrix().transpose()

t0 = time.clock()
[sum([abs(countsMatrix[0,k] - countsMatrix[j,k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using numpy.matrix, took ', t1-t0, 's', sep='')

Compute L1 distance of first row to all rows, using numpy.matrix, took 0.007395999999999958s

t0 = time.clock()
[sum([abs(countsTable.columns[k][0] - countsTable.columns[k][j]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table.columns, took ', t1-t0, 's', sep='')

Compute L1 distance of first row to all rows, using Table.columns, took 0.4431849999999997s

t0 = time.clock()
[sum([abs(countsTable.rows[0][k] - countsTable.rows[j][k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table.rows, took ', t1-t0, 's', sep='')

Compute L1 distance of first row to all rows, using Table.rows, took 31.142619999999994s

Running this code shows that iteration over numpy.matrix is about 100x
faster than iterating over Table.columns, which in turn is 100x faster than
using Table.rows.

—
Reply to this email directly or view it on GitHub
#48.

pxinghao · 2015-08-19T21:16:46Z

Using getitem is as fast as using numpy.matrix:

t0 = time.clock()
[sum([abs(countsTable[str(k)][0] - countsTable[str(k)][j]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table __getitem__ (column-wise), took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using Table __getitem__ (column-wise), took 0.006763999999999992s

Running either column- or row-first doesn't seem to make much of a difference on the small examples I tried:

t0 = time.clock()
[sum([abs(countsTable[str(k)][0] - countsTable[str(k)][j]) for j in range(0, numDatapoints)]) for k in range(0, numFeatures)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table __getitem__ (row-wise), took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using Table __getitem__ (row-wise), took 0.00734900000000005s

I'm not sure what is meant by "cell" though.

The only (minor) issue I have with getitem is the need to deference by column label. Is there a direct, natural way to deference by column number instead? (I could iterate over column_labels, but using [str][int] still feels less natural than the usual [int][int] indexing.)

papajohn · 2015-08-21T16:40:31Z

What if instead of supporting tables with many columns that are one-column-per-feature, we model this project using four columns:

song name
song genre
artist name
word counts: numpy.array valued column

The word-counts column could be represented as a 2-d array, which we know is fast.

The only headache in achieving this design is fiddling with the read_table function so that it's possible to produce such a structure.

papajohn · 2015-08-21T16:41:24Z

Regarding this issue in particular, we're not going to support numpy-speed row comparison operations on Tables this semester. The best approximation would be to convert first to a numpy.matrix.

pxinghao · 2015-08-21T18:50:48Z

A possible workaround is to first read the Table from CSV as-is into a 5004-column data structure, then use a simple wrapper to convert into a 5-column Table.

For example,
largeTable = Table.read_table(...)
wordCountsMatrix = largeTable.drop(['UID', 'Title', 'Artist', 'Genre']).matrix()
smallTable = Table(largeTable.select(['UID', 'Title', 'Artist', 'Genre']).columns + wordCountsMatrix.tolist(), ['UID', 'Title', 'Artist', 'Genre', 'Counts']).

SamLau95 · 2015-10-30T07:39:59Z

This looks more relevant now as project 2 is coming up.

pxinghao · 2015-10-30T07:52:25Z

Turns out that the fastest way I've found is to use numpy.tile:

(abs(countsMatrix - np.tile(countsMatrix[0,],[numDatapoints, 1]))).sum(1)

which is faster than manually computing pairwise distances, even using numpy:

[sum([abs(countsMatrix[0,k] - countsMatrix[j,k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]

papajohn · 2015-10-30T16:12:42Z

Perhaps we just wrap this call to np.tile into a more intuitive name, such
as pairwise_distance.

On Fri, Oct 30, 2015 at 12:52 AM, Xinghao Pan notifications@github.com
wrote:

Turns out that the fastest way I've found is to use numpy.tile:

(abs(countsMatrix - np.tile(countsMatrix[0,],[numDatapoints, 1]))).sum(1)

which is faster than manually computing pairwise distances, even using
numpy:

[sum([abs(countsMatrix[0,k] - countsMatrix[j,k]) for k in range(0,
numFeatures)]) for j in range(0, numDatapoints)]

—
Reply to this email directly or view it on GitHub
#48 (comment).

pxinghao · 2015-10-30T16:16:43Z

For project2, we might want students to try to write code for computing the distances, so I would vote against putting it in the Table codebase, for now at least.

papajohn closed this as completed Jan 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Table vs numpy.matrix speed #48

Table vs numpy.matrix speed #48

pxinghao commented Aug 19, 2015

pxinghao commented Aug 19, 2015

deculler commented Aug 19, 2015

Compute L1 distance of first row to all rows, using numpy.matrix, took 0.007395999999999958s

Compute L1 distance of first row to all rows, using Table.columns, took 0.4431849999999997s

Compute L1 distance of first row to all rows, using Table.rows, took 31.142619999999994s

pxinghao commented Aug 19, 2015

papajohn commented Aug 21, 2015

papajohn commented Aug 21, 2015

pxinghao commented Aug 21, 2015

SamLau95 commented Oct 30, 2015

pxinghao commented Oct 30, 2015

papajohn commented Oct 30, 2015

pxinghao commented Oct 30, 2015

Table vs numpy.matrix speed #48

Table vs numpy.matrix speed #48

Comments

pxinghao commented Aug 19, 2015

pxinghao commented Aug 19, 2015

deculler commented Aug 19, 2015

Compute L1 distance of first row to all rows, using numpy.matrix, took 0.007395999999999958s

Compute L1 distance of first row to all rows, using Table.columns, took 0.4431849999999997s

Compute L1 distance of first row to all rows, using Table.rows, took 31.142619999999994s

pxinghao commented Aug 19, 2015

papajohn commented Aug 21, 2015

papajohn commented Aug 21, 2015

pxinghao commented Aug 21, 2015

SamLau95 commented Oct 30, 2015

pxinghao commented Oct 30, 2015

papajohn commented Oct 30, 2015

pxinghao commented Oct 30, 2015