Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Table vs numpy.matrix speed #48

Closed
pxinghao opened this issue Aug 19, 2015 · 10 comments
Closed

Table vs numpy.matrix speed #48

pxinghao opened this issue Aug 19, 2015 · 10 comments

Comments

@pxinghao
Copy link

Below is a benchmark for comparing Table vs numpy.matrix on an access pattern that I expect is very common in data science applications. Essentially, all I'm doing is treating each row as a vector, and attempting to compute pairwise distances between rows / vectors by iterating over all their values.

from datascience import *
import numpy as np
import time

numDatapoints = 10
numFeatures = 250
countsTable = Table([[0 for i in range(0,numDatapoints)] for j in range(0,numFeatures)], [str(j) for j in range(0,numFeatures)])

countsMatrix = countsTable.matrix().transpose()



t0 = time.clock()
[sum([abs(countsMatrix[0,k] - countsMatrix[j,k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using numpy.matrix, took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using numpy.matrix, took 0.007395999999999958s



t0 = time.clock()
[sum([abs(countsTable.columns[k][0] - countsTable.columns[k][j]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table.columns, took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using Table.columns, took 0.4431849999999997s



t0 = time.clock()
[sum([abs(countsTable.rows[0][k] - countsTable.rows[j][k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table.rows, took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using Table.rows, took 31.142619999999994s

Running this code shows that iteration over numpy.matrix is about 100x faster than iterating over Table.columns, which in turn is 100x faster than using Table.rows.

@pxinghao
Copy link
Author

Looking at the Table code, it seems like doing a .columns essentially does a complete dump of the Table, before [i] indexes into the dump. Would it be possible to instead have a function like .value(i,j) that returns the value of a cell without dumping all columns or even the column being accessed? Or amortize the cost by storing the dumped columns somewhere?

For example, a quick fix was to do the dump once:

columns = countsTable.columns
t0 = time.clock()
[sum([abs(columns[k][0] - columns[k][j]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table.columns once, took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using Table.columns once, took 0.0015499999999804004s

@deculler
Copy link
Contributor

mytable[col][row] is the value.

The primary design point for tables treats each column as a vector. A row
is really more like a record. A table is definitely not a matrix. (And it
isn't C, i.e., row major, although numpy arrays down inside are implemented
as such and fast.)

The fundamental unit to be working with is mytable[column_label]. If you
care about performance, that is what you care about. And you want to
operate on columns.

I don't follow what you mean by "complete dump" in this context. All of
this is in memory. But perhaps you mean making copies. All the columns
are numpy arrays down inside. Rows are not.

It would be great if you'd do the timing for columns (not lists constructed
from cells of columns or rows) rows, and cells. Comparisons with Pandas and
Numpy would be great too.

David E. Culler
Friesen Professor of Computer Science
Electrical Engineering and Computer Sciences
University of California, Berkeley

On Tue, Aug 18, 2015 at 6:57 PM, Xinghao Pan notifications@github.com
wrote:

Below is a benchmark for comparing Table vs numpy.matrix on an access
pattern that I expect is very common in data science applications.
Essentially, all I'm doing is treating each row as a vector, and attempting
to compute pairwise distances between rows / vectors by iterating over all
their values.

from datascience import *
import numpy as np
import time

numDatapoints = 10
numFeatures = 250
countsTable = Table([[0 for i in range(0,numDatapoints)] for j in range(0,numFeatures)], [str(j) for j in range(0,numFeatures)])

countsMatrix = countsTable.matrix().transpose()

t0 = time.clock()
[sum([abs(countsMatrix[0,k] - countsMatrix[j,k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using numpy.matrix, took ', t1-t0, 's', sep='')

Compute L1 distance of first row to all rows, using numpy.matrix, took 0.007395999999999958s

t0 = time.clock()
[sum([abs(countsTable.columns[k][0] - countsTable.columns[k][j]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table.columns, took ', t1-t0, 's', sep='')

Compute L1 distance of first row to all rows, using Table.columns, took 0.4431849999999997s

t0 = time.clock()
[sum([abs(countsTable.rows[0][k] - countsTable.rows[j][k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table.rows, took ', t1-t0, 's', sep='')

Compute L1 distance of first row to all rows, using Table.rows, took 31.142619999999994s

Running this code shows that iteration over numpy.matrix is about 100x
faster than iterating over Table.columns, which in turn is 100x faster than
using Table.rows.


Reply to this email directly or view it on GitHub
#48.

@pxinghao
Copy link
Author

Using getitem is as fast as using numpy.matrix:

t0 = time.clock()
[sum([abs(countsTable[str(k)][0] - countsTable[str(k)][j]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table __getitem__ (column-wise), took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using Table __getitem__ (column-wise), took 0.006763999999999992s

Running either column- or row-first doesn't seem to make much of a difference on the small examples I tried:

t0 = time.clock()
[sum([abs(countsTable[str(k)][0] - countsTable[str(k)][j]) for j in range(0, numDatapoints)]) for k in range(0, numFeatures)]
t1 = time.clock()
print('Compute L1 distance of first row to all rows, using Table __getitem__ (row-wise), took ', t1-t0, 's', sep='')
# Compute L1 distance of first row to all rows, using Table __getitem__ (row-wise), took 0.00734900000000005s

I'm not sure what is meant by "cell" though.

The only (minor) issue I have with getitem is the need to deference by column label. Is there a direct, natural way to deference by column number instead? (I could iterate over column_labels, but using [str][int] still feels less natural than the usual [int][int] indexing.)

@papajohn
Copy link
Contributor

What if instead of supporting tables with many columns that are one-column-per-feature, we model this project using four columns:

  • song name
  • song genre
  • artist name
  • word counts: numpy.array valued column

The word-counts column could be represented as a 2-d array, which we know is fast.

The only headache in achieving this design is fiddling with the read_table function so that it's possible to produce such a structure.

@papajohn
Copy link
Contributor

Regarding this issue in particular, we're not going to support numpy-speed row comparison operations on Tables this semester. The best approximation would be to convert first to a numpy.matrix.

@pxinghao
Copy link
Author

A possible workaround is to first read the Table from CSV as-is into a 5004-column data structure, then use a simple wrapper to convert into a 5-column Table.

For example,
largeTable = Table.read_table(...)
wordCountsMatrix = largeTable.drop(['UID', 'Title', 'Artist', 'Genre']).matrix()
smallTable = Table(largeTable.select(['UID', 'Title', 'Artist', 'Genre']).columns + wordCountsMatrix.tolist(), ['UID', 'Title', 'Artist', 'Genre', 'Counts']).

@SamLau95
Copy link
Member

This looks more relevant now as project 2 is coming up.

@pxinghao
Copy link
Author

Turns out that the fastest way I've found is to use numpy.tile:

(abs(countsMatrix - np.tile(countsMatrix[0,],[numDatapoints, 1]))).sum(1)

which is faster than manually computing pairwise distances, even using numpy:

[sum([abs(countsMatrix[0,k] - countsMatrix[j,k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)]

@papajohn
Copy link
Contributor

Perhaps we just wrap this call to np.tile into a more intuitive name, such
as pairwise_distance.

On Fri, Oct 30, 2015 at 12:52 AM, Xinghao Pan notifications@github.com
wrote:

Turns out that the fastest way I've found is to use numpy.tile:

(abs(countsMatrix - np.tile(countsMatrix[0,],[numDatapoints, 1]))).sum(1)

which is faster than manually computing pairwise distances, even using
numpy:

[sum([abs(countsMatrix[0,k] - countsMatrix[j,k]) for k in range(0,
numFeatures)]) for j in range(0, numDatapoints)]


Reply to this email directly or view it on GitHub
#48 (comment).

@pxinghao
Copy link
Author

For project2, we might want students to try to write code for computing the distances, so I would vote against putting it in the Table codebase, for now at least.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants