-
Notifications
You must be signed in to change notification settings - Fork 281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Table vs numpy.matrix speed #48
Comments
Looking at the Table code, it seems like doing a .columns essentially does a complete dump of the Table, before [i] indexes into the dump. Would it be possible to instead have a function like .value(i,j) that returns the value of a cell without dumping all columns or even the column being accessed? Or amortize the cost by storing the dumped columns somewhere? For example, a quick fix was to do the dump once: columns = countsTable.columns t0 = time.clock() [sum([abs(columns[k][0] - columns[k][j]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)] t1 = time.clock() print('Compute L1 distance of first row to all rows, using Table.columns once, took ', t1-t0, 's', sep='') # Compute L1 distance of first row to all rows, using Table.columns once, took 0.0015499999999804004s |
mytable[col][row] is the value. The primary design point for tables treats each column as a vector. A row The fundamental unit to be working with is mytable[column_label]. If you I don't follow what you mean by "complete dump" in this context. All of It would be great if you'd do the timing for columns (not lists constructed David E. Culler On Tue, Aug 18, 2015 at 6:57 PM, Xinghao Pan notifications@github.com
|
Using getitem is as fast as using numpy.matrix: t0 = time.clock() [sum([abs(countsTable[str(k)][0] - countsTable[str(k)][j]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)] t1 = time.clock() print('Compute L1 distance of first row to all rows, using Table __getitem__ (column-wise), took ', t1-t0, 's', sep='') # Compute L1 distance of first row to all rows, using Table __getitem__ (column-wise), took 0.006763999999999992s Running either column- or row-first doesn't seem to make much of a difference on the small examples I tried: t0 = time.clock() [sum([abs(countsTable[str(k)][0] - countsTable[str(k)][j]) for j in range(0, numDatapoints)]) for k in range(0, numFeatures)] t1 = time.clock() print('Compute L1 distance of first row to all rows, using Table __getitem__ (row-wise), took ', t1-t0, 's', sep='') # Compute L1 distance of first row to all rows, using Table __getitem__ (row-wise), took 0.00734900000000005s I'm not sure what is meant by "cell" though. The only (minor) issue I have with getitem is the need to deference by column label. Is there a direct, natural way to deference by column number instead? (I could iterate over column_labels, but using [str][int] still feels less natural than the usual [int][int] indexing.) |
What if instead of supporting tables with many columns that are one-column-per-feature, we model this project using four columns:
The word-counts column could be represented as a 2-d array, which we know is fast. The only headache in achieving this design is fiddling with the |
Regarding this issue in particular, we're not going to support numpy-speed row comparison operations on |
A possible workaround is to first read the Table from CSV as-is into a 5004-column data structure, then use a simple wrapper to convert into a 5-column Table. For example, |
This looks more relevant now as project 2 is coming up. |
Turns out that the fastest way I've found is to use numpy.tile: (abs(countsMatrix - np.tile(countsMatrix[0,],[numDatapoints, 1]))).sum(1) which is faster than manually computing pairwise distances, even using numpy: [sum([abs(countsMatrix[0,k] - countsMatrix[j,k]) for k in range(0, numFeatures)]) for j in range(0, numDatapoints)] |
Perhaps we just wrap this call to np.tile into a more intuitive name, such On Fri, Oct 30, 2015 at 12:52 AM, Xinghao Pan notifications@github.com
|
For project2, we might want students to try to write code for computing the distances, so I would vote against putting it in the Table codebase, for now at least. |
Below is a benchmark for comparing Table vs numpy.matrix on an access pattern that I expect is very common in data science applications. Essentially, all I'm doing is treating each row as a vector, and attempting to compute pairwise distances between rows / vectors by iterating over all their values.
Running this code shows that iteration over numpy.matrix is about 100x faster than iterating over Table.columns, which in turn is 100x faster than using Table.rows.
The text was updated successfully, but these errors were encountered: