Skip to content

Conversation

@stefanv
Copy link
Contributor

@stefanv stefanv commented Oct 7, 2015

This lets Tables behave similar to other Python containers, such that:

In [1]: import datascience as ds

In [2]: t = ds.Table((['a','b','c','d'], [1, 2, 3, 4]), ('names', 'counts'))

In [3]: print(t[:2])
names | counts
a     | 1
b     | 2

In [4]: print(t[-1])
names | counts
d     | 4

In [5]: print(t[1:-1])
names | counts
b     | 2
c     | 3

@SamLau95
Copy link
Contributor

SamLau95 commented Oct 7, 2015

This syntax is a tad confusing because we've only had stuff like t['label'] returning an array of the values of that column up to this point. @papajohn any thoughts?

@stefanv
Copy link
Contributor Author

stefanv commented Oct 7, 2015 via email

@papajohn
Copy link
Contributor

papajohn commented Oct 7, 2015

Nice suggestion, but tables are indexed by column name rather than number. You could change t.columns() and t.rows() to special containers that give you a selected table upon slicing, rather than a list of arrays.

Currently take gives you a few rows, and select gives you a few columns.

@SamLau95
Copy link
Contributor

SamLau95 commented Oct 7, 2015

@stefanv There's Table.take which students have seen before.

@stefanv
Copy link
Contributor Author

stefanv commented Oct 7, 2015 via email

@stefanv
Copy link
Contributor Author

stefanv commented Oct 7, 2015 via email

@papajohn
Copy link
Contributor

papajohn commented Oct 7, 2015

In our course, the first table manipulation is to create new columns from existing columns. Named columns make the resulting expressions fairly easy to interpret.

E.g. http://data8.org/text/1_data.html#tables

Most of our examples don't pick out rows based on index, but instead based on their contents (e.g., using where) or by sampling.

@papajohn
Copy link
Contributor

papajohn commented Oct 7, 2015

@stefanv take can take a range rather than a slice. (We actually teach np.arange instead b/c Python 3 ranges take extra explanation.)

@deculler
Copy link
Contributor

deculler commented Oct 7, 2015

take by index
where by value

Both produce a new table.
On Oct 6, 2015 7:12 PM, "Stefan van der Walt" notifications@github.com
wrote:

Is there currently another way of easily grabbing a few rows? This is a
common enough operation that I presumed I must have missed it.


Reply to this email directly or view it on GitHub
#110 (comment).

@stefanv
Copy link
Contributor Author

stefanv commented Oct 7, 2015

@papajohn Thanks for the link to the notes, I see now that the column syntax is often used for operations such as table['diff'] = table['2015'] - table['2012'.

One way to do this would be to repurpose .rows. We could, e.g., do t.rows[:15] and get a table back. t.rows[0] (the current usage) will still be supported.

Indexing by range is probably fine for smaller queries, but becomes expensive for larger ones:

In [1]: import numpy as np

In [2]: x = np.random.random(100000)

In [3]: %timeit x[:50000]
The slowest run took 122.46 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 259 ns per loop

In [4]: z = np.arange(50000)

In [5]: %timeit x[z]
The slowest run took 7.63 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 97.4 µs per loop

Has a design decision been made on whether a take operation should always copy data, or whether tables can re-use underlying column storage? Again, a trade-off between simplicity and memory usage (which can be limiting when working on large datasets).

@stefanv
Copy link
Contributor Author

stefanv commented Oct 10, 2015

See #120.

@stefanv stefanv closed this Oct 10, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants