Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

loop over all rows of a column family #10

Closed
neonknight opened this Issue Feb 28, 2012 · 1 comment

Comments

Projects
None yet
2 participants

I'm trying to loop over all rows in a column family. This is well performing and very scalable in pycassa:

for row in mycolfam.get_range():

My test data set of ~200000 rows is being processed in 20s (10000rows/s) on a weak development virtual machine.

However, I haven't found a working solution in cassandra-simple. Using something like

$conn->get_range('mycolfam', {'row_count'=>$maxrows});

will start loading all tokens from the database into memory. This works for row_count with a maximum of 10000. A bigger row_count will result in enormous times to finish the request. I'm talking about several hours for 200000 rows.

Is there a scalable way in cassandra-simple that I missed? Or is this impossible at the moment?

@ghost ghost assigned fmgoncalves Feb 28, 2012

Owner

fmgoncalves commented Feb 28, 2012

That feature in pycassa derives from a very nice Python feature. Since ColumnFamily.get_range implements an iterator, that is basically syntactic sugar for

rows = mycolfam.get_range()
while len(rows) > 0:
  #do something
  rows = mycolfam.get_range(column_start=rows.keys()[-1])

In that example, get_range probably uses the default row_count 100 and repeats it while possible.

To have this kind of logic Cassandra::Simple (without the syntactic sugar) would probably have to use something like Tie::Hash::DxHash to have ordered keys in get_range in order to properly iterate them (pycassa uses the Python native OrderedDict).
I'll look into this to see if the extra functionality is worth the added complexity (but I think it is, since this is important functionality).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment