Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

loop over all rows of a column family #10

Closed
neonknight opened this Issue · 1 comment

2 participants

@neonknight

I'm trying to loop over all rows in a column family. This is well performing and very scalable in pycassa:

for row in mycolfam.get_range():

My test data set of ~200000 rows is being processed in 20s (10000rows/s) on a weak development virtual machine.

However, I haven't found a working solution in cassandra-simple. Using something like

$conn->get_range('mycolfam', {'row_count'=>$maxrows});

will start loading all tokens from the database into memory. This works for row_count with a maximum of 10000. A bigger row_count will result in enormous times to finish the request. I'm talking about several hours for 200000 rows.

Is there a scalable way in cassandra-simple that I missed? Or is this impossible at the moment?

@fmgoncalves fmgoncalves was assigned
@fmgoncalves
Owner

That feature in pycassa derives from a very nice Python feature. Since ColumnFamily.get_range implements an iterator, that is basically syntactic sugar for

rows = mycolfam.get_range()
while len(rows) > 0:
  #do something
  rows = mycolfam.get_range(column_start=rows.keys()[-1])

In that example, get_range probably uses the default row_count 100 and repeats it while possible.

To have this kind of logic Cassandra::Simple (without the syntactic sugar) would probably have to use something like Tie::Hash::DxHash to have ordered keys in get_range in order to properly iterate them (pycassa uses the Python native OrderedDict).
I'll look into this to see if the extra functionality is worth the added complexity (but I think it is, since this is important functionality).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.