You can clone with
HTTPS or Subversion.
I'm trying to loop over all rows in a column family. This is well performing and very scalable in pycassa:
for row in mycolfam.get_range():
My test data set of ~200000 rows is being processed in 20s (10000rows/s) on a weak development virtual machine.
However, I haven't found a working solution in cassandra-simple. Using something like
will start loading all tokens from the database into memory. This works for row_count with a maximum of 10000. A bigger row_count will result in enormous times to finish the request. I'm talking about several hours for 200000 rows.
Is there a scalable way in cassandra-simple that I missed? Or is this impossible at the moment?
That feature in pycassa derives from a very nice Python feature. Since ColumnFamily.get_range implements an iterator, that is basically syntactic sugar for
rows = mycolfam.get_range()
while len(rows) > 0:
rows = mycolfam.get_range(column_start=rows.keys()[-1])
In that example, get_range probably uses the default row_count 100 and repeats it while possible.
To have this kind of logic Cassandra::Simple (without the syntactic sugar) would probably have to use something like Tie::Hash::DxHash to have ordered keys in get_range in order to properly iterate them (pycassa uses the Python native OrderedDict).
I'll look into this to see if the extra functionality is worth the added complexity (but I think it is, since this is important functionality).
Result hashes are now ordered, allowing iteration. fixes #10