Ruby ORM for Cassandra with CQL3
Ruby Shell
Latest commit 4a0263e Oct 25, 2016 @pezra pezra committed on GitHub Merge pull request #335 from orenmazor/last-page
expose if the dataset is on the last page

README.md

Cequel

Cequel is a Ruby ORM for Cassandra using CQL3.

Gem Version Build Status Dependency Status Code Climate Inline docs

Cequel::Record is an ActiveRecord-like domain model layer that exposes the robust data modeling capabilities of CQL3, including parent-child relationships via compound primary keys and collection columns.

The lower-level Cequel::Metal layer provides a CQL query builder interface inspired by the excellent Sequel library.

Installation

Add it to your Gemfile:

gem 'cequel'

If you use Rails 5, add this:

gem 'activemodel-serializers-xml'

Rails integration

Cequel does not require Rails, but if you are using Rails, you will need version 3.2+. Cequel::Record will read from the configuration file config/cequel.yml if it is present. You can generate a default configuration file with:

rails g cequel:configuration

Once you've got things configured (or decided to accept the defaults), run this to create your keyspace (database):

rake cequel:keyspace:create

Setting up Models

Unlike in ActiveRecord, models declare their properties inline. We'll start with a simple Blog model:

class Blog
  include Cequel::Record

  key :subdomain, :text
  column :name, :text
  column :description, :text
end

Unlike a relational database, Cassandra does not have auto-incrementing primary keys, so you must explicitly set the primary key when you create a new model. For blogs, we use a natural key, which is the subdomain. Another option is to use a UUID.

Compound keys and parent-child relationships

While Cassandra is not a relational database, compound keys do naturally map to parent-child relationships. Cequel supports this explicitly with the has_many and belongs_to relations. Let's create a model for posts that acts as the child of the blog model:

class Post
  include Cequel::Record
  belongs_to :blog
  key :id, :timeuuid, auto: true
  column :title, :text
  column :body, :text
end

The auto option for the key declaration means Cequel will initialize new records with a UUID already generated. This option is only valid for :uuid and :timeuuid key columns.

The belongs_to association accepts a :foreign_key option which allows you to specify the attribute used as the partition key.

Note that the belongs_to declaration must come before the key declaration. This is because belongs_to defines the partition key; the id column is the clustering column.

Practically speaking, this means that posts are accessed using both the blog_subdomain (automatically defined by the belongs_to association) and the id. The most natural way to represent this type of lookup is using a has_many association. Let's add one to Blog:

class Blog
  include Cequel::Record

  key :subdomain, :text
  column :name, :text
  column :description, :text

  has_many :posts
end

Now we might do something like this:

class PostsController < ActionController::Base
  def show
    Blog.find(current_subdomain).posts.find(params[:id])
  end
end

Parent child relationship in a namespaced model can be defined using the class_name option of belongs_to method as follows:

module Blogger
  class Blog
    include Cequel::Record

    key :subdomain, :text
    column :name, :text
    column :description, :text

    has_many :posts
  end
end

module Blogger
  class Post
    include Cequel::Record

    belongs_to :blog, class_name: 'Blogger::Blog'
    key :id, :timeuuid, auto: true
    column :title, :text
    column :body, :text
  end
end

Timestamps

If your final primary key column is a timeuuid with the :auto option set, the created_at method will return the time that the UUID key was generated.

To add timestamp columns, simply use the timestamps class macro:

class Blog
  key :subdomain, :text
  column :name, :text
  timestamps
end

This will automatically define created_at and updated_at columns, and populate them appropriately on save.

If the creation time can be extracted from the primary key as outlined above, this method will be preferred and no created_at column will be defined.

Schema synchronization

Cequel will automatically synchronize the schema stored in Cassandra to match the schema you have defined in your models. If you're using Rails, you can synchronize your schemas for everything in app/models by invoking:

rake cequel:migrate

Record sets

Record sets are lazy-loaded collections of records that correspond to a particular CQL query. They behave similarly to ActiveRecord scopes:

Post.select(:id, :title).reverse.limit(10)

To scope a record set to a primary key value, use the [] operator. This will define a scoped value for the first unscoped primary key in the record set:

Post['bigdata'] # scopes posts with blog_subdomain="bigdata"

You can pass multiple arguments to the [] operator, which will generate an IN query:

Post['bigdata', 'nosql'] # scopes posts with blog_subdomain IN ("bigdata", "nosql")

To select ranges of data, use before, after, from, upto, and in. Like the [] operator, these methods operate on the first unscoped primary key:

Post['bigdata'].after(last_id) # scopes posts with blog_subdomain="bigdata" and id > last_id

You can also use where to scope to primary key columns, but a primary key column can only be scoped if all the columns that come before it are also scoped:

Post.where(blog_subdomain: 'bigdata') # this is fine
Post.where(blog_subdomain: 'bigdata', permalink: 'cassandra') # also fine
Post.where(blog_subdomain: 'bigdata').where(permalink: 'cassandra') # also fine
Post.where(permalink: 'cassandra') # bad: can't use permalink without blog_subdomain

Note that record sets always load records in batches; Cassandra does not support result sets of unbounded size. This process is transparent to you but you'll see multiple queries in your logs if you're iterating over a huge result set.

Time UUID Queries

CQL has special handling for the timeuuid type, which allows you to return a rows whose UUID keys correspond to a range of timestamps.

Cequel automatically constructs timeuuid range queries if you pass a Time value for a range over a timeuuid column. So, if you want to get the posts from the last day, you can run:

Blog['myblog'].posts.from(1.day.ago)

Updating records

When you update an existing record, Cequel will only write statements to the database that correspond to explicit modifications you've made to the record in memory. So, in this situation:

@post = Blog.find(current_subdomain).posts.find(params[:id])
@post.update_attributes!(title: "Announcing Cequel 1.0")

Cequel will only update the title column. Note that this is not full dirty tracking; simply setting the title on the record will signal to Cequel that you want to write that attribute to the database, regardless of its previous value.

Unloaded models

In the above example, we call the familiar find method to load a blog and then one of its posts, but we didn't actually do anything with the data in the Blog model; it was simply a convenient object-oriented way to get a handle to the blog's posts. Cequel supports unloaded models via the [] operator; this will return an unloaded blog instance, which knows the value of its primary key, but does not read the row from the database. So, we can refactor the example to be a bit more efficient:

class PostsController < ActionController::Base
  def show
    @post = Blog[current_subdomain].posts.find(params[:id])
  end
end

If you attempt to access a data attribute on an unloaded class, it will lazy-load the row from the database and become a normal loaded instance.

You can generate a collection of unloaded instances by passing multiple arguments to []:

class BlogsController < ActionController::Base
  def recommended
    @blogs = Blog['cassandra', 'nosql']
  end
end

The above will not generate a CQL query, but when you access a property on any of the unloaded Blog instances, Cequel will load data for all of them with a single query. Note that CQL does not allow selecting collection columns when loading multiple records by primary key; only scalar columns will be loaded.

There is another use for unloaded instances: you may set attributes on an unloaded instance and call save without ever actually reading the row from Cassandra. Because Cassandra is optimized for writing data, this "write without reading" pattern gives you maximum efficiency, particularly if you are updating a large number of records.

Collection columns

Cassandra supports three types of collection columns: lists, sets, and maps. Collection columns can be manipulated using atomic collection mutation; e.g., you can add an element to a set without knowing the existing elements. Cequel supports this by exposing collection objects that keep track of their modifications, and which then persist those modifications to Cassandra on save.

Let's add a category set to our post model:

class Post
  include Cequel::Record

  belongs_to :blog
  key :id, :uuid
  column :title, :text
  column :body, :text
  set :categories, :text
end

If we were to then update a post like so:

@post = Blog[current_subdomain].posts[params[:id]]
@post.categories << 'Kittens'
@post.save!

Cequel would send the CQL equivalent of "Add the category 'Kittens' to the post at the given (blog_subdomain, id)", without ever reading the saved value of the categories set.

Secondary indexes

Cassandra supports secondary indexes, although with notable restrictions:

  • Only scalar data columns can be indexed; key columns and collection columns cannot.
  • A secondary index consists of exactly one column.
  • Though you can have more than one secondary index on a table, you can only use one in any given query.

Cequel supports the :index option to add secondary indexes to column definitions:

class Post
  include Cequel::Record

  belongs_to :blog
  key :id, :uuid
  column :title, :text
  column :body, :text
  column :author_id, :uuid, :index => true
  set :categories, :text
end

Defining a column with a secondary index adds several "magic methods" for using the index:

Post.with_author_id(id) # returns a record set scoped to that author_id
Post.find_by_author_id(id) # returns the first post with that author_id
Post.find_all_by_author_id(id) # returns an array of all posts with that author_id

You can also call the where method directly on record sets:

Post.where(author_id: id)

Consistency tuning

Cassandra supports tunable consistency, allowing you to choose the right balance between query speed and consistent reads and writes. Cequel supports consistency tuning for reads and writes:

Post.new(id: 1, title: 'First post!').save!(consistency: :all)

Post.consistency(:one).find_each { |post| puts post.title }

Both read and write consistency default to QUORUM.

Compression

Cassandra supports frame compression, which can give you a performance boost if your requests or responses are big. To enable it you can specify client_compression to use in cequel.yaml.

development:
  host: '127.0.0.1'
  port: 9042
  keyspace: Blog
  client_compression: :lz4

ActiveModel Support

Cequel supports ActiveModel functionality, such as callbacks, validations, dirty attribute tracking, naming, and serialization. If you're using Rails 3, mass-assignment protection works as usual, and in Rails 4, strong parameters are treated correctly. So we can add some extra ActiveModel goodness to our post model:

class Post
  include Cequel::Record

  belongs_to :blog
  key :id, :uuid
  column :title, :text
  column :body, :text

  validates :body, presence: true

  after_save :notify_followers
end

Note that validations or callbacks that need to read data attributes will cause unloaded models to load their row during the course of the save operation, so if you are following a write-without-reading pattern, you will need to be careful.

Dirty attribute tracking is only enabled on loaded models.

Upgrading from Cequel 0.x

Cequel 0.x targeted CQL2, which has a substantially different data representation from CQL3. Accordingly, upgrading from Cequel 0.x to Cequel 1.0 requires some changes to your data models.

Upgrading a Cequel::Model

Upgrading from a Cequel::Model class is fairly straightforward; simply add the compact_storage directive to your class definition:

# Model definition in Cequel 0.x
class Post
  include Cequel::Model

  key :id, :uuid
  column :title, :text
  column :body, :text
end

# Model definition in Cequel 1.0
class Post
  include Cequel::Record

  key :id, :uuid
  column :title, :text
  column :body, :text

  compact_storage
end

Note that the semantics of belongs_to and has_many are completely different between Cequel 0.x and Cequel 1.0; if you have data columns that reference keys in other tables, you will need to hand-roll those associations for now.

Upgrading a Cequel::Model::Dictionary

CQL3 does not have a direct "wide row" representation like CQL2, so the Dictionary class does not have a direct analog in Cequel 1.0. Instead, each row key-map key-value tuple in a Dictionary corresponds to a single row in CQL3. Upgrading a Dictionary to Cequel 1.0 involves defining two primary keys and a single data column, again using the compact_storage directive:

# Dictionary definition in Cequel 0.x
class BlogPosts < Cequel::Model::Dictionary
  key :blog_id, :uuid
  maps :uuid => :text

  private

  def serialize_value(column, value)
    value.to_json
  end

  def deserialize_value(column, value)
    JSON.parse(value)
  end
end

# Equivalent model in Cequel 1.0
class BlogPost
  include Cequel::Record

  key :blog_id, :uuid
  key :id, :uuid
  column :data, :text

  compact_storage

  def data
    JSON.parse(read_attribute(:data))
  end

  def data=(new_data)
    write_attribute(:data, new_data.to_json)
  end
end

Cequel::Model::Dictionary did not infer a pluralized table name, as Cequel::Model did and Cequel::Record does. If your legacy Dictionary table has a singlar table name, add a self.table_name = :blog_post in the model definition.

Note that you will want to run ::synchronize_schema on your models when upgrading; this will not change the underlying data structure, but will add some CQL3-specific metadata to the table definition which will allow you to query it.

CQL Gotchas

CQL is designed to be immediately familiar to those of us who are used to working with SQL, which is all of us. Cequel advances this spirit by providing an ActiveRecord-like mapping for CQL. However, Cassandra is very much not a relational database, so some behaviors can come as a surprise. Here's an overview.

Upserts

Perhaps the most surprising fact about CQL is that INSERT and UPDATE are essentially the same thing: both simply persist the given column data at the given key(s). So, you may think you are creating a new record, but in fact you're overwriting data at an existing record:

Counting

Counting is not the same as in a RDB, as it can have a much longer runtime and can put unexpected load on your cluster. As a result Cequel does not support this feature. It is still possible to execute raw cql to get the counts, should you require this functionality. MyModel.connection.execute('select count(*) from table_name;').first['count']

# I'm just creating a blog here.
blog1 = Blog.create!(
  subdomain: 'big-data',
  name: 'Big Data',
  description: 'A blog about all things big data')

# And another new blog.
blog2 = Blog.create!(
  subdomain: 'big-data',
  name: 'The Big Data Blog')

Living in a relational world, we'd expect the second statement to throw an error because the row with key 'big-data' already exists. But not Cassandra: the above code will just overwrite the name in that row. Note that the description will not be touched by the second statement; upserts only work on the columns that are given.

Compatibility

Rails

  • 5.0
  • 4.2
  • 4.1
  • 4.0

Ruby

  • Ruby 2.3, 2.2, 2.1, 2.0

Cassandra

  • 2.1.x
  • 2.2.x
  • 3.0.x

Breaking API changes

2.0

  • dropped support for jruby (Due to difficult to work around bugs in jruby. PRs welcome to restore jruby compatibility.)

Support & Bugs

If you find a bug, feel free to open an issue on GitHub. Pull requests are most welcome.

For questions or feedback, hit up our mailing list at cequel@groups.google.com or find outoftime in the #cassandra IRC channel on Freenode.

Contributing

See CONTRIBUTING.md

Credits

Cequel was written by:

  • Mat Brown
  • Aubrey Holland
  • Keenan Brock
  • Insoo Buzz Jung
  • Louis Simoneau
  • Peter Williams
  • Kenneth Hoffman
  • Antti Tapio
  • Ilya Bazylchuk
  • Dan Cardamore
  • Kei Kusakari
  • Oleh Novosad
  • John Smart
  • Angelo Lakra
  • Olivier Lance
  • Tomohiro Nishimura
  • Masaki Takahashi
  • G Gordon Worley III
  • Clark Bremer
  • Tamara Temple
  • Long On
  • Lucas Mundim
  • Luke Duncalfe
  • Eric Betts
  • Maxim Dobryakov
  • Yi-Cyuan Chen
  • Justin Hannus

Special thanks to Brewster, which supported the 0.x releases of Cequel.

Shameless Self-Promotion

If you're new to Cassandra, check out Learning Apache Cassandra, a hands-on guide to Cassandra application development by example, written by the creator of Cequel.

License

Cequel is distributed under the MIT license. See the attached LICENSE for all the sordid details.