Tool to help users migrate large relational databases into Hadoop clusters.
Switch branches/tags
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.


cascading-dbmigrate makes it easy to run Cascading flows on sql tables with a primary key of an int or a long. We use it at BackType to migrate data from our databases to HDFS.

Cascading-DBMigrate is available on Clojars.


To read data from a database in a Cascading flow, use DBMigrateTap. DBMigrateTap's constructor has the following signature:

  int numChunks,        // The number of splits to create of the database.
                        // This will correspond to the number of mappers
                        // created to read the database.
  String dbDriver,      // For example, "com.mysql.jdbc.Driver"
  String dbUrl,         // For example, "jdbc:mysql://localhost:3306/mydb"
  String username,      // Username to connect to your database.
  String pwd,           // Password to connect to your database.
  String tableName,     // The table to read during the flow.
  String pkColumn,      // The name of the primary key column of the table.
  String[] columnNames, // The names of the columns to read into the flow.
  Options ops           // Optional, can provide min/max values to read.

The tap will emit tuples containing one field for each column read, the field names being the column names.



 (defn db-range [min max]
   (let [opts (new cascading.dbmigrate.tap.DBMigrateTap$Options)]
     (set! (. opts :minId) min))
     (set! (. opts :maxId) max))

 (defn db-tap [table]
     (into-array ["id" "name"])
     (db-range 1 100))) ;; Only load first 100 records

 (?<- (stdout)
      [?id ?name]
      ((db-tap "users") ?id ?name))


To build cascading-dbmigrate, follow these instructions:

  1. Set HADOOP_HOME environment variable to the root directory of your hadoop distribution.
  2. Set CASCADING_HOME environment variable to the root directory of your cascading distribution.
  3. ant jar

This will produce a single jar called cascading_dbmigrate.jar in the build/ directory.

Thanks to Chris Wensel for his help in developing this project.