sequel based dsl to migrate data from a legacy database to a new home.
Ruby Shell
Pull request Compare This branch is 9 commits ahead, 13 commits behind stmichael:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


Build Status Code Climate

data-import is a data-migration framework. The goal of the project is to provide a simple api to migrate data from a legacy schema into a new one. It's based on jeremyevans/sequel.


gem 'data-import'

you can put your migration configuration in any file you like. We suggest something like mapping.rb. You can find the various ways to connect described in the sequel docs.

source 'sqlite://legacy_blog.db'
target :adapter => :postgres, :host => 'localhost', :user => 'user', :password => 'password', :database => 'blog'

import 'Animals' do
  from 'tblAnimal', :primary_key => 'sAnimalID'
  to 'animals'

  mapping 'sAnimalID' => 'id'
  mapping 'strAnimalTitleText' => 'name'
  mapping 'sAnimalAge' => 'age'
  mapping 'convert threat to danger rating' do
    rating = ['none', 'medium', 'big'].index(row[:strThreat]) + 1
    {:danger_rating => rating}

to run the import just execute:

  mapping_path = Rails.root + 'mapping.rb'
  DataImport.run_config! mapping_path

if you execute the import frequently you can create a Rake-Task:

desc "Imports the date from the source database"
task :import do
  mapping_path = Rails.root + 'mapping.rb'
  options = {}
  options[:only] = ENV['RUN_ONLY'].split(',') if ENV['RUN_ONLY'].present?

  DataImport.run_config! mapping_path, options

You can set the import to fail fast if your definitions contain mapping errors like referencing non-existent columns:

  options = { strict_mode: true }
  DataImport.run_config! mapping_path, options


data-import provides a clean dsl to define your mappings from the legacy schema to the new one.

Before Filter

data-import allows you to definie a global filter. This filter can be used to make global transformations like encoding fixes. You can define a filter, which downcases every string like so:

before_filter do |row|
  row.each do |k, v|
    row[k] = v.downcase if v.respond_to?(:downcase)

Simple Mappings

You've already seen a very basic example of the dsl in the Installation-Section. This part shows off the features of the mapping-DSL.


every mapping starts with a call to import followed by the name of the mapping. You can name mappings however you like. The block passed to import contains the mapping itself. You can supply the source-table with from and the target-table with to. Make sure that you set the primary-key on the source-table otherwhise pagination is not working properly and the migration will fill up your RAM.

import 'Users' do
  from 'tblUser', :primary_key => 'sUserID'
  to 'users'

Data source

In simple cases you would read from one table and write to another.

import 'Items' do
  from 'tblItem'
  to 'items'

This is not always sufficient. This gem allows you to specify a custom data source using the sequel syntax or plain SQL.

import 'Items' do
  from 'items' do |sequel|
    sequel[:tblItems].join(:tblOrderItems, :sItemID => :sID)
  to 'items'


import 'Items' do
  from 'items' do |sequel|
FROM tblItems
INNER JOIN tblOrderItems ON tblOrderItems.sItemID = tblItems.sID
    sequel[:tblItems].join(:tblOrderItems, :sItemID => :sID)
  to 'items'

Data output

By default a new record will be inserted for record read from the source. This behaviour can be changed. For example you may want to update existing records.

import 'Article Authors' do
  from 'tblArticleAbout', :primary_key => 'sID'
  to 'articles', :mode => :update

  mapping 'lArticleId' => 'id'
  mapping 'strWho' => 'author'

With :mode => :update you tell data-import to update a record instead of inserting a new one. You have to specify a mapping for the primary key of the target table. If there is no value for the primary key then nothing will be updated.

There is also a unique writer to filter double records.

import 'Cities' do
  from 'tblCities', :primary_key => 'sID'
  to 'cities', :mode => [:unique, :columns => [:name, :zip]]

  mapping 'strName' => 'name'
  mapping 'sZip' => 'zip'
  mapping 'strShort' => 'short_name'

Passing the option :mode => [:unique, :columns => [:name, :zip]] makes data-import use a unique writer. :columns must be an array of column which will be used to identify a double record. Before a new record will be inserted, data-import makes a select on the target table with the defined columns. So depending on the size of your table, you may consider adding an (unique) index on those columns to speed up the import.


You can create simple name-mappings with a call to mapping:

mapping 'sUserID' => 'id'
mapping 'strEmail' => 'email'
mapping 'strUsername' => 'username'

If you need to process a column you can add a block. You have access to the entire record read from the source database. The return value of the block should be a hash or nil. Nil means no mapping at all and in case of a hash you have to use the column-names of the target-table as keys.

mapping 'convert threat to danger rating' do
  rating = ['none', 'medium', 'big'].index(row[:strThreat]) + 1
  {:danger_rating => rating}

Seed data

If you have static data that needs to be inserted you can used to following feature:

import 'managers' do
  from 'tblManagers'
  to 'managers'

  seed :earns_much_money => true, :has_emplyees => true

After row blocks

Use after row blocks to specify some logic that will be executed after a row has been inserted.

import 'Sales Leads' do
  from 'SL_NewLeads', :primary_key => 'slNewLeadsID'
  to 'sales_leads'

  mapping 'slNewLeadsID' => :id

  after_row do
    target_database[:contacts].insert(:firstname => row[:slName1],
                                      :lastname => row[:slName2])

Row validation

Rows can be validated before insertion.

import 'People' do
  from 'Person'
  to 'females'

  mapping 'Name' => :name
  mapping 'Gender' => :gender

  validate_row do
    if mapped_row[:gender] == 'f'
    else "Row #{row} skipped since the gender is male"

Inside the validation block you have access to the row read from the data source (row) and to the row with all mappings applied (mapped_row). If the result of the validation block is evaluated to true, then the row will be inserted into the target table. If the result is false, insertion will be skipped.

Script mappings

If you have a more complex mapping than just reading from one source and writing each record to another, you can define script blocks. Inside a script block you can write ruby code that does your data conversion. The whole block runs in a transaction to ensure consistency.

script 'my compex converion' do
  body do
    log_texts = source_database.db[:Logger].map {|record| record[:strText]}


source_database.db and target_database.db are sequel database objects. Look at the sequel docs for more information.


You can specify dependencies between definitions. Dependencies are always run before a given definition will be executed. Adding all necessary dependencies also allows you to run a set of definitions instead of everything.

import 'Roles' do
  from 'tblRole', :primary_key => 'sRoleID'
  to 'roles'

import 'SubscriptionPlans' do
  from 'tblSubcriptionCat', :primary_key => 'sSubscriptionCatID'
  to 'subscription_plans'

import 'Users' do
  from 'tblUser', :primary_key => 'sUserID'
  to 'users'
  dependencies 'SubscriptionPlans'

import 'Permissions' do
  from 'tblUserRoles'
  to 'permissions'
  dependencies 'Users', 'Roles'

you can now run parts of your mappings using the :only option:

DataImport.run_config! 'mappings.rb', :only => ['Users'] # => imports SubscriptionPlans then Users
DataImport.run_config! 'mappings.rb', :only => ['Roles'] # => imports Roles only
DataImport.run_config! 'mappings.rb', :only => ['Permissions'] # => imports Roles, SubscriptionPlans, Users and then Permissions


If you have tables referenced on other fields than their primary-key you need to perform lookups while migrating the data. data-import provides a feature called lookup-tables to basically create an index on any given field to the migrated primary-key.

The following example shows a table People which is linked to the table Organizations. Sadly the legacy-schema used the field :code from the Organizations-table as reference.

import 'Organizations' do
  # define a lookup-table on the :sOrgId attribute named :sOrgId
  lookup_for :sOrgId

  # define a lookup-table on the :strCode attribute named :code
  lookup_for :code, :column => 'strCode'

import 'People' do
  dependencies 'Organizations'

  # you can then use the previously defined lookup-table on :code to get the primary-key
  reference 'Organizations', 'OrganizationCode' => :org_id, :lookup => :code

  # or you can do the same thing manually (this also works in script blocks)
  mapping 'organization code' do
    {:org_id => definition('Organizations').identify_by(:code, row['strCode'])}

If you don't specify the option :lookup then data-import uses the lookup table called :id.


In every block, be this mapping, after_row, script body, validation, etc., you have access to a logger, which can be accessed as follows:

import 'Animals' do
  # source and target config

  # mapping config

  validate_row do "animal name has been mapped from #{row[:strName]} to #{mapped_row[:name]}"

The logger supports the standard log levels debug, info, warn, error and fatal.

This gem supports two different kinds of logging. The full logger prints every bit information. By default this will be printed to a file called import.log in the project root. The important logger only prints messages of levels warn, error or fatal, which will be displayed on STDOUT be default. The file import.log will therefore hold every message you log in the data migration process. On STDOUT will only see severe messages beside the progress bar. The reason for this distinction is that you don't want STDOUT to be flooded by debug messages.

Full and important logger can be configured as follows:

DataImport.full_logger =
DataImport.important_logger =

You can apply any object that provides the methods debug, info, warn, error and fatal.


you can learn a lot from the acceptance specs.


Got a question?

Just send me a message and I'll try to get to you as soon as possible.

Found a bug?

Please submit a new issue.

Fixed something?

  1. Fork data-import
  2. Create a topic branch - git checkout -b my_branch
  3. Make your changes and update the History.txt file
  4. Push to your branch - git push origin my_branch
  5. Send me a pull-request for your topic branch
  6. That's it!