Data Engineering Challenge
My attempt uses Rails and a RESTfult style for the various interactions. See the thoughts section below for an idea of how (and why) the code is organized.
- Live demo, using heroku, at http://arnab-ls-data-challenge.herokuapp.com/
- Note that importing files requires a sign in/up.
- Backlog and stories, using Pivotal Tracker, at https://www.pivotaltracker.com/projects/507765
- Continuous integration, using travis.ci, at http://travis-ci.org/arnab/data-engineering
- Get rvm if you don't already have it.
- Install Ruby 1.9 inside rvm:
rvm install 1.9.2. This package works with Ruby 1.9.2+ (or rubinius in 1.9 mode).
- Create a gemset under Ruby 1.9:
rvm gemset create <name> && rvm use 1.9.2@<name>
- Typical rails (rake will run tests, rails s will run the server etc. etc.).
This is a Rails app built using a RESTful model when it comes to interacting with the app. If you know Rails the following will be a breeze. Otherwise, a little brushing up on REST and Rails will be awesome!
Most of the interaction the user does is with the DataFile resource. That's what they see as the GET as the new form (in which to upload a data file) and POST it back to the server. The DataFile model is not persisted but it does follow a ActiveModel interface. This resource manages the persisted data indirectly (through the persisted, ActiveRecord backed models).
Data modeling and normalization
Looking at the example input and the main requirement (normalize, persist and show the gross revenue represented by the uploaded file) the most natural normalized data models that come to mind are:
Option A) Deals and Purchases
- Deals: merchant name, merchant address, deal description, price
- Purchase: purchaser name, quantity purchased
- Relations: A Deal has many Purchases and a Purchase belongs to a Deal
Option B) Merchants, Deals, Purchasers and Purchases
- Relations: A Merchant has many Deals; A Deal has many Purchases; A Purchase belongs to a Purchaser; A Purchaser has many Purchases.
Comparison & Conclusion
Option A is simpler while B is naturally more normalized (no duplication of merchants and purchasers). Both can answer with ease, apart from the main requirement (calculating the gross), questions such as:
What is LivingSocial's total revenue from this merchant? Option A can be used to find the merchant by name in the Deal table, find all of it's purchases and multiply the quantity by the price. Option B will yield: merchant.deals.map(price) * merchant.deals.purchases.map(quantity)
How many times has this customer bought from LivingSocial? Similar to above, in option A you start by querying the purchases model.
The normalization of option B) can easily fall apart with scale though (and obviously Option A scales even less): think about merchants having different locations and now we need to normalize the merchant table further into merchants and merchant_locations. I am not going to attempt to trivialize the data model LivingSocial probably by trying to come up with the most normalized form here, and am instead going to concentrate on simplicity, retaining just enough complexity to be able to easily solve the given requirements (and a few others that I dreamed up above). Hence, my choice at this point is Option A.
File format validations
Although it's mentioned that I can assume the file's columns to always remain in the same order and that all the fields will be present, for the sake of completeness I added validations to my models. In doing so, I saw that it was easy enough to make the fields order-independent and thus much more flexible for end-users. Technically, it goes beyond the requirements, but I hope that's ok. Take a look at the cucumber features for examples of such validations.
Like any project, this is work in progress. As can be seen in the Pivotal Tracker backlog here are a few things that can be done:
Right now the POST of the file renders the view. This can confuse users if they try to refresh (as it's a form POST the browser shows a warning). We can do a redirect, but to where is an interesting question. We'd probably need either a index page for our deals.
Right now, everything is done in the HTTP request. Meaning the experience will be bad if the file is huge. The first step will be using something like pjax/ajax. Eventually the file should be processed asynchronously: using something like background-job or resque (or perhaps SQS messages that talk to the processing apps that are running separately). But then again, we don't want to prematurely optimize.
Eventually the DataFile model has become kind of big. Perhaps we should create a separate DataFileLine class and have that hold one Deal and one Purchase and the DataFile model holing a collection of DataFileLines. Just an idea.