The code I used to get in the top #150 in the Netflix Prize
C Python
Latest commit 5d8d94d Aug 27, 2012 @alexbw Merge pull request #1 from kencochrane/master
Cleaned up the .pyc files

My Code for the Netflix Prize

I'm not aware of folks having published their code for the Netflix Prize. Here's mine.
Under the team name "Hi!", I competed alone in college. I did it mostly for fun, and to learn modern machine learning techniques. It was an incredibly valuable, but strenuous, time. Well worth it on all fronts, though.
I peaked out at #45 or so, and then dropped out to work on my senior thesis, and came in #145 or so.
What I learned in the process was that smarter wasn't always better -- make an algorithm, and then scale it up, and then make a dozen tweaks to it, and then average all of the results together. That's how you climbed the leaderboard.

Anyhoo, I haven't touched this code in awhile, but perhaps it'll be useful to folks interested in competitive data mining.
Specifically, the lessons I learned:

  • Get the raw data into a saved and manageable format fast. The easier it is to load your data in and start mutating it, the better.
  • If doing simple pivots on your data is hard, and slows you down from visualizing whats in your data, spend time making data structures which make that easy.
  • Generalize. Iterate. If you have a method you think will work, but it has a lot of knobs, and you don't know the best way to set those knobs, make it easy for you to try every possible iteration. There is often not a good way to figure out what the best approach is. You will have to try many of them in order to build up an intuition. Specifically, that means (for me) a pluggable architecture. If there's ten ways to try a particular step, make sure you write your overarching algorithm so that it takes a function that you can pass to it, as opposed to having a method hardwired in the code. That way, you can hotswap all your ideas.
  • Speed is a feature. Of course you make sure it works first. But your goal is to see if something works. If an algorithm takes a day to run, but you can spend five hours making it run in 1/3 of a day, do it. You'll be running it over and over again, and you'll learn more if you can iterate.

As for the technical nitty-gritty, everything that's speed sensitive is written in Cython, which was the best balance of speed and convenience in 2009. If I were to do it al again, I would use (Numba)[].

The original data is gone, I believe, but I might have it stored somewhere. I'll look for that.