Skip to content


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Comparing changes

Choose two branches to see what's changed or to start a new pull request. If you need to, you can also compare across forks.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also compare across forks.
base fork: archie/
base: 9195e15876
head fork: archie/
compare: bb11562052
Checking mergeability… Don't worry, you can still create the pull request.
  • 2 commits
  • 1 file changed
  • 0 commit comments
  • 1 contributor
Showing with 42 additions and 0 deletions.
  1. +42 −0 _posts/2012-04-19-testing_with_real_data.textile
42 _posts/2012-04-19-testing_with_real_data.textile
@@ -0,0 +1,42 @@
+layout: default
+title: Towards real-world testing
+category: notes
+Up until now I've mostly used fictitious examples to test the recommender system that's being developed. This data was generated with random numbers without any correlation whatsoever to real world events. Today, however, that changed.
+One of the advantages of doing a thesis with a social network is that I can test with _real data_. This means true anonymised logs from users interacting with content on the site. In my case that is two months of click data from November and December last year, in total about 2 Gb worth of raw data for videos played.
+In order to use the data there are a number of steps to be taken. As mentioned before the recommender system scales on the premise that we can divide the item factors for 40 million+ videos into clusters. Here is the bird's eye view of the process:
+# count the number of plays per user-video pair (python)
+# run matrix factorisation and store the intermediate item and user vectors separately (java)
+# cluster the item vectors produced in step 2 - currently using @scipy.cluster.vq.kmeans2@ to do this (python)
+# load the data into the online recommendation system and start serving recommendations (scala)
+The data comes in the following format: @... userid videoid ...@ (the dots indicate play related data) and there's one line per play. A simple python script does the trick.
+<script src=""> </script>
+Since I need to verify the accuracy of the recommendation I create two sets: a training set used to build the model, and a test set which will be used to verify the results. The sets are the end-result of this process.
+*Model generation*
+The implementation of the matrix factorisation algorithm is beyond the scope of my thesis (except for an understanding of how it works). I'm lucky to be able to use an existing implementation of "Koren's algorithm for implicit feedback": provided by Linas and his team at Telefonica. It is part of a larger suite of recommendation algorithms which they plan to open source, and I hope to be able to assist them in that process. For now, it's enough to say that the result of this step consists of two files: one with all items' latent factors and one with all users' latent factors.
+I'm really focusing on scalability more than the accuracy of the algorithms used. This is both an advantage and a drawback of my work. The advantage is I can be less picky about the model and how the data is supplied to the online recommendation system. Clustering is thus something I do not have to implement myself and, browsing a little for easy-to-use clustering implementations I finally settled on "scipy's k-means": The "k-means algorithm": is one of the easiest and works by grouping items into K clusters such that the item joins the cluster with the nearest mean of the items already in the cluster (see the Wikipedia article for a full explanation of the algorithm). Clustering with scipy is a breeze and the gist of the code is only four lines:
+<script src=""> </script>
+The output is one file per cluster with the respective item vectors and a signature file with the centroids. The latter is used to initialise the last online recommendation system.
+The recommender system reads the centroid file and spawns one process per centroid/cluster distributed across a set of nodes[1]. Each process loads its cluster into memory and registers itself as ready to accept recommendation requests. "Routers" on every node routes requests based on cosine similarity between the signature and the user's request. The request contains the user's latent factors and possibly contextual information such as the last video played. Once the best cluster is determined the request is forwarded to the process responsible for that cluster which in turn computes the recommendations. Finally, the top K recommendations are returned to the user.
+Easy peasy.
+There are still some hurdles to tackle before I can test the system fully and say that it _really_ works. But today's progress is a good step in the direction of ensuring some form of qualitative results from my thesis.
+fn1. There are still some details missing before it is fully distributed. It's coming.

No commit comments for this range

Something went wrong with that request. Please try again.