# Million Song Database
IS622 Final Project  
Aaron Palumbo | December 2015

## About the Data

The <a href=http://labrosa.ee.columbia.edu/millionsong/tasteprofile>data</a> are provided by The Echo Nest.

From the website:

> Welcome to the Taste Profile subset, the official user dataset of the Million Song Dataset.

> The Echo Nest is committed to giving back to the research community (for instance by creating the MSD!), and they prove it again by releasing the Taste Profile dataset. The dataset contains real user - play counts from undisclosed partners, all songs already matched to the MSD. if you were looking for the right collaborative filtering dataset with audio features, this might be for you! Plus, you can link that user data to lyrics, tags and Last.fm's similar songs, thus you have many viewpoint for explaining the data.

The Million Song Dataset Challenge, B. McFee, T. Bertin-Mahieux, D. Ellis and G. Lanckriet, AdMIRe '12 [pdf][bib]

The listening data from EchoNest comes as one big text file. Each line contains three fields: user, song, play count.

We can see the file on disk:

In [1]:
%ls -lh ../data/train_triplets.txt

-rw-r--r-- 1 apalumbo apalumbo 2.8G Dec 19  2011 ../data/train_triplets.txt


We can copy this to HDFS with the command line tool:

    hdfs dfs -put {{ fileLoc }} {{ fileHDFS }}

Here I am using the <a href="http://jinja.pocoo.org/Jinja">Jinja2</a> syntax to reference variables.

In [2]:
# show file in hadoop
import pydoop.hdfs as hdfs
hdfs.lsl("/user/apalumbo/final/train_triplets.txt")

[{'block_size': 134217728,
  'group': 'supergroup',
  'kind': 'file',
  'last_access': 1450645260,
  'last_mod': 1450543445,
  'name': u'hdfs://localhost:9000/user/apalumbo/final/train_triplets.txt',
  'owner': 'apalumbo',
  'path': u'hdfs://localhost:9000/user/apalumbo/final/train_triplets.txt',
  'permissions': 420,
  'replication': 1,
  'size': 3001659271L}]

## Objective

The data consists of:

* 1,019,318 unique users
* 384,546 unique MSD songs
* 48,373,586 user - song - play count triplets

Our goal is to compare three tools for analyzing this data:

* pandas
* Spark
* **Hadoop**


We will make this comparison based on normal tasks encountered while working with data of this type and try to draw some conclusions about the appropriateness of each of these tools. Obviously, the first criterion we will use in the comparison is the feasibility. Assuming the task is feasible in all three tools we will then move to complexity and time. Complexity will be somewhat subjective while time will be more objective. In our conclusions we will also discuss how will each of these methods scale.

> _Notes_
> * we will be using Apache Spark 1.5.1
* Hadoop 2.7.1 accessed from python with pydoop 1.1.0
* pandas 0.17.1
* we will exercise the tool sequentially and confirm that memory has been released to ensure the resources of the machine are dedicated to the tool at hand.

## Hadoop

Now let's see how we would do the same tasks with Hadoop.

### Setup

In [None]:
# Clear the namespace
%reset -f

In [None]:
%%bash
cat /proc/meminfo | grep Mem

### Dependencies

In [None]:
import pydoop.hdfs as hdfs
import pydoop.mapreduce.api as api
import os
import sys
from pyechonest import song
from subprocess import call

# Paths
fileHDFS = "hdfs:///user/apalumbo/final/train_triplets.txt"
# use for testing
# fileHDFS = "hdfs:///user/apalumbo/final/train_triplets_100.txt"
fileOutput = "hdfs:///user/apalumbo/final/hadoop_output.txt"

In [None]:
hdfs.lsl(fileHDFS)

In [None]:
colnames = ["user", "song", "playCount"]

### Loading Data

In [None]:
def splitFun(line):
    row = []
    for field in line.split("\t"):
        try:
            row.append(int(field))
        except ValueError:
            row.append(str(field))
    return row

In [None]:
def hadoop_take(file_path, take_lines):
    output = []
    i = 0
    with hdfs.open(file_path, "r") as f:
        for line in f:
            output.append(line)
            i += 1
            if i >= take_lines:
                break
    return output

songs_hadoop = hadoop_take(fileHDFS, 10)
[splitFun(x) for x in songs_hadoop]

In [None]:
%%timeit
songs_hadoop = hadoop_take(fileHDFS, 10)
[splitFun(x) for x in songs_hadoop]

Although it is more complicated, it is faster than Spark.

### Dataset statistics

I spent time trying to use the python libraries mrjob and pydoop and was unable to get them functioning. I have not been able to isolate the problem. Instead, I will do the hadoop part in R.

## Appendix

In [None]:
# used to connect a console to the notebook
%connect_info