# Million Song Database
IS622 Final Project  
Aaron Palumbo | December 2015

## About the Data

The <a href=http://labrosa.ee.columbia.edu/millionsong/tasteprofile>data</a> are provided by The Echo Nest.

From the website:

> Welcome to the Taste Profile subset, the official user dataset of the Million Song Dataset.

> The Echo Nest is committed to giving back to the research community (for instance by creating the MSD!), and they prove it again by releasing the Taste Profile dataset. The dataset contains real user - play counts from undisclosed partners, all songs already matched to the MSD. if you were looking for the right collaborative filtering dataset with audio features, this might be for you! Plus, you can link that user data to lyrics, tags and Last.fm's similar songs, thus you have many viewpoint for explaining the data.

The Million Song Dataset Challenge, B. McFee, T. Bertin-Mahieux, D. Ellis and G. Lanckriet, AdMIRe '12 [pdf][bib]

The listening data from EchoNest comes as one big text file. Each line contains three fields: user, song, play count.

We can see the file on disk:

In [1]:
%ls -lh ../data/train_triplets.txt

-rw-r--r-- 1 apalumbo apalumbo 2.8G Dec 19  2011 ../data/train_triplets.txt


We can copy this to HDFS with the command line tool:

    hdfs dfs -put {{ fileLoc }} {{ fileHDFS }}

Here I am using the <a href="http://jinja.pocoo.org/Jinja">Jinja2</a> syntax to reference variables.

In [2]:
# show file in hadoop
import pydoop.hdfs as hdfs
hdfs.lsl("/user/apalumbo/final/train_triplets.txt")

[{'block_size': 134217728,
  'group': 'supergroup',
  'kind': 'file',
  'last_access': 1450645260,
  'last_mod': 1450543445,
  'name': u'hdfs://localhost:9000/user/apalumbo/final/train_triplets.txt',
  'owner': 'apalumbo',
  'path': u'hdfs://localhost:9000/user/apalumbo/final/train_triplets.txt',
  'permissions': 420,
  'replication': 1,
  'size': 3001659271L}]

## Objective

The data consists of:

* 1,019,318 unique users
* 384,546 unique MSD songs
* 48,373,586 user - song - play count triplets

Our goal is to compare three tools for analyzing this data:

* pandas
* Spark
* Hadoop


We will make this comparison based on normal tasks encountered while working with data of this type and try to draw some conclusions about the appropriateness of each of these tools. Obviously, the first criterion we will use in the comparison is the feasibility. Assuming the task is feasible in all three tools we will then move to complexity and time. Complexity will be somewhat subjective while time will be more objective. In our conclusions we will also discuss how will each of these methods scale.

> _Notes_
> * we will be using Apache Spark 1.5.1
* Hadoop 2.7.1 accessed from python with pydoop 1.1.0
* pandas 0.17.1
* we will exercise the tool sequentially and confirm that memory has been released to ensure the resources of the machine are dedicated to the tool at hand.

## Spark

### Setup

In [3]:
# Reset the namespace
%reset -f

In [4]:
%%bash
cat /proc/meminfo | grep Mem

MemTotal:       14361144 kB
MemFree:         6778668 kB
MemAvailable:   12892156 kB


In [5]:
import os
import sys
from pyechonest import song

In [6]:
spark_home = "/home/apalumbo/workspace/cuny_msda_is622/spark-1.5.1-bin-hadoop2.6/"

# Path for Spark source folder
os.environ['SPARK_HOME'] = spark_home

# Append pyspark to Python Path
sys.path.append(spark_home + "python/")

# Append py4j to Python Path
sys.path.append(spark_home + "python/lib/py4j-0.8.2.1-src.zip")

# Launch Spark
execfile(spark_home + "python/pyspark/shell.py")

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.1
      /_/

Using Python version 2.7.11 (default, Dec  6 2015 18:08:32)
SparkContext available as sc, HiveContext available as sqlContext.


### Dependencies

In [7]:
# Libraries 
from pyspark.sql import SQLContext, Row
from pyspark.sql.types import *

import IPython.display as dis
from pyechonest import config
import json

sqlCtx = SQLContext(sc)

# Paths
fileHDFS = "hdfs:///user/apalumbo/final/train_triplets.txt"
# use for testing
# fileHDFS = "hdfs:///user/apalumbo/final/train_triplets_100.txt"

### Loading Data

To use Spark to load the data and look at the first few records, is fast and easy:

First we need a split function:

In [8]:
def splitFun(line):
    row = []
    for field in line.split("\t"):
        try:
            row.append(int(field))
        except ValueError:
            row.append(str(field))
    return row

In [9]:
%%time
songs_spark_ref = sc.textFile(fileHDFS)

songs_spark = songs_spark_ref.map(lambda line: splitFun(line))

dis.display(songs_spark.take(10))

[['b80344d063b5ccb3212f76538f3d9e43d87dca9e', 'SOAKIMP12A8C130995', 1],
 ['b80344d063b5ccb3212f76538f3d9e43d87dca9e', 'SOAPDEY12A81C210A9', 1],
 ['b80344d063b5ccb3212f76538f3d9e43d87dca9e', 'SOBBMDR12A8C13253B', 2],
 ['b80344d063b5ccb3212f76538f3d9e43d87dca9e', 'SOBFNSP12AF72A0E22', 1],
 ['b80344d063b5ccb3212f76538f3d9e43d87dca9e', 'SOBFOVM12A58A7D494', 1],
 ['b80344d063b5ccb3212f76538f3d9e43d87dca9e', 'SOBNZDC12A6D4FC103', 1],
 ['b80344d063b5ccb3212f76538f3d9e43d87dca9e', 'SOBSUJE12A6D4F8CF5', 2],
 ['b80344d063b5ccb3212f76538f3d9e43d87dca9e', 'SOBVFZR12A6D4F8AE3', 1],
 ['b80344d063b5ccb3212f76538f3d9e43d87dca9e', 'SOBXALG12A8C13C108', 1],
 ['b80344d063b5ccb3212f76538f3d9e43d87dca9e', 'SOBXHDL12A81C204C0', 1]]

CPU times: user 20 ms, sys: 0 ns, total: 20 ms
Wall time: 8.03 s


We can use the %%timeit magic to measure how fast this operation is.

In [10]:
%%timeit
songs_spark_ref.map(lambda line: splitFun(line)).take(10)

10 loops, best of 3: 54 ms per loop


### Dataset statistics

The first thing we would like to do is to determine some basic information about the data. We can start with the overall size.

The sqlcontext provides a nice tool for this.

Now we create a schema and load the data:

In [11]:
schema = StructType([StructField("user", StringType()), 
                     StructField("song", StringType()), 
                     StructField("playCount", IntegerType())])

# Convert the RDD to a spark DataFrame
sdf = songs_spark.toDF(schema)
sdf.show(10)

+--------------------+------------------+---------+
|                user|              song|playCount|
+--------------------+------------------+---------+
|b80344d063b5ccb32...|SOAKIMP12A8C130995|        1|
|b80344d063b5ccb32...|SOAPDEY12A81C210A9|        1|
|b80344d063b5ccb32...|SOBBMDR12A8C13253B|        2|
|b80344d063b5ccb32...|SOBFNSP12AF72A0E22|        1|
|b80344d063b5ccb32...|SOBFOVM12A58A7D494|        1|
|b80344d063b5ccb32...|SOBNZDC12A6D4FC103|        1|
|b80344d063b5ccb32...|SOBSUJE12A6D4F8CF5|        2|
|b80344d063b5ccb32...|SOBVFZR12A6D4F8AE3|        1|
|b80344d063b5ccb32...|SOBXALG12A8C13C108|        1|
|b80344d063b5ccb32...|SOBXHDL12A81C204C0|        1|
+--------------------+------------------+---------+
only showing top 10 rows



In [12]:
%%time
print "Num Rows: {}".format(sdf.select("user").count())

Num Rows: 48373586
CPU times: user 296 ms, sys: 156 ms, total: 452 ms
Wall time: 11min 49s


In [13]:
%%time
print "Num Unique Users: {}".format(sdf.select("user").distinct().count())

Num Unique Users: 1019318
CPU times: user 308 ms, sys: 160 ms, total: 468 ms
Wall time: 11min 38s


In [14]:
%%time
print "Num Unique Songs: {}".format(sdf.select("song").distinct().count())

Num Unique Songs: 384546
CPU times: user 240 ms, sys: 228 ms, total: 468 ms
Wall time: 11min 40s


The overall size, number of unique users, and number of unique songs are shown above.

### Most Popular Songs

One of our objectives with this data set is to build a recommendation engine. One simple way to do this is to simply recommend the most popular artists or songs. Let's see how we do this in Spark.

Now we can group by song and sum play counts to get a measure of the most popular songs / artists (this takes about 11.5 minutes:

In [15]:
%%time
groupedSongs = sdf.groupBy('song')
songsByPlayCount = groupedSongs.sum('playCount') \
        .sort('sum(playCount)', ascending=False)
topSongs = songsByPlayCount.take(10)
dis.display([[line[i] for i, j in enumerate(line)] for line in topSongs])

[[u'SOBONKR12A58A7A7E0', 726885],
 [u'SOAUWYT12A81C206F1', 648239],
 [u'SOSXLTC12AF72A7F54', 527893],
 [u'SOFRQTD12A81C233C0', 425463],
 [u'SOEGIYH12A6D4FC0E3', 389880],
 [u'SOAXGDH12A8C13F8A1', 356533],
 [u'SONYKOW12AB01849C9', 292642],
 [u'SOPUCYA12A8C13A694', 274627],
 [u'SOUFTBI12AB0183F65', 268353],
 [u'SOVDSJC12A58A7A271', 244730]]

CPU times: user 228 ms, sys: 224 ms, total: 452 ms
Wall time: 11min 39s


In [50]:
# Use echonest API to look up user/song information
echonestAPI = json.load(open("../echonest_info.json", "rb"))
config.ECHO_NEST_API_KEY = echonestAPI['api_key']

for i in topSongs:
    try:
        s = song.Song(i.song)
        artist = s.artist_name.encode("iso-8859-1")
        title = s.title.encode("iso-8859-1")
        print "artist: {}, title: {}".format(artist, title)
    except IndexError:
        print "{} not found".format(i)

Row(song=u'SOBONKR12A58A7A7E0', sum(playCount)=726885) not found
artist: Bj�rk, title: Undo (Live - Vespertine World Tour 2001)
artist: Kings of Leon, title: Revelry
Row(song=u'SOFRQTD12A81C233C0', sum(playCount)=425463) not found
artist: Barry Tuckwell, title: Horn Concerto No. 4 in E Flat, K.495: II. Romance (Andante cantabile)
Row(song=u'SOAXGDH12A8C13F8A1', sum(playCount)=356533) not found
Row(song=u'SONYKOW12AB01849C9', sum(playCount)=292642) not found
artist: Five Iron Frenzy, title: Canada
artist: Tub Ring, title: Invalid
Row(song=u'SOVDSJC12A58A7A271', sum(playCount)=244730) not found


## Appendix

In [51]:
# used to connect a console to the notebook
%connect_info

{
  "stdin_port": 47167, 
  "ip": "127.0.0.1", 
  "control_port": 59681, 
  "hb_port": 54281, 
  "signature_scheme": "hmac-sha256", 
  "key": "c03260cc-cbc0-4025-81b7-7445f0388067", 
  "shell_port": 51889, 
  "transport": "tcp", 
  "iopub_port": 57747
}

Paste the above JSON into a file, and connect with:
    $> ipython <app> --existing <file>
or, if you are local, you can connect with just:
    $> ipython <app> --existing kernel-d5352ed9-84b0-4142-9dc1-14d221b9f601.json 
or even just:
    $> ipython <app> --existing 
if this is the most recent IPython session you have started.
