# Million Song Database
IS622 Final Project  
Aaron Palumbo | December 2015

## About the Data

The <a href=http://labrosa.ee.columbia.edu/millionsong/tasteprofile>data</a> are provided by The Echo Nest.

From the website:

> Welcome to the Taste Profile subset, the official user dataset of the Million Song Dataset.

> The Echo Nest is committed to giving back to the research community (for instance by creating the MSD!), and they prove it again by releasing the Taste Profile dataset. The dataset contains real user - play counts from undisclosed partners, all songs already matched to the MSD. if you were looking for the right collaborative filtering dataset with audio features, this might be for you! Plus, you can link that user data to lyrics, tags and Last.fm's similar songs, thus you have many viewpoint for explaining the data.

The Million Song Dataset Challenge, B. McFee, T. Bertin-Mahieux, D. Ellis and G. Lanckriet, AdMIRe '12 [pdf][bib]

The listening data from EchoNest comes as one big text file. Each line contains three fields: user, song, play count.

We can see the file on disk:

In [1]:
%ls -lh ../data/train_triplets.txt

-rw-r--r-- 1 apalumbo apalumbo 2.8G Dec 19  2011 ../data/train_triplets.txt


We can copy this to HDFS with the command line tool:

    hdfs dfs -put {{ fileLoc }} {{ fileHDFS }}

Here I am using the <a href="http://jinja.pocoo.org/Jinja">Jinja2</a> syntax to reference variables.

In [2]:
# show file in hadoop
import pydoop.hdfs as hdfs
hdfs.lsl("/user/apalumbo/final/train_triplets.txt")

[{'block_size': 134217728,
  'group': 'supergroup',
  'kind': 'file',
  'last_access': 1450648897,
  'last_mod': 1450543445,
  'name': u'hdfs://localhost:9000/user/apalumbo/final/train_triplets.txt',
  'owner': 'apalumbo',
  'path': u'hdfs://localhost:9000/user/apalumbo/final/train_triplets.txt',
  'permissions': 420,
  'replication': 1,
  'size': 3001659271L}]

## Objective

The data consists of:

* 1,019,318 unique users
* 384,546 unique MSD songs
* 48,373,586 user - song - play count triplets

Our goal is to compare three tools for analyzing this data:

* pandas
* Spark
* **Hadoop**


We will make this comparison based on normal tasks encountered while working with data of this type and try to draw some conclusions about the appropriateness of each of these tools. Obviously, the first criterion we will use in the comparison is the feasibility. Assuming the task is feasible in all three tools we will then move to complexity and time. Complexity will be somewhat subjective while time will be more objective. In our conclusions we will also discuss how will each of these methods scale.

> _Notes_
> * we will be using Apache Spark 1.5.1
* Hadoop 2.7.1 accessed from python with pydoop 1.1.0
* pandas 0.17.1
* we will exercise the tool sequentially and confirm that memory has been released to ensure the resources of the machine are dedicated to the tool at hand.

## Pandas

In [37]:
import pandas as pd
import IPython.display as dis
from pyechonest import song
from pyechonest import config
import json

In [4]:
colnames = ["user", "song", "playCount"]

fileLoc  = "file:///home/apalumbo/is622/final_project/data/train_triplets.txt"

### Loading the Data

In [5]:
%%time
songs_pandas = pd.read_csv(fileLoc, sep="\t", header=None, names=colnames)

CPU times: user 50 s, sys: 2.25 s, total: 52.2 s
Wall time: 52.5 s


In [6]:
%%timeit
songs_pandas.head(10)

10000 loops, best of 3: 183 µs per loop


### Data Statistics

In [16]:
%%time
# Size of data
print len(songs_pandas)
# unique users
print len(songs_pandas.user.unique())
# unique songs
print len(songs_pandas.song.unique())

48373586
1019318
384546
CPU times: user 12.9 s, sys: 244 ms, total: 13.2 s
Wall time: 13.2 s


In [35]:
%%time
playcounts = songs_pandas.groupby("song").sum()
topSongs = playcounts.sort('playCount', ascending=False).head(10)
dis.display(topSongs)

Unnamed: 0_level_0,playCount
song,Unnamed: 1_level_1
SOBONKR12A58A7A7E0,726885
SOAUWYT12A81C206F1,648239
SOSXLTC12AF72A7F54,527893
SOFRQTD12A81C233C0,425463
SOEGIYH12A6D4FC0E3,389880
SOAXGDH12A8C13F8A1,356533
SONYKOW12AB01849C9,292642
SOPUCYA12A8C13A694,274627
SOUFTBI12AB0183F65,268353
SOVDSJC12A58A7A271,244730


CPU times: user 16.8 s, sys: 668 ms, total: 17.5 s
Wall time: 17.5 s


In [45]:
# Use echonest API to look up user/song information
echonestAPI = json.load(open("../echonest_info.json", "rb"))
config.ECHO_NEST_API_KEY = echonestAPI['api_key']

for i in topSongs.index:
    try:
        s = song.Song(i)
        artist = s.artist_name.encode("iso-8859-1")
        title = s.title.encode("iso-8859-1")
        print "artist: {}, title: {}".format(artist, title)
    except IndexError:
        print "{} not found".format(i)

SOBONKR12A58A7A7E0 not found
artist: Bj�rk, title: Undo (Live - Vespertine World Tour 2001)
artist: Kings of Leon, title: Revelry
SOFRQTD12A81C233C0 not found
artist: Barry Tuckwell, title: Horn Concerto No. 4 in E Flat, K.495: II. Romance (Andante cantabile)
SOAXGDH12A8C13F8A1 not found
SONYKOW12AB01849C9 not found
artist: Five Iron Frenzy, title: Canada
artist: Tub Ring, title: Invalid
SOVDSJC12A58A7A271 not found


## Appendix

In [46]:
# used to connect a console to the notebook
%connect_info

{
  "stdin_port": 33324, 
  "ip": "127.0.0.1", 
  "control_port": 51460, 
  "hb_port": 44358, 
  "signature_scheme": "hmac-sha256", 
  "key": "211f60b4-f7cb-4402-895c-9e6837da0a9f", 
  "shell_port": 48229, 
  "transport": "tcp", 
  "iopub_port": 59232
}

Paste the above JSON into a file, and connect with:
    $> ipython <app> --existing <file>
or, if you are local, you can connect with just:
    $> ipython <app> --existing kernel-3e8d1bf4-6967-4a71-beab-d6c0a4b0567c.json 
or even just:
    $> ipython <app> --existing 
if this is the most recent IPython session you have started.
