# Graph Model of the Million Song Dataset

*Andrea Soto*  
*MIDS W205 Final Project*

---

# Notebook Overview

This notebook processess the Million Song Dataset Subset which contains a sample of 10,000 songs from the main dataset. **The goal is to get familiarized with the data, and to develop and test the code on the smaller dataset.** The code developed was then compiled in scripts that were used to process the full dataset.

Two sources of data were used:

1. **Million Song Dataset (MSD):** contains track, song, artist, and album metadata. It also contains artist similarity and artist tags. This was the main datasource for this project. The data is stored in HDF5 format, with one file per song.
A detail description of the MSD project can be found [here](http://labrosa.ee.columbia.edu/millionsong/) and the filed list can be found [here](http://labrosa.ee.columbia.edu/millionsong/faq).
2. **Last.fm Dataset:** contains information about song similarity and song tags. This information was used to complement the MSD. This data is stored in JSON format, with one file per song. A detail desctiption of the Last.fm data can be found [here](http://labrosa.ee.columbia.edu/millionsong/lastfm).

For reference, a small sample of the tree structue of each dataset is shown below.   

In [None]:
data/MillionSongSubset/data/
|-- A
|   |-- A
|   |   |-- A
|   |   |   |-- TRAAAAW128F429D538.h5
|   |   |   |-- TRAAABD128F429CF47.h5
|   |   |   |-- TRAAADZ128F9348C2E.h5
|   |   |   |-- ...
|   |   |-- B
|   |   |   |-- TRAABCL128F4286650.h5
|   |   |   |-- TRAABDL12903CAABBA.h5
|   |   |   |-- TRAABJL12903CDCF1A.h5
|   |   |   |-- ...
|   |   |-- C
|   |   |   |-- TRAACCG128F92E8A55.h5
|   |   |   |-- TRAACER128F4290F96.h5
|   |   |   |-- TRAACFV128F935E50B.h5
|   |   |   |-- ...
|   |   |-- D
|   |   |-- ...
|   |   |-- X
|   |   |-- Y
|   |   `-- Z
|   |-- B
|   |   |-- A
|   |   |-- ...
|   |   `-- Z
|   |-- C
|   |-- ...
|   |-- Z
`-- B
    |-- A
    |   |-- A
    |   |-- B
    |   |-- ...
    |   |-- Z
    |-- B
    |-- ...
    |-- Z

In [None]:
data/lastfm_subset/
|-- A
|   |-- A
|   |   |-- A
|   |   |   |-- TRAAAAW128F429D538.json
|   |   |   |-- TRAAABD128F429CF47.json
|   |   |   |-- TRAAADZ128F9348C2E.json
|   |   |   |-- ...
|   |   |-- B
|   |   |   |-- TRAABDL12903CAABBA.json
|   |   |   |-- TRAABJL12903CDCF1A.json
|   |   |   |-- TRAABJV128F1460C49.json
|   |   |   |-- ...
|   |   |-- C
|   |   |   |-- TRAACCG128F92E8A55.json
|   |   |   |-- TRAACER128F4290F96.json
|   |   |   |-- TRAACFV128F935E50B.json
|   |   |   |-- ...
|   |   |-- D
|   |   |-- ...
|   |   |-- X
|   |   |-- Y
|   |   `-- Z
|   |-- B
|   |   |-- A
|   |   |-- ...
|   |   `-- Z
|   |-- C
|   |-- ...
|   |-- Z
`-- B
    |-- A
    |   |-- A
    |   |-- B
    |   |-- ...
    |   |-- Z
    |-- B
    |-- ...
    |-- Z

---
# Workflow

The Million Song Dataset is stored in HDF5 files. The data was transformed into CSV files containing the nodes and realtionship structure for Neo4j. This CSV files where created to leaverage the `LOAD CVS` functionality of Neo4j, which makes loading large graphs into Neo4j faster and scalable. 

The steps followed in this notebook were:

1. Create a list of the song files with the full path to each file
2. In Spark, read each file in the list and extract the information
3. In Spark, transform the extracted data to create CSV files for each node and relationsip type. Separate transformations were requiered for each node and relationsip type
4. Save the transformed data using Spark's `saveAsTextFile()` operation
5. Since the `saveAsTextFile()` operation generates several files named 'part-000xx', the data was merged into a single .csv file and headers were added to ease readability. This was done with bash commands

The MSD team found some matching errors between tracks and songs in the data. They created a list of (song id, tack id) pairs that are not trusted and they suggest removing this pairs from the data. These missmatches were removed from the data as part of the transformation process.

For more details see:
- http://labrosa.ee.columbia.edu/millionsong/blog/12-1-2-matching-errors-taste-profile-and-msd
- http://labrosa.ee.columbia.edu/millionsong/blog/12-2-12-fixing-matching-errors


---

#### LINK TO DOWNLOAD NEO4J

http://neo4j.com/artifact.php?name=neo4j-community-2.3.1-unix.tar.gz
 
#### AWS SAMPLE CODE

https://alestic.com/2013/11/aws-cli-query/

# Attempt to automate configuration

In [None]:
#'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
# W205 Final Project: Million Song Dataset (MSD)

# Requirements: W205 AMI with Hadoop and Spark
#               aws cli installed and configured ()
# This configurations scripts is run from within the EC2 instance.
# It assumes that the instance DOES NOT have any volume attached and that the mount
# point /data is available
 
# Python Libraries: py2neo,
#'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''

# AS ROOT
# === Installations ===
sudo yum install jq
pip install awscli
 
# === Install ec2-metadata tool to get information about this instance ===
wget http://s3.amazonaws.com/ec2metadata/ec2-metadata
chmod a+x ec2-metadata
mv ec2-metadata /usr/bin
 
# ============================================================================================================================
#'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
# AWS Setup - Attache 2 volumes to this instance
# Main Volume:
# MSD VolumeL  is where the Million Song Dataset (MSD)
#'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''

# === Save instance info in environment variables ===
# Get instance id
INSTANCE_ID=$(ec2-metadata -i | cut -d:  -f2| cut -d' ' -f2)
# Get instance public hostname
INSTANCE_PDNS=$(ec2-metadata -p | cut -d:  -f2| cut -d' ' -f2)
# Get instance availability zone
INSTANCE_ZONE=$(ec2-metadata -z | cut -d:  -f2| cut -d' ' -f2)


export INSTANCE_ID
export INSTANCE_PDNS
export INSTANCE_ZONE

echo 'export INSTANCE_ID='$INSTANCE_ID >> ~/.bashrc
echo 'export INSTANCE_PDNS='$INSTANCE_PDNS >> ~/.bashrc
echo 'export INSTANCE_ZONE='$INSTANCE_ZONE >> ~/.bashrc

source ~/.bashrc

# === Create and Attache Volumes ===
mkdir aws-info
 
### Create project main working volume
aws ec2 create-volume --size 100 --availability-zone $INSTANCE_ZONE --volume-type gp2 > aws-info/main-volume.json
wait
MAIN_VOL_ID = jq '.VolumeId' aws-info/main-volume.json
 
aws ec2 attach-volume --volume-id $MAIN_VOL_ID --instance-id $INSTANCE_ID --device /dev/xvdf
wait
mkdir data
sudo mkfs -t ext4 /dev/xvdf
sudo mount /dev/xvdf /data
 
### Create volume from AWS snapshot of Million Song Dataset (full dataset)
aws ec2 create-volume --availability-zone $INSTANCE_ZONE --snapshot-id snap-5178cf30 --volume-type gp2 > aws-info/msd-volume.json
wait
MSD_VOL_ID = jq '.VolumeId' aws-info/msd-volume.json
 
aws ec2 attach-volume --volume-id $MSD_VOL_ID --instance-id $INSTANCE_ID --device /dev/xvdg
wait
mkdir msong_data
sudo mount /dev/xvdg /msong_data
 
# ============================================================================================================================
# === Install Neo4j in main directory ===
 
cd /data
wget http://neo4j.com/artifact.php?name=neo4j-community-2.3.1-unix.tar.gz
tar -xf neo4j-community-2.3.1-unix.tar.gz
export NEO4J_HOME="/data/neo4j"

# Download the Subset Data - 10,000 songs

In [None]:
#Create a project directory
!mkdir msd_project
!cd msd_project

In [None]:
%%writefile download_subsetdata.sh
#!/usr/bin/env bash

#Create a directory for the data
mkdir data
cd data

# Download data subset of 10,000 songs, ~1GB to develop and test code
wget http://static.echonest.com/millionsongsubset_full.tar.gz data_subset
wait

tar xvzf millionsongsubset_full.tar.gz
wait

# Download list of all artist ID 
# The format is: artist id<SEP>artist mbid<SEP>track id<SEP>artist name
wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/unique_artists.txt
wait
wc -l unique_artists.txt #44745 unique_artists.txt

# Download list of all unique artist terms (Echo Nest tags) 
wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/unique_terms.txt
wait
wc -l unique_terms.txt #7643 unique_terms.txt
    
# Download list of all unique artist musicbrainz tags
wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/AdditionalFiles/unique_mbtags.txt
wait
wc -l unique_mbtags.txt #2321 unique_mbtags.txt

# Download last-fm data with song similarities
wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_subset.zip
unzip lastfm_subset.zip

cd ..

# Download untrusted Songs to be filtered out

In [4]:
wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/tasteprofile/sid_mismatches.txt

--2015-12-12 21:37:41--  http://labrosa.ee.columbia.edu/millionsong/sites/default/files/tasteprofile/sid_mismatches.txt
Resolving labrosa.ee.columbia.edu... 128.59.66.11
Connecting to labrosa.ee.columbia.edu|128.59.66.11|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2026182 (1.9M) [text/plain]
Saving to: `sid_mismatches.txt'


2015-12-12 21:37:41 (8.42 MB/s) - `sid_mismatches.txt' saved [2026182/2026182]



# Start Spark Context in Notebook

In [2]:
import os
import sys
#Escape L for line numbers
spark_home = os.environ['SPARK_HOME'] = '/data/spark15'
if not spark_home:
    raise ValueError('SPARK_HOME enviroment variable is not set')

sys.path.insert(0,os.path.join(spark_home,'python'))
sys.path.insert(0,os.path.join(spark_home,'python/lib/py4j-0.8.2.1-src.zip'))
execfile(os.path.join(spark_home,'python/pyspark/shell.py'))

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.0
      /_/

Using Python version 2.7.10 (default, Sep 15 2015 14:50:01)
SparkContext available as sc, HiveContext available as sqlContext.


In [3]:
# Check spark context exists
sc

<pyspark.context.SparkContext at 0x7f0ee8089710>

---
# Data Preparation

### Python Libraries

In [20]:
import os
import glob
import shutil

import numpy as np

import h5py
import json

### Create list of HDF5 files and JSON files for the Subset Data

In [21]:
# Run from 'msd_project' directory
# List the files with their full path and store the list in a .txt file
# This list of files will then be read in Spark to parse the actual files

# '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' 
# Million Song Dataset
# ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
# Path to list of song files - Million Song Subset - 
song_paths = 'data/list_hdf5_files.txt'

# If file does not exits, create it
if not os.path.exists(song_paths):

    # List all paths of songs and save them to 
    get_song_paths = glob.glob('./data/MillionSongSubset/data/*/*/*/*.h5')
    
    with open(song_paths,'w') as f:
        f.writelines('\n'.join(p for p in get_song_paths))
        f.close()

        
# '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' 
# Last.fm Dataset
# ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
# Path to list of song files - LastFM Song Similarity and Tags -
lastfm_paths = 'data/list_lastfm_files.txt'

# If file does not exits, create it
if not os.path.exists(lastfm_paths):

    # List all paths of songs and save them to 
    get_song_paths = glob.glob('./data/lastfm_subset/*/*/*/*.json')
    
    with open(lastfm_paths,'w') as f:
        f.writelines('\n'.join(p for p in get_song_paths))
        f.close()

### Read list of files to Spark

In [43]:
cdir = os.getcwd()

# Create RDD with the list of HDF5 song files
path = os.path.join(cdir, song_paths)
song_pathsRDD = sc.textFile('file://'+path)

# Create RDD with the list of JSON song files
path = os.path.join(cdir, lastfm_paths)
lastfm_pathsRDD = sc.textFile('file://'+path)

**Sample list of MSD song files**

In [12]:
song_pathsRDD.take(3)

[u'./data/MillionSongSubset/data/B/B/O/TRBBOPX12903D106F7.h5',
 u'./data/MillionSongSubset/data/B/B/O/TRBBOKQ128F933AE7C.h5',
 u'./data/MillionSongSubset/data/B/B/O/TRBBOPV12903CFB50F.h5']

**Sample list of Last.fm song files**

In [24]:
lastfm_pathsRDD.take(3)

[u'./data/lastfm_subset/B/B/O/TRBBOBO128F425FDFC.json',
 u'./data/lastfm_subset/B/B/O/TRBBOPX12903D106F7.json',
 u'./data/lastfm_subset/B/B/O/TRBBOBQ12903CC5186.json']

## Load the (song id, track id) pair mismatches

The raw mismatched file has the following general structure:

In [None]:
ERROR: <'songID', 'trackID'> 'descrption showing mismatch'

Some example lines from the file are:

In [None]:
ERROR: <SOVWUNG12A8C137891 TRMGMLW128F426A200> Warlock  -  Copy of a Copy  !=  Sickboy  -  She's out of way  
ERROR: <SOJTFZA12A8C13704E TRMGGOK128F426FDEB> Bike  -  Circus Kids  !=  Slut  -  Gloom  
ERROR: <SOZZXCP12A8C13832E TRMGQMW128F9311251> Musiq  -  Solong  !=  Suthun Boy  -  Full Blown

In [15]:
def parse_mismatches(line):
    '''
    This function extracts the songID and trackID of the mismatched records.
    Returned value: ('songID', 'trackID')
    '''
    return line[8:45].split()

In [None]:
# Create an RDD with the song-track pairs that need to be removed
toRemoveRDD = sc.textFile('file://'+cdir+'/sid_mismatches.txt').map(parse_mismatches)
songsToRemove = sc.broadcast(toRemoveRDD.collect())

**Number of (song,track) pairs to remove**

In [29]:
toRemoveRDD.count()

19094

## Extract MSD song information

The MSD song data was extracted into an RDD were each song is represendted with a python dictionary having the following structure:

**Function to read MSD HDF5 files and extract data**

In [44]:
def get_h5_info(path):
    '''
    Takes a path to a song stored as an HDF5 file and returns a dictionary with the 
    information that will be included in the graph
    ''' 
    d = {}
    with h5py.File(path, 'r') as f:

        # --- Artist Info -----------------------------
        d.setdefault('artist_id', f['metadata']['songs']['artist_id'][0])
        d.setdefault('artist_mbid', f['metadata']['songs']['artist_mbid'][0])
        d.setdefault('artist_7did', f['metadata']['songs']['artist_7digitalid'][0])
        d.setdefault('artist_name', f['metadata']['songs']['artist_name'][0])

        # --- Song Info -----------------------------
        d.setdefault('song_id', f['metadata']['songs']['song_id'][0])
        d.setdefault('track_id', f['analysis']['songs']['track_id'][0])
        d.setdefault('title', f['metadata']['songs']['title'][0])
        d.setdefault('dance', f['analysis']['songs']['danceability'][0])
        d.setdefault('dur', f['analysis']['songs']['duration'][0])
        d.setdefault('energy', f['analysis']['songs']['energy'][0])
        d.setdefault('loudness', f['analysis']['songs']['loudness'][0])

        # --- Year -----------------------------
        d.setdefault('year', f['musicbrainz']['songs']['year'][0])

        # --- Album -----------------------------
        d.setdefault('album', f['metadata']['songs']['release'][0])

        # --- Similar Artist -----------------------------
        d.setdefault('a_similar', np.array(f['metadata']['similar_artists']))

        # --- Artist Terms -----------------------------
        d.setdefault('a_terms', np.array(f['metadata']['artist_terms']))
        d.setdefault('a_tfrq', np.array(f['metadata']['artist_terms_freq']))
        d.setdefault('a_tw', np.array(f['metadata']['artist_terms_weight']))

        return d

**Extract MSD song data**

In [45]:
# Extract song information
songsRDD = song_pathsRDD.map(get_h5_info).filter(
    lambda x: [x['song_id'],x['track_id']] not in songsToRemove.value).cache()

**Sample song extracted**

In [None]:
songsRDD.take(1)

## Extract Last.fm song information

The Last.fm data has the following structure:

**Function to read Last.fm JSON files and extract data**

In [31]:
def get_json_info(path):
    with open(path) as data_file:    
        return json.load(data_file)

In [27]:
# Extract song information
lastfmRDD = lastfm_pathsRDD.map(get_json_info).cache()

**Sample song**

In [30]:
lastfmRDD.take(1)

[{u'artist': u'DeGarmo & Key',
  u'similars': [],
  u'tags': [],
  u'timestamp': u'2011-09-08 01:41:45.776631',
  u'title': u'Jericho  (Straight On Album Version)',
  u'track_id': u'TRBBOBO128F425FDFC'}]

## Create Nodes

The following table shows the nodes that will be created

|Node Label|File Name| Format |
|:--|:--|:--|
|Artists|nodes_artists.csv| 'artist_id', 'artist_mbid', 'artist_7did', 'artist_name'|
|Songs|nodes_songs.csv|'song_id', 'track_id', 'title', 'dance', 'dur', 'energy','loudness'|
|Albums|nodes_albums.csv| 'album_name'|
|Year|nodes_years.csv| 'year'|
|Tags|nodes_tags.csv| 'tag'|

### Artist Nodes

In [82]:
# Create artist nodes
fields = ['artist_id', 'artist_mbid', 'artist_7did', 'artist_name']
fieldsBrC = sc.broadcast(fields)

outputfile = os.path.join(cdir,'data/MillionSongSubset/tmp/nodes_artists')
# If directory already exists, delete it
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)

songsRDD.map(makeCSVline).distinct().saveAsTextFile('file://'+outputfile)

### Song Nodes

In [84]:
# Create song nodes
fields = ['song_id', 'track_id', 'title', 'dance', 'dur', 'energy','loudness']
fieldsBrC = sc.broadcast(fields)

outputfile = os.path.join(cdir,'data/MillionSongSubset/tmp/nodes_songs')
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)

songsRDD.map(makeCSVline).distinct().saveAsTextFile('file://'+outputfile)

### Album Nodes

In [86]:
# Create album nodes
outputfile = os.path.join(cdir,'data/MillionSongSubset/tmp/nodes_albums')
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)

songsRDD.map(lambda x: x['album']).distinct().saveAsTextFile('file://'+outputfile)

### Year Nodes

In [87]:
# Create year nodes
outputfile = os.path.join(cdir,'data/MillionSongSubset/tmp/nodes_years')
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)
    
songsRDD.map(lambda x: x['year']).filter(lambda x: int(x) > 0).distinct().saveAsTextFile('file://'+outputfile)

### Tag Nodes

The MSD data containes artist tags and  the Last.fm data containes song tags.

The tags in the dataset have a large overlap. For example, the tags 'pop' and 'rock' are used to describe both artists and songs. Since the tags can be the same and convey the same information, I decided to model tags as one type of node. 

The tags were merged together and then the list of tags was created. 

In [42]:
# Tag nodes
outputfile = os.path.join(cdir,'data/MillionSongSubset/tmp/nodes_tags')
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)
    
#songsRDD.flatMap(lambda x: x['a_terms']).distinct().saveAsTextFile('file://'+outputfile)

artistTags = songsRDD.flatMap(lambda x: x['a_terms']).distinct()

NameError: name 'songsRDD' is not defined

In [41]:
songTags = lastfmRDD.flatMap(lambda x: x['tags']).map(lambda x: x[0]).distinct()

In [35]:
lastfmRDD.filter(lambda x: x['tags']<>[]).flatMap(lambda x: x['tags']).take(4)

[[u'indie', u'100'],
 [u'chinese female vocal', u'25'],
 [u'just4lov', u'25'],
 [u'of christmas past', u'25']]

In [40]:
lastfmRDD.flatMap(lambda x: x['tags']).map(lambda x: x[0]).take(10)

[u'indie',
 u'chinese female vocal',
 u'just4lov',
 u'of christmas past',
 u'sweetodd',
 u'in china gibt es doch musik',
 u'suicide on your stereo set',
 u'Volltonfarbes Lieblingslieder',
 u'popLove',
 u'dinlemeye kiyamiyorum']

## Relationships

The following table shows the relationships that will be created

|Relationship Structure|File Name| Format |
|:--|:--|:--|
|(ARTIST) - [SIMILAR_TO] -> (ARTIST)|rel_similar_artists.csv|'from_artist_id', 'to_artist_id'|
|(ARTIST) - [PERFORMS] -> (SONG)|rel_performs.csv|'artist_id', 'song_id'|
|(ARTIST) - [HAS_ALBUM] -> (ALBUM)|rel_artist_has_album.csv|'artist_id', 'album_name'|
|(SONG) - [IN_ALBUM] -> (ALBUM)|rel_song_in_album.csv|'song_id', 'album_name'|
|(ARTIST) - [HAS_TAG] -> (TAG)|rel_artist_tag.csv|'artist_id', 'tag_name', 'normalized_frq', 'normalized_weight'|
|(SONG) - [SIMILAR_TO] -> (SONG)| rel_similar_songs.csv|'from_song_id', 'to_song_id', 'similarity_weight'|

# Missing Relationships

|Relationship Structure|File Name| Format |
|:--|:--|:--|
|(SONG) - [HAS_TAG] -> (TAG)| rel_song_tag.csv|'from_song_id', 'to_song_id', 'normalized_weight'|

### HAS_TAG relationship between artist and tags

In [48]:
def artistToTags(record):
    '''
    Concatenate artist with each tag
    Normalize tag frequency and weight
    '''
    normalize_frq = record['a_tfrq'] / sum(record['a_tfrq'])
    normalize_w = record['a_tw'] / sum(record['a_tw'])
    terms = record['a_terms']
    artist = record['artist_id']
    
    result = []
    for i in range(len(terms)):
        result.append( artist +","+ terms[i] +","+ str(normalize_frq[i]) +","+ str(normalize_w[i]))
    
    return result

In [50]:
# Artist Has Tags (edge has properties)
outputfile = os.path.join(cdir,'data/MillionSongSubset/tmp/rel_artist_tag')
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)

songsRDD.flatMap(artistToTags).distinct().saveAsTextFile('file://'+outputfile)

### SIMILAR_TO relationship between artist and artist

In [98]:
# Similar Artist to Artist (directional, no properties)
outputfile = os.path.join(cdir,'data/MillionSongSubset/tmp/rel_similar_artists')
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)
    
similarArtistsRDD = songsRDD.map(lambda x: (x['artist_id'],x['a_similar'])).flatMapValues(lambda x: x)
similarArtistsRDD.distinct().map(lambda x: x[0]+","+x[1]).saveAsTextFile('file://'+outputfile)

### PERFORMS relationship between artist and song

In [99]:
# Artist Performs Song
outputfile = os.path.join(cdir,'data/MillionSongSubset/tmp/rel_performs')
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)
    
songsRDD.map(lambda x: x['artist_id']+","+x['song_id']).distinct().saveAsTextFile('file://'+outputfile)

### HAS_ALBUM relationship between artist and album

In [100]:
# Artist Has Album
outputfile = os.path.join(cdir,'data/MillionSongSubset/tmp/rel_artist_has_album')
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)
    
songsRDD.map(lambda x: x['artist_id']+","+x['album']).distinct().saveAsTextFile('file://'+outputfile)

### IN_ALBUM relationship between song and album

In [101]:
# Song In Album
outputfile = os.path.join(cdir,'data/MillionSongSubset/tmp/rel_song_in_album')
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)
    
songsRDD.map(lambda x: x['song_id']+","+x['album']).distinct().saveAsTextFile('file://'+outputfile)

### SIMILAR_TO relationship between song and song

In [None]:
# Similar Song to Song
# .csv format: from_track_id, to_track_id, similarity_measure
outputfile = os.path.join(cdir,'data/MillionSongSubset/tmp/rel_similar_songs')
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)
    
similarSongsRDD = lastfmRDD.filter(
    lambda x: x['similars']<>[]).map(
    lambda x: (x['track_id'],x['similars'])).flatMapValues(lambda x: x)
similarSongsRDD.map(lambda x: x[0]+","+x[1][0]+","+str(x[1][1])).saveAsTextFile('file://'+outputfile)

## Combine Spark output 

Spark outputs several files named 'part-000xx' which cannot be read into Neo4j. To load the data to Neo4j, I combined the Spark output into a .csv file which can be imported to Neo4j. Additionally, a header line was added for better readability.

In [94]:
os.system('cat data/MillionSongSubset/tmp/nodes_artists/part-* > data/MillionSongSubset/graph/nodes_artists.csv')
os.system('cat data/MillionSongSubset/tmp/nodes_songs/part-* > data/MillionSongSubset/graph/nodes_songs.csv')
os.system('cat data/MillionSongSubset/tmp/nodes_albums/part-* > data/MillionSongSubset/graph/nodes_albums.csv')
os.system('cat data/MillionSongSubset/tmp/nodes_years/part-* > data/MillionSongSubset/graph/nodes_years.csv')
os.system('cat data/MillionSongSubset/tmp/nodes_tags/part-* > data/MillionSongSubset/graph/nodes_tags.csv')

sed -i '1iname' nodes_albums.csv
sed -i '1iid,mbid,7did,name' nodes_artists.csv
sed -i '1iid,trackid,name,dance,dur,energy,loudness' nodes_songs.csv
sed -i '1itag' nodes_tags.csv
sed -i '1iyear' nodes_years.csv

0

In [152]:
os.system('cat data/MillionSongSubset/tmp/rel_artist_tag/part-* > data/MillionSongSubset/graph/rel_artist_tag.csv')
os.system('cat data/MillionSongSubset/tmp/rel_similar_artists/part-* > data/MillionSongSubset/graph/rel_similar_artists.csv')
os.system('cat data/MillionSongSubset/tmp/rel_performs/part-* > data/MillionSongSubset/graph/rel_performs.csv')
os.system('cat data/MillionSongSubset/tmp/rel_artist_has_album/part-* > data/MillionSongSubset/graph/rel_artist_has_album.csv')
os.system('cat data/MillionSongSubset/tmp/rel_song_in_album/part-* > data/MillionSongSubset/graph/rel_song_in_album.csv')

0

# CHECK THAT HEADERS ARE THE SAME AS USED TO IMPORT DATA TO NEO

In [None]:
sed -i '1iname' nodes_albums.csv
sed -i '1iid,mbid,7did,name' nodes_artists.csv
sed -i '1iid,trackid,name,dance,dur,energy,loudness' nodes_songs.csv
sed -i '1itag' nodes_tags.csv
sed -i '1iyear' nodes_years.csv

sed -i '1iartistId,album' rel_artist_has_album.csv  
sed -i '1iartist_id,tag_name,frq,weight' rel_artist_tag.csv  
sed -i '1iartist,song' rel_performs.csv  
sed -i '1ifrom,to' rel_similar_artists.csv  
sed -i '1isongID,album' rel_song_in_album.csv

---
# Structure for general script where sparkcontext has to be created and run with `spark-sbmit`

In [None]:
%%writefile test_code/count_h5.py
#!/usr/bin/env python
from pyspark import SparkContext
import time
import h5py

def read_h5_file(path):
    with h5py.File(path, 'r') as f:
        return f['metadata']['songs']['title'][0]
#Start Time
t1 = time.time()

# --- Process files ----
sc = SparkContext(appName="SparkHDF5")
file_paths = sc.textFile('file:///data/asoto/projectW205/data/list_files.txt')

songs = file_paths.map(read_h5_file)
songs.count()
# ----------------------

#End Time
t2 = time.time()
sec = t2-t1

print "Run Time: %0.2f sec = %.2f min = %.2f h"%(sec,sec/60.0,sec/1440.0)
sc.stop()

In [None]:
!spark-submit test_code/count_h5.py