# Pre-Processing the 10,000 Song Subset

*Andrea Soto*  
*MIDS W205 Final Project*  
*Project Name: Graph Model of the Million Song Dataset*

---

# Notebook Overview

This notebook processess the Million Song Dataset Subset which contains a sample of 10,000 songs from the main dataset. **The goal is to get familiarized with the data, and to develop and test the code on the smaller dataset.** The code developed was then compiled in scripts that were used to process the full dataset.

This notebook assumes that the instance has been configured according to the notebook [Step 1 - Configuration.ipynb](./Step 1 - Step 1 - Configuration.ipynb).

Two sources of data were used:

1. **Million Song Dataset (MSD):** contains track, song, artist, and album metadata. It also contains artist similarity and artist tags. This was the main datasource for this project. The data is stored in HDF5 format, with one file per song.
A detail description of the MSD project can be found [here](http://labrosa.ee.columbia.edu/millionsong/) and the filed list can be found [here](http://labrosa.ee.columbia.edu/millionsong/faq).
2. **Last.fm Dataset:** contains information about song similarity and song tags. This information was used to complement the MSD. This data is stored in JSON format, with one file per song. A detail desctiption of the Last.fm data can be found [here](http://labrosa.ee.columbia.edu/millionsong/lastfm).

For reference, a small sample of the tree structue of each dataset is shown below.   

In [None]:
MillionSongSubset/data/
|-- A
|   |-- A
|   |   |-- A
|   |   |   |-- TRAAAAW128F429D538.h5
|   |   |   |-- TRAAABD128F429CF47.h5
|   |   |   |-- TRAAADZ128F9348C2E.h5
|   |   |   |-- ...
|   |   |-- B
|   |   |   |-- TRAABCL128F4286650.h5
|   |   |   |-- TRAABDL12903CAABBA.h5
|   |   |   |-- TRAABJL12903CDCF1A.h5
|   |   |   |-- ...
|   |   |-- C
|   |   |   |-- TRAACCG128F92E8A55.h5
|   |   |   |-- TRAACER128F4290F96.h5
|   |   |   |-- TRAACFV128F935E50B.h5
|   |   |   |-- ...
|   |   |-- D
|   |   |-- ...
|   |   |-- X
|   |   |-- Y
|   |   `-- Z
|   |-- B
|   |   |-- A
|   |   |-- ...
|   |   `-- Z
|   |-- C
|   |-- ...
|   `-- Z
`-- B
    |-- A
    |   |-- A
    |   |-- B
    |   |-- ...
    |   `-- Z
    |-- B
    |-- ...
    `-- Z

In [None]:
MillionSongSubset/lastfm_subset/
|-- A
|   |-- A
|   |   |-- A
|   |   |   |-- TRAAAAW128F429D538.json
|   |   |   |-- TRAAABD128F429CF47.json
|   |   |   |-- TRAAADZ128F9348C2E.json
|   |   |   |-- ...
|   |   |-- B
|   |   |   |-- TRAABDL12903CAABBA.json
|   |   |   |-- TRAABJL12903CDCF1A.json
|   |   |   |-- TRAABJV128F1460C49.json
|   |   |   |-- ...
|   |   |-- C
|   |   |   |-- TRAACCG128F92E8A55.json
|   |   |   |-- TRAACER128F4290F96.json
|   |   |   |-- TRAACFV128F935E50B.json
|   |   |   |-- ...
|   |   |-- D
|   |   |-- ...
|   |   |-- X
|   |   |-- Y
|   |   `-- Z
|   |-- B
|   |   |-- A
|   |   |-- ...
|   |   `-- Z
|   |-- C
|   |-- ...
|   `-- Z
`-- B
    |-- A
    |   |-- A
    |   |-- B
    |   |-- ...
    |   `-- Z
    |-- B
    |-- ...
    `-- Z

---
# Workflow

The data was transformed into CSV files containing the nodes and realtionship structure for Neo4j. This CSV files where created to leaverage the `LOAD CVS` functionality of Neo4j, which makes loading large graphs into Neo4j faster and scalable. 

The Million Song Dataset stores each song in HDF5 format and the Last.fm dataset stores each song in JSON format. 

The steps followed in this notebook were:

1. Create a list of the song files with the full path to each file
2. In Spark, read each file in the list and extract the information
3. In Spark, transform the extracted data to create CSV files for each node and relationsip type. Separate transformations were requiered for each node and relationship type
4. Save the transformed data using Spark's `saveAsTextFile()` operation
5. Since the `saveAsTextFile()` operation generates several files named 'part-000xx', the data was merged into a single .csv file and headers were added to ease readability. This was done with bash commands

# Matching Errors

The MSD team found some matching errors between tracks and songs in the data. They created a list of (song id, tack id) pairs that are not trusted and they suggest removing this pairs from the data. These missmatches were removed from the data as part of the transformation process.

For more details see:
- http://labrosa.ee.columbia.edu/millionsong/blog/12-1-2-matching-errors-taste-profile-and-msd
- http://labrosa.ee.columbia.edu/millionsong/blog/12-2-12-fixing-matching-errors

### The following tree shows the final directory structure after processing the Subset data

In [3]:
!tree -L 2 /data/asoto/projectW205/MillionSongSubset/

/data/asoto/projectW205/MillionSongSubset/
|-- AdditionalFiles
|   |-- LICENSE
|   |-- README
|   |-- subset_artist_location.txt
|   |-- subset_artist_similarity.db
|   |-- subset_artist_term.db
|   |-- subset_msd_summary_file.h5
|   |-- subset_track_metadata.db
|   |-- subset_tracks_per_year.txt
|   |-- subset_unique_artists.txt
|   |-- subset_unique_mbtags.txt
|   |-- subset_unique_terms.txt
|   `-- subset_unique_tracks.txt
|-- data
|   |-- A
|   `-- B
|-- graph
|   |-- nodes_albums.csv
|   |-- nodes_artists.csv
|   |-- nodes_songs.csv
|   |-- nodes_tags.csv
|   |-- nodes_years.csv
|   |-- rel_artist_has_album.csv
|   |-- rel_artist_has_tag.csv
|   |-- rel_performs.csv
|   |-- rel_similar_artists.csv
|   |-- rel_similar_songs.csv
|   |-- rel_song_has_tag.csv
|   |-- rel_song_in_album.csv
|   `-- rel_song_year.csv
|-- lastfm_subset
|   |-- A
|   `-- B
|-- list_hdf5_files.txt
|-- list_lastfm_files.txt
`-- tmp
    |-- nodes_albums
    |-- nodes_arti

- The directories **'AdditionalFiles'**, **'data'**  and **'lastfm_subset'** are input folders downloaded from the MSD website
- The directories **'graph'** and **'tmp'** will be created as part of the extract and transfor process. The intermediate output from spark will be saved in **'tmp'** and the results will be consolidated into CSV files with headers in the **'graph'** folder.
- The files **'list_hdf5_files.txt'** and **'list_lastfm_files.txt'** are the list of paths to the song files that will be used to read within Spark. These are the first output created in the dataflow.

# Download the Subset Data - 10,000 songs

The **'download_subsetdata.sh'** script downloads the MSD and Last.fm datasets into the current directory. The Last.fm is downloaded inside the MillionSongSubset folder.

The final directory structure is:

(current directory)  
| -- MillionSongSubset  
| --  -- | -- Additional files  
| --  -- | -- data  
| --  -- | -- lastfm_subset   

All the outputs of processing the Subset dataset will be stored under the main directoyr 'MillionSongSubset'

In [6]:
%%writefile scripts/download_subsetdata.sh
#!/usr/bin/env bash

# Download data subset of 10,000 songs, ~1GB to develop and test code
wget http://static.echonest.com/millionsongsubset_full.tar.gz
wait
tar xvzf millionsongsubset_full.tar.gz
wait
rm millionsongsubset_full.tar.gz

cd MillionSongSubset

# Download last-fm data with song similarities
wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_subset.zip
unzip lastfm_subset.zip
rm lastfm_subset.zip

cd ..

Writing scripts/download_subsetdata.sh


# Download untrusted Songs to be filtered out

Download a copy to process with the subset data

In [10]:
!wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/tasteprofile/sid_mismatches.txt

--2015-12-14 07:56:28--  http://labrosa.ee.columbia.edu/millionsong/sites/default/files/tasteprofile/sid_mismatches.txt
Resolving labrosa.ee.columbia.edu... 128.59.66.11
Connecting to labrosa.ee.columbia.edu|128.59.66.11|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2026182 (1.9M) [text/plain]
Saving to: `sid_mismatches.txt'


2015-12-14 07:56:28 (7.93 MB/s) - `sid_mismatches.txt' saved [2026182/2026182]



# Start Spark Context in Notebook

In [1]:
import os
import sys
#Escape L for line numbers
spark_home = os.environ['SPARK_HOME'] = '/data/spark15'
if not spark_home:
    raise ValueError('SPARK_HOME enviroment variable is not set')

sys.path.insert(0,os.path.join(spark_home,'python'))
sys.path.insert(0,os.path.join(spark_home,'python/lib/py4j-0.8.2.1-src.zip'))
execfile(os.path.join(spark_home,'python/pyspark/shell.py'))

Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.0
      /_/

Using Python version 2.7.10 (default, Sep 15 2015 14:50:01)
SparkContext available as sc, HiveContext available as sqlContext.


In [2]:
# Check spark context exists
sc

<pyspark.context.SparkContext at 0x7f1690c88910>

---
# Data Preparation

### Python Libraries

In [3]:
import os
import glob
import shutil

import numpy as np

import h5py
import json

### Create list of HDF5 files and JSON files for the Subset Data

In [4]:
# Run from 'msd_project' directory
# List the files with their full path and store the list in a .txt file
# This list of files will then be read in Spark to parse the actual files

# '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' 
# Million Song Dataset
# ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
# Path to list of song files - Million Song Subset - 
song_paths = 'MillionSongSubset/list_hdf5_files.txt'

# If file does not exits, create it
if not os.path.exists(song_paths):

    # List all paths of songs and save them to 
    get_song_paths = glob.glob('./MillionSongSubset/data/*/*/*/*.h5')
    
    with open(song_paths,'w') as f:
        f.writelines('\n'.join(p for p in get_song_paths))
        f.close()

        
# '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' 
# Last.fm Dataset
# ''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''
# Path to list of song files - LastFM Song Similarity and Tags -
lastfm_paths = 'MillionSongSubset/list_lastfm_files.txt'

# If file does not exits, create it
if not os.path.exists(lastfm_paths):

    # List all paths of songs and save them to 
    get_song_paths = glob.glob('./MillionSongSubset/lastfm_subset/*/*/*/*.json')
    
    with open(lastfm_paths,'w') as f:
        f.writelines('\n'.join(p for p in get_song_paths))
        f.close()

### Read list of files to Spark

In [5]:
cdir = os.getcwd()

# Create RDD with the list of HDF5 song files
path = os.path.join(cdir, song_paths)
song_pathsRDD = sc.textFile('file://'+path)

# Create RDD with the list of JSON song files
path = os.path.join(cdir, lastfm_paths)
lastfm_pathsRDD = sc.textFile('file://'+path)

**Sample list of MSD song files**

In [10]:
song_pathsRDD.take(5)

[u'./MillionSongSubset/data/B/B/O/TRBBOPX12903D106F7.h5',
 u'./MillionSongSubset/data/B/B/O/TRBBOKQ128F933AE7C.h5',
 u'./MillionSongSubset/data/B/B/O/TRBBOPV12903CFB50F.h5',
 u'./MillionSongSubset/data/B/B/O/TRBBOJM12903CD1BDD.h5',
 u'./MillionSongSubset/data/B/B/O/TRBBOBQ12903CC5186.h5']

**Sample list of Last.fm song files**

In [11]:
lastfm_pathsRDD.take(5)

[u'./MillionSongSubset/lastfm_subset/B/B/O/TRBBOBO128F425FDFC.json',
 u'./MillionSongSubset/lastfm_subset/B/B/O/TRBBOPX12903D106F7.json',
 u'./MillionSongSubset/lastfm_subset/B/B/O/TRBBOBQ12903CC5186.json',
 u'./MillionSongSubset/lastfm_subset/B/B/O/TRBBOME12903CC3862.json',
 u'./MillionSongSubset/lastfm_subset/B/B/O/TRBBOFH128F14A2A46.json']

## Load the (song id, track id) pair mismatches

The raw mismatched file has the following general structure:

In [None]:
ERROR: <'songID', 'trackID'> 'descrption showing mismatch'

Some example lines from the file are:

In [None]:
ERROR: <SOVWUNG12A8C137891 TRMGMLW128F426A200> Warlock  -  Copy of a Copy  !=  Sickboy  -  She's out of way  
ERROR: <SOJTFZA12A8C13704E TRMGGOK128F426FDEB> Bike  -  Circus Kids  !=  Slut  -  Gloom  
ERROR: <SOZZXCP12A8C13832E TRMGQMW128F9311251> Musiq  -  Solong  !=  Suthun Boy  -  Full Blown

In [7]:
def parse_mismatches(line):
    '''
    This function extracts the songID and trackID of the mismatched records.
    Returned value: ('songID', 'trackID')
    '''
    return line[8:45].split()

In [14]:
# Create an RDD with the song-track pairs that need to be removed
toRemoveRDD = sc.textFile('file://'+cdir+'/sid_mismatches.txt').map(parse_mismatches)
songsToRemove = sc.broadcast(toRemoveRDD.collect())

**Number of (song,track) pairs to remove**

In [29]:
toRemoveRDD.count()

19094

## Extract MSD song information

The MSD song data was extracted into an RDD were each song is represendted with a python dictionary having the following structure:

In [None]:
{   'a_similar': array(['artistId', 'artistId', ... , 'artistId']),  
      'a_terms': array(['term1', 'term2', ..., 'termN']),  
       'a_tfrq': array([ ]),  
         'a_tw': array([ ]),  
        'album': 'album name',  
  'artist_7did': '7digit artist id',  
    'artist_id': 'Echo Nest artist id',  
  'artist_mbid': 'Music Brain artist id',  
  'artist_name': 'Artist name',  
        'dance': 0.0,  
          'dur': 125.7,  
       'energy': 0.0,  
     'loudness': -9.3,  
      'song_id': 'Echo Nest song id',  
        'title': 'Song title',  
     'track_id': 'Echo Nest trach id',  
         'year': 1990  
}

**Function to read MSD HDF5 files and extract data**

In [15]:
def get_h5_info(path):
    '''
    Takes a path to a song stored as an HDF5 file and returns a dictionary with the 
    information that will be included in the graph
    ''' 
    d = {}
    with h5py.File(path, 'r') as f:
        song_id = f['metadata']['songs']['song_id'][0]
        track_id = f['analysis']['songs']['track_id'][0]
        
        if (song_id, track_id) not in songsToRemove.value:

            # --- Artist Info -----------------------------
            d.setdefault('artist_id', f['metadata']['songs']['artist_id'][0])
            d.setdefault('artist_mbid', f['metadata']['songs']['artist_mbid'][0])
            d.setdefault('artist_7did', f['metadata']['songs']['artist_7digitalid'][0])
            d.setdefault('artist_name', f['metadata']['songs']['artist_name'][0])

            # --- Song Info -----------------------------
            d.setdefault('song_id', song_id)
            d.setdefault('track_id', track_id)
            d.setdefault('title', f['metadata']['songs']['title'][0])
            d.setdefault('dance', f['analysis']['songs']['danceability'][0])
            d.setdefault('dur', f['analysis']['songs']['duration'][0])
            d.setdefault('energy', f['analysis']['songs']['energy'][0])
            d.setdefault('loudness', f['analysis']['songs']['loudness'][0])

            # --- Year -----------------------------
            d.setdefault('year', f['musicbrainz']['songs']['year'][0])

            # --- Album -----------------------------
            d.setdefault('album', f['metadata']['songs']['release'][0])

            # --- Similar Artist -----------------------------
            d.setdefault('a_similar', np.array(f['metadata']['similar_artists']))

            # --- Artist Terms -----------------------------
            d.setdefault('a_terms', np.array(f['metadata']['artist_terms']))
            d.setdefault('a_tfrq', np.array(f['metadata']['artist_terms_freq']))
            d.setdefault('a_tw', np.array(f['metadata']['artist_terms_weight']))

            return d
        else: 
            pass

**Extract MSD song data**

In [16]:
# Extract song information
songsRDD = song_pathsRDD.map(get_h5_info).cache()

**Sample song extracted**

In [17]:
songsRDD.take(1)

[{'a_similar': array(['ARRGFFD1187B9AF330', 'ARIVAXF122BCFCACF3', 'AR6LT5K1187FB562A9',
         'ARI8PQM1187B99577F', 'ARHYS6D1187FB5BBA4', 'AR1XPEO1187B9B560E',
         'AREUFRU1187FB49BEF', 'AR41E9U1187FB5573B', 'AR6AD5N1187FB52F22',
         'ARCF9FU119B866967B', 'ARBVIM21187FB520A2', 'ARISRD71187FB57AE8',
         'ARAMB6Q1187B99DE68', 'ARE3JFT1187FB589B6', 'ARJMAW61187B9A6148',
         'ARP6QCL1187FB36142', 'ARJ41O41187B9A0F53', 'AR1P7OW1187FB5B3E1',
         'ARVMRVW1187FB392FF', 'ARA8DDQ1187B9AE3A0', 'AR3QE2N1187FB588CA',
         'AROF8OV1187FB55B85', 'AR9JJ761187B9AF496', 'ARWCIR91187FB55D30',
         'ARXWXEB1187B9A8592', 'AR0WGKH11C8A414A0F', 'ARJ8S571187FB4550A',
         'ARWY36G11A348EFDFC', 'AR5SZEA1187B9BA0AA', 'ARPFC0M1187B9B969D',
         'ARAEZVZ1187FB573A8', 'AR52O1K1187FB4C98D', 'ARDEOJT1187B990229',
         'ARKWACN11A348F0476', 'ARL26PR1187FB576E5', 'ARE3RNX1187B9ADD8B',
         'AROLJZM1187B994C58', 'ARXOPQ911C8A41568B', 'ARAFF5A1187FB56142',
         'AR

In [20]:
songsRDD.count()

10000

## Extract Last.fm song information

The Last.fm data has the following structure:

In [None]:
{    'artist': u'DeGarmo & Key',
   'similars': [['song id', similarity measure], ['song id', similarity measure]],
       'tags': [['tag one', weight],['some tag', weight]],
  'timestamp': '2011-09-08 01:41:45.776631',
      'title': 'Jericho  (Straight On Album Version)',
   'track_id': 'TRBBOBO128F425FDFC'
}

**Function to read Last.fm JSON files and extract data**

In [18]:
def get_json_info(path):
    with open(path) as data_file:    
        return json.load(data_file)

In [19]:
# Extract song information
lastfmRDD = lastfm_pathsRDD.map(get_json_info).cache()

**Sample song**

In [30]:
lastfmRDD.take(1)

[{u'artist': u'DeGarmo & Key',
  u'similars': [],
  u'tags': [],
  u'timestamp': u'2011-09-08 01:41:45.776631',
  u'title': u'Jericho  (Straight On Album Version)',
  u'track_id': u'TRBBOBO128F425FDFC'}]

---
# Transform and Export Data in CSV format

## Create Nodes

The following table shows the nodes that will be created

|No.|Node Label|File Name| Format |
|:--:|:--|:--|:--|
|1|Artists|nodes_artists.csv| 'artist_id', 'artist_mbid', 'artist_7did', 'artist_name'|
|2|Songs|nodes_songs.csv|'song_id', 'track_id', 'title', 'dance', 'dur', 'energy','loudness'|
|3|Albums|nodes_albums.csv| 'album_name'|
|4|Year|nodes_years.csv| 'year'|
|5|Tags|nodes_tags.csv| 'tag'|

**Function to create CSV format from fields of dictionary**

In [20]:
def makeCSVline(line):
    return ','.join(str(line[field]) for field in fieldsBrC.value)

### Artist Nodes

*CSV Format: artist_id, artist_mb_id, artist_7d_id, artist_name*

In [82]:
# Create artist nodes
fields = ['artist_id', 'artist_mbid', 'artist_7did', 'artist_name']
fieldsBrC = sc.broadcast(fields)

outputfile = os.path.join(cdir,'MillionSongSubset/tmp/nodes_artists')
# If directory already exists, delete it
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)

songsRDD.map(makeCSVline).distinct().saveAsTextFile('file://'+outputfile)

### Song Nodes

*CSV Format: song_id, track_id, song_title, danceability, duration, energy, loudness*

In [84]:
# Create song nodes
fields = ['song_id', 'track_id', 'title', 'dance', 'dur', 'energy','loudness']
fieldsBrC = sc.broadcast(fields)

outputfile = os.path.join(cdir,'MillionSongSubset/tmp/nodes_songs')
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)

songsRDD.map(makeCSVline).distinct().saveAsTextFile('file://'+outputfile)

### Album Nodes

*CSV Format: album_name*

In [86]:
# Create album nodes
outputfile = os.path.join(cdir,'MillionSongSubset/tmp/nodes_albums')
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)

songsRDD.map(lambda x: x['album']).distinct().saveAsTextFile('file://'+outputfile)

### Year Nodes

*CSV Format: year*

In [87]:
# Create year nodes
outputfile = os.path.join(cdir,'MillionSongSubset/tmp/nodes_years')
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)
    
songsRDD.map(lambda x: x['year']).filter(lambda x: int(x) > 0).distinct().saveAsTextFile('file://'+outputfile)

### Tag Nodes

*CSV Format: tag_name*

The MSD data containes artist tags and  the Last.fm data containes song tags.

The tags in the dataset have a large overlap. For example, the tags 'pop' and 'rock' are used to describe both artists and songs. Since the tags can be the same and convey the same information, I decided to model tags as one type of node. 

The tags were merged together and then the list of tags was created. 

In [64]:
# Tag nodes
artistTags = songsRDD.flatMap(lambda x: x['a_terms']).distinct()
songTags = lastfmRDD.flatMap(lambda x: x['tags']).map(lambda x: x[0]).distinct()

outputfile = os.path.join(cdir,'MillionSongSubset/tmp/nodes_tags')
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)

allTags = songTags.union(artistTags).distinct()
allTags.saveAsTextFile('file://'+outputfile)

**Number of distinct tags in each dataset**

In [56]:
cnt_artistTags = artistTags.count()
cnt_songTags = songTags.count()
cnt_combined = allTags.count()

In [63]:
print 'Artist tags:\t {:,}'.format(cnt_artistTags)
print 'Song tags:  \t{:,}'.format(cnt_songTags)
print 'Unique tags:\t{:,}'.format(cnt_combined)

Artist tags:	 3,502
Song tags:  	33,355
Unique tags:	35,541


**Sample tags**

In [32]:
artistTags.take(5)

['fabric saturdays venue',
 'suicidal',
 'protest',
 'memphisunderground',
 'technical progressive death metal']

In [33]:
songTags.take(5)

[u'great solo',
 u'river ssss',
 u'sacramental imagery',
 u'Yuri Gagarin',
 u'songs about birds']

## Relationships

The following table shows the relationships that will be created

|No.|Relationship Structure|File Name| Format |
|:--:|:--|:--|:--|
|1|(ARTIST) - [SIMILAR_TO] -> (ARTIST)|rel_similar_artists.csv|'from_artist_id', 'to_artist_id'|
|2|(ARTIST) - [PERFORMS] -> (SONG)|rel_performs.csv|'artist_id', 'song_id'|
|3|(ARTIST) - [HAS_ALBUM] -> (ALBUM)|rel_artist_has_album.csv|'artist_id', 'album_name'|
|4|(ARTIST) - [HAS_TAG] -> (TAG)|rel_artist_has_tag.csv|'artist_id', 'tag_name', 'normalized_frq', 'normalized_weight'|
|5|(SONG) - [IN_ALBUM] -> (ALBUM)|rel_song_in_album.csv|'song_id', 'album_name'|
|6|(SONG) - [SIMILAR_TO] -> (SONG)| rel_similar_songs.csv|'from_song_id', 'to_song_id', 'similarity_weight'|
|7|(SONG) - [HAS_TAG] -> (TAG)| rel_song_has_tag.csv|'from_song_id', 'to_song_id', 'normalized_weight'|
|8|(SONG) - [RELEASED_ON] -> (YEAR)| rel_song_year.csv|'song_id', 'year'|

### SIMILAR_TO relationship between artist and artist

*CSV Format: from_artist_id, to_artist_id*

In [98]:
# Similar Artist to Artist (directional, no properties)
outputfile = os.path.join(cdir,'MillionSongSubset/tmp/rel_similar_artists')
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)
    
similarArtistsRDD = songsRDD.map(lambda x: (x['artist_id'],x['a_similar'])).flatMapValues(lambda x: x)
similarArtistsRDD.distinct().map(lambda x: x[0]+","+x[1]).saveAsTextFile('file://'+outputfile)

### PERFORMS relationship between artist and song

*CSV Format: artist_id, song_id*

In [99]:
# Artist Performs Song (directional, no properties)
outputfile = os.path.join(cdir,'MillionSongSubset/tmp/rel_performs')
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)
    
songsRDD.map(lambda x: x['artist_id']+","+x['song_id']).distinct().saveAsTextFile('file://'+outputfile)

### HAS_ALBUM relationship between artist and album

CSV Format: artist_id, album_name

In [100]:
# Artist Has Album (directional, no properties)
outputfile = os.path.join(cdir,'MillionSongSubset/tmp/rel_artist_has_album')
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)
    
songsRDD.map(lambda x: x['artist_id']+","+x['album']).distinct().saveAsTextFile('file://'+outputfile)

### HAS_TAG relationship between artist and tags

*CSV Format: artist_id, tag_name, tag_frequency, tag_weight*

In [76]:
def artistToTags(record):
    '''
    Concatenate artist with each tag
    Normalize tag frequency and weight
    '''
    normalize_frq = record['a_tfrq'] / sum(record['a_tfrq'])
    normalize_w = record['a_tw'] / sum(record['a_tw'])
    terms = record['a_terms']
    artist = record['artist_id']
    
    result = []
    for i in range(len(terms)):
        result.append( artist +","+ terms[i] +","+ str(normalize_frq[i]) +","+ str(normalize_w[i]))
    
    return result

In [None]:
# Artist Has Tags (directional, has properties frequency and weight)
outputfile = os.path.join(cdir,'MillionSongSubset/tmp/rel_artist_has_tag')
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)

songsRDD.flatMap(artistToTags).distinct().saveAsTextFile('file://'+outputfile)

### IN_ALBUM relationship between song and album

*CSV Format: song_id, album_name*

In [101]:
# Song In Album (direction, no properties)
outputfile = os.path.join(cdir,'MillionSongSubset/tmp/rel_song_in_album')
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)
    
songsRDD.map(lambda x: x['song_id']+","+x['album']).distinct().saveAsTextFile('file://'+outputfile)

### SIMILAR_TO relationship between song and song

*CSV Format: from_track_id, to_track_id, similarity_measure*

Uses the track_id instead of the song_id to create the similarity relationship

In [None]:
# Similar Song to Song (directional, with property similarity measure)
outputfile = os.path.join(cdir,'MillionSongSubset/tmp/rel_similar_songs')
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)
    
similarSongsRDD = lastfmRDD.filter(
    lambda x: x['similars']<>[]).map(
    lambda x: (x['track_id'],x['similars'])).flatMapValues(lambda x: x)
similarSongsRDD.map(lambda x: x[0]+","+x[1][0]+","+str(x[1][1])).saveAsTextFile('file://'+outputfile)

### HAS_TAG relationship between song and tags

*CSV Format: track_id, tag_name, tag_weight*

Uses the track_id instead of song_id to identify songs

In [89]:
def songToTags(record):
    '''
    Concatenate song with each tag
    '''
    tags = record['tags']
    total_weight = sum(float(w[1]) for w in tags)
    track_id = record['track_id']
    
    result = []
    for i in range(len(tags)):
        result.append( track_id +","+ tags[i][0] +","+ str(float(tags[i][1])/total_weight))
    
    return result

In [91]:
# Song Has Tags (directional, edge with property weight)
outputfile = os.path.join(cdir,'MillionSongSubset/tmp/rel_song_has_tag')
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)

lastfmRDD.flatMap(songToTags).saveAsTextFile('file://'+outputfile)

### RELEASED_ON relationship between song and year

*CSV Format: song_id, year*

In [82]:
# Song Released in Year (directional, no properties)
outputfile = os.path.join(cdir,'MillionSongSubset/tmp/rel_song_year')
if os.path.exists(outputfile):
    shutil.rmtree(outputfile)

songsRDD.filter(lambda x: int(x['year'])<>0).map(
    lambda x: x['song_id']+","+str(x['year'])).saveAsTextFile('file://'+outputfile)

**Number of songs with non-zero year**

In [81]:
songsRDD.filter(lambda x: int(x['year'])<>0).count()

4680

---
# Merge Spark output 

Spark outputs several files named 'part-000xx' which cannot be read into Neo4j. To load the data to Neo4j, I combined the Spark output into a .csv file which can be imported to Neo4j. I also added a header line for better readability.

In [77]:
%%bash

cd MillionSongSubset

# Combine node files
cat tmp/nodes_artists/part-* > graph/nodes_artists.csv
cat tmp/nodes_songs/part-*   > graph/nodes_songs.csv
cat tmp/nodes_albums/part-*  > graph/nodes_albums.csv
cat tmp/nodes_years/part-*   > graph/nodes_years.csv
cat tmp/nodes_tags/part-*    > graph/nodes_tags.csv

# Add headers to nodes
sed -i '1iartist_id,artist_mbid,artist_7did,artist_name' graph/nodes_artists.csv
sed -i '1isong_id,track_id,song_name,danceability,duration,energy,loudness' graph/nodes_songs.csv
sed -i '1ialbum_name' graph/nodes_albums.csv
sed -i '1iyear' graph/nodes_years.csv
sed -i '1itag_name' graph/nodes_tags.csv


# Combine relationship files
cat tmp/rel_artist_has_album/part-* > graph/rel_artist_has_album.csv
cat tmp/rel_artist_has_tag/part-*   > graph/rel_artist_has_tag.csv
cat tmp/rel_performs/part-*         > graph/rel_performs.csv
cat tmp/rel_similar_artists/part-*  > graph/rel_similar_artists.csv
cat tmp/rel_similar_songs/part-*    > graph/rel_similar_songs.csv
cat tmp/rel_song_has_tag/part-*     > graph/rel_song_has_tag.csv
cat tmp/rel_song_in_album/part-*    > graph/rel_song_in_album.csv
cat tmp/rel_song_year/part-*        > graph/rel_song_year.csv

# Add headers to relationships
sed -i '1iartist_id,album_name' graph/rel_artist_has_album.csv
sed -i '1iartist_id,tag_name,tag_frq,tag_w' graph/rel_artist_has_tag.csv
sed -i '1iartist_id,song_id' graph/rel_performs.csv
sed -i '1ifrom_artist,to_artist' graph/rel_similar_artists.csv
sed -i '1ifrom_track,to_track,sim_measure' graph/rel_similar_songs.csv
sed -i '1itrack_id,tag_name,tag_w' graph/rel_song_has_tag.csv
sed -i '1isong_id,album_name' graph/rel_song_in_album.csv
sed -i '1isong_id,year' graph/rel_song_year.csv

---
# Load CSV files to Neo4j

Loading the data into Neo4j was done in the notebook [Step 3 - Load Subset to Neo.ipynb](./Step 3 - Load Subset to Neo.ipynb).

---
# Final Scripts

The notebook [Step 4 - Process Entire Dataset.ipynb](./Step 4 - Process Entire Dataset.ipynb) has the final scripts developed to process the entire Million Song Dataset and a description of how to run those scripts.