# Import entire 1,000,000 song dataset to Neo4j

*Andrea Soto*  
*MIDS W205 Final Project*  
*Project Name: Graph Model of the Million Song Dataset*

---

# Notebook Overview

This notebook is a continuation of the notebook [Step 4 - Process Entire Dataset.ipynb](./Step 4 - Process Entire Dataset.ipynb) which creates the files used in this notebook.

Spark creates a 'part-000xx' output file for each thread. This files were grouped under a folder showing the corresponding node or relationship type of the files. To make it easier to list and reference the files, they were renamed and moved from the folders inside '/graph/import/tmp' to the '/graph/neo4j/data/import/' folder. Header files were also created for each node and relationship type. 

The table below summarized the current folders and how the 'part-000xx' files inside will be renamed.

|Folder (under /graph/import/tmp)|Target file names (name-0xx)|Header file|
|:--|:--|:--|
|nodes_artists|artist|hdr-artist.csv|
|nodes_songs |song|hdr-song.csv|
|nodes_albums|album|hdr-album.csv|
|nodes_years|year|hdr-year.csv|
|nodes_tags|tag|hdr-tag.csv|
|rel_similar_artists|sim-a|hdr-sim-a.csv|
|rel_performs|perf|hdr-perf.csv|
|rel_artist_has_album|has-album|hdr-has-album.csv|
|rel_artist_has_tag|a-has-tag|hdr-a-has-tag.csv|
|rel_song_in_album|in-album|hdr-in-album.csv|
|rel_similar_songs|sim-s|hdr-sim-s.csv|
|rel_song_has_tag|s-has-tag|hdr-s-has-tag.csv|
|rel_song_year|release|hdr-release.csv|


Finally, Neo4j's [import tool](http://neo4j.com/docs/stable/import-tool.html) was used to load the graph.


---

# Rename, Copy, and Create Headers

**Script Path and Name:** scripts/rename-and-move.sh  
**Script Description:** Rename 'part-000xx' files to 'name-0xx' where the names is based on the node or relatioship type of the file. The re-names files copied from their current location to the folder '/graph/neo4j/data/import'. Finally, header files are created. 

In [7]:
%%writefile scripts/rename-and-move.sh
#!/usr/bin/env bash

folders=(nodes_artists nodes_songs nodes_albums nodes_years nodes_tags rel_similar_artists rel_performs rel_artist_has_album rel_artist_has_tag rel_song_in_album rel_similar_songs rel_song_has_tag rel_song_year)
newName=(artist song album year tag sim-a perf has-album a-has-tag in-album sim-s s-has-tag release)

# Rename files
echo LOG: Renaming files...
for i in {0..12}
do 
rename part-00 ${newName[$i]}- /graph/import/tmp/${folders[$i]}/part*
done

echo LOG: Copying files to /graph/neo4j/data/import...
for i in {0..12}
do 
cp /graph/import/tmp/${folders[$i]}/${newName[$i]}* /graph/neo4j/data/import/
done

echo LOG: Creating headers...
# Create header CSV files
echo "id:ID(artist),idmb,id7d,name" > /graph/neo4j/data/import/hdr-artist.csv
echo "songid,trackid:ID(song),title,danceability:FLOAT,duration:FLOAT,energy:FLOAT,loudness:FLOAT" > /graph/neo4j/data/import/hdr-song.csv
echo "name:ID(album)" > /graph/neo4j/data/import/hdr-album.csv
echo "year:ID(year)" > /graph/neo4j/data/import/hdr-year.csv
echo "tag:ID(tag)" > /graph/neo4j/data/import/hdr-tag.csv
 
echo ":START_ID(artist),:END_ID(artist)" > /graph/neo4j/data/import/hdr-sim-a.csv
echo ":START_ID(artist),:END_ID(song)" > /graph/neo4j/data/import/hdr-perf.csv
echo ":START_ID(artist),:END_ID(album)" > /graph/neo4j/data/import/hdr-has-album.csv
echo ":START_ID(artist),:END_ID(tag),frq,weight" > /graph/neo4j/data/import/hdr-a-has-tag.csv
echo ":START_ID(song),:END_ID(album)" > /graph/neo4j/data/import/hdr-in-album.csv
echo ":START_ID(song),:END_ID(song),weight" > /graph/neo4j/data/import/hdr-sim-s.csv
echo ":START_ID(song),:END_ID(tag),weight" > /graph/neo4j/data/import/hdr-s-has-tag.csv
echo ":START_ID(song),:END_ID(year)" > /graph/neo4j/data/import/hdr-release.csv

Overwriting scripts/rename-and-move.sh


In [8]:
!chmod a+x scripts/rename-and-move.sh

In [9]:
!time scripts/rename-and-move.sh

LOG: Renaming files...
LOG: Copying files to /graph/neo4j/data/import...
LOG: Creating headers...

real	0m7.020s
user	0m0.064s
sys	0m4.862s


**Check header files exist**

In [11]:
!ls -l /graph/neo4j/data/import/hdr*

-rw-rw-r-- 1 asoto asoto 54 Dec 20 19:42 /graph/neo4j/data/import/hdr-a-has-tag.csv
-rw-rw-r-- 1 asoto asoto 15 Dec 20 19:42 /graph/neo4j/data/import/hdr-album.csv
-rw-rw-r-- 1 asoto asoto 29 Dec 20 19:42 /graph/neo4j/data/import/hdr-artist.csv
-rw-rw-r-- 1 asoto asoto 33 Dec 20 19:42 /graph/neo4j/data/import/hdr-has-album.csv
-rw-rw-r-- 1 asoto asoto 31 Dec 20 19:42 /graph/neo4j/data/import/hdr-in-album.csv
-rw-rw-r-- 1 asoto asoto 32 Dec 20 19:42 /graph/neo4j/data/import/hdr-perf.csv
-rw-rw-r-- 1 asoto asoto 30 Dec 20 19:42 /graph/neo4j/data/import/hdr-release.csv
-rw-rw-r-- 1 asoto asoto 42 Dec 20 19:42 /graph/neo4j/data/import/hdr-s-has-tag.csv
-rw-rw-r-- 1 asoto asoto 34 Dec 20 19:42 /graph/neo4j/data/import/hdr-sim-a.csv
-rw-rw-r-- 1 asoto asoto 43 Dec 20 19:42 /graph/neo4j/data/import/hdr-sim-s.csv
-rw-rw-r-- 1 asoto asoto 92 Dec 20 19:42 /graph/neo4j/data/import/hdr-song.csv
-rw-rw-r-- 1 asoto asoto 12 Dec 20 19:42 /graph/neo4j/data/import/hdr-tag.csv
-rw-rw-r-- 1 a

**Check content files exist (only showing files ending in 001)**

In [14]:
!ls -l /graph/neo4j/data/import/*001

-rw-r--r-- 1 asoto asoto  2065171 Dec 20 19:42 /graph/neo4j/data/import/a-has-tag-001
-rw-r--r-- 1 asoto asoto   102126 Dec 20 19:42 /graph/neo4j/data/import/album-001
-rw-r--r-- 1 asoto asoto   179420 Dec 20 19:42 /graph/neo4j/data/import/artist-001
-rw-r--r-- 1 asoto asoto   283378 Dec 20 19:42 /graph/neo4j/data/import/has-album-001
-rw-r--r-- 1 asoto asoto  1204867 Dec 20 19:42 /graph/neo4j/data/import/in-album-001
-rw-r--r-- 1 asoto asoto  1165194 Dec 20 19:42 /graph/neo4j/data/import/perf-001
-rw-r--r-- 1 asoto asoto   374808 Dec 20 19:42 /graph/neo4j/data/import/release-001
-rw-r--r-- 1 asoto asoto 12026173 Dec 20 19:42 /graph/neo4j/data/import/s-has-tag-001
-rw-r--r-- 1 asoto asoto  5229788 Dec 20 19:42 /graph/neo4j/data/import/sim-a-001
-rw-r--r-- 1 asoto asoto 85827001 Dec 20 19:42 /graph/neo4j/data/import/sim-s-001
-rw-r--r-- 1 asoto asoto  2530473 Dec 20 19:42 /graph/neo4j/data/import/song-001
-rw-r--r-- 1 asoto asoto   143381 Dec 20 19:42 /graph/neo4j/data/import

**Sample content files for Song nodes**

In [15]:
!ls /graph/neo4j/data/import/song-*

/graph/neo4j/data/import/song-000  /graph/neo4j/data/import/song-016
/graph/neo4j/data/import/song-001  /graph/neo4j/data/import/song-017
/graph/neo4j/data/import/song-002  /graph/neo4j/data/import/song-018
/graph/neo4j/data/import/song-003  /graph/neo4j/data/import/song-019
/graph/neo4j/data/import/song-004  /graph/neo4j/data/import/song-020
/graph/neo4j/data/import/song-005  /graph/neo4j/data/import/song-021
/graph/neo4j/data/import/song-006  /graph/neo4j/data/import/song-022
/graph/neo4j/data/import/song-007  /graph/neo4j/data/import/song-023
/graph/neo4j/data/import/song-008  /graph/neo4j/data/import/song-024
/graph/neo4j/data/import/song-009  /graph/neo4j/data/import/song-025
/graph/neo4j/data/import/song-010  /graph/neo4j/data/import/song-026
/graph/neo4j/data/import/song-011  /graph/neo4j/data/import/song-027
/graph/neo4j/data/import/song-012  /graph/neo4j/data/import/song-028
/graph/neo4j/data/import/song-013  /graph/neo4j/data/import/song-029
/graph/neo4j/data/im

---

# Import to Neo4j

**Script Path and Name:** scripts/importToNeo.sh  
**Script Description:** Import graph to Neo4j using import tool. 

In [47]:
%%writefile scripts/importToNeo.sh
#!/usr/bin/env bash

cd /graph/neo4j/data/import

$NEO4J_HOME/neo4j-import --into /graph/neo4j/data/graph.db \
--skip-duplicate-nodes true --skip-bad-relationships true \
--ignore-empty-strings true \
--bad-tolerance 9000000 \
--processors $(nproc) \
--id-type string \
--nodes:ARTIST hdr-artist.csv,$(echo $(ls art*) | tr ' ' ,)  \
--nodes:SONG   hdr-song.csv,$(echo $(ls song*) | tr ' ' ,)  \
--nodes:ALBUM  hdr-album.csv,$(echo $(ls album*) | tr ' ' ,)  \
--nodes:YEAR   hdr-year.csv,$(echo $(ls year*) | tr ' ' ,)  \
--nodes:TAG    hdr-tag.csv,$(echo $(ls tag*) | tr ' ' ,)  \
--relationships:SIMILAR_TO  hdr-sim-a.csv,$(echo $(ls sim-a*) | tr ' ' ,)  \
--relationships:PERFORMS    hdr-perf.csv,$(echo $(ls perf*) | tr ' ' ,)  \
--relationships:HAS_ALBUM   hdr-has-album.csv,$(echo $(ls has-album*) | tr ' ' ,)  \
--relationships:HAS_TAG     hdr-a-has-tag.csv,$(echo $(ls a-has-tag*) | tr ' ' ,)  \
--relationships:IN_ALBUM    hdr-in-album.csv,$(echo $(ls in-album*) | tr ' ' ,)  \
--relationships:SIMILAR_TO  hdr-sim-s.csv,$(echo $(ls sim-s*) | tr ' ' ,)  \
--relationships:HAS_TAG     hdr-s-has-tag.csv,$(echo $(ls s-has-tag*) | tr ' ' ,)  \
--relationships:RELEASED_ON hdr-release.csv,$(echo $(ls release*) | tr ' ' ,) 

Overwriting scripts/importToNeo.sh


In [45]:
!chmod a+x scripts/importToNeo.sh

The data is loaded into the default location **data/graph.db**. 

If there is existing data in this graph, rename the folder or remove the graph by running any of the following

> `mv /graph/neo4j/data/graph.db /graph/neo4j/data/new_name.db`  
> `rm -r /graph/neo4j/data/graph.db/`

In [54]:
!time scripts/importToNeo.sh

Importing the contents of these files into /graph/neo4j/data/graph.db:
Nodes:
  :ARTIST
  /graph/neo4j/data/import/hdr-artist.csv
  /graph/neo4j/data/import/artist-000
  /graph/neo4j/data/import/artist-001
  /graph/neo4j/data/import/artist-002
  /graph/neo4j/data/import/artist-003
  /graph/neo4j/data/import/artist-004
  /graph/neo4j/data/import/artist-005
  /graph/neo4j/data/import/artist-006
  /graph/neo4j/data/import/artist-007
  /graph/neo4j/data/import/artist-008
  /graph/neo4j/data/import/artist-009
  /graph/neo4j/data/import/artist-010
  /graph/neo4j/data/import/artist-011
  /graph/neo4j/data/import/artist-012
  /graph/neo4j/data/import/artist-013
  /graph/neo4j/data/import/artist-014
  /graph/neo4j/data/import/artist-015
  /graph/neo4j/data/import/artist-016
  /graph/neo4j/data/import/artist-017
  /graph/neo4j/data/import/artist-018
  /graph/neo4j/data/import/artist-019
  /graph/neo4j/data/import/artist-020
  /graph/neo4j/data/import/artist-021
  /graph/neo4j/data/import/artist-

**Final Size of Million Song Graph**

In [56]:
!du -h /graph/neo4j/data/graph.db

1.9M	/graph/neo4j/data/graph.db/schema/label/lucene
1.9M	/graph/neo4j/data/graph.db/schema/label
1.9M	/graph/neo4j/data/graph.db/schema
6.4G	/graph/neo4j/data/graph.db


---
# Analysis

Some basic analysis of the data loaded into Neo4j was done in the notebook [Step 6 - Analysist.ipynb](./Step 6 - Analysis.ipynb).