A small set of tools to parse the Discogs XML data dumps. Started as a couple of Ruby scripts but then moved into scala and Apache Spark because I wanted to explore the two.
It's very much a work in progress and exploratory code.
A couple of ruby scripts to parse the dumps and to manipulate the data
Scala code for parsing the dumps with Apache Spark. Mmain classes being DeduplicateTracks (utility to reduce to total number of tracks based on the artists, name and remixers) and ProcessDiscogs (to parse CSV files generated by a Ruby script and extract nodes and relationships, ready to be imported to Neo4j)