Skip to content

A library to parse Discogs data dumps with Scala and Apache Spark. It's specific to my use case but it's a start.

Notifications You must be signed in to change notification settings

alexquintino/discogs-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

70 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A small set of tools to parse the Discogs XML data dumps. Started as a couple of Ruby scripts but then moved into scala and Apache Spark because I wanted to explore the two.

It's very much a work in progress and exploratory code.

Artifacts

discogs-parser-ruby

A couple of ruby scripts to parse the dumps and to manipulate the data

discogs-parser-spark

Scala code for parsing the dumps with Apache Spark. Mmain classes being DeduplicateTracks (utility to reduce to total number of tracks based on the artists, name and remixers) and ProcessDiscogs (to parse CSV files generated by a Ruby script and extract nodes and relationships, ready to be imported to Neo4j)

About

A library to parse Discogs data dumps with Scala and Apache Spark. It's specific to my use case but it's a start.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published