GitHub - alexquintino/discogs-parser: A library to parse Discogs data dumps with Scala and Apache Spark. It's specific to my use case but it's a start.

A small set of tools to parse the Discogs XML data dumps. Started as a couple of Ruby scripts but then moved into scala and Apache Spark because I wanted to explore the two.

It's very much a work in progress and exploratory code.

Artifacts

discogs-parser-ruby

A couple of ruby scripts to parse the dumps and to manipulate the data

discogs-parser-spark

Scala code for parsing the dumps with Apache Spark. Mmain classes being DeduplicateTracks (utility to reduce to total number of tracks based on the artists, name and remixers) and ProcessDiscogs (to parse CSV files generated by a Ruby script and extract nodes and relationships, ready to be imported to Neo4j)

Name		Name	Last commit message	Last commit date
Latest commit History 70 Commits
data		data
discogs-parser-ruby		discogs-parser-ruby
discogs-parser-spark		discogs-parser-spark
output		output
.gitignore		.gitignore
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Artifacts

discogs-parser-ruby

discogs-parser-spark

About

Releases

Packages

Languages

alexquintino/discogs-parser

Folders and files

Latest commit

History

Repository files navigation

Artifacts

discogs-parser-ruby

discogs-parser-spark

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages