This is a repository of a topic-centric public data sources in high quality for Recommender Systems (RS)
Branch: master
Clone or download
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
Processed Datasets Add new dataset Apr 25, 2018
.gitignore New reduce datasets Nov 3, 2017
README.md Fix link of dating dataset Sep 16, 2018

README.md

Public Datasets For Recommender Systems

This is a repository of a topic-centric public data sources in high quality for Recommender Systems (RS). They are collected and tidied from Stack Overflow, articles, recommender sites and academic experiments. Most of the datasets presented here are free, having open sorce linceses, however, some are not and you need to ask permission to use or cite the authors' work.

In addition, this repository contains some pre-processed datasets with treatment for academic experiments.

Link and datasets descriptions

Book

  • Book Crossing:: The BookCrossing (BX) dataset was collected by Cai-Nicolas in a 4-week crawl (August / September 2004) from the Book-Crossing community

Dating

  • Dating Agency:: This dataset contains 17,359,346 anonymous ratings of 168,791 profiles made by 135,359 LibimSeTi users as dumped on April 4, 2006.

E-commerce

  • Amazon:: This dataset contains product reviews and metadata from Amazon, including 142.8 million reviews spanning May 1996 - July 2014
  • Retailrocket recommender system dataset:: The dataset consists of three files: a file with behaviour data (events.csv), a file with item properties (item_properties.сsv) and a file, which describes category tree (category_tree.сsv). The data has been collected from a real-world ecommerce website.

Music

  • Amazon Music:: This digital music dataset contains reviews and metadata from Amazon
  • Yahoo Music:: This dataset represents a snapshot of the Yahoo! Music community's preferences for various musical artists.
  • LastFM (Implicit):: This dataset contains social networking, tagging, and music artist listening information from a set of 2K users from Last.fm online music system.
  • Million Song Dataset:: The Million Song Dataset is a freely-available collection of audio features and metadata for a million contemporary popular music tracks.

Movies

  • MovieLens:: GroupLens Research has collected and made available rating datasets from their movie web site
  • Yahoo Movies:: This dataset contains ratings for songs collected from two different sources. The first source consists of ratings supplied by users during normal interaction with Yahoo! Music services.
  • CiaoDVD:: CiaoDVD is a dataset crawled from the entire category of DVDs from the dvd.ciao.co.uk website in December, 2013
  • FilmTrust:: FilmTrust is a small dataset crawled from the entire FilmTrust website in June, 2011
  • Netflix:: This is the official data set used in the Netflix Prize competition.

Games

  • Steam Video Games:: This dataset is a list of user behaviors, with columns: user-id, game-title, behavior-name, value. The behaviors included are 'purchase' and 'play'. The value indicates the degree to which the behavior was performed - in the case of 'purchase' the value is always 1, and in the case of 'play' the value represents the number of hours the user has played the game.

Jokes

  • Jester:: This Joke dataset contains 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,496 users

Food

  • Chicago Entree:: This dataset contains a record of user interactions with the Entree Chicago restaurant recommendation system.

Anime

  • Anime Recommendations Database:: This data set contains information on user preference data from 73,516 users on 12,294 anime. Each user is able to add anime to their completed list and give it a rating and this data set is a compilation of those ratings.

Other dataset

You can find more datasets in:

  • GroupLens Datasets link
  • LibRec Datasets link
  • Yahoo Research link
  • Datasets for Machine Learning link
  • Stanford Large Network Dataset Collection link

Usage and License

Before using these data sets, please review their README files or sites for the usage licenses, acknowledgments and other details.

Note : If you have difficulties in downloading any of these datasets please contact me. I have backup of all datasets.

Recommender Tools

Contributors

Arthur Fortes da Costa {fortes [dot] arthur [at] gmail [dot] com} (ICMC - USP) [Editor]