Skip to content

eifuentes/lastfm-dataset-1K

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Last.FM 1K User Dataset

This repository hosts the Last.fm Dataset - 1K users under the same license and terms as in the offical README, copied for convenience below.

The dataset has been preprocessed and hosted for easier use in the standard PyData set of tools. In addition, an educational/developer friendly subset is hosted for quality assurance tasks and experimentation. See the releases section of this repository to download.

Release Files

  • userid-timestamp-artid-artname-traid-traname.tsv.zip (original ~1,000 user level event dataset from lastfm-dataset-1K.tar.gz)
  • userid-profile.tsv.zip (original ~10,00 user profile dataset from lastfm-dataset-1K.tar.gz)
  • README.txt (original README from lastfm-dataset-1K.tar.gz, see below as well)
  • lastfm-dataset-1k.snappy.parquet (processed userid-timestamp-artid-artname-traid-traname.tsv.zip)
  • lastfm-dataset-50.snappy.parquet (processed userid-timestamp-artid-artname-traid-traname.tsv.zip with 50 users sampled)

Preprocessing

The preprocessing done in the preprocessing.ipynb notebook consisted of the following steps.

  1. Load userid-timestamp-artid-artname-traid-traname.tsv.zip as a pandas Dataframe
  2. Remove malformed rows
  3. Convert timestamp string to a proper UTC datetime object
  4. Sort records by user_id and timestamp
  5. Save original and sampled dataset as single snappy compressed parquet files.

The column headers were renamed to be user_id, timestamp, artist_id, artist_name, track_id, track_name.


README

Version 1.0, May 2010

What is this?

This dataset contains user, timestamp, artist, song tuples collected from Last.fm API, using the user.getRecentTracks() method.

This dataset represents the whole listening habits (till May, 5th 2009) for nearly 1,000 users.

Files

Filename MD5
userid-timestamp-artid-artname-traid-traname.tsv 64747b21563e3d2aa95751e0ddc46b68
userid-profile.tsv c53608b6b445db201098c1489ea497df

Data Statistics

userid-timestamp-artid-artname-traid-traname.tsv

Element Statistic
Total Lines 19,150,868
Unique Users 992
Artists with MBID 107,528
Artists without MBDID 69,420

Data Format

The data is formatted one entry per line as follows (tab separated, \t)

userid-timestamp-artid-artname-traid-traname.tsv

userid \t timestamp \t musicbrainz-artist-id \t artist-name \t musicbrainz-track-id \t track-name

userid-profile.tsv

userid \t gender ('m'|'f'|empty) \t age (int|empty) \t country (str|empty) \t signup (date|empty)

Example:

userid-timestamp-artid-artname-traid-traname.tsv

user_000639 \t 2009-04-08T01:57:47Z \t MBID \t The Dogs D'Amour \t MBID \t Fall in Love Again?
user_000639 \t 2009-04-08T01:53:56Z \t MBID \t The Dogs D'Amour \t MBID \t Wait Until I'm Dead
...

userid-profile.tsv

user_000639 \t m \t Mexico \t Apr 27, 2005
...

License

The data contained in lastfm-dataset-1K.tar.gz is distributed with permission of Last.fm.

The data is made available for non-commercial use.

Those interested in using the data or web services in a commercial context should contact partners [at] last [dot] fm.

For more information see Last.fm terms of service.

Acknowledgements

Thanks to Last.fm for providing the access to this data via their web services.

Special thanks to Norman Casagrande.

References

When using this dataset you must reference the Last.fm webpage.

Optionally (not mandatory at all!), you can cite Chapter 3 of this book

@book{Celma:Springer2010,
  author = {Celma, O.},
  title = {{Music Recommendation and Discovery in the Long Tail}},
  publisher = {Springer},
  year = {2010}
}

Contact

This data was collected by Òscar Celma @ MTG/UPF