Exploratory data analysis of the million songs dataset, visualized
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
.gitignore
README.md
grab_data.py
model.py
run.py
scrape_categories.py

README.md

Million Song Dataset Data Exploration

Exploratory data analysis and visualization of the “Million Songs Dataset” mainly using Python, Pandas and Matplotlib.

Notes:

Setup

If you want to run the code:

>>> git clone https://github.com/eltonlaw/msd_data_exploration.git

Then download the dataset and move the downloaded and unzipped dataset into the repo

>>> mv MillionSongSubset msd_data_exploration/MillionSongSubset

Also need to have the hdf5 helper functions from Thierry Bertin-Mahieux's GitHub repo

>>> cd msd_data_exploration
>>> git clone https://github.com/tbertinmahieux/MSongsDB

Make a temp folder for the output graphs:

>>> mkdir temp

Your directory should look something like this now:

msd_data_exploration
│   grab_data.py
│   MillionSongsSubset
│   model.py
└───MSongsDB
│   │   ...
│   └───Python Src
│       │   hdf5_descriptors.py
│       │   hdf5_getters.py
│       │   ...
|   README.md
|   run.py
|   scrape_categories.py
└───temp

Now you should be able to run the analysis with this command:

python -m run.py

Categories

>>> python scrape_categories.py 

Columbia shows an example datapoint here. I wrote a simple web scraper using Beautiful Soup to print the categories and descriptions to avoid the hassle of going to the website each time...example output in Terminal scraper

Initial Analysis

basic_info(categories=["tempo","duration","key","time_signature","song_hotttnesss"])

Print skew, distribution and pairwise correlation for the 5 following categories: tempo, duration, key, time_signature, song_hotttnesss.

Pairwise Correlations

correlation

The elements tested are linearly independent.

Distribution

distribution

Skew

skew

Artist Locations from latitude/longitude

world_plot(lat="artist_latitude",lon="artist_longitude")

Plot the latitude and longitude of each artist.

artist location

Most of the datapoints are coming from North America and EU, the subset data is not representative of the population.

Normalized % Frequency for each year

freq_plot(category="year")

Plots the normalized frequency of songs for each year in ascending order

year frequency

Duration of Songs

stacked_bar_plot(full="duration",head_end="end_of_fade_in",tail_start="start_of_fade_out")

Plots the song duration in black and overlays the end of the fade in and start of fade out in red.<

duration_fade_in-out

Average 'artist_hotttnesss' Over Time

compare_to_average(x_cat="year",y_cat="artist_hotttnesss")

Plots raw data y, average y for each x and finds areas where average y for each x is above/below total average

Average Artist Hotness Over Time

Each raw datapoint represents a song. The blue line on the bottom represents the average "Artist Hotness" for each year we have raw datapoints for. Because some years are missing in between, the line is sporadic. The dotted black line represents the average of average "Artist Hotness". Green areas represent year ranges where the average for that year is above the average of the average. Red areas represent the opposite, where the average is below the average of the average.

Segment Max Loudness

error_bar(categories=["segments_loudness_max","segments_confidence"],data_start=[0,1],sec_i[0,100])

Plots error bars for max loudness.

Error Bar for Max Loudness in each Segment

The full "segment_loudness_max" array contains 791 values for datapoint 0, this image shows the first 100 and the associated confidence values.

Dimensionality Reduction

dr(x_categories=["key","loudness","mode","tempo","year"],y_category=["artist_mbtags","artist_mbtags_count"])

Plots dimensions reduced through T-SNE and PCA.

PCA TSNE

Result of going from 5 dimensions to 2 using the following categories: "key","loudness","mode","tempo","year". Used Principal Component Analysis and t-distributed Stochastic Neighbor Embedding.

Citations

Thierry Bertin-Mahieux, Daniel P.W. Ellis, Brian Whitman, and Paul Lamere. The Million Song Dataset. In Proceedings of the 12th International Society for Music Information Retrieval Conference (ISMIR 2011), 2011.

TO DO's

  • Setup .tar.gz unzipper
  • Currently entire dataset needs to be loaded into memory prior to doing any analysis
  • Write hdf5 helper functions