# Explore the HDF5 files of the MSS dataset

The `pandas` module contains the `HDFStore` function which reads data from one of the 10,000 HDF5 files which make up the Million Song Subset (MSS) dataset. The function requires code from the `PyTables` package. To load this package into Python from a console:

> `$ conda install --name python3 PyTables`

Or you can run the Docker container `dataspace/datalab-notebook`

> `$ docker run -v /Users/david:/home/jovyan/work -it --rm -p 8888:8888 dataspace/datalab-notebook`

Make sure to replace `/Users/david` with your home folder.

### Load libraries

- `pandas`: dataframes
- `numpy`: arrays
- `itertools`: iterate through data structures
- `os`: access the host operating system
- `re`: regular expressions 

In [1]:
import pandas    as pd
import numpy     as np
import itertools as it
import os 
import re

### Define utility functions `get_filenames` and `unlist`

The `get_filenames` recursively obtains the names of all files in the `path` directory and all of its subdirectories. The function returns a multi-level list if `path` contains subdirectories. The `unlist` function flattens the list by removing one level. 

In [2]:
def get_filenames(path):
    return([get_filenames(path+"/"+entry.name)
            if entry.is_dir() 
            else path+"/"+entry.name 
            for entry 
            in os.scandir(path)
           ])

In [3]:
def unlist(alist):
    return(list(it.chain.from_iterable(alist)
               )
          )

### Create file list

The `/home/jovyan/work` directory (inside the container) is mirrored with `/Users/david` (outside the container, on  the host, which is your/my latop.) This matching of directories is setup in the command that starts Jupyter Notebook:

> `$ docker run -v /Users/david:/home/jovyan/work -it --rm -p 8888:8888 dataspace/datalab-notebook`

The `path` variable stores the root of the directory tree containing the `10,000` song files. 

In [125]:
path = "/home/jovyan/work/Dropbox/Data/MillionSongSubset/data"

Now use the two above functions (`get_filenames` and `unlist`) to create the list of 10,000 files.

The function `get_filenames` returns a multi-level list, which is flattened using `unlist` into a list of full-path filenames, and stord in variable `filename`.

In [5]:
filenames = unlist(unlist(unlist(get_filenames(path))))
filenames[0:2]

['/home/jovyan/work/Dropbox/Data/MillionSongSubset/data/A/A/A/TRAAAAW128F429D538.h5',
 '/home/jovyan/work/Dropbox/Data/MillionSongSubset/data/A/A/A/TRAAABD128F429CF47.h5']

### Store in `filenames` only the songs files

The song files contain the string "/TR" and end with extension `.h5`.

In [5]:
p = re.compile("\/TR.*\.h5$")
filenames = [filename 
             for filename 
             in filenames 
             if p.search(filename)
            ]
filenames[0:2]

['/home/jovyan/work/Dropbox/Data/MillionSongSubset/data/A/A/A/TRAAAAW128F429D538.h5',
 '/home/jovyan/work/Dropbox/Data/MillionSongSubset/data/A/A/A/TRAAABD128F429CF47.h5']

### Verify we have the `10,000` files we expected

In [6]:
len(filenames)

10000

## Investigate the data stored in a song file

In [11]:
filenames[0:2]

['/home/jovyan/work/Dropbox/Data/MillionSongSubset/data/A/A/A/TRAAAAW128F429D538.h5',
 '/home/jovyan/work/Dropbox/Data/MillionSongSubset/data/A/A/A/TRAAABD128F429CF47.h5']

Use the `HDFStore` function to open the (second) HDF5 file. 

In [124]:
store=pd.HDFStore(filenames[1000])

List the three `Groups` in the file. 

In [76]:
store.root

/ (RootGroup) 'H5 Song File'
  children := ['musicbrainz' (Group), 'metadata' (Group), 'analysis' (Group)]

We focus on two groups 

- `metadata` which contains data produced by people
- `analysis` which contains data produced by computers

and ignore the `musicbrainz` group (for now.)

### Objects of the `metadata` group

This group contains five objects:

- One object of type `Table` named `songs`
- Four objects of type `EArray` named `artist_terms_freq`, `artist_terms_weight`, `artist_terms` and `similar_artists`

In [77]:
store.root.metadata

/metadata (Group) 'metadata about the song'
  children := ['songs' (Table), 'artist_terms_freq' (EArray), 'artist_terms_weight' (EArray), 'artist_terms' (EArray), 'similar_artists' (EArray)]

Each one of these objects will be inspected below. The `read` method will retrieve the data. 

### The `songs` object of the `metadata` group

Looks like a table based on the field names.

In [78]:
store.root.metadata.songs

/metadata/songs (Table(1,), shuffle, zlib(1)) 'table of metadata for one song'
  description := {
  "analyzer_version": StringCol(itemsize=32, shape=(), dflt=b'', pos=0),
  "artist_7digitalid": Int32Col(shape=(), dflt=0, pos=1),
  "artist_familiarity": Float64Col(shape=(), dflt=0.0, pos=2),
  "artist_hotttnesss": Float64Col(shape=(), dflt=0.0, pos=3),
  "artist_id": StringCol(itemsize=32, shape=(), dflt=b'', pos=4),
  "artist_latitude": Float64Col(shape=(), dflt=0.0, pos=5),
  "artist_location": StringCol(itemsize=1024, shape=(), dflt=b'', pos=6),
  "artist_longitude": Float64Col(shape=(), dflt=0.0, pos=7),
  "artist_mbid": StringCol(itemsize=40, shape=(), dflt=b'', pos=8),
  "artist_name": StringCol(itemsize=1024, shape=(), dflt=b'', pos=9),
  "artist_playmeid": Int32Col(shape=(), dflt=0, pos=10),
  "genre": StringCol(itemsize=1024, shape=(), dflt=b'', pos=11),
  "idx_artist_terms": Int32Col(shape=(), dflt=0, pos=12),
  "idx_similar_artists": Int32Col(shape=(), dflt=0, pos=13),
  "rel

The `read` methods retrieves the data, which is stored in variable `x` so we can check its type and display it.

In [79]:
x = store.root.metadata.songs.read()
print("type(x): ",type(x))
print("x      : ",x)

type(x):  <class 'numpy.ndarray'>
x      :  [ (b'', 4550, 0.7478153693139534, 0.459189194680227, b'ARTWPGO1187FB52DC3', nan, b'Kingston, Jamaica', nan, b'cd5f7bd9-ee6f-4a99-8dd9-dc3afd9da736', b'Elephant Man', 535, b'', 0, 0, b'Riddim Driven: Power Cuts', 74571, 0.23962909777363708, b'SOLMCUV12A6BD54773', b'Loud And Clear', 769111)]


Notice:

- `artist_name`: Elephant Man
- `title`: Loud and Clear
- `genre`: [blank, missing]
- `release`: Riddim Driven: Power Cuts

YouTube: https://www.youtube.com/watch?v=P26QXy01JGY

- Duration: 2:08 (see [here](https://www.beatport.com/release/riddim-driven-power-cut/138137))

### The `artists_terms` object of the `metadata` group

In [80]:
store.root.metadata.artist_terms

/metadata/artist_terms (EArray(10,), shuffle, zlib(1)) 'array of terms (Echo Nest tags) for an artist'
  atom := StringAtom(itemsize=256, shape=(), dflt=b'')
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
  chunkshape := (32,)

It seems to be an array with `38` elements

In [81]:
x = store.root.metadata.artist_terms.read()

print("type(x): ",type(x))
print("x      : ",x)

type(x):  <class 'numpy.ndarray'>
x      :  [b'dancehall' b'reggae' b'hip hop' b'jamaica' b'raga' b'kingston' b'urban'
 b'energetic' b'acoustic' b'pop']


They look like _terms_ that describe the _artist_.

### The `artists_terms_freq` object of the `metadata` group

In [82]:
store.root.metadata.artist_terms_freq

/metadata/artist_terms_freq (EArray(10,), shuffle, zlib(1)) 'array of term (Echo Nest tags) frequencies for an artist'
  atom := Float64Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := (1024,)

Another array of 38 elements.

In [83]:
x = store.root.metadata.artist_terms_freq.read()
print(type(x))
x

<class 'numpy.ndarray'>


array([ 1.        ,  0.94735746,  0.94730019,  0.6848307 ,  0.65292715,
        0.44654665,  0.23787286,  0.1577559 ,  0.27394067,  0.31898712])

The difference between these frequencies and the weights below is unclear.

### The `artists_terms_weight` object of the `metadata` group

In [23]:
store.root.metadata.artist_terms_weight

/metadata/artist_terms_weight (EArray(38,), shuffle, zlib(1)) 'array of term (Echo Nest tags) weights for an artist'
  atom := Float64Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := (1024,)

Another array of 38 elements.

In [30]:
x = store.root.metadata.artist_terms_weight.read()
print(type(x))
x

<class 'numpy.ndarray'>


array([ 1.        ,  0.8459884 ,  0.83068957,  0.79929112,  0.7882742 ,
        0.78409474,  0.78162137,  0.77157135,  0.76701044,  0.76259756,
        0.74642957,  0.72313246,  0.71631195,  0.69145399,  0.68487963,
        0.68092924,  0.66818023,  0.65671327,  0.64044927,  0.62919326,
        0.62628879,  0.62208271,  0.60905077,  0.60905061,  0.60905036,
        0.60905025,  0.6090495 ,  0.60681945,  0.60252961,  0.59761364,
        0.56556785,  0.55928456,  0.55928206,  0.5592817 ,  0.55928128,
        0.55631773,  0.54045413,  0.53544559])

### The `similar_artists` object of the `metadata` group

In [25]:
store.root.metadata.similar_artists

/metadata/similar_artists (EArray(100,), shuffle, zlib(1)) 'array of similar artists Echo Nest id'
  atom := StringAtom(itemsize=20, shape=(), dflt=b'')
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
  chunkshape := (409,)

I need to check these array lengths for other songs. Is `100` a maximum?

In [32]:
x = store.root.metadata.similar_artists.read()
print(type(x))
x

<class 'numpy.ndarray'>


array([b'ARSZWK21187B9B26D7', b'ARLDW2Y1187B9B544F', b'ARG0TXR1187FB4E708',
       b'AR6Z8OF1187FB5216E', b'ARUAG4R1187FB53500', b'AR4M1NA1187FB54703',
       b'ARACWDD11F4C83EFB0', b'ARUSW6X1187FB3CF6E', b'ARJ4LIU1187B98FF60',
       b'ARN0DMU1187FB5B63A', b'ARGPHQO11F4C8463D3', b'ARUPZB41187FB52CEE',
       b'ARGS67J1187FB3FC1A', b'AR71RV81187B9A3A46', b'ARFN3551187FB4C930',
       b'ARQVGOX11F4C83D683', b'ARHSO041187FB3D06D', b'ART5NIQ1187FB40252',
       b'ARSPBQG1187B9A60B5', b'AR40GVR1187B9B6C87', b'ARYGOXC1187B98F69A',
       b'ARXRNDO1187FB42BE5', b'ARKVEIM1187FB4D597', b'ARA97CU1187FB364DA',
       b'ARR964E1187B9B95D5', b'AR74PG51187B9AAFA0', b'ARFSZGA11F4C846FF3',
       b'ARAPFN61187B990019', b'ARS73GR1187FB4AE00', b'AR39HYB1187FB56A08',
       b'ARGSEQR1187B9B48F6', b'ARYKJBJ12454A3DD19', b'ARX0RC51187B9A4056',
       b'AROPXVN1187FB4E8F0', b'ARUXSE51187B997E35', b'ARFF0FC1187B9AFEAF',
       b'ARCGVQJ11F50C4D0F3', b'AR5XW991187B99B270', b'ARN0IWD1187FB47BD1',
       b'ARP

The `metadata` portion of the first song file contains:

1. a `song` object with variables: `artist_id`, `song_id`, `artist` (name), `title` (name), `artist_mbid` (musicbrainz), `genre` (?), and others
1. `similar_artists` object, a list of similar artists (probably match `artist_id`
1. `artists_terms` object, a list of tags describing the artist (i.e. "male", "canada", "blues", "70s")

Even at this point we encounter the question: 
> How can we put this information (for a single song) into a single row of a table? 

The variable length `EArray` data about similar artists and artist terms doesn't easily fit into a table. 

But it gets worse. Keep reading. 

## Objects of the `analysis` group

Data in this group is produced by "_Echo Nest analysis of the song_."

All of the data in this group are of type `EArray`, except the `songs` object which is of type `Table`. 

In [21]:
store.root.analysis

/analysis (Group) 'Echo Nest analysis of the song'
  children := ['tatums_confidence' (EArray), 'beats_confidence' (EArray), 'sections_confidence' (EArray), 'segments_loudness_start' (EArray), 'bars_confidence' (EArray), 'sections_start' (EArray), 'segments_confidence' (EArray), 'tatums_start' (EArray), 'bars_start' (EArray), 'segments_loudness_max' (EArray), 'segments_start' (EArray), 'segments_timbre' (EArray), 'songs' (Table), 'beats_start' (EArray), 'segments_pitches' (EArray), 'segments_loudness_max_time' (EArray)]

We will find that some of the `EArray` data has two dimensions (is a table.)


We will inspect some, but not all, of these objects below.

### `songs` object of the `analysis` group

Happily, the `songs` object seems to contain variables, which we can place alongside the variables from the `song` object from the `metadata` group. 

Notice that the values of the array are followed by datatypes.

In [46]:
x = store.root.analysis.songs.read()
print(type(x))
x

<class 'numpy.ndarray'>


array([ (22050, b'bb9771eeef3d5b204a3c55e690f52a91', 0.0, 148.03546, 0.148, 0.0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0.169, -9.843, 0, 0.43, 137.915, 121.274, 4, 0.384, b'TRAAABD128F429CF47')], 
      dtype=[('analysis_sample_rate', '<i4'), ('audio_md5', 'S32'), ('danceability', '<f8'), ('duration', '<f8'), ('end_of_fade_in', '<f8'), ('energy', '<f8'), ('idx_bars_confidence', '<i4'), ('idx_bars_start', '<i4'), ('idx_beats_confidence', '<i4'), ('idx_beats_start', '<i4'), ('idx_sections_confidence', '<i4'), ('idx_sections_start', '<i4'), ('idx_segments_confidence', '<i4'), ('idx_segments_loudness_max', '<i4'), ('idx_segments_loudness_max_time', '<i4'), ('idx_segments_loudness_start', '<i4'), ('idx_segments_pitches', '<i4'), ('idx_segments_start', '<i4'), ('idx_segments_timbre', '<i4'), ('idx_tatums_confidence', '<i4'), ('idx_tatums_start', '<i4'), ('key', '<i4'), ('key_confidence', '<f8'), ('loudness', '<f8'), ('mode', '<i4'), ('mode_confidence', '<f8'), ('start_of_fade_out',

One can retrieve them separately.

In [52]:
print(x[0])
x[0][1], x[0][-1]

(22050, b'bb9771eeef3d5b204a3c55e690f52a91', 0.0, 148.03546, 0.148, 0.0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0.169, -9.843, 0, 0.43, 137.915, 121.274, 4, 0.384, b'TRAAABD128F429CF47')


(b'bb9771eeef3d5b204a3c55e690f52a91', b'TRAAABD128F429CF47')

In [47]:
print(x)
x.dtype

[ (22050, b'bb9771eeef3d5b204a3c55e690f52a91', 0.0, 148.03546, 0.148, 0.0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0.169, -9.843, 0, 0.43, 137.915, 121.274, 4, 0.384, b'TRAAABD128F429CF47')]


dtype([('analysis_sample_rate', '<i4'), ('audio_md5', 'S32'), ('danceability', '<f8'), ('duration', '<f8'), ('end_of_fade_in', '<f8'), ('energy', '<f8'), ('idx_bars_confidence', '<i4'), ('idx_bars_start', '<i4'), ('idx_beats_confidence', '<i4'), ('idx_beats_start', '<i4'), ('idx_sections_confidence', '<i4'), ('idx_sections_start', '<i4'), ('idx_segments_confidence', '<i4'), ('idx_segments_loudness_max', '<i4'), ('idx_segments_loudness_max_time', '<i4'), ('idx_segments_loudness_start', '<i4'), ('idx_segments_pitches', '<i4'), ('idx_segments_start', '<i4'), ('idx_segments_timbre', '<i4'), ('idx_tatums_confidence', '<i4'), ('idx_tatums_start', '<i4'), ('key', '<i4'), ('key_confidence', '<f8'), ('loudness', '<f8'), ('mode', '<i4'), ('mode_confidence', '<f8'), ('start_of_fade_out', '<f8'), ('tempo', '<f8'), ('time_signature', '<i4'), ('time_signature_confidence', '<f8'), ('track_id', 'S32')])

The datatypes seem to indicate: 

- `<i4`: integer numbers
- `<f8`: decimal numbers
- `S32`: character strings

### Types of objects of the `analysis` group

Four types:
- tatums
- segments
- beats
- bars
- sections

Assuming that the collection of values of each type partition the song, then the previous list is in order from smallest time duration to longest time duration.

In [84]:
store.root.analysis.tatums_start.read().shape

(546,)

In [85]:
store.root.analysis.segments_start.read().shape

(599,)

In [86]:
store.root.analysis.beats_start.read().shape

(273,)

In [87]:
store.root.analysis.bars_start.read().shape

(67,)

In [88]:
store.root.analysis.sections_start.read().shape

(9,)

The `tatums`, `beats`, `bars` and `sections` contain only `start` and `confidence` information.
- `tatums_start`
- `tatums_confidence`
- `beats_start`
- `beats_confidence`

- `bars_start`
- `bars_confidence`

- `sections_start`
- `sections_confidence`

The `segments` contain, in addition, `pitch`, `timbre` and `loudness`:
- `segments_start` 
- `segments_confidence`
- `segments_pitches`
- `segments_timbre`
- `segments_loudness_max_time`
- `segments_loudness_start`
- `segments_loudness_max`



Check the first three and last three `start` times of each type. They all seem to be consistent with a duration of `2:08` and so seem to have units in seconds.

In [94]:
store.root.analysis.segments_start.read()[0:3,], store.root.analysis.segments_start.read()[-3:,]

(array([ 0.     ,  0.49075,  0.94331]),
 array([ 127.33542,  127.50345,  127.76454]))

In [93]:
store.root.analysis.tatums_start.read()[0:3,], store.root.analysis.tatums_start.read()[-3:,]

(array([ 0.30847,  0.53882,  0.7636 ]),
 array([ 127.0784 ,  127.32034,  127.56108]))

In [95]:
store.root.analysis.sections_start.read()[0:3,], store.root.analysis.sections_start.read()[-3:,]

(array([  0.     ,   8.56912,  18.2259 ]),
 array([  82.59045,   96.2478 ,  113.05733]))

In [96]:
store.root.analysis.beats_start.read()[0:3,], store.root.analysis.beats_start.read()[-3:,]

(array([ 0.53882,  0.9906 ,  1.43869]),
 array([ 126.59472,  127.0784 ,  127.56108]))

In [97]:
store.root.analysis.bars_start.read()[0:3,], store.root.analysis.bars_start.read()[-3:,]

(array([ 0.9906 ,  2.73619,  4.67471]),
 array([ 120.32298,  122.275  ,  124.20791]))

### The `confidence` objects of the `analysis` group

Use the `stats.describe` function from the `scipy` library to summarize each array.

In [114]:
from scipy import stats

#### `tatums_confidence`

In [117]:
stats.describe(store.root.analysis.tatums_confidence.read())

DescribeResult(nobs=546, minmax=(0.0, 0.53300000000000003), mean=0.25041941391941391, variance=0.018323634778371477, skewness=-0.8054550877871492, kurtosis=-0.5684787689414659)

#### `segments_confidence`

In [118]:
stats.describe(store.root.analysis.segments_confidence.read())

DescribeResult(nobs=599, minmax=(0.012, 1.0), mean=0.61527879799666119, variance=0.056756408763770166, skewness=-0.4871032551587329, kurtosis=-0.5046147365116327)

#### `beats_confidence`

In [119]:
stats.describe(store.root.analysis.beats_confidence.read())

DescribeResult(nobs=273, minmax=(0.0, 0.63300000000000001), mean=0.31813186813186811, variance=0.017093122252747252, skewness=-0.32171308627692596, kurtosis=-0.07362392751793756)

#### `bars_confidence`

In [120]:
stats.describe(store.root.analysis.bars_confidence.read())

DescribeResult(nobs=67, minmax=(0.0, 0.63300000000000001), mean=0.1247462686567164, variance=0.018039222523744909, skewness=1.5777257042176682, kurtosis=2.6614389928183817)

#### `sections_confidence`

In [121]:
stats.describe(store.root.analysis.sections_confidence.read())

DescribeResult(nobs=9, minmax=(0.33600000000000002, 1.0), mean=0.56222222222222218, variance=0.057306194444444444, skewness=0.9111747459055977, kurtosis=-0.6801877987252616)

### Check the `segments_timbre` array 

It is two (2) dimensional, with 12 columns.

In [112]:
store.root.analysis.segments_timbre.read().shape

(599, 12)

### Check the `segments_pitches` array 

It is two (2) dimensional, with 12 columns.

In [113]:
store.root.analysis.segments_pitches.read().shape

(599, 12)

### Close the HDF5 file

In [31]:
store.close()

#### The end