# Explore the HDF5 files that store the MSS information

The `pandas` module is used to open and look into the file, but requires code from the _PyTables_ package. To load this package into Python from a console:

> `$ conda install --name python3 PyTables`

### Load libraries

In [3]:
import pandas as pd
import numpy as np
import itertools as it
import os 
import re

### Define utility functions `get_filenames` and `unlist`

The `get_filenames` function recursively obtains the names of all files in the `path` directory and all of its subdirectories. The function returns a multi-level list if `path` contains subdirectories. The `unlist` function flattens the list by removing one level. 

In [4]:
def get_filenames(path):
    return([get_filenames(path+"/"+entry.name)
            if entry.is_dir() 
            else path+"/"+entry.name 
            for entry 
            in os.scandir(path)
           ])

def unlist(alist):
    return(list(it.chain.from_iterable(alist)
               )
          )

## Use these two functions to get the list of 10,000 files

The `path` variable stores the root of the directory tree containing all of the song files. The function `get_filenames` returns a multi-level list, which is flattened using `unlist` and stored in variable `filenames` as a list of full-path filenames.

In [9]:
path = "/Users/David/Dropbox/Data/MillionSongSubset/data"
filenames = unlist(unlist(unlist(get_filenames(path))))
filenames[0:2]

['/Users/David/Dropbox/Data/MillionSongSubset/data/A/A/A/original1.TRAAAAW128F429D538.h5',
 '/Users/David/Dropbox/Data/MillionSongSubset/data/A/A/A/original1.TRAAABD128F429CF47.h5']

In [10]:
list(os.scandir(path))

[<DirEntry 'A'>, <DirEntry 'B'>]

### Store in `filenames` only the files with extension `.h5` which start with `TR`

In [23]:
p = re.compile("\/TR.*\.h5$")
filenames = [filename 
             for filename 
             in filenames 
             if p.search(filename)
            ]
filenames[0:2]

['/Users/David/Dropbox/Data/MillionSongSubset/data/A/A/A/TRAAAAW128F429D538.h5',
 '/Users/David/Dropbox/Data/MillionSongSubset/data/A/A/A/TRAAABD128F429CF47.h5']

In [24]:
len(filenames)

10000

## Investigate the data stored in the first file `filenames[0]`

In [27]:
filenames[0]

'/Users/David/Dropbox/Data/MillionSongSubset/data/A/A/A/TRAAAAW128F429D538.h5'

Use the `HDFStore` function to open the HDF5 file. 

In [28]:
store=pd.HDFStore(filenames[1])

List the three `Groups` in the file. 

Notice the `metadata` and `analysis` groups:

- `metadata` contains data produced by people
- `analysis` contains data produced by programs

In [30]:
store.root

/ (RootGroup) 'H5 Song File'
  children := ['metadata' (Group), 'analysis' (Group), 'musicbrainz' (Group)]

The `RootGroup` contains three groups. We will look into the `metadata` group and the `analysis` group.

## Objects of the `metadata` group

This group contains objects of type `Table` and `EArray`:

- One object of type `Table` named `songs`
- Three objects of type `EArray` named `artist_terms_freq`, `artist_terms_weight`, `artist_terms` and `similar_artists`

In [32]:
store.root.metadata

/metadata (Group) 'metadata about the song'
  children := ['artist_terms_freq' (EArray), 'artist_terms_weight' (EArray), 'artist_terms' (EArray), 'similar_artists' (EArray), 'songs' (Table)]

Each one of these objects will be inspected below. The `read` method will retrieve the data. 

First, the `Table` object named `songs` looks like a single row of a table as it has variable names associated with data of different types. 

In [45]:
store.root.metadata.songs

/metadata/songs (Table(1,), shuffle, zlib(1)) 'table of metadata for one song'
  description := {
  "analyzer_version": StringCol(itemsize=32, shape=(), dflt=b'', pos=0),
  "artist_7digitalid": Int32Col(shape=(), dflt=0, pos=1),
  "artist_familiarity": Float64Col(shape=(), dflt=0.0, pos=2),
  "artist_hotttnesss": Float64Col(shape=(), dflt=0.0, pos=3),
  "artist_id": StringCol(itemsize=32, shape=(), dflt=b'', pos=4),
  "artist_latitude": Float64Col(shape=(), dflt=0.0, pos=5),
  "artist_location": StringCol(itemsize=1024, shape=(), dflt=b'', pos=6),
  "artist_longitude": Float64Col(shape=(), dflt=0.0, pos=7),
  "artist_mbid": StringCol(itemsize=40, shape=(), dflt=b'', pos=8),
  "artist_name": StringCol(itemsize=1024, shape=(), dflt=b'', pos=9),
  "artist_playmeid": Int32Col(shape=(), dflt=0, pos=10),
  "genre": StringCol(itemsize=1024, shape=(), dflt=b'', pos=11),
  "idx_artist_terms": Int32Col(shape=(), dflt=0, pos=12),
  "idx_similar_artists": Int32Col(shape=(), dflt=0, pos=13),
  "rel

In [46]:
x = store.root.metadata.songs.read()
type(x), x

(numpy.ndarray,
 array([ (b'', 1998, 0.6306300375898077, 0.4174996449709784, b'ARMJAGH1187FB546F3', 35.14968, b'Memphis, TN', -90.04892, b'1c78ab62-db33-4433-8d0b-7c8dcf1849c2', b'The Box Tops', 22066, b'', 0, 0, b'Dimensions', 300822, nan, b'SOCIWDW12A8C13D406', b'Soul Deep', 3400270)], 
       dtype=[('analyzer_version', 'S32'), ('artist_7digitalid', '<i4'), ('artist_familiarity', '<f8'), ('artist_hotttnesss', '<f8'), ('artist_id', 'S32'), ('artist_latitude', '<f8'), ('artist_location', 'S1024'), ('artist_longitude', '<f8'), ('artist_mbid', 'S40'), ('artist_name', 'S1024'), ('artist_playmeid', '<i4'), ('genre', 'S1024'), ('idx_artist_terms', '<i4'), ('idx_similar_artists', '<i4'), ('release', 'S1024'), ('release_7digitalid', '<i4'), ('song_hotttnesss', '<f8'), ('song_id', 'S32'), ('title', 'S1024'), ('track_7digitalid', '<i4')]))

Notice:

- `artist_name`: The Box Tops
- `title`: Soul Deep
- `genre`: [missing]
- `release`: Dimensions (looks like album name)

See http://www.allmusic.com/album/dimensions-mw0000605601

- Duration: 2:27

The `artists_terms` object:

In [47]:
store.root.metadata.artist_terms

/metadata/artist_terms (EArray(38,), shuffle, zlib(1)) 'array of terms (Echo Nest tags) for an artist'
  atom := StringAtom(itemsize=256, shape=(), dflt=b'')
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
  chunkshape := (32,)

In [48]:
x = store.root.metadata.artist_terms.read()
print(type(x))
x

<class 'numpy.ndarray'>


array([b'blue-eyed soul', b'pop rock', b'blues-rock', b'beach music',
       b'soft rock', b'soul', b'classic rock', b'oldies', b'power pop',
       b'psychedelic rock', b'rock', b'sunshine pop', b'blues',
       b'singer-songwriter', b'pop', b'united states', b'male vocalist',
       b"rock 'n roll", b'60s', b'am pop', b'r&b', b'american', b'male',
       b'psychedelic', b'classic', b'vocal', b'americana', b'game music',
       b'mod', b'trippy', b'french', b'germany', b'canada', b'70s',
       b'belgium', b'cover', b'nederland', b'confident'], 
      dtype='|S256')

The `artists_terms_freq` object:

In [49]:
store.root.metadata.artist_terms_freq

/metadata/artist_terms_freq (EArray(38,), shuffle, zlib(1)) 'array of term (Echo Nest tags) frequencies for an artist'
  atom := Float64Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := (1024,)

In [50]:
x = store.root.metadata.artist_terms_freq.read()
print(type(x))
x

<class 'numpy.ndarray'>


array([ 1.        ,  0.89319999,  0.78606029,  0.74638538,  0.76959371,
        0.86287996,  0.84396311,  0.80926862,  0.76959371,  0.76959371,
        0.91182678,  0.65015172,  0.76959371,  0.76959371,  0.80319885,
        0.59682398,  0.62623369,  0.56622481,  0.63564988,  0.53145068,
        0.53925634,  0.6718294 ,  0.56760867,  0.64202042,  0.57531714,
        0.57830665,  0.58026237,  0.50317926,  0.49775864,  0.49154685,
        0.56355296,  0.50884487,  0.4721209 ,  0.53094369,  0.46943746,
        0.43936557,  0.41932041,  0.41299165])

The `artists_terms_weight` object:

In [51]:
store.root.metadata.artist_terms_weight

/metadata/artist_terms_weight (EArray(38,), shuffle, zlib(1)) 'array of term (Echo Nest tags) weights for an artist'
  atom := Float64Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := (1024,)

In [52]:
x = store.root.metadata.artist_terms_weight.read()
print(type(x))
x

<class 'numpy.ndarray'>


array([ 1.        ,  0.8459884 ,  0.83068957,  0.79929112,  0.7882742 ,
        0.78409474,  0.78162137,  0.77157135,  0.76701044,  0.76259756,
        0.74642957,  0.72313246,  0.71631195,  0.69145399,  0.68487963,
        0.68092924,  0.66818023,  0.65671327,  0.64044927,  0.62919326,
        0.62628879,  0.62208271,  0.60905077,  0.60905061,  0.60905036,
        0.60905025,  0.6090495 ,  0.60681945,  0.60252961,  0.59761364,
        0.56556785,  0.55928456,  0.55928206,  0.5592817 ,  0.55928128,
        0.55631773,  0.54045413,  0.53544559])

The `similar_artists` object:

In [43]:
store.root.metadata.similar_artists

/metadata/similar_artists (EArray(100,), shuffle, zlib(1)) 'array of similar artists Echo Nest id'
  atom := StringAtom(itemsize=20, shape=(), dflt=b'')
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
  chunkshape := (409,)

In [44]:
x = store.root.metadata.similar_artists.read()
print(type(x))
x

<class 'numpy.ndarray'>


array([b'ARSZWK21187B9B26D7', b'ARLDW2Y1187B9B544F', b'ARG0TXR1187FB4E708',
       b'AR6Z8OF1187FB5216E', b'ARUAG4R1187FB53500', b'AR4M1NA1187FB54703',
       b'ARACWDD11F4C83EFB0', b'ARUSW6X1187FB3CF6E', b'ARJ4LIU1187B98FF60',
       b'ARN0DMU1187FB5B63A', b'ARGPHQO11F4C8463D3', b'ARUPZB41187FB52CEE',
       b'ARGS67J1187FB3FC1A', b'AR71RV81187B9A3A46', b'ARFN3551187FB4C930',
       b'ARQVGOX11F4C83D683', b'ARHSO041187FB3D06D', b'ART5NIQ1187FB40252',
       b'ARSPBQG1187B9A60B5', b'AR40GVR1187B9B6C87', b'ARYGOXC1187B98F69A',
       b'ARXRNDO1187FB42BE5', b'ARKVEIM1187FB4D597', b'ARA97CU1187FB364DA',
       b'ARR964E1187B9B95D5', b'AR74PG51187B9AAFA0', b'ARFSZGA11F4C846FF3',
       b'ARAPFN61187B990019', b'ARS73GR1187FB4AE00', b'AR39HYB1187FB56A08',
       b'ARGSEQR1187B9B48F6', b'ARYKJBJ12454A3DD19', b'ARX0RC51187B9A4056',
       b'AROPXVN1187FB4E8F0', b'ARUXSE51187B997E35', b'ARFF0FC1187B9AFEAF',
       b'ARCGVQJ11F50C4D0F3', b'AR5XW991187B99B270', b'ARN0IWD1187FB47BD1',
       b'ARP

The metadata portion of the data file for the first song contains, in the:

1. `song` object, some variables
1. `similar_artists` object, a list of similar artists
1. `artists_terms` object, a list of tags describing the artist (i.e. "male", "canada", "blues", "70s")

The variables can easily-ish become part of a dataset of songs. 

Lists (variable length `EArray` data) require some work as they contain a variable number of elements. (It's not so easy to put them in a table.)

## Objects of the `analysis` group

Data in this group is produced by "_Echo Nest analysis of the song_."

All of the data in this group are of type `EArray`, except the `songs` object which is of type `Table`. 

In [20]:
store.root.analysis

/analysis (Group) 'Echo Nest analysis of the song'
  children := ['segments_confidence' (EArray), 'beats_start' (EArray), 'sections_confidence' (EArray), 'songs' (Table), 'segments_loudness_max_time' (EArray), 'tatums_confidence' (EArray), 'segments_timbre' (EArray), 'beats_confidence' (EArray), 'sections_start' (EArray), 'segments_loudness_start' (EArray), 'segments_loudness_max' (EArray), 'bars_confidence' (EArray), 'bars_start' (EArray), 'tatums_start' (EArray), 'segments_start' (EArray), 'segments_pitches' (EArray)]

We will inspect some, but not all, of these objects below.

Happily, the `songs` object seems to contain variables, which we can place alongside the variables from the `song` object from the `metadata` group. 

The datatypes seem to indicate: 

- `<i4`: integer numbers
- `<f8`: decimal numbers
- `S32`: character strings

But, we'll need to check this when looking at the data. 

In [22]:
x = store.root.analysis.songs.read()
type(x), x

(numpy.ndarray,
 array([ (22050, b'a222795e07cd65b7a530f1346f520649', 0.0, 218.93179, 0.247, 0.0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0.736, -11.197, 0, 0.636, 218.932, 92.198, 4, 0.778, b'TRAAAAW128F429D538')], 
       dtype=[('analysis_sample_rate', '<i4'), ('audio_md5', 'S32'), ('danceability', '<f8'), ('duration', '<f8'), ('end_of_fade_in', '<f8'), ('energy', '<f8'), ('idx_bars_confidence', '<i4'), ('idx_bars_start', '<i4'), ('idx_beats_confidence', '<i4'), ('idx_beats_start', '<i4'), ('idx_sections_confidence', '<i4'), ('idx_sections_start', '<i4'), ('idx_segments_confidence', '<i4'), ('idx_segments_loudness_max', '<i4'), ('idx_segments_loudness_max_time', '<i4'), ('idx_segments_loudness_start', '<i4'), ('idx_segments_pitches', '<i4'), ('idx_segments_start', '<i4'), ('idx_segments_timbre', '<i4'), ('idx_tatums_confidence', '<i4'), ('idx_tatums_start', '<i4'), ('key', '<i4'), ('key_confidence', '<f8'), ('loudness', '<f8'), ('mode', '<i4'), ('mode_confidence', '<f8'), ('

Notice that the `segments_timbre` object is an `EArray` with two dimensions. We will need to make some choices when reading this data into a dataframe. 

The `segments_start` object seems to have units of seconds. 

In [60]:
store.root.analysis.segments_start.read()[0:3,]

array([ 0.     ,  0.14803,  0.68104])

In [64]:
store.root.analysis.segments_start.read()[-3:,]

array([ 145.32658,  145.82984,  146.07447])

Compare:

- Soul Deep has duration of 2 minutes and 27 seconds (see URL above)
- Last `segments_start` value is 146.07447
- The `duration` is 218.93179 (?!)

In [38]:
store.root.analysis.segments_timbre.read()[0:5,]

array([[   0.   ,  171.13 ,    9.469,  -28.48 ,   57.491,  -50.067,
          14.833,    5.359,  -27.228,    0.973,  -10.64 ,   -7.228],
       [  35.141,  -30.807,   35.192,  -75.606,   -0.584,  195.091,
          52.842,  -34.58 ,    9.649,   93.124,   50.145,  -13.776],
       [  42.317,  -23.978,    5.4  ,   59.208,  -17.624,   28.703,
          14.13 ,   -0.71 ,   34.62 ,  -23.91 ,   23.453,   -5.048],
       [  37.802,  -51.322,    9.355,   41.799,  -39.216,  -11.274,
          37.813,  -14.581,   18.403,    6.843,   -6.07 ,    5.466],
       [  39.138,  -22.524,    6.124,  -21.399,  -71.754,  -23.274,
          32.708,    7.204,   38.913,   20.644,   20.334,  -11.435]])

In [40]:
store.root.analysis.segments_pitches.read()[0:10,]

array([[ 1.   ,  1.   ,  1.   ,  1.   ,  1.   ,  1.   ,  1.   ,  1.   ,
         1.   ,  1.   ,  1.   ,  1.   ],
       [ 0.018,  0.07 ,  0.04 ,  0.044,  0.217,  0.074,  0.069,  0.074,
         0.123,  1.   ,  0.087,  0.042],
       [ 0.077,  0.216,  0.08 ,  0.08 ,  0.503,  0.182,  0.417,  0.215,
         0.331,  1.   ,  0.241,  0.107],
       [ 0.006,  0.074,  0.021,  0.043,  0.529,  0.044,  0.033,  0.027,
         0.265,  1.   ,  0.223,  0.015],
       [ 0.006,  0.027,  0.016,  0.022,  0.319,  0.022,  0.012,  0.016,
         0.187,  1.   ,  0.219,  0.008],
       [ 0.021,  0.043,  0.032,  0.061,  0.366,  0.063,  0.067,  0.055,
         0.37 ,  1.   ,  0.292,  0.02 ],
       [ 0.217,  0.026,  0.015,  0.016,  0.067,  0.035,  0.309,  0.058,
         0.037,  0.079,  0.091,  1.   ],
       [ 0.119,  1.   ,  0.252,  0.009,  0.071,  0.015,  0.029,  0.022,
         0.203,  0.115,  0.03 ,  0.057],
       [ 0.058,  0.411,  1.   ,  0.138,  0.061,  0.024,  0.036,  0.016,
         0.064,  0.332, 

In [25]:
store.root.analysis.segments_loudness_max.read()

array([-60.   , -31.646, -34.565, -38.407, -34.696, -20.511, -18.919,
       -21.477, -25.9  , -23.011, -20.598, -19.805, -21.222, -20.432,
       -28.022, -34.927, -29.401, -28.001, -18.132, -32.869, -36.764,
       -29.322, -20.021, -24.552, -41.055, -18.448, -18.232,  -9.656,
        -9.718, -16.525, -17.739,  -9.871, -17.369,  -9.493, -18.296,
       -17.17 , -18.037, -26.947,  -9.616, -12.048, -10.813, -10.22 ,
       -12.488, -17.914, -25.179, -12.849,  -8.758, -11.848,  -9.867,
       -14.918, -16.477, -12.302, -11.31 , -14.388, -10.922,  -8.27 ,
       -10.003, -10.464, -17.697, -17.604,  -8.914, -10.226, -12.403,
       -12.699,  -8.748, -13.511, -15.141,  -8.725,  -9.374, -15.494,
       -10.698,  -7.889,  -8.524, -10.545, -11.413, -13.845, -16.668,
        -8.857, -12.993, -11.349, -14.074, -12.878, -12.789,  -8.737,
       -12.071, -11.296, -13.16 , -16.822, -16.245, -11.888, -14.716,
        -8.284, -10.717, -17.063, -12.333, -13.186, -11.168,  -9.067,
       -14.331, -13.

In [26]:
store.root.analysis.segments_confidence.read()

array([ 0.   ,  1.   ,  0.483,  0.137,  0.42 ,  1.   ,  0.257,  1.   ,
        0.592,  0.596,  0.852,  0.937,  0.737,  0.976,  0.254,  0.631,
        0.997,  0.655,  1.   ,  0.421,  0.138,  0.959,  0.83 ,  0.812,
        0.533,  1.   ,  0.993,  1.   ,  1.   ,  0.321,  0.999,  1.   ,
        0.506,  1.   ,  0.866,  0.779,  0.92 ,  0.515,  1.   ,  0.591,
        0.641,  0.346,  0.882,  0.03 ,  0.454,  1.   ,  0.906,  0.443,
        0.484,  0.003,  0.869,  0.641,  0.304,  0.643,  0.751,  0.864,
        0.702,  0.857,  0.718,  1.   ,  1.   ,  0.418,  0.825,  0.838,
        0.878,  0.89 ,  0.794,  0.773,  0.993,  0.92 ,  1.   ,  1.   ,
        1.   ,  1.   ,  0.479,  0.929,  0.456,  0.901,  0.413,  0.674,
        0.988,  0.682,  0.725,  0.817,  0.301,  0.693,  0.873,  0.761,
        0.493,  0.75 ,  0.762,  0.746,  0.907,  0.639,  0.737,  1.   ,
        0.749,  0.548,  0.869,  0.867,  0.549,  1.   ,  0.958,  0.168,
        0.873,  1.   ,  0.593,  0.199,  0.862,  0.684,  1.   ,  1.   ,
      

In [27]:
store.root.analysis.segments_loudness_max_time.read()

array([ 0.     ,  0.10929,  0.11044,  0.0844 ,  0.05898,  0.07381,
        0.04654,  0.08719,  0.06602,  0.03343,  0.11404,  0.05306,
        0.07451,  0.11407,  0.10054,  0.15215,  0.24543,  0.05017,
        0.11508,  0.16495,  0.1623 ,  0.04619,  0.14897,  0.07871,
        0.05129,  0.04371,  0.0214 ,  0.11547,  0.04197,  0.02048,
        0.02722,  0.0966 ,  0.03066,  0.03833,  0.01424,  0.02249,
        0.02616,  0.14193,  0.06646,  0.09761,  0.14756,  0.03001,
        0.02226,  0.02634,  0.04725,  0.05246,  0.08835,  0.06793,
        0.03319,  0.04381,  0.07776,  0.02414,  0.08314,  0.0561 ,
        0.07179,  0.02257,  0.02975,  0.18057,  0.03032,  0.13801,
        0.09724,  0.03762,  0.07956,  0.07945,  0.08375,  0.06315,
        0.04173,  0.14577,  0.06894,  0.14735,  0.04998,  0.08002,
        0.07557,  0.07449,  0.03874,  0.04153,  0.02846,  0.08423,
        0.02059,  0.03769,  0.06351,  0.03422,  0.03496,  0.06465,
        0.03929,  0.05229,  0.05503,  0.0404 ,  0.04247,  0.10

In [28]:
store.root.analysis.segments_loudness_start.read()

array([-60.   , -60.   , -40.84 , -40.401, -38.456, -39.684, -25.842,
       -43.843, -35.42 , -30.22 , -32.719, -30.927, -29.006, -34.219,
       -32.381, -42.203, -43.467, -35.264, -36.227, -40.697, -40.557,
       -41.201, -34.446, -32.735, -45.457, -43.529, -34.318, -35.52 ,
       -23.493, -21.726, -30.405, -32.598, -23.99 , -33.812, -28.183,
       -32.316, -30.788, -37.207, -26.968, -19.364, -18.643, -17.849,
       -23.164, -28.267, -32.508, -32.282, -23.056, -17.791, -14.631,
       -18.875, -31.509, -17.44 , -14.815, -23.603, -21.037, -19.842,
       -19.164, -28.699, -31.326, -37.333, -27.798, -15.963, -24.418,
       -26.653, -21.625, -26.669, -26.453, -18.779, -25.103, -30.899,
       -32.526, -34.207, -25.023, -27.908, -18.36 , -30.287, -22.713,
       -18.366, -19.623, -19.643, -28.554, -21.304, -23.002, -19.105,
       -14.83 , -21.513, -23.23 , -27.164, -18.838, -16.682, -23.27 ,
       -15.444, -22.098, -26.962, -22.288, -31.645, -20.364, -14.853,
       -26.066, -25.

### Close the HDF5 file.

In [29]:
store.close()

# The end