# Explore an HDF5 file that stores the MSS information

The `pandas` module is used to open and look into the file, but requires code from the _PyTables_ package. To load this package into Python from a console:

> `$ conda install --name python3 PyTables`

### Load libraries

In [1]:
import pandas as pd
import itertools as it
import os 
import re

### Define utility functions `get_filenames` and `unlist`

The `get_filenames` function recursively obtains the names of all files in the `path` directory and all of its subdirectories. The function returns a multi-level list if `path` contains subdirectories. The `unlist` function flattens the list by removing one level. 

In [73]:
def get_filenames(path):
    return([get_filenames(path+"/"+entry.name)
            if entry.is_dir() 
            else path+"/"+entry.name 
            for entry 
            in os.scandir(path)
           ])

def unlist(alist):
    return(list(it.chain.from_iterable(alist)
               )
          )

## Use these two functions to get the list of 10,000 files

The `path` variable stores the root of the directory tree containing all of the song files. The function `get_filenames` returns a multi-level list, which is flattened using `unlist` and stored in variable `filenames` as a list of full-path filenames.

In [74]:
path = "/Users/David/Dropbox/Data/MillionSongSubset/data"
filenames = unlist(unlist(unlist(get_filenames(path))))
filenames[0:2]

['/Users/David/Dropbox/Data/MillionSongSubset/data/A/A/A/TRAAAAW128F429D538.h5',
 '/Users/David/Dropbox/Data/MillionSongSubset/data/A/A/A/TRAAABD128F429CF47.h5']

In [76]:
list(os.scandir(path))

[<DirEntry 'A'>, <DirEntry 'B'>]

### Store in `filenames` only the files with extension `.h5` 

In [65]:
p = re.compile("\.h5$")
filenames = [filename for filename 
             in filenames if p.search(filename)]
filenames[0:2]

['/Users/David/Dropbox/Data/MillionSongSubset/data/A/A/A/TRAAAAW128F429D538.h5',
 '/Users/David/Dropbox/Data/MillionSongSubset/data/A/A/A/TRAAABD128F429CF47.h5']

In [9]:
len(filenames)

10000

## Investigate the data stored in the first file `filenames[0]`

Use the `HDFStore` function to open the HDF5 file. 

In [68]:
store=pd.HDFStore(filenames[0])

### List the three `Groups` in the file.

In [69]:
store.root

/ (RootGroup) 'H5 Song File'
  children := ['metadata' (Group), 'analysis' (Group), 'musicbrainz' (Group)]

### Groups can contain other groups (this one doesn't) or objects of type `Table` or `EArray` (this one has one `Table` and three objects of type `EArray`. 

In [53]:
store.root.metadata

/metadata (Group) 'metadata about the song'
  children := ['songs' (Table), 'artist_terms' (EArray), 'similar_artists' (EArray), 'artist_terms_weight' (EArray), 'artist_terms_freq' (EArray)]

### The `Table` object `songs` looks like a single row of a table as it has data of different types. The `EArray` objects look like vectors as they have objects of a single type. See below for examples. The next three cells don't include data, but do indicate the datatypes.

In [70]:
store.root.metadata.songs

/metadata/songs (Table(1,), shuffle, zlib(1)) 'table of metadata for one song'
  description := {
  "analyzer_version": StringCol(itemsize=32, shape=(), dflt=b'', pos=0),
  "artist_7digitalid": Int32Col(shape=(), dflt=0, pos=1),
  "artist_familiarity": Float64Col(shape=(), dflt=0.0, pos=2),
  "artist_hotttnesss": Float64Col(shape=(), dflt=0.0, pos=3),
  "artist_id": StringCol(itemsize=32, shape=(), dflt=b'', pos=4),
  "artist_latitude": Float64Col(shape=(), dflt=0.0, pos=5),
  "artist_location": StringCol(itemsize=1024, shape=(), dflt=b'', pos=6),
  "artist_longitude": Float64Col(shape=(), dflt=0.0, pos=7),
  "artist_mbid": StringCol(itemsize=40, shape=(), dflt=b'', pos=8),
  "artist_name": StringCol(itemsize=1024, shape=(), dflt=b'', pos=9),
  "artist_playmeid": Int32Col(shape=(), dflt=0, pos=10),
  "genre": StringCol(itemsize=1024, shape=(), dflt=b'', pos=11),
  "idx_artist_terms": Int32Col(shape=(), dflt=0, pos=12),
  "idx_similar_artists": Int32Col(shape=(), dflt=0, pos=13),
  "rel

In [55]:
store.root.metadata.artist_terms

/metadata/artist_terms (EArray(37,), shuffle, zlib(1)) 'array of terms (Echo Nest tags) for an artist'
  atom := StringAtom(itemsize=256, shape=(), dflt=b'')
  maindim := 0
  flavor := 'numpy'
  byteorder := 'irrelevant'
  chunkshape := (32,)

In [56]:
store.root.metadata.artist_terms_freq

/metadata/artist_terms_freq (EArray(37,), shuffle, zlib(1)) 'array of term (Echo Nest tags) frequencies for an artist'
  atom := Float64Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := (1024,)

### The next three cells produce the actual data and all produce an `ndarray`. 

In [71]:
x = store.root.metadata.songs.read()
type(x), x

(numpy.ndarray,
 array([ (b'', 165270, 0.5817937658450281, 0.4019975433642836, b'ARD7TVE1187B99BFB1', nan, b'California - LA', nan, b'e77e51a5-4761-45b3-9847-2051f811e366', b'Casual', 4479, b'', 0, 0, b'Fear Itself', 300848, 0.6021199899057548, b'SOMZWCG12A8C13C480', b"I Didn't Mean To", 3401791)], 
       dtype=[('analyzer_version', 'S32'), ('artist_7digitalid', '<i4'), ('artist_familiarity', '<f8'), ('artist_hotttnesss', '<f8'), ('artist_id', 'S32'), ('artist_latitude', '<f8'), ('artist_location', 'S1024'), ('artist_longitude', '<f8'), ('artist_mbid', 'S40'), ('artist_name', 'S1024'), ('artist_playmeid', '<i4'), ('genre', 'S1024'), ('idx_artist_terms', '<i4'), ('idx_similar_artists', '<i4'), ('release', 'S1024'), ('release_7digitalid', '<i4'), ('song_hotttnesss', '<f8'), ('song_id', 'S32'), ('title', 'S1024'), ('track_7digitalid', '<i4')]))

### The data from `songs` is easily used to create variables of a dataframe. We will do this in the notebook `Dataset-Million Song Subset-create dataframe`. 

### The data from `artists_terms` is a vector, which is not simple to add to a dataframe. Fortunately these terms are sorted. See the two cells following this next cell. We might choose to pull only the first or first two terms, and so create one or two variables.

In [32]:
x = store.root.metadata.artist_terms.read()
type(x), x

(numpy.ndarray,
 array([b'blue-eyed soul', b'pop rock', b'blues-rock', b'beach music',
        b'soft rock', b'soul', b'classic rock', b'oldies', b'power pop',
        b'psychedelic rock', b'rock', b'sunshine pop', b'blues',
        b'singer-songwriter', b'pop', b'united states', b'male vocalist',
        b"rock 'n roll", b'60s', b'am pop', b'r&b', b'american', b'male',
        b'psychedelic', b'classic', b'vocal', b'americana', b'game music',
        b'mod', b'trippy', b'french', b'germany', b'canada', b'70s',
        b'belgium', b'cover', b'nederland', b'confident'], 
       dtype='|S256'))

In [31]:
x = store.root.metadata.artist_terms_freq.read()
type(x), x

(numpy.ndarray,
 array([ 1.        ,  0.89319999,  0.78606029,  0.74638538,  0.76959371,
         0.86287996,  0.84396311,  0.80926862,  0.76959371,  0.76959371,
         0.91182678,  0.65015172,  0.76959371,  0.76959371,  0.80319885,
         0.59682398,  0.62623369,  0.56622481,  0.63564988,  0.53145068,
         0.53925634,  0.6718294 ,  0.56760867,  0.64202042,  0.57531714,
         0.57830665,  0.58026237,  0.50317926,  0.49775864,  0.49154685,
         0.56355296,  0.50884487,  0.4721209 ,  0.53094369,  0.46943746,
         0.43936557,  0.41932041,  0.41299165]))

In [59]:
x = store.root.metadata.artist_terms_weight.read()
type(x), x

(numpy.ndarray,
 array([ 1.        ,  0.89793596,  0.88426185,  0.84262975,  0.84256301,
         0.83239282,  0.82577707,  0.79859195,  0.7431759 ,  0.73850237,
         0.72505245,  0.71389955,  0.67049417,  0.65697231,  0.65105613,
         0.65105612,  0.65105597,  0.65105592,  0.65105547,  0.65105532,
         0.65105508,  0.65105506,  0.65105461,  0.65105427,  0.65105376,
         0.65104997,  0.6364043 ,  0.63334971,  0.61973455,  0.61889383,
         0.61419433,  0.59579116,  0.56220197,  0.55067233,  0.52897541,
         0.49021215,  0.38341077]))

### Now look at the `Table` and `EArray` objects of the `analysis` group.

In [57]:
store.root.analysis

/analysis (Group) 'Echo Nest analysis of the song'
  children := ['segments_timbre' (EArray), 'segments_loudness_max' (EArray), 'sections_confidence' (EArray), 'beats_confidence' (EArray), 'segments_pitches' (EArray), 'tatums_confidence' (EArray), 'bars_confidence' (EArray), 'tatums_start' (EArray), 'segments_confidence' (EArray), 'segments_loudness_max_time' (EArray), 'segments_loudness_start' (EArray), 'beats_start' (EArray), 'segments_start' (EArray), 'bars_start' (EArray), 'songs' (Table), 'sections_start' (EArray)]

### The `songs` object is a `Table`. The next cell contains the variable names and their types. The second cell contains the data. 

### The data from `songs` is easily used to create variables of a dataframe. We will do this in the notebook `Dataset-Million Song Subset-create dataframe`. 

In [35]:
store.root.analysis.songs

/analysis/songs (Table(1,), shuffle, zlib(1)) 'table of Echo Nest analysis for one song'
  description := {
  "analysis_sample_rate": Int32Col(shape=(), dflt=0, pos=0),
  "audio_md5": StringCol(itemsize=32, shape=(), dflt=b'', pos=1),
  "danceability": Float64Col(shape=(), dflt=0.0, pos=2),
  "duration": Float64Col(shape=(), dflt=0.0, pos=3),
  "end_of_fade_in": Float64Col(shape=(), dflt=0.0, pos=4),
  "energy": Float64Col(shape=(), dflt=0.0, pos=5),
  "idx_bars_confidence": Int32Col(shape=(), dflt=0, pos=6),
  "idx_bars_start": Int32Col(shape=(), dflt=0, pos=7),
  "idx_beats_confidence": Int32Col(shape=(), dflt=0, pos=8),
  "idx_beats_start": Int32Col(shape=(), dflt=0, pos=9),
  "idx_sections_confidence": Int32Col(shape=(), dflt=0, pos=10),
  "idx_sections_start": Int32Col(shape=(), dflt=0, pos=11),
  "idx_segments_confidence": Int32Col(shape=(), dflt=0, pos=12),
  "idx_segments_loudness_max": Int32Col(shape=(), dflt=0, pos=13),
  "idx_segments_loudness_max_time": Int32Col(shape=(), d

In [41]:
x = store.root.analysis.songs.read()
type(x), x

(numpy.ndarray,
 array([ (22050, b'bb9771eeef3d5b204a3c55e690f52a91', 0.0, 148.03546, 0.148, 0.0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0.169, -9.843, 0, 0.43, 137.915, 121.274, 4, 0.384, b'TRAAABD128F429CF47')], 
       dtype=[('analysis_sample_rate', '<i4'), ('audio_md5', 'S32'), ('danceability', '<f8'), ('duration', '<f8'), ('end_of_fade_in', '<f8'), ('energy', '<f8'), ('idx_bars_confidence', '<i4'), ('idx_bars_start', '<i4'), ('idx_beats_confidence', '<i4'), ('idx_beats_start', '<i4'), ('idx_sections_confidence', '<i4'), ('idx_sections_start', '<i4'), ('idx_segments_confidence', '<i4'), ('idx_segments_loudness_max', '<i4'), ('idx_segments_loudness_max_time', '<i4'), ('idx_segments_loudness_start', '<i4'), ('idx_segments_pitches', '<i4'), ('idx_segments_start', '<i4'), ('idx_segments_timbre', '<i4'), ('idx_tatums_confidence', '<i4'), ('idx_tatums_start', '<i4'), ('key', '<i4'), ('key_confidence', '<f8'), ('loudness', '<f8'), ('mode', '<i4'), ('mode_confidence', '<f8'), ('s

### Notice that the `segments_timbre` object is an `EArray` with two dimensions. We will need to make some choices when reading this data into a dataframe. 

In [37]:
store.root.analysis.segments_timbre

/analysis/segments_timbre (EArray(550, 12), shuffle, zlib(1)) 'array of timbre of segments (MFCC-like)'
  atom := Float64Atom(shape=(), dflt=0.0)
  maindim := 0
  flavor := 'numpy'
  byteorder := 'little'
  chunkshape := (85, 12)

In [58]:
x = store.root.analysis.segments_timbre.read()
type(x), x.shape, x

(numpy.ndarray,
 (971, 12),
 array([[  0.00000000e+00,   1.71130000e+02,   9.46900000e+00, ...,
           9.73000000e-01,  -1.06400000e+01,  -7.22800000e+00],
        [  1.99910000e+01,  -1.43504000e+02,  -1.18249000e+02, ...,
           3.81060000e+01,  -2.76000000e+00,  -1.90030000e+01],
        [  2.05970000e+01,  -2.03829000e+02,  -1.59915000e+02, ...,
           9.46000000e+00,  -1.53300000e+01,  -2.10790000e+01],
        ..., 
        [  2.44160000e+01,  -8.00690000e+01,  -1.20022000e+02, ...,
           9.11800000e+00,  -9.64400000e+00,   1.32000000e-01],
        [  4.16210000e+01,   3.42380000e+01,  -2.85390000e+01, ...,
           2.14910000e+01,   3.41890000e+01,  -9.64400000e+00],
        [  3.71950000e+01,   1.21030000e+02,  -7.98630000e+01, ...,
           1.93330000e+01,  -2.18400000e+01,   1.69290000e+01]]))

### Close the HDF5 file.

In [47]:
store.close()

# The end