# Read parts of the MSD into tables

This notebook creates a pandas dataframe from the `/metadata/songs` and `/analysis/songs` tables in the HDF5 files. 

The `pandas` module requires code from the _PyTables_ package. To load this package into Python from a console:

> `$ conda install --name python3 PyTables`

This only needs to happen once on your computer.

## Load libraries

In [14]:
import os
import re
import itertools as it
import pandas as pd

## Define utility functions

The `get_filenames` function recursively gets the names of all files in a given directory `path` and all of its subdirectories. The function returns a multi-level list if `path` contains subdirectories. The `unlist` function flattens the list by removing one level. 

In [17]:
def get_filenames(path):
    return([get_filenames(path+"/"+entry.name)
            if entry.is_dir() 
            else path+"/"+entry.name 
            for entry 
            in os.scandir(path)
           ])

def unlist(alist):
    return(list(it.chain.from_iterable(alist)
               )
          )

The `make_1row_df` function returns a single row dataframe and takes the following input:

- `filename`: full path file name of an MSD HDF5 file containing data for a single song
- `metadata_vars`: list of variable names from `/metadata/songs`
- `analysis_vars`: list of variable names from `/analysis/songs`
- `remove`: 
    - if `False` the variables listed in the last two parameters are retrieved from the input file
    - if `True` all variables except those listed are retrieved from the input file

See comments in the code for further details. 

In [19]:
def make_1row_df(filename='', metadata_vars=[], analysis_vars=[], remove=False):
    # open `filename` as a HDF5 file
    store = pd.HDFStore(filename,"r")
    if remove==False:
        # lists `metadata_vars` and `analysis_vars` contain the variables to keep
        metadata_var_list = metadata_vars
        analysis_var_list = analysis_vars
    else: # these lists contain the variables to remove
        metadata_var_list = list({item for item 
                                  in list(store.root.metadata.songs.read().dtype.names) 
                                  if item not in metadata_remove})
        analysis_var_list = list({item for item 
                                  in list(store.root.analysis.songs.read().dtype.names) 
                                  if item not in analysis_remove})
    # retrieve a single row dataframe from `/metadata/songs`
    ret_metadata = pd.DataFrame(store.root.metadata.songs.read(), 
                                columns=metadata_var_list)
    # retrieve a single row dataframe from `/analysis/songs`
    ret_analysis = pd.DataFrame(store.root.analysis.songs.read(), 
                                columns=analysis_var_list)
    # merge these two dataframes by adding their columns together
    ret = pd.concat([ret_analysis, 
                     ret_metadata], 
                    axis=1)
    # close the open HDF5 file
    store.close()
    # return the merged dataframe
    return(ret)

## Get list of (10,000) files

The `path` variable stores the root of the directory tree containing all of the song files. The function `get_filenames` returns a multi-level list, which is flattened using `unlist` and stored in variable `filenames` as a list of full-path filenames.

In [20]:
path = "/Users/David/Dropbox/Data/MillionSongSubset/data"
filenames = unlist(unlist(unlist(get_filenames(path))))
filenames[0:2]

['/Users/David/Dropbox/Data/MillionSongSubset/data/A/A/A/TRAAAAW128F429D538.h5',
 '/Users/David/Dropbox/Data/MillionSongSubset/data/A/A/A/TRAAABD128F429CF47.h5']

Store in `filenames` only the files with extension `.h5` 

In [21]:
p = re.compile("\.h5$")
filenames = [filename for filename 
             in filenames if p.search(filename)]
filenames[0:2]

['/Users/David/Dropbox/Data/MillionSongSubset/data/A/A/A/TRAAAAW128F429D538.h5',
 '/Users/David/Dropbox/Data/MillionSongSubset/data/A/A/A/TRAAABD128F429CF47.h5']

In [22]:
len(filenames)

10000

## Get lists of variables from `/metadata/songs` and `/analysis/songs`

The two tables `/metadata/songs` and `/analysis/songs` provide data that is easy to load into a dataframe. Their variables are displayed below so we know which to choose or omit when creating the corresponding dataframes.

### `/metadata/songs`

In [11]:
tmp=pd.HDFStore(filenames[1])
print(tmp.root.metadata.songs.read().dtype)
tmp.close()

[('analyzer_version', 'S32'), ('artist_7digitalid', '<i4'), ('artist_familiarity', '<f8'), ('artist_hotttnesss', '<f8'), ('artist_id', 'S32'), ('artist_latitude', '<f8'), ('artist_location', 'S1024'), ('artist_longitude', '<f8'), ('artist_mbid', 'S40'), ('artist_name', 'S1024'), ('artist_playmeid', '<i4'), ('genre', 'S1024'), ('idx_artist_terms', '<i4'), ('idx_similar_artists', '<i4'), ('release', 'S1024'), ('release_7digitalid', '<i4'), ('song_hotttnesss', '<f8'), ('song_id', 'S32'), ('title', 'S1024'), ('track_7digitalid', '<i4')]


### `/analysis/songs`

In [12]:
tmp=pd.HDFStore(filenames[1])
print(tmp.root.analysis.songs.read().dtype)
tmp.close()

[('analysis_sample_rate', '<i4'), ('audio_md5', 'S32'), ('danceability', '<f8'), ('duration', '<f8'), ('end_of_fade_in', '<f8'), ('energy', '<f8'), ('idx_bars_confidence', '<i4'), ('idx_bars_start', '<i4'), ('idx_beats_confidence', '<i4'), ('idx_beats_start', '<i4'), ('idx_sections_confidence', '<i4'), ('idx_sections_start', '<i4'), ('idx_segments_confidence', '<i4'), ('idx_segments_loudness_max', '<i4'), ('idx_segments_loudness_max_time', '<i4'), ('idx_segments_loudness_start', '<i4'), ('idx_segments_pitches', '<i4'), ('idx_segments_start', '<i4'), ('idx_segments_timbre', '<i4'), ('idx_tatums_confidence', '<i4'), ('idx_tatums_start', '<i4'), ('key', '<i4'), ('key_confidence', '<f8'), ('loudness', '<f8'), ('mode', '<i4'), ('mode_confidence', '<f8'), ('start_of_fade_out', '<f8'), ('tempo', '<f8'), ('time_signature', '<i4'), ('time_signature_confidence', '<f8'), ('track_id', 'S32')]


### Create a single table from the 10,000 files

Because `remove=False` is specified the two lists of variables are retrieved from the two `Tables` displayed above. The result of this command is a list of 10,000 single row dataframes with columns indicated. 

In [7]:
mss_df_list = [make_1row_df(filename=filename,
                            metadata_vars=['artist_familiarity','artist_hotttnesss',
                                           'song_hotttnesss','genre','title',
                                           'artist_location','release'
                                           'artist_longitude','artist_latitude'],
                            analysis_vars=['danceability','duration','energy','key',
                                           'loudness','mode','tempo','time_signature'],
                            remove=False
                           )
                for filename in filenames[0:10000] # get data from all 10,000 files
              ]
len(mss_df_list), mss_df_list[0].shape

(10000, (1, 16))

### Merge all dataframes of `mss_df_list` into a single dataframe stored in `mss_df`.

In [23]:
mss_df = pd.concat(mss_df_list,axis=0).reset_index(drop=True)

### Check its dimensions (shape) and its variables.

In [24]:
print('shape:',mss_df.shape)
print('columns:',mss_df.columns.values)

shape: (10000, 16)
columns: ['danceability' 'duration' 'energy' 'key' 'loudness' 'mode' 'tempo'
 'time_signature' 'artist_familiarity' 'artist_hotttnesss'
 'song_hotttnesss' 'genre' 'title' 'artist_location'
 'releaseartist_longitude' 'artist_latitude']


### Save-load the table `mss_df`

First set the folder to save to and load from. 

In [25]:
save_load_path = '/Users/David/Desktop'

Save `mss_df` to a _pickle_ file. 

In [11]:
mss_df.to_pickle(save_load_path+'/mss_df.pkl')

Load `mss_df` from the _pickle_ file.

In [12]:
mss_df = pd.read_pickle(save_load_path+'/mss_df.pkl')

In [13]:
print('shape:',mss_df.shape)
print('columns:',mss_df.columns.values)
mss_df.dtypes

shape: (10000, 16)
columns: ['danceability' 'duration' 'energy' 'key' 'loudness' 'mode' 'tempo'
 'time_signature' 'artist_familiarity' 'artist_hotttnesss'
 'song_hotttnesss' 'genre' 'title' 'artist_location'
 'releaseartist_longitude' 'artist_latitude']


danceability               float64
duration                   float64
energy                     float64
key                          int32
loudness                   float64
mode                         int32
tempo                      float64
time_signature               int32
artist_familiarity         float64
artist_hotttnesss          float64
song_hotttnesss            float64
genre                       object
title                       object
artist_location             object
releaseartist_longitude     object
artist_latitude            float64
dtype: object