# Read parts of the MSD into tables

This notebook creates a pandas dataframe from the `/metadata/songs` and `/analysis/songs` tables in the HDF5 files. 

The `pandas` module requires code from the _PyTables_ package. To load this package into Python from a console:

> `$ conda install --name python3 PyTables`

This only needs to happen once on your computer.

### Load libraries

In [30]:
import os
import re
import itertools as it
import pandas as pd
import numpy as np
import operator 
import functools

### Define utility functions

The `get_filenames` function recursively gets the names of all files in a given directory `path` and all of its subdirectories. The function returns a multi-level list if `path` contains subdirectories. The `unlist` function flattens the list by removing one level. 

In [31]:
def get_filenames(path):
    return([get_filenames(path+"/"+entry.name)
            if entry.is_dir() 
            else path+"/"+entry.name 
            for entry 
            in os.scandir(path)
           ])

def unlist(alist):
    return(list(it.chain.from_iterable(alist)
               )
          )

def var_list(base,numof):
    return([base+str(ndx) for ndx in range(numof)]
          )

def h1d_array(in_array,n): 
    # n1d is the number of elements in `in_array`
    n1d = functools.reduce(operator.mul,
                           list(in_array.shape))
    # return a 1 row 2D array with `n` columns
    b = np.ndarray(shape=(1,n1d),
                   buffer=in_array,
                   dtype=in_array.dtype
                  )[0:1,0:n]
    return(b)

The `make_1row_df` function returns a single row dataframe and takes the following input:

- `filename`: full path file name of an MSD HDF5 file containing data for a single song
- `metadata_vars`: list of variable names from `/metadata/songs`
- `analysis_vars`: list of variable names from `/analysis/songs`
- `remove`: 
    - if `False` the variables listed in the last two parameters are retrieved from the input file
    - if `True` all variables except those listed are retrieved from the input file

See comments in the code for further details. 

In [32]:
def make_1row_df(filename='', metadata_vars=[], analysis_vars=[], remove=False):
    # open `filename` as a HDF5 file
    store = pd.HDFStore(filename,"r")
    if remove==True:
        # `metadata_vars` and `analysis_vars` contain the variables to remove
        metadata_vars = list({item for item 
                                  in list(store.root.metadata.songs.read().dtype.names) 
                                  if item not in metadata_vars})
        analysis_vars = list({item for item 
                                  in list(store.root.analysis.songs.read().dtype.names) 
                                  if item not in analysis_vars})
    # else: `metadata_vars` and `analysis_vars` contain the variables to keep
    
    # retrieve the first `n` values as a horizontal array of 1 dimension
    segments_pitches = h1d_array(store.root.analysis.segments_pitches.read(),36)
    segments_timbre  = h1d_array(store.root.analysis.segments_timbre.read(),36)
    bars_confidence  = h1d_array(store.root.analysis.bars_confidence.read(),10)
    artist_terms     = h1d_array(store.root.metadata.artist_terms.read(),3)
    
    # store these values as variables in single dataframes
    at_df = pd.DataFrame(artist_terms    ,columns=var_list('at_',artist_terms    .shape[1]))
    bc_df = pd.DataFrame(bars_confidence ,columns=var_list('bc_',bars_confidence .shape[1]))
    sp_df = pd.DataFrame(segments_pitches,columns=var_list('sp_',segments_pitches.shape[1]))
    st_df = pd.DataFrame(segments_timbre ,columns=var_list('st_',segments_timbre .shape[1]))
    
    # merge these single dataframes into one single row dataframe
    ret = pd.concat([
            # retrieve a single row dataframe from `/metadata/songs`
            pd.DataFrame(store.root.metadata.songs.read(), 
                         columns=metadata_vars),
            # retrieve a single row dataframe from `/analysis/songs`
            pd.DataFrame(store.root.analysis.songs.read(), 
                         columns=analysis_vars),
            #at_df, 
            bc_df, 
            sp_df,
            st_df],
            axis=1) # `axes=1` means stack the dataframes horizontally 
    # close the HDF5 file
    store.close()
    # return the merged dataframe
    return(ret)

### Get the list of (10,000) HDF5 (.h5) files

The `path` variable stores the root of the directory tree containing all of the song files. The function `get_filenames` returns a multi-level list, which is flattened using `unlist` and stored in variable `filenames` as a list of full-path filenames.

In [33]:
path = "MillionSongSubset/data"
filenames = unlist(unlist(unlist(get_filenames(path))))
filenames[0:2]

FileNotFoundError: [WinError 3] The system cannot find the path specified: 'MillionSongSubset/data'

### Store in the `filenames` variable only the files with extension `.h5`

In [10]:
p = re.compile("\.h5$")
filenames = [filename for filename 
             in filenames if p.search(filename)]
filenames[0:2]

['MillionSongSubset/data/A/A/A/TRAAAAW128F429D538.h5',
 'MillionSongSubset/data/A/A/A/TRAAABD128F429CF47.h5']

In [11]:
len(filenames)

10000

### Get lists of variables from `/metadata/songs` and `/analysis/songs`

The two tables `/metadata/songs` and `/analysis/songs` provide data that is easy to load into a dataframe. Their variables are displayed below so we know which to choose or omit when creating the corresponding dataframes.

### `/metadata/songs`

In [12]:
tmp=pd.HDFStore(filenames[1])
print(tmp.root.metadata.songs.read().dtype)
tmp.close()

[('analyzer_version', 'S32'), ('artist_7digitalid', '<i4'), ('artist_familiarity', '<f8'), ('artist_hotttnesss', '<f8'), ('artist_id', 'S32'), ('artist_latitude', '<f8'), ('artist_location', 'S1024'), ('artist_longitude', '<f8'), ('artist_mbid', 'S40'), ('artist_name', 'S1024'), ('artist_playmeid', '<i4'), ('genre', 'S1024'), ('idx_artist_terms', '<i4'), ('idx_similar_artists', '<i4'), ('release', 'S1024'), ('release_7digitalid', '<i4'), ('song_hotttnesss', '<f8'), ('song_id', 'S32'), ('title', 'S1024'), ('track_7digitalid', '<i4')]


### `/analysis/songs`

### Run the `make_1row_df` function on the fourth file

We are currently only pulling data from `/metadata/songs` and `/analysis/songs`. 

Later we will pull additional data from the file. There are three types of data we can retreive:

1. From `/metadata` there are three lists: `artist_terms`, `artist_terms_freq`, `artist_terms_weight`
1. From `/analysis` there is information about _tatums_, _beats_, _segments_, _bars_, _timbre_ and _pitch_
1. From `/musicbrainz` there should be tags, but I don't think there are any values here.

In [16]:
make_1row_df(filename=filenames[3],
                                metadata_vars=['artist_familiarity','artist_hotttnesss',
                                           'song_hotttnesss','title','artist_name',
                                           'artist_location','release',
                                           'artist_longitude','artist_latitude'],
                            # Omit: genre
                            analysis_vars=['duration','key','loudness','mode',
                                           'tempo','time_signature'],
                            # Omit: danceability, energy
                            remove=False
                           )

Unnamed: 0,artist_familiarity,artist_hotttnesss,song_hotttnesss,title,artist_name,artist_location,release,artist_longitude,artist_latitude,duration,...,st_26,st_27,st_28,st_29,st_30,st_31,st_32,st_33,st_34,st_35
0,0.630382,0.454231,,b'Something Girls',b'Adam Ant',"b'London, England'",b'Friend Or Foe',,,233.40363,...,103.23,-17.005,-37.423,47.573,-0.734,25.383,-10.965,-44.947,10.023,-40.109


### There is more data in the file than the data in `/metadata/songs` and `/analysis/songs`. 

See the `Dataset-MSS-pandas-explore` notebook. 

### Create a list of 10,000 single row dataframes

Because `remove=False` is specified the two lists of variables are retrieved from the two `Tables` displayed above. The result of this command is a list of 10,000 single row dataframes with columns indicated. 

It may take up to twenty (20) minutes to create `mss_df_list` with the current set of variables. 

In [17]:
mss_df_list = [make_1row_df(filename=filename,
                            metadata_vars=['artist_familiarity','artist_hotttnesss',
                                           'song_hotttnesss','title',
                                           'artist_location','release',
                                           'artist_longitude','artist_latitude'],
                            # Omit: genre
                            analysis_vars=['duration','key','loudness','mode',
                                           'tempo','time_signature'],
                            # Omit: danceability, energy
                            remove=False
                           )
                for filename in filenames[0:10000] # get data from all 10,000 files
              ]
len(mss_df_list), mss_df_list[0].shape

(10000, (1, 96))

In [18]:
len(mss_df_list)

10000

### Merge all dataframes of `mss_df_list` into a single dataframe stored in `mss_df`.

In [19]:
mss_df = pd.concat(mss_df_list,axis=0).reset_index(drop=True)

### Check the head of the table

In [20]:
mss_df.head()

Unnamed: 0,artist_familiarity,artist_hotttnesss,artist_latitude,artist_location,artist_longitude,bc_0,bc_1,bc_2,bc_3,bc_4,...,st_35,st_4,st_5,st_6,st_7,st_8,st_9,tempo,time_signature,title
0,0.581794,0.401998,,b'California - LA',,0.643,0.746,0.722,0.095,0.091,...,-21.079,57.491,-50.067,14.833,5.359,-27.228,0.973,92.198,4,"b""I Didn't Mean To"""
1,0.63063,0.4175,35.14968,"b'Memphis, TN'",-90.04892,0.007,0.259,0.172,0.404,0.011,...,-5.048,57.491,-50.067,14.833,5.359,-27.228,0.973,121.274,4,b'Soul Deep'
2,0.487357,0.343428,,b'',,0.98,0.399,0.185,0.27,0.422,...,4.562,57.482,-50.069,14.839,5.352,-27.227,0.975,100.07,1,b'Amor De Cabaret'
3,0.630382,0.454231,,"b'London, England'",,0.017,0.05,0.014,0.008,0.114,...,-40.109,56.3,202.348,68.838,-33.635,-24.275,92.399,119.293,4,b'Something Girls'
4,0.651046,0.401724,,b'',,0.175,0.409,0.639,0.067,0.016,...,8.708,54.144,-50.189,18.536,5.384,-26.271,2.826,129.738,4,b'Face the Ashes'


### Check its dimensions (shape) and its variables.

In [21]:
print('shape:',mss_df.shape)
print('columns:',mss_df.columns.values)

shape: (10000, 96)
columns: ['artist_familiarity' 'artist_hotttnesss' 'artist_latitude'
 'artist_location' 'artist_longitude' 'bc_0' 'bc_1' 'bc_2' 'bc_3' 'bc_4'
 'bc_5' 'bc_6' 'bc_7' 'bc_8' 'bc_9' 'duration' 'key' 'loudness' 'mode'
 'release' 'song_hotttnesss' 'sp_0' 'sp_1' 'sp_10' 'sp_11' 'sp_12' 'sp_13'
 'sp_14' 'sp_15' 'sp_16' 'sp_17' 'sp_18' 'sp_19' 'sp_2' 'sp_20' 'sp_21'
 'sp_22' 'sp_23' 'sp_24' 'sp_25' 'sp_26' 'sp_27' 'sp_28' 'sp_29' 'sp_3'
 'sp_30' 'sp_31' 'sp_32' 'sp_33' 'sp_34' 'sp_35' 'sp_4' 'sp_5' 'sp_6'
 'sp_7' 'sp_8' 'sp_9' 'st_0' 'st_1' 'st_10' 'st_11' 'st_12' 'st_13' 'st_14'
 'st_15' 'st_16' 'st_17' 'st_18' 'st_19' 'st_2' 'st_20' 'st_21' 'st_22'
 'st_23' 'st_24' 'st_25' 'st_26' 'st_27' 'st_28' 'st_29' 'st_3' 'st_30'
 'st_31' 'st_32' 'st_33' 'st_34' 'st_35' 'st_4' 'st_5' 'st_6' 'st_7' 'st_8'
 'st_9' 'tempo' 'time_signature' 'title']


### Some changes 

### Make  `key` and `time_signature` variables categorical

Leave `mode` as numeric. It mayb

In [22]:
mss_df['mode']            = mss_df['mode']           .astype('float64')
mss_df['key']             = mss_df['key']            .astype('category')
mss_df['time_signature']  = mss_df['time_signature'] .astype('category')
mss_df['key'].dtype, mss_df['mode'].dtype, mss_df['time_signature'].dtype

(category, dtype('float64'), category)

### Create dummy variables from categorical variables `key` and `time_signature`

The `mode` variable is already binary. 

The `key` and `time_signature` variables are removed with this next command.

In [23]:

mss_df = pd.get_dummies(mss_df, 
                        columns=['key','time_signature'], 
                        prefix=['k','ts'])

In [24]:
mss_df.dtypes

artist_familiarity    float64
artist_hotttnesss     float64
artist_latitude       float64
artist_location        object
artist_longitude      float64
bc_0                  float64
bc_1                  float64
bc_2                  float64
bc_3                  float64
bc_4                  float64
bc_5                  float64
bc_6                  float64
bc_7                  float64
bc_8                  float64
bc_9                  float64
duration              float64
loudness              float64
mode                  float64
release                object
song_hotttnesss       float64
sp_0                  float64
sp_1                  float64
sp_10                 float64
sp_11                 float64
sp_12                 float64
sp_13                 float64
sp_14                 float64
sp_15                 float64
sp_16                 float64
sp_17                 float64
                       ...   
st_32                 float64
st_33                 float64
st_34     

### Save the table `mss_df` in a _pickle_ file

First set the folder to save to and load from. 

In [26]:
save_load_path = 'MillionSongSubset/data'

Save `mss_df` to a _pickle_ file. 

In [27]:
mss_df.to_pickle(save_load_path+'/mss_df.pkl')

Load `mss_df` from the _pickle_ file.

In [28]:
mss_df = pd.read_pickle(save_load_path+'/mss_df.pkl')

Now check that we retrieved the same number of rows and variables we expect.

In [29]:
print('shape:',mss_df.shape)
print('columns:',mss_df.columns.values)
mss_df.dtypes

shape: (10000, 112)
columns: ['artist_familiarity' 'artist_hotttnesss' 'artist_latitude'
 'artist_location' 'artist_longitude' 'bc_0' 'bc_1' 'bc_2' 'bc_3' 'bc_4'
 'bc_5' 'bc_6' 'bc_7' 'bc_8' 'bc_9' 'duration' 'loudness' 'mode' 'release'
 'song_hotttnesss' 'sp_0' 'sp_1' 'sp_10' 'sp_11' 'sp_12' 'sp_13' 'sp_14'
 'sp_15' 'sp_16' 'sp_17' 'sp_18' 'sp_19' 'sp_2' 'sp_20' 'sp_21' 'sp_22'
 'sp_23' 'sp_24' 'sp_25' 'sp_26' 'sp_27' 'sp_28' 'sp_29' 'sp_3' 'sp_30'
 'sp_31' 'sp_32' 'sp_33' 'sp_34' 'sp_35' 'sp_4' 'sp_5' 'sp_6' 'sp_7' 'sp_8'
 'sp_9' 'st_0' 'st_1' 'st_10' 'st_11' 'st_12' 'st_13' 'st_14' 'st_15'
 'st_16' 'st_17' 'st_18' 'st_19' 'st_2' 'st_20' 'st_21' 'st_22' 'st_23'
 'st_24' 'st_25' 'st_26' 'st_27' 'st_28' 'st_29' 'st_3' 'st_30' 'st_31'
 'st_32' 'st_33' 'st_34' 'st_35' 'st_4' 'st_5' 'st_6' 'st_7' 'st_8' 'st_9'
 'tempo' 'title' 'k_0' 'k_1' 'k_2' 'k_3' 'k_4' 'k_5' 'k_6' 'k_7' 'k_8'
 'k_9' 'k_10' 'k_11' 'ts_0' 'ts_1' 'ts_3' 'ts_4' 'ts_5' 'ts_7']


artist_familiarity    float64
artist_hotttnesss     float64
artist_latitude       float64
artist_location        object
artist_longitude      float64
bc_0                  float64
bc_1                  float64
bc_2                  float64
bc_3                  float64
bc_4                  float64
bc_5                  float64
bc_6                  float64
bc_7                  float64
bc_8                  float64
bc_9                  float64
duration              float64
loudness              float64
mode                  float64
release                object
song_hotttnesss       float64
sp_0                  float64
sp_1                  float64
sp_10                 float64
sp_11                 float64
sp_12                 float64
sp_13                 float64
sp_14                 float64
sp_15                 float64
sp_16                 float64
sp_17                 float64
                       ...   
st_32                 float64
st_33                 float64
st_34     

# End