# Read parts of the MSD into tables

This notebook creates a pandas dataframe from the `/metadata/songs` and `/analysis/songs` tables in the HDF5 files. 

The `pandas` module requires code from the _PyTables_ package. To load this package into Python from a console:

> `$ conda install --name python3 PyTables`

This only needs to happen once on your computer.

### Load libraries

In [13]:
import os
import re
import itertools as it
import pandas as pd
import numpy as np
import operator 
import functools
import time

### Define utility functions

The `get_filenames` function recursively gets the names of all files in a given directory `path` and all of its subdirectories. The function returns a multi-level list if `path` contains subdirectories. The `unlist` function flattens the list by removing one level. 

In [14]:
def get_filenames(path):
    return([get_filenames(path+"/"+entry.name)
            if entry.is_dir() 
            else path+"/"+entry.name 
            for entry 
            in os.scandir(path)
           ])

def unlist(alist):
    return(list(it.chain.from_iterable(alist)
               )
          )

def var_list(base,numof):
    return([base+str(ndx) for ndx in range(numof)]
          )

def h1d_array(in_array,n): 
    # n1d is the number of elements in `in_array`
    n1d = functools.reduce(operator.mul,
                           list(in_array.shape))
    # return a 1 row 2D array with `n` rows
    b = np.ndarray(shape=(n1d,1),
                   buffer=in_array,
                   dtype=in_array.dtype
                  )[0:n,0:1]
    return(b)

The `make_1row_df` function returns a single row dataframe and takes the following input:

- `filename`: full path file name of an MSD HDF5 file containing data for a single song
- `metadata_vars`: list of variable names from `/metadata/songs`
- `analysis_vars`: list of variable names from `/analysis/songs`
- `remove`: 
    - if `False` the variables listed in the last two parameters are retrieved from the input file
    - if `True` all variables except those listed are retrieved from the input file

See comments in the code for further details. 

### Get the list of (10,000) HDF5 (.h5) files

The `path` variable stores the root of the directory tree containing all of the song files. The function `get_filenames` returns a multi-level list, which is flattened using `unlist` and stored in variable `filenames` as a list of full-path filenames.

In [15]:
#path = "C:/Users/CH162975/Documents/B/MA755/MillionSongSubset/data"
path = "D:\Documents\B\Bentley\Coursework\MA755\MillionSongSubset\data"
filenames = unlist(unlist(unlist(get_filenames(path))))
filenames[119:122]

['D:\\Documents\\B\\Bentley\\Coursework\\MA755\\MillionSongSubset\\data/A/A/I/TRAAINT128F933BBE0.h5',
 'D:\\Documents\\B\\Bentley\\Coursework\\MA755\\MillionSongSubset\\data/A/A/I/TRAAIRG128F93265E8.h5',
 'D:\\Documents\\B\\Bentley\\Coursework\\MA755\\MillionSongSubset\\data/A/A/I/TRAAIXN128F428027A.h5']

### Store in the `filenames` variable only the files with extension `.h5`

In [16]:
p = re.compile("\.h5$")
filenames = [filename for filename 
             in filenames if p.search(filename)]
filenames[119:122]

['D:\\Documents\\B\\Bentley\\Coursework\\MA755\\MillionSongSubset\\data/A/A/I/TRAAINT128F933BBE0.h5',
 'D:\\Documents\\B\\Bentley\\Coursework\\MA755\\MillionSongSubset\\data/A/A/I/TRAAIRG128F93265E8.h5',
 'D:\\Documents\\B\\Bentley\\Coursework\\MA755\\MillionSongSubset\\data/A/A/I/TRAAIXN128F428027A.h5']

In [27]:
len(filenames)

10000

In [28]:
def make_2col_df(length, filename=''):
    # open `filename` as a HDF5 file
    store = pd.HDFStore(filename,"r")
    
    # retrieve the first `n` values as a horizontal array of 1 dimension
    similar_artist = h1d_array(store.root.metadata.similar_artists.read(),length)
    at_id = pd.DataFrame(store.root.metadata.songs.read(), columns=['artist_id'])
    
    # close the HDF5 file
    store.close()
    
    
    
    #time.sleep(.00001)                    
    
    
    
    
    
    # store these values as variables in single dataframes
    similar_artist = pd.DataFrame(similar_artist       ,columns=var_list('similar_artist_',similar_artist            .shape[1]))
    at_i2 = pd.concat([at_id]*length, ignore_index=True)
    at_i3 = pd.DataFrame(at_i2)    
    
    ret = pd.concat([at_i3, similar_artist], axis=1)
    ret['relationship'] = 'Similar_artist'
    # merge these single dataframes into one single row dataframe
    # `axes=1` means stack the dataframes horizontally 
    # return the merged dataframe
    return(ret)

### Create a list of 10,000 single row dataframes

Because `remove=False` is specified the two lists of variables are retrieved from the two `Tables` displayed above. The result of this command is a list of 10,000 single row dataframes with columns indicated. 

In [29]:
# choose number of terms to extract for each artist
results_count= 5

In [30]:
mss_df_list = [make_2col_df(length=results_count,filename=filename)
                for filename in filenames[0:10000] # get data from all 10,000 files
              ]
len(mss_df_list), mss_df_list[1].shape

(10000, (5, 3))

### Merge all dataframes of `mss_df_list` into a single dataframe stored in `mss_df`.

In [31]:
mss_df = pd.concat(mss_df_list,axis=0).reset_index(drop=True)
mss_df.shape

(50000, 3)

### Check its dimensions (shape) and its variables.

In [33]:
print('shape:',mss_df.shape)
print('columns:',mss_df.columns.values)

shape: (50000, 3)
columns: ['artist_id' 'similar_artist_0' 'relationship']


### Check the tail of the table

In [34]:
mss_df.tail(50)

Unnamed: 0,artist_id,similar_artist_0,relationship
49950,b'ARUUP4L1187B9B72EB',b'AR8W31W1187B9A6F5C',Similar_artist
49951,b'ARUUP4L1187B9B72EB',b'ARHJB981187FB4EE11',Similar_artist
49952,b'ARUUP4L1187B9B72EB',b'ARVQ0YD1187B9BA5B4',Similar_artist
49953,b'ARUUP4L1187B9B72EB',b'ARLB1N91187B99D823',Similar_artist
49954,b'ARUUP4L1187B9B72EB',b'ARI5APR1187B9903DB',Similar_artist
49955,b'ARI4S0E1187B9B06C0',b'ARTMSZO1187B98F20E',Similar_artist
49956,b'ARI4S0E1187B9B06C0',b'ARGMKMG1187B9A8468',Similar_artist
49957,b'ARI4S0E1187B9B06C0',b'AR913ZN1187FB5B56B',Similar_artist
49958,b'ARI4S0E1187B9B06C0',b'AR4VICI1187B995679',Similar_artist
49959,b'ARI4S0E1187B9B06C0',b'AR4GSXW1187B9A4C5E',Similar_artist


# Remove NaNs

In [38]:
mss_df.dropna(inplace=True)

In [39]:
mss_df.shape

(50000, 3)

In [40]:
mss_df.tail(50)

Unnamed: 0,artist_id,similar_artist_0,relationship
49950,b'ARUUP4L1187B9B72EB',b'AR8W31W1187B9A6F5C',Similar_artist
49951,b'ARUUP4L1187B9B72EB',b'ARHJB981187FB4EE11',Similar_artist
49952,b'ARUUP4L1187B9B72EB',b'ARVQ0YD1187B9BA5B4',Similar_artist
49953,b'ARUUP4L1187B9B72EB',b'ARLB1N91187B99D823',Similar_artist
49954,b'ARUUP4L1187B9B72EB',b'ARI5APR1187B9903DB',Similar_artist
49955,b'ARI4S0E1187B9B06C0',b'ARTMSZO1187B98F20E',Similar_artist
49956,b'ARI4S0E1187B9B06C0',b'ARGMKMG1187B9A8468',Similar_artist
49957,b'ARI4S0E1187B9B06C0',b'AR913ZN1187FB5B56B',Similar_artist
49958,b'ARI4S0E1187B9B06C0',b'AR4VICI1187B995679',Similar_artist
49959,b'ARI4S0E1187B9B06C0',b'AR4GSXW1187B9A4C5E',Similar_artist


### Save the table `mss_df` in a _pickle_ file

First set the folder to save to and load from. 

In [41]:
save_load_path = "D:/Documents/B/Bentley/Coursework/MA755/MillionSongSubset/Graph"
mss_df.to_pickle(save_load_path+'/similar_artists.pkl')

Load `mss_df` from the _pickle_ file.

In [42]:
save_load_path = "D:/Documents/B/Bentley/Coursework/MA755/MillionSongSubset/Graph"


In [43]:
similar_artists = pd.read_pickle(save_load_path+'/similar_artists.pkl')

Now check that we retrieved the same number of rows and variables we expect.

In [45]:
similar_artists.shape, similar_artists.columns

((50000, 3),
 Index(['artist_id', 'similar_artist_0', 'relationship'], dtype='object'))

In [46]:
similar_artists.tail(50)

Unnamed: 0,artist_id,similar_artist_0,relationship
49950,b'ARUUP4L1187B9B72EB',b'AR8W31W1187B9A6F5C',Similar_artist
49951,b'ARUUP4L1187B9B72EB',b'ARHJB981187FB4EE11',Similar_artist
49952,b'ARUUP4L1187B9B72EB',b'ARVQ0YD1187B9BA5B4',Similar_artist
49953,b'ARUUP4L1187B9B72EB',b'ARLB1N91187B99D823',Similar_artist
49954,b'ARUUP4L1187B9B72EB',b'ARI5APR1187B9903DB',Similar_artist
49955,b'ARI4S0E1187B9B06C0',b'ARTMSZO1187B98F20E',Similar_artist
49956,b'ARI4S0E1187B9B06C0',b'ARGMKMG1187B9A8468',Similar_artist
49957,b'ARI4S0E1187B9B06C0',b'AR913ZN1187FB5B56B',Similar_artist
49958,b'ARI4S0E1187B9B06C0',b'AR4VICI1187B995679',Similar_artist
49959,b'ARI4S0E1187B9B06C0',b'AR4GSXW1187B9A4C5E',Similar_artist


# End