# Read parts of the MSD into tables

The `pandas` module requires code from the _PyTables_ package. To load this package into Python from a console:

> `$ conda install --name python3 PyTables`

This makes the `tables` module available and likely makes other modules available too.

## Load libraries

In [17]:
import os
import re
import itertools as it
import pandas as pd

## Define utility functions

The `get_filenames` function recursively gets the names of all files in a given directory `path` and all of its subdirectories. The function returns a multi-level list if `path` contains subdirectories. The `unlist` function flattens the list by removing one level. 

In [18]:
def get_filenames(path):
    return([get_filenames(path+"/"+entry.name)
            if entry.is_dir() 
            else path+"/"+entry.name 
            for entry 
            in os.scandir(path)
           ])
def unlist(alist):
    return(list(it.chain.from_iterable(alist)
               )
          )

Currently these next two functions are not used. 

## Open the first file (with `pandas`)

Open the first file of the list using `HDFStore` function of the `pandas` module. Notice that the leaf nodes are either of type: 

1. `Table` --- `/analysis/songs`, `/metadata/songs` and `/musicbrainz/songs`
1. `EArray` --- all other nodes (either 1 or 2 dimensions)

In [None]:
x = pandas.HDFStore(filenames[0],"r")
# x.close()
x

In [None]:
x.root.metadata.artist_terms.read()

## Read data from this file

Check out the root to look for _groups_.

In [None]:
x.root

Check out each group to look for data. Notice data comes in two types, `Table` and `EArray`. 

In [None]:
x.root.metadata

In [None]:
for leaf in x.root.metadata._f_walknodes('Leaf'):
    print(leaf)

In [None]:
x.root.musicbrainz

### Nodes of type `Table` (three of them)

Each returns all of the data in an `ndarray` with a single element that is a tuple. The last line of each cell takes that tuple and converts it into a list.

In [None]:
analysis_songs = x.root.analysis.songs.read()
print('type:',type(analysis_songs))
print('dtype:',analysis_songs.dtype)
print('data:',analysis_songs)

In [None]:
musicbrainz_songs = x.root.musicbrainz.songs.read()
print('type:',type(musicbrainz_songs))
print('dtype:',musicbrainz_songs.dtype)
print('data:',musicbrainz_songs)

### Odds and ends

In [None]:
sdf = analysis_songs
print('dtype:',sdf.dtype)
list(sdf.dtype.names)

In [None]:
def get_row(file_name, var_list):
lkj = pandas.DataFrame(metadata_songs, 
#                      index=['x'],
                       columns=var_list)
print(type(lkj))
lkj

In [None]:
metadata_songs = x.root.metadata.songs.read()
print('type:',type(metadata_songs))
print('data:',metadata_songs)
print('dtype:',metadata_songs.dtype)

In [None]:
x.root.metadata.songs.attrs

## WORKING

In [None]:
data = x.root.metadata.songs.read()
print('shape:' ,data.shape)
print('values:',data)
print('dtype:' ,data.dtype)

### Nodes of type `EArray` (the rest of them)

Each returns the data in an `ndarray`, which has one (1) or two (2) dimensions.

In [None]:
metadata_artist_terms = x.root.metadata.artist_terms.read()
print('type:' ,type(metadata_artist_terms))
print('shape:',     metadata_artist_terms.shape)
print('data:' ,     metadata_artist_terms)

In [None]:
analysis_tatums_confidence = x.root.analysis.tatums_confidence.read()
print('type:' ,type(analysis_tatums_confidence))
print('shape:',     analysis_tatums_confidence.shape)
print('data:' ,     analysis_tatums_confidence)

In [None]:
analysis_segments_pitches = x.root.analysis.segments_pitches.read()
print('type:' ,type(analysis_segments_pitches))
print('shape:',     analysis_segments_pitches.shape)
print('data:' ,     analysis_segments_pitches)

## OTHER STUFF - may not be needed

This material should be collected into a notebook for _PyTables_ or for the `tables` module.

Huh? Closer now to the actual data

This table has attributes:

In [None]:
table.attrs # also try _v_attrs in place of attrs

In [None]:
table._v_attrs.FIELD_0_FILL

In [14]:
print(table.attrs._f_list("sys")) # also try "all" and "user"


NameError: name 'table' is not defined