# About the Notebook

This notebook describes the steps to create row-wise dataframes as well as column-wise dataframes for the Million Songs Subset data.

## Import the libraries 

In [8]:
import os
import re
import itertools as it
import pandas as pd
import numpy as np
import operator 
import functools
import pickle as pk
import networkx as nx
import matplotlib as mplot
import matplotlib.colors as colors
import matplotlib.pyplot as pyplot
import random 

### Set random generator seed

For some analysis, a subset of the __Million Songs Subset__, referred henceforth as MSS, will be used. To be able to pick the same random set at all times and across all notebooks, the set is set to a number (4 in this case) consistently across all notebooks used for this analysis. 

In [10]:
# Set the seed for the random number generator - to be consistent in different notebooks
a = random.randint(0,9)
random.seed(4)
a

3

# Create a Dataframe with the Metadata

To assist in our goal of creating a network graph using the metadata, the data from the `HDF5` files are read row-wise for multiple values for the fields `artist_terms`, `artist_term_wt` and `artist_terms_freq`. This approach makes it easier for analysing the MSS using network graphs.

### Utility Functions

The `get_filenames` function recursively gets the names of all files in a given directory path and all of its subdirectories.The function returns a multi-level list if path contains subdirectories.

The `unlist` function flattens the list by removing one level.

While the `h1d_array` function creates a single row array, the function `h2d_array` creates a single column array.

In [11]:
# The functions to be used
def get_filenames(path):
    return([get_filenames(path+"/"+entry.name)
            if entry.is_dir() 
            else path+"/"+entry.name 
            for entry 
            in os.scandir(path)
           ])

def unlist(alist):
    return(list(it.chain.from_iterable(alist)
               )
          )

def var_list(base,numof):
    return([base+str(ndx) for ndx in range(numof)]
          )

def h1d_array(in_array,n): 
    # n1d is the number of elements in `in_array`
    n1d = functools.reduce(operator.mul,
                           list(in_array.shape))
    # return a 1 row 2D array with `n` columns
    b = np.ndarray(shape=(1,n1d),
                   buffer=in_array,
                   dtype=in_array.dtype
                  )[0:1,0:n]
    return(b)

def h2d_array(in_array,n): 
    # n2d is the number of elements in `in_array`
    n2d = functools.reduce(operator.mul,
                           list(in_array.shape))
    # return a 1 row 2D array with `n` columns
    b = np.ndarray(shape=(n2d,1),
                   buffer=in_array,
                   dtype=in_array.dtype
                  )[0:n2d,0:1]
    return(b)


The `fill_track_artist_mul_col` function basically uses the value in the first row and fills all the other rows in the column with the same value. The input to the function is a dataframe and the list of column numbers for which the filling has to be done. 

The `remove_b_mul_col` function takes a dataframe and column numbers list as the input and returns another dataframe by removing the __b'__ and the __'__ from the begining and end of the string values in the respective columns.

The `bytes_to_string_artist_terms` function does the same function as the `remove_b_mul_col`, explained before. The former is used when the datatype of the column is `S256`, wherein a decoding algorithm is used. The latter function does not work for the particular datatype.

The reason behind using this function will be explained a little later.

In [12]:
# Second set of functions

def fill_track_artist_mul_col(df,col = []):
    for j in range(0,len(col)):
        for i in range(0,len(df)):
            df.iloc[i,col[j]] = df.iloc[0,col[j]]
            #df.iloc[i,1] = df.iloc[0,1]
    return(df)


def bytes_to_string_artist_terms(df,n):
    #for i in range(0,df.shape[1]):
    b = pd.DataFrame(df[df.columns[n-1]])
    b.columns = ['artist_terms']
    for i in range(0,len(df)):
        #if df.dtypes[i] == 'S256':
        b.iloc[i,0] = df.iloc[i,n].decode("utf-8")
    df = pd.concat([df,b], axis = 1)
    return(df)
        
def remove_b_mul_col(df, col = []):
    for j in range(0,len(col)):
        for i in range(0,len(df)):  
                df.iloc[i,col[j]]=str(df.iloc[i,col[j]])[2:-1]
    return(df)


The `make_rows_df` function returns a multiple rows dataframe and takes the following input:
+ `filename`: full path file name of an MSD HDF5 file containing data for a single song
+ `metadata_vars`: list of variable names from `/metadata/songs`
+ `analysis_vars`: list of variable names from `/analysis/songs`
+ `musicbrainz_vars`: list of variable names from `/musicbrainz/songs`
+ `remove`:
    - if `False` the variables listed in the last two parameters are retrieved from the input file
    - if `True` all variables except those listed are retrieved from the input file

When a single track filename is read, and say for example the track has an artist with 10 artist terms, then only the artist terms column will have 10 values (in 10 rows). The other variables (from the variable list) will only be read once and populated in the first row. So, in order to fill them up for all rows, the function `fill_track_artist_mul_col` is used within the `make_rows_df` function. Similarly, to remove the __b'__ and __'__ from some columns, the `remove_b_mul_col` function is used within. Since the column numbers that need to be filled and fixed are known, these functions are embedded within `make_rows_df`.

In [13]:
def make_rows_df(filename='', metadata_vars=[], analysis_vars=[], musicbrainz_vars = [], remove=False):
    # open `filename` as a HDF5 file
    store = pd.HDFStore(filename,"r")
    if remove==True:
        # `metadata_vars` and `analysis_vars` contain the variables to remove
        metadata_vars = list({item for item 
                                  in list(store.root.metadata.songs.read().dtype.names) 
                                  if item not in metadata_vars})
        analysis_vars = list({item for item 
                                  in list(store.root.analysis.songs.read().dtype.names) 
                                  if item not in analysis_vars})
        musicbrainz_vars = list({item for item 
                                  in list(store.root.musicbrainz.songs.read().dtype.names) 
                                  if item not in musicbrainz_vars})
    # else: `metadata_vars` and `analysis_vars` contain the variables to keep
    
    # retrieve the first `n` values as a vertical array of 1 dimension
    #artist_name = h2d_array(store.root.metadata.artist_name.read(),1)
    artist_terms     = h2d_array(store.root.metadata.artist_terms.read(),37)
    artist_terms_freq  = h2d_array(store.root.metadata.artist_terms_freq.read(),37)
    artist_terms_weight = h2d_array(store.root.metadata.artist_terms_weight.read(),37)
    
    # store these values as variables in single dataframes
    #s_art_df = pd.DataFrame(similar_artists ,columns=var_list('sart_',bars_confidence .shape[1]))
    at_df = pd.DataFrame(artist_terms,columns=['at'])
    atf_df = pd.DataFrame(artist_terms_freq,columns=['at_freq'])
    atw_df = pd.DataFrame(artist_terms_weight ,columns=['at_wt'])
    
    # get track from filename
    match = re.split('\/',filename)
    match[-1]
    match = re.split('\.',match[-1])
    
    # merge these single dataframes into one single row dataframe
    ret = pd.concat([
            # make single row dataframe from track
            pd.DataFrame([match[0]], columns=['track']),
            # retrieve a single row dataframe from `/metadata/songs`
            pd.DataFrame(store.root.metadata.songs.read(), 
                         columns=metadata_vars),
            # retrieve a single row dataframe from `/analysis/songs`
            pd.DataFrame(store.root.analysis.songs.read(), 
                         columns=analysis_vars),
            # retrieve a single row dataframe from `/musicbrainz/songs`
            pd.DataFrame(store.root.musicbrainz.songs.read(), 
                         columns=musicbrainz_vars),
            at_df, 
            atf_df,
            atw_df],
            #bc_df, 
            #sp_df,
            #st_df],
            #s_art_df,
            axis=1) # `axes=1` means stack the dataframes horizontally 
    # close the HDF5 file
    store.close()
    # fill
    aa = fill_track_artist_mul_col(ret,col = [0,1,2,3,4,5,6,7,8])
    # remove b
    bb = remove_b_mul_col(aa, col = [1,2,3,4])
    ##cc = bytes_to_string_artist_terms(bb,6)
    # drop artist_terms with b (since it is a S256 dtype)
    #dd = cc.drop('at',axis = 1)
    # return the merged dataframe
    return(bb)

### Get the list of (10,000) HDF5 (.h5) files

The `path` variable stores the root of the directory tree containing all of the song files. The function `get_filenames` returns a multi-level list, which is flattened using `unlist` and stored in variable `filenames` as a list of full-path filenames.

In [14]:
path = "C:/Users/ganes_000/Documents/MSBA/Spring 2016/MA 755/millionsongsubset_full/MillionSongSubset/data"
filenames = unlist(unlist(unlist(get_filenames(path))))
filenames[0:2]

['C:/Users/ganes_000/Documents/MSBA/Spring 2016/MA 755/millionsongsubset_full/MillionSongSubset/data/A/A/A/TRAAAAW128F429D538.h5',
 'C:/Users/ganes_000/Documents/MSBA/Spring 2016/MA 755/millionsongsubset_full/MillionSongSubset/data/A/A/A/TRAAABD128F429CF47.h5']

### Store in the filenames variable only the files with extension .h5¶

In [15]:
p = re.compile("\.h5$")
filenames = [filename for filename 
             in filenames if p.search(filename)]
filenames[0:2]
len(filenames)

10000

### Pick a random set of the files from above (optional)

For some memory intensive analysis, a subset of 500 songs from the MSS, will be used. The code chunk in the following cell helps pick a random 500 songs subset.

In [16]:
random_filename_ind = random.sample(range(0,10000), 500)

filenames_500 = [ filenames[i] for i in random_filename_ind]
filenames_500

filenames = filenames_500 # comment this line out, if the entire 10,000 songs are to be read

### Read a single track file

A single track file is read using the functions described earlier and validated.

In [17]:
# Read one track file and get the data - which will be a bunch of rows
f_m = make_rows_df(filename=filenames[40],
                            metadata_vars=['title','song_id','artist_id','artist_name',
                                           'artist_familiarity','artist_hotttnesss','song_hotttnesss'],
                            # Omit: genre
                            analysis_vars=[],
                             musicbrainz_vars = ['year'],
                            # Omit: danceability, energy
                            remove=False
                           )

#a.iloc[:,3:40]
print(type(f_m))
f_m

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,track,title,song_id,artist_id,artist_name,artist_familiarity,artist_hotttnesss,song_hotttnesss,year,at,at_freq,at_wt
0,TRAEXAN128EF354542,Skopska situacija,SOSYNKD12A6D4F6655,ARZI4B31187B991125,SAF,0.467583,0.27896,0.0,0.0,b'hardcore hip hop',0.939069,1.0
1,TRAEXAN128EF354542,Skopska situacija,SOSYNKD12A6D4F6655,ARZI4B31187B991125,SAF,0.467583,0.27896,0.0,0.0,b'musette',0.878138,0.954661
2,TRAEXAN128EF354542,Skopska situacija,SOSYNKD12A6D4F6655,ARZI4B31187B991125,SAF,0.467583,0.27896,0.0,0.0,b'chanson',0.878138,0.937098
3,TRAEXAN128EF354542,Skopska situacija,SOSYNKD12A6D4F6655,ARZI4B31187B991125,SAF,0.467583,0.27896,0.0,0.0,b'progressive house',0.878138,0.917423
4,TRAEXAN128EF354542,Skopska situacija,SOSYNKD12A6D4F6655,ARZI4B31187B991125,SAF,0.467583,0.27896,0.0,0.0,b'hip hop',1.0,0.883275
5,TRAEXAN128EF354542,Skopska situacija,SOSYNKD12A6D4F6655,ARZI4B31187B991125,SAF,0.467583,0.27896,0.0,0.0,b'rap',0.974711,0.860299
6,TRAEXAN128EF354542,Skopska situacija,SOSYNKD12A6D4F6655,ARZI4B31187B991125,SAF,0.467583,0.27896,0.0,0.0,b'trance',0.878138,0.795758
7,TRAEXAN128EF354542,Skopska situacija,SOSYNKD12A6D4F6655,ARZI4B31187B991125,SAF,0.467583,0.27896,0.0,0.0,b'underground hip hop',0.670753,0.771888
8,TRAEXAN128EF354542,Skopska situacija,SOSYNKD12A6D4F6655,ARZI4B31187B991125,SAF,0.467583,0.27896,0.0,0.0,b'underground rap',0.582403,0.69871
9,TRAEXAN128EF354542,Skopska situacija,SOSYNKD12A6D4F6655,ARZI4B31187B991125,SAF,0.467583,0.27896,0.0,0.0,b'electronic',0.806153,0.653429


In [18]:
# Remove b' in artist terms as well. This was not easy since the data type was S256 and had to be decoded.
f_m = bytes_to_string_artist_terms(f_m,9)
f_m = f_m.drop(['at'],axis = 1) # axis = 1 to drop a colummn, otherwise it will look for a matching index (key)
f_m

Unnamed: 0,track,title,song_id,artist_id,artist_name,artist_familiarity,artist_hotttnesss,song_hotttnesss,year,at_freq,at_wt,artist_terms
0,TRAEXAN128EF354542,Skopska situacija,SOSYNKD12A6D4F6655,ARZI4B31187B991125,SAF,0.467583,0.27896,0.0,0.0,0.939069,1.0,hardcore hip hop
1,TRAEXAN128EF354542,Skopska situacija,SOSYNKD12A6D4F6655,ARZI4B31187B991125,SAF,0.467583,0.27896,0.0,0.0,0.878138,0.954661,musette
2,TRAEXAN128EF354542,Skopska situacija,SOSYNKD12A6D4F6655,ARZI4B31187B991125,SAF,0.467583,0.27896,0.0,0.0,0.878138,0.937098,chanson
3,TRAEXAN128EF354542,Skopska situacija,SOSYNKD12A6D4F6655,ARZI4B31187B991125,SAF,0.467583,0.27896,0.0,0.0,0.878138,0.917423,progressive house
4,TRAEXAN128EF354542,Skopska situacija,SOSYNKD12A6D4F6655,ARZI4B31187B991125,SAF,0.467583,0.27896,0.0,0.0,1.0,0.883275,hip hop
5,TRAEXAN128EF354542,Skopska situacija,SOSYNKD12A6D4F6655,ARZI4B31187B991125,SAF,0.467583,0.27896,0.0,0.0,0.974711,0.860299,rap
6,TRAEXAN128EF354542,Skopska situacija,SOSYNKD12A6D4F6655,ARZI4B31187B991125,SAF,0.467583,0.27896,0.0,0.0,0.878138,0.795758,trance
7,TRAEXAN128EF354542,Skopska situacija,SOSYNKD12A6D4F6655,ARZI4B31187B991125,SAF,0.467583,0.27896,0.0,0.0,0.670753,0.771888,underground hip hop
8,TRAEXAN128EF354542,Skopska situacija,SOSYNKD12A6D4F6655,ARZI4B31187B991125,SAF,0.467583,0.27896,0.0,0.0,0.582403,0.69871,underground rap
9,TRAEXAN128EF354542,Skopska situacija,SOSYNKD12A6D4F6655,ARZI4B31187B991125,SAF,0.467583,0.27896,0.0,0.0,0.806153,0.653429,electronic


### Read all the tracks from `filenames`

The result will be having more rows than the number of tracks read; more specifically the number of rows will be equal to the number of tracks (`track`) times the number of `artist_terms` associated with that track (through the `artist_id`). The number of columns will be as provided in the input list plus four additional columns for the `track`, `artist_terms`, `artist_terms_wt`(`at_wt` in the dataframe) and `artist_terms_freq` (`at_freq` in the dataframe).

In [19]:
# Get the dataframe for all the filenames

for i in range(0,len(filenames_500)):
    if i == 0:
        df_mss_m = make_rows_df(filename=filenames_500[i],
                            metadata_vars=['title','song_id','artist_id','artist_name',
                                           'artist_familiarity','artist_hotttnesss','song_hotttnesss'],
                            # Omit: genre
                            analysis_vars=[],
                            musicbrainz_vars = ['year'],
                            
                            # Omit: danceability, energy
                            remove=False
                           )
    else:
        df_temp = make_rows_df(filename=filenames_500[i],
                            metadata_vars=['title','song_id','artist_id','artist_name',
                                           'artist_familiarity','artist_hotttnesss','song_hotttnesss'],
                            # Omit: genre
                            analysis_vars=[],
                            musicbrainz_vars = ['year'],
                            # Omit: danceability, energy
                            remove=False
                           )
        df_mss_m = pd.concat([df_mss_m,df_temp])
        
# remove b' from artist_terms
df_mss_m = bytes_to_string_artist_terms(df_mss_m,9)
# drop the field with the b'
df_mss_m = df_mss_m.drop(['at'],axis = 1) # axis = 1 to drop a colummn, otherwise it will look for a matching index (key)

print(df_mss_m.shape)

(13117, 12)


In [20]:
# check the head of the dataframe
df_mss_m.head(50)

Unnamed: 0,track,title,song_id,artist_id,artist_name,artist_familiarity,artist_hotttnesss,song_hotttnesss,year,at_freq,at_wt,artist_terms
0,TRANGCR128F148CF28,Think About Me,SOOJCHC12A6D4F81E1,AREPD8D1187FB3F5BC,Sister Hazel,0.731066,0.509879,0.707172,1997.0,1.0,1.0,post-grunge
1,TRANGCR128F148CF28,Think About Me,SOOJCHC12A6D4F81E1,AREPD8D1187FB3F5BC,Sister Hazel,0.731066,0.509879,0.707172,1997.0,0.795196,0.831104,southern rock
2,TRANGCR128F148CF28,Think About Me,SOOJCHC12A6D4F81E1,AREPD8D1187FB3F5BC,Sister Hazel,0.731066,0.509879,0.707172,1997.0,0.761407,0.811431,chill-out
3,TRANGCR128F148CF28,Think About Me,SOOJCHC12A6D4F81E1,AREPD8D1187FB3F5BC,Sister Hazel,0.731066,0.509879,0.707172,1997.0,0.795196,0.808603,soft rock
4,TRANGCR128F148CF28,Think About Me,SOOJCHC12A6D4F81E1,AREPD8D1187FB3F5BC,Sister Hazel,0.731066,0.509879,0.707172,1997.0,0.755268,0.806579,alternative dance
5,TRANGCR128F148CF28,Think About Me,SOOJCHC12A6D4F81E1,AREPD8D1187FB3F5BC,Sister Hazel,0.731066,0.509879,0.707172,1997.0,0.755268,0.806579,country rock
6,TRANGCR128F148CF28,Think About Me,SOOJCHC12A6D4F81E1,AREPD8D1187FB3F5BC,Sister Hazel,0.731066,0.509879,0.707172,1997.0,0.907628,0.804375,alternative rock
7,TRANGCR128F148CF28,Think About Me,SOOJCHC12A6D4F81E1,AREPD8D1187FB3F5BC,Sister Hazel,0.731066,0.509879,0.707172,1997.0,0.835125,0.799851,pop rock
8,TRANGCR128F148CF28,Think About Me,SOOJCHC12A6D4F81E1,AREPD8D1187FB3F5BC,Sister Hazel,0.731066,0.509879,0.707172,1997.0,0.795196,0.791244,grunge
9,TRANGCR128F148CF28,Think About Me,SOOJCHC12A6D4F81E1,AREPD8D1187FB3F5BC,Sister Hazel,0.731066,0.509879,0.707172,1997.0,0.755268,0.787546,trip hop


From the output above, it can be seen that each `track` has as many rows as the number of `artist_terms` for the track's artist.

### Reset the index, pickle and save dataframe

Resetting of the index is critical in this case since the index repeats when a new track file is read. Failure to do this will lead to numerous bugs during further analysis.

In [21]:
# reset_index
df_mss_m = df_mss_m.reset_index()
del df_mss_m['index']
#df = df.reset_index(drop=True)
df_mss_m.head(50)

Unnamed: 0,track,title,song_id,artist_id,artist_name,artist_familiarity,artist_hotttnesss,song_hotttnesss,year,at_freq,at_wt,artist_terms
0,TRANGCR128F148CF28,Think About Me,SOOJCHC12A6D4F81E1,AREPD8D1187FB3F5BC,Sister Hazel,0.731066,0.509879,0.707172,1997.0,1.0,1.0,post-grunge
1,TRANGCR128F148CF28,Think About Me,SOOJCHC12A6D4F81E1,AREPD8D1187FB3F5BC,Sister Hazel,0.731066,0.509879,0.707172,1997.0,0.795196,0.831104,southern rock
2,TRANGCR128F148CF28,Think About Me,SOOJCHC12A6D4F81E1,AREPD8D1187FB3F5BC,Sister Hazel,0.731066,0.509879,0.707172,1997.0,0.761407,0.811431,chill-out
3,TRANGCR128F148CF28,Think About Me,SOOJCHC12A6D4F81E1,AREPD8D1187FB3F5BC,Sister Hazel,0.731066,0.509879,0.707172,1997.0,0.795196,0.808603,soft rock
4,TRANGCR128F148CF28,Think About Me,SOOJCHC12A6D4F81E1,AREPD8D1187FB3F5BC,Sister Hazel,0.731066,0.509879,0.707172,1997.0,0.755268,0.806579,alternative dance
5,TRANGCR128F148CF28,Think About Me,SOOJCHC12A6D4F81E1,AREPD8D1187FB3F5BC,Sister Hazel,0.731066,0.509879,0.707172,1997.0,0.755268,0.806579,country rock
6,TRANGCR128F148CF28,Think About Me,SOOJCHC12A6D4F81E1,AREPD8D1187FB3F5BC,Sister Hazel,0.731066,0.509879,0.707172,1997.0,0.907628,0.804375,alternative rock
7,TRANGCR128F148CF28,Think About Me,SOOJCHC12A6D4F81E1,AREPD8D1187FB3F5BC,Sister Hazel,0.731066,0.509879,0.707172,1997.0,0.835125,0.799851,pop rock
8,TRANGCR128F148CF28,Think About Me,SOOJCHC12A6D4F81E1,AREPD8D1187FB3F5BC,Sister Hazel,0.731066,0.509879,0.707172,1997.0,0.795196,0.791244,grunge
9,TRANGCR128F148CF28,Think About Me,SOOJCHC12A6D4F81E1,AREPD8D1187FB3F5BC,Sister Hazel,0.731066,0.509879,0.707172,1997.0,0.755268,0.787546,trip hop


In [23]:
# Pickle the dataframe
path = "C:/Users/ganes_000/Documents/MSBA/Spring 2016/MA 755/Pickle Files"
df_mss_m.to_pickle(path+"/MSS_METADATA_SUBSET_500.pkl")

# Create Dataframe using Analysis and MusicBrainz

The steps to do this are very similar to the creation of a dataframe using the metadata. The reason this is being done separately is to reduce the filesize since in this case, there will be a single row data for each track as opposed to the dataframe using metadata. 

The `make_1row_df` function returns a single row dataframe and takes the following input:

- `filename`: full path file name of an MSD HDF5 file containing data for a single song
- `metadata_vars`: list of variable names from `/metadata/songs`
- `analysis_vars`: list of variable names from `/analysis/songs`
- `musicbrainz_vars`: list of variable names from `/musicbrainz/songs`
- `remove`: 
    - if `False` the variables listed in the last two parameters are retrieved from the input file
    - if `True` all variables except those listed are retrieved from the input file

In [None]:
def make_1row_df(filename='', metadata_vars=[], analysis_vars=[], musicbrainz_vars = [], remove=False):
    # open `filename` as a HDF5 file
    store = pd.HDFStore(filename,"r")
    if remove==True:
        # `metadata_vars` and `analysis_vars` contain the variables to remove
        metadata_vars = list({item for item 
                                  in list(store.root.metadata.songs.read().dtype.names) 
                                  if item not in metadata_vars})
        analysis_vars = list({item for item 
                                  in list(store.root.analysis.songs.read().dtype.names) 
                                  if item not in analysis_vars})
        musicbrainz_vars = list({item for item 
                                  in list(store.root.musicbrainz.songs.read().dtype.names) 
                                  if item not in musicbrainz_vars})
    # else: `metadata_vars` and `analysis_vars` contain the variables to keep
    
    # retrieve the first `n` values as a horizontal array of 1 dimension
    segments_pitches = h1d_array(store.root.analysis.segments_pitches.read(),36)
    segments_timbre  = h1d_array(store.root.analysis.segments_timbre.read(),36)
    bars_confidence  = h1d_array(store.root.analysis.bars_confidence.read(),10)
    #similar_artists = h1d_array(store.root.metadata.similar_artists.read(),10)
    
    
    # store these values as variables in single dataframes
    #s_art_df = pd.DataFrame(similar_artists ,columns=var_list('sart_',bars_confidence .shape[1]))
    bc_df = pd.DataFrame(bars_confidence ,columns=var_list('bc_',bars_confidence .shape[1]))
    sp_df = pd.DataFrame(segments_pitches,columns=var_list('sp_',segments_pitches.shape[1]))
    st_df = pd.DataFrame(segments_timbre ,columns=var_list('st_',segments_timbre .shape[1]))

    # get track from filename
    match = re.split('\/',filename)
    match[-1]
    match = re.split('\.',match[-1])
    
    # merge these single dataframes into one single row dataframe
    ret = pd.concat([
            # make single row dataframe from track
            pd.DataFrame([match[0]], columns=['track']),
            # retrieve a single row dataframe from `/metadata/songs`
            pd.DataFrame(store.root.metadata.songs.read(), 
                         columns=metadata_vars),
            # retrieve a single row dataframe from `/analysis/songs`
            pd.DataFrame(store.root.analysis.songs.read(), 
                         columns=analysis_vars),
            # retrieve a single row dataframe from `/musicbrainz/songs`
            pd.DataFrame(store.root.musicbrainz.songs.read(), 
                         columns=musicbrainz_vars),
            bc_df, 
            sp_df,
            st_df],
            #s_art_df,
            axis=1) # `axes=1` means stack the dataframes horizontally 
    # close the HDF5 file
    store.close()
    # return the merged dataframe
    return(ret)

### Read one track file and validate

In [None]:
# Read one track file and get the data - which will be a single row
f_a = make_1row_df(filename=filenames[40],
                            metadata_vars=['track_7digitalid','title','song_id','artist_id','artist_name',
                                                'release','idx_similar_artists',
                                               'artist_familiarity','artist_hotttnesss','artist_location',
                                               'artist_longitude','artist_latitude','song_hotttnesss'],
                            # Omit: genre
                            analysis_vars=['track_id','duration','key','loudness','mode',
                                           'tempo','time_signature','end_of_fade_in'],
                             musicbrainz_vars = ['year'],
                            # Omit: danceability, energy
                            remove=False
                           )

#a.iloc[:,3:40]
print(type(f_a))
f_a

### Read all the tracks from `filenames`

The result of this command is a list of single row dataframes with columns indicated in the variables list.

In [None]:
list_mss_a = [make_1row_df(filename=filename,
                           metadata_vars=['track_7digitalid','title','song_id','artist_id','artist_name',
                                                'release','idx_similar_artists',
                                               'artist_familiarity','artist_hotttnesss','artist_location',
                                               'artist_longitude','artist_latitude','song_hotttnesss'],
                            # Omit: genre
                            analysis_vars=['track_id','duration','key','loudness','mode',
                                           'tempo','time_signature','end_of_fade_in'],
                             musicbrainz_vars = ['year'],
                            # Omit: danceability, energy
                            remove=False
                           )
                for filename in filenames[0:len(filenames)] # get data from all the files
              ]
len(list_mss_a), list_mss_a[0].shape

### Merge all dataframes of list_mss_a into a single dataframe stored in df_mss_a

In [None]:
df_mss_a = pd.concat(list_mss_a,axis=0).reset_index(drop=True)

### Remove "b ' " from relevant columns

In [None]:
#df_mss_a.dtypes
# remove b' from the requisite columns
df_mss_a = remove_b_mul_col(df_mss_a, col = [2,4,6,100,103])
df_mss_a

### Pickle and save the dataframe object

In [None]:
# Pickle the dataframe
path = "C:/Users/ganes_000/Documents/MSBA/Spring 2016/MA 755/Pickle Files"
df_mss_a.to_pickle(path+"/MSS_ANALYSIS_DATA_SUBSET_500.pkl")