# About this Notebook

In this notebook, the dataframes representing the relationships between different combinations of fields are created from the master data obtained in the notebook titled __`G_Final_1_Data_Creation_From_HDF5.ipynb`__.

### Import the libraries and set random seed 

In [2]:
import os
import re
import itertools as it
import pandas as pd
import numpy as np
import operator 
import functools
import pickle as pk
import networkx as nx
import matplotlib as mplot
import matplotlib.colors as colors
import matplotlib.pyplot as pyplot
import random 

In [4]:
# Set the seed for the random number generator - to be consistent in different notebooks
a = random.randint(0,9)
random.seed(4)
a

3

# Reiterating the Goal 

To avoid any loss of continuity, before describing why specific relationships are gathered from the master dataframe, it is imperative that the goal from the perspective of __song recommendation__ using graphs, and the approach, is reiterated.

## Goal

To create a `track`-`track` network graph with appropriate weights that will help in song recommendation, based on shortest-distance algorithms for networks.

The steps to achieve the goal are explained below.

### Create a Graph with the nodes being `track`, `artist_id` and `artist_terms`

This will be the most obvious graph, based on the master MSS data, which will have the following relationships (or edges).

1. `track` - `artist_id`: with attributes being `artist_familiarity`, `artist_hotttness` and `song_hotttness`
2. `artist_id` - `artist`: with attributes being `artist_wt` and `artist_freq`

### Remove the `artist_id` and `artist_terms` nodes and replace with appropriate relationships between `track` nodes

A `track` - `track` relationship network graph is the end goal. In the graph with all the three types of nodes, two tracks will always be connected via `artist_id` node and `artist_terms` node. There will be no case where two tracks will be directly connected. The exhaustive case examples below explain it better.

+ Two tracks by the same artist will be connected through the common `artist_id` node.
+ Two tracks by different artists will be connected through the common `artist_terms` nodes between the two `artist_id` nodes.
    
    
    - No common artist terms:
    
    There will be no path between the two tracks, without a third `artist_id` being involved which has atleast one common term with each of the artists. 
    
    - `n` common artist terms:
    
    There will be `n` paths between the two tracks. Each path (undirected) will be consisting of the following nodes, in that order: track1--artist1--artist_term--artist2--track2.
        
The sequential approach will be described in the sections that follow, in this notebook.


# Graph Relationship Data from MSS Metadata

The master dataframe created in the notebook `G_Final_1_Data_Creation_From_H5DF.ipynb`, from the MSS metadata, is used to deduce the relationships that exists between the following combination of fields.

+ __`track` - `artist_id`__

    This is a straightforward relationship from the master dataframe. For each `track`, there will be one and only one `artist_id`. 
    
    
+ __`artist_id` - `artist_terms`__

    This is another relatively straightforward relationship to deduce from the master dataframe. For each `artist_id` there will be more than one `artist_terms`. But each row of the master dataframe represents one such relationship.
    

+ __`artist_id` - `artist_id`__

    There will be more than one relationship between two artists, via the common artist terms. In order to reduce these multiple relationships (or multiple paths) between a pair of artists to a single relationship with tha appropriate strength, a self-join of the master dataframe with itself on `artist_terms` is required.
    
    
+ __`track` - `track`__

    Since the final goal is to reduce the graph to a `track` to `track` relationship graph, deducing this relationship is critical. Again, it is done using a self-join of the master dataframe with itself. Since, a song is connected to only one artist directly, the reduction of `artist_id` - `artist_id` relationships to a single relationship is equivalent to reducing the `track` - `track` paths to a single path. Hence, only `track` - `track` relationship deduction will be described in detail later in this notebook, following which the `artist_id` - `artist_id` relationship will be deduced similarly.


Since building and analyzing a graph is computationally very intense, data from a random subset of __500 tracks__ is used. The following sections describe the sequence of steps, starting with the master MSS metadata, to arrive at the `track`-`track` relationship data.

### 1. Read the master MSS metadata dataframe

In [5]:
# Read the 500 songs subset data
path = "C:/Users/ganes_000/Documents/MSBA/Spring 2016/MA 755/Pickle Files"
df_m = pd.read_pickle(path+"/MSS_METADATA_SUBSET_500.pkl")
df_m.head(50)

Unnamed: 0,track,title,song_id,artist_id,artist_name,artist_familiarity,artist_hotttnesss,song_hotttnesss,year,at_freq,at_wt,artist_terms
0,TRACBMH128F93291A4,Blaue Matrosen (live),SOTNLBM12AB0185DD2,ARFY8NV1187B9B1961,Der Moderne Man,0.419034,0.293984,,0.0,0.998522,1.0,new wave
1,TRACBMH128F93291A4,Blaue Matrosen (live),SOTNLBM12AB0185DD2,ARFY8NV1187B9B1961,Der Moderne Man,0.419034,0.293984,,0.0,0.80536,0.950451,neue deutsche welle
2,TRACBMH128F93291A4,Blaue Matrosen (live),SOTNLBM12AB0185DD2,ARFY8NV1187B9B1961,Der Moderne Man,0.419034,0.293984,,0.0,0.728236,0.885503,kraut rock
3,TRACBMH128F93291A4,Blaue Matrosen (live),SOTNLBM12AB0185DD2,ARFY8NV1187B9B1961,Der Moderne Man,0.419034,0.293984,,0.0,1.0,0.874499,rock
4,TRACBMH128F93291A4,Blaue Matrosen (live),SOTNLBM12AB0185DD2,ARFY8NV1187B9B1961,Der Moderne Man,0.419034,0.293984,,0.0,0.896572,0.868037,alternative rock
5,TRACBMH128F93291A4,Blaue Matrosen (live),SOTNLBM12AB0185DD2,ARFY8NV1187B9B1961,Der Moderne Man,0.419034,0.293984,,0.0,0.863182,0.810248,punk
6,TRACBMH128F93291A4,Blaue Matrosen (live),SOTNLBM12AB0185DD2,ARFY8NV1187B9B1961,Der Moderne Man,0.419034,0.293984,,0.0,0.728236,0.740235,electro
7,TRACBMH128F93291A4,Blaue Matrosen (live),SOTNLBM12AB0185DD2,ARFY8NV1187B9B1961,Der Moderne Man,0.419034,0.293984,,0.0,0.598536,0.708393,germany
8,TRACBMH128F93291A4,Blaue Matrosen (live),SOTNLBM12AB0185DD2,ARFY8NV1187B9B1961,Der Moderne Man,0.419034,0.293984,,0.0,0.768071,0.695211,electronic
9,TRACBMH128F93291A4,Blaue Matrosen (live),SOTNLBM12AB0185DD2,ARFY8NV1187B9B1961,Der Moderne Man,0.419034,0.293984,,0.0,0.718635,0.69521,experimental


Each row in the dataframe above represents the path from the track to an artist term via the artist. The following cell's output describe the potential nodes from the 500 songs subset data.

In [6]:
print('The master dataframe of 500 songs metadata has dimensions: ', df_m.shape)
print('No. of unique tracks:',len(df_m.track.unique()))
print('No. of unique artists:', len(df_m.artist_id.unique()))
print('No. of unique artist_terms:', len(df_m.artist_terms.unique()))
print('No. of unique song id:', len(df_m.song_id.unique()))
#print('No. of unique years:', len(df_m.year.unique()))

The master dataframe of 500 songs metadata has dimensions:  (13179, 12)
No. of unique tracks: 500
No. of unique artists: 469
No. of unique artist_terms: 1561
No. of unique song id: 500


The dataframe has 13179 rows. There are 500 tracks, with 469 unique artists and 1561 unique artist terms. The `song_id`, which was defined to be a key that connects two similar tracks, is unique for each song. So, it does not have any significance from the recommendation perspective.

For the potential attributes of the nodes, we look at the null value counts of the different columns in the master dataframe.

In [7]:
print('NULL VALUE COUNTS FOR EACH COLUMN \n', '------------------------------')
print(df_m.isnull().sum())

NULL VALUE COUNTS FOR EACH COLUMN 
 ------------------------------
track                    0
title                    0
song_id                  0
artist_id                0
artist_name              0
artist_familiarity       0
artist_hotttnesss        0
song_hotttnesss       5982
year                     0
at_freq                  0
at_wt                    0
artist_terms             0
dtype: int64


The counts indicate that `song_hotttnesss`, because of its null values, will not be very useful as a node attribute for recommendation purposes. So, fields like `artist_familiarity`, `artist_hotttnesss`, `song_hotttnesss` and `artist_terms_wt` will be used as attributes for the graph and subsequently for the distance calculations for song recommendation.

### 2. `track` - `artist_id` Relationship Data

The `df_m` dataframe, which is the 500 songs subset of the MSS data, is used to get this relationship. Unique combination values of the `track`,`artist_id`,`title`,`artist_name`,`artist_familiarity`,`artist_hotttnesss` and `song_hotttnesss` field from the master dataframe will provide the `track`-`artist_id` relationship dataframe with the relevant attributes. Each row of the resulting dataframe will represent a `track`-`artist_id` relationship with the other columns being the attributes.

In [8]:
df_ta = df_m.loc[:,['track','artist_id','title','artist_name','artist_familiarity','artist_hotttnesss','song_hotttnesss']]

#drop duplicates
df_ta = df_ta.drop_duplicates()

#reset index
df_ta = df_ta.reset_index()
del df_ta['index']

print(df_ta.shape)
df_ta.head(10)

(500, 7)


Unnamed: 0,track,artist_id,title,artist_name,artist_familiarity,artist_hotttnesss,song_hotttnesss
0,TRACBMH128F93291A4,ARFY8NV1187B9B1961,Blaue Matrosen (live),Der Moderne Man,0.419034,0.293984,
1,TRBIBSS128F14A205E,ARBH0MS1187FB36311,Two Elegiac Melodies Op. 34: 2. The Last Spring,Sir Neville Marriner,0.430025,0.353034,0.588922
2,TRASAIH128F931F567,ARIMZQZ1187B9AD541,Betrayal Is A Symptom,Thrice,0.873239,0.519965,0.735346
3,TRBDQVN128F425CF6D,ARZ3U0K1187B999BF4,Machuca,Cabas,0.606268,0.397415,0.375984
4,TRBAPPF128F4243DA5,AR5MK521187B98E0B8,We Blame Love,Heaven 17,0.630929,0.461558,
5,TRATSNE128F427EDB6,AR41B9G1187B990D63,African Typic Collection,Sam Fan Thomas,0.360187,0.243065,
6,TRAENEO128F426789F,AR8QHU51187B9A3341,Shi Shang Zui Qiang Man Hua Wang Da Luan Dou,Leo Ku,0.483279,0.360629,
7,TRAJTCL128F1460CDA,ARD0S291187B9B7BF5,Shoulda Did,Rated R,0.556496,0.261941,
8,TRACIEF128F4270573,AROBQ0B1187FB404F6,What Would You Do,Walter Jackson,0.437194,0.316156,0.353261
9,TRBBZFO128F932CD27,ARGLOWN1187B99C06D,2StepN,North Mississippi Allstars,0.746214,0.476955,


In [9]:
# Pickle the dataframe
path = "C:/Users/ganes_000/Documents/MSBA/Spring 2016/MA 755/Pickle Files"
df_ta.to_pickle(path+"/MSS_METADATA_SUBSET_500_TRACK_ARTIST.pkl")

### 3. `artist` - `artist_terms` Relationship Data

The `df_m` dataframe, which is the 500 songs subset of the MSS data, is used to get this relationship as well. Unique combination values of the `artist_id`, `artist_terms`,`artist_name`,`artist_familiarity`,`artist_hotttnesss`, `at_wt` and `at_freq` field from the master dataframe will provide the `artist_id`-`artist_terms` relationship dataframe with the relevant attributes. Each row of the resulting dataframe will represent a `artist_term`-`artist_id` relationship with the other columns being the attributes.

In [10]:
df_aat = df_m.loc[:,['artist_id','artist_terms','artist_name','artist_familiarity','artist_hotttnesss','at_wt','at_freq']]

# drop duplicates
df_aat = df_aat.drop_duplicates()

# reset_index
df_aat = df_aat.reset_index()
del df_aat['index']

print(df_aat.shape)
df_aat.head(10)

(12575, 7)


Unnamed: 0,artist_id,artist_terms,artist_name,artist_familiarity,artist_hotttnesss,at_wt,at_freq
0,ARFY8NV1187B9B1961,new wave,Der Moderne Man,0.419034,0.293984,1.0,0.998522
1,ARFY8NV1187B9B1961,neue deutsche welle,Der Moderne Man,0.419034,0.293984,0.950451,0.80536
2,ARFY8NV1187B9B1961,kraut rock,Der Moderne Man,0.419034,0.293984,0.885503,0.728236
3,ARFY8NV1187B9B1961,rock,Der Moderne Man,0.419034,0.293984,0.874499,1.0
4,ARFY8NV1187B9B1961,alternative rock,Der Moderne Man,0.419034,0.293984,0.868037,0.896572
5,ARFY8NV1187B9B1961,punk,Der Moderne Man,0.419034,0.293984,0.810248,0.863182
6,ARFY8NV1187B9B1961,electro,Der Moderne Man,0.419034,0.293984,0.740235,0.728236
7,ARFY8NV1187B9B1961,germany,Der Moderne Man,0.419034,0.293984,0.708393,0.598536
8,ARFY8NV1187B9B1961,electronic,Der Moderne Man,0.419034,0.293984,0.695211,0.768071
9,ARFY8NV1187B9B1961,experimental,Der Moderne Man,0.419034,0.293984,0.69521,0.718635


In [48]:
# Pickle the dataframe
path = "C:/Users/ganes_000/Documents/MSBA/Spring 2016/MA 755/Pickle Files"
df_aat.to_pickle(path+"/MSS_METADATA_SUBSET_500_ARTIST_ARTIST_TERMS.pkl")

### 4. `track` - `track` relationship data

As mentioned earlier, the master dataframe of the 500 songs subset, is joined with itself on the `artist_terms` to get the `track` - `track` relationship.

In [11]:
# Gives all song to song relationship based on same artist_terms
df_ss= pd.merge(df_m,df_m, on = 'artist_terms', how = 'inner', suffixes =('_l', '_r'))

# Remove those edges where it is a loop edge (track connected to the same track)
df_ss = df_ss[df_ss['track_l'] != df_ss['track_r']]

print('Total number of paths between two tracks through t-a-at-a-t : ', df_ss.shape[0])

Total number of paths between two tracks through t-a-at-a-t :  784096


Each row represents a `track` - `track` path via an artist term. So, the total number of track to track (track1--artist1--artist_term--artist2--track2) paths are 784,096. However, the dataframe above contains the reverse paths as well (that is track1 - track2 and also track2 - track1. It can be removed but since this does not impact a undirected graph (at least with `networkx`), it is left as is. Note that there will be multiple rows between the same pairs of tracks since they may be connected through more than one artist term. The next logical step is to group these multiple rows for each pair and replace them with an aggregated weight computed using a mathematical function.

#### Find all `track` - `track` groups
This is done using the `groupby` module of the `pandas` library as shown below. Refer to the comments and outputs in the individual cells to know about the datatype of the various objects.

In [14]:
# find all song-song groups
group_ss = df_ss.groupby(['track_l', 'track_r']) # A pandas GroupBy object

In [35]:
print(len(group_ss))
print(type(group_ss))
print(type(group_ss.groups))
group_ss.groups

198840
<class 'pandas.core.groupby.DataFrameGroupBy'>
<class 'dict'>


{('TRANUPI12903CA6A4F', 'TRAQRAM128F4296694'): [70790,
  175516,
  202404,
  247868,
  323928,
  421564,
  456219,
  469286,
  583052,
  656725,
  778676],
 ('TRAAENP128F147BF32', 'TRACIEF128F4270573'): [60302, 202122, 315264, 617975],
 ('TRAIXEO128F424AD97', 'TRBENBG12903CF94B4'): [177317],
 ('TRAXTSU128F4241835', 'TRAWNBV128F4279867'): [58106,
  243968,
  314134,
  432370,
  528522],
 ('TRAIXZC128F424C148', 'TRAJDCS128F92F9236'): [472694, 480824, 615294],
 ('TRAJIDY128F931C89D', 'TRAQFGE128F9322ADA'): [240900, 565037, 793572],
 ('TRAHKEG12903CDB829', 'TRABPHE128F42A1E65'): [19201,
  153330,
  281094,
  440886,
  709492],
 ('TRBIACJ128F93087A9', 'TRBERUM128F148AC93'): [56954,
  142962,
  211290,
  243500,
  485940,
  528256,
  566193,
  580590,
  597429,
  617754,
  651058],
 ('TRBGCJQ128F1453CB3', 'TRALQXV12903C9FE7B'): [23054, 267354, 285169],
 ('TRBHXRQ128F42A713F', 'TRASJRH12903CF2689'): [7114, 627748],
 ('TRAUBDS128F931A703', 'TRALUNR128EF3425DE'): [32833, 292327, 409927],
 ('TRA

From the output of the cell above, it can be seen that there are 198,840 direct connections between `track` - `track`. Of course, because the reverse connections also exists, the actual effective connections will be halved. The groupby object holds the indices of the rows in the source dataframe (`df_ss`) that belong to each pair of tracks. This is shown by the `group_ss.groups` which is a `dictionary` object with the `key` being the two tracks and `value` being an array of the indices of the rows in the source dataframe.

In [19]:
# For example, the tracks 'TRAAENP128F147BF32' and 'TRACIEF128F4270573' have connections through 4 artist terms
print(type(group_ss.get_group(('TRAAENP128F147BF32', 'TRACIEF128F4270573'))))
group_ss.get_group(('TRAAENP128F147BF32', 'TRACIEF128F4270573'))
# Note the index [60302, 202122, 315264, 617975] in the dataframe source 'df_ss' on which the grouping was done.
# The value in the group_ss.groups dictionary object hold these indices

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,track_l,title_l,song_id_l,artist_id_l,artist_name_l,artist_familiarity_l,artist_hotttnesss_l,song_hotttnesss_l,year_l,at_freq_l,...,title_r,song_id_r,artist_id_r,artist_name_r,artist_familiarity_r,artist_hotttnesss_r,song_hotttnesss_r,year_r,at_freq_r,at_wt_r
60302,TRAAENP128F147BF32,Movement 4 [from Kiss] (Album Version),SOVOLGY12A6D4F822B,ARUI8651187B9ACF52,John Cale,0.749318,0.481656,0.226555,1997.0,0.564534,...,What Would You Do,SOOOBCE12A8C134019,AROBQ0B1187FB404F6,Walter Jackson,0.437194,0.316156,0.353261,2006.0,0.97614,0.756252
202122,TRAAENP128F147BF32,Movement 4 [from Kiss] (Album Version),SOVOLGY12A6D4F822B,ARUI8651187B9ACF52,John Cale,0.749318,0.481656,0.226555,1997.0,0.427527,...,What Would You Do,SOOOBCE12A8C134019,AROBQ0B1187FB404F6,Walter Jackson,0.437194,0.316156,0.353261,2006.0,0.634468,0.609892
315264,TRAAENP128F147BF32,Movement 4 [from Kiss] (Album Version),SOVOLGY12A6D4F822B,ARUI8651187B9ACF52,John Cale,0.749318,0.481656,0.226555,1997.0,0.481279,...,What Would You Do,SOOOBCE12A8C134019,AROBQ0B1187FB404F6,Walter Jackson,0.437194,0.316156,0.353261,2006.0,0.946641,0.764347
617975,TRAAENP128F147BF32,Movement 4 [from Kiss] (Album Version),SOVOLGY12A6D4F822B,ARUI8651187B9ACF52,John Cale,0.749318,0.481656,0.226555,1997.0,0.308013,...,What Would You Do,SOOOBCE12A8C134019,AROBQ0B1187FB404F6,Walter Jackson,0.437194,0.316156,0.353261,2006.0,0.621886,0.648512


In [20]:
# The reverse path or the redundant connection, which will be ignored by networkx
group_ss.get_group(( 'TRACIEF128F4270573', 'TRAAENP128F147BF32'))

Unnamed: 0,track_l,title_l,song_id_l,artist_id_l,artist_name_l,artist_familiarity_l,artist_hotttnesss_l,song_hotttnesss_l,year_l,at_freq_l,...,title_r,song_id_r,artist_id_r,artist_name_r,artist_familiarity_r,artist_hotttnesss_r,song_hotttnesss_r,year_r,at_freq_r,at_wt_r
5002,TRACIEF128F4270573,What Would You Do,SOOOBCE12A8C134019,AROBQ0B1187FB404F6,Walter Jackson,0.437194,0.316156,0.353261,2006.0,0.97614,...,Movement 4 [from Kiss] (Album Version),SOVOLGY12A6D4F822B,ARUI8651187B9ACF52,John Cale,0.749318,0.481656,0.226555,1997.0,0.564534,0.454787
200610,TRACIEF128F4270573,What Would You Do,SOOOBCE12A8C134019,AROBQ0B1187FB404F6,Walter Jackson,0.437194,0.316156,0.353261,2006.0,0.634468,...,Movement 4 [from Kiss] (Album Version),SOVOLGY12A6D4F822B,ARUI8651187B9ACF52,John Cale,0.749318,0.481656,0.226555,1997.0,0.427527,0.462952
269344,TRACIEF128F4270573,What Would You Do,SOOOBCE12A8C134019,AROBQ0B1187FB404F6,Walter Jackson,0.437194,0.316156,0.353261,2006.0,0.946641,...,Movement 4 [from Kiss] (Album Version),SOVOLGY12A6D4F822B,ARUI8651187B9ACF52,John Cale,0.749318,0.481656,0.226555,1997.0,0.481279,0.423283
613675,TRACIEF128F4270573,What Would You Do,SOOOBCE12A8C134019,AROBQ0B1187FB404F6,Walter Jackson,0.437194,0.316156,0.353261,2006.0,0.621886,...,Movement 4 [from Kiss] (Album Version),SOVOLGY12A6D4F822B,ARUI8651187B9ACF52,John Cale,0.749318,0.481656,0.226555,1997.0,0.308013,0.422747


Now that all the `track` - `track` groups are identified, the `sum` of the artist terms weights (`at_wt_l` and `at_wt_r` signifying the weights from the artists on either side of the `artist_terms` node) and the `mean` of the other attributes are computed, as below. The rationale behind these computations will be discussed when the final weights between two tracks are computed at the time of representing them in a network graph.

In [21]:
# Now, compute the group aggregate attributes
df_ss_g1 = group_ss['at_wt_l'].agg({'wt_l_sum': np.sum}).reset_index()
df_ss_g2 = group_ss['at_wt_r'].agg({'wt_r_sum': np.sum}).reset_index()
df_ss_g3 = group_ss['artist_familiarity_l'].agg({'fam_l': np.mean}).reset_index()
df_ss_g4 = group_ss['artist_familiarity_r'].agg({'fam_r': np.mean}).reset_index()
df_ss_g5 = group_ss['artist_hotttnesss_l'].agg({'hot_l': np.mean}).reset_index()
df_ss_g6 = group_ss['artist_hotttnesss_r'].agg({'hot_r': np.mean}).reset_index()
df_ss_g7 = group_ss['song_hotttnesss_l'].agg({'song_hot_l': np.mean}).reset_index()
df_ss_g8 = group_ss['song_hotttnesss_r'].agg({'song_hot_r': np.mean}).reset_index()
df_ss_g9 = group_ss['artist_terms'].agg({'cnt_at': np.size}).reset_index()

In [22]:
#df_ss_g9[(df_ss_g9['dtitle_l'] == 'Anyone Who Had A Heart') & (df_ss_g9['title_r'] == 'Far Beyond The Endless')]
df_ss_g1
# even if some songs have no hotttnesss, those groups are still intact (same number of rows as others)
# df_ss_g7

Unnamed: 0,track_l,track_r,wt_l_sum
0,TRAAAVG12903CFA543,TRAACTB12903CAAF15,0.734100
1,TRAAAVG12903CFA543,TRAADLN128F14832E9,0.734100
2,TRAAAVG12903CFA543,TRAAENP128F147BF32,1.477259
3,TRAAAVG12903CFA543,TRAAGHM128EF35CF8E,5.924304
4,TRAAAVG12903CFA543,TRAAGOH128F42593CE,0.734100
5,TRAAAVG12903CFA543,TRAAJJG128F4284B27,4.795986
6,TRAAAVG12903CFA543,TRAAMSO128EF348DCC,5.698359
7,TRAAAVG12903CFA543,TRAAMXP128F4264F6A,0.734100
8,TRAAAVG12903CFA543,TRAAOTT128F14AE002,0.734100
9,TRAAAVG12903CFA543,TRAAXCM128F427F97B,5.190204


Now that the aggregated weights and other attributes are computed for each of the `track` - `track` groups, we merge them all together into a single dataframe. Merging is preferred over concatenation simply to ensure that potential variation in indexing of each of the aggregated groups (`df_ss_g1` through `df_ss_g9`) does not lead to incorrect weight assignments. Each row of the data frame represents the connection between two tracks, with the appropriate aggregated attributes. How the final weights for each connection is computed is discussed at the time of representing them as a graph.

In [27]:
# merge all the aggregates into a single dataframe
df_ss_g = pd.merge(pd.merge(pd.merge((pd.merge(df_ss_g1,df_ss_g2, on = ['track_l','track_r'], how = 'left')),
                                    (pd.merge(df_ss_g3,df_ss_g4, on = ['track_l','track_r'], how = 'left')),
                                     on = ['track_l','track_r'], how = 'left'),
                            pd.merge((pd.merge(df_ss_g5,df_ss_g6, on = ['track_l','track_r'], how = 'left')),
                                    (pd.merge(df_ss_g7,df_ss_g8, on = ['track_l','track_r'], how = 'left')),
                                     on = ['track_l','track_r'], how = 'left'),
                            on = ['track_l','track_r'], how = 'left'),
                   df_ss_g9, on = ['track_l','track_r'], how = 'left') 

df_ss_g.head(10)

Unnamed: 0,track_l,track_r,wt_l_sum,wt_r_sum,fam_l,fam_r,hot_l,hot_r,song_hot_l,song_hot_r,cnt_at
0,TRAAAVG12903CFA543,TRAACTB12903CAAF15,0.7341,0.72435,0.550514,0.523715,0.422706,0.384611,,0.0,1
1,TRAAAVG12903CFA543,TRAADLN128F14832E9,0.7341,0.894243,0.550514,0.81483,0.422706,0.735766,,0.834493,1
2,TRAAAVG12903CFA543,TRAAENP128F147BF32,1.477259,1.267916,0.550514,0.749318,0.422706,0.481656,,0.226555,2
3,TRAAAVG12903CFA543,TRAAGHM128EF35CF8E,5.924304,4.94713,0.550514,0.408389,0.422706,0.139247,,0.198785,7
4,TRAAAVG12903CFA543,TRAAGOH128F42593CE,0.7341,0.795655,0.550514,0.449324,0.422706,0.343282,,0.382723,1
5,TRAAAVG12903CFA543,TRAAJJG128F4284B27,4.795986,4.153012,0.550514,0.703349,0.422706,0.498395,,0.466305,6
6,TRAAAVG12903CFA543,TRAAMSO128EF348DCC,5.698359,4.855945,0.550514,0.665728,0.422706,0.456941,,,7
7,TRAAAVG12903CFA543,TRAAMXP128F4264F6A,0.7341,0.783216,0.550514,0.539381,0.422706,0.360584,,0.0,1
8,TRAAAVG12903CFA543,TRAAOTT128F14AE002,0.7341,0.781684,0.550514,0.497308,0.422706,0.342406,,,1
9,TRAAAVG12903CFA543,TRAAXCM128F427F97B,5.190204,4.780164,0.550514,0.483243,0.422706,0.324431,,,6


To this dataframe `df_ss_g` above, that represents the `track` - `track` connected groups, the names of the two artists and a flag (`same_artist_flag`) to indicate if the two tracks were by the same artist are added. In addition, the two years and the track titles are included as well.

In [28]:
df_ss_g = pd.merge(pd.merge(df_ss_g,df_ta.loc[:,['title','track','artist_id','artist_name']], 
                            left_on='track_l', right_on='track', how = 'inner'),
                   df_ta.loc[:,['title','track','artist_id','artist_name']],
                   left_on ='track_r', right_on ='track', how = 'inner',suffixes = ('_l','_r'))         

# add same artist flag to indicate if the tracks are from the same artist
df_ss_g['same_artist_flag'] = df_ss_g.apply(lambda x: 1 if x['artist_id_l'] == x['artist_id_r'] else 0, axis = 1)

df_ss_g.head(10)

Unnamed: 0,track_l,track_r,wt_l_sum,wt_r_sum,fam_l,fam_r,hot_l,hot_r,song_hot_l,song_hot_r,cnt_at,title_l,track_l.1,artist_id_l,artist_name_l,title_r,track_r.1,artist_id_r,artist_name_r,same_artist_flag
0,TRAAAVG12903CFA543,TRAACTB12903CAAF15,0.7341,0.72435,0.550514,0.523715,0.422706,0.384611,,0.0,1,Insatiable (Instrumental Version),TRAAAVG12903CFA543,ARNTLGG11E2835DDB9,Clp,It Makes No Difference Now,TRAACTB12903CAAF15,AR0RCMP1187FB3F427,Billie Jo Spears,0
1,TRAABJV128F1460C49,TRAACTB12903CAAF15,1.804455,2.60804,0.776676,0.523715,0.553072,0.384611,,0.0,4,Tonight Will Be Alright,TRAABJV128F1460C49,ARIK43K1187B9AE54C,Lionel Richie,It Makes No Difference Now,TRAACTB12903CAAF15,AR0RCMP1187FB3F427,Billie Jo Spears,0
2,TRAADLN128F14832E9,TRAACTB12903CAAF15,3.067252,4.877071,0.81483,0.523715,0.735766,0.384611,0.834493,0.0,7,Angie (1993 Digital Remaster),TRAADLN128F14832E9,ARFCUN31187B9AD578,The Rolling Stones,It Makes No Difference Now,TRAACTB12903CAAF15,AR0RCMP1187FB3F427,Billie Jo Spears,0
3,TRAADLR12903CF8D7E,TRAACTB12903CAAF15,1.94994,2.312114,0.482356,0.523715,0.350079,0.384611,0.0,0.0,3,Sabor Guajiro,TRAADLR12903CF8D7E,AR6YDKV1187B989230,Roberto Torres,It Makes No Difference Now,TRAACTB12903CAAF15,AR0RCMP1187FB3F427,Billie Jo Spears,0
4,TRAAENP128F147BF32,TRAACTB12903CAAF15,1.250377,2.040589,0.749318,0.523715,0.481656,0.384611,0.226555,0.0,3,Movement 4 [from Kiss] (Album Version),TRAAENP128F147BF32,ARUI8651187B9ACF52,John Cale,It Makes No Difference Now,TRAACTB12903CAAF15,AR0RCMP1187FB3F427,Billie Jo Spears,0
5,TRAAGHM128EF35CF8E,TRAACTB12903CAAF15,3.283752,3.739735,0.408389,0.523715,0.139247,0.384611,0.198785,0.0,5,Next Time,TRAAGHM128EF35CF8E,ARMBYRO1187FB57419,Nadine Renee,It Makes No Difference Now,TRAACTB12903CAAF15,AR0RCMP1187FB3F427,Billie Jo Spears,0
6,TRAAGOH128F42593CE,TRAACTB12903CAAF15,4.317186,4.545478,0.449324,0.523715,0.343282,0.384611,0.382723,0.0,6,You Feel Good All Over,TRAAGOH128F42593CE,ARLWR721187B9A03C9,T.G. Sheppard,It Makes No Difference Now,TRAACTB12903CAAF15,AR0RCMP1187FB3F427,Billie Jo Spears,0
7,TRAAGZU128EF360F47,TRAACTB12903CAAF15,0.706473,0.691292,0.420546,0.523715,0.313569,0.384611,,0.0,1,Caricia Y Herida,TRAAGZU128EF360F47,ARHFEBO11F50C4825A,Flor Silvestre,It Makes No Difference Now,TRAACTB12903CAAF15,AR0RCMP1187FB3F427,Billie Jo Spears,0
8,TRAAHEH128F427FCEF,TRAACTB12903CAAF15,2.292019,2.194105,0.138188,0.523715,0.317811,0.384611,,0.0,3,Ter\xc3\xa4slintu,TRAAHEH128F427FCEF,ARL6XZC1187FB3936E,Solistiyhtye Suomi,It Makes No Difference Now,TRAACTB12903CAAF15,AR0RCMP1187FB3F427,Billie Jo Spears,0
9,TRAAJJG128F4284B27,TRAACTB12903CAAF15,5.544593,5.923369,0.703349,0.523715,0.498395,0.384611,0.466305,0.0,8,Medicate Myself,TRAAJJG128F4284B27,ARQOC971187B9910FA,The Verve Pipe,It Makes No Difference Now,TRAACTB12903CAAF15,AR0RCMP1187FB3F427,Billie Jo Spears,0


In [29]:
# Remove duplicate columns (track_l and track_r)
df_ss_g = df_ss_g.iloc[:,[0,1,2,3,4,5,6,7,8,9,10,11,13,14,15,17,18,19]]
df_ss_g.head(10)

Unnamed: 0,track_l,track_r,wt_l_sum,wt_r_sum,fam_l,fam_r,hot_l,hot_r,song_hot_l,song_hot_r,cnt_at,title_l,artist_id_l,artist_name_l,title_r,artist_id_r,artist_name_r,same_artist_flag
0,TRAAAVG12903CFA543,TRAACTB12903CAAF15,0.7341,0.72435,0.550514,0.523715,0.422706,0.384611,,0.0,1,Insatiable (Instrumental Version),ARNTLGG11E2835DDB9,Clp,It Makes No Difference Now,AR0RCMP1187FB3F427,Billie Jo Spears,0
1,TRAABJV128F1460C49,TRAACTB12903CAAF15,1.804455,2.60804,0.776676,0.523715,0.553072,0.384611,,0.0,4,Tonight Will Be Alright,ARIK43K1187B9AE54C,Lionel Richie,It Makes No Difference Now,AR0RCMP1187FB3F427,Billie Jo Spears,0
2,TRAADLN128F14832E9,TRAACTB12903CAAF15,3.067252,4.877071,0.81483,0.523715,0.735766,0.384611,0.834493,0.0,7,Angie (1993 Digital Remaster),ARFCUN31187B9AD578,The Rolling Stones,It Makes No Difference Now,AR0RCMP1187FB3F427,Billie Jo Spears,0
3,TRAADLR12903CF8D7E,TRAACTB12903CAAF15,1.94994,2.312114,0.482356,0.523715,0.350079,0.384611,0.0,0.0,3,Sabor Guajiro,AR6YDKV1187B989230,Roberto Torres,It Makes No Difference Now,AR0RCMP1187FB3F427,Billie Jo Spears,0
4,TRAAENP128F147BF32,TRAACTB12903CAAF15,1.250377,2.040589,0.749318,0.523715,0.481656,0.384611,0.226555,0.0,3,Movement 4 [from Kiss] (Album Version),ARUI8651187B9ACF52,John Cale,It Makes No Difference Now,AR0RCMP1187FB3F427,Billie Jo Spears,0
5,TRAAGHM128EF35CF8E,TRAACTB12903CAAF15,3.283752,3.739735,0.408389,0.523715,0.139247,0.384611,0.198785,0.0,5,Next Time,ARMBYRO1187FB57419,Nadine Renee,It Makes No Difference Now,AR0RCMP1187FB3F427,Billie Jo Spears,0
6,TRAAGOH128F42593CE,TRAACTB12903CAAF15,4.317186,4.545478,0.449324,0.523715,0.343282,0.384611,0.382723,0.0,6,You Feel Good All Over,ARLWR721187B9A03C9,T.G. Sheppard,It Makes No Difference Now,AR0RCMP1187FB3F427,Billie Jo Spears,0
7,TRAAGZU128EF360F47,TRAACTB12903CAAF15,0.706473,0.691292,0.420546,0.523715,0.313569,0.384611,,0.0,1,Caricia Y Herida,ARHFEBO11F50C4825A,Flor Silvestre,It Makes No Difference Now,AR0RCMP1187FB3F427,Billie Jo Spears,0
8,TRAAHEH128F427FCEF,TRAACTB12903CAAF15,2.292019,2.194105,0.138188,0.523715,0.317811,0.384611,,0.0,3,Ter\xc3\xa4slintu,ARL6XZC1187FB3936E,Solistiyhtye Suomi,It Makes No Difference Now,AR0RCMP1187FB3F427,Billie Jo Spears,0
9,TRAAJJG128F4284B27,TRAACTB12903CAAF15,5.544593,5.923369,0.703349,0.523715,0.498395,0.384611,0.466305,0.0,8,Medicate Myself,ARQOC971187B9910FA,The Verve Pipe,It Makes No Difference Now,AR0RCMP1187FB3F427,Billie Jo Spears,0


In [31]:
# Check if same artist flag is assigned correct
df_ss_g[df_ss_g['same_artist_flag'] == 1].head(5) 

Unnamed: 0,track_l,track_r,wt_l_sum,wt_r_sum,fam_l,fam_r,hot_l,hot_r,song_hot_l,song_hot_r,cnt_at,title_l,artist_id_l,artist_name_l,title_r,artist_id_r,artist_name_r,same_artist_flag
4861,TRAHBWE128F9349247,TRAAYGH128F92ECD16,34.256392,34.256392,0.662299,0.662299,0.379138,0.379138,0.0,,44,Dedicated To The One I Love,ARALP6I1187B989E27,The Shirelles,Will You Love Me Tomorrow,ARALP6I1187B989E27,The Shirelles,1
9211,TRBDOAR128F42618B7,TRABPYJ128F92DA476,32.750631,32.750631,0.563049,0.563049,0.348227,0.348227,0.38679,0.319566,46,Mopao,ARDNW1B1187FB4ABBB,Africando All Stars,Nina Nina,ARDNW1B1187FB4ABBB,Ch\xc3\xa9co Feliciano And Joe King,1
12101,TRATAMJ12903CA7BA9,TRABXWD128F425AD1E,9.275885,9.275885,0.729212,0.729212,0.481339,0.481339,0.21508,,16,Stack and Pile,ARVUN5F1187FB4CCC7,Beenie Man,Love Me Now (Rockwilder Remix) (Feat. Wyclef A...,ARVUN5F1187FB4CCC7,Beenie Man Featuring Wyclef And Redman,1
13274,TRAMQKQ128F93291A3,TRACBMH128F93291A4,9.905238,9.905238,0.419034,0.419034,0.293984,0.293984,,,13,Sinnloz/ Anakonda (live),ARFY8NV1187B9B1961,Der Moderne Man,Blaue Matrosen (live),ARFY8NV1187B9B1961,Der Moderne Man,1
14476,TRALGQH12903CD4922,TRACJDX12903CD4917,13.223911,13.223911,0.66802,0.66802,0.516964,0.516964,,,26,Kids Now,ARYF20K1187B9B76BD,George Lopez,Church Hangover,ARYF20K1187B9B76BD,George Lopez,1


#### Pickle the `track` - `track` groups aggregated dataframe
The final `track` - `track` groups dataframe, along with the aggregated attributes, is pickled and saved.

In [49]:
path = "C:/Users/ganes_000/Documents/MSBA/Spring 2016/MA 755/Pickle Files"
df_ss_g.to_pickle(path+"/MSS_METADATA_SUBSET_500_TRACK_TRACK.pkl")

### 5. `artist_id` - `artist_id` relationship data

As mentioned earlier, this is very similar to the `track` - `track` relationship and hence will not be explained in detail. However, the code chunk to do this are presented in the following cells.

In [32]:
# Gives all artist to artist connections based on common artist_terms
df_aa = pd.merge(df_aat,df_aat, on = 'artist_terms', how = 'inner')
# can add suffixes =('_left', '_right') above. If not default _x and _y will be used
print(df_aa.shape)

# Remove those edges where it is a loop edge
df_aa = df_aa[df_aa['artist_id_x'] != df_aa['artist_id_y'] ]
print(df_aa.shape)

(723373, 13)
(710504, 13)


In [34]:
group_aa = df_aa.groupby(['artist_id_x', 'artist_id_y']) # A pandas GroupBy object
print(type(group_aa))
print(type(group_aa.groups))
print(len(group_aa)) # The actual number of artist-artist groups

<class 'pandas.core.groupby.DataFrameGroupBy'>
<class 'dict'>
175242


In [36]:
df_aa_g1 = group_aa['at_wt_x'].agg({'wt_l_sum': np.sum}).reset_index()
df_aa_g2 = group_aa['at_wt_y'].agg({'wt_r_sum': np.sum}).reset_index()
df_aa_g3 = group_aa['artist_familiarity_x'].agg({'fam_l': np.mean}).reset_index()
df_aa_g4 = group_aa['artist_familiarity_y'].agg({'fam_r': np.mean}).reset_index()
df_aa_g5 = group_aa['artist_hotttnesss_x'].agg({'hot_l': np.mean}).reset_index()
df_aa_g6 = group_aa['artist_hotttnesss_y'].agg({'hot_r': np.mean}).reset_index()

In [38]:
df_aa_g6.head(10)

Unnamed: 0,artist_id_x,artist_id_y,hot_r
0,AR051KA1187B98B2FF,AR059HI1187B9A14D7,0.306519
1,AR051KA1187B98B2FF,AR0693R1187FB59D32,0.370829
2,AR051KA1187B98B2FF,AR0AVIB1187FB37D01,0.384602
3,AR051KA1187B98B2FF,AR0B3RS1187FB48F5D,0.47702
4,AR051KA1187B98B2FF,AR0FGL21187FB56301,0.340524
5,AR051KA1187B98B2FF,AR0IAWL1187B9A96D0,0.27712
6,AR051KA1187B98B2FF,AR0IVSA1187FB4F069,0.49008
7,AR051KA1187B98B2FF,AR0S7TA1187FB4D024,0.725746
8,AR051KA1187B98B2FF,AR0WV4Y1187B99B806,0.389414
9,AR051KA1187B98B2FF,AR14CJ91187FB3A994,0.397776


In [41]:
# merge all the aggregates into a single dataframe
df_aa_g = pd.merge(pd.merge((pd.merge(df_aa_g1,df_aa_g2, on = ['artist_id_x', 'artist_id_y'], how = 'left')),
                            (pd.merge(df_aa_g3,df_aa_g4, on = ['artist_id_x', 'artist_id_y'], how = 'left')),
                            on = ['artist_id_x', 'artist_id_y'], how = 'left'),
                   (pd.merge(df_aa_g5,df_aa_g6, on = ['artist_id_x', 'artist_id_y'], how = 'left')),
                    on = ['artist_id_x', 'artist_id_y'], how = 'left')
                            

df_aa_g.head(10)

Unnamed: 0,artist_id_x,artist_id_y,wt_l_sum,wt_r_sum,fam_l,fam_r,hot_l,hot_r
0,AR051KA1187B98B2FF,AR059HI1187B9A14D7,0.985888,0.868416,0.351182,0.378348,0.080167,0.306519
1,AR051KA1187B98B2FF,AR0693R1187FB59D32,1.842803,1.687173,0.351182,0.546789,0.080167,0.370829
2,AR051KA1187B98B2FF,AR0AVIB1187FB37D01,1.86266,1.706814,0.351182,0.566024,0.080167,0.384602
3,AR051KA1187B98B2FF,AR0B3RS1187FB48F5D,1.720894,1.331649,0.351182,0.822035,0.080167,0.47702
4,AR051KA1187B98B2FF,AR0FGL21187FB56301,3.268508,2.996158,0.351182,0.422103,0.080167,0.340524
5,AR051KA1187B98B2FF,AR0IAWL1187B9A96D0,1.597667,1.506507,0.351182,0.331874,0.080167,0.27712
6,AR051KA1187B98B2FF,AR0IVSA1187FB4F069,1.0,0.863677,0.351182,0.776599,0.080167,0.49008
7,AR051KA1187B98B2FF,AR0S7TA1187FB4D024,3.583555,2.841571,0.351182,0.83124,0.080167,0.725746
8,AR051KA1187B98B2FF,AR0WV4Y1187B99B806,2.597667,2.452595,0.351182,0.491836,0.080167,0.389414
9,AR051KA1187B98B2FF,AR14CJ91187FB3A994,0.985888,0.689191,0.351182,0.564129,0.080167,0.397776


In [43]:
df_aa_g = pd.merge(pd.merge(df_aa_g,df_ta.loc[:,['artist_id','artist_name']], 
                            left_on='artist_id_x', right_on='artist_id', how = 'inner'),
                   df_ta.loc[:,['artist_id','artist_name']],
                   left_on ='artist_id_y', right_on ='artist_id', how = 'inner',suffixes = ('_l','_r'))         


df_aa_g.head(10)

Unnamed: 0,artist_id_x,artist_id_y,wt_l_sum,wt_r_sum,fam_l,fam_r,hot_l,hot_r,artist_id_l,artist_name_l,artist_id_r,artist_name_r
0,AR051KA1187B98B2FF,AR059HI1187B9A14D7,0.985888,0.868416,0.351182,0.378348,0.080167,0.306519,AR051KA1187B98B2FF,Wilks,AR059HI1187B9A14D7,Roy Drusky
1,AR0693R1187FB59D32,AR059HI1187B9A14D7,2.028113,2.150506,0.546789,0.378348,0.370829,0.306519,AR0693R1187FB59D32,Dusminguet,AR059HI1187B9A14D7,Roy Drusky
2,AR0AVIB1187FB37D01,AR059HI1187B9A14D7,3.259692,3.15469,0.566024,0.378348,0.384602,0.306519,AR0AVIB1187FB37D01,Regina Belle,AR059HI1187B9A14D7,Roy Drusky
3,AR0B3RS1187FB48F5D,AR059HI1187B9A14D7,2.042358,2.150506,0.822035,0.378348,0.47702,0.306519,AR0B3RS1187FB48F5D,Why?,AR059HI1187B9A14D7,Roy Drusky
4,AR0FGL21187FB56301,AR059HI1187B9A14D7,1.795019,2.319531,0.422103,0.378348,0.340524,0.306519,AR0FGL21187FB56301,Ras Michael and the Sons of Negus,AR059HI1187B9A14D7,Roy Drusky
5,AR0IVSA1187FB4F069,AR059HI1187B9A14D7,0.811732,0.760766,0.776599,0.378348,0.49008,0.306519,AR0IVSA1187FB4F069,Helmet,AR059HI1187B9A14D7,Roy Drusky
6,AR0KBXO1187B996460,AR059HI1187B9A14D7,1.435857,1.39433,0.928937,0.378348,0.598555,0.306519,AR0KBXO1187B996460,Slipknot,AR059HI1187B9A14D7,Roy Drusky
7,AR0KBXO1187B996460,AR059HI1187B9A14D7,1.435857,1.39433,0.928937,0.378348,0.598555,0.306519,AR0KBXO1187B996460,Slipknot,AR059HI1187B9A14D7,Roy Drusky
8,AR0OEYB1187FB4A81E,AR059HI1187B9A14D7,1.458036,1.38974,0.421792,0.378348,0.31921,0.306519,AR0OEYB1187FB4A81E,Catherine Howe & Vo Fletcher,AR059HI1187B9A14D7,Roy Drusky
9,AR0QS8F1187B9ADC96,AR059HI1187B9A14D7,0.791171,0.633564,0.498677,0.378348,0.354169,0.306519,AR0QS8F1187B9ADC96,Sandy Lam,AR059HI1187B9A14D7,Roy Drusky


In [44]:
# Remove duplicate columns (artist_id_x and artist_id_y)
df_aa_g = df_aa_g.iloc[:,[8,10,9,11,2,3,4,5,6,7]]
df_aa_g.head(10)

Unnamed: 0,artist_id_l,artist_id_r,artist_name_l,artist_name_r,wt_l_sum,wt_r_sum,fam_l,fam_r,hot_l,hot_r
0,AR051KA1187B98B2FF,AR059HI1187B9A14D7,Wilks,Roy Drusky,0.985888,0.868416,0.351182,0.378348,0.080167,0.306519
1,AR0693R1187FB59D32,AR059HI1187B9A14D7,Dusminguet,Roy Drusky,2.028113,2.150506,0.546789,0.378348,0.370829,0.306519
2,AR0AVIB1187FB37D01,AR059HI1187B9A14D7,Regina Belle,Roy Drusky,3.259692,3.15469,0.566024,0.378348,0.384602,0.306519
3,AR0B3RS1187FB48F5D,AR059HI1187B9A14D7,Why?,Roy Drusky,2.042358,2.150506,0.822035,0.378348,0.47702,0.306519
4,AR0FGL21187FB56301,AR059HI1187B9A14D7,Ras Michael and the Sons of Negus,Roy Drusky,1.795019,2.319531,0.422103,0.378348,0.340524,0.306519
5,AR0IVSA1187FB4F069,AR059HI1187B9A14D7,Helmet,Roy Drusky,0.811732,0.760766,0.776599,0.378348,0.49008,0.306519
6,AR0KBXO1187B996460,AR059HI1187B9A14D7,Slipknot,Roy Drusky,1.435857,1.39433,0.928937,0.378348,0.598555,0.306519
7,AR0KBXO1187B996460,AR059HI1187B9A14D7,Slipknot,Roy Drusky,1.435857,1.39433,0.928937,0.378348,0.598555,0.306519
8,AR0OEYB1187FB4A81E,AR059HI1187B9A14D7,Catherine Howe & Vo Fletcher,Roy Drusky,1.458036,1.38974,0.421792,0.378348,0.31921,0.306519
9,AR0QS8F1187B9ADC96,AR059HI1187B9A14D7,Sandy Lam,Roy Drusky,0.791171,0.633564,0.498677,0.378348,0.354169,0.306519


In [46]:
path = "C:/Users/ganes_000/Documents/MSBA/Spring 2016/MA 755/Pickle Files"
df_aa_g.to_pickle(path+"/MSS_METADATA_SUBSET_500_ARTIST_ARTIST.pkl")