# Flattening the Json data

to flatten the data I chose to use pandas and used a slurm job array to flatten. I used pandas as it has a function called **json_normalize()** which will flatten the data. I don't believe that pyspark has this function. Dask may have it but I haven't checked.

Minimum pandas version of 1.03 is required as anything below will not have **max_level** parameter necessary to split on gids.

First I needed to normalize the json file to a max depth of 0 as that will create columns for each gid that contains it's own hierarchical json. Which is then transposed so that each recoding gid is it's own row/index. Metadata is then removed as the metadata contains MusicBrainz data and not spectral data. The rest is then flattened into columns and then exported to csv in this code. Or it can be set to export in parquet which is something I should have done at the beginning but I have exported it to csvs first and then later compressed it to parquet.

The slurm job is done in arrays corresponding to the number of chunks of json. In this case we have 48 json chunks.
The chunk is then divided up into processes and then joined using Queues.

#### Slurm Job

In [None]:
#!/bin/bash
#SBATCH --time=1:00:00
#SBATCH -N 2
#SBATCH -n 48
#SBATCH -a 0-47

module load python/anaconda-3.6-5.1.0
source activate cheinu
python clean.py $SLURM_ARRAY_TASK_ID 

#### contents of clean.py

In [None]:
import os, sys
import numpy as np
import pandas as pd
from pandas import json_normalize
import json
from multiprocessing import Process, Pool, Queue, cpu_count

def flatten(data, out_q):
    #first flatten out the data into columns with "lowlevel", "rhythm", "tonal", "metaData"
	tempjson = json_normalize(data[0], max_level=1).set_index(data.index)
	data = pd.concat([data,tempjson], axis=1)
    #remove metadata column
	data = data.drop(columns=[0, '0.metadata'])
	
    #we now have the columns we want
	cols = ['0.lowlevel', '0.rhythm', '0.tonal']
    
    #now lets flatten columns "lowlevel", "rhythm", "tonal" jsons
	for i in cols:
		tempjson = json_normalize(data[i]).set_index(data.index)
		data = pd.concat([data,tempjson], axis=1)
		data = data.drop(columns=[i])
	out_q.put(data)

if __name__=='__main__':
	nprocs = int(os.environ["SLURM_CPUS_ON_NODE"]) -16
	procs = []
	chunknum = int(sys.argv[1])
	#ntasks = int(sys.argv[2])
	out_q = Queue()

	newdf = pd.DataFrame()

	chunkdir = "sbatch/data/chunk_{}.json".format(chunknum)
	print("opening: ", chunkdir)

	with open(chunkdir) as datafile:
		maindata = json.load(datafile)

	maindata = json_normalize(maindata, max_level=0).T #I have to transpose so that the gids will be in rows
	maindata = maindata.reset_index()

    #split for each process for a given chunk of json
	newfiles = np.array_split(maindata, nprocs)

	for i in range(nprocs):
		p = Process(target=flatten, args=(newfiles[i], out_q))
		procs.append(p)
		p.start()
	for j in range(nprocs):
		newdf = pd.concat([newdf, out_q.get()])
	for p in procs:
		p.join()

    #write to csv
	print("writing clean_{}.csv".format(chunknum))
	newdf.to_csv("step1/clean_{}.csv".format(chunknum))
	print("finished chunk_{}".format(chunknum))

### Sample

let's see what this looks like

In [1]:
import pandas as pd
import json
from pandas import json_normalize

chunkdir = "reportfiles/chunk_{}.json".format(2)

with open(chunkdir) as datafile:
	data = json.load(datafile)


data = json_normalize(data, max_level=0).T #I have to transpose so that the gids will be in rows
data = data.reset_index()

tempjson = json_normalize(data[0], max_level=1).set_index(data.index)
data = pd.concat([data,tempjson], axis=1)
    #remove metadata column
data = data.drop(columns=[0, '0.metadata'])

data.head()

Unnamed: 0,index,0.lowlevel,0.rhythm,0.tonal
0,0ad5b627-7687-4369-935f-e291cf22a3a2,"{'average_loudness': 0.0918393954635, 'barkban...","{'beats_count': 779, 'beats_loudness': {'dmean...","{'chords_changes_rate': 0.0479620248079, 'chor..."
1,0ad5f3ba-8c0c-4c99-a15f-39d85442ef07,"{'average_loudness': 0.924234747887, 'barkband...","{'beats_count': 326, 'beats_loudness': {'dmean...","{'chords_changes_rate': 0.0629560127854, 'chor..."
2,0ad61287-0756-469f-a9a1-1658eb0ad953,"{'average_loudness': 0.778621613979, 'barkband...","{'beats_count': 623, 'beats_loudness': {'dmean...","{'chords_changes_rate': 0.0670494884253, 'chor..."
3,0ad630f5-af66-478f-930c-9b4863977851,"{'average_loudness': 0.583457231522, 'barkband...","{'beats_count': 397, 'beats_loudness': {'dmean...","{'chords_changes_rate': 0.0535150393844, 'chor..."
4,0ad645be-adc0-4a9c-8ed6-ba40c5868109,"{'average_loudness': 0.397027939558, 'barkband...","{'beats_count': 528, 'beats_loudness': {'dmean...","{'chords_changes_rate': 0.111055441201, 'chord..."


Let's take a sample and then flatten the rest

In [2]:
data = data[0:10]
    #we now have the columns we want
cols = ['0.lowlevel', '0.rhythm', '0.tonal']
    
    #now lets flatten columns "lowlevel", "rhythm", "tonal" jsons
for i in cols:
	tempjson = json_normalize(data[i]).set_index(data.index)
	data = pd.concat([data,tempjson], axis=1)
	data = data.drop(columns=[i])

data.head()

Unnamed: 0,index,average_loudness,dynamic_complexity,barkbands.dmean,barkbands.dmean2,barkbands.dvar,barkbands.dvar2,barkbands.max,barkbands.mean,barkbands.median,...,hpcp.var,hpcp_entropy.dmean,hpcp_entropy.dmean2,hpcp_entropy.dvar,hpcp_entropy.dvar2,hpcp_entropy.max,hpcp_entropy.mean,hpcp_entropy.median,hpcp_entropy.min,hpcp_entropy.var
0,0ad5b627-7687-4369-935f-e291cf22a3a2,0.091839,8.198511,"[2.10325106309e-06, 0.000234614053625, 0.00037...","[3.6466126403e-06, 0.00036908625043, 0.0006030...","[2.20399514611e-11, 3.57924079708e-07, 7.97018...","[6.72192787543e-11, 8.79129459008e-07, 2.21954...","[0.000127967898152, 0.0193108320236, 0.0170268...","[2.99578687191e-06, 0.000585605157539, 0.00079...","[9.9382850749e-07, 6.12559888395e-05, 0.000157...",...,"[0.0126574188471, 0.0213037766516, 0.061309915...",0.53352,0.912364,0.253803,0.650287,4.067466,1.377538,1.384897,0.033799,0.576594
1,0ad5f3ba-8c0c-4c99-a15f-39d85442ef07,0.924235,3.281046,"[0.00115961779375, 0.00764878280461, 0.0015152...","[0.00153099466115, 0.0108632184565, 0.00293669...","[6.1408595684e-06, 0.000476207846077, 7.238477...","[1.05806775537e-05, 0.00107473530807, 0.000214...","[0.0444188974798, 0.272109985352, 0.0982875227...","[0.00423845043406, 0.0225107558072, 0.00088907...","[0.00111644377466, 0.0158996060491, 2.79625237...",...,"[0.0443876087666, 0.0548203215003, 0.038754455...",0.640768,1.074586,0.32131,0.834083,4.614322,1.886516,1.798779,0.0,0.53294
2,0ad61287-0756-469f-a9a1-1658eb0ad953,0.778622,4.383081,"[4.75470915262e-05, 0.00162874232046, 0.001055...","[7.79787224019e-05, 0.00251456839032, 0.001939...","[2.02780139347e-08, 2.40628341999e-05, 1.95965...","[4.978848267e-08, 5.21023903275e-05, 5.7101682...","[0.00339068891481, 0.128779143095, 0.063227206...","[6.14934979239e-05, 0.00231360690668, 0.000898...","[5.27273596163e-06, 0.000224814255489, 0.00013...",...,"[0.0772988498211, 0.04289817065, 0.03626205399...",0.622843,1.042012,0.28951,0.78507,4.553124,1.98666,2.00466,0.0,0.767897
3,0ad630f5-af66-478f-930c-9b4863977851,0.583457,4.836153,"[0.000293237942969, 0.00236392347142, 0.001721...","[0.000411182147218, 0.00382467010058, 0.002921...","[1.33503385769e-06, 6.65750776534e-05, 1.06028...","[2.5845934033e-06, 0.000163199860253, 2.944192...","[0.017749864608, 0.106422841549, 0.08857379108...","[0.000382329890272, 0.00358359701931, 0.004204...","[8.73548651725e-06, 0.0010280571878, 0.0027236...",...,"[0.127469345927, 0.129632502794, 0.03104320168...",0.547933,0.920735,0.237623,0.629671,4.071524,1.839566,1.85489,0.0,0.504924
4,0ad645be-adc0-4a9c-8ed6-ba40c5868109,0.397028,3.531669,"[6.2329127104e-05, 0.00209936965257, 0.0026804...","[8.90565279406e-05, 0.00349251972511, 0.004525...","[1.67730522804e-08, 1.75265286089e-05, 3.06000...","[3.14540322677e-08, 4.42475575255e-05, 7.79407...","[0.00240575056523, 0.0653672218323, 0.09743301...","[0.000134370682645, 0.00258843507618, 0.003045...","[8.84822802618e-05, 0.000910451461095, 0.00108...",...,"[0.0620835721493, 0.0589975118637, 0.061412181...",0.696603,1.172068,0.331266,0.888627,4.917964,2.385656,2.375285,0.0,0.676673


And then Export the data

# Compressor (optional)
Initially I made the mistake of exporting to csvs and not parquet directly so below is the optional code to compress everything into parquets using slurm arrays. It's pretty straightforward so I won't explain what's going on

#### Slurm Job

In [None]:
#!/bin/bash
#SBATCH --time=00:05:00
#SBATCH -N 1
#SBATCH -n 2 
#SBATCH -a 0-47

#. /global/software/jupyterhub-spark/anaconda3/etc/profile.d/conda.sh

#which python
#which conda

. /global/software/anaconda/anaconda-3.6-5.1.0/etc/profile.d/conda.sh
module load python/anaconda-3.6-5.1.0
conda activate cheinu

python compressor.py $SLURM_ARRAY_TASK_ID 

#### contents of compressor.py

In [None]:
import os, sys, time
import pandas as pd

print(pd.__version__)
if __name__=="__main__":
	chunk = int(sys.argv[1])
	print(chunk)
	fdir = '../step1/clean_{}.csv'.format(chunk)
	df = pd.read_csv(fdir).drop(columns=["Unnamed: 0"])

	df.to_parquet('../step1/clean_{}.parquet'.format(chunk), compression='snappy')