# Nested Attributes & Functions Operating on Nested Types in PySpark

In this notebook we will be working with spotify songs Dataset from Kaggle. Specifically we will work with nested data types where the columns are of type ARRAYS or MAPS.

# [Kaggle: Spotify Dataset 1921-2020, 160k+ Tracks](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks)

![](assets/kaggle.png)

## Problem Statement:

Recently, I needed to work with Spark dataframes having Map datatypes for one of our projects. I realized that `Map` and `Array` are the two most commonly used datatypes. So, I explored in detail how can we `create`, `query`, `explode` and `implode` columns of `array` and `map` datatypes. I created this notebook to be a handy reference for myself. Please feel free to checkout this notebook if you also need something quick and handy while working with these nested datatypes. 

`@author: Anindya Saha`  
`@email: mail.anindya@gmail.com`  

**Note:** You can reap benefits from Spark if you use it for large datasets. This dataaset is small and used for illustrative purposes. I hope you enjoy reviewing it as much as I had writing it. Please let me know if you have any suggestions to improve this.

The original dataset can be downloaded from the [Kaggle](https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks) dataset page. The original dataset has been modified a bit for this notebook.

In [1]:
import os
import pandas as pd
import numpy as np

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession, SQLContext

from pyspark.sql.types import *
from pyspark.sql.window import Window

import pyspark.sql.functions as F
from pyspark.sql.functions import udf, col

In [2]:
import warnings
warnings.filterwarnings("ignore")

In [3]:
pd.set_option('display.max_columns', 20)
pd.set_option('display.max_colwidth', 400)

In [4]:
# setting random seed for notebook reproducability
rnd_seed=23
np.random.seed=rnd_seed
np.random.set_state=rnd_seed

## 1. Create the Spark Session

In [5]:
spark = SparkSession.builder.master("local[*]").appName("working-with-nested-data-types").getOrCreate()

In [6]:
spark

In [7]:
sc = spark.sparkContext
sc

In [8]:
sqlContext = SQLContext(spark.sparkContext)
sqlContext

<pyspark.sql.context.SQLContext at 0x7f3f1b348ee0>

In [9]:
import re

# Utility function to emulate stripMargin in Scala string.
def strip_margin(text):
    nomargin = re.sub('\n[ \t]*\|', ' ', text)
    trimmed = re.sub('\s+', ' ', nomargin)
    return trimmed

## 2. Load Spotify Songs Dataset

In [10]:
spotify_df = spark.read.csv(path='data/spotify-songs.csv', inferSchema=True, header=True).cache()

In [11]:
spotify_df.limit(10).toPandas()

Unnamed: 0,id,song_title,artist,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,target
0,0,Mask Off,Future,0.0102,0.833,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,150.062,4,0.286,1
1,1,Redbone,Childish Gambino,0.199,0.743,326933,0.359,0.00611,1,0.137,-10.401,1,0.0794,160.083,4,0.588,1
2,2,Xanny Family,Future,0.0344,0.838,185707,0.412,0.000234,2,0.159,-7.148,1,0.289,75.044,4,0.173,1
3,3,Master Of None,Beach House,0.604,0.494,199413,0.338,0.51,5,0.0922,-15.236,1,0.0261,86.468,4,0.23,1
4,4,Parallel Lines,Junior Boys,0.18,0.678,392893,0.561,0.512,5,0.439,-11.648,0,0.0694,174.004,4,0.904,1
5,5,Sneakin’,Drake,0.00479,0.804,251333,0.56,0.0,8,0.164,-6.682,1,0.185,85.023,4,0.264,1
6,6,Childs Play,Drake,0.0145,0.739,241400,0.472,7e-06,1,0.207,-11.204,1,0.156,80.03,4,0.308,1
7,7,Gyöngyhajú lány,Omega,0.0202,0.266,349667,0.348,0.664,10,0.16,-11.609,0,0.0371,144.154,4,0.393,1
8,8,I've Seen Footage,Death Grips,0.0481,0.603,202853,0.944,0.0,11,0.342,-3.626,0,0.347,130.035,4,0.398,1
9,9,Digital Animal,Honey Claws,0.00208,0.836,226840,0.603,0.0,7,0.571,-7.792,1,0.237,99.994,4,0.386,1


## 3. Data Wrangling

### 3.1 Create Nested Types

+ Combine the columns ['key', 'mode', 'target'] into an array using the `array` function of PySpark. 
+ Transform the acoustic qualities {'acousticness', 'tempo', 'liveness', 'instrumentalness', 'energy', 'danceability', 'speechiness', 'loudness'} of a song from individual columns into a map (key being acoustic quality). Although `create_map` function is meant to create map between a pair of columns but here we use the F.lit(...) function to generate the string key name for the acoustic quality.
http://spark.apache.org/docs/2.2.0/api/python/pyspark.sql.html#pyspark.sql.functions.create_map
+ Alos, you cannot have mixed data types for the Key or the Value. All Keys must be of same type. All Values must be of same type.

In [12]:
spotify_map_df = (spotify_df
          .select('id', 'song_title', 'artist', 'duration_ms',
                  F.array('key', 'mode', 'target').alias('audience'), 
                  F.create_map(
                      F.lit('acousticness'), 'acousticness', 
                      F.lit('danceability'), 'acousticness',
                      F.lit('energy'), 'energy',
                      F.lit('instrumentalness'), 'instrumentalness',
                      F.lit('liveness'), 'liveness',
                      F.lit('loudness'), 'loudness',
                      F.lit('speechiness'), 'speechiness',
                      F.lit('tempo'), 'tempo'
                  ).alias('qualities'),
                 'time_signature',
                 'valence')
        .cache())

In [13]:
spotify_map_df.limit(10).toPandas()

Unnamed: 0,id,song_title,artist,duration_ms,audience,qualities,time_signature,valence
0,0,Mask Off,Future,204600,"[2, 1, 1]","{'acousticness': 0.0102, 'loudness': -8.795, 'liveness': 0.165, 'tempo': 150.062, 'instrumentalness': 0.0219, 'danceability': 0.0102, 'speechiness': 0.431, 'energy': 0.434}",4,0.286
1,1,Redbone,Childish Gambino,326933,"[1, 1, 1]","{'acousticness': 0.199, 'loudness': -10.401, 'liveness': 0.137, 'tempo': 160.083, 'instrumentalness': 0.00611, 'danceability': 0.199, 'speechiness': 0.0794, 'energy': 0.359}",4,0.588
2,2,Xanny Family,Future,185707,"[2, 1, 1]","{'acousticness': 0.0344, 'loudness': -7.148, 'liveness': 0.159, 'tempo': 75.044, 'instrumentalness': 0.000234, 'danceability': 0.0344, 'speechiness': 0.289, 'energy': 0.412}",4,0.173
3,3,Master Of None,Beach House,199413,"[5, 1, 1]","{'acousticness': 0.604, 'loudness': -15.236, 'liveness': 0.0922, 'tempo': 86.468, 'instrumentalness': 0.51, 'danceability': 0.604, 'speechiness': 0.0261, 'energy': 0.338}",4,0.23
4,4,Parallel Lines,Junior Boys,392893,"[5, 0, 1]","{'acousticness': 0.18, 'loudness': -11.648, 'liveness': 0.439, 'tempo': 174.004, 'instrumentalness': 0.512, 'danceability': 0.18, 'speechiness': 0.0694, 'energy': 0.561}",4,0.904
5,5,Sneakin’,Drake,251333,"[8, 1, 1]","{'acousticness': 0.00479, 'loudness': -6.682, 'liveness': 0.164, 'tempo': 85.023, 'instrumentalness': 0.0, 'danceability': 0.00479, 'speechiness': 0.185, 'energy': 0.56}",4,0.264
6,6,Childs Play,Drake,241400,"[1, 1, 1]","{'acousticness': 0.0145, 'loudness': -11.204, 'liveness': 0.207, 'tempo': 80.03, 'instrumentalness': 7.27e-06, 'danceability': 0.0145, 'speechiness': 0.156, 'energy': 0.472}",4,0.308
7,7,Gyöngyhajú lány,Omega,349667,"[10, 0, 1]","{'acousticness': 0.0202, 'loudness': -11.609, 'liveness': 0.16, 'tempo': 144.154, 'instrumentalness': 0.664, 'danceability': 0.0202, 'speechiness': 0.0371, 'energy': 0.348}",4,0.393
8,8,I've Seen Footage,Death Grips,202853,"[11, 0, 1]","{'acousticness': 0.0481, 'loudness': -3.626, 'liveness': 0.342, 'tempo': 130.035, 'instrumentalness': 0.0, 'danceability': 0.0481, 'speechiness': 0.347, 'energy': 0.944}",4,0.398
9,9,Digital Animal,Honey Claws,226840,"[7, 1, 1]","{'acousticness': 0.00208, 'loudness': -7.792, 'liveness': 0.571, 'tempo': 99.994, 'instrumentalness': 0.0, 'danceability': 0.00208, 'speechiness': 0.237, 'energy': 0.603}",4,0.386


In [14]:
# Let's check the schema of the new DataFrame
spotify_map_df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- song_title: string (nullable = true)
 |-- artist: string (nullable = true)
 |-- duration_ms: integer (nullable = true)
 |-- audience: array (nullable = false)
 |    |-- element: integer (containsNull = true)
 |-- qualities: map (nullable = false)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = true)
 |-- time_signature: integer (nullable = true)
 |-- valence: double (nullable = true)



**Write the DataFrame to a json file:**

In [15]:
spotify_map_df.coalesce(1).write.json(path='data/spotify-songs-json', mode="overwrite")

### 3.2 Reload the above restructured DataFrame now using a more complex schema with Nested Data Types

In [16]:
nested_schema = StructType([
    StructField('id', IntegerType(), nullable=False),
    StructField('song_title', StringType(), nullable=False),
    StructField('artist', StringType(), nullable=False),
    StructField('duration_ms', IntegerType(), nullable=False),
    StructField('audience', ArrayType(elementType=IntegerType()), nullable=False),
    StructField('qualities', MapType(keyType=StringType(), valueType=DoubleType(), valueContainsNull=False), nullable=True),
    StructField('time_signature', IntegerType(), nullable=False),
    StructField('valence', DoubleType(), nullable=False),
  ])

In [17]:
spotify_reloaded_df = spark.read.json(path='data/spotify-songs-json', schema=nested_schema).cache()

In [18]:
spotify_reloaded_df.limit(10).toPandas()

Unnamed: 0,id,song_title,artist,duration_ms,audience,qualities,time_signature,valence
0,0,Mask Off,Future,204600,"[2, 1, 1]","{'acousticness': 0.0102, 'loudness': -8.795, 'liveness': 0.165, 'tempo': 150.062, 'instrumentalness': 0.0219, 'danceability': 0.0102, 'speechiness': 0.431, 'energy': 0.434}",4,0.286
1,1,Redbone,Childish Gambino,326933,"[1, 1, 1]","{'acousticness': 0.199, 'loudness': -10.401, 'liveness': 0.137, 'tempo': 160.083, 'instrumentalness': 0.00611, 'danceability': 0.199, 'speechiness': 0.0794, 'energy': 0.359}",4,0.588
2,2,Xanny Family,Future,185707,"[2, 1, 1]","{'acousticness': 0.0344, 'loudness': -7.148, 'liveness': 0.159, 'tempo': 75.044, 'instrumentalness': 0.000234, 'danceability': 0.0344, 'speechiness': 0.289, 'energy': 0.412}",4,0.173
3,3,Master Of None,Beach House,199413,"[5, 1, 1]","{'acousticness': 0.604, 'loudness': -15.236, 'liveness': 0.0922, 'tempo': 86.468, 'instrumentalness': 0.51, 'danceability': 0.604, 'speechiness': 0.0261, 'energy': 0.338}",4,0.23
4,4,Parallel Lines,Junior Boys,392893,"[5, 0, 1]","{'acousticness': 0.18, 'loudness': -11.648, 'liveness': 0.439, 'tempo': 174.004, 'instrumentalness': 0.512, 'danceability': 0.18, 'speechiness': 0.0694, 'energy': 0.561}",4,0.904
5,5,Sneakin’,Drake,251333,"[8, 1, 1]","{'acousticness': 0.00479, 'loudness': -6.682, 'liveness': 0.164, 'tempo': 85.023, 'instrumentalness': 0.0, 'danceability': 0.00479, 'speechiness': 0.185, 'energy': 0.56}",4,0.264
6,6,Childs Play,Drake,241400,"[1, 1, 1]","{'acousticness': 0.0145, 'loudness': -11.204, 'liveness': 0.207, 'tempo': 80.03, 'instrumentalness': 7.27e-06, 'danceability': 0.0145, 'speechiness': 0.156, 'energy': 0.472}",4,0.308
7,7,Gyöngyhajú lány,Omega,349667,"[10, 0, 1]","{'acousticness': 0.0202, 'loudness': -11.609, 'liveness': 0.16, 'tempo': 144.154, 'instrumentalness': 0.664, 'danceability': 0.0202, 'speechiness': 0.0371, 'energy': 0.348}",4,0.393
8,8,I've Seen Footage,Death Grips,202853,"[11, 0, 1]","{'acousticness': 0.0481, 'loudness': -3.626, 'liveness': 0.342, 'tempo': 130.035, 'instrumentalness': 0.0, 'danceability': 0.0481, 'speechiness': 0.347, 'energy': 0.944}",4,0.398
9,9,Digital Animal,Honey Claws,226840,"[7, 1, 1]","{'acousticness': 0.00208, 'loudness': -7.792, 'liveness': 0.571, 'tempo': 99.994, 'instrumentalness': 0.0, 'danceability': 0.00208, 'speechiness': 0.237, 'energy': 0.603}",4,0.386


In [19]:
spotify_reloaded_df.printSchema()

root
 |-- id: integer (nullable = true)
 |-- song_title: string (nullable = true)
 |-- artist: string (nullable = true)
 |-- duration_ms: integer (nullable = true)
 |-- audience: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- qualities: map (nullable = true)
 |    |-- key: string
 |    |-- value: double (valueContainsNull = true)
 |-- time_signature: integer (nullable = true)
 |-- valence: double (nullable = true)



### 3.3 Extract Individual Nested/Complex Atributes as a Column

We can extract out each nested attribute within an array or map into a column of its own.

**Extract out ARRAY elements:**  
The audience column is a combination of three attributes 'key', 'mode' and 'target'. Extract out each array element into a column of its own.

In [20]:
(spotify_map_df
 .select('song_title',
         col('audience').getItem(0).alias('key'), 
         col('audience').getItem(1).alias('mode'),
         col('audience').getItem(2).alias('target'))
 .limit(10)
 .toPandas())

Unnamed: 0,song_title,key,mode,target
0,Mask Off,2,1,1
1,Redbone,1,1,1
2,Xanny Family,2,1,1
3,Master Of None,5,1,1
4,Parallel Lines,5,0,1
5,Sneakin’,8,1,1
6,Childs Play,1,1,1
7,Gyöngyhajú lány,10,0,1
8,I've Seen Footage,11,0,1
9,Digital Animal,7,1,1


**Extract out MAP attributes:**  
The acoustic column is a map created from attributes 'acousticness', 'tempo', 'liveness', 'instrumentalness', etc. of a song. Extract out those qualities into individual columns.

In [21]:
(spotify_map_df
 .select('song_title',
     col('qualities').getItem('acousticness').alias('acousticness'),
     col('qualities').getItem('speechiness').alias('speechiness')
 )
 .limit(10)
 .toPandas())

Unnamed: 0,song_title,acousticness,speechiness
0,Mask Off,0.0102,0.431
1,Redbone,0.199,0.0794
2,Xanny Family,0.0344,0.289
3,Master Of None,0.604,0.0261
4,Parallel Lines,0.18,0.0694
5,Sneakin’,0.00479,0.185
6,Childs Play,0.0145,0.156
7,Gyöngyhajú lány,0.0202,0.0371
8,I've Seen Footage,0.0481,0.347
9,Digital Animal,0.00208,0.237


**Refactor:** We can refactor the above code to be more concise and to generate a more efficient parsed logical plan.

In [22]:
cols = [F.col("song_title")] + list(map(
        lambda f: F.col("qualities").getItem(f).alias(str(f)), ["acousticness", "speechiness", "liveness", "tempo"]))

spotify_map_df.select(cols).limit(10).toPandas()

Unnamed: 0,song_title,acousticness,speechiness,liveness,tempo
0,Mask Off,0.0102,0.431,0.165,150.062
1,Redbone,0.199,0.0794,0.137,160.083
2,Xanny Family,0.0344,0.289,0.159,75.044
3,Master Of None,0.604,0.0261,0.0922,86.468
4,Parallel Lines,0.18,0.0694,0.439,174.004
5,Sneakin’,0.00479,0.185,0.164,85.023
6,Childs Play,0.0145,0.156,0.207,80.03
7,Gyöngyhajú lány,0.0202,0.0371,0.16,144.154
8,I've Seen Footage,0.0481,0.347,0.342,130.035
9,Digital Animal,0.00208,0.237,0.571,99.994


**Extract out MAP attributes programmatically:**  

Manually appending the columns is fine if we know all the distinct keys in the map. If we don’t know all the distinct keys, we’ll need a programatic solution, but be warned – this approach is slow! I learnt this approach from [1](https://mungingdata.com/pyspark/dict-map-to-multiple-columns/) and I have modifed it a bit.

In [23]:
# Step 1: Create a DataFrame with all the unique keys
keys_df = spotify_map_df.select(F.collect_set(F.map_keys(F.col("qualities"))))

In [35]:
keys_df.show(truncate=False)

+------------------------------------------------------------------------------------------------+
|collect_set(map_keys(qualities))                                                                |
+------------------------------------------------------------------------------------------------+
|[[acousticness, danceability, energy, instrumentalness, liveness, loudness, speechiness, tempo]]|
+------------------------------------------------------------------------------------------------+



In [25]:
# Step 2: Convert the DataFrame to a list with all the unique keys
keys = keys_df.collect()[0][0][0]

In [26]:
keys

['acousticness',
 'danceability',
 'energy',
 'instrumentalness',
 'liveness',
 'loudness',
 'speechiness',
 'tempo']

The `collect()` method gathers all the data on the driver node, which can be slow. We call `collect_set()` to limit the data that’s being collected on the driver node. Collecting data on a single node and leaving the worker nodes idle should be avoided whenever possible. We’re only using `collect()` here cause it’s the only option.

In [27]:
# Step 3: Create an array of column objects for the map items
key_cols = list(map(lambda f: F.col("qualities").getItem(f).alias(str(f)), keys))

In [28]:
# Step 4: Add any additional columns before calculating the final result
final_cols = [F.col("song_title")] + key_cols

In [29]:
# Step 5: Run a select() to get the final result
spotify_map_df.select(final_cols).limit(10).toPandas()

Unnamed: 0,song_title,acousticness,danceability,energy,instrumentalness,liveness,loudness,speechiness,tempo
0,Mask Off,0.0102,0.0102,0.434,0.0219,0.165,-8.795,0.431,150.062
1,Redbone,0.199,0.199,0.359,0.00611,0.137,-10.401,0.0794,160.083
2,Xanny Family,0.0344,0.0344,0.412,0.000234,0.159,-7.148,0.289,75.044
3,Master Of None,0.604,0.604,0.338,0.51,0.0922,-15.236,0.0261,86.468
4,Parallel Lines,0.18,0.18,0.561,0.512,0.439,-11.648,0.0694,174.004
5,Sneakin’,0.00479,0.00479,0.56,0.0,0.164,-6.682,0.185,85.023
6,Childs Play,0.0145,0.0145,0.472,7e-06,0.207,-11.204,0.156,80.03
7,Gyöngyhajú lány,0.0202,0.0202,0.348,0.664,0.16,-11.609,0.0371,144.154
8,I've Seen Footage,0.0481,0.0481,0.944,0.0,0.342,-3.626,0.347,130.035
9,Digital Animal,0.00208,0.00208,0.603,0.0,0.571,-7.792,0.237,99.994


### Examining logical plans
Use the `explain()` function to print the logical plans and see if the parsed logical plan needs a lot of optimizations:

In [30]:
spotify_map_df.select(final_cols).explain(True)

== Parsed Logical Plan ==
'Project [unresolvedalias('song_title, None), 'qualities[acousticness] AS acousticness#2015, 'qualities[danceability] AS danceability#2016, 'qualities[energy] AS energy#2017, 'qualities[instrumentalness] AS instrumentalness#2018, 'qualities[liveness] AS liveness#2019, 'qualities[loudness] AS loudness#2020, 'qualities[speechiness] AS speechiness#2021, 'qualities[tempo] AS tempo#2022]
+- Project [id#16, song_title#17, artist#18, duration_ms#21, array(key#24, mode#27, target#32) AS audience#526, map(acousticness, acousticness#19, danceability, acousticness#19, energy, energy#22, instrumentalness, instrumentalness#23, liveness, liveness#25, loudness, loudness#26, speechiness, speechiness#28, tempo, tempo#29) AS qualities#527, time_signature#30, valence#31]
   +- Relation[id#16,song_title#17,artist#18,acousticness#19,danceability#20,duration_ms#21,energy#22,instrumentalness#23,key#24,liveness#25,loudness#26,mode#27,speechiness#28,tempo#29,time_signature#30,valence#

### Reconstruct the original Table:
We can use all our learnings from above to reconstruct the original table.

In [31]:
(spotify_reloaded_df
 .select('id', 'song_title', 'artist',
     col('qualities').getItem('acousticness').alias('acousticness'),
     col('qualities').getItem('danceability').alias('danceability'),
     'duration_ms',
     col('qualities').getItem('energy').alias('energy'),
     col('qualities').getItem('instrumentalness').alias('instrumentalness'),
     col('audience').getItem(0).alias('key'),
     col('qualities').getItem('liveness').alias('liveness'),
     col('qualities').getItem('loudness').alias('loudness'),
     col('audience').getItem(1).alias('mode'),
     col('qualities').getItem('speechiness').alias('speechiness'),
     col('qualities').getItem('tempo').alias('tempo'),
     'time_signature',
     'valence',
     col('audience').getItem(2).alias('target')
 )
 .limit(10)
 .toPandas())

Unnamed: 0,id,song_title,artist,acousticness,danceability,duration_ms,energy,instrumentalness,key,liveness,loudness,mode,speechiness,tempo,time_signature,valence,target
0,0,Mask Off,Future,0.0102,0.0102,204600,0.434,0.0219,2,0.165,-8.795,1,0.431,150.062,4,0.286,1
1,1,Redbone,Childish Gambino,0.199,0.199,326933,0.359,0.00611,1,0.137,-10.401,1,0.0794,160.083,4,0.588,1
2,2,Xanny Family,Future,0.0344,0.0344,185707,0.412,0.000234,2,0.159,-7.148,1,0.289,75.044,4,0.173,1
3,3,Master Of None,Beach House,0.604,0.604,199413,0.338,0.51,5,0.0922,-15.236,1,0.0261,86.468,4,0.23,1
4,4,Parallel Lines,Junior Boys,0.18,0.18,392893,0.561,0.512,5,0.439,-11.648,0,0.0694,174.004,4,0.904,1
5,5,Sneakin’,Drake,0.00479,0.00479,251333,0.56,0.0,8,0.164,-6.682,1,0.185,85.023,4,0.264,1
6,6,Childs Play,Drake,0.0145,0.0145,241400,0.472,7e-06,1,0.207,-11.204,1,0.156,80.03,4,0.308,1
7,7,Gyöngyhajú lány,Omega,0.0202,0.0202,349667,0.348,0.664,10,0.16,-11.609,0,0.0371,144.154,4,0.393,1
8,8,I've Seen Footage,Death Grips,0.0481,0.0481,202853,0.944,0.0,11,0.342,-3.626,0,0.347,130.035,4,0.398,1
9,9,Digital Animal,Honey Claws,0.00208,0.00208,226840,0.603,0.0,7,0.571,-7.792,1,0.237,99.994,4,0.386,1


### 3.4 Explode Individual Nested/Complex into a row of its own

Using `posexplode` function we can extract array element into a new row for each element with position in the given array.

In [32]:
(spotify_reloaded_df
 .select('song_title', F.posexplode('audience'))
 .limit(10)
 .toPandas())

Unnamed: 0,song_title,pos,col
0,Mask Off,0,2
1,Mask Off,1,1
2,Mask Off,2,1
3,Redbone,0,1
4,Redbone,1,1
5,Redbone,2,1
6,Xanny Family,0,2
7,Xanny Family,1,1
8,Xanny Family,2,1
9,Master Of None,0,5


Using `explode` function we can extract a new row for each element in the given array or map.

In [33]:
(spotify_reloaded_df
 .select('song_title', F.explode('qualities').alias("qualities", "value"))
 .limit(10)
 .toPandas())

Unnamed: 0,song_title,qualities,value
0,Mask Off,acousticness,0.0102
1,Mask Off,danceability,0.0102
2,Mask Off,energy,0.434
3,Mask Off,instrumentalness,0.0219
4,Mask Off,liveness,0.165
5,Mask Off,loudness,-8.795
6,Mask Off,speechiness,0.431
7,Mask Off,tempo,150.062
8,Redbone,acousticness,0.199
9,Redbone,danceability,0.199


In [36]:
spark.stop()

**References"**  
These resources helped me a lot to understand about Map Datatype and their usage. Please visit these notebooks, they are great resources on their own merit.
+ \[1\] https://mungingdata.com/pyspark/dict-map-to-multiple-columns/
+ \[2\] https://sparkbyexamples.com/spark/spark-dataframe-map-maptype-column/