# Working with JSON Guide

This notebook will showcase the various ways you can interrogate and read JSON data structures in Databricks using Python

# Sources
- https://amiradata.com/pyspark-explode-nested-array-map-to-rows/
- https://medium.com/expedia-group-tech/working-with-json-in-apache-spark-1ecf553c2a8c
- https://databricks.com/blog/2021/11/11/10-powerful-features-to-simplify-semi-structured-data-management-in-the-databricks-lakehouse.html

# Source Data To Work On

When working with JSON data that was read from a column in a tabular structure like a CSV - some data type conversion is needed. *When reading a .json file as is, this is not required.*


Use this *get_json_df* function to take the JSON column and parse to JSON Typed data. If you don't do this you can't use JSON functions on the data, it will just see it as string data. 
This function ensures that the JSON data is inside a top level attribute and can be converted. If the JSON data in the data frame is just an array, the schema retrieval fails.

In [0]:

from pyspark.sql.functions import col, explode_outer, from_json, lit, concat, explode, array
from pyspark.sql.types import StructType, ArrayType

def get_json_df(inputDF, primary_partition_column, json_column_name, spark_session):
    '''
    Description:
    This function provides the schema of json records and the dataframe to be used for flattening. If this doesnt happen, the source JSON String remains a string and cant be queries like JSON
        :param inputDF: [type: pyspark.sql.dataframe.DataFrame] input dataframe
        :primary_partition_column: [type: string] name of primary partition column
        :param json_column_name: [type: string] name of the column with json string
        :param spark_session: SparkSession object
        :return df: dataframe to be used for flattening
    '''
    inputDF = inputDF if primary_partition_column is None else inputDF.drop(primary_partition_column)
    # creating a column transformedJSON to create an outer struct
    df1 = inputDF.withColumn('transformed_json', concat(lit("""{"transformed_json" :"""), inputDF[json_column_name], lit("""}""")))
    json_df = spark_session.read.json(df1.rdd.map(lambda row: row.transformed_json))
    # get schema
    json_schema = json_df.schema
    
    #Return a dataframe with the orignal column name but with proper JSON typed data
    df = df1.drop(json_column_name)\
        .withColumn(json_column_name, from_json(col('transformed_json'), json_schema))\
        .drop('transformed_json')\
        .select(f'{json_column_name}.*', '*')\
        .drop(json_column_name)\
        .withColumnRenamed("transformed_json", json_column_name)
    return df

Read data from storage account into a dataframe.

Convert the column with JSON string to JSON typed data that databricks can interrogate. 

This data sample has an array on each row. Each element inside that array contains both nested attributes and 2 nested arrays.

In [0]:
from pyspark.sql.functions import col, explode_outer, from_json

#Get raw data into a dataframe
#Ensure you specify the quote and esacpe characters. 
#Multiple is false by default. 
rawJsonDF = spark.read\
.format("csv")\
.options(header="true", escape='"', quote='"', multiline=False, inferSchema=True)\
.load("/mnt/datalake_rawdata/json-generator.com/MockJsonInCSV.csv")\
.select("JSONColumn")

#Get JSON Typed data
rawJSON_ProperTyped_DF = get_json_df(rawJsonDF, None, "JSONColumn", spark)

#View data
display(rawJSON_ProperTyped_DF)

#View the schema
rawJSON_ProperTyped_DF.printSchema()


JSONColumn
"List(List(List(5, 4, 1, 6, 3, 2), 622b43c868725fca32d05127, Sit non proident Lorem laboris non id excepteur voluptate nulla fugiat excepteur eiusmod est. Sunt deserunt officia cillum incididunt dolore cillum ipsum eu. In est laboris occaecat id sint laborum dolore mollit qui anim elit adipisicing. Eu labore esse qui do dolor Lorem elit fugiat eu duis. Eiusmod in incididunt qui do labore consectetur irure ipsum aliqua incididunt ex tempor nulla adipisicing. Sunt esse commodo proident est id Lorem ea. Commodo mollit est officia esse nulla nisi tempor ullamco aliqua. , 116 Stewart Street, Longoria, Alaska, 3997, 20, $2,387.61, GENMY, hughesslater@genmy.com, green, banana, List(List(0, Katie Carey), List(1, Roth Gould), List(2, Trisha Lindsay)), male, Hello, Hughes Slater! You have 1 unread messages., 8a81ca3a-0813-4902-820d-d58e9703ec04, 0, true, -47.586921, 22.966975, Hughes Slater, +1 (883) 494-2174, http://placehold.it/32x32, 2022-02-03T10:33:31 -02:00, List(non, ex, sint, culpa, aliqua, esse, pariatur)), List(List(5, 4, 1, 6, 3, 2), 622b43c8d2984c9956b839d9, Sunt culpa commodo ullamco aliqua fugiat sit fugiat eu fugiat eiusmod amet esse. Consequat nostrud exercitation deserunt magna labore elit ipsum Lorem laboris cupidatat reprehenderit mollit magna velit. Duis esse cillum proident ad id excepteur. Dolore excepteur cillum duis commodo. Sint dolore voluptate in amet nostrud culpa qui consequat tempor incididunt eiusmod irure sit officia. Consequat cillum occaecat Lorem nulla nisi qui excepteur. , 894 Cambridge Place, Dexter, Utah, 8622, 25, $1,910.43, GEEKNET, burchconrad@geeknet.com, green, apple, List(List(0, Charles Newman), List(1, Maude Robinson), List(2, Taylor Goodwin)), male, Hello, Burch Conrad! You have 7 unread messages., f4b512ca-e362-48a5-a5a5-1b21013bbd4c, 1, false, 33.526889, -69.444991, Burch Conrad, +1 (863) 523-2711, http://placehold.it/32x32, 2018-10-11T07:17:58 -02:00, List(in, non, adipisicing, veniam, irure, excepteur, laboris)), List(List(5, 4, 1, 6, 3, 2), 622b43c806bb45d49234240b, Lorem aute eiusmod et Lorem quis ea. Duis velit anim minim aute magna eiusmod excepteur nulla ad velit nostrud. Amet laboris voluptate nostrud nulla do minim non mollit duis sunt officia sunt nisi reprehenderit. , 111 Beverly Road, Canby, Oklahoma, 4837, 24, $1,711.22, SKINSERVE, ellapacheco@skinserve.com, brown, apple, List(List(0, Montoya Becker), List(1, Reilly Phelps), List(2, Reeves Tran)), female, Hello, Ella Pacheco! You have 6 unread messages., 6f752038-214b-4cd7-b6d0-14250292a1e3, 2, false, -18.438374, -137.932995, Ella Pacheco, +1 (915) 430-3434, http://placehold.it/32x32, 2015-08-13T05:28:29 -02:00, List(duis, occaecat, Lorem, aute, ut, nisi, enim)), List(List(5, 4, 1, 6, 3, 2), 622b43c8858495221ea6ecda, Sit non veniam amet minim. Sint cillum excepteur et incididunt. Do reprehenderit ex elit ut excepteur nostrud. Commodo elit minim velit sint esse duis eu sint minim laborum mollit. Fugiat nulla consequat veniam nisi culpa sit do veniam. Sunt anim est culpa nisi anim mollit consequat. Qui dolor sint id minim irure cillum. , 372 Ainslie Street, Cornfields, Alabama, 6363, 23, $1,889.73, EBIDCO, estellarich@ebidco.com, brown, strawberry, List(List(0, Walker William), List(1, Kathrine Alexander), List(2, Faye Hoffman)), female, Hello, Estella Rich! You have 7 unread messages., 121b43bf-abc7-46bf-8ac3-5355fe1d8290, 3, true, 38.887249, -41.450973, Estella Rich, +1 (957) 478-3558, http://placehold.it/32x32, 2018-03-04T12:10:29 -02:00, List(cillum, occaecat, et, pariatur, duis, adipisicing, nostrud)), List(List(5, 4, 1, 6, 3, 2), 622b43c876311a36c1c9da33, Laborum eu deserunt commodo deserunt pariatur sit quis fugiat consequat cillum ex mollit dolore consequat. Officia irure cillum et do incididunt eiusmod laborum consequat laboris ex fugiat ullamco. Anim nisi velit adipisicing consectetur Lorem ut cupidatat. , 521 Stratford Road, Stewartville, Vermont, 6312, 33, $1,498.87, LOTRON, carpentersullivan@lotron.com, brown, apple, List(List(0, Doreen Hanson), List(1, Donna Jackson), List(2, Chandler Wiley)), male, Hello, Carpenter Sullivan! You have 8 unread messages., c3fd5dee-5ef5-40a0-9d9d-8fe596038009, 4, true, -73.782288, -67.572945, Carpenter Sullivan, +1 (967) 543-2046, http://placehold.it/32x32, 2014-12-22T05:22:31 -02:00, List(sint, excepteur, excepteur, ut, deserunt, ut, dolore)), List(List(5, 4, 1, 6, 3, 2), 622b43c898fac31d1e0593ea, Mollit pariatur aliqua ullamco occaecat duis consectetur laboris Lorem est veniam enim quis laboris. Incididunt deserunt commodo sit reprehenderit. Ipsum aute occaecat mollit sit est do duis nulla id deserunt. Duis est dolore proident minim mollit veniam pariatur consequat ullamco magna. Sint fugiat quis do magna amet consequat fugiat Lorem est id id cupidatat veniam. Commodo commodo reprehenderit minim aute exercitation duis. , 130 Lake Street, Northridge, New Hampshire, 5051, 31, $3,781.32, INTERFIND, nadiablake@interfind.com, brown, strawberry, List(List(0, Melba Chen), List(1, Richmond Norman), List(2, Henson Graham)), female, Hello, Nadia Blake! You have 1 unread messages., b66dba8f-7156-4600-9742-7535777a0ca3, 5, true, 71.733634, -121.91096, Nadia Blake, +1 (965) 434-3202, http://placehold.it/32x32, 2014-06-15T06:31:58 -02:00, List(do, consequat, mollit, mollit, ex, velit, laborum)))"
"List(List(List(5, 4, 1, 6, 3, 2), 622b43c868725fca32d05127, Sit non proident Lorem laboris non id excepteur voluptate nulla fugiat excepteur eiusmod est. Sunt deserunt officia cillum incididunt dolore cillum ipsum eu. In est laboris occaecat id sint laborum dolore mollit qui anim elit adipisicing. Eu labore esse qui do dolor Lorem elit fugiat eu duis. Eiusmod in incididunt qui do labore consectetur irure ipsum aliqua incididunt ex tempor nulla adipisicing. Sunt esse commodo proident est id Lorem ea. Commodo mollit est officia esse nulla nisi tempor ullamco aliqua. , 116 Stewart Street, Longoria, Alaska, 3997, 20, $2,387.61, GENMY, hughesslater@genmy.com, green, banana, List(List(0, Katie Carey), List(1, Roth Gould), List(2, Trisha Lindsay)), male, Hello, Hughes Slater! You have 1 unread messages., 8a81ca3a-0813-4902-820d-d58e9703ec04, 0, true, -47.586921, 22.966975, Hughes Slater, +1 (883) 494-2174, http://placehold.it/32x32, 2022-02-03T10:33:31 -02:00, List(non, ex, sint, culpa, aliqua, esse, pariatur)), List(List(5, 4, 1, 6, 3, 2), 622b43c8d2984c9956b839d9, Sunt culpa commodo ullamco aliqua fugiat sit fugiat eu fugiat eiusmod amet esse. Consequat nostrud exercitation deserunt magna labore elit ipsum Lorem laboris cupidatat reprehenderit mollit magna velit. Duis esse cillum proident ad id excepteur. Dolore excepteur cillum duis commodo. Sint dolore voluptate in amet nostrud culpa qui consequat tempor incididunt eiusmod irure sit officia. Consequat cillum occaecat Lorem nulla nisi qui excepteur. , 894 Cambridge Place, Dexter, Utah, 8622, 25, $1,910.43, GEEKNET, burchconrad@geeknet.com, green, apple, List(List(0, Charles Newman), List(1, Maude Robinson), List(2, Taylor Goodwin)), male, Hello, Burch Conrad! You have 7 unread messages., f4b512ca-e362-48a5-a5a5-1b21013bbd4c, 1, false, 33.526889, -69.444991, Burch Conrad, +1 (863) 523-2711, http://placehold.it/32x32, 2018-10-11T07:17:58 -02:00, List(in, non, adipisicing, veniam, irure, excepteur, laboris)), List(List(5, 4, 1, 6, 3, 2), 622b43c806bb45d49234240b, Lorem aute eiusmod et Lorem quis ea. Duis velit anim minim aute magna eiusmod excepteur nulla ad velit nostrud. Amet laboris voluptate nostrud nulla do minim non mollit duis sunt officia sunt nisi reprehenderit. , 111 Beverly Road, Canby, Oklahoma, 4837, 24, $1,711.22, SKINSERVE, ellapacheco@skinserve.com, brown, apple, List(List(0, Montoya Becker), List(1, Reilly Phelps), List(2, Reeves Tran)), female, Hello, Ella Pacheco! You have 6 unread messages., 6f752038-214b-4cd7-b6d0-14250292a1e3, 2, false, -18.438374, -137.932995, Ella Pacheco, +1 (915) 430-3434, http://placehold.it/32x32, 2015-08-13T05:28:29 -02:00, List(duis, occaecat, Lorem, aute, ut, nisi, enim)), List(List(5, 4, 1, 6, 3, 2), 622b43c8858495221ea6ecda, Sit non veniam amet minim. Sint cillum excepteur et incididunt. Do reprehenderit ex elit ut excepteur nostrud. Commodo elit minim velit sint esse duis eu sint minim laborum mollit. Fugiat nulla consequat veniam nisi culpa sit do veniam. Sunt anim est culpa nisi anim mollit consequat. Qui dolor sint id minim irure cillum. , 372 Ainslie Street, Cornfields, Alabama, 6363, 23, $1,889.73, EBIDCO, estellarich@ebidco.com, brown, strawberry, List(List(0, Walker William), List(1, Kathrine Alexander), List(2, Faye Hoffman)), female, Hello, Estella Rich! You have 7 unread messages., 121b43bf-abc7-46bf-8ac3-5355fe1d8290, 3, true, 38.887249, -41.450973, Estella Rich, +1 (957) 478-3558, http://placehold.it/32x32, 2018-03-04T12:10:29 -02:00, List(cillum, occaecat, et, pariatur, duis, adipisicing, nostrud)), List(List(5, 4, 1, 6, 3, 2), 622b43c876311a36c1c9da33, Laborum eu deserunt commodo deserunt pariatur sit quis fugiat consequat cillum ex mollit dolore consequat. Officia irure cillum et do incididunt eiusmod laborum consequat laboris ex fugiat ullamco. Anim nisi velit adipisicing consectetur Lorem ut cupidatat. , 521 Stratford Road, Stewartville, Vermont, 6312, 33, $1,498.87, LOTRON, carpentersullivan@lotron.com, brown, apple, List(List(0, Doreen Hanson), List(1, Donna Jackson), List(2, Chandler Wiley)), male, Hello, Carpenter Sullivan! You have 8 unread messages., c3fd5dee-5ef5-40a0-9d9d-8fe596038009, 4, true, -73.782288, -67.572945, Carpenter Sullivan, +1 (967) 543-2046, http://placehold.it/32x32, 2014-12-22T05:22:31 -02:00, List(sint, excepteur, excepteur, ut, deserunt, ut, dolore)), List(List(5, 4, 1, 6, 3, 2), 622b43c898fac31d1e0593ea, Mollit pariatur aliqua ullamco occaecat duis consectetur laboris Lorem est veniam enim quis laboris. Incididunt deserunt commodo sit reprehenderit. Ipsum aute occaecat mollit sit est do duis nulla id deserunt. Duis est dolore proident minim mollit veniam pariatur consequat ullamco magna. Sint fugiat quis do magna amet consequat fugiat Lorem est id id cupidatat veniam. Commodo commodo reprehenderit minim aute exercitation duis. , 130 Lake Street, Northridge, New Hampshire, 5051, 31, $3,781.32, INTERFIND, nadiablake@interfind.com, brown, strawberry, List(List(0, Melba Chen), List(1, Richmond Norman), List(2, Henson Graham)), female, Hello, Nadia Blake! You have 1 unread messages., b66dba8f-7156-4600-9742-7535777a0ca3, 5, true, 71.733634, -121.91096, Nadia Blake, +1 (965) 434-3202, http://placehold.it/32x32, 2014-06-15T06:31:58 -02:00, List(do, consequat, mollit, mollit, ex, velit, laborum)))"


# Arrays

In this example, each record in the source dataframe has a JSON array as it's contents. 

The code below will: 
- Explode the array out to have each JSON record in the array on its own row in the dataframe. Use pyspark.sql.functions.explode_outer
  - `explode`: Returns a new row for each element in the given array or map. 
  - `explode_outer`: Returns a new row for each element in the given array or map. Unlike explode, if the array/map is null or empty then null is produced. 
  - So if the cell in the original dataframe was an array of Struct types, the result will be multiple new rows for each original row, each Struct type now on its own row. 
- After this explode operation, there will be more than one row in the dataframe for each source data frame row. 
- All data is still in a single column, each json attribute has not be flattened out as yet.

In [0]:
from pyspark.sql.functions import explode_outer

ParsedJSONDF = rawJSON_ProperTyped_DF.withColumn("JSONColumn", explode_outer("JSONColumn"))

display(ParsedJSONDF)

JSONColumn
"List(List(5, 4, 1, 6, 3, 2), 622b43c868725fca32d05127, Sit non proident Lorem laboris non id excepteur voluptate nulla fugiat excepteur eiusmod est. Sunt deserunt officia cillum incididunt dolore cillum ipsum eu. In est laboris occaecat id sint laborum dolore mollit qui anim elit adipisicing. Eu labore esse qui do dolor Lorem elit fugiat eu duis. Eiusmod in incididunt qui do labore consectetur irure ipsum aliqua incididunt ex tempor nulla adipisicing. Sunt esse commodo proident est id Lorem ea. Commodo mollit est officia esse nulla nisi tempor ullamco aliqua. , 116 Stewart Street, Longoria, Alaska, 3997, 20, $2,387.61, GENMY, hughesslater@genmy.com, green, banana, List(List(0, Katie Carey), List(1, Roth Gould), List(2, Trisha Lindsay)), male, Hello, Hughes Slater! You have 1 unread messages., 8a81ca3a-0813-4902-820d-d58e9703ec04, 0, true, -47.586921, 22.966975, Hughes Slater, +1 (883) 494-2174, http://placehold.it/32x32, 2022-02-03T10:33:31 -02:00, List(non, ex, sint, culpa, aliqua, esse, pariatur))"
"List(List(5, 4, 1, 6, 3, 2), 622b43c8d2984c9956b839d9, Sunt culpa commodo ullamco aliqua fugiat sit fugiat eu fugiat eiusmod amet esse. Consequat nostrud exercitation deserunt magna labore elit ipsum Lorem laboris cupidatat reprehenderit mollit magna velit. Duis esse cillum proident ad id excepteur. Dolore excepteur cillum duis commodo. Sint dolore voluptate in amet nostrud culpa qui consequat tempor incididunt eiusmod irure sit officia. Consequat cillum occaecat Lorem nulla nisi qui excepteur. , 894 Cambridge Place, Dexter, Utah, 8622, 25, $1,910.43, GEEKNET, burchconrad@geeknet.com, green, apple, List(List(0, Charles Newman), List(1, Maude Robinson), List(2, Taylor Goodwin)), male, Hello, Burch Conrad! You have 7 unread messages., f4b512ca-e362-48a5-a5a5-1b21013bbd4c, 1, false, 33.526889, -69.444991, Burch Conrad, +1 (863) 523-2711, http://placehold.it/32x32, 2018-10-11T07:17:58 -02:00, List(in, non, adipisicing, veniam, irure, excepteur, laboris))"
"List(List(5, 4, 1, 6, 3, 2), 622b43c806bb45d49234240b, Lorem aute eiusmod et Lorem quis ea. Duis velit anim minim aute magna eiusmod excepteur nulla ad velit nostrud. Amet laboris voluptate nostrud nulla do minim non mollit duis sunt officia sunt nisi reprehenderit. , 111 Beverly Road, Canby, Oklahoma, 4837, 24, $1,711.22, SKINSERVE, ellapacheco@skinserve.com, brown, apple, List(List(0, Montoya Becker), List(1, Reilly Phelps), List(2, Reeves Tran)), female, Hello, Ella Pacheco! You have 6 unread messages., 6f752038-214b-4cd7-b6d0-14250292a1e3, 2, false, -18.438374, -137.932995, Ella Pacheco, +1 (915) 430-3434, http://placehold.it/32x32, 2015-08-13T05:28:29 -02:00, List(duis, occaecat, Lorem, aute, ut, nisi, enim))"
"List(List(5, 4, 1, 6, 3, 2), 622b43c8858495221ea6ecda, Sit non veniam amet minim. Sint cillum excepteur et incididunt. Do reprehenderit ex elit ut excepteur nostrud. Commodo elit minim velit sint esse duis eu sint minim laborum mollit. Fugiat nulla consequat veniam nisi culpa sit do veniam. Sunt anim est culpa nisi anim mollit consequat. Qui dolor sint id minim irure cillum. , 372 Ainslie Street, Cornfields, Alabama, 6363, 23, $1,889.73, EBIDCO, estellarich@ebidco.com, brown, strawberry, List(List(0, Walker William), List(1, Kathrine Alexander), List(2, Faye Hoffman)), female, Hello, Estella Rich! You have 7 unread messages., 121b43bf-abc7-46bf-8ac3-5355fe1d8290, 3, true, 38.887249, -41.450973, Estella Rich, +1 (957) 478-3558, http://placehold.it/32x32, 2018-03-04T12:10:29 -02:00, List(cillum, occaecat, et, pariatur, duis, adipisicing, nostrud))"
"List(List(5, 4, 1, 6, 3, 2), 622b43c876311a36c1c9da33, Laborum eu deserunt commodo deserunt pariatur sit quis fugiat consequat cillum ex mollit dolore consequat. Officia irure cillum et do incididunt eiusmod laborum consequat laboris ex fugiat ullamco. Anim nisi velit adipisicing consectetur Lorem ut cupidatat. , 521 Stratford Road, Stewartville, Vermont, 6312, 33, $1,498.87, LOTRON, carpentersullivan@lotron.com, brown, apple, List(List(0, Doreen Hanson), List(1, Donna Jackson), List(2, Chandler Wiley)), male, Hello, Carpenter Sullivan! You have 8 unread messages., c3fd5dee-5ef5-40a0-9d9d-8fe596038009, 4, true, -73.782288, -67.572945, Carpenter Sullivan, +1 (967) 543-2046, http://placehold.it/32x32, 2014-12-22T05:22:31 -02:00, List(sint, excepteur, excepteur, ut, deserunt, ut, dolore))"
"List(List(5, 4, 1, 6, 3, 2), 622b43c898fac31d1e0593ea, Mollit pariatur aliqua ullamco occaecat duis consectetur laboris Lorem est veniam enim quis laboris. Incididunt deserunt commodo sit reprehenderit. Ipsum aute occaecat mollit sit est do duis nulla id deserunt. Duis est dolore proident minim mollit veniam pariatur consequat ullamco magna. Sint fugiat quis do magna amet consequat fugiat Lorem est id id cupidatat veniam. Commodo commodo reprehenderit minim aute exercitation duis. , 130 Lake Street, Northridge, New Hampshire, 5051, 31, $3,781.32, INTERFIND, nadiablake@interfind.com, brown, strawberry, List(List(0, Melba Chen), List(1, Richmond Norman), List(2, Henson Graham)), female, Hello, Nadia Blake! You have 1 unread messages., b66dba8f-7156-4600-9742-7535777a0ca3, 5, true, 71.733634, -121.91096, Nadia Blake, +1 (965) 434-3202, http://placehold.it/32x32, 2014-06-15T06:31:58 -02:00, List(do, consequat, mollit, mollit, ex, velit, laborum))"
"List(List(5, 4, 1, 6, 3, 2), 622b43c868725fca32d05127, Sit non proident Lorem laboris non id excepteur voluptate nulla fugiat excepteur eiusmod est. Sunt deserunt officia cillum incididunt dolore cillum ipsum eu. In est laboris occaecat id sint laborum dolore mollit qui anim elit adipisicing. Eu labore esse qui do dolor Lorem elit fugiat eu duis. Eiusmod in incididunt qui do labore consectetur irure ipsum aliqua incididunt ex tempor nulla adipisicing. Sunt esse commodo proident est id Lorem ea. Commodo mollit est officia esse nulla nisi tempor ullamco aliqua. , 116 Stewart Street, Longoria, Alaska, 3997, 20, $2,387.61, GENMY, hughesslater@genmy.com, green, banana, List(List(0, Katie Carey), List(1, Roth Gould), List(2, Trisha Lindsay)), male, Hello, Hughes Slater! You have 1 unread messages., 8a81ca3a-0813-4902-820d-d58e9703ec04, 0, true, -47.586921, 22.966975, Hughes Slater, +1 (883) 494-2174, http://placehold.it/32x32, 2022-02-03T10:33:31 -02:00, List(non, ex, sint, culpa, aliqua, esse, pariatur))"
"List(List(5, 4, 1, 6, 3, 2), 622b43c8d2984c9956b839d9, Sunt culpa commodo ullamco aliqua fugiat sit fugiat eu fugiat eiusmod amet esse. Consequat nostrud exercitation deserunt magna labore elit ipsum Lorem laboris cupidatat reprehenderit mollit magna velit. Duis esse cillum proident ad id excepteur. Dolore excepteur cillum duis commodo. Sint dolore voluptate in amet nostrud culpa qui consequat tempor incididunt eiusmod irure sit officia. Consequat cillum occaecat Lorem nulla nisi qui excepteur. , 894 Cambridge Place, Dexter, Utah, 8622, 25, $1,910.43, GEEKNET, burchconrad@geeknet.com, green, apple, List(List(0, Charles Newman), List(1, Maude Robinson), List(2, Taylor Goodwin)), male, Hello, Burch Conrad! You have 7 unread messages., f4b512ca-e362-48a5-a5a5-1b21013bbd4c, 1, false, 33.526889, -69.444991, Burch Conrad, +1 (863) 523-2711, http://placehold.it/32x32, 2018-10-11T07:17:58 -02:00, List(in, non, adipisicing, veniam, irure, excepteur, laboris))"
"List(List(5, 4, 1, 6, 3, 2), 622b43c806bb45d49234240b, Lorem aute eiusmod et Lorem quis ea. Duis velit anim minim aute magna eiusmod excepteur nulla ad velit nostrud. Amet laboris voluptate nostrud nulla do minim non mollit duis sunt officia sunt nisi reprehenderit. , 111 Beverly Road, Canby, Oklahoma, 4837, 24, $1,711.22, SKINSERVE, ellapacheco@skinserve.com, brown, apple, List(List(0, Montoya Becker), List(1, Reilly Phelps), List(2, Reeves Tran)), female, Hello, Ella Pacheco! You have 6 unread messages., 6f752038-214b-4cd7-b6d0-14250292a1e3, 2, false, -18.438374, -137.932995, Ella Pacheco, +1 (915) 430-3434, http://placehold.it/32x32, 2015-08-13T05:28:29 -02:00, List(duis, occaecat, Lorem, aute, ut, nisi, enim))"
"List(List(5, 4, 1, 6, 3, 2), 622b43c8858495221ea6ecda, Sit non veniam amet minim. Sint cillum excepteur et incididunt. Do reprehenderit ex elit ut excepteur nostrud. Commodo elit minim velit sint esse duis eu sint minim laborum mollit. Fugiat nulla consequat veniam nisi culpa sit do veniam. Sunt anim est culpa nisi anim mollit consequat. Qui dolor sint id minim irure cillum. , 372 Ainslie Street, Cornfields, Alabama, 6363, 23, $1,889.73, EBIDCO, estellarich@ebidco.com, brown, strawberry, List(List(0, Walker William), List(1, Kathrine Alexander), List(2, Faye Hoffman)), female, Hello, Estella Rich! You have 7 unread messages., 121b43bf-abc7-46bf-8ac3-5355fe1d8290, 3, true, 38.887249, -41.450973, Estella Rich, +1 (957) 478-3558, http://placehold.it/32x32, 2018-03-04T12:10:29 -02:00, List(cillum, occaecat, et, pariatur, duis, adipisicing, nostrud))"


# Extract single elements out of the JSON Struct type

## dot notation
  - OriginalColumnWithJson.NestedAttributeName
  - Arrays are still returned as arrays when the array field is selected.

In [0]:
ParsedJSON_Flattened_DotNotation = ParsedJSONDF.select(
    "JSONColumn._id"
    , "JSONColumn.NestedAttributesTest.NestedAttributeOne"
    , "JSONColumn.NestedAttributesTest.NestedAttributeTwo"
    , "JSONColumn.NestedAttributesTest.NestedAttributeThree"
    , "JSONColumn.NestedAttributesTest.NestedAttributeFour"
    , "JSONColumn.NestedAttributesTest.NestedAttributeFive"
    , "JSONColumn.age"
    , "JSONColumn.tags"
    , "JSONColumn.friends"
    , )

display(ParsedJSON_Flattened_DotNotation)

_id,NestedAttributeOne,NestedAttributeTwo,NestedAttributeThree,NestedAttributeFour,NestedAttributeFive,age,tags,friends
622b43c868725fca32d05127,1,2,3,4,5,20,"List(non, ex, sint, culpa, aliqua, esse, pariatur)","List(List(0, Katie Carey), List(1, Roth Gould), List(2, Trisha Lindsay))"
622b43c8d2984c9956b839d9,1,2,3,4,5,25,"List(in, non, adipisicing, veniam, irure, excepteur, laboris)","List(List(0, Charles Newman), List(1, Maude Robinson), List(2, Taylor Goodwin))"
622b43c806bb45d49234240b,1,2,3,4,5,24,"List(duis, occaecat, Lorem, aute, ut, nisi, enim)","List(List(0, Montoya Becker), List(1, Reilly Phelps), List(2, Reeves Tran))"
622b43c8858495221ea6ecda,1,2,3,4,5,23,"List(cillum, occaecat, et, pariatur, duis, adipisicing, nostrud)","List(List(0, Walker William), List(1, Kathrine Alexander), List(2, Faye Hoffman))"
622b43c876311a36c1c9da33,1,2,3,4,5,33,"List(sint, excepteur, excepteur, ut, deserunt, ut, dolore)","List(List(0, Doreen Hanson), List(1, Donna Jackson), List(2, Chandler Wiley))"
622b43c898fac31d1e0593ea,1,2,3,4,5,31,"List(do, consequat, mollit, mollit, ex, velit, laborum)","List(List(0, Melba Chen), List(1, Richmond Norman), List(2, Henson Graham))"
622b43c868725fca32d05127,1,2,3,4,5,20,"List(non, ex, sint, culpa, aliqua, esse, pariatur)","List(List(0, Katie Carey), List(1, Roth Gould), List(2, Trisha Lindsay))"
622b43c8d2984c9956b839d9,1,2,3,4,5,25,"List(in, non, adipisicing, veniam, irure, excepteur, laboris)","List(List(0, Charles Newman), List(1, Maude Robinson), List(2, Taylor Goodwin))"
622b43c806bb45d49234240b,1,2,3,4,5,24,"List(duis, occaecat, Lorem, aute, ut, nisi, enim)","List(List(0, Montoya Becker), List(1, Reilly Phelps), List(2, Reeves Tran))"
622b43c8858495221ea6ecda,1,2,3,4,5,23,"List(cillum, occaecat, et, pariatur, duis, adipisicing, nostrud)","List(List(0, Walker William), List(1, Kathrine Alexander), List(2, Faye Hoffman))"


You can also use the selectExpr function to assign aliases to each field extracted. 

Ensure you escape column names with dots in them.

In [0]:
ParsedJSON_Flattened_SelectExprDotNotation = ParsedJSONDF.selectExpr(
    "JSONColumn._id AS ID"
    , "JSONColumn.NestedAttributesTest.NestedAttributeOne AS `NestedAttributesTest.NestedAttributeOne`"
    , "JSONColumn.NestedAttributesTest.NestedAttributeTwo AS `NestedAttributesTest.NestedAttributeTwo`"
    , "JSONColumn.NestedAttributesTest.NestedAttributeThree AS `NestedAttributesTest.NestedAttributeThree`"
    , "JSONColumn.NestedAttributesTest.NestedAttributeFour AS `NestedAttributesTest.NestedAttributeFour`"
    , "JSONColumn.NestedAttributesTest.NestedAttributeFive AS `NestedAttributesTest.NestedAttributeFive`"
    , "JSONColumn.age AS Age"
    , "JSONColumn.tags AS Tags_Array"
    , "JSONColumn.friends AS Friends_Array"
    , )

display(ParsedJSON_Flattened_SelectExprDotNotation)

ID,NestedAttributesTest.NestedAttributeOne,NestedAttributesTest.NestedAttributeTwo,NestedAttributesTest.NestedAttributeThree,NestedAttributesTest.NestedAttributeFour,NestedAttributesTest.NestedAttributeFive,Age,Tags_Array,Friends_Array
622b43c868725fca32d05127,1,2,3,4,5,20,"List(non, ex, sint, culpa, aliqua, esse, pariatur)","List(List(0, Katie Carey), List(1, Roth Gould), List(2, Trisha Lindsay))"
622b43c8d2984c9956b839d9,1,2,3,4,5,25,"List(in, non, adipisicing, veniam, irure, excepteur, laboris)","List(List(0, Charles Newman), List(1, Maude Robinson), List(2, Taylor Goodwin))"
622b43c806bb45d49234240b,1,2,3,4,5,24,"List(duis, occaecat, Lorem, aute, ut, nisi, enim)","List(List(0, Montoya Becker), List(1, Reilly Phelps), List(2, Reeves Tran))"
622b43c8858495221ea6ecda,1,2,3,4,5,23,"List(cillum, occaecat, et, pariatur, duis, adipisicing, nostrud)","List(List(0, Walker William), List(1, Kathrine Alexander), List(2, Faye Hoffman))"
622b43c876311a36c1c9da33,1,2,3,4,5,33,"List(sint, excepteur, excepteur, ut, deserunt, ut, dolore)","List(List(0, Doreen Hanson), List(1, Donna Jackson), List(2, Chandler Wiley))"
622b43c898fac31d1e0593ea,1,2,3,4,5,31,"List(do, consequat, mollit, mollit, ex, velit, laborum)","List(List(0, Melba Chen), List(1, Richmond Norman), List(2, Henson Graham))"
622b43c868725fca32d05127,1,2,3,4,5,20,"List(non, ex, sint, culpa, aliqua, esse, pariatur)","List(List(0, Katie Carey), List(1, Roth Gould), List(2, Trisha Lindsay))"
622b43c8d2984c9956b839d9,1,2,3,4,5,25,"List(in, non, adipisicing, veniam, irure, excepteur, laboris)","List(List(0, Charles Newman), List(1, Maude Robinson), List(2, Taylor Goodwin))"
622b43c806bb45d49234240b,1,2,3,4,5,24,"List(duis, occaecat, Lorem, aute, ut, nisi, enim)","List(List(0, Montoya Becker), List(1, Reilly Phelps), List(2, Reeves Tran))"
622b43c8858495221ea6ecda,1,2,3,4,5,23,"List(cillum, occaecat, et, pariatur, duis, adipisicing, nostrud)","List(List(0, Walker William), List(1, Kathrine Alexander), List(2, Faye Hoffman))"


## JSON Functions
- pyspark.sql.functions.get_json_object
  - Extracts json object from a json string based on json path specified, and returns json string of the extracted json object. It will return null if the input json string is invalid.
  - NOTE, this needs the column with JSON data to be of string type. It cant work on Struct types.

In [0]:
from pyspark.sql.functions import get_json_object

data = [("1", '''{"f1": "value1", "f2": "value2"}'''), ("2", '''{"f1": "value12"}''')]

display(data)

df = spark.createDataFrame(data, ("key", "jstring"))

display(df)

df = df.select(df.key, get_json_object(df.jstring, '$.f1').alias("c0"), \
                  get_json_object(df.jstring, '$.f2').alias("c1") ).collect()

display(df)


_1,_2
1,"{""f1"": ""value1"", ""f2"": ""value2""}"
2,"{""f1"": ""value12""}"


key,jstring
1,"{""f1"": ""value1"", ""f2"": ""value2""}"
2,"{""f1"": ""value12""}"


key,c0,c1
1,value1,value2
2,value12,


#Nested Arrays

To expand nested arrays, just nest the explode_outer function again on the nested array once extracted

In [0]:
#Steps
#Select the _id field and the fiends nested array
#Use the explode_outer function to explode the array out into seperate rows. At this point, there is still a JSON object in each record. We need to extract the elements. 
#Use the SelectExpr function to choose the fields from the array to return

ParsedJSON_Flattened_NestedArrayExtract = ParsedJSONDF.selectExpr("JSONColumn._id AS ID", "JSONColumn.friends AS Friends_Array")\
.withColumn("Friends_Array", explode_outer("Friends_Array"))\
.selectExpr("ID", "Friends_Array.id AS `Friends_Array.id`", "Friends_Array.name AS `Friends_Array.name`")

display(ParsedJSON_Flattened_NestedArrayExtract)

ID,Friends_Array.id,Friends_Array.name
622b43c868725fca32d05127,0,Katie Carey
622b43c868725fca32d05127,1,Roth Gould
622b43c868725fca32d05127,2,Trisha Lindsay
622b43c8d2984c9956b839d9,0,Charles Newman
622b43c8d2984c9956b839d9,1,Maude Robinson
622b43c8d2984c9956b839d9,2,Taylor Goodwin
622b43c806bb45d49234240b,0,Montoya Becker
622b43c806bb45d49234240b,1,Reilly Phelps
622b43c806bb45d49234240b,2,Reeves Tran
622b43c8858495221ea6ecda,0,Walker William


# Auto expand nested structs using the column.* notation

Doesnt work on arrays

In [0]:
ParsedJSON_Flattened_4 = ParsedJSONDF.selectExpr("JSONColumn._id AS ID", "JSONColumn.NestedAttributesTest.*")


display(ParsedJSON_Flattened_4)

ID,NestedAttributeFive,NestedAttributeFour,NestedAttributeOne,NestedAttributeSix,NestedAttributeThree,NestedAttributeTwo
622b43c868725fca32d05127,5,4,1,6,3,2
622b43c8d2984c9956b839d9,5,4,1,6,3,2
622b43c806bb45d49234240b,5,4,1,6,3,2
622b43c8858495221ea6ecda,5,4,1,6,3,2
622b43c876311a36c1c9da33,5,4,1,6,3,2
622b43c898fac31d1e0593ea,5,4,1,6,3,2
622b43c868725fca32d05127,5,4,1,6,3,2
622b43c8d2984c9956b839d9,5,4,1,6,3,2
622b43c806bb45d49234240b,5,4,1,6,3,2
622b43c8858495221ea6ecda,5,4,1,6,3,2
