#PySpark JSON Functions with Examples

---


**PySpark JSON functions are used to query or extract the elements from JSON string of DataFrame column by path, convert it to struct, mapt type e.t.c, In this article, I will explain the most used JSON SQL functions with Python examples.**


---

##1. PySpark JSON Functions

- from_json() – Converts JSON string into Struct type or Map type.

- to_json() – Converts MapType or Struct type to JSON string.

- json_tuple() – Extract the Data from JSON and create them as a new columns.

- get_json_object() – Extracts JSON element from a JSON string based on json path specified.

- schema_of_json() – Create schema string from JSON string


---


###1.1. Create DataFrame with Column contains JSON String


**In order to explain these JSON functions first, let’s create DataFrame with a column contains JSON string.**

In [0]:

jsonString = """{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""
df = spark.createDataFrame(data=[(1, jsonString)],schema=["id", "value"])
df.show(truncate=False)

+---+--------------------------------------------------------------------------+
|id |value                                                                     |
+---+--------------------------------------------------------------------------+
|1  |{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}|
+---+--------------------------------------------------------------------------+



##2. PySpark JSON Functions Examples


###2.1. from_json()


**PySpark from_json() function is used to convert JSON string into Struct type or Map type. The below example converts JSON string to Map key-value pair. I will leave it to you to convert to struct type. Refer, Convert JSON string to Struct type column.**

In [0]:
from pyspark.sql.functions import from_json
from pyspark.sql.types import StringType, MapType, StructType, StructField

In [0]:
df2 = df.withColumn("value", from_json(df.value, MapType(StringType(), StringType())))
df2.printSchema()
df2.show(truncate=False)

root
 |-- id: long (nullable = true)
 |-- value: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

+---+---------------------------------------------------------------------------+
|id |value                                                                      |
+---+---------------------------------------------------------------------------+
|1  |{Zipcode -> 704, ZipCodeType -> STANDARD, City -> PARC PARQUE, State -> PR}|
+---+---------------------------------------------------------------------------+



In [0]:
df3 = df2.rdd.map(lambda x: (x.id, x.value['Zipcode'], x.value['City'], x.value['State']))\
.toDF(['id', 'zipcode', 'city', 'state'])
df3.printSchema()
df3.show(truncate=False)

root
 |-- id: long (nullable = true)
 |-- zipcode: string (nullable = true)
 |-- city: string (nullable = true)
 |-- state: string (nullable = true)

+---+-------+-----------+-----+
|id |zipcode|city       |state|
+---+-------+-----------+-----+
|1  |704    |PARC PARQUE|PR   |
+---+-------+-----------+-----+



In [0]:
from pyspark.sql.functions import *

In [0]:
schema = StructType([
    StructField("Zipcode", StringType(), True),
    StructField("ZipcodeType", StringType(), True),
    StructField("City", StringType(), True),
    StructField("State", StringType(), True)
])

df4  = df.withColumn("Value", from_json(col("value"), schema))
df4.printSchema()
df4.show(truncate=False)

#convert to multiple columns
df5 = df4.select("id", "value.*")
df5.show(truncate=False)

root
 |-- id: long (nullable = true)
 |-- Value: struct (nullable = true)
 |    |-- Zipcode: string (nullable = true)
 |    |-- ZipcodeType: string (nullable = true)
 |    |-- City: string (nullable = true)
 |    |-- State: string (nullable = true)

+---+----------------------------+
|id |Value                       |
+---+----------------------------+
|1  |{704, null, PARC PARQUE, PR}|
+---+----------------------------+

+---+-------+-----------+-----------+-----+
|id |Zipcode|ZipcodeType|City       |State|
+---+-------+-----------+-----------+-----+
|1  |704    |null       |PARC PARQUE|PR   |
+---+-------+-----------+-----------+-----+



###2.2. to_json()


**to_json() function is used to convert DataFrame columns MapType or Struct type to JSON string. Here, I am using df2 that created from above from_json() example.**

In [0]:
#MapType
df2.withColumn("Value", to_json(col("value")))\
.show(truncate=False)

+---+----------------------------------------------------------------------------+
|id |Value                                                                       |
+---+----------------------------------------------------------------------------+
|1  |{"Zipcode":"704","ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}|
+---+----------------------------------------------------------------------------+



In [0]:
#StuctType
df4.withColumn("value", to_json(col("value")))\
.show(truncate=False)

+---+---------------------------------------------------+
|id |value                                              |
+---+---------------------------------------------------+
|1  |{"Zipcode":"704","City":"PARC PARQUE","State":"PR"}|
+---+---------------------------------------------------+



###2.3. json_tuple()


**Function json_tuple() is used the query or extract the elements from JSON column and create the result as a new columns.**

In [0]:
df.select(col("id"), json_tuple(col("value"), "Zipcode", "ZipCodeType", "City", "State"))\
.toDF("id", "Zipcode", "ZipCodeType", "City", "State")\
.show(truncate=False)

+---+-------+-----------+-----------+-----+
|id |Zipcode|ZipCodeType|City       |State|
+---+-------+-----------+-----------+-----+
|1  |704    |STANDARD   |PARC PARQUE|PR   |
+---+-------+-----------+-----------+-----+



###2.4. get_json_object()

**get_json_object() is used to extract the JSON string based on path from the JSON column.**

In [0]:
df.select(col("id"), get_json_object(col("value"), "$.ZipCodeType").alias("ZipCodeType"))\
.show(truncate=False)

+---+-----------+
|id |ZipCodeType|
+---+-----------+
|1  |STANDARD   |
+---+-----------+



###2.5. schema_of_json()

**Use schema_of_json() to create schema string from JSON string column.**

In [0]:
schemaStr = spark.range(1)\
.select(schema_of_json(lit("""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}""")))\
.collect()[0][0]
print(schemaStr)


STRUCT<City: STRING, State: STRING, ZipCodeType: STRING, Zipcode: BIGINT>


In [0]:
schemaStr = spark.range(1)
schemaStr.printSchema()
schemaStr.show()

root
 |-- id: long (nullable = false)

+---+
| id|
+---+
|  0|
+---+



In [0]:
schemaStr = spark.range(1)\
.select(schema_of_json(lit("""{"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"}"""))).show(truncate=False)

+------------------------------------------------------------------------------------------+
|schema_of_json({"Zipcode":704,"ZipCodeType":"STANDARD","City":"PARC PARQUE","State":"PR"})|
+------------------------------------------------------------------------------------------+
|STRUCT<City: STRING, State: STRING, ZipCodeType: STRING, Zipcode: BIGINT>                 |
+------------------------------------------------------------------------------------------+



In [0]:
help(schema_of_json)

Help on function schema_of_json in module pyspark.sql.functions:

schema_of_json(json: 'ColumnOrName', options: Optional[Dict[str, str]] = None) -> pyspark.sql.column.Column
    Parses a JSON string and infers its schema in DDL format.
    
    .. versionadded:: 2.4.0
    
    .. versionchanged:: 3.4.0
        Support Spark Connect.
    
    Parameters
    ----------
    json : :class:`~pyspark.sql.Column` or str
        a JSON string or a foldable string column containing a JSON string.
    options : dict, optional
        options to control parsing. accepts the same options as the JSON datasource.
        See `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option>`_
        in the version you use.
    
        .. # noqa
    
        .. versionchanged:: 3.0.0
           It accepts `options` parameter to control schema inferring.
    
    Examples
    --------
    >>> df = spark.range(1)
    >>> df.select(schema_of_json(lit('{"a": 0}')).