#### parse_json
- For **newer versions** of **Spark (4.0 and above)**, the **pyspark.sql.functions.parse_json** function can be used to parse a **JSON string** into a **VariantType**.

- Parses a column containing a **JSON string** into a **VariantType** (including **objects, arrays, strings, numbers, and null values**).

- parse_json function returns an **error** if the **JSON string is malformed / not valid**. To return **NULL instead of an error**, use the **try_parse_json** function.

#### Syntax

     pyspark.sql.functions.parse_json(col, schema=None, options={})

**Parameters:**

- **col** → Column containing the **JSON string** (usually a string column).
- **schema (optional)** →
  - **If provided** → the JSON string will be parsed into that schema.
  - **If omitted** → the JSON will be parsed into a VARIANT (loosely typed struct/map).
- **options (optional)** → Dictionary of options for parsing, e.g.
  - **"mode":** "PERMISSIVE" | "FAILFAST" | "DROPMALFORMED"
  - **"allowComments":** "true/false"
  - **"timestampFormat":** "yyyy-MM-dd'T'HH:mm:ss"

**Returns:**
- A **VARIANT** value.

**parse_json:**
- **does not take a schema**
- it only takes a **single argument** (the JSON column).
         
      TypeError: parse_json() takes 1 positional argument but 2 were given

**✅ Correct usage:**

- **parse_json(col)** → Parses a **JSON string** column into a loosely typed **variant/struct** (schema is inferred).
- **from_json(col, schema)** → Parses a **JSON string** column into a **user-defined schema**.

In [0]:
from pyspark.sql.functions import parse_json

In [0]:
help(parse_json)

Help on function parse_json in module pyspark.sql.functions.builtin:

parse_json(col: 'ColumnOrName') -> pyspark.sql.column.Column
    Parses a column containing a JSON string into a :class:`VariantType`. Throws exception if a
    string represents an invalid JSON value.

    .. versionadded:: 4.0.0

    Parameters
    ----------
    col : :class:`~pyspark.sql.Column` or str
        a column or column name JSON formatted strings

    Returns
    -------
    :class:`~pyspark.sql.Column`
        a new column of VariantType.

    Examples
    --------
    >>> df = spark.createDataFrame([ {'json': '''{ "a" : 1 }'''} ])
    >>> df.select(to_json(parse_json(df.json))).collect() # doctest: +SKIP
    [Row(to_json(parse_json(json))='{"a":1}')]



##### 1) Nested JSON
**a) Single nested JSON string**

In [0]:
nested_json_string = '''{"name": "Alice", "age": 28,
 "address": {"street": "123 Main St", "city": "San Francisco", "zip": "94105"}}'''

df_string = spark.createDataFrame([(nested_json_string,)], ["json_string"])
display(df_string)

json_string
"{""name"": ""Alice"", ""age"": 28,  ""address"": {""street"": ""123 Main St"", ""city"": ""San Francisco"", ""zip"": ""94105""}}"


In [0]:
# Parse string with VARIANT type
df_variant = df_string.withColumn("json_variant", parse_json(col("json_string")))
display(df_variant)

json_string,json_variant
"{""name"": ""Alice"", ""age"": 28,  ""address"": {""street"": ""123 Main St"", ""city"": ""San Francisco"", ""zip"": ""94105""}}","{""address"":{""city"":""San Francisco"",""street"":""123 Main St"",""zip"":""94105""},""age"":28,""name"":""Alice""}"


In [0]:
# Extract multiple fields
df_variant.select(
  expr("json_variant:age::int").alias("Age"),
  expr("json_variant:name::string").alias("Name"),
  expr("json_variant:address:city::string").alias("City"),
  expr("json_variant:address:street::string").alias("Street"),
  expr("json_variant:address:zip::int").alias("Zip")
).display()

Age,Name,City,Street,Zip
28,Alice,San Francisco,123 Main St,94105


**b) Nested JSON string**

In [0]:
 # Sample nested JSON data
 data = [
     ('''{"name": "Alice", "age": 28,
          "address": {"street": "123 Main St", "city": "San Francisco", "zip": "94105"}}''',),

     ('''{"name": "Bob", "age": 32,
          "address": {"street": "456 Park Ave", "city": "New York", "zip": "10001"}}''',),

     ('''{"name": "Charlie", "age": 40,
          "address": {"street": "789 MG Road", "city": "Bangalore", "zip": "560001"}}''',),

     ('''{"name": "David", "age": 29,
          "address": {"street": "12 Residency Rd", "city": "Chennai", "zip": "600001"}}''',),

     ('''{"name": "Eva", "age": 35,
          "address": {"street": "90 Banjara Hills", "city": "Hyderabad", "zip": "500034"}}''',)
 ]

 # Create DataFrame with JSON strings
 df_json = spark.createDataFrame(data, ["json_string"])
 display(df_json)

json_string
"{""name"": ""Alice"", ""age"": 28,  ""address"": {""street"": ""123 Main St"", ""city"": ""San Francisco"", ""zip"": ""94105""}}"
"{""name"": ""Bob"", ""age"": 32,  ""address"": {""street"": ""456 Park Ave"", ""city"": ""New York"", ""zip"": ""10001""}}"
"{""name"": ""Charlie"", ""age"": 40,  ""address"": {""street"": ""789 MG Road"", ""city"": ""Bangalore"", ""zip"": ""560001""}}"
"{""name"": ""David"", ""age"": 29,  ""address"": {""street"": ""12 Residency Rd"", ""city"": ""Chennai"", ""zip"": ""600001""}}"
"{""name"": ""Eva"", ""age"": 35,  ""address"": {""street"": ""90 Banjara Hills"", ""city"": ""Hyderabad"", ""zip"": ""500034""}}"


In [0]:
# Parse JSON string into struct
df_parsed_01 = df_json.withColumn("json_data", parse_json(col("json_string")))
display(df_parsed_01)

json_string,json_data
"{""name"": ""Alice"", ""age"": 28,  ""address"": {""street"": ""123 Main St"", ""city"": ""San Francisco"", ""zip"": ""94105""}}","{""address"":{""city"":""San Francisco"",""street"":""123 Main St"",""zip"":""94105""},""age"":28,""name"":""Alice""}"
"{""name"": ""Bob"", ""age"": 32,  ""address"": {""street"": ""456 Park Ave"", ""city"": ""New York"", ""zip"": ""10001""}}","{""address"":{""city"":""New York"",""street"":""456 Park Ave"",""zip"":""10001""},""age"":32,""name"":""Bob""}"
"{""name"": ""Charlie"", ""age"": 40,  ""address"": {""street"": ""789 MG Road"", ""city"": ""Bangalore"", ""zip"": ""560001""}}","{""address"":{""city"":""Bangalore"",""street"":""789 MG Road"",""zip"":""560001""},""age"":40,""name"":""Charlie""}"
"{""name"": ""David"", ""age"": 29,  ""address"": {""street"": ""12 Residency Rd"", ""city"": ""Chennai"", ""zip"": ""600001""}}","{""address"":{""city"":""Chennai"",""street"":""12 Residency Rd"",""zip"":""600001""},""age"":29,""name"":""David""}"
"{""name"": ""Eva"", ""age"": 35,  ""address"": {""street"": ""90 Banjara Hills"", ""city"": ""Hyderabad"", ""zip"": ""500034""}}","{""address"":{""city"":""Hyderabad"",""street"":""90 Banjara Hills"",""zip"":""500034""},""age"":35,""name"":""Eva""}"


In [0]:
# Extract multiple fields
df_parsed_01.select(
  expr("json_data:age::int").alias("Age"),
  expr("json_data:name::string").alias("Name"),
  expr("json_data:address:city::string").alias("City"),
  expr("json_data:address:street::string").alias("Street"),
  expr("json_data:address:zip::int").alias("Zip")
).display()

Age,Name,City,Street,Zip
28,Alice,San Francisco,123 Main St,94105
32,Bob,New York,456 Park Ave,10001
40,Charlie,Bangalore,789 MG Road,560001
29,David,Chennai,12 Residency Rd,600001
35,Eva,Hyderabad,90 Banjara Hills,500034


##### 2) Simple JSON with key-value pairs

In [0]:
data = [
    ('{"name": "Albert", "age": 30, "city": "Bangalore"}',),
    ('{"name": "Bobby", "age": 25, "city": "Chennai"}',),
    ('{"name": "Swapna", "age": 35, "city": "Hyderabad"}',),
    ('{"name": "David", "age": 28, "city": "Cochin"}',),
    ('{"name": "Anand", "age": 33, "city": "Baroda"}',),
    ('{"name": "Baskar", "age": 29, "city": "Nasik"}',),
]

df_kv = spark.createDataFrame(data, ["json_string"])

# Convert JSON string into struct
df_parsed = df_kv.withColumn("json_data", parse_json(col("json_string")))
display(df_parsed)

json_string,json_data
"{""name"": ""Albert"", ""age"": 30, ""city"": ""Bangalore""}","{""age"":30,""city"":""Bangalore"",""name"":""Albert""}"
"{""name"": ""Bobby"", ""age"": 25, ""city"": ""Chennai""}","{""age"":25,""city"":""Chennai"",""name"":""Bobby""}"
"{""name"": ""Swapna"", ""age"": 35, ""city"": ""Hyderabad""}","{""age"":35,""city"":""Hyderabad"",""name"":""Swapna""}"
"{""name"": ""David"", ""age"": 28, ""city"": ""Cochin""}","{""age"":28,""city"":""Cochin"",""name"":""David""}"
"{""name"": ""Anand"", ""age"": 33, ""city"": ""Baroda""}","{""age"":33,""city"":""Baroda"",""name"":""Anand""}"
"{""name"": ""Baskar"", ""age"": 29, ""city"": ""Nasik""}","{""age"":29,""city"":""Nasik"",""name"":""Baskar""}"


In [0]:
# Extract multiple fields
df_result_expr = df_parsed.select(
    expr("json_data:age::int").alias("Age"),
    expr("json_data:name::string").alias("Name"),
    expr("json_data:city::string").alias("City")
)

display(df_result_expr)

Age,Name,City
30,Albert,Bangalore
25,Bobby,Chennai
35,Swapna,Hyderabad
28,David,Cochin
33,Anand,Baroda
29,Baskar,Nasik


#### SQL

In [0]:
%sql
-- Convert a simple JSON object to VARIANT
SELECT parse_json('{"key": 123, "data": [4, 5, "str"]}') AS variant_data;

variant_data
"{""data"":[4,5,""str""],""key"":123}"


In [0]:
%sql
-- Convert a JSON array to VARIANT
SELECT parse_json('[1, 2, 3, {"nested_key": "nested_value"}]') AS variant_array;

variant_array
"[1,2,3,{""nested_key"":""nested_value""}]"


In [0]:
%sql
-- Convert a simple scalar value to VARIANT
SELECT parse_json('"A simple string"') AS variant_string;

variant_string
"""A simple string"""


In [0]:
%sql
-- Convert a numeric value to VARIANT
SELECT parse_json('12345') AS variant_number;

variant_number
12345


In [0]:
%sql
-- Handling null values
SELECT parse_json(null) AS variant_null;

variant_null
""
