##### get_json_object

- is used to **extract** a specific **JSON object or value** from a **JSON string** column within a DataFrame.
- Works directly on **JSON strings** (no need to **parse** into **struct** first).
- Useful when you need **only a few fields** from **raw JSON without schema**.

#### Syntax

     get_json_object(col("json_column"), "$.path")

     $ → root
     dot . → for nested objects
     [index] → for arrays

|            JSON string                                                      | field ($.name) | field.subfield ($.details.city) | array[0] |
|-----------------------------------------------------------------------------|----------------|---------------------------------|------------|
| (1, '{"id": 1, "name": "Albert", "details": {"age": 30, "city": "Delhi"}}, "hobbies": ["reading", "sports", "music"]') |   Albert  |  Delhi  | reading  |

- **$.field** → Top-level (root) field
- **$.field.subfield** → Nested field
- **$.array[0]** → First element in array

**Return Value:**
- Returns **value** as a **string** (you must **cast** if you need **int, double,** etc.).
- The function returns a **Column object** containing the **extracted JSON string** of the specified **object or value**.
- If the **JSON string** is **invalid** or the **JSONPath does not match** any element, it returns **null**.

In [0]:
from pyspark.sql.functions import col, get_json_object, json_tuple, expr

In [0]:
help(get_json_object)

Help on function get_json_object in module pyspark.sql.functions.builtin:

get_json_object(col: 'ColumnOrName', path: str) -> pyspark.sql.column.Column
    Extracts json object from a json string based on json `path` specified, and returns json string
    of the extracted json object. It will return null if the input json string is invalid.

    .. versionadded:: 1.6.0

    .. versionchanged:: 3.4.0
        Supports Spark Connect.

    Parameters
    ----------
    col : :class:`~pyspark.sql.Column` or str
        string column in json format
    path : str
        path to the json object to extract

    Returns
    -------
    :class:`~pyspark.sql.Column`
        string representation of given JSON object value.

    Examples
    --------
    Example 1: Extract a json object from json string

    >>> data = [("1", '''{"f1": "value1", "f2": "value2"}'''), ("2", '''{"f1": "value12"}''')]
    >>> df = spark.createDataFrame(data, ("key", "jstring"))
    >>> df.select(df.key,
    ...     g

##### 1) Simple JSON

In [0]:
data = [
    ('{"name": "Anand", "age": 28, "city": "Bangalore"}',),
    ('{"name": "Bibin", "age": 32, "city": "Chennai"}',),
    ('{"name": "Chandan", "age": 35, "city": "Hyderabad"}',),
    ('{"name": "Dora", "age": 37, "city": "Vadodara"}',),
    ('{"name": "Eswar", "age": 39, "city": "Pune"}',),
    ('{"name": "Sanjay", "age": 29, "city": "Cochin"}',)
]

df_sjson = spark.createDataFrame(data, ["json_string"])

df_simple_json = df_sjson.select(
    get_json_object(col("json_string"), "$.name").alias("Name"),
    get_json_object(col("json_string"), "$.age").alias("Age"),
    get_json_object(col("json_string"), "$.city").alias("City")
)

df_simple_json.display()

Name,Age,City
Anand,28,Bangalore
Bibin,32,Chennai
Chandan,35,Hyderabad
Dora,37,Vadodara
Eswar,39,Pune
Sanjay,29,Cochin


##### 2) Nested JSON

In [0]:
data = [
    (1, '{"id": 1, "name": "Albert", "details": {"age": 30, "city": "Delhi"}}'),
    (2, '{"id": 2, "name": "Bobby", "details": {"age": 25, "city": "Baroda"}}'),
    (3, '{"id": 3, "name": "Swapna", "details": {"age": 35}}'),
    (4, '{"id": 4, "name": "David", "details": {"age": 25, "city": "Chennai"}}'),
    (5, '{"id": 5, "name": "Anand", "details": {"age": 25, "city": "Bangalore"}}'),
    (6, '{"id": 6, "name": "Baskar", "details": {"age": 25, "city": "Hyderabad"}}'),
]

df_kv = spark.createDataFrame(data, ["id", "json_data"])
display(df_kv)

id,json_data
1,"{""id"": 1, ""name"": ""Albert"", ""details"": {""age"": 30, ""city"": ""Delhi""}}"
2,"{""id"": 2, ""name"": ""Bobby"", ""details"": {""age"": 25, ""city"": ""Baroda""}}"
3,"{""id"": 3, ""name"": ""Swapna"", ""details"": {""age"": 35}}"
4,"{""id"": 4, ""name"": ""David"", ""details"": {""age"": 25, ""city"": ""Chennai""}}"
5,"{""id"": 5, ""name"": ""Anand"", ""details"": {""age"": 25, ""city"": ""Bangalore""}}"
6,"{""id"": 6, ""name"": ""Baskar"", ""details"": {""age"": 25, ""city"": ""Hyderabad""}}"


- **$** represents the **root of the JSON.**

In [0]:
df_kv.select(
    get_json_object(col("json_data"), "$.id").alias("Id"),
    get_json_object(col("json_data"), "$.name").alias("Name"),
    get_json_object(col("json_data"), "$.details.age").alias("Age"),
    get_json_object(col("json_data"), "$.details.city").alias("City")
).display()

Id,Name,Age,City
1,Albert,30,Delhi
2,Bobby,25,Baroda
3,Swapna,35,
4,David,25,Chennai
5,Anand,25,Bangalore
6,Baskar,25,Hyderabad


##### 3) Array in JSON

In [0]:
array_data = [
    ('{"id": 1, "hobbies": ["reading", "sports", "music"]}',),
    ('{"id": 2, "hobbies": ["travel", "cooking"]}',),
    ('{"id": 3, "hobbies": ["watching", "sports", "music"]}',),
    ('{"id": 4, "hobbies": ["cricket", "cooking"]}',),
    ('{"id": 5, "hobbies": ["football", "sports", "music"]}',),
    ('{"id": 6, "hobbies": ["hocky", "cooking"]}',)
]

df_array = spark.createDataFrame(array_data, ["json_string"])

df_array_result = df_array.select(
    get_json_object(col("json_string"), "$.id").alias("ID"),
    get_json_object(col("json_string"), "$.hobbies[0]").alias("Hobby1"),
    get_json_object(col("json_string"), "$.hobbies[1]").alias("Hobby2"),
    get_json_object(col("json_string"), "$.hobbies[2]").alias("Hobby3")
)

display(df_array_result)


ID,Hobby1,Hobby2,Hobby3
1,reading,sports,music
2,travel,cooking,
3,watching,sports,music
4,cricket,cooking,
5,football,sports,music
6,hocky,cooking,


##### 4) Array of JSON string

In [0]:
json_str = """[{"Attr_INT":1, "ATTR_DOUBLE":10.101, "ATTR_DATE": "2021-01-01"},
{"Attr_INT":2, "ATTR_DOUBLE":20.201, "ATTR_DATE": "2022-02-11"},
{"Attr_INT":3, "ATTR_DOUBLE":30.301, "ATTR_DATE": "2023-04-21"},
{"Attr_INT":4, "ATTR_DOUBLE":40.401, "ATTR_DATE": "2024-05-15"},
{"Attr_INT":5, "ATTR_DOUBLE":50.501, "ATTR_DATE": "2025-03-25"}]"""

# Create a DataFrame
df_get = spark.createDataFrame([[1, json_str]], ['id', 'json_col'])
display(df_get)

id,json_col
1,"[{""Attr_INT"":1, ""ATTR_DOUBLE"":10.101, ""ATTR_DATE"": ""2021-01-01""}, {""Attr_INT"":2, ""ATTR_DOUBLE"":20.201, ""ATTR_DATE"": ""2022-02-11""}, {""Attr_INT"":3, ""ATTR_DOUBLE"":30.301, ""ATTR_DATE"": ""2023-04-21""}, {""Attr_INT"":4, ""ATTR_DOUBLE"":40.401, ""ATTR_DATE"": ""2024-05-15""}, {""Attr_INT"":5, ""ATTR_DOUBLE"":50.501, ""ATTR_DATE"": ""2025-03-25""}]"


In [0]:
# Extract JSON values
df_get_obj = df_get\
    .withColumn('ATTR_INT_0', get_json_object('json_col', '$[0].Attr_INT')) \
    .withColumn('ATTR_DOUBLE_1', get_json_object('json_col', '$[0].ATTR_DATE')) \
    .withColumn('ATTR_DATE_2', get_json_object('json_col', '$[0].ATTR_DATE'))

display(df_get_obj)

id,json_col,ATTR_INT_0,ATTR_DOUBLE_1,ATTR_DATE_2
1,"[{""Attr_INT"":1, ""ATTR_DOUBLE"":10.101, ""ATTR_DATE"": ""2021-01-01""}, {""Attr_INT"":2, ""ATTR_DOUBLE"":20.201, ""ATTR_DATE"": ""2022-02-11""}, {""Attr_INT"":3, ""ATTR_DOUBLE"":30.301, ""ATTR_DATE"": ""2023-04-21""}, {""Attr_INT"":4, ""ATTR_DOUBLE"":40.401, ""ATTR_DATE"": ""2024-05-15""}, {""Attr_INT"":5, ""ATTR_DOUBLE"":50.501, ""ATTR_DATE"": ""2025-03-25""}]",1,2021-01-01,2021-01-01


#### Key Differences

| Feature           | `parse_json`                                                 | `get_json_object`                                   |
| ----------------- | ------------------------------------------------------------ | --------------------------------------------------- |
| **What it does**  | Parses JSON into a **struct/variant**                        | Extracts a **string value** by JSONPath             |
| **Return type**   | **Struct** (can expand into **multiple columns**)            | **String** (you must **cast** if you need **int, double**, etc.)  |
| **Schema**        | **Inferred** (flexible, can be used like a DataFrame column) | Not needed (**no schema applied**)                  |
| **Performance**   | Better for **multiple/nested** fields                        | Lightweight for extracting **1–2 fields**           |
| **Best use case** | When you need structured access to **many fields**           | When you just need a **few values quickly**         |

| Method                | Input                | Output Type      | Pros                                            | Cons                         |
| --------------------- | -------------------- | ---------------- | ----------------------------------------------- | ---------------------------- |
| **get\_json\_object** | JSON string          | String           | Simple, lightweight                             | No schema, everything string |
| **parse\_json**       | JSON string          | variant          | Flexible, works without schema, preserves types | Schema not enforced          |
| **from\_json**        | JSON string + schema | Struct           | Strong typing, schema enforcement               | Must define schema           |

| Function              | Input Type           | Output Type                                                                 | Schema Required?               | Supports Nested?             | Use Case                                                                |
| --------------------- | -------------------- | --------------------------------------------------------------------------- | ------------------------------ | ---------------------------- | ----------------------------------------------------------------------- |
| **`get_json_object`** | `string` (JSON text) | `string` (single value)                                                     | ❌ No                           | ✅ Yes (via JSON path)        | Extract one field from JSON using a JSONPath-like syntax                |
| **`parse_json`**      | `string` (JSON text) | `variant` (Spark internal JSON representation, like semi-structured object) | ❌ No                           | ✅ Yes                        | Directly parse JSON text into a flexible object (no schema enforcement) |
| **`from_json`**       | `string` (JSON text) | `struct`, `array`, or `map`                                                 | ✅ Yes (you must define schema) | ✅ Yes                        | Parse JSON into strongly-typed Spark SQL structs/arrays/maps            |
| **`json_tuple`**      | `string` (JSON text) | Multiple `string` columns                                                   | ❌ No                           | ❌ No (only top-level fields) | Quickly extract multiple top-level fields without schema                |


In [0]:
from pyspark.sql.functions import col, parse_json, get_json_object, from_json
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

data = [
    ('{"name": "Kiran", "age": 28, "city": "Bangalore"}',),
    ('{"name": "Darshan", "age": 32, "city": "Chennai"}',),
    ('{"name": "Chetan", "age": 40, "city": "Hyderabad"}',),
    ('{"name": "Ramu", "age": 35, "city": "Varanasi"}',),
    ('{"name": "Priya", "age": 45, "city": "Amaravati"}',)
]

df_diff = spark.createDataFrame(data, ["json_string"])
display(df_diff)

json_string
"{""name"": ""Kiran"", ""age"": 28, ""city"": ""Bangalore""}"
"{""name"": ""Darshan"", ""age"": 32, ""city"": ""Chennai""}"
"{""name"": ""Chetan"", ""age"": 40, ""city"": ""Hyderabad""}"
"{""name"": ""Ramu"", ""age"": 35, ""city"": ""Varanasi""}"
"{""name"": ""Priya"", ""age"": 45, ""city"": ""Amaravati""}"


##### Using get_json_object

In [0]:
df_getjson = df_diff.select(
    get_json_object(col("json_string"), "$.name").alias("Name"),
    get_json_object(col("json_string"), "$.age").alias("Age"),   # always string
    expr("CAST(get_json_object(json_string, '$.age') AS INT)").alias("Age_Int"),
    get_json_object(col("json_string"), "$.city").alias("City")
)
df_getjson.display()

Name,Age,Age_Int,City
Kiran,28,28,Bangalore
Darshan,32,32,Chennai
Chetan,40,40,Hyderabad
Ramu,35,35,Varanasi
Priya,45,45,Amaravati


##### Using parse_json

In [0]:
df_parse = df_diff.withColumn("json_data", parse_json(col("json_string")))

# Extract multiple fields
df_parse.select(
  expr("json_data:name::string").alias("Name"),
  expr("json_data:age::int").alias("Age"),       # keeps numeric type
  expr("json_data:city::string").alias("City")
).display()

Name,Age,City
Kiran,28,Bangalore
Darshan,32,Chennai
Chetan,40,Hyderabad
Ramu,35,Varanasi
Priya,45,Amaravati


##### Using from_json (explicit schema)

In [0]:
schema = StructType([
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True),
    StructField("city", StringType(), True)
])

df_fromjson = df_diff.withColumn("json_data", from_json(col("json_string"), schema))

df_fromjson.select("json_data.*").display()

name,age,city
Kiran,28,Bangalore
Darshan,32,Chennai
Chetan,40,Hyderabad
Ramu,35,Varanasi
Priya,45,Amaravati


**json_tuple**

In [0]:
data = [
    ('{"name": "Subash", "age": 28, "price": 55000, "address": {"city": "Bangalore", "zip": "560001"}}',),
    ('{"name": "Dinesh", "age": 32, "price": 65000, "address": {"city": "Chennai", "zip": "600001"}}',),
    ('{"name": "Fathima", "age": 29, "price": 59500, "address": {"city": "Nasik", "zip": "560001"}}',),
    ('{"name": "Gopesh", "age": 35, "price": 77000, "address": {"city": "Vizak", "zip": "600001"}}',),
    ('{"name": "Sreeni", "age": 26, "price": 89000, "address": {"city": "Hyderabad", "zip": "560001"}}',),
    ('{"name": "Anitha", "age": 33, "price": 44000, "address": {"city": "Salem", "zip": "600001"}}',)
]

df_nest = spark.createDataFrame(data, ["json_string"])
display(df_nest)

json_string
"{""name"": ""Subash"", ""age"": 28, ""price"": 55000, ""address"": {""city"": ""Bangalore"", ""zip"": ""560001""}}"
"{""name"": ""Dinesh"", ""age"": 32, ""price"": 65000, ""address"": {""city"": ""Chennai"", ""zip"": ""600001""}}"
"{""name"": ""Fathima"", ""age"": 29, ""price"": 59500, ""address"": {""city"": ""Nasik"", ""zip"": ""560001""}}"
"{""name"": ""Gopesh"", ""age"": 35, ""price"": 77000, ""address"": {""city"": ""Vizak"", ""zip"": ""600001""}}"
"{""name"": ""Sreeni"", ""age"": 26, ""price"": 89000, ""address"": {""city"": ""Hyderabad"", ""zip"": ""560001""}}"
"{""name"": ""Anitha"", ""age"": 33, ""price"": 44000, ""address"": {""city"": ""Salem"", ""zip"": ""600001""}}"


In [0]:
# Try to extract nested city from "address.city"
df_bad = df_nest.select(
    json_tuple("json_string", "name", "age", "price", "address.city", "address.zip").alias("Name", "Age", "Price" ,"City", "Zip")
)

display(df_bad)

Name,Age,Price,City,Zip
Subash,28,55000,,
Dinesh,32,65000,,
Fathima,29,59500,,
Gopesh,35,77000,,
Sreeni,26,89000,,
Anitha,33,44000,,
