##### json_tuple
- To **extract values** directly from a **JSON string** column **without converting** it into a **variant**.
- If you want to **extract multiple fields / separate columns** from a **flat JSON string**.
- It only works for **top-level fields (not nested JSON)**.
- It always returns **strings**, no matter the **original type**.
- All extracted fields are **strings** (you can **cast** later).
- It’s **schema-less** (you don’t need to define a schema).

**Limitation:**
- Can’t parse **nested JSON** like **address.city**.
- For that → use **get_json_object, parse_json, or from_json**.

#### Syntax

     json_tuple(column, field1, field2, ..., fieldN)

**column:**
- The **column** that contains the **JSON string**.

**field1 ... fieldN:**
- The JSON **keys** you want to **extract**.

**Returns:**
- **multiple string columns** (all extracted values are strings).

In [0]:
from pyspark.sql.functions import col, json_tuple

In [0]:
help(json_tuple)

Help on function json_tuple in module pyspark.sql.functions.builtin:

json_tuple(col: 'ColumnOrName', *fields: str) -> pyspark.sql.column.Column
    Creates a new row for a json column according to the given field names.

    .. versionadded:: 1.6.0

    .. versionchanged:: 3.4.0
        Supports Spark Connect.

    Parameters
    ----------
    col : :class:`~pyspark.sql.Column` or str
        string column in json format
    fields : str
        a field or fields to extract

    Returns
    -------
    :class:`~pyspark.sql.Column`
        a new row for each given field value from json object

    Examples
    --------
    >>> data = [("1", '''{"f1": "value1", "f2": "value2"}'''), ("2", '''{"f1": "value12"}''')]
    >>> df = spark.createDataFrame(data, ("key", "jstring"))
    >>> df.select(df.key, json_tuple(df.jstring, 'f1', 'f2')).collect()
    [Row(key='1', c0='value1', c1='value2'), Row(key='2', c0='value12', c1=None)]



##### 1) Simple JSON

In [0]:
data = [
    ('{"id": 101, "name": "Subash", "age": 28, "city": "Bangalore", "price": 55000}',),
    ('{"id": 102, "name": "Dinesh", "age": 32, "city": "Chennai", "price": 25000}',),
    ('{"id": 103, "name": "Fathima", "age": 29, "city": "Nasik", "price": 35000}',),
    ('{"id": 104, "name": "Gopesh", "age": 35, "city": "Vizak", "price": 45000}',),
    ('{"id": 105, "name": "Sreeni", "age": 26, "city": "Hyderabad", "price": 26600}',),
    ('{"id": 106, "name": "Anitha", "age": 33, "city": "Salem", "price": 34400}',)
]

df = spark.createDataFrame(data, ["json_string"])

# Extract multiple fields
df_extracted = df.select('json_string',
    json_tuple("json_string", "id" ,"name", "age", "city", "price").alias("Id", "Name", "Age", "City", "Price")
)

df_extracted.display()

json_string,Id,Name,Age,City,Price
"{""id"": 101, ""name"": ""Subash"", ""age"": 28, ""city"": ""Bangalore"", ""price"": 55000}",101,Subash,28,Bangalore,55000
"{""id"": 102, ""name"": ""Dinesh"", ""age"": 32, ""city"": ""Chennai"", ""price"": 25000}",102,Dinesh,32,Chennai,25000
"{""id"": 103, ""name"": ""Fathima"", ""age"": 29, ""city"": ""Nasik"", ""price"": 35000}",103,Fathima,29,Nasik,35000
"{""id"": 104, ""name"": ""Gopesh"", ""age"": 35, ""city"": ""Vizak"", ""price"": 45000}",104,Gopesh,35,Vizak,45000
"{""id"": 105, ""name"": ""Sreeni"", ""age"": 26, ""city"": ""Hyderabad"", ""price"": 26600}",105,Sreeni,26,Hyderabad,26600
"{""id"": 106, ""name"": ""Anitha"", ""age"": 33, ""city"": ""Salem"", ""price"": 34400}",106,Anitha,33,Salem,34400


In [0]:
df_cast = df_extracted\
    .withColumn("ID", col("id").cast("int")) \
    .withColumn("AGE", col("age").cast("int")) \
    .withColumn("Price", col("Price").cast("int"))

df_cast.display()

json_string,ID,Name,AGE,City,Price
"{""id"": 101, ""name"": ""Subash"", ""age"": 28, ""city"": ""Bangalore"", ""price"": 55000}",101,Subash,28,Bangalore,55000
"{""id"": 102, ""name"": ""Dinesh"", ""age"": 32, ""city"": ""Chennai"", ""price"": 25000}",102,Dinesh,32,Chennai,25000
"{""id"": 103, ""name"": ""Fathima"", ""age"": 29, ""city"": ""Nasik"", ""price"": 35000}",103,Fathima,29,Nasik,35000
"{""id"": 104, ""name"": ""Gopesh"", ""age"": 35, ""city"": ""Vizak"", ""price"": 45000}",104,Gopesh,35,Vizak,45000
"{""id"": 105, ""name"": ""Sreeni"", ""age"": 26, ""city"": ""Hyderabad"", ""price"": 26600}",105,Sreeni,26,Hyderabad,26600
"{""id"": 106, ""name"": ""Anitha"", ""age"": 33, ""city"": ""Salem"", ""price"": 34400}",106,Anitha,33,Salem,34400


In [0]:
cols = ["id", "name", "age", "city", "price"]

df_ext_iter = df.select(
    json_tuple("json_string", *cols).alias(*[c.capitalize() for c in cols])
)

df_ext_iter.display()

Id,Name,Age,City,Price
101,Subash,28,Bangalore,55000
102,Dinesh,32,Chennai,25000
103,Fathima,29,Nasik,35000
104,Gopesh,35,Vizak,45000
105,Sreeni,26,Hyderabad,26600
106,Anitha,33,Salem,34400


##### 2) Nested JSON

In [0]:
data = [
    ('{"name": "Subash", "age": 28, "price": 55000, "address": {"city": "Bangalore", "zip": "560001"}}',),
    ('{"name": "Dinesh", "age": 32, "price": 65000, "address": {"city": "Chennai", "zip": "600001"}}',),
    ('{"name": "Fathima", "age": 29, "price": 59500, "address": {"city": "Nasik", "zip": "560001"}}',),
    ('{"name": "Gopesh", "age": 35, "price": 77000, "address": {"city": "Vizak", "zip": "600001"}}',),
    ('{"name": "Sreeni", "age": 26, "price": 89000, "address": {"city": "Hyderabad", "zip": "560001"}}',),
    ('{"name": "Anitha", "age": 33, "price": 44000, "address": {"city": "Salem", "zip": "600001"}}',)
]

df_nest = spark.createDataFrame(data, ["json_string"])
display(df_nest)

json_string
"{""name"": ""Subash"", ""age"": 28, ""price"": 55000, ""address"": {""city"": ""Bangalore"", ""zip"": ""560001""}}"
"{""name"": ""Dinesh"", ""age"": 32, ""price"": 65000, ""address"": {""city"": ""Chennai"", ""zip"": ""600001""}}"
"{""name"": ""Fathima"", ""age"": 29, ""price"": 59500, ""address"": {""city"": ""Nasik"", ""zip"": ""560001""}}"
"{""name"": ""Gopesh"", ""age"": 35, ""price"": 77000, ""address"": {""city"": ""Vizak"", ""zip"": ""600001""}}"
"{""name"": ""Sreeni"", ""age"": 26, ""price"": 89000, ""address"": {""city"": ""Hyderabad"", ""zip"": ""560001""}}"
"{""name"": ""Anitha"", ""age"": 33, ""price"": 44000, ""address"": {""city"": ""Salem"", ""zip"": ""600001""}}"


In [0]:
# Try to extract nested city from "address.city"
df_bad = df_nest.select(
    json_tuple("json_string", "name", "age", "price", "address.city", "address.zip").alias("Name", "Age", "Price" ,"City", "Zip")
)

display(df_bad)

Name,Age,Price,City,Zip
Subash,28,55000,,
Dinesh,32,65000,,
Fathima,29,59500,,
Gopesh,35,77000,,
Sreeni,26,89000,,
Anitha,33,44000,,
