##### singleLineList JSON

- The **entire file** is on **one line**.
- The **root element** is a **JSON array ([ {…}, {…} ])**.

     [{"id":1,"name":"A","price":100},{"id":2,"name":"B","price":200}]

##### 1) Using multiLine = true (Recommended)
- Even though it’s **one line**, Spark needs **multiLine=true** because the **root is an array**.
- Works for:
  - **Single line JSON array**.
  - **Multi-line JSON array**.

In [0]:
path = "/Volumes/@azureadb/pyspark/training/json/read_json_singlelineList_same_schema/singlelineList_01.json"

df_singList_mlt = spark.read.format("json")\
                            .option('multiLine', True)\
                            .load(path)

display(df_singList_mlt)

country,description,id,input_timestamp,last_update_timestamp,source,user
IND,SQLServer,1,1124256609,1524256609,SQLDB,Jagan
US,ABAP,2,1224256609,1424256609,SAP,Brindavan
CANADA,GEN2,3,1324256609,1524256609,ADLS,Nandu
US,Storage,4,1424256609,1724256609,Blob,Syamala
SWEDEN,Lake House,5,1524256609,1664256609,Data Lake Storage,Chethan
UK,Lake Warehouse,6,1624256609,1874256609,Delta Lake,Roopesh
Norway,OracleDB,7,1779256609,188256609,oracle,Sundar
SWEDEN,ML,8,1524256609,1664256609,DS,Swaroop
UK,Machine,9,1924256609,1674256609,MLOPS,Rahul
Norway,DataScience,10,1379256609,198256609,AI,Santhosh


##### 2) What happens if you don’t use multiLine=true?

- **Entire row becomes NULL**.
- Or **_corrupt_record** is populated.
- Because Spark expects **one JSON object per line**, **not a list**.

      spark.read.json("/path/to/singleline_list.json")

In [0]:
df_singList = spark.read.format("json").load(path)

display(df_singList)

_corrupt_record,country,description,id,input_timestamp,last_update_timestamp,source,user
,IND,SQLServer,1,1124256609,1524256609,SQLDB,Jagan
,US,ABAP,2,1224256609,1424256609,SAP,Brindavan
,CANADA,GEN2,3,1324256609,1524256609,ADLS,Nandu
,US,Storage,4,1424256609,1724256609,Blob,Syamala
,SWEDEN,Lake House,5,1524256609,1664256609,Data Lake Storage,Chethan
,UK,Lake Warehouse,6,1624256609,1874256609,Delta Lake,Roopesh
,Norway,OracleDB,7,1779256609,188256609,oracle,Sundar
,SWEDEN,ML,8,1524256609,1664256609,DS,Swaroop
,UK,Machine,9,1924256609,1674256609,MLOPS,Rahul
,Norway,DataScience,10,1379256609,198256609,AI,Santhosh


##### 3) Explicit schema

In [0]:
from pyspark.sql.types import StructField, StructType, IntegerType, StringType, LongType

# Define the main schema including the nested structure
Schema = StructType([StructField('id', IntegerType(), False),
                     StructField('country', StringType(), False),
                     StructField('description', StringType(), False),
                     StructField('input_timestamp', LongType(), False),
                     StructField('last_update_timestamp', LongType(), False),
                     StructField('source', StringType(), False),
                     StructField('user', StringType(), False)]
                     )

df_singList_mlt_schema = spark.read.format("json")\
                                   .option('multiLine', True)\
                                   .schema(Schema)\
                                   .json("/Volumes/@azureadb/pyspark/training/json/read_json_singlelineList_same_schema/singlelineList_01.json")

display(df_singList_mlt_schema)

id,country,description,input_timestamp,last_update_timestamp,source,user
1,IND,SQLServer,1124256609,1524256609,SQLDB,Jagan
2,US,ABAP,1224256609,1424256609,SAP,Brindavan
3,CANADA,GEN2,1324256609,1524256609,ADLS,Nandu
4,US,Storage,1424256609,1724256609,Blob,Syamala
5,SWEDEN,Lake House,1524256609,1664256609,Data Lake Storage,Chethan
6,UK,Lake Warehouse,1624256609,1874256609,Delta Lake,Roopesh
7,Norway,OracleDB,1779256609,188256609,oracle,Sundar
8,SWEDEN,ML,1524256609,1664256609,DS,Swaroop
9,UK,Machine,1924256609,1674256609,MLOPS,Rahul
10,Norway,DataScience,1379256609,198256609,AI,Santhosh


##### 4) How to read multiple singlelineList files?

In [0]:
path01 = "/Volumes/@azureadb/pyspark/training/read_json_singlelineList_same_schema/singlelineList_01.json"
path02 = "/Volumes/@azureadb/pyspark/training/read_json_singlelineList_same_schema/singlelineList_02.json"
path03 = "/Volumes/@azureadb/pyspark/training/read_json_singlelineList_same_schema/singlelineList_03.json"

df_singList_mlt_path = spark.read.format("json")\
                                 .option('multiLine', True)\
                                 .json([path01, path02, path03])

display(df_singList_mlt_path)

country,description,id,input_timestamp,last_update_timestamp,source,user
IND,SQLServer,21,1124256609,1524256609,SQLDB,Jaleel
US,ABAP,22,1224256609,1424256609,SAP,Bhavya
CANADA,GEN2,23,1324256609,1524256609,ADLS,Nirmal
US,Storage,24,1424256609,1724256609,Blob,Kavitha
SWEDEN,Lake House,25,1524256609,1664256609,Data Lake Storage,Guptha
UK,Lake Warehouse,26,1624256609,1874256609,Delta Lake,Deepak
Norway,OracleDB,27,1779256609,188256609,oracle,Sindu
SWEDEN,ML,28,1524256609,1664256609,DS,Yamuna
UK,Machine,29,1924256609,1674256609,MLOPS,Gajendra
Norway,DataScience,30,1379256609,198256609,AI,Yunus


In [0]:
# Read all multiple single line files
df_all_singList_mlt_path = spark.read.option("multiline","true") \
                                     .json("/Volumes/@azureadb/pyspark/training/json/read_json_singlelineList_same_schema/*.json")
display(df_all_singList_mlt_path)

country,description,id,input_timestamp,last_update_timestamp,source,user
IND,SQLServer,21,1124256609,1524256609,SQLDB,Jaleel
US,ABAP,22,1224256609,1424256609,SAP,Bhavya
CANADA,GEN2,23,1324256609,1524256609,ADLS,Nirmal
US,Storage,24,1424256609,1724256609,Blob,Kavitha
SWEDEN,Lake House,25,1524256609,1664256609,Data Lake Storage,Guptha
UK,Lake Warehouse,26,1624256609,1874256609,Delta Lake,Deepak
Norway,OracleDB,27,1779256609,188256609,oracle,Sindu
SWEDEN,ML,28,1524256609,1664256609,DS,Yamuna
UK,Machine,29,1924256609,1674256609,MLOPS,Gajendra
Norway,DataScience,30,1379256609,198256609,AI,Yunus


##### 5) PERMISSIVE mode & explicit schema without `_corrupt_record`

In [0]:
from pyspark.sql.types import StructField, StructType, IntegerType, StringType, LongType

# Define the main schema including the nested structure
Schema = StructType([StructField('id', IntegerType(), False),
                     StructField('country', StringType(), False),
                     StructField('description', StringType(), False),
                     StructField('input_timestamp', LongType(), False),
                     StructField('last_update_timestamp', LongType(), False),
                     StructField('source', StringType(), False),
                     StructField('user', StringType(), False),
                     StructField("_corrupt_record", StringType(), True)]
                     )

df_singList_recrd = spark.read.format("json")\
                              .schema(Schema)\
                              .option('multiLine', True)\
                              .option("mode", "PERMISSIVE")\
                              .load("/Volumes/@azureadb/pyspark/training/json/read_json/singlelineList_corrupt_record.json")

display(df_singList_recrd)

id,country,description,input_timestamp,last_update_timestamp,source,user,_corrupt_record
1.0,IND,bravia,1124256609.0,1524256609.0,catalog,Hari,
2.0,US,sony,1224256609.0,1424256609.0,SAP,Rajesh,
3.0,CANADA,bse,1324256609.0,1524256609.0,ADLS,Lokesh,
4.0,US,exchange,1424256609.0,1724256609.0,Blob,Sharath,
5.0,SWEDEN,Stock,1524256609.0,1664256609.0,SQL,Sheetal,
,,,,,,,"[  {""id"":1, ""source"":""catalog"", ""description"":""bravia"", ""input_timestamp"":1124256609, ""last_update_timestamp"":1524256609, ""country"":""IND"", ""user"":""Hari""},  {""id"":2, ""source"":""SAP"", ""description"":""sony"", ""input_timestamp"":1224256609, ""last_update_timestamp"":1424256609, ""country"":""US"", ""user"":""Rajesh""},  {""id"":3, ""source"":""ADLS"", ""description"":""bse"", ""input_timestamp"":1324256609, ""last_update_timestamp"":1524256609, ""country"":""CANADA"", ""user"":""Lokesh""},  {""id"":4, ""source"":""Blob"", ""description"":""exchange"", ""input_timestamp"":1424256609, ""last_update_timestamp"":1724256609, ""country"":""US"", ""user"":""Sharath""},  {""id"":5, ""source"":""SQL"", ""description"":""Stock"", ""input_timestamp"":1524256609, ""last_update_timestamp"":1664256609, ""country"":""SWEDEN"", ""user"":""Sheetal""},  {""id"":6, ""source"":""datawarehouse"", ""description"":""azure"", ""input_timestamp"":1624256609, ""last_update_timestamp"":1874256609, ""country"":""UK"", ""user"":""Raj""  {""id"":7, ""source"":""oracle"", ""description"":""ADF"", ""input_timestamp"":1779256609, ""last_update_timestamp"":188256609, ""country"":""Norway"", ""user"":""Synapse""},  {""id"":8, ""source"":""AZURE"", ""description"":""ETL"", ""input_timestamp"":1229256609, ""last_update_timestamp"":173256609, ""country"":""Norway"", ""user"":""Synapse""},  {""id"":9, ""source"":""AWS"", ""description"":""ELT"", ""input_timestamp"":1339256609, ""last_update_timestamp"":116256609, ""country"":""Norway"", ""user"":""Synapse""},  {""id"":10, ""source"":""GCC"", ""description"":""Git"", ""input_timestamp"":1569256609, ""last_update_timestamp"":129256609, ""country"":""Norway"", ""user"":""Synapse""} ]"


| JSON Type              | Correct Option              |
| ---------------------- | --------------------------- |
| One JSON per line      | default                     |
| Single-line JSON array | `multiLine = true`          |
| Multi-line JSON array  | `multiLine = true`          |
| Nested JSON            | `multiLine = true` + schema |