**Spark is a distributed system**

Spark reads data by:
- **Splitting files** into chunks **(partitions)**.
- Processing those chunks in **parallel across executors**.

##### 1) With `multiLine = false` (default)

- **Each line** is a complete **JSON** record.
- Spark can safely **split** the file anywhere.
- **Each partition** can **parse JSON independently**.

##### 2) What happens if `multiLine = true`?

In [0]:
[
  {
    "id": 1,
    "name": "Suresh"
  },
  {
    "id": 2,
    "name": "Ravi"
  }
]

[{'id': 1, 'name': 'Suresh'}, {'id': 2, 'name': 'Ravi'}]

**Problems:**
- **One JSON record** spans **multiple lines**.
- Spark **cannot split** the file **arbitrarily**.
- **Spark must:**
  - Read large portions into memory.
  - Coordinate parsing across lines.

❌ Slower

❌ More memory usage

❌ Harder to parallelize

##### 3) Why not make `multiLine = true` default?

Because:
- It would **break performance** for **90%** of use cases.
- Spark would need to **read entire files**, reducing parallelism.
- **Streaming and big data** workloads would **suffer**.

##### 4) Safety & fault tolerance

With `multiLine = false`:
- If **one line is corrupt** → Spark **skips or flags** it.
- **Other records** are **still processed**.

With `multiLine = true`:
- **One corrupt block** can **break the entire file**.

##### Visual explanation

| multiLine=false |
|-----------------|
| Line 1 ✅      |
| Line 2 ✅      |
| Line 3 ❌  → skipped |
| Line 4 ✅      |

| multiLine=true |
|----------------|
| [ Entire JSON array ] |
|      ❌              |
| → whole file rejected |

- Corrupted records are only added to the _corrupt_record column if the JSON parser encounters **incomplete or malformed JSON data**.
- If the record simply has a **type mismatch or missing columns**, those are **not** considered **"corrupt"** and **will not** be placed in **_corrupt_record**.
- Instead, Spark will **fill** those fields with **nulls** in **PERMISSIVE mode**.
- Only truly **malformed records** (e.g., broken JSON syntax) are captured in **_corrupt_record**.

#### Scenario 01
With `multiLine = false`:
- If **one line is corrupt** → Spark **skips or flags** it.
- **Other records** are **still processed**.

    {"id":1,"name":"Laptop","price":55000}
    {"id":2,"name":"Mobile","price":25000}
    {"id":3,"name":"Table","price":12000}
    {"id":4,"name":"Chair","price":8000       ❌ Line 4 is corrupt (missing closing })
    {"id":5,"name":"Monitor","price":15000}


In [0]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema_wo_corrpt = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("price", IntegerType(), True)
])

schema_w_corrpt = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("price", IntegerType(), True),
    StructField("_corrupt_record", StringType(), True)
])

**a) PERMISSIVE mode & explicit schema without `_corrupt_record`**

- If you provide an **explicit schema**, Spark **will NOT add new columns** that are **not in the schema**.
  - schema **does NOT** contain **_corrupt_record**.
  - Reads **valid rows correctly**.
  - For **corrupt lines**, fills all defined fields as **NULL**.
  - **Drops** the **corrupt text**, because there is **no column to store it**.

        null  null  null

In [0]:
df_perm_schema = spark.read \
                      .schema(schema_wo_corrpt) \
                      .option("mode", "PERMISSIVE") \
                      .json("/Volumes/@azureadb/pyspark/training/read_json/singleline_corrupt_record.json")

display(df_perm_schema)

id,name,price
1.0,Laptop,55000.0
2.0,Mobile,25000.0
3.0,Table,12000.0
,,
5.0,Monitor,15000.0


**b) PERMISSIVE mode & inferSchema**

In [0]:
# Auto schema inference
# If you remove .schema(schema), Spark will auto-add _corrupt_record
# Not recommended for production (schema drift risk)
df_permissive = spark.read \
                     .option("mode", "PERMISSIVE") \
                     .json("/Volumes/@azureadb/pyspark/training/read_json/singleline_corrupt_record.json")

display(df_permissive)

_corrupt_record,id,name,price
,1.0,Laptop,55000.0
,2.0,Mobile,25000.0
,3.0,Table,12000.0
"{""id"":4,""name"":""Chair"",""price"":8000",,,
,5.0,Monitor,15000.0


- **Valid rows** are **processed**.
- **Corrupt line** is **skipped / flagged**.

     .option("mode", "PERMISSIVE")

Spark does the following:
- Reads **all valid JSON records**.
- For **malformed JSON**:
  - Sets **all normal columns** to **null**.
  - Stores the **raw corrupt JSON text** in a **special column**.
- **Default corrupt-record column name**: `_corrupt_record`.

**c) PERMISSIVE mode & explicit schema with `_corrupt_record`**

In [0]:
df_permissive_wo = spark.read \
                        .schema(schema_w_corrpt) \
                        .option("mode", "PERMISSIVE") \
                        .json("/Volumes/@azureadb/pyspark/training/read_json/singleline_corrupt_record.json")

display(df_permissive_wo)

id,name,price,_corrupt_record
1.0,Laptop,55000.0,
2.0,Mobile,25000.0,
3.0,Table,12000.0,
,,,"{""id"":4,""name"":""Chair"",""price"":8000"
5.0,Monitor,15000.0,


- **Corrupt row** is **isolated**.
- **Clean data** is **not lost**.

**d) When DOES `columnNameOfCorruptRecord` matter?**
- It **only matters** when you **change the column name**.

In [0]:
schema_custom = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("price", IntegerType(), True),
    StructField("bad_record", StringType(), True)
])

df_bad_rec = spark.read \
                  .schema(schema_custom) \
                  .option("mode", "PERMISSIVE") \
                  .option("columnNameOfCorruptRecord", "bad_record") \
                  .json("/Volumes/@azureadb/pyspark/training/read_json/singleline_corrupt_record.json")

display(df_bad_rec)

id,name,price,bad_record
1.0,Laptop,55000.0,
2.0,Mobile,25000.0,
3.0,Table,12000.0,
,,,"{""id"":4,""name"":""Chair"",""price"":8000"
5.0,Monitor,15000.0,


- **Corrupt rows** go into **bad_record**.
- **_corrupt_record** is **NOT** used.

##### e) What if schema has `_corrupt_record` but option uses `different name`?

In [0]:
df_sch_bad_rec = spark.read \
                      .schema(schema_w_corrpt) \
                      .option("mode", "PERMISSIVE") \
                      .option("columnNameOfCorruptRecord", "bad_record") \
                      .json("/Volumes/@azureadb/pyspark/training/read_json/singleline_corrupt_record.json")

display(df_sch_bad_rec)

id,name,price,_corrupt_record
1.0,Laptop,55000.0,
2.0,Mobile,25000.0,
3.0,Table,12000.0,
,,,
5.0,Monitor,15000.0,


- Spark **will NOT** write **corrupt data**.
- Because **bad_record** is **not in the schema**.

**f) Create separate tables (Clean vs Bad data)**

In [0]:
valid_df = df_perm_corrpt.filter("_corrupt_record IS NULL")
bad_df   = df_perm_corrpt.filter("_corrupt_record IS NOT NULL")

display(valid_df)
display(bad_df)

id,name,price,_corrupt_record
1,Laptop,55000,
2,Mobile,25000,
3,Table,12000,
5,Monitor,15000,


id,name,price,_corrupt_record
,,,"{""id"":4,""name"":""Chair"",""price"":8000"


**Why this works?**
- `multiLine = false` → `1 line = 1 record`.
- **Corruption** affects **only that line**.
- Spark **continues processing** other lines.

#### Scenario 02

With `multiLine = true`:
- **One corrupt block** can **break the entire file**.

    [
      {
        "id": 1,
        "name": "Laptop",
        "price": 55000
      },
      {
        "id": 2,
        "name": "Mobile",
        "price": 25000
      },
      {
        "id": 3,
        "name": "Table",
        "price": 12000
      },
      {
        "id": 4,
        "name": "Chair",
        "price": 8000        // ❌ invalid comment → corrupt block
      },
      {
        "id": 5,
        "name": "Monitor",
        "price": 15000
      }
    ]


- JSON **comments (//)** are **not allowed**.
- This **corrupts** the **entire JSON array**.

In [0]:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType

schema_true = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("price", IntegerType(), True)
])

In [0]:
# Only individual corrupt records are captured in PERMISSIVE mode,
# not the entire JSON array, unless every record is corrupt.
df_mlt_true = spark.read \
                   .schema(schema_true) \
                   .option("multiLine", "true") \
                   .option("mode", "PERMISSIVE") \
                   .json("/Volumes/@azureadb/pyspark/training/read_json/multiline_corrupt_record.json")

display(df_mlt_true)

id,name,price
1.0,Laptop,55000.0
2.0,Mobile,25000.0
3.0,Table,12000.0
,,


**Result:**
- Spark **fails** to **parse the JSON**.
- Output is either:
  - **Empty DataFrame**, or
  - **Single row with all NULLs**, or
  - **Job failure**, depending on Spark version

In [0]:
# _corrupt_record is added because the schema explicitly includes it,
# and the PERMISSIVE mode with columnNameOfCorruptRecord option
# captures any corrupt records (including in multiline JSON).
schema_with_corrupt = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("price", IntegerType(), True),
    StructField("_corrupt_record", StringType(), True)
])

df_mlt_true_corrpt = spark.read \
                          .schema(schema_with_corrupt) \
                          .option("multiLine", "true") \
                          .option("mode", "PERMISSIVE") \
                          .option("columnNameOfCorruptRecord", "_corrupt_record") \
                          .json("/Volumes/@azureadb/pyspark/training/read_json/multiline_corrupt_record.json")

display(df_mlt_true_corrpt)

id,name,price,_corrupt_record
1.0,Laptop,55000.0,
2.0,Mobile,25000.0,
3.0,Table,12000.0,
,,,"[  {  ""id"": 1,  ""name"": ""Laptop"",  ""price"": 55000  },  {  ""id"": 2,  ""name"": ""Mobile"",  ""price"": 25000  },  {  ""id"": 3,  ""name"": ""Table"",  ""price"": 12000  },  {  ""id"": 4,  ""name"": ""Chair"",  ""price"": 8000 // ❌ invalid comment → corrupt block  },  {  ""id"": 5,  ""name"": ""Monitor"",  ""price"": 15000  } ]"


##### 5) Analogy

- `Single-line` JSON = `one row per line` (like CSV).
- `Multi-line` JSON = `one book` spread across `many pages`.
- Spark `prefers rows, not books`.

##### Final takeaway
- `multiLine=false` is `default` because it `enables` Spark’s **parallel, scalable, and fault-tolerant design**.
- Use `multiLine=true` only when your JSON truly spans `multiple lines`.