#### What is .option("samplingRatio", value)?

- When you use **inferSchema=True**, Spark **scans** your dataset to **guess data types**.
- But scanning the **entire file** can be **slow for large datasets**.
- By default, Spark **samples the first 1000 rows** to **infer the schema**.

#### Syntax

     .option("samplingRatio", <value between 0.0 and 1.0>)

- **0.1** → use **10% of rows** for schema inference.
- **1.0** → use **100%** (default, full scan)
- **0.01** → use only **1% of rows**.

       .option("samplingRatio", 0.1)

- This tells Spark to **sample 10% of the rows**. A **higher ratio** means **more accuracy** but **longer loading time**.

##### 1) For CSV file

##### a) Default behavior / Full inference (no samplingRatio)
- Spark reads **all rows** to infer schema.
- **Most accurate**, but **slow for large data**.

In [0]:
df_def = spark.read.option("header", True).option("inferSchema", True).csv("/Volumes/@azureadb/pyspark/dataframe/inferschema.csv")
display(df_def)

id,name,age,salry
1,Jyoti,25,50000
2,Albert,30,60000
3,Baby,35,55000
4,Chethan,40,70000
5,David,45,45000
6,Elango,50,25000
7,Firoj,55,15000
8,Giri,60,28000
9,Hemanth,65,38000
10,Indra,70,43000


##### b) Use samplingRatio=0.1 (10%)
- Spark reads only **10% of rows** & Guesses schema from that sample.
- **Much faster** schema inference (especially if file is **very large**)
- Slightly **higher chance** of **incorrect data types**.

In [0]:
df_ratio_10 = (
    spark.read
        .option("header", True)
        .option("inferSchema", True)
        .option("samplingRatio", 0.1) # <-- only 10% rows used to infer schema
        .csv("/Volumes/@azureadb/pyspark/dataframe/inferschema.csv")
)

display(df_ratio_10)

id,name,age,salry
1,Jyoti,25,50000
2,Albert,30,60000
3,Baby,35,55000
4,Chethan,40,70000
5,David,45,45000
6,Elango,50,25000
7,Firoj,55,15000
8,Giri,60,28000
9,Hemanth,65,38000
10,Indra,70,43000


##### c) Small ratio (1%) for huge dataset
- **Performance gain:** up to **10x faster** schema inference.
- **Risk:** If your **1%** sample **doesn’t contain all data patterns**, types might be **wrong**.

In [0]:
df_ratio_1 = (
    spark.read
        .option("header", True)
        .option("inferSchema", True)
        .option("samplingRatio", 0.01)
        .csv("/Volumes/@azureadb/pyspark/dataframe/inferschema.csv")
)

display(df_ratio_1)

id,name,age,salry
1,Jyoti,25,50000
2,Albert,30,60000
3,Baby,35,55000
4,Chethan,40,70000
5,David,45,45000
6,Elango,50,25000
7,Firoj,55,15000
8,Giri,60,28000
9,Hemanth,65,38000
10,Indra,70,43000


**Compare Results**

| Sampling Ratio | Rows Scanned   | Code                             | Accuracy                   | Speed      |
| -------------- | ---------------|--------------------------------- | -------------------------- | ---------- |
| 1.0            | 100%           |Default (no samplingRatio) `.option("samplingRatio", 1.0)`   | ✅ Most accurate           | 🐢 Slowest |
| 0.1            | 10%            |`.option("samplingRatio", 0.1)`   | ⚖️ Good balance (Sometimes less accurate if rare types exist in skipped rows)  | ⚡ Fast  |
| 0.01           | 1%             |`.option("samplingRatio", 0.01)`  | ⚠️ May misinfer some types | 🚀 Fastest |

##### d) Observe schema difference

| age column |
|------------|
| 10 |
| 20 |
| 30 |
| N/A |
| 45.5 |

If Spark samples only first 3 rows (integers), it might infer:

     |-- age: integer (nullable = true)

But in full dataset, 45.5 should make it:

     |-- age: double (nullable = true)

##### Best Practice Recommendations

| Use Case                     | Recommended samplingRatio                    |
| ---------------------------- | -------------------------------------------- |
| Small dataset (<100 MB)      | `1.0` (no sampling)                          |
| Medium dataset (100 MB–1 GB) | `0.2` or `0.3`                               |
| Large dataset (>1 GB)        | `0.05` to `0.1`                              |
| Very large dataset (>10 GB)  | `0.01` or less, but validate schema manually |

**2) For JSON file**
- Spark will **sample 5%** of **big_data.json** to determine column **data types**.

     df_json = (
         spark.read
              .option("inferSchema", True)
              .option("samplingRatio", 0.05)
              .json("big_data.json")
     )

     df_json.printSchema()


**3) With Parquet (ignored)**
- For Parquet, this **option has no effect**, because Parquet files **already store schema** in metadata.
- You **don’t** need **inferSchema or samplingRatio**.

     df_parquet = (
         spark.read
              .option("samplingRatio", 0.1)
              .parquet("data.parquet")
     )

##### 4) When samplingRatio helps performance

##### Let’s simulate a big dataset:
     df = spark.range(0, 10000000).withColumnRenamed("id", "age")
     df.write.csv("huge_file.csv", header=True)

##### Now, read with full and sampled schema inference:
     
     # Full scan (slow)
     df_full = spark.read.option("header", True).option("inferSchema", True).csv("huge_file.csv")

     # 5% sample scan (faster)
     df_sample = spark.read.option("header", True).option("inferSchema", True).option("samplingRatio", 0.05).csv("huge_file.csv")

**Result**

| Setting              | Inference Time (approx) | Accuracy                                      |
| -------------------- | ----------------------- | --------------------------------------------- |
| No samplingRatio     | 12–15 sec               | ✅ 100% accurate                               |
| samplingRatio = 0.05 | 3–4 sec                 | ✅ Same accuracy (since all numeric)           |
| samplingRatio = 0.01 | 1–2 sec                 | ⚠️ Slight risk of mis-inference if mixed data |

##### Best Practice:
**when:**
- **File is large**,
- You need **quicker schema inference**

     spark.read.option("header", True).option("inferSchema", True).option("samplingRatio", 0.1).csv("data.csv")