###Spark.read:
File types:
- jdbc
- csv
- orc
- parquet
- table
- text
- xml
- json

Other options:
- option
- options
- schema
- load



####Spark.read.csv

- > header,inferschema,sampling ratio,sep,linesep,
- > quote,escape,
- > nullvalue,emptyvalue,
- > mode,columnofcorruptrecord,
- > comment,ignoreLeadingWhiteSpace,ignoreTrailingWhiteSpace
- > dataformat,timestampformat
- > maxcolumns
- > multiline




Important options:
- **header**-> By default False. Set to True when the first row of data has to be considered as header **option("header","True")**
- **inferschema**-> without this all columns by default are considers as string. Infers the schema by scanning the whole fail(not recommaned for large files) **option("inferSchema","True")**
- **sampling ratio**-> this helps inferschema, instead of scanning the whole file it scans only respective percentage **option("inferSchema","True").option("samplingRatio",0.1)**
- **sep** -> Column Seperator, default is ',' option("sep","|")
- **linesep** -> default new line option("lineSep","|")
- **quote and escape** ->
  csv:
  id,name,description

  1,Alice,"Senior \"Data Engineer\", Spark expert"

  spark.read \
    .option("header", "true") \
    .option("quote", '"') \    #quote is used as , is present inside a value

    .option("escape", "\\") \   #escape is used as quote is in quote

    .csv(path)

    ‚úî Use quote when text contains delimiter or newline
    ‚úî Use escape when text contains quote character itself

- **null value and emptyvalue**
    - By default any empty value is treated as null
    - when an string comes as 'null' it is treated as mere string
    - the value given in nullvalue is also treated as null along with empty fields

    ‚úî These options apply globally to all columns while reading the CSV.
    ‚úî Empty CSV field ‚Üí NULL (by default)
    ‚úî nullValue handles explicit tokens like NA, NULL
    ‚úî emptyValue replaces empty fields with a value

    example csv:
    id,value
    1,
    2,null
    3,NA
    4,100

    df = spark.read \
        .option("header", "true") \
        .option("nullValue", "null") \  #when we dont specify it treats null as string

        .option("nullValue", "NA") \
        .csv("/path/file.csv")
      
    with empty value:

    df = spark.read \
        .option("header", "true") \
        .option("emptyValue","0") \  #now the emty value will be filled with 0

        .option("nullValue", "null") \  #when we dont specify it treats null as string
        
        .option("nullValue", "NA") \
        .csv("/path/file.csv")

- **mode**-> Defines behavior when parsing errors occur.

  | Mode                   | Behavior          |
  | ---------------------- | ----------------- |
  | `PERMISSIVE` (default) | Bad rows ‚Üí `null` |
  | `DROPMALFORMED`        | Drops bad rows    |
  | `FAILFAST`             | Fails immediately |


- ** columnNameOfCorruptRecord** -> Stores bad rows in a separate column.

  .option("columnNameOfCorruptRecord", "_corrupt")

  ‚úî Useful for debugging dirty data

- **comment** -> Ignores lines starting with a character.

    .option("comment", "@")

    Example:
    - 1,Kavi,30
    - @this is a comment
    - 2,Alice,25

- **ignoreLeadingWhiteSpace / ignoreTrailingWhiteSpace**
    .option("ignoreLeadingWhiteSpace", "true")
    .option("ignoreTrailingWhiteSpace", "true")
    ‚úî Cleans extra spaces

- **dateFormat / timestampFormat**
    .option("dateFormat", "yyyy-MM-dd")
    .option("timestampFormat", "yyyy-MM-dd HH:mm:ss")
    ‚úî Needed when parsing dates

- **maxColumns** -> Limits maximum number of columns.

    .option("maxColumns", "20480")

    Prevents malformed wide files from crashing Spark

- **multiline** -> Allows rows to expant to multi line

    without multiline the below code is considered as 2 records:

    1,John,"Hello
    
    How are you?"

    with multiline=True spark will be able to understand that these are same rows with quotes
  







####spark.read.orc and spark.read.

‚ö†Ô∏è Important upfront
There are very few true reader options. ORC andd Parquet are self-describing, so most behavior is controlled by Spark configs, not .option().

- **mergeSchema**(common for orc and parquet)
    -     .option("mergeSchema", "true")
    -     Merge schemas from multiple ORC files
    -     Default: false
    -     Costly (reads metadata of all files)
- **pathGlobFilter**(common for orc and parquet)
    -       .option("pathGlobFilter", "*.orc")
    -       Read only files matching the pattern
    -       Useful when directory has mixed files

- **recursiveFileLookup**(common for orc and parquet)
    -     .option("recursiveFileLookup", "true")
    -     Recursively read ORC files from subdirectories
    -     ‚ùå Not for Hive-style partitioned data

- **basePath**(common for orc and parquet)
  -     .option("basePath", "/data/sales")
  -     Required when reading partitioned data using wildcards
  -     Helps Spark correctly infer partition columns

- **schema**(common for orc and parquet)
    -     .schema(custom_schema)
    -     Explicitly provide schema
    -     Rarely needed (ORC already stores schema)

- **datetimeRebaseMode**(common for orc and parquet)
    -     .option("datetimeRebaseMode", "LEGACY")
    -     For legacy ORC files written by old Spark/Hive
    -     Values:
      - CORRECTED (default)
      - LEGACY
- **int96RebaseMode**(Parquet specific)
    -     .option("int96RebaseMode", "LEGACY")
    -     Handles legacy INT96 timestamp columns
    -    This option is Parquet-specific (ORC does not have INT96).
    -     Values:
            -  CORRECTED (default)
            -  LEGACY
‚öôÔ∏è ORC behavior controlled via Spark configs (NOT .option())

These are important but separate:

- spark.sql.orc.filterPushdown
- spark.sql.orc.enableVectorizedReader
- spark.sql.orc.mergeSchema

‚öôÔ∏è Parquet behavior controlled via Spark configs (NOT .option())

These are very important in real projects:

- spark.sql.parquet.filterPushdown
- spark.sql.parquet.enableVectorizedReader
- spark.sql.parquet.mergeSchema
- spark.sql.parquet.binaryAsString
- spark.sql.parquet.int96AsTimestamp

Example:
spark.conf.set("spark.sql.parquet.filterPushdown", "true")


####Spark.read.table
spark.read.table() takes ONLY a table name as input and returns a DataFrame using table metadata ‚Äî no options, no paths, no format needed.

These are equivalent:

- spark.read.table("sales.orders")
- spark.sql("SELECT * FROM sales.orders")

Difference:
- table() ‚Üí DataFrame API
- sql() ‚Üí SQL string

df = (
    spark.read.table("main.finance.transactions")
         .filter("txn_date >= '2024-01-01'")
         .select("txn_id", "amount")
)

- ‚úî Governance
- ‚úî Unity Catalog permissions
- ‚úî Lineage tracking

- What happens internally (important)
    -       When you run: -> spark.read.table("sales.orders")
    -       Spark:
      - Looks up the table in catalog
      - Reads table metadata
      - Finds:
          - Storage location
          - File format (Delta / Parquet / ORC)
          - Schema
          - Partitions
          - Uses the correct reader automatically
          - üëâ You do NOT specify format or path.


####Spark.read.text

      df = (
          spark.read
              .option("wholetext", "false")
              .option("lineSep", "\n")
              .option("pathGlobFilter", "*.txt")
              .option("recursiveFileLookup", "false")
              .option("encoding", "UTF-8")
              .text("/path/to/text/files")
      )


üîπ ALL valid spark.read.text() options
- **wholetext**
    -   .option("wholetext", "true"
    -   Entire file becomes ONE row
    -   Default: false

- **lineSep**
    - .option("lineSep", "\n")
    - Custom line separator
    - Default: \n

- **pathGlobFilter**
    - .option("pathGlobFilter", "*.log")

- **recursiveFileLookup**
    - .option("recursiveFileLookup", "true")

- **encoding**
    - .option("encoding", "UTF-8")




####spark.read.jdbc

Reads data from a relational database (MySQL, Postgres, Oracle, SQL Server, etc.) using JDBC and returns a DataFrame.

- **IN (Inputs)**
      ‚úÖ Required parameters (minimum)

      spark.read.jdbc(
          url="jdbc:mysql://host:3306/db",
          table="orders",
          properties={
              "user": "username",
              "password": "password",
              "driver": "com.mysql.cj.jdbc.Driver"
          }
      )
- **Function signatures (important)**
    - **Table-based read**
          jdbc(url, table, properties)
    - **Query-based read**
          jdbc(url, "(select * from orders) t", properties)
          ‚ö†Ô∏è Query MUST be aliased.
- **OUT (Output)**
      -   A DataFrame
      -   Schema inferred from DB metadata
      -   Rows fetched via JDBC

- **ALL valid spark.read.jdbc() options**

  These can be passed via .option() or properties.

    **Connection options**
          -   Option	Meaning
          -   url	JDBC URL
          -   dbtable	Table name or subquery
          -   user	DB username
          -   password	DB password
          -   driver	JDBC driver class

    **Parallel read (VERY IMPORTANT)**
            -     Option	Purpose
            -     partitionColumn	Column to split data
            -     lowerBound	Min value
            -     upperBound	Max value
            -     numPartitions	Parallel connections

    Example:
        spark.read \
          .option("url", url) \
          .option("dbtable", "orders") \
          .option("partitionColumn", "id") \
          .option("lowerBound", 1) \
          .option("upperBound", 100000) \
          .option("numPartitions", 10) \
          .load()
       **Alternative partitioning**

        -     predicates	List of WHERE clauses
        -   spark.read.jdbc(url, "orders", predicates, properties)

            predicates = [
                "country = 'US'",
                "country = 'IN'",
                "country = 'UK'"
            ]

        **Fetching & performance**
    
          -     fetchsize	Rows per DB fetch
          -     batchsize	Write-side mostly
          -     queryTimeout	Seconds

        **Schema & types**
    
          -     customSchema	Override column types
          -     pushDownPredicate	Push filters to DB

        **Security**
    
          -     ssl	Enable SSL
          -     sessionInitStatement	Init SQL

- **template (BEST PRACTICE)**

      jdbc_url = "jdbc:mysql://host:3306/sales"
      props = {
          "user": "user",
          "password": "password",
          "driver": "com.mysql.cj.jdbc.Driver"
      }

      df = (
          spark.read
              .option("url", jdbc_url)
              .option("dbtable", "orders")
              .option("partitionColumn", "id")
              .option("lowerBound", "1")
              .option("upperBound", "100000")
              .option("numPartitions", "8")
              .option("fetchsize", "1000")
              .load()
)
- **Common mistakes ‚ùå**
      - ‚ùå Not using partitioning ‚Üí single-threaded read
      - ‚ùå Using non-numeric partitionColumn
      - ‚ùå Forgetting alias in subquery
      - ‚ùå Pulling huge tables without filters
- **When to use JDBC**
      - ‚úÖ Small‚Äìmedium tables
      - ‚úÖ Reference / dimension data
      - ‚ùå Massive fact tables (prefer dumps to Parquet)