<a href="https://colab.research.google.com/github/anjli01/PySpark-Notes/blob/main/07_Reading_%26_Writing_Files.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Reading & Writing Files

Spark provides robust capabilities for interacting with various file formats, which is fundamental for data ingestion and persistence.

#### File Formats: CSV, JSON, Parquet

Spark supports a multitude of file formats, with **CSV**, **JSON**, and **Parquet** being among the most common.

**Comparison of Common File Formats in Spark:**

| Feature                | CSV (Comma Separated Values)                                  | JSON (JavaScript Object Notation)                                   | Parquet                                                               |
| :--------------------- | :------------------------------------------------------------ | :------------------------------------------------------------------ | :-------------------------------------------------------------------- |
| **Data Structure**     | Row-based, plain text                                         | Semi-structured, row-based, plain text                              | **Columnar**, binary, self-describing                                 |
| **Human-Readable**     | Yes                                                           | Yes                                                                 | No (requires tools/Spark to read)                                     |
| **Schema Enforcement** | No (schema must be inferred or provided)                      | Flexible (can be error-prone with inconsistent data)                | **Yes** (built-in schema stored with data)                            |
| **Compression**        | No built-in compression (can be applied externally)           | No built-in compression (can be applied externally)                 | **Highly efficient built-in compression and encoding**                |
| **Nested Data**        | Difficult/impossible to represent                               | Good support for nested structures                                  | **Excellent support** for complex nested data                         |
| **Read Performance**   | Poor for large datasets (requires full scan for schema infer, row-by-row parsing) | Moderate (parsing overhead, still row-by-row processing)          | **Excellent** for analytical queries (reads only necessary columns, predicate pushdown) |
| **Schema Evolution**   | Poor                                                          | Fair (flexible schema, but requires careful handling)               | **Excellent** (built-in support)                                      |
| **Use Cases**          | Small data, ad-hoc analysis, data exchange with other systems | Log data, API responses, flexible data interchange                  | **Recommended for big data storage & analytics**, ETL intermediate stages, Data Lakes |
| **Spark Optimization** | Limited                                                       | Limited                                                             | **Highly optimized** for Spark and other columnar engines             |

---

#### Read Options:

When reading files, you can specify various options to control how Spark interprets the data:

*   **`header`**: (`True`/`False`) Indicates if the first row is a header (Default: `False`).
*   **`inferSchema`**: (`True`/`False`)
    *   `True`: Spark will read a sample of the data to infer data types. **Caution**: Can be slow for large files as it requires an additional pass over the data.
    *   `False`: All columns will be read as `StringType`.
    *   **Best Practice**: For production, always define schemas manually (`StructType`) for performance and reliability.
*   **`multiline`**: (`True`/`False`) (Primarily for JSON)
    *   `True`: Reads a single JSON object that spans multiple lines.
    *   `False`: Expects one JSON object per line. (Default: `False` for JSON).
*   **`sep`**: Specifies the column delimiter for CSV files (Default: `,`).
*   **`compression`**: Codec to use for compression (e.g., `gzip`, `snappy`, `lz4`, `bzip2`). Spark often infers this from the file extension.

---

**Example (Python): Reading Files**


In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("ReadWriteFiles").getOrCreate()

# Create dummy files for demonstration
# CSV data
csv_data = """id|name|age
1|Alice|30
2|Bob|25
3|Charlie|35
"""
with open("data.csv", "w") as f:
    f.write(csv_data)

# JSON (multi-line record - simplified to single line for example, but `multiline=True` would handle real multi-line)
json_data = """{"id":1,"name":"Alice", "details":{"age":30, "city":"NY"}}
{"id":2,"name":"Bob", "details":{"age":25, "city":"LD"}}
"""
with open("data.json", "w") as f:
    f.write(json_data)

# --- CSV Read Options ---
print("Reading CSV with header and custom separator:")
df_csv = spark.read.csv("data.csv", header=True, inferSchema=True, sep="|")
df_csv.printSchema()
df_csv.show()

# --- JSON Read Options ---
print("\nReading JSON with multiline=False (default, for line-delimited JSON):")
# Note: Even if `json_data` was truly multiline for a single object, multiline=False treats each line as a record.
# To read a single JSON object spanning multiple lines, `multiline=True` would be used.
df_json = spark.read.json("data.json")
df_json.printSchema()
df_json.show(truncate=False)

# --- Demonstrate Compression (Write then Read) ---
# Write a gzipped CSV file
df_csv.write.option("compression", "gzip").csv("data_compressed.csv.gz",
                                              mode="overwrite", header=True)
print("\nReading compressed CSV:")
df_compressed = spark.read.option("compression", "gzip").csv("data_compressed.csv.gz",
                                                            header=True, inferSchema=True)
df_compressed.printSchema()
df_compressed.show()

spark.stop()

Reading CSV with header and custom separator:
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 30|
|  2|    Bob| 25|
|  3|Charlie| 35|
+---+-------+---+


Reading JSON with multiline=False (default, for line-delimited JSON):
root
 |-- details: struct (nullable = true)
 |    |-- age: long (nullable = true)
 |    |-- city: string (nullable = true)
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)

+--------+---+-----+
|details |id |name |
+--------+---+-----+
|{30, NY}|1  |Alice|
|{25, LD}|2  |Bob  |
+--------+---+-----+


Reading compressed CSV:
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 30|
|  2|    Bob| 25|
|  3|Charlie| 35|
+---+-------+---+



---

#### Write Options:

When writing DataFrames to files, you have control over the output behavior:

*   **`mode`**: Specifies the save mode for existing data.
    *   `overwrite`: Overwrites existing data/directory.
    *   `append`: Appends new data to existing data.
    *   `ignore`: If data already exists, the write operation does nothing.
    *   `errorIfExists` (Default): Throws an error if data already exists.
*   **`partitionBy(column_names)`**:
    *   Partitions the output data by the values of specified columns.
    *   Creates subdirectories in the output path (e.g., `output/city=NY/`, `output/city=LD/`).
    *   **Benefit**: Improves query performance by allowing Spark to skip scanning irrelevant partitions (Predicate Pushdown).
*   **`bucketBy(num_buckets, column_names)`**:
    *   Buckets the output data by hashing the specified columns into a fixed number of buckets.
    *   **Benefit**: Improves join performance (co-located data for joins) and sampling.
    *   **Requirement**: Requires saving as a table (e.g., Hive table).
*   **`.option(key, value)` vs `.options(**kwargs)`**:
    *   `.option(key, value)`: Sets a single write option.
    *   `.options(**kwargs)`: Sets multiple options using keyword arguments (Python-specific).

---

**Example (Python): Writing Files**

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession.builder.appName("WriteOptions").getOrCreate()

# Create a sample DataFrame
data = [("Alice", 30, "NY"), ("Bob", 25, "LD"), ("Charlie", 35, "NY"), ("David", 22, "SF")]
columns = ["Name", "Age", "City"]
df = spark.createDataFrame(data, columns)
df.show()

output_base_path = "output_data"

# --- Write to CSV with overwrite mode ---
print("\nWriting CSV with overwrite mode...")
# This will create 'output_data/csv_output_overwrite' directory
df.write.mode("overwrite").csv(f"{output_base_path}/csv_output_overwrite", header=True)

# --- Write to Parquet, partitioned by 'City' ---
print("\nWriting Parquet partitioned by 'City'...")
# This will create directories like 'output_data/parquet_partitioned/City=NY/'
df.write.mode("overwrite").partitionBy("City").parquet(f"{output_base_path}/parquet_partitioned")

# --- Write to JSON with append mode (demonstrating mode) ---
# For append, run this block twice. The first time it creates the file.
# The second time, it appends data to the existing JSON file.
print("\nWriting JSON with append mode...")
df.write.mode("append").json(f"{output_base_path}/json_output_append")

# --- Using .options() (Python specific for multiple options) ---
print("\nWriting CSV using .options():")
df.write.mode("overwrite").options(header=True, sep=",").csv(f"{output_base_path}/csv_output_options")

print(f"\nWrite operations completed. Check your '{output_base_path}' directory.")

spark.stop()

+-------+---+----+
|   Name|Age|City|
+-------+---+----+
|  Alice| 30|  NY|
|    Bob| 25|  LD|
|Charlie| 35|  NY|
|  David| 22|  SF|
+-------+---+----+


Writing CSV with overwrite mode...

Writing Parquet partitioned by 'City'...

Writing JSON with append mode...

Writing CSV using .options():

Write operations completed. Check your 'output_data' directory.
