# Read Modes in spark

When Spark reads a file (CSV, JSON, Parquet‚Ä¶), sometimes the data contains:

* wrong data types

* missing columns

* malformed rows

* corrupted records

* extra columns

Read modes tell Spark what to do when it encounters bad records.

In [3]:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("RDD_Operations") \
.getOrCreate()

## FAILFAST

If Spark sees even one bad record, it will STOP and throw an error.

In [7]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType , BooleanType

In [8]:
schema = StructType([
    StructField("customer_id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("city", StringType(), True),
    StructField("state", StringType(), True),
    StructField("country", StringType(), True),
    StructField("registration_date", StringType(), True),
    StructField("is_active", BooleanType(), True)
])

In [10]:
dataset_path = "/content/customers.csv"

df_failfast = spark.read.format("csv").option('header','true').schema(schema).option('mode','FAILFAST').load(dataset_path)

## PERMISSIVE (default)

* Spark tries to read all rows.
* If a row is bad, Spark put null value instead


In [12]:
df_failfast = spark.read.format("csv").option('header','true').schema(schema).option('mode','PERMISSIVE').load(dataset_path)

## DROPMALFORMED (deprecated in new versions)
* üëâ Spark drops bad rows completely.

* Example:

     * If 5 rows are corrupted ‚Üí Spark drops them ‚Üí they disappear from the DataFrame.

* ‚ö†Ô∏è Not recommended because you lose data silently.

In [16]:
df_failfast = spark.read.format("csv").option('header','true').schema(schema).option('mode','DROPMALFORMED').load(dataset_path)