# DataFrame partition and distribution process
- In distributed storage systems (like HDFS, S3, etc.), data is stored as blocks across multiple nodes, which are naturally partitioned.

- When we read data into Spark, it's loaded into memory as partitions — think of each partition as a smaller, manageable chunk of the overall DataFrame that can be processed independently and in parallel.

- The Driver program (your main Spark application entry point) sends a read command, and based on metadata from the cluster manager and storage system, it constructs a logical plan for the DataFrame and determines how it will be partitioned.

- At this stage (before any action), the DataFrame is lazily evaluated — Spark only builds the execution plan (DAG) and knows what needs to be done, but nothing is executed yet.

- When you trigger an action (e.g., show(), collect(), write(), etc.), the Driver requests resources (containers/executors) from the cluster manager (like YARN or Spark Standalone).

- The cluster manager launches executors on worker nodes, and each executor is assigned one or more partitions of the DataFrame to process.

- The Driver coordinates the task execution across nodes and eventually collects the results (if required).

- You can control the number of partitions using .repartition() or .coalesce() for better performance tuning.

- The number of partitions affects parallelism: more partitions = better parallelism (to an extent), but also more overhead.

# Spark Operations:

## Transformations
- Transformations are operations that create a new DataFrame from an existing one without modifying the original. They are lazy, meaning they don't trigger execution until an action is called.
- There are two types of transformations:
    - Narrow Dependency Transformations
        - Each output partition depends on only one input partition.
        - Transformations can be performed independently on each partition.
        - These are faster and do not require data shuffling.
        - Examples: filter, select, where, union

    - Wide Dependency Transformations
        - Each output partition may depend on multiple input partitions.
        - These transformations require data to be shuffled across the network (expensive).
        - Spark needs to redistribute and group data before applying the transformation.
        - Examples: groupBy, distinct, join

## Actions
- Actions are operations that trigger execution of the computation graph (DAG) built by transformations.
- They either return a result to the driver or write it to external storage.
- Actions launch a Spark job and use the defined transformations to compute the result.
- Examples: show(), count(), collect(), write()

In [1]:
# Transformation and action example:

# Read the data 

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

if __name__=="__main__":
    spark = SparkSession.builder\
        .appName("Read Data")\
        .master("local[*]")\
        .getOrCreate()
    
    # Read the data
    spark_df = spark.read.csv(
        r"C:\Users\shubh\OneDrive\Desktop\validating data.csv", 
        header=True, #check the first row to infer column row
        inferSchema=True # make intelligent guess of the data type of each column
        )
    
    # we can chain multiple transformations together and assign the result to a new variable
    transformed_df = spark_df.where('age between 20 and 40')\
                            .select('ry_user_id','age')

    transformed_df.show(5, False)
    transformed_df.printSchema()

    spark.stop()

+----------+---+
|ry_user_id|age|
+----------+---+
|73006237  |35 |
|74522222  |33 |
|48241626  |34 |
|38520311  |28 |
|74805610  |25 |
+----------+---+
only showing top 5 rows

root
 |-- ry_user_id: integer (nullable = true)
 |-- age: integer (nullable = true)



# Spark transformation and execution model

## Spark Transformation Logic
- In Spark, transformations form a Directed Acyclic Graph (DAG) of operations:
- Example: read >> where >> select >> group >> count >> show
- Unlike traditional programming, Spark doesn't execute operations line by line.
- Instead, all transformations are sent to the driver, which plans the optimal execution path and sends the instructions to executors.
- Since transformations don’t run immediately, this is known as lazy evaluation — they’re only triggered by an action.

## Spark Execution Theory
- Spark behaves like a compiler: it takes the logical plan (from your code) and compiles it into a physical execution plan.
- When an action is called, Spark creates a job.
- For example, reading a CSV (with schema inference) is an action and results in a job.
- Each job is divided into stages, and each stage is broken down into tasks.
- Every action triggers at least one job, which may contain atleast stages and tasks.
- Spark builds a DAG of stages for each action to determine how to process the data.
- Then spark will execute the DAG with jobs, stages and tasks etc

#### Summary:
Each action will result in job each white transformation will result in separate stage and exvery stage executes tasks in paralled depending on the number of partitions and executor cores. If there are less executors then partitions then tasks are queued.

## Execution Plan Example

    spark_df = spark.read.csv(
        r"C:\Users\shubh\OneDrive\Desktop\validating data.csv", 
        header=True,         # Use first row as header
        inferSchema=True     # Infer column types automatically
    )
    Action 1:
        - Triggers Job 0: Read the CSV file from disk.
        - Triggers Job 1: Infer schema from the data.
    transformed_df = spark_df.where('age between 20 and 40').select('ry_user_id', 'age')
    transformed_df.show(5, False)
    Action 2:
        - Triggers Job 2: Apply filter and select transformations and Execute .show() to return 5 records to the driver. Since all the transforamtions are executed with only 1 action  (.show()) it will create 1 job.

In [2]:
# exectuion plan example

from pyspark.sql import SparkSession

spark = SparkSession.builder\
    .appName("Read Data")\
    .master("local[*]")\
    .getOrCreate()    
spark_df = spark.read.csv(
        r"C:\Users\shubh\OneDrive\Desktop\validating data.csv", 
        header=True, #check the first row to infer column row
        inferSchema=True
        )     
transformed_df = spark_df.where('age between 20 and 40').select('ry_user_id','age')
transformed_df.show(5, False)
print('press enter to continue') # holding the program until user presses enter to check the spark dag
# check the dag at localhost:4040

+----------+---+
|ry_user_id|age|
+----------+---+
|73006237  |35 |
|74522222  |33 |
|48241626  |34 |
|38520311  |28 |
|74805610  |25 |
+----------+---+
only showing top 5 rows

press enter to continue


![alt text](image.png)