# Why PySpark?

**PySpark** is the Python API for Apache Spark — a powerful distributed computing engine. It enables Python developers and data scientists to leverage the full power of Spark while working in a familiar language.

##  Key Benefits of PySpark

| Feature                | Benefit                                                                 |
|------------------------|-------------------------------------------------------------------------|
| **Scalability**        | Process data at petabyte scale across distributed clusters              |
| **Speed**              | In-memory computation makes PySpark much faster than traditional tools  |
| **Ease of Use**        | Write Spark jobs using familiar Python syntax                          |
| **Ecosystem**          | Integrates with Hadoop, HDFS, Hive, HBase, Kafka, Cassandra, etc.       |
| **Libraries**          | Includes libraries for SQL (Spark SQL), streaming (Structured Streaming), ML (MLlib), and Graph (GraphX) |
| **Cloud Ready**        | Runs on AWS EMR, Google Dataproc, Azure HDInsight, and Kubernetes       |

## Basic Example of PySpark

Reading a CSV and performing some basic transformations.

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("WhyPySparkDemo").getOrCreate()

# Read CSV file
df = spark.read.csv("sample.csv", header=True, inferSchema=True)

# Filter and select columns
df_filtered = df.select("Name", "Age").filter(col("Age") > 25)
df_filtered.show()