
# 🔥 Big Data Processing with PySpark

This notebook provides **code templates and checklists** for **handling large-scale datasets using PySpark**.

### 🔹 What’s Covered:
- Initializing a PySpark session
- Loading and processing large datasets
- Performing transformations & aggregations
- Optimizing performance for distributed computing


In [None]:

# Ensure required libraries are installed (Uncomment if necessary)
# !pip install pyspark



## 🚀 Initializing PySpark

✅ Set up a **Spark session** to enable distributed computing.  
✅ Configure **memory allocation & parallelism** for efficiency.  


In [None]:

from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder     .appName("BigDataProcessing")     .config("spark.executor.memory", "2g")     .getOrCreate()

print("Spark Session Initialized")



## 📂 Loading Large Datasets

✅ Load **CSV, Parquet, JSON** files efficiently.  
✅ Use **schema definition** to avoid automatic inference overhead.  


In [None]:

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Define schema for dataset
schema = StructType([
    StructField("id", IntegerType(), True),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

# Load data (replace with actual file path)
df = spark.read.csv("large_dataset.csv", schema=schema, header=True)

# Show a sample
df.show(5)



## 🔄 Data Transformations

✅ Perform **filtering, selection, and column transformations**.  
✅ Use **optimized Spark functions** over Pandas-like operations.  


In [None]:

from pyspark.sql.functions import col

# Filter dataset (age > 30)
df_filtered = df.filter(col("age") > 30)

# Select specific columns
df_selected = df_filtered.select("id", "name")

df_selected.show(5)



## 📊 Aggregations & Grouping

✅ Perform **group-by operations** on large datasets.  
✅ Use **Spark’s built-in aggregation functions** for efficiency.  


In [None]:

# Aggregate data: Count people by age
df_grouped = df.groupBy("age").count()

# Show results
df_grouped.show(10)



## ⚡ Optimizing Performance

✅ **Cache data** when reusing DataFrames.  
✅ Use **Parquet** instead of CSV for better speed.  
✅ **Repartition data** to balance workload across nodes.  


In [None]:

# Cache DataFrame for repeated use
df.cache()

# Save as Parquet format for optimized storage
df.write.parquet("large_dataset.parquet")

# Repartition dataset to optimize parallelism
df_repartitioned = df.repartition(10)



## ✅ Best Practices & Common Pitfalls

- **Avoid small partitions**: Too many partitions slow down performance.  
- **Use built-in Spark functions**: Avoid UDFs unless necessary.  
- **Monitor memory usage**: Ensure efficient execution with `.explain()` or Spark UI.  
- **Prefer Parquet over CSV**: Faster read/write operations.  
