PySpark Null Value Analysis

Problem Statement:
This script demonstrates handling of null/missing values in PySpark by:
1. Creating a DataFrame with intentional null values across columns
2. Visualizing the complete dataset including null entries

Data Characteristics:
- Contains nulls in string (name), numeric (value), and integer (id) columns
- Represents common data quality issues in real-world datasets

Use Cases:
- Data quality assessment
- Pre-processing for ML pipelines
- Data validation checks
- Demonstrating null-handling techniques

Columns:
- name (string): May contain customer names
- value (int): Numeric measurements with missing values
- id (int): Identifier field with potential gaps



In [None]:
from pyspark.sql import SparkSession

# Initialize Spark session
spark = SparkSession.builder.appName("Null Check Example").getOrCreate()

# Sample data
data = [
    ("Alice", 50, 1),
    (None, 60, 2),
    ("Bob", None, 3),
    ("Charlie", 70, None),
    (None, None, None)
]

# Creating DataFrame
columns = ["name", "value", "id"]
df = spark.createDataFrame(data, columns)
df.show()


In [None]:
from pyspark.sql.functions import col, sum
# Check nulls in each column
null_counts = df.select([sum(col(c).isNull().cast('int')).alias(c) for c in df.columns])

null_counts.show()