- na(): handle missing(null/none) values in a DataFrame. Common methods include fill(), drop(), replace().
- isEmpty(): 
    1. counts the number of rows in DataFrame.
    2. Unlike count(), this method does not trigger any computation.
    3. An empty DataFrame has no rows. It may have columns, but no data.

In [1]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("NAandEmptyDemo").getOrCreate()


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/09/08 10:55:55 WARN Utils: Your hostname, KLZPC0015, resolves to a loopback address: 127.0.1.1; using 172.25.17.96 instead (on interface eth0)
25/09/08 10:55:55 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/08 10:56:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/09/08 10:56:08 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/09/08 10:56:08 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/09/08 10:56:08 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
25/09/08 10:56:08 WARN Utils: Serv

In [5]:
# create sample dataframes
data = [
    (1, "Manta", 75000, "IT", 24),
    (2, "Dipankar", 30000, None, 27),
    (3, "Souvik", 60000, "Army Officer", 27),
    (4, "Soukarjya", 45000, "BDO", "null"),
    (5, "Arvind", 35000, "HR", 28),
    (6, "Prodipta", 25000, "Data Analyst", 28),
    (7, "Padma", 20000, "Data Analyst", None),
    (8, "Panta", 125000, "Business Analyst", 27),
    (9, "Sougato", None, None, 29)
]

columns = ["id", "name", "salary", "department", "age"]

df = spark.createDataFrame(data, schema=columns)
df.show()




+---+---------+------+----------------+----+
| id|     name|salary|      department| age|
+---+---------+------+----------------+----+
|  1|    Manta| 75000|              IT|  24|
|  2| Dipankar| 30000|            NULL|  27|
|  3|   Souvik| 60000|    Army Officer|  27|
|  4|Soukarjya| 45000|             BDO|null|
|  5|   Arvind| 35000|              HR|  28|
|  6| Prodipta| 25000|    Data Analyst|  28|
|  7|    Padma| 20000|    Data Analyst|NULL|
|  8|    Panta|125000|Business Analyst|  27|
|  9|  Sougato|  NULL|            NULL|  29|
+---+---------+------+----------------+----+



                                                                                

In [6]:
# using na.drop(): drop rows with any null values
print("Drop rows with any null values using na.drop(): ")
df_na_drop = df.na.drop()
df_na_drop.show()


Drop rows with any null values using na.drop(): 


[Stage 9:>                                                          (0 + 3) / 3]

+---+---------+------+----------------+----+
| id|     name|salary|      department| age|
+---+---------+------+----------------+----+
|  1|    Manta| 75000|              IT|  24|
|  3|   Souvik| 60000|    Army Officer|  27|
|  4|Soukarjya| 45000|             BDO|null|
|  5|   Arvind| 35000|              HR|  28|
|  6| Prodipta| 25000|    Data Analyst|  28|
|  8|    Panta|125000|Business Analyst|  27|
+---+---------+------+----------------+----+



                                                                                

In [7]:
# using na.fill(): replace null values with specified values
print("Fill null values using na.fill(): ")
df_na_fill = df.na.fill(
    {
        "name": "unknown",
        "department": "not assigned",
        "salary": 0
    }
)
df_na_fill.show()


Fill null values using na.fill(): 


[Stage 11:>                                                         (0 + 3) / 3]

+---+---------+------+----------------+----+
| id|     name|salary|      department| age|
+---+---------+------+----------------+----+
|  1|    Manta| 75000|              IT|  24|
|  2| Dipankar| 30000|    not assigned|  27|
|  3|   Souvik| 60000|    Army Officer|  27|
|  4|Soukarjya| 45000|             BDO|null|
|  5|   Arvind| 35000|              HR|  28|
|  6| Prodipta| 25000|    Data Analyst|  28|
|  7|    Padma| 20000|    Data Analyst|NULL|
|  8|    Panta|125000|Business Analyst|  27|
|  9|  Sougato|     0|    not assigned|  29|
+---+---------+------+----------------+----+



                                                                                

In [9]:
# using na.replace(): replace specific values (including null)
print("replace 'HR' with 'Human Resources' using na.replace(): ")
df_replace = df_na_fill.na.replace("HR", "Human Resources")
df_replace.show()


replace 'HR' with 'Human Resources' using na.replace(): 


[Stage 13:>                                                         (0 + 3) / 3]

+---+---------+------+----------------+----+
| id|     name|salary|      department| age|
+---+---------+------+----------------+----+
|  1|    Manta| 75000|              IT|  24|
|  2| Dipankar| 30000|    not assigned|  27|
|  3|   Souvik| 60000|    Army Officer|  27|
|  4|Soukarjya| 45000|             BDO|null|
|  5|   Arvind| 35000| Human Resources|  28|
|  6| Prodipta| 25000|    Data Analyst|  28|
|  7|    Padma| 20000|    Data Analyst|NULL|
|  8|    Panta|125000|Business Analyst|  27|
|  9|  Sougato|     0|    not assigned|  29|
+---+---------+------+----------------+----+



                                                                                

In [12]:
# how to check if a DataFrame is empty (No rows)
print("Check if DataFrame is empty: ")
empty_df = df.filter(df.salary > 50000) # filter returns empty DataFrame
empty_df.isEmpty()


Check if DataFrame is empty: 


                                                                                

False

In [13]:
empty_df.show()




+---+------+------+----------------+---+
| id|  name|salary|      department|age|
+---+------+------+----------------+---+
|  1| Manta| 75000|              IT| 24|
|  3|Souvik| 60000|    Army Officer| 27|
|  8| Panta|125000|Business Analyst| 27|
+---+------+------+----------------+---+



                                                                                