- What is subtract() func in PySpark?
    - The subtract() function in PySpark is used to return the rows that are present in one DataFrame but not in another.
    - It's similar to the SQL "EXCEPT" operation.

- Key Points:
    - It returns rows that exist in the first DataFrame and NOT in the second DataFrame.
    - Both DataFrames must have the same schema.

- Syntax:
    DataFrame1.subtract(DataFrame2)

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("subtractFuncExample").getOrCreate()


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/09/16 10:22:12 WARN Utils: Your hostname, KLZPC0015, resolves to a loopback address: 127.0.1.1; using 172.25.17.96 instead (on interface eth0)
25/09/16 10:22:12 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/16 10:22:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/09/16 10:22:59 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [2]:
data1 = [
    (29, "Suman"),
    (35, "Ankur"),
    (40, "Sayantan"),
    (25, "Sai")
]

data2 = [
    (29, "Suman"),
    (40, "Sayantan")
]

columns = ["age", "name"]

df1 = spark.createDataFrame(data1, columns)
df2 = spark.createDataFrame(data2, columns)

df1.show()
df2.show()


                                                                                

+---+--------+
|age|    name|
+---+--------+
| 29|   Suman|
| 35|   Ankur|
| 40|Sayantan|
| 25|     Sai|
+---+--------+



[Stage 3:>                                                          (0 + 3) / 3]

+---+--------+
|age|    name|
+---+--------+
| 29|   Suman|
| 40|Sayantan|
+---+--------+



                                                                                

In [3]:
# Using subtract() function in PySpark
# Subtract df2 from df1 -> Returns rows present in df1 but not in df2
result_df = df1.subtract(df2)

print("Result after subtracting df2 from df1: ")
result_df.show()


Result after subtracting df2 from df1: 


[Stage 9:>                                                          (0 + 1) / 1]

+---+-----+
|age| name|
+---+-----+
| 35|Ankur|
| 25|  Sai|
+---+-----+



                                                                                

- subtract() helps you find differences between two DataFrames.
- It works like an "EXCEPT" in SQL.
- It removes all rows from df1 that are also in df2.

- We subtracted Suman and Sayantan from the first DataFrame.
- The result shows only Ankur and Sai