- What is transform()?
    - transform() is a method in PySpark that allows you to apply a function to a DataFrame.
    - It helps write cleaner and more reusable code by chaining custom transformations.

- Syntax:
    df.transform(function)

- Example:
    df.transform(lambda d: d.withColumn(...))

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, upper, lower

spark = SparkSession.builder.appName("transformFunctionExample").getOrCreate()


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/09/16 10:44:20 WARN Utils: Your hostname, KLZPC0015, resolves to a loopback address: 127.0.1.1; using 172.25.17.96 instead (on interface eth0)
25/09/16 10:44:20 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/16 10:44:34 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/09/16 10:44:38 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/09/16 10:44:38 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


In [2]:
data = [
    (1, "Manisha", 5000),
    (2, "Simpa", 25000),
    (3, "Shreya", 30000),
    (4, "Shankha", 20000)
]

columns = ["id", "name", "salary"]

df = spark.createDataFrame(data, columns)
df.show()


                                                                                

+---+-------+------+
| id|   name|salary|
+---+-------+------+
|  1|Manisha|  5000|
|  2|  Simpa| 25000|
|  3| Shreya| 30000|
|  4|Shankha| 20000|
+---+-------+------+



In [4]:
# define a Tranformation function
def add_new_columns(input_df):
    # Example transformation:
        # 1. Uppercase the name column
        # 2. add a new column "Bonus" which is 10% of Salary
    transformed_df = input_df.withColumn("NameUpper", upper(col("name"))) \
                              .withColumn("Bonus", col("salary"))

    return transformed_df


In [5]:
df_transformed = df.transform(add_new_columns)

print("Transformed DataFrame (Upper Name + Bosus):")
df_transformed.show()


Transformed DataFrame (Upper Name + Bosus):




+---+-------+------+---------+-----+
| id|   name|salary|NameUpper|Bonus|
+---+-------+------+---------+-----+
|  1|Manisha|  5000|  MANISHA| 5000|
|  2|  Simpa| 25000|    SIMPA|25000|
|  3| Shreya| 30000|   SHREYA|30000|
|  4|Shankha| 20000|  SHANKHA|20000|
+---+-------+------+---------+-----+



                                                                                

In [6]:
df_another_transform = df.transform(lambda d: d.withColumn("Lower_Name", lower(col("Name"))))
print("Another transformation Example (lowercase Name): ")
df_another_transform.show()


Another transformation Example (lowercase Name): 




+---+-------+------+----------+
| id|   name|salary|Lower_Name|
+---+-------+------+----------+
|  1|Manisha|  5000|   manisha|
|  2|  Simpa| 25000|     simpa|
|  3| Shreya| 30000|    shreya|
|  4|Shankha| 20000|   shankha|
+---+-------+------+----------+



                                                                                