- replace() is a PySpark DataFrame function that allows you to replace specific values in a column or multiple columns with new values.
- It is useful for data cleaning and handling inconsistent or incorrect data.

- Syntax:
    df.replace(to_replace, value, subset=None)

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("replaceFunctionExample").getOrCreate()


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/09/16 11:34:41 WARN Utils: Your hostname, KLZPC0015, resolves to a loopback address: 127.0.1.1; using 172.25.17.96 instead (on interface eth0)
25/09/16 11:34:41 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/16 11:34:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/09/16 11:34:59 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/09/16 11:34:59 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
25/09/16 11:34:59 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.


In [6]:
data = [
    (1, "Akash", "United State of America", 500000),
    (2, "Pallab", "Malyesia", 450000),
    (3, "Abhigyan", "India", 70000),
    (4, "Soumi", "Germany", 90000),
    (5, "Arvind", "Ireland", 45000),
    (6, "Unknown", "Unknown", None),
]
columns = ["id", "name", "country", "salary"]

df = spark.createDataFrame(data, columns)
df.show()




+---+--------+--------------------+------+
| id|    name|             country|salary|
+---+--------+--------------------+------+
|  1|   Akash|United State of A...|500000|
|  2|  Pallab|            Malyesia|450000|
|  3|Abhigyan|               India| 70000|
|  4|   Soumi|             Germany| 90000|
|  5|  Arvind|             Ireland| 45000|
|  6| Unknown|             Unknown|  NULL|
+---+--------+--------------------+------+



                                                                                

In [7]:
# replace a single value in one column
# replace "Unknown" in the name column with "Not Provided"
df_replaced_name = df.replace("Unknown", "Not Provided", subset=["name"])

print("After replacing 'Unknown' in name column: ")
df_replaced_name.show()


After replacing 'Unknown' in name column: 


                                                                                

+---+------------+--------------------+------+
| id|        name|             country|salary|
+---+------------+--------------------+------+
|  1|       Akash|United State of A...|500000|
|  2|      Pallab|            Malyesia|450000|
|  3|    Abhigyan|               India| 70000|
|  4|       Soumi|             Germany| 90000|
|  5|      Arvind|             Ireland| 45000|
|  6|Not Provided|             Unknown|  NULL|
+---+------------+--------------------+------+



In [11]:
# Example: Replace multiple values in a single column
# replace country names: India -> IND, United State of America - USA
df_replaced_country = df.replace(
    {
        "United State of America": "USA",
        "Malyesia": "MLS",
        "India": "IND",
        "Germany": "GRM",
        "Ireland": "IRE"
    }, subset=["country"]
)

print("After replacing multiple country names: ")
df_replaced_country.show()


After replacing multiple country names: 


[Stage 15:>                                                         (0 + 3) / 3]

+---+--------+-------+------+
| id|    name|country|salary|
+---+--------+-------+------+
|  1|   Akash|    USA|500000|
|  2|  Pallab|    MLS|450000|
|  3|Abhigyan|    IND| 70000|
|  4|   Soumi|    GRM| 90000|
|  5|  Arvind|    IRE| 45000|
|  6| Unknown|Unknown|  NULL|
+---+--------+-------+------+



                                                                                

In [12]:
# Example: Replace a Single Value in Multiple Columns
# Replace "Unknown" in the Name column with "Not Provided"
df_replaced_nameCountry = df.replace("Unknown", "Not Provided", subset=["name", "country"])

print("After replacing 'Unknown' in name and country column: ")
df_replaced_nameCountry.show()


After replacing 'Unknown' in name and country column: 


[Stage 17:>                                                         (0 + 3) / 3]

+---+------------+--------------------+------+
| id|        name|             country|salary|
+---+------------+--------------------+------+
|  1|       Akash|United State of A...|500000|
|  2|      Pallab|            Malyesia|450000|
|  3|    Abhigyan|               India| 70000|
|  4|       Soumi|             Germany| 90000|
|  5|      Arvind|             Ireland| 45000|
|  6|Not Provided|        Not Provided|  NULL|
+---+------------+--------------------+------+



                                                                                