https://sparkbyexamples.com/pyspark/pyspark-when-otherwise/

In [1]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

data = [("James", "M", 60000), ("Michael", "M", 70000),
        ("Robert", None, 400000), ("Maria", "F", 500000), ("Jen", "", None)]

columns = ["name", "gender", "salary"]
df = spark.createDataFrame(data=data, schema=columns)
df.show()

+-------+------+------+
|   name|gender|salary|
+-------+------+------+
|  James|     M| 60000|
|Michael|     M| 70000|
| Robert|  null|400000|
|  Maria|     F|500000|
|    Jen|      |  null|
+-------+------+------+



# Using when() otherwise() on PySpark DataFrame.
PySpark when() is SQL function, in order to use this first you should import and this returns a Column type, otherwise() is a function of Column, when otherwise() not used and none of the conditions met it assigns None (Null) value. Usage would be like when(condition).otherwise(default).

when() function take 2 parameters, first param takes a condition and second takes a literal value or Column, if condition evaluates to true then it returns a value from second param.

The below code snippet replaces the value of gender with a new derived value, when conditions not matched, we are assigning “Unknown” as value, for null assigning empty.

In [3]:
from pyspark.sql.functions import when, col
df2 = df.withColumn("new_gender", when(df.gender == "M","Male")
                                 .when(df.gender == "F","Female")
                                 .when(df.gender.isNull() ,"")
                                 .otherwise(df.gender))
df2.show()

+-------+------+------+----------+
|   name|gender|salary|new_gender|
+-------+------+------+----------+
|  James|     M| 60000|      Male|
|Michael|     M| 70000|      Male|
| Robert|  null|400000|          |
|  Maria|     F|500000|    Female|
|    Jen|      |  null|          |
+-------+------+------+----------+



Using with select()

In [4]:
df2=df.select(col("*"),when(df.gender == "M","Male")
                  .when(df.gender == "F","Female")
                  .when(df.gender.isNull() ,"")
                  .otherwise(df.gender).alias("new_gender"))

# PySpark SQL Case When on DataFrame.
If you have a SQL background you might have familiar with Case When statement that is used to execute a sequence of conditions and returns a value when the first condition met, similar to SWITH and IF THEN ELSE statements. Similarly, PySpark SQL Case When statement can be used on DataFrame, below are some of the examples of using with withColumn(), select(), selectExpr() utilizing expr() function.

## Using Case When Else on DataFrame using withColumn() & select()
Below example uses PySpark SQL expr() Function to express SQL like expressions.

In [6]:
from pyspark.sql.functions import expr, col

#Using Case When on withColumn()
df3 = df.withColumn(
    "new_gender",
    expr("CASE WHEN gender = 'M' THEN 'Male' " +
         "WHEN gender = 'F' THEN 'Female' WHEN gender IS NULL THEN ''" +
         "ELSE gender END"))
df3.show(truncate=False)

+-------+------+------+----------+
|name   |gender|salary|new_gender|
+-------+------+------+----------+
|James  |M     |60000 |Male      |
|Michael|M     |70000 |Male      |
|Robert |null  |400000|          |
|Maria  |F     |500000|Female    |
|Jen    |      |null  |          |
+-------+------+------+----------+



In [7]:
#Using Case When on select()
df4 = df.select(
    col("*"),
    expr("CASE WHEN gender = 'M' THEN 'Male' " +
         "WHEN gender = 'F' THEN 'Female' WHEN gender IS NULL THEN ''" +
         "ELSE gender END").alias("new_gender")).show()

+-------+------+------+----------+
|   name|gender|salary|new_gender|
+-------+------+------+----------+
|  James|     M| 60000|      Male|
|Michael|     M| 70000|      Male|
| Robert|  null|400000|          |
|  Maria|     F|500000|    Female|
|    Jen|      |  null|          |
+-------+------+------+----------+



## Using Case When on SQL Expression
You can also use Case When with SQL statement after creating a temporary view. This returns a similar output as above.

In [8]:
df.createOrReplaceTempView("EMP")
spark.sql("select name, CASE WHEN gender = 'M' THEN 'Male' " + 
               "WHEN gender = 'F' THEN 'Female' WHEN gender IS NULL THEN ''" +
              "ELSE gender END as new_gender from EMP").show()

+-------+----------+
|   name|new_gender|
+-------+----------+
|  James|      Male|
|Michael|      Male|
| Robert|          |
|  Maria|    Female|
|    Jen|          |
+-------+----------+



## Multiple Conditions using & and | operator
We often need to check with multiple conditions, below is an example of using PySpark When Otherwise with multiple conditions by using and (&) or (|) coperators. To explain this I will use a new set of data to make it simple.

In [20]:
from pyspark.sql.functions import *

data = [[66, "a", 4], [67, "a", 0], [70, "b", 4], [71, "d", 4]]

columns = ["id", "code", "amt"]
df5 = spark.createDataFrame(data=data, schema=columns)
df5.show()

+---+----+---+
| id|code|amt|
+---+----+---+
| 66|   a|  4|
| 67|   a|  0|
| 70|   b|  4|
| 71|   d|  4|
+---+----+---+



In [26]:
df5.withColumn(
    "new_column",
    when(col("code") == "a" | col("code") == "d",
         "A").when(col("code") == "b" & col("amt") == "4",
                   "B").otherwise("A1")).show()

Py4JError: An error occurred while calling o249.or. Trace:
py4j.Py4JException: Method or([class java.lang.String]) does not exist
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
	at py4j.Gateway.invoke(Gateway.java:274)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)



```python
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
data = [("James", "M", 60000), ("Michael", "M", 70000),
        ("Robert", None, 400000), ("Maria", "F", 500000), ("Jen", "", None)]

columns = ["name", "gender", "salary"]
df = spark.createDataFrame(data=data, schema=columns)
df.show()

#Using When otherwise
from pyspark.sql.functions import when, col

df2 = df.withColumn(
    "new_gender",
    when(df.gender == "M",
         "Male").when(df.gender == "F",
                      "Female").when(df.gender.isNull(),
                                     "").otherwise(df.gender))
df2.show()

df2 = df.select(
    col("*"),
    when(df.gender == "M", "Male").when(df.gender == "F", "Female").when(
        df.gender.isNull(), "").otherwise(df.gender).alias("new_gender"))
df2.show()
# Using SQL Case When
from pyspark.sql.functions import expr

df3 = df.withColumn(
    "new_gender",
    expr("CASE WHEN gender = 'M' THEN 'Male' " +
         "WHEN gender = 'F' THEN 'Female' WHEN gender IS NULL THEN ''" +
         "ELSE gender END"))
df3.show()

df4 = df.select(
    col("*"),
    expr("CASE WHEN gender = 'M' THEN 'Male' " +
         "WHEN gender = 'F' THEN 'Female' WHEN gender IS NULL THEN ''" +
         "ELSE gender END").alias("new_gender"))

df.createOrReplaceTempView("EMP")
spark.sql("select name, CASE WHEN gender = 'M' THEN 'Male' " +
          "WHEN gender = 'F' THEN 'Female' WHEN gender IS NULL THEN ''" +
          "ELSE gender END as new_gender from EMP").show()
```