- Unpivot is the process of transforming columns into rows.
- In PySpark, you can use the 'selectExpr()' function with the 'stack()' method to achieve unpivoting. It is the reverse operation of pivoting and is helpful for normalizing data.

- syntax: 
    df.selectExpr("column1", 'stack(n, "col_name", col_value, ...) as (new_col, new_val)')

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("UnpivotFunctionExample").getOrCreate()


Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
25/09/10 14:36:17 WARN Utils: Your hostname, KLZPC0015, resolves to a loopback address: 127.0.1.1; using 172.25.17.96 instead (on interface eth0)
25/09/10 14:36:17 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j2-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/09/10 14:36:27 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/09/10 14:36:30 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


In [3]:
# create sample dataframes
data = [
    (1, "Manta",    None, 240, None),
    (2, "Dipankar", None, None, 270),
    (3, "Souvik",   270, None, None),
    (4, "Soukarjya",None, None, None),
    (5, "Arvind",   280, None, None),
    (6, "Prodipta", 280, None, None),
    (7, "Padma",    None, None, None),
    (8, "Panta",    None, 270, None),
    (9, "Sougato",  290, None, None),
]

columns = ["id", "name", "sales_2022", "sales_2023", "sales_2024"]

df = spark.createDataFrame(data, schema=columns)
df.show()




+---+---------+----------+----------+----------+
| id|     name|sales_2022|sales_2023|sales_2024|
+---+---------+----------+----------+----------+
|  1|    Manta|      NULL|       240|      NULL|
|  2| Dipankar|      NULL|      NULL|       270|
|  3|   Souvik|       270|      NULL|      NULL|
|  4|Soukarjya|      NULL|      NULL|      NULL|
|  5|   Arvind|       280|      NULL|      NULL|
|  6| Prodipta|       280|      NULL|      NULL|
|  7|    Padma|      NULL|      NULL|      NULL|
|  8|    Panta|      NULL|       270|      NULL|
|  9|  Sougato|       290|      NULL|      NULL|
+---+---------+----------+----------+----------+



                                                                                

In [5]:
# unpivot the sales columns into two columns: Year and stack()
unpivotDF = df.selectExpr(
    "name",
    "stack(3, '2022', sales_2022, '2023', sales_2023, '2024', sales_2024) as (Year, Sales)"
)

print("Unpivoted DataFrame")
unpivotDF.show()


Unpivoted DataFrame


                                                                                

+---------+----+-----+
|     name|Year|Sales|
+---------+----+-----+
|    Manta|2022| NULL|
|    Manta|2023|  240|
|    Manta|2024| NULL|
| Dipankar|2022| NULL|
| Dipankar|2023| NULL|
| Dipankar|2024|  270|
|   Souvik|2022|  270|
|   Souvik|2023| NULL|
|   Souvik|2024| NULL|
|Soukarjya|2022| NULL|
|Soukarjya|2023| NULL|
|Soukarjya|2024| NULL|
|   Arvind|2022|  280|
|   Arvind|2023| NULL|
|   Arvind|2024| NULL|
| Prodipta|2022|  280|
| Prodipta|2023| NULL|
| Prodipta|2024| NULL|
|    Padma|2022| NULL|
|    Padma|2023| NULL|
+---------+----+-----+
only showing top 20 rows


- Summary:
    - Unpivoting transforms columns into rows.
    - Useful for preparing data for analysis or reporting.
    - stack(n, ...) allows you to unpivot columns easily in PySpark.