##### What is the * (unpacking) operator?

- `*` operator is used to **unpack elements** from a **list or tuple.**
- It takes **each element** in the **list / tuple** and passes them as **separate arguments**.

##### 1) Basic Python Example

In [0]:
nums_list = [1, 2, 3, 4, 5]

print(nums_list)
print(*nums_list)

[1, 2, 3, 4, 5]
1 2 3 4 5


In [0]:
nums_tup = (1, 2, 3, 4, 5)

print(nums_tup)
print(*nums_tup)

(1, 2, 3, 4, 5)
1 2 3 4 5


- So instead of printing **[1, 2, 3, 4, 5]**, it prints **1 2 3 4 5**. Because the **list / tuple** was **unpacked** into **separate arguments**.

In [0]:
List1 = [1, 2, 3, 4, 5]
List2 = [6, 7, 8, 9, 10, 11]

print(List1 + List2)
print(*List1, *List2)

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
1 2 3 4 5 6 7 8 9 10 11


In PySpark, this is particularly useful when you’re passing **multiple column** expressions to methods like:
- **.select()**
- **.agg()**
- **.groupBy().agg()**
- **.orderBy()**
- **.drop()**

##### 1) Using * with .select()

In [0]:
data = [("Joseph", 25, "New York", 200, 30, "East"),
        ("Janani", 30, "Pune", 250, 40, "East"),
        ("Mukesh", 22, "Noida", 300, 50, "North"),
        ("Naresh", 26, "Chennai", 220, 35, "North"),
        ("Priya", 28, "Mumbai", 400, 60, "West"),
        ("Ravi", 27, "Delhi", 500, 70, "West"),
        ("Rahul", 32, "Bangalore", 150, 10, "West"),
        ("Roshan", 19, "Cochin", 100, 15, "South")]

columns = ["name", "age", "city", "sales", "profit", "region"]

df_select = spark.createDataFrame(data, columns)
display(df_select)

name,age,city,sales,profit,region
Joseph,25,New York,200,30,East
Janani,30,Pune,250,40,East
Mukesh,22,Noida,300,50,North
Naresh,26,Chennai,220,35,North
Priya,28,Mumbai,400,60,West
Ravi,27,Delhi,500,70,West
Rahul,32,Bangalore,150,10,West
Roshan,19,Cochin,100,15,South


In [0]:
df_select.select(columns).display()
# df_select.select(["name", "age", "city", "sales", "profit", "region"]).display()

name,age,city,sales,profit,region
Joseph,25,New York,200,30,East
Janani,30,Pune,250,40,East
Mukesh,22,Noida,300,50,North
Naresh,26,Chennai,220,35,North
Priya,28,Mumbai,400,60,West
Ravi,27,Delhi,500,70,West
Rahul,32,Bangalore,150,10,West
Roshan,19,Cochin,100,15,South


In [0]:
df_select.select(*columns).display()

# df_select.select(*["name", "age", "city", "sales", "profit", "region"]).display()
# df_select.select(col("name"), col("age"), col("city"), col("sales"), col("profit"), col("region")).display()

name,age,city,sales,profit,region
Joseph,25,New York,200,30,East
Janani,30,Pune,250,40,East
Mukesh,22,Noida,300,50,North
Naresh,26,Chennai,220,35,North
Priya,28,Mumbai,400,60,West
Ravi,27,Delhi,500,70,West
Rahul,32,Bangalore,150,10,West
Roshan,19,Cochin,100,15,South


     columns = ["name", "age", "city", "sales", "profit", "region"]
     *columns → "name", "age", "city", "sales", "profit", "region"
     .select(*columns) = .select("name", "age", "city", "sales", "profit", "region")

| Expression                    | Works             | Behavior                       | Recommended           |
| ----------------------------- | ----------------- | ------------------------------ | --------------------- |
| `df.select(cols)`             | ✅ (PySpark ≥ 3.0) | Implicitly expands string list | ⚠️ Sometimes works    |
| `df.select(*cols)`            | ✅ Always          | Explicit unpacking             | ✅ Yes (Best practice) |
| When using column expressions | ❌ May fail        | Needs unpacking                | ✅ Required            |

##### 2) Using * with .agg() and dynamic columns

In [0]:
from pyspark.sql.functions import sum, count
lst_agg_cols = ["sales", "profit"]

**a) without asterisk `*`**
- PySpark will complain, because it’s receiving **one list argument** instead of **multiple columns**.

      df.groupBy("region").agg(agg_exprs)  # ❌ ERROR
      [Column<'sum(sales) AS sales'>, Column<'sum(profit) AS profit'>]

In [0]:
agg_exprs_wastr = [sum(col_name).alias(col_name) for col_name in lst_agg_cols]

df_select.groupBy("region").agg(agg_exprs_wastr).display()

[0;31m---------------------------------------------------------------------------[0m
[0;31mAssertionError[0m                            Traceback (most recent call last)
File [0;32m<command-5166503171256217>, line 3[0m
[1;32m      1[0m agg_exprs_wastr [38;5;241m=[39m [[38;5;28msum[39m(col_name)[38;5;241m.[39malias(col_name) [38;5;28;01mfor[39;00m col_name [38;5;129;01min[39;00m lst_agg_cols]
[0;32m----> 3[0m df_select[38;5;241m.[39mgroupBy([38;5;124m"[39m[38;5;124mregion[39m[38;5;124m"[39m)[38;5;241m.[39magg(agg_exprs_wastr)[38;5;241m.[39mdisplay()

File [0;32m/databricks/python/lib/python3.12/site-packages/pyspark/sql/connect/group.py:140[0m, in [0;36mGroupedData.agg[0;34m(self, *exprs)[0m
[1;32m    137[0m     aggregate_cols [38;5;241m=[39m [F[38;5;241m.[39m_invoke_function(exprs[[38;5;241m0[39m][k], F[38;5;241m.[39mcol(k)) [38;5;28;01mfor[39;00m k [38;5;129;01min[39;00m exprs[[38;5;241m0[39m]]
[1;32m    138[0m [38;5;28;01melse[

**b) with asterisk `*`**

- `*agg_exprs` unpacks the **list** into:

      df.groupBy("region").agg(*agg_exprs).display()
      df.groupBy("region").agg(sum("sales").alias("sales"), sum("profit").alias("profit")).display()

In [0]:
df_select.groupBy("region").agg(*agg_exprs_wastr).display()

region,sales,profit
East,450,70
North,520,85
West,1050,140
South,100,15


`*` **(unpacking operator)**

- The **asterisk `*`** unpacks the list elements so they can be passed as separate arguments.

      # Without *
      df.groupBy("region").agg([sum("sales").alias("sales"), sum("profit").alias("profit")])  # ❌ Error

      # With *
      df.groupBy("region").agg(*[sum("sales").alias("sales"), sum("profit").alias("profit")])  # ✅ Works
      df.groupBy("region").agg(sum("sales").alias("sales"), sum("profit").alias("profit")).display()

      df.groupBy("region").agg(*agg_exprs).display()

      # *agg_exprs unpacks the list into
      df.groupBy("region").agg(sum("sales").alias("sales"), sum("profit").alias("profit")).display()

**c) without alias**

In [0]:
agg_exprs_wo = [sum(col_name) for col_name in lst_agg_cols]
agg_exprs_wo

[Column<'sum(sales)'>, Column<'sum(profit)'>]

In [0]:
df_select.groupBy("region").agg(*agg_exprs_wo).display()

region,sum(sales),sum(profit)
East,450,70
North,520,85
West,1050,140
South,100,15


**d) with alias**

In [0]:
agg_exprs = [sum(col_name).alias(col_name) for col_name in lst_agg_cols]
agg_exprs

[Column<'sum(sales) AS sales'>, Column<'sum(profit) AS profit'>]

In [0]:
df_select.groupBy("region").agg(agg_exprs).display()

[0;31m---------------------------------------------------------------------------[0m
[0;31mAssertionError[0m                            Traceback (most recent call last)
File [0;32m<command-6744122582445829>, line 1[0m
[0;32m----> 1[0m df_select[38;5;241m.[39mgroupBy([38;5;124m"[39m[38;5;124mregion[39m[38;5;124m"[39m)[38;5;241m.[39magg(agg_exprs)[38;5;241m.[39mdisplay()

File [0;32m/databricks/python/lib/python3.12/site-packages/pyspark/sql/connect/group.py:140[0m, in [0;36mGroupedData.agg[0;34m(self, *exprs)[0m
[1;32m    137[0m     aggregate_cols [38;5;241m=[39m [F[38;5;241m.[39m_invoke_function(exprs[[38;5;241m0[39m][k], F[38;5;241m.[39mcol(k)) [38;5;28;01mfor[39;00m k [38;5;129;01min[39;00m exprs[[38;5;241m0[39m]]
[1;32m    138[0m [38;5;28;01melse[39;00m:
[1;32m    139[0m     [38;5;66;03m# Columns[39;00m
[0;32m--> 140[0m     [38;5;28;01massert[39;00m [38;5;28mall[39m([38;5;28misinstance[39m(c, Column) [38;5;28;01mfor[39;00m 

In [0]:
df_select.groupBy("region").agg(*agg_exprs).display()

region,sales,profit
East,450,70
North,520,85
West,1050,140
South,100,15


In [0]:
agg_exprs = [sum(col_name).alias(col_name.upper()) for col_name in lst_agg_cols]

df_select.groupBy("region").agg(*agg_exprs).display()

region,SALES,PROFIT
East,450,70
North,520,85
West,1050,140
South,100,15


In [0]:
agg_exprs = [sum(col_name).alias(col_name.capitalize()) for col_name in lst_agg_cols]

df_select.groupBy("region").agg(*agg_exprs).display()

region,Sales,Profit
East,450,70
North,520,85
West,1050,140
South,100,15


**sum("sales")**   # => Column<'sum(sales)'>

**.alias(col_name)**
- .alias() **renames** the resulting **aggregated column**.
- **Without alias**, PySpark would name it **sum(sales)**.
- **with alias**, it’s **renamed** to just **sales**.     
          sum("sales").alias("sales")

**[ ... for col_name in lst_agg_clms_event ]**
 
      [sum("sales").alias("sales"), sum("profit").alias("profit")]

| Concept              | Description                                 |
| -------------------- | ------------------------------------------- |
| `lst_agg_clms_event` | List of numeric columns to aggregate        |
| `sum(col_name)`      | **Aggregation** function applied to **each column** |
| `.alias(col_name)`   | **Renames the resulting aggregated column** |
| Used in              | `df.groupBy().agg(*agg_exprs)`              |

     # with list comprehension
     agg_exprs_wastr = [sum(col_name).alias(col_name) for col_name in lst_agg_cols]
     df_select.groupBy("region").agg(*agg_exprs_wastr).display()

     # Alternate 1: Using a simple for loop
     agg_exprs_wastr = []
     for col_name in lst_agg_cols:
         agg_exprs_wastr.append(sum(col_name).alias(col_name))

     df_select.groupBy("region").agg(*agg_exprs_wastr).display()

     # Alternate 2: If you want Uppercase or Capitalized column aliases
     agg_exprs_wastr = []
     for col_name in lst_agg_cols:
         alias_name = col_name.upper()  # or col_name.capitalize()
         agg_exprs_wastr.append(sum(col_name).alias(alias_name))

     df_select.groupBy("region").agg(*agg_exprs_wastr).display()

##### 3) Using * with .orderBy()

In [0]:
order_cols = ["region", "sales"]
df_select.orderBy(*order_cols).display()

name,age,city,sales,profit,region
Joseph,25,New York,200,30,East
Janani,30,Pune,250,40,East
Naresh,26,Chennai,220,35,North
Mukesh,22,Noida,300,50,North
Roshan,19,Cochin,100,15,South
Rahul,32,Bangalore,150,10,West
Priya,28,Mumbai,400,60,West
Ravi,27,Delhi,500,70,West


In [0]:
df_select.orderBy(order_cols).display()

name,age,city,sales,profit,region
Joseph,25,New York,200,30,East
Janani,30,Pune,250,40,East
Naresh,26,Chennai,220,35,North
Mukesh,22,Noida,300,50,North
Roshan,19,Cochin,100,15,South
Rahul,32,Bangalore,150,10,West
Priya,28,Mumbai,400,60,West
Ravi,27,Delhi,500,70,West


##### 4) Combine static and dynamic arguments

In [0]:
metrics = [sum("sales").alias("total_sales"), sum("profit").alias("total_profit")]
df_select.groupBy("region").agg(*metrics, count("*").alias("count")).display()

region,total_sales,total_profit,count
East,450,70,2
North,520,85,2
West,1050,140,3
South,100,15,1


##### 5) Using * with .drop()

In [0]:
cols_to_drop = ["sales", "profit"]
drp = df_select.drop(*cols_to_drop)
drp.display()

name,age,city,region
Joseph,25,New York,East
Janani,30,Pune,East
Mukesh,22,Noida,North
Naresh,26,Chennai,North
Priya,28,Mumbai,West
Ravi,27,Delhi,West
Rahul,32,Bangalore,West
Roshan,19,Cochin,South


| Use Case                   | Without `*`             | With `*`                 | Works? |
| -------------------------- | ----------------------- | ------------------------ | ------ |
| `df.select(selected_cols)` | Passes list as one arg  | Unpacks list             | ✅      |
| `df.groupBy().agg([...])`  | Passes list as one arg  | Unpacks list             | ✅      |
| `df.orderBy([...])`        | Passes list as one arg  | Unpacks list             | ✅      |

| Expression                    | Works             | Behavior                       | Recommended           |
| ----------------------------- | ----------------- | ------------------------------ | --------------------- |
| `df.select(cols)`             | ✅ (PySpark ≥ 3.0) | Implicitly expands string list | ⚠️ Sometimes works    |
| `df.select(*cols)`            | ✅ Always          | Explicit unpacking             | ✅ Yes (Best practice) |
| When using column expressions | ❌ May fail        | Needs unpacking                | ✅ Required            |