**Why lit() is Used**

      array([lit(x) for x in fixed_values_cust])
      
- The elements of **fixed_values_cust** are simple **Python integers**, and to use them in Spark expressions like **array**, they must be **converted** to **PySpark column-compatible** literals.

- **Without lit()**, Spark would **not recognize** the elements as **valid column expressions**, and the code would throw an **error**.

In [0]:
# Define the fixed integer values
fixed_values_cust = [10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45]

[x for x in fixed_values_cust]

[10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45]

In [0]:
from pyspark.sql.functions import lit

[lit(x) for x in fixed_values_cust]

[Column<'10'>,
 Column<'13'>,
 Column<'15'>,
 Column<'20'>,
 Column<'25'>,
 Column<'28'>,
 Column<'30'>,
 Column<'35'>,
 Column<'38'>,
 Column<'40'>,
 Column<'45'>]

     [
        Column<'10'>,  # A PySpark column object for the literal value 10
        Column<'13'>,  # A PySpark column object for the literal value 13
        Column<'15'>,  # A PySpark column object for the literal value 15
        Column<'20'>,  # A PySpark column object for the literal value 20
        Column<'25'>,  # A PySpark column object for the literal value 25
        Column<'28'>,  # A PySpark column object for the literal value 28
        Column<'30'>, # A PySpark column object for the literal value 30
        Column<'35'>, # A PySpark column object for the literal value 35
        Column<'38'>, # A PySpark column object for the literal value 38
        Column<'40'>, # A PySpark column object for the literal value 40
        Column<'45'>  # A PySpark column object for the literal value 45
     ]

#### **Explanation:**

      array([lit(x) for x in fixed_values_cust]

**fixed_values_cust:**
- This is a **Python list** containing predefined **numeric values**:

    [10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45]

**lit(x):**

- The **lit()** function in PySpark creates a **column object** representing a **literal value (a constant)**.

- For each element **x** in the **fixed_values_cust** list, **lit(x)** converts it into a **PySpark literal column**.

- **lit(x)** converts each element **x** from the **Python list** a into a **Spark Column type**, which is necessary for the **array()** function to create a **new array column** within the **DataFrame**.

**List Comprehension:**

- The comprehension **[lit(x) for x in fixed_values_cust]** iterates over every value **x** in the list **fixed_values_cust** and applies the **lit(x)** function to it.

- As a result, it produces a new list where each item is a PySpark column object representing the corresponding value from **fixed_values_cust**.

- Imagine you want to add a new column named **source** with the value **manual** to every row of your DataFrame.

- You would use **df.withColumn("source", lit("manual"))**.

- If you tried **df.withColumn("source", "manual")** directly, it would likely result in an **error** because **manual is a Python string, not a PySpark Column**.

- While **list comprehensions** themselves are a **Python construct** for creating **lists**, their **integration** with **PySpark** operations often necessitates **lit()** to **bridge the gap** between **Python literals and PySpark's Column-based operations**.

In [0]:
from pyspark.sql.functions import when, col, array, array_contains

In [0]:
# Converts the `fixed_values_cust` Python list into a `PySpark array column` where each element is wrapped as a literal (`lit`).
array([lit(x) for x in fixed_values_cust])

Column<'array(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)'>

In [0]:
array(*[lit(x) for x in fixed_values_cust])

Column<'array(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)'>

**Why Use array for List Comprehension?**

- **Consolidate Fixed Values into a Single Data Structure**:

  - The **[lit(x) for x in fixed_values_cust]** generates a **list of PySpark literal column objects**. However, PySpark operations, such as **indexing or random selection**, cannot directly operate on a Python list.
  
  - The **array()** function **combines** these individual **column literals into a single PySpark array column**, which is a valid column type for further DataFrame operations.

In [0]:
df = spark.range(1)  # dummy row
fixed_values_cust = [10, 13, 15]

# Case 1 - without unpacking
df1 = df.withColumn("array_col_wrong", array([lit(x) for x in fixed_values_cust]))

# Case 2 - with unpacking
df2 = df.withColumn("array_col_right", array(*[lit(x) for x in fixed_values_cust]))

display(df1)
display(df2)

id,array_col_wrong
0,"List(10, 13, 15)"


id,array_col_right
0,"List(10, 13, 15)"


     +-------------------+
     |  array_col_wrong  |
     +-------------------+
     |  [[10, 13, 15]]   |    <-- nested array (1 element inside)
     +-------------------+

     +-------------------+
     |  array_col_right  |
     +-------------------+
     |   [10, 13, 15]    |    <-- correct array
     +-------------------+


In [0]:
df1.explain(True)
df2.explain(True)

== Parsed Logical Plan ==
'Project [id#24L, 'array(10, 13, 15) AS array_col_wrong#26]
+- Range (0, 1, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint, array_col_wrong: array<int>
Project [id#24L, array(10, 13, 15) AS array_col_wrong#26]
+- Range (0, 1, step=1, splits=Some(8))

== Optimized Logical Plan ==
Project [id#24L, [10,13,15] AS array_col_wrong#26]
+- Range (0, 1, step=1, splits=Some(8))

== Physical Plan ==
*(1) Project [id#24L, [10,13,15] AS array_col_wrong#26]
+- *(1) Range (0, 1, step=1, splits=8)

== Parsed Logical Plan ==
'Project [id#24L, 'array(10, 13, 15) AS array_col_right#29]
+- Range (0, 1, step=1, splits=Some(8))

== Analyzed Logical Plan ==
id: bigint, array_col_right: array<int>
Project [id#24L, array(10, 13, 15) AS array_col_right#29]
+- Range (0, 1, step=1, splits=Some(8))

== Optimized Logical Plan ==
Project [id#24L, [10,13,15] AS array_col_right#29]
+- Range (0, 1, step=1, splits=Some(8))

== Physical Plan ==
*(1) Project [id#24L, [10,13,15] A

In [0]:
data = [(1, "Rakesh", 25, "Sales"),
        (2, "Kiran", 29, "Admin"),
        (3, "Preeti", 31, "Marketing"),
        (4, "Subash", 33, "HR"),
        (5, "Sekhar", 35, "Maintenance"),
        (6, "Nirmal", 55, "Security"),
        (7, "Sailesh", 35, "IT"),
        (8, "kumar", 29, "Sales"),
        (9, "Asif", 39, "HR"),
        (10, "Murugan", 40, "Admin"),
        (11, "Prakash", 45, "Marketing")]

columns = ["id", "Name", "Age", "Department"]

df = spark.createDataFrame(data, columns)
display(df)

id,Name,Age,Department
1,Rakesh,25,Sales
2,Kiran,29,Admin
3,Preeti,31,Marketing
4,Subash,33,HR
5,Sekhar,35,Maintenance
6,Nirmal,55,Security
7,Sailesh,35,IT
8,kumar,29,Sales
9,Asif,39,HR
10,Murugan,40,Admin


In [0]:
# Add a column with random fixed values
df_with_fixed_value33 = df.withColumn('fixed_value', array([x for x in fixed_values_cust]))
display(df_with_fixed_value33)

[0;31m---------------------------------------------------------------------------[0m
[0;31mPySparkTypeError[0m                          Traceback (most recent call last)
File [0;32m<command-1884206424256289>, line 2[0m
[1;32m      1[0m [38;5;66;03m# Add a column with random fixed values[39;00m
[0;32m----> 2[0m df_with_fixed_value33 [38;5;241m=[39m df[38;5;241m.[39mwithColumn([38;5;124m'[39m[38;5;124mfixed_value[39m[38;5;124m'[39m, array([x [38;5;28;01mfor[39;00m x [38;5;129;01min[39;00m fixed_values_cust]))
[1;32m      3[0m display(df_with_fixed_value33)

File [0;32m/databricks/spark/python/pyspark/sql/utils.py:264[0m, in [0;36mtry_remote_functions.<locals>.wrapped[0;34m(*args, **kwargs)[0m
[1;32m    262[0m     [38;5;28;01mreturn[39;00m [38;5;28mgetattr[39m(functions, f[38;5;241m.[39m[38;5;18m__name__[39m)([38;5;241m*[39margs, [38;5;241m*[39m[38;5;241m*[39mkwargs)
[1;32m    263[0m [38;5;28;01melse[39;00m:
[0;32m--> 264[0m     [38;

In [0]:
# Add a column with random fixed values
df_with_fixed_value3 = df.withColumn('fixed_value', array([lit(x) for x in fixed_values_cust])) \
                         .withColumn('fixed_value_unpack', array(*[lit(x) for x in fixed_values_cust]))
display(df_with_fixed_value3)

id,Name,Age,Department,fixed_value,fixed_value_unpack
1,Rakesh,25,Sales,"List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)","List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)"
2,Kiran,29,Admin,"List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)","List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)"
3,Preeti,31,Marketing,"List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)","List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)"
4,Subash,33,HR,"List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)","List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)"
5,Sekhar,35,Maintenance,"List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)","List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)"
6,Nirmal,55,Security,"List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)","List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)"
7,Sailesh,35,IT,"List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)","List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)"
8,kumar,29,Sales,"List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)","List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)"
9,Asif,39,HR,"List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)","List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)"
10,Murugan,40,Admin,"List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)","List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)"


In [0]:
# Add a column with random fixed values
df_with_fixed_value4 = df.withColumn(
    'fixed_value', 
    array([lit(x) for x in fixed_values_cust])[0]
)
display(df_with_fixed_value4)

id,Name,Age,Department,fixed_value
1,Rakesh,25,Sales,10
2,Kiran,29,Admin,10
3,Preeti,31,Marketing,10
4,Subash,33,HR,10
5,Sekhar,35,Maintenance,10
6,Nirmal,55,Security,10
7,Sailesh,35,IT,10
8,kumar,29,Sales,10
9,Asif,39,HR,10
10,Murugan,40,Admin,10


#### Filtering Rows (to check if a column value is in a fixed list?)

In [0]:
fixed_values_cust = [10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45]

# Create an array literal and check if Age is in that array
df_filtered = df.withColumn('fixed_value', array([lit(x) for x in fixed_values_cust])) \
                .filter(array_contains(array(*[lit(x) for x in fixed_values_cust]), col("Age")))

display(df_filtered)

id,Name,Age,Department,fixed_value
1,Rakesh,25,Sales,"List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)"
5,Sekhar,35,Maintenance,"List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)"
7,Sailesh,35,IT,"List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)"
10,Murugan,40,Admin,"List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)"
11,Prakash,45,Marketing,"List(10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45)"


- **array()** needs **multiple column expressions**, not a **single list** of them.
- `*` operator (called **unpacking** in Python).
- `*` unpacks the list into individual arguments.

**without `*`:**

        array([lit(10), lit(13), lit(15), lit(20), lit(25), lit(28), lit(30), lit(35), lit(38), lit(40), lit(45)])

- This passes a **single list** as **one argument** to array(), which is **not valid for PySpark's array()** function. It expects **multiple column arguments**, not a list of columns.

**with `*`:**

      array(*[lit(10), lit(13), lit(15), lit(20), lit(25), lit(28), lit(30), lit(35), lit(38), lit(40), lit(45)])

      # This unpacks the list so that it becomes:
      array(lit(10), lit(13), lit(15), lit(20), lit(25), lit(28), lit(30), lit(35), lit(38), lit(40), lit(45))


- **array(*[lit(x) for x in fixed_values_cust])** creates a **literal** array column from your list.
- **array_contains(..., col("Age"))** checks if the **Age value exists** in that **array**.

#### Using in when/case Expressions

In [0]:
fixed_values_cust = [10, 13, 15, 20, 25, 28, 30, 35, 38, 40, 45]

df_caseWhen = df.withColumn(
    "is_fixed",
    when(col("Age").isin(*fixed_values_cust), lit(True)).otherwise(lit(False))
)
display(df_caseWhen)

id,Name,Age,Department,is_fixed
1,Rakesh,25,Sales,True
2,Kiran,29,Admin,False
3,Preeti,31,Marketing,False
4,Subash,33,HR,False
5,Sekhar,35,Maintenance,True
6,Nirmal,55,Security,False
7,Sailesh,35,IT,True
8,kumar,29,Sales,False
9,Asif,39,HR,False
10,Murugan,40,Admin,True
