In [0]:
from pyspark.sql.functions import rand, lit, array

In [0]:
# Create a sample DataFrame
data = [(1,), (2,), (3,), (4,), (5,), (6,), (7,), (8,), (9,), (10,)]
df = spark.createDataFrame(data, ["id"])
display(df)

id
1
2
3
4
5
6
7
8
9
10


#### **Using rand() to Randomly Select Elements from a List**

     # Define the fixed integer values
     fixed_values_cust = [25, 30, 40, 55, 70, 85, 100, 130, 150, 145, 160]

     # Add a column with random fixed values
     df_with_fixed_value = df.withColumn(
                      'fixed_value', 
                      array([lit(x) for x in fixed_values_cust])[(rand() * len(fixed_values_cust)).cast('int')]
     )
     display(df_with_fixed_value)

In [0]:
# Define the fixed integer values
# A list of predefined fixed values that you want to use as random selections for the fixed_value column.
fixed_values_cust = [25, 30, 40, 55, 70, 85, 100, 130, 150, 145, 160]

len(fixed_values_cust)

11

In [0]:
# Add a column with random fixed values
# Used to pick a random index from the array
df_with_fixed_value2 = df.withColumn(
    'fixed_value', (rand() * len(fixed_values_cust)).cast('int')
)
display(df_with_fixed_value2)

id,fixed_value
1,5
2,0
3,3
4,9
5,4
6,4
7,9
8,6
9,7
10,6


**Why lit() is Used**
- PySpark operations work on **columns and expressions**.
- The elements of **fixed_values_cust** are simple **Python integers**, and to use them in Spark expressions like **array**, they must be **converted** to **PySpark column-compatible** literals.
- Without lit(), Spark would **not recognize** the elements as **valid column expressions**, and the code would throw an **error**.

In [0]:
[x for x in fixed_values_cust]

[25, 30, 40, 55, 70, 85, 100, 130, 150, 145, 160]

**Explanation:**
- **fixed_values_cust:** This is a Python list containing predefined numeric values:

    [25, 30, 40, 55, 70, 85, 100, 130, 150, 145, 160].

**lit(x):**

- The `lit()` function in PySpark creates a column object representing a `literal value (a constant)`.
- For each element `x` in the `fixed_values_cust` list, `lit(x)` converts it into a `PySpark literal column`.

**List Comprehension:**

- The comprehension `[lit(x) for x in fixed_values_cust]` iterates over every value `x` in the list `fixed_values_cust` and applies the `lit(x)` function to it.
- As a result, it produces a new list where each item is a PySpark column object representing the corresponding value from `fixed_values_cust`.

In [0]:
[lit(x) for x in fixed_values_cust]

[Column<'25'>,
 Column<'30'>,
 Column<'40'>,
 Column<'55'>,
 Column<'70'>,
 Column<'85'>,
 Column<'100'>,
 Column<'130'>,
 Column<'150'>,
 Column<'145'>,
 Column<'160'>]

     [
        Column<'25'>,  # A PySpark column object for the literal value 25
        Column<'30'>,  # A PySpark column object for the literal value 30
        Column<'40'>,  # A PySpark column object for the literal value 40
        Column<'55'>,  # A PySpark column object for the literal value 55
        Column<'70'>,  # A PySpark column object for the literal value 70
        Column<'85'>,  # A PySpark column object for the literal value 85
        Column<'100'>, # A PySpark column object for the literal value 100
        Column<'130'>, # A PySpark column object for the literal value 130
        Column<'150'>, # A PySpark column object for the literal value 150
        Column<'145'>, # A PySpark column object for the literal value 145
        Column<'160'>  # A PySpark column object for the literal value 160
     ]

In [0]:
# Converts the `fixed_values_cust` Python list into a `PySpark array column` where each element is wrapped as a literal (`lit`).
array([lit(x) for x in fixed_values_cust])

Column<'array(25, 30, 40, 55, 70, 85, 100, 130, 150, 145, 160)'>

**Why Use array for List Comprehension?**

- **Consolidate Fixed Values into a Single Data Structure**:

  - The **[lit(x) for x in fixed_values_cust]** generates a **list of PySpark literal column objects**. However, PySpark operations, such as **indexing or random selection**, cannot directly operate on a Python list.
  
  - The **array()** function **combines** these individual **column literals into a single PySpark array column**, which is a valid column type for further DataFrame operations.

In [0]:
# Add a column with random fixed values
df_with_fixed_value3 = df.withColumn(
    'fixed_value', 
    array([lit(x) for x in fixed_values_cust])
)
display(df_with_fixed_value3)

id,fixed_value
1,"List(25, 30, 40, 55, 70, 85, 100, 130, 150, 145, 160)"
2,"List(25, 30, 40, 55, 70, 85, 100, 130, 150, 145, 160)"
3,"List(25, 30, 40, 55, 70, 85, 100, 130, 150, 145, 160)"
4,"List(25, 30, 40, 55, 70, 85, 100, 130, 150, 145, 160)"
5,"List(25, 30, 40, 55, 70, 85, 100, 130, 150, 145, 160)"
6,"List(25, 30, 40, 55, 70, 85, 100, 130, 150, 145, 160)"
7,"List(25, 30, 40, 55, 70, 85, 100, 130, 150, 145, 160)"
8,"List(25, 30, 40, 55, 70, 85, 100, 130, 150, 145, 160)"
9,"List(25, 30, 40, 55, 70, 85, 100, 130, 150, 145, 160)"
10,"List(25, 30, 40, 55, 70, 85, 100, 130, 150, 145, 160)"


In [0]:
# Add a column with random fixed values
df_with_fixed_value4 = df.withColumn(
    'fixed_value', 
    array([lit(x) for x in fixed_values_cust])[0]
)
display(df_with_fixed_value4)

id,fixed_value
1,25
2,25
3,25
4,25
5,25
6,25
7,25
8,25
9,25
10,25


In [0]:
# Add a column with random fixed values
df_with_fixed_value = df.withColumn(
    'fixed_value', 
    array([lit(x) for x in fixed_values_cust])[len(fixed_values_cust)-1]
)
display(df_with_fixed_value)

id,fixed_value
1,160
2,160
3,160
4,160
5,160
6,160
7,160
8,160
9,160
10,160


In [0]:
# Define the fixed integer values
fixed_values_cust = [25, 30, 40, 55, 70, 85, 100, 130, 150, 145, 160]

# Add a column with random fixed values
df_with_fixed_value = df.withColumn(
    'fixed_value', 
    array([lit(x) for x in fixed_values_cust])[(rand() * len(fixed_values_cust)).cast('int')]
)
display(df_with_fixed_value)

id,fixed_value
1,130
2,100
3,130
4,55
5,150
6,70
7,85
8,100
9,85
10,25


The expression **[(rand() * len(fixed_values_cust)).cast('int')]** is used to generate a **random index** within the range of the **fixed_values_cust** list. 

**rand():**
- Generates a random float value **between 0 and 1**.

**rand() * len(fixed_values_cust):**
- **Multiplies** the **random float** by the **length of the fixed_values_cust** list, which scales the random value to the range **[0, len(fixed_values_cust))**.

**(rand() * len(fixed_values_cust)).cast('int'):**
- Casts the scaled random **float to an integer**, effectively generating a random **index** within the range of the list indices.

**[(rand() * len(fixed_values_cust)).cast('int')]:**
- The result is wrapped in a **list** to be used as an **index** for **array selection**.