#### How to convert the values of a column into a Python list?

##### 1) Using .collect()
- This is the simplest method, but it **pulls all data to the driver**, so it’s only good for **small datasets**.

In [0]:
df = spark.read.csv("/Volumes/@azureadb/pyspark/unionby/company_level.csv", header=True, inferSchema=True)
display(df.limit(15))

start_date,product_url,category,default_group,source_target,cloud_flatform,session_id,session_name,status_name,status_type,sessions,product_id,load datetime
2025-08-25,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,9876543,first_visit,first_visit,Not Available,5,409516064,2025-09-02T19:10:35
2025-08-26,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,9876544,purchase,organic,Not Available,12,409516064,2025-09-02T19:10:36
2025-08-27,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,9876545,search,network,Not Available,16,409516064,2025-09-02T19:10:37
2025-08-28,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,9876546,search,scroll,Not Available,22,409516064,2025-09-02T19:10:38
2025-08-29,shop.sony.bpl,mobile,wifi-network,google,azure / aws / gcc,9876547,search,organic,Not Available,25,409516064,2025-09-02T19:10:39
2025-08-30,shop.sony.bpl,mobile,wifi-network,(not set),azure / aws / gcc,9876548,add_to_cart,organic,Not Available,4,409516064,2025-09-02T19:10:40
2025-08-31,shop.sony.bpl,mobile,wifi-network,(not set),azure / aws / gcc,9876549,add_to_cart,organic,Not Available,9,409516064,2025-09-02T19:10:41
2025-09-01,shop.sony.bpl,mobile,wifi-network,(none) / (direct),azure / aws / gcc,9876550,add_to_cart,first_visit,Not Available,8,409516064,2025-09-02T19:10:42
2025-09-02,shop.sony.bpl,mobile,wifi-network,flipkart / referral,azure / aws / gcc,9876551,add_to_cart,first_visit,Not Available,7,409516064,2025-09-02T19:10:43
2025-09-03,shop.sony.bpl,mobile,wifi-network,(data not available),azure / aws / gcc,9876552,add_to_cart,first_visit,Not Available,6,409516064,2025-09-02T19:10:44


In [0]:
# Convert a column into a Python list
values_list = [row["status_type"] for row in df.select("status_type").collect()]
print(values_list)

['Not Available', 'Not Available', 'Not Available', 'Not Available', 'Not Available', 'Not Available', 'Not Available', 'Not Available', 'Not Available', 'Not Available', 'Not Available', 'Not Available', 'Not Available', 'Not Available', 'Checkout', 'Checkout', 'Checkout', 'Checkout', 'Checkout', 'Checkout', 'Checkout', 'Checkout', 'Checkout', 'Checkout', 'Checkout', 'Checkout', 'Checkout', 'Checkout', 'Subscribe', 'Subscribe', 'Subscribe', 'Subscribe', 'Subscribe', 'Subscribe', 'Subscribe', 'Subscribe', 'Subscribe', 'Subscribe', 'Search', 'Search', 'Search', 'Search', 'Search', 'Search', 'Search', 'Search', 'Search', 'Autocomplete', 'Autocomplete', 'Autocomplete', 'Autocomplete', 'Autocomplete', 'Autocomplete', 'Autocomplete', 'Autocomplete', 'Autocomplete', 'Autocomplete', 'Autocomplete', 'Customer', 'Customer', 'Customer', 'Customer', 'Customer', 'Customer', 'Customer', 'Customer', 'Customer', 'Customer', 'Customer', 'Customer', 'Audit', 'Audit', 'Audit', 'Audit', 'Audit', 'Audit',

- **df.select("column_name")** → selects only the **status_type** column from the DataFrame.

- **.collect()** → brings **all rows** of the **selected column** to the **driver** as a **list of Row objects**.

- The list comprehension **[row["status_type"] for row in ...]** extracts the value of **status_type** from **each Row** and creates a **Python list**.

**When to use which method:**
- **.collect()** should only be used for **small datasets**, as it loads **all data** into the **driver’s memory**.
- For **small datasets** where you need the **entire column** as a **Python list on the driver**, collect() with **list comprehension or toPandas()** are convenient.
- For **large datasets**, avoid **.collect() or .toPandas()**.
- For **larger datasets**, especially when you need to perform further operations within Spark, using **collect_list() or collect_set()** for **aggregation** is more efficient as the operation remains distributed.
- Instead, try to work in **Spark** directly without converting to a Python list, or use **take(n)** if you only need a sample.

##### 2) Getting only distinct values
- **.distinct()** removes **duplicates** before collecting.

In [0]:
distinct_list = [row["status_type"] for row in df.select("status_type").distinct().collect()]
print(distinct_list)

['Not Available', 'Search', 'Checkout', 'Subscribe', 'Marketing', 'Autocomplete', 'Customer', 'Audit']


##### 3) Using toPandas() (not recommended for large datasets)

In [0]:
values_list = df.select("status_type").toPandas()["status_type"].tolist()
print(values_list)

['Not Available', 'Not Available', 'Not Available', 'Not Available', 'Not Available', 'Not Available', 'Not Available', 'Not Available', 'Not Available', 'Not Available', 'Not Available', 'Not Available', 'Not Available', 'Not Available', 'Checkout', 'Checkout', 'Checkout', 'Checkout', 'Checkout', 'Checkout', 'Checkout', 'Checkout', 'Checkout', 'Checkout', 'Checkout', 'Checkout', 'Checkout', 'Checkout', 'Subscribe', 'Subscribe', 'Subscribe', 'Subscribe', 'Subscribe', 'Subscribe', 'Subscribe', 'Subscribe', 'Subscribe', 'Subscribe', 'Search', 'Search', 'Search', 'Search', 'Search', 'Search', 'Search', 'Search', 'Search', 'Autocomplete', 'Autocomplete', 'Autocomplete', 'Autocomplete', 'Autocomplete', 'Autocomplete', 'Autocomplete', 'Autocomplete', 'Autocomplete', 'Autocomplete', 'Autocomplete', 'Customer', 'Customer', 'Customer', 'Customer', 'Customer', 'Customer', 'Customer', 'Customer', 'Customer', 'Customer', 'Customer', 'Customer', 'Audit', 'Audit', 'Audit', 'Audit', 'Audit', 'Audit',

- Converts **DataFrame to Pandas**, then uses Pandas **.tolist()**.
- Works but requires **memory to fit** the whole DataFrame on the **driver**.

##### 4) Using collect_list() or collect_set() (for aggregation)
- collect_list() (with duplicates)
- collect_set() (unique values)

In [0]:
from pyspark.sql import functions as F

data = [("A", 1), ("A", 1), ("A", 2), ("B", 3), ("B", 1), ("C", 1), ("C", 2), ("C", 2), ("C", 3), ("C", 3), ("D", 2), ("D", 3), ("D", 3), ("E", 1), ("E", 2), ("E", 2)]

df = spark.createDataFrame(data, ["Category", "Value"])
display(df)

Category,Value
A,1
A,1
A,2
B,3
B,1
C,1
C,2
C,2
C,3
C,3


In [0]:
# Collect values into a list for each category
df_collected = df.groupBy("Category").agg(F.collect_list("Value").alias("ValueList"))
df_collected.display()

Category,ValueList
A,"List(1, 1, 2)"
B,"List(3, 1)"
C,"List(1, 2, 2, 3, 3)"
D,"List(2, 3, 3)"
E,"List(1, 2, 2)"


In [0]:
# Collect unique values into a list for each category
df_collected_set = df.groupBy("Category").agg(F.collect_set("Value").alias("UniqueValueList"))
df_collected_set.display()

Category,UniqueValueList
A,"List(1, 2)"
B,"List(3, 1)"
C,"List(1, 2, 3)"
D,"List(2, 3)"
E,"List(1, 2)"


##### 5) Using .rdd.flatMap()

- This avoids explicit **list comprehension** and works with **Spark’s RDD**

In [0]:
values_list = df.select("status_type").rdd.flatMap(lambda x: x).collect()
print(values_list)

[0;31m---------------------------------------------------------------------------[0m
[0;31mPySparkNotImplementedError[0m                Traceback (most recent call last)
File [0;32m<command-5983335820974137>, line 1[0m
[0;32m----> 1[0m values_list [38;5;241m=[39m df[38;5;241m.[39mselect([38;5;124m"[39m[38;5;124mstatus_type[39m[38;5;124m"[39m)[38;5;241m.[39mrdd[38;5;241m.[39mflatMap([38;5;28;01mlambda[39;00m x: x)[38;5;241m.[39mcollect()
[1;32m      2[0m [38;5;28mprint[39m(values_list)

File [0;32m/databricks/python/lib/python3.12/site-packages/pyspark/sql/connect/dataframe.py:2330[0m, in [0;36mDataFrame.rdd[0;34m(self)[0m
[1;32m   2328[0m [38;5;129m@property[39m
[1;32m   2329[0m [38;5;28;01mdef[39;00m [38;5;21mrdd[39m([38;5;28mself[39m) [38;5;241m-[39m[38;5;241m>[39m [38;5;124m"[39m[38;5;124mRDD[Row][39m[38;5;124m"[39m:
[0;32m-> 2330[0m     [38;5;28;01mraise[39;00m PySparkNotImplementedError(
[1;32m   2331[0m         errorC

- **.rdd** → converts the **DataFrame to an RDD**.
- **.flatMap(lambda x: x)** → flattens each **row** into a **single value**.
- **.collect()** → returns a **Python list**.