In [4]:
'''
You are given a dataset containing information about customers and their list of purchased items stored in an array column. Your task is to:

Find the size of the array for each row
Find the last element of the array
Find the count of unique elements in the array
Input Schema & Example
Column Name	Data Type
customer_id	INT
items	ARRAY<STRING>
Example Input Table
customer_id	items
1	["apple", "banana", "apple"]
2	["pen", "pencil", "ink"]
3	[]
Output Schema
Column Name	Data Type
customer_id	INT
array_size	INT
last_item	STRING
unique_count	INT
Example Output Table
customer_id	array_size	last_item	unique_count
1	3	apple	2
2	3	ink	3
3	0	null	0
Explanation
For customer_id = 1:

["apple", "banana", "apple"] has size 3
Last element is apple
Unique elements are apple, banana ‚Üí count = 2
For customer_id = 2:

Size = 3
Last element = ink
Unique count = 3
For customer_id = 3:

Size = 0
No last element ‚Üí null
Unique count = 0
Starter Code
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

data = [
    (1, ["apple", "banana", "apple"]),
    (2, ["pen", "pencil", "ink"]),
    (3, []),
    (4, ["milk", "bread", "milk", "butter"]),
    (5, ["apple", "orange", "banana", "orange" , "apple"]),
    (6, ["pen", "pen"]),
    (7, ["mobile"])
]

columns = ["customer_id", "items"]

df = spark.createDataFrame(data, columns)

# Your logic goes here to create df_result

df_result.show(truncate=False)

'''

'''
For ERROR - 
25/12/18 17:50:35 ERROR Executor: Exception in task 2.0 in stage 3.0 (TID 8)
org.apache.spark.SparkArrayIndexOutOfBoundsException: [INVALID_ARRAY_INDEX_IN_ELEMENT_AT] 
The index -1 is out of bounds. 
The array has 0 elements. Use `try_element_at` to tolerate accessing element at invalid index and return NULL instead. SQLSTATE: 22003

Solution - 

In newer Spark versions, element_at(array, -1) throws an error for empty arrays, instead of silently returning NULL.

‚úÖ Correct & Safe Fix (Recommended)

Use try_element_at instead of element_at.

üí° Interview Tip

If asked:

How do you safely access array elements in Spark?

Say:

‚ÄúUse try_element_at to avoid runtime failures when arrays are empty or indexes are invalid.‚Äù

'''

# Initialize Spark session
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

spark = SparkSession.builder.appName('Spark Playground').getOrCreate()

data = [
    (1, ["apple", "banana", "apple"]),
    (2, ["pen", "pencil", "ink"]),
    (3, []),
    (4, ["milk", "bread", "milk", "butter"]),
    (5, ["apple", "orange", "banana", "orange" , "apple"]),
    (6, ["pen", "pen"]),
    (7, ["mobile"])
]

columns = ["customer_id", "items"]

df = spark.createDataFrame(data, columns)

df_result = (
  df.withColumn("array_size", F.size(F.col("items")))
  .withColumn("last_item", F.expr("try_element_at(items, -1)"))
  .withColumn("unique_count", F.size(F.array_distinct(F.col("items"))))
  .drop("items")
)

df_result.show()

+-----------+----------+---------+------------+
|customer_id|array_size|last_item|unique_count|
+-----------+----------+---------+------------+
|          1|         3|    apple|           2|
|          2|         3|      ink|           3|
|          3|         0|     NULL|           0|
|          4|         4|   butter|           3|
|          5|         5|    apple|           3|
|          6|         2|      pen|           1|
|          7|         1|   mobile|           1|
+-----------+----------+---------+------------+

