# Working with Complex Types

Spark allows working with types other than numbers, booleans and text. It is also possible to use data structures (lists, maps) and nested types (structs).

In [78]:
df = spark.read.format("csv")\
  .option("header", "true")\
  .option("inferSchema", "true")\
  .load("/work/data/retail-data/all/*.csv")\
  .coalesce(5)
df.cache()
df.printSchema()

[Stage 54:>                                                         (0 + 2) / 2]

root
 |-- InvoiceNo: string (nullable = true)
 |-- StockCode: string (nullable = true)
 |-- Description: string (nullable = true)
 |-- Quantity: integer (nullable = true)
 |-- InvoiceDate: string (nullable = true)
 |-- UnitPrice: double (nullable = true)
 |-- CustomerID: integer (nullable = true)
 |-- Country: string (nullable = true)



                                                                                22/11/07 01:23:26 WARN CacheManager: Asked to cache already cached data.


## Date and time

In [79]:
df.select('InvoiceDate').show(5)

+--------------+
|   InvoiceDate|
+--------------+
|12/1/2010 8:26|
|12/1/2010 8:26|
|12/1/2010 8:26|
|12/1/2010 8:26|
|12/1/2010 8:26|
+--------------+
only showing top 5 rows



In [80]:
from pyspark.sql.functions import to_date, to_timestamp
date_time_format = "dd/M/yyyy h:mm"
df_with_datetime = df.select('InvoiceDate', 
                             to_date(col('InvoiceDate'), date_time_format).alias('date'),
                             to_timestamp(col('InvoiceDate'), date_time_format).alias('time'))
df_with_datetime.show(5)
df_with_datetime.printSchema()

+--------------+----------+-------------------+
|   InvoiceDate|      date|               time|
+--------------+----------+-------------------+
|12/1/2010 8:26|2010-01-12|2010-01-12 08:26:00|
|12/1/2010 8:26|2010-01-12|2010-01-12 08:26:00|
|12/1/2010 8:26|2010-01-12|2010-01-12 08:26:00|
|12/1/2010 8:26|2010-01-12|2010-01-12 08:26:00|
|12/1/2010 8:26|2010-01-12|2010-01-12 08:26:00|
+--------------+----------+-------------------+
only showing top 5 rows

root
 |-- InvoiceDate: string (nullable = true)
 |-- date: date (nullable = true)
 |-- time: timestamp (nullable = true)



### Structs

In [81]:
df_with_struct = df.selectExpr('(CustomerID, Country) as struct_col')
df_with_struct.show(5)
df_with_struct.printSchema()

+--------------------+
|          struct_col|
+--------------------+
|[17850, United Ki...|
|[17850, United Ki...|
|[17850, United Ki...|
|[17850, United Ki...|
|[17850, United Ki...|
+--------------------+
only showing top 5 rows

root
 |-- struct_col: struct (nullable = false)
 |    |-- CustomerID: integer (nullable = true)
 |    |-- Country: string (nullable = true)



Selecting struct's fields:

In [82]:
df_with_struct.select('struct_col.Country').show(5)
# or df_with_struct.select(col('struct_field').getField('Country'))

+--------------+
|       Country|
+--------------+
|United Kingdom|
|United Kingdom|
|United Kingdom|
|United Kingdom|
|United Kingdom|
+--------------+
only showing top 5 rows



### Arrays

In [83]:
from pyspark.sql.functions import split
df_with_array = df.select(col('Description'), split(col("Description"), " ").alias('array_col'))
df_with_array.show(5)
df_with_array.printSchema()

+--------------------+--------------------+
|         Description|           array_col|
+--------------------+--------------------+
|WHITE HANGING HEA...|[WHITE, HANGING, ...|
| WHITE METAL LANTERN|[WHITE, METAL, LA...|
|CREAM CUPID HEART...|[CREAM, CUPID, HE...|
|KNITTED UNION FLA...|[KNITTED, UNION, ...|
|RED WOOLLY HOTTIE...|[RED, WOOLLY, HOT...|
+--------------------+--------------------+
only showing top 5 rows

root
 |-- Description: string (nullable = true)
 |-- array_col: array (nullable = true)
 |    |-- element: string (containsNull = true)



Selecting element from arrays:

In [84]:
df_with_array.selectExpr('array_col[0]', 'array_col[2]', 'array_col[1000]').show(5)

+------------+------------+---------------+
|array_col[0]|array_col[2]|array_col[1000]|
+------------+------------+---------------+
|       WHITE|       HEART|           null|
|       WHITE|     LANTERN|           null|
|       CREAM|      HEARTS|           null|
|     KNITTED|        FLAG|           null|
|         RED|      HOTTIE|           null|
+------------+------------+---------------+
only showing top 5 rows



Array length:

In [86]:
from pyspark.sql.functions import size
df_with_array.select(size('array_col')).show(5)

+---------------+
|size(array_col)|
+---------------+
|              5|
|              3|
|              5|
|              6|
|              5|
+---------------+
only showing top 5 rows



Array contains:

In [89]:
from pyspark.sql.functions import array_contains
df_with_array.select(col('Description'), array_contains('array_col', 'WHITE')).show(5)

+--------------------+--------------------------------+
|         Description|array_contains(array_col, WHITE)|
+--------------------+--------------------------------+
|WHITE HANGING HEA...|                            true|
| WHITE METAL LANTERN|                            true|
|CREAM CUPID HEART...|                           false|
|KNITTED UNION FLA...|                           false|
|RED WOOLLY HOTTIE...|                            true|
+--------------------+--------------------------------+
only showing top 5 rows



Explode: explode turns elements of an array into rows, if more columns exists in the same array's row, the valus are duplicated

In [90]:
from pyspark.sql.functions import explode
df_with_array.select(explode('array_col'), 'Description').show()

+-------+--------------------+
|    col|         Description|
+-------+--------------------+
|  WHITE|WHITE HANGING HEA...|
|HANGING|WHITE HANGING HEA...|
|  HEART|WHITE HANGING HEA...|
|T-LIGHT|WHITE HANGING HEA...|
| HOLDER|WHITE HANGING HEA...|
|  WHITE| WHITE METAL LANTERN|
|  METAL| WHITE METAL LANTERN|
|LANTERN| WHITE METAL LANTERN|
|  CREAM|CREAM CUPID HEART...|
|  CUPID|CREAM CUPID HEART...|
| HEARTS|CREAM CUPID HEART...|
|   COAT|CREAM CUPID HEART...|
| HANGER|CREAM CUPID HEART...|
|KNITTED|KNITTED UNION FLA...|
|  UNION|KNITTED UNION FLA...|
|   FLAG|KNITTED UNION FLA...|
|    HOT|KNITTED UNION FLA...|
|  WATER|KNITTED UNION FLA...|
| BOTTLE|KNITTED UNION FLA...|
|    RED|RED WOOLLY HOTTIE...|
+-------+--------------------+
only showing top 20 rows



### Maps

In [91]:
from pyspark.sql.functions import create_map
df_with_map = df.select(
    'CustomerID',
    create_map(col("Description"), col("InvoiceNo"), col('CustomerID'), col('StockCode')).\
        alias("map_col"))
df_with_map.show(5, truncate=False)
df_with_map.printSchema()

+----------+----------------------------------------------------------------+
|CustomerID|map_col                                                         |
+----------+----------------------------------------------------------------+
|17850     |[WHITE HANGING HEART T-LIGHT HOLDER -> 536365, 17850 -> 85123A] |
|17850     |[WHITE METAL LANTERN -> 536365, 17850 -> 71053]                 |
|17850     |[CREAM CUPID HEARTS COAT HANGER -> 536365, 17850 -> 84406B]     |
|17850     |[KNITTED UNION FLAG HOT WATER BOTTLE -> 536365, 17850 -> 84029G]|
|17850     |[RED WOOLLY HOTTIE WHITE HEART. -> 536365, 17850 -> 84029E]     |
+----------+----------------------------------------------------------------+
only showing top 5 rows

root
 |-- CustomerID: integer (nullable = true)
 |-- map_col: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)



Querying key in map using a literal key value

In [92]:
df_with_map.selectExpr('map_col["WHITE METAL LANTERN"]').show(5)

+----------------------------+
|map_col[WHITE METAL LANTERN]|
+----------------------------+
|                        null|
|                      536365|
|                        null|
|                        null|
|                        null|
+----------------------------+
only showing top 5 rows



Using a colum value as the key

In [93]:
df_with_map.selectExpr('map_col[CustomerID]').show(5)

+-----------------------------------+
|map_col[CAST(CustomerID AS STRING)]|
+-----------------------------------+
|                             85123A|
|                              71053|
|                             84406B|
|                             84029G|
|                             84029E|
+-----------------------------------+
only showing top 5 rows



Exploding a map: each key-value pair is turned into a row

In [94]:
df_with_map.selectExpr('explode(map_col)').show()

+--------------------+------+
|                 key| value|
+--------------------+------+
|WHITE HANGING HEA...|536365|
|               17850|85123A|
| WHITE METAL LANTERN|536365|
|               17850| 71053|
|CREAM CUPID HEART...|536365|
|               17850|84406B|
|KNITTED UNION FLA...|536365|
|               17850|84029G|
|RED WOOLLY HOTTIE...|536365|
|               17850|84029E|
|SET 7 BABUSHKA NE...|536365|
|               17850| 22752|
|GLASS STAR FROSTE...|536365|
|               17850| 21730|
|HAND WARMER UNION...|536366|
|               17850| 22633|
|HAND WARMER RED P...|536366|
|               17850| 22632|
|ASSORTED COLOUR B...|536367|
|               13047| 84879|
+--------------------+------+
only showing top 20 rows

