## Processing Column Data

As part of this module we will explore the functions available under `pyspark.sql.functions` to derive new values from existing column values with in a Data Frame.

* Pre-defined Functions
* Create Dummy Data Frame
* Categories of Functions
* Special Functions - `col` and `lit`
* String Manipulation Functions - 1
* String Manipulation Functions - 2
* Date and Time Overview
* Date and Time Arithmetic
* Date and Time - `trunc` and `date_trunc`
* Date and Time - Extracting Information
* Dealing with Unix Timestamp
* Example - Word Count
* Conclusion

In [1]:
from pyspark.sql import SparkSession
spark=SparkSession.builder.appName('instance').getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/06/12 23:48:07 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [34]:
orders=spark.read.csv(
    './retail_db/orders', \
    schema='order_id INT, order_date STRING, order_customer_id INT, order_status STRING'
)
orders.printSchema()
orders.show(5)
orders.count()

root
 |-- order_id: integer (nullable = true)
 |-- order_date: string (nullable = true)
 |-- order_customer_id: integer (nullable = true)
 |-- order_status: string (nullable = true)

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|
+--------+--------------------+-----------------+---------------+
only showing top 5 rows



68883

In [16]:
from pyspark.sql.functions import date_format
#date_format? 
#help(date_format)

In [19]:
orders.select('*', date_format(orders['order_date'], 'yyyyMM').alias('order_month')).show(5) 

+--------+--------------------+-----------------+---------------+-----------+
|order_id|          order_date|order_customer_id|   order_status|order_month|
+--------+--------------------+-----------------+---------------+-----------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|     201307|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|     201307|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|     201307|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|     201307|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|     201307|
+--------+--------------------+-----------------+---------------+-----------+
only showing top 5 rows



In [21]:
orders.withColumn('order_month', date_format('order_date', 'yyyyMM')).select('*').show(5)

+--------+--------------------+-----------------+---------------+-----------+
|order_id|          order_date|order_customer_id|   order_status|order_month|
+--------+--------------------+-----------------+---------------+-----------+
|       1|2013-07-25 00:00:...|            11599|         CLOSED|     201307|
|       2|2013-07-25 00:00:...|              256|PENDING_PAYMENT|     201307|
|       3|2013-07-25 00:00:...|            12111|       COMPLETE|     201307|
|       4|2013-07-25 00:00:...|             8827|         CLOSED|     201307|
|       5|2013-07-25 00:00:...|            11318|       COMPLETE|     201307|
+--------+--------------------+-----------------+---------------+-----------+
only showing top 5 rows



In [22]:
orders. \
    filter(date_format('order_date', 'yyyyMM')==201401). \
    show(5)

+--------+--------------------+-----------------+---------------+
|order_id|          order_date|order_customer_id|   order_status|
+--------+--------------------+-----------------+---------------+
|   25876|2014-01-01 00:00:...|             3414|PENDING_PAYMENT|
|   25877|2014-01-01 00:00:...|             5549|PENDING_PAYMENT|
|   25878|2014-01-01 00:00:...|             9084|        PENDING|
|   25879|2014-01-01 00:00:...|             5118|        PENDING|
|   25880|2014-01-01 00:00:...|            10146|       CANCELED|
+--------+--------------------+-----------------+---------------+
only showing top 5 rows



In [27]:
orders. \
    groupBy(date_format('order_date', 'yyyyMM').alias('order_month')). \
    count(). \
    show(5)

+-----------+-----+
|order_month|count|
+-----------+-----+
|     201401| 5908|
|     201405| 5467|
|     201312| 5892|
|     201310| 5335|
|     201311| 6381|
+-----------+-----+
only showing top 5 rows



In [30]:
from pyspark.sql import functions as F
#help(F.date_format)

## Create Dummy Spark Data Frame
* Oracle dual (view)
* dual - dummy CHAR(1)
* 'X' one record

In [45]:
l = [('X',)]
df=spark.createDataFrame(l, 'dummy STRING')
df.show()
df.printSchema()

+-----+
|dummy|
+-----+
|    X|
+-----+

root
 |-- dummy: string (nullable = true)



In [48]:
# simialar to Oracle - select sysdate from dual
df.select(F.current_date().alias('current_date')).show()

+------------+
|current_date|
+------------+
|  2024-05-31|
+------------+



In [98]:
employees = [
    (1, "Scott", "Tiger", 1000.0, "united states", "+1 123 456 7890", "123 45 6789"),
    (2, "Henry", "Ford", 1250.0, "India", "+91 234 567 8901", "456 78 9123"),
    (3, "Nick", "Junior", 750.0, "united KINGDOM", "+44 111 111 1111", "222 33 4444"),
    (4, "Bill", "Gomes", 1500.0, "AUSTRALIA", "+61 987 654 3210", "789 12 6118")
]

employeesDF=spark.createDataFrame( \
    employees, \
    schema='employee_id INT, first_name STRING, last_name STRING, salary FLOAT, nationality STRING, phone_number STRING, ssn STRING' \
)
len(employees), employeesDF.show(), employeesDF.printSchema()

+-----------+----------+---------+------+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+--------------+----------------+-----------+
|          1|     Scott|    Tiger|1000.0| united states| +1 123 456 7890|123 45 6789|
|          2|     Henry|     Ford|1250.0|         India|+91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111|222 33 4444|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|+61 987 654 3210|789 12 6118|
+-----------+----------+---------+------+--------------+----------------+-----------+

root
 |-- employee_id: integer (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- salary: float (nullable = true)
 |-- nationality: string (nullable = true)
 |-- phone_number: string (nullable = true)
 |-- ssn: string (nullable = true)



(4, None, None)

## Categories of Functions to Manipulate Columns in Spark Data Frames
* String Manipulation Functions
  * Case Conversion - `lower`,  `upper`
  * Getting Length -  `length`
  * Extracting substrings - `substring`, `split`
  * Trimming - `trim`, `ltrim`, `rtrim`
  * Padding - `lpad`, `rpad`
  * Concatenating string - `concat`, `concat_ws`
* Date Manipulation Functions
  * Getting current date and time - `current_date`, `current_timestamp`
  * Date Arithmetic - `date_add`, `date_sub`, `datediff`, `months_between`, `add_months`, `next_day`
  * Beginning and Ending Date or Time - `last_day`, `trunc`, `date_trunc`
  * Formatting Date - `date_format`
  * Extracting Information - `dayofyear`, `dayofmonth`, `dayofweek`, `year`, `month`
* Aggregate Functions
  * `count`, `countDistinct`
  * `sum`, `avg`
  * `min`, `max`
* Other Functions - We will explore depending on the use cases.
  * `CASE` and `WHEN`
  * `CAST` for type casting
  * Functions to manage special types such as `ARRAY`, `MAP`, `STRUCT` type columns
  * Many others

In [5]:
from pyspark.sql import functions as F
employeesDF.withColumn('nationality', F.lower('nationality')).show()

+-----------+----------+---------+------+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+--------------+----------------+-----------+
|          1|     Scott|    Tiger|1000.0| united states| +1 123 456 7890|123 45 6789|
|          2|     Henry|     Ford|1250.0|         india|+91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|united kingdom|+44 111 111 1111|222 33 4444|
|          4|      Bill|    Gomes|1500.0|     australia|+61 987 654 3210|789 12 6118|
+-----------+----------+---------+------+--------------+----------------+-----------+



## Special functions - col and lit
* col - takes column name or expression as argument and return Column object
* lit (literal) - takes constant value as an argument and return Coulmn object

In [6]:
employeesDF.select('first_name', 'last_name').show()

+----------+---------+
|first_name|last_name|
+----------+---------+
|     Scott|    Tiger|
|     Henry|     Ford|
|      Nick|   Junior|
|      Bill|    Gomes|
+----------+---------+



In [7]:
employeesDF. \
    groupBy('nationality'). \
    count(). \
    show()

+--------------+-----+
|   nationality|count|
+--------------+-----+
| united states|    1|
|         India|    1|
|united KINGDOM|    1|
|     AUSTRALIA|    1|
+--------------+-----+



In [8]:
employeesDF.orderBy('employee_id').show()

+-----------+----------+---------+------+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+--------------+----------------+-----------+
|          1|     Scott|    Tiger|1000.0| united states| +1 123 456 7890|123 45 6789|
|          2|     Henry|     Ford|1250.0|         India|+91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111|222 33 4444|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|+61 987 654 3210|789 12 6118|
+-----------+----------+---------+------+--------------+----------------+-----------+



In [10]:
from pyspark.sql.functions import col

type(col('first_name'))

pyspark.sql.column.Column

In [11]:
employeesDF.select(col('first_name'), col('last_name')).show()

+----------+---------+
|first_name|last_name|
+----------+---------+
|     Scott|    Tiger|
|     Henry|     Ford|
|      Nick|   Junior|
|      Bill|    Gomes|
+----------+---------+



In [16]:
from pyspark.sql.functions import upper

employeesDF.select(upper('first_name'), upper('last_name')).show()
type(upper('first_name'))
employeesDF.select(upper(col('first_name')), upper(employeesDF['last_name'])).show()

+-----------------+----------------+
|upper(first_name)|upper(last_name)|
+-----------------+----------------+
|            SCOTT|           TIGER|
|            HENRY|            FORD|
|             NICK|          JUNIOR|
|             BILL|           GOMES|
+-----------------+----------------+

+-----------------+----------------+
|upper(first_name)|upper(last_name)|
+-----------------+----------------+
|            SCOTT|           TIGER|
|            HENRY|            FORD|
|             NICK|          JUNIOR|
|             BILL|           GOMES|
+-----------------+----------------+



In [17]:
# employeesDF.orderBy('employee_id'.desc()).show() #FAILS
employeesDF.orderBy(col('employee_id').desc()).show()

+-----------+----------+---------+------+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+--------------+----------------+-----------+
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|+61 987 654 3210|789 12 6118|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111|222 33 4444|
|          2|     Henry|     Ford|1250.0|         India|+91 234 567 8901|456 78 9123|
|          1|     Scott|    Tiger|1000.0| united states| +1 123 456 7890|123 45 6789|
+-----------+----------+---------+------+--------------+----------------+-----------+



In [20]:
employeesDF.orderBy(upper(employeesDF['first_name']).alias('first_name')).show()
employeesDF.orderBy(upper(employeesDF.first_name).alias('first_name')).show()

+-----------+----------+---------+------+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+--------------+----------------+-----------+
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|+61 987 654 3210|789 12 6118|
|          2|     Henry|     Ford|1250.0|         India|+91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111|222 33 4444|
|          1|     Scott|    Tiger|1000.0| united states| +1 123 456 7890|123 45 6789|
+-----------+----------+---------+------+--------------+----------------+-----------+

+-----------+----------+---------+------+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+--------------+----------------+-----------+
|          4|      Bill|    Gomes|1500.0|     AUSTRAL

In [25]:
from pyspark.sql.functions import lit, concat
employeesDF.select('first_name', lit(', '), 'last_name').show()
employeesDF.select(concat('first_name', lit(', '), 'last_name').alias('full_name')).show()

+----------+---+---------+
|first_name| , |last_name|
+----------+---+---------+
|     Scott| , |    Tiger|
|     Henry| , |     Ford|
|      Nick| , |   Junior|
|      Bill| , |    Gomes|
+----------+---+---------+

+------------+
|   full_name|
+------------+
|Scott, Tiger|
| Henry, Ford|
|Nick, Junior|
| Bill, Gomes|
+------------+



In [26]:
employeesDF.withColumn('bonus', col('salary')*0.2).show()

+-----------+----------+---------+------+--------------+----------------+-----------+-----+
|employee_id|first_name|last_name|salary|   nationality|    phone_number|        ssn|bonus|
+-----------+----------+---------+------+--------------+----------------+-----------+-----+
|          1|     Scott|    Tiger|1000.0| united states| +1 123 456 7890|123 45 6789|200.0|
|          2|     Henry|     Ford|1250.0|         India|+91 234 567 8901|456 78 9123|250.0|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111|222 33 4444|150.0|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|+61 987 654 3210|789 12 6118|300.0|
+-----------+----------+---------+------+--------------+----------------+-----------+-----+



In [31]:
from pyspark.sql.functions import concat_ws
employeesDF.select(concat_ws(', ', 'first_name', 'last_name').alias('full_name')).show()
employeesDF.withColumn('full_name', concat_ws(', ', 'first_name', 'last_name')).show()

+------------+
|   full_name|
+------------+
|Scott, Tiger|
| Henry, Ford|
|Nick, Junior|
| Bill, Gomes|
+------------+

+-----------+----------+---------+------+--------------+----------------+-----------+------------+
|employee_id|first_name|last_name|salary|   nationality|    phone_number|        ssn|   full_name|
+-----------+----------+---------+------+--------------+----------------+-----------+------------+
|          1|     Scott|    Tiger|1000.0| united states| +1 123 456 7890|123 45 6789|Scott, Tiger|
|          2|     Henry|     Ford|1250.0|         India|+91 234 567 8901|456 78 9123| Henry, Ford|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111|222 33 4444|Nick, Junior|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|+61 987 654 3210|789 12 6118| Bill, Gomes|
+-----------+----------+---------+------+--------------+----------------+-----------+------------+



In [35]:
addresses = [{'address': '36155 Dayton Hill', 'city': 'Newton', 'country': 'US', 'id': 1, 'postal_code': '02162', 'state': 'Massachusetts'}
           , {'address': '1551 6th Plaza', 'city': 'Modesto', 'country': 'US', 'id': 2, 'postal_code': '95354', 'state': 'Californial'}
           , {'address': '21370 Waubesa Pass', 'city': 'New York City', 'country': 'US', 'id': 3, 'postal_code': '10120', 'state': 'New York'}
           , {'address': '7849 Ohio Drive', 'city': 'Springfield', 'country': 'US', 'id': 4, 'postal_code': '65805', 'state': 'Missouri'}
           , {'address': '6268 Marcy Center', 'city': 'Brooklyn', 'country': 'US', 'id': 5, 'postal_code': '11254', 'state': 'New York'}
           , {'address': '3315 Schlimgen Place', 'city': 'Lexington', 'country': 'US', 'id': 6, 'postal_code': '40576', 'state': 'Kentucky'}
           , {'address': '44 Surrey Plaza', 'city': 'Saint Paul', 'country': 'US', 'id': 7, 'postal_code': '55115', 'state': 'Minnesota'}
           , {'address': '512 Carpenter Lane', 'city': 'Charlotte', 'country': 'US', 'id': 8, 'postal_code': '28205', 'state': 'North Carolina'}
           , {'address': '566 Kipling Court', 'city': 'Austin', 'country': 'US', 'id': 9, 'postal_code': '78764', 'state': 'Texas'}
           , {'address': '9 Debs Parkway', 'city': 'New York City', 'country': 'US', 'id': 10, 'postal_code': '10090', 'state': 'New York'}]
addressesDF=spark.createDataFrame(addresses)
addressesDF.show()

+--------------------+-------------+-------+---+-----------+--------------+
|             address|         city|country| id|postal_code|         state|
+--------------------+-------------+-------+---+-----------+--------------+
|   36155 Dayton Hill|       Newton|     US|  1|      02162| Massachusetts|
|      1551 6th Plaza|      Modesto|     US|  2|      95354|   Californial|
|  21370 Waubesa Pass|New York City|     US|  3|      10120|      New York|
|     7849 Ohio Drive|  Springfield|     US|  4|      65805|      Missouri|
|   6268 Marcy Center|     Brooklyn|     US|  5|      11254|      New York|
|3315 Schlimgen Place|    Lexington|     US|  6|      40576|      Kentucky|
|     44 Surrey Plaza|   Saint Paul|     US|  7|      55115|     Minnesota|
|  512 Carpenter Lane|    Charlotte|     US|  8|      28205|North Carolina|
|   566 Kipling Court|       Austin|     US|  9|      78764|         Texas|
|      9 Debs Parkway|New York City|     US| 10|      10090|      New York|
+-----------

In [37]:
addressesDF. \
    select('id', concat_ws(', ', 'address', 'city', 'state', 'country', 'postal_code').alias('full_address')). \
    show(truncate=False)

+---+--------------------------------------------------------+
|id |full_address                                            |
+---+--------------------------------------------------------+
|1  |36155 Dayton Hill, Newton, Massachusetts, US, 02162     |
|2  |1551 6th Plaza, Modesto, Californial, US, 95354         |
|3  |21370 Waubesa Pass, New York City, New York, US, 10120  |
|4  |7849 Ohio Drive, Springfield, Missouri, US, 65805       |
|5  |6268 Marcy Center, Brooklyn, New York, US, 11254        |
|6  |3315 Schlimgen Place, Lexington, Kentucky, US, 40576    |
|7  |44 Surrey Plaza, Saint Paul, Minnesota, US, 55115       |
|8  |512 Carpenter Lane, Charlotte, North Carolina, US, 28205|
|9  |566 Kipling Court, Austin, Texas, US, 78764             |
|10 |9 Debs Parkway, New York City, New York, US, 10090      |
+---+--------------------------------------------------------+



In [38]:
from pyspark.sql.functions import col, lower, upper, initcap, length

In [39]:
employeesDF.show()

+-----------+----------+---------+------+--------------+----------------+-----------+
|employee_id|first_name|last_name|salary|   nationality|    phone_number|        ssn|
+-----------+----------+---------+------+--------------+----------------+-----------+
|          1|     Scott|    Tiger|1000.0| united states| +1 123 456 7890|123 45 6789|
|          2|     Henry|     Ford|1250.0|         India|+91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111|222 33 4444|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|+61 987 654 3210|789 12 6118|
+-----------+----------+---------+------+--------------+----------------+-----------+



In [42]:
employeesDF.select('employee_id', 'nationality').\
    withColumn('nationality_upper', upper('nationality')). \
    withColumn('nationality_lower', lower('nationality')). \
    withColumn('nationality_initcap', initcap('nationality')). \
    withColumn('nationality_length', length('nationality')). \
    show()

+-----------+--------------+-----------------+-----------------+-------------------+------------------+
|employee_id|   nationality|nationality_upper|nationality_lower|nationality_initcap|nationality_length|
+-----------+--------------+-----------------+-----------------+-------------------+------------------+
|          1| united states|    UNITED STATES|    united states|      United States|                13|
|          2|         India|            INDIA|            india|              India|                 5|
|          3|united KINGDOM|   UNITED KINGDOM|   united kingdom|     United Kingdom|                14|
|          4|     AUSTRALIA|        AUSTRALIA|        australia|          Australia|                 9|
+-----------+--------------+-----------------+-----------------+-------------------+------------------+



## Extracting Strings using Substring from Spark Data Frame Columns

In [44]:
s =  'Hello World'
s[:5], s[1:4]

('Hello', 'ell')

In [59]:
from pyspark.sql.functions import substring, cast
df. \
    select(substring(lit('Hello World'), 7, 5), substring(lit('Hello World'), -5, 5)). \
    show()

+----------------------------+-----------------------------+
|substring(Hello World, 7, 5)|substring(Hello World, -5, 5)|
+----------------------------+-----------------------------+
|                       World|                        World|
+----------------------------+-----------------------------+



In [63]:
employeesDF. \
    select('employee_id', 'ssn', 'phone_number'). \
    withColumn('phone_last4', substring('phone_number', -4, 4).cast('int')). \
    withColumn('ssn_last4', substring('ssn', -4, 4).cast('int')). \
    show()

employeesDF. \
    select('employee_id', 'ssn', 'phone_number'). \
    withColumn('phone_last4', substring('phone_number', -4, 4).cast('int')). \
    withColumn('ssn_last4_', substring('ssn', 8, 4).cast('int')). \
    show()

+-----------+-----------+----------------+-----------+---------+
|employee_id|        ssn|    phone_number|phone_last4|ssn_last4|
+-----------+-----------+----------------+-----------+---------+
|          1|123 45 6789| +1 123 456 7890|       7890|     6789|
|          2|456 78 9123|+91 234 567 8901|       8901|     9123|
|          3|222 33 4444|+44 111 111 1111|       1111|     4444|
|          4|789 12 6118|+61 987 654 3210|       3210|     6118|
+-----------+-----------+----------------+-----------+---------+

+-----------+-----------+----------------+-----------+----------+
|employee_id|        ssn|    phone_number|phone_last4|ssn_last4_|
+-----------+-----------+----------------+-----------+----------+
|          1|123 45 6789| +1 123 456 7890|       7890|      6789|
|          2|456 78 9123|+91 234 567 8901|       8901|      9123|
|          3|222 33 4444|+44 111 111 1111|       1111|      4444|
|          4|789 12 6118|+61 987 654 3210|       3210|      6118|
+-----------+----

## Extracting Strings using split from Spark Data Frame Columns

In [70]:
from pyspark.sql.functions import lit, split, explode
df.select(split(lit('Hello World, how are you'), ' ')).show(truncate=False)
df.select(split(lit('Hello World, how are you'), ' ')[2]).show(truncate=False) # 3rd element from the array
df.select(explode(split(lit('Hello World, how are you'), ' ')).alias('word')).show()

+--------------------------------------+
|split(Hello World, how are you,  , -1)|
+--------------------------------------+
|[Hello, World,, how, are, you]        |
+--------------------------------------+

+-----------------------------------------+
|split(Hello World, how are you,  , -1)[2]|
+-----------------------------------------+
|how                                      |
+-----------------------------------------+

+------+
|  word|
+------+
| Hello|
|World,|
|   how|
|   are|
|   you|
+------+



In [78]:
employees = [
    (1, "Scott", "Tiger", 1000.0, "united states", "+1 123 456 7890,+1 234 567 8901", "123 45 6789"),
    (2, "Henry", "Ford", 1250.0, "India", "+91 234 567 8901", "456 78 9123"),
    (3, "Nick", "Junior", 750.0, "united KINGDOM", "+44 111 111 1111,+44 222 222 2222", "222 33 4444"),
    (4, "Bill", "Gomes", 1500.0, "AUSTRALIA", "+61 987 654 3210,+61 876 543 2109", "789 12 6118")
]

employeesDF=spark.createDataFrame( \
    employees, \
    schema='employee_id INT, first_name STRING, last_name STRING, salary FLOAT, nationality STRING, phone_numbers STRING, ssn STRING' \
)
employeesDF.show()

+-----------+----------+---------+------+--------------+--------------------+-----------+
|employee_id|first_name|last_name|salary|   nationality|       phone_numbers|        ssn|
+-----------+----------+---------+------+--------------+--------------------+-----------+
|          1|     Scott|    Tiger|1000.0| united states|+1 123 456 7890,+...|123 45 6789|
|          2|     Henry|     Ford|1250.0|         India|    +91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|united KINGDOM|+44 111 111 1111,...|222 33 4444|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|+61 987 654 3210,...|789 12 6118|
+-----------+----------+---------+------+--------------+--------------------+-----------+



In [87]:
employeesDF.show(truncate=False)
employeesDF.select('employee_id', 'phone_numbers'). \
    withColumn('phone_number', explode(split('phone_numbers', ','))).\
    show(truncate=False)
employeesDF. \
    withColumn('phone_number', explode(split('phone_numbers', ','))).\
    select('employee_id', 'phone_number', 'ssn'). \
    withColumn('area_code', split('phone_number', ' ')[1].cast('int')). \
    withColumn('phone_last4', split('phone_number', ' ')[3].cast('int')). \
    withColumn('ssn_last4', split('ssn', ' ')[2].cast('int')). \
    show(truncate=False)

+-----------+----------+---------+------+--------------+---------------------------------+-----------+
|employee_id|first_name|last_name|salary|nationality   |phone_numbers                    |ssn        |
+-----------+----------+---------+------+--------------+---------------------------------+-----------+
|1          |Scott     |Tiger    |1000.0|united states |+1 123 456 7890,+1 234 567 8901  |123 45 6789|
|2          |Henry     |Ford     |1250.0|India         |+91 234 567 8901                 |456 78 9123|
|3          |Nick      |Junior   |750.0 |united KINGDOM|+44 111 111 1111,+44 222 222 2222|222 33 4444|
|4          |Bill      |Gomes    |1500.0|AUSTRALIA     |+61 987 654 3210,+61 876 543 2109|789 12 6118|
+-----------+----------+---------+------+--------------+---------------------------------+-----------+

+-----------+---------------------------------+----------------+
|employee_id|phone_numbers                    |phone_number    |
+-----------+--------------------------------

In [95]:
# number of phones each employee has
employeesDF. \
    withColumn('phone_number', explode(split('phone_numbers', ','))). \
    select('employee_id', 'phone_number'). \
    groupBy('employee_id'). \
    count(). \
    show()

from pyspark.sql.functions import size
employeesDF. \
    withColumn('#_of_phones', size(split('phone_numbers', ','))). \
    select('employee_id', '#_of_phones'). \
    show()

+-----------+-----+
|employee_id|count|
+-----------+-----+
|          1|    2|
|          2|    1|
|          3|    2|
|          4|    2|
+-----------+-----+

+-----------+-----------+
|employee_id|#_of_phones|
+-----------+-----------+
|          1|          2|
|          2|          1|
|          3|          2|
|          4|          2|
+-----------+-----------+



## Padding characters around String using Spark Data Frame Columns

In [97]:
from pyspark.sql.functions import lit, rpad, lpad
df.select(lpad(lit('Hello'), 10, '-')).show()

+------------------+
|lpad(Hello, 10, -)|
+------------------+
|        -----Hello|
+------------------+



In [102]:
empFixDF = employeesDF. \
    select(concat(lpad('employee_id', 5, '0'), \
                  rpad('first_name', 10,  '-'), \
                  lpad('salary', 10, '0'), \
                  rpad('nationality', 15,  '-'), \
                  rpad('phone_number', 17,  '-'), \
                  'ssn'
                 ).
           alias('employee')\
          )
empFixDF.show(truncate=False)

+--------------------------------------------------------------------+
|employee                                                            |
+--------------------------------------------------------------------+
|00001Scott-----00001000.0united states--+1 123 456 7890--123 45 6789|
|00002Henry-----00001250.0India----------+91 234 567 8901-456 78 9123|
|00003Nick------00000750.0united KINGDOM-+44 111 111 1111-222 33 4444|
|00004Bill------00001500.0AUSTRALIA------+61 987 654 3210-789 12 6118|
+--------------------------------------------------------------------+



## Trimming characters from String in Spark Data Frame Column

In [108]:
#l = [('    Hello    ', '    World    ',)]
#df=spark.createDataFrame(l).toDF('A', 'B') #column list

l = [('    Hello.    ',)]
df=spark.createDataFrame(l).toDF('dummy') #column list
df.show()

+--------------+
|         dummy|
+--------------+
|    Hello.    |
+--------------+



In [111]:
from pyspark.sql.functions import col, rtrim, ltrim, trim
df. \
    withColumn('rtrim', rtrim('dummy')). \
    withColumn('ltrim', ltrim('dummy')). \
    withColumn('trim', trim('dummy')). \
    show()

+--------------+----------+----------+------+
|         dummy|     rtrim|     ltrim|  trim|
+--------------+----------+----------+------+
|    Hello.    |    Hello.|Hello.    |Hello.|
+--------------+----------+----------+------+



In [113]:
from pyspark.sql.functions import expr
spark.sql('DESCRIBE FUNCTION rtrim').show(truncate=False)

+-------------------------------------------------------------------------------+
|function_desc                                                                  |
+-------------------------------------------------------------------------------+
|Function: rtrim                                                                |
|Class: org.apache.spark.sql.catalyst.expressions.StringTrimRight               |
|Usage: \n    rtrim(str) - Removes the trailing space characters from `str`.\n  |
+-------------------------------------------------------------------------------+



In [122]:
df. \
    withColumn('rtrim', expr("trim(leading ' ' from dummy)")). \
    withColumn('ltrim', expr("trim(trailing '.' from rtrim(dummy))")). \
    withColumn('trim',  expr("trim(both ' ' from dummy)")). \
    show()

+--------------+----------+---------+------+
|         dummy|     rtrim|    ltrim|  trim|
+--------------+----------+---------+------+
|    Hello.    |Hello.    |    Hello|Hello.|
+--------------+----------+---------+------+



## Date and Time Manipulation Functions using Spark Data Frames

In [17]:
df = spark.createDataFrame([('x',)]).toDF('dummy')
df.show()

+-----+
|dummy|
+-----+
|    x|
+-----+



In [18]:
from pyspark.sql.functions import current_date, current_timestamp, expr

df.select(current_date()).show() #yyyy-MM-dd
df.select(current_timestamp()).show(truncate=False) # #yyyy-MM-dd HH:mi:ss.SSS

df.selectExpr('current_date', 'current_timestamp', 'current_date()', 'current_timestamp()').show(truncate=False)
df.select(expr('current_date').alias('current_date')).show()
spark.sql('select current_date, current_timestamp').show()

+--------------+
|current_date()|
+--------------+
|    2024-06-09|
+--------------+

+--------------------------+
|current_timestamp()       |
+--------------------------+
|2024-06-09 18:59:48.692772|
+--------------------------+

+--------------+--------------------------+--------------+--------------------------+
|current_date()|current_timestamp()       |current_date()|current_timestamp()       |
+--------------+--------------------------+--------------+--------------------------+
|2024-06-09    |2024-06-09 18:59:48.884812|2024-06-09    |2024-06-09 18:59:48.884812|
+--------------+--------------------------+--------------+--------------------------+

+------------+
|current_date|
+------------+
|  2024-06-09|
+------------+

+--------------+--------------------+
|current_date()| current_timestamp()|
+--------------+--------------------+
|    2024-06-09|2024-06-09 18:59:...|
+--------------+--------------------+



In [8]:
from pyspark.sql.functions import to_date, to_timestamp, lit
df.select(to_date(lit('20240609'), 'yyyyMMdd')).show() 
df.select(to_timestamp(lit('20240609230501'), 'yyyyMMddHHmmss')).show() 

+---------------------------+
|to_date(20240609, yyyyMMdd)|
+---------------------------+
|                 2024-06-09|
+---------------------------+

+--------------------------------------------+
|to_timestamp(20240609230501, yyyyMMddHHmmss)|
+--------------------------------------------+
|                         2024-06-09 23:05:01|
+--------------------------------------------+



## Date and Time Arithmetic using Spark Data Frames

In [2]:
datetimes = [
    ("2014-02-28", "2014-02-28 10:00:00.123"),
    ("2016-02-29", "2016-02-29 08:08:08.999"),
    ("2017-10-31", "2017-12-31 11:59:59.123"),
    ("2019-11-30", "2019-08-31 00:00:00.000")
]

datetimesDF = spark.createDataFrame(datetimes, schema='date string, timestamp string')

In [26]:
from pyspark.sql.functions import date_add, date_sub

datetimesDF.select(to_date('date', 'yyyy-MM-dd'), to_timestamp('timestamp', 'yyyy-MM-dd HH:mm:ss.SSS')).show(truncate=False)

+-------------------------+------------------------------------------------+
|to_date(date, yyyy-MM-dd)|to_timestamp(timestamp, yyyy-MM-dd HH:mm:ss.SSS)|
+-------------------------+------------------------------------------------+
|2014-02-28               |2014-02-28 10:00:00.123                         |
|2016-02-29               |2016-02-29 08:08:08.999                         |
|2017-10-31               |2017-12-31 11:59:59.123                         |
|2019-11-30               |2019-08-31 00:00:00                             |
+-------------------------+------------------------------------------------+



In [31]:
datetimesDF. \
    withColumn('date_10_days_later', date_add('date', 10)). \
    withColumn('timestamp_10_days_later', date_add('timestamp', 10)). \
    withColumn('date_10_days_earlier', date_sub('date', 10)). \
    withColumn('timestamp_10_days_earlier', date_sub('timestamp', 10)). \
    show(truncate=False)

+----------+-----------------------+------------------+-----------------------+--------------------+-------------------------+
|date      |timestamp              |date_10_days_later|timestamp_10_days_later|date_10_days_earlier|timestamp_10_days_earlier|
+----------+-----------------------+------------------+-----------------------+--------------------+-------------------------+
|2014-02-28|2014-02-28 10:00:00.123|2014-03-10        |2014-03-10             |2014-02-18          |2014-02-18               |
|2016-02-29|2016-02-29 08:08:08.999|2016-03-10        |2016-03-10             |2016-02-19          |2016-02-19               |
|2017-10-31|2017-12-31 11:59:59.123|2017-11-10        |2018-01-10             |2017-10-21          |2017-12-21               |
|2019-11-30|2019-08-31 00:00:00.000|2019-12-10        |2019-09-10             |2019-11-20          |2019-08-21               |
+----------+-----------------------+------------------+-----------------------+--------------------+-----------

In [35]:
from pyspark.sql.functions import current_date, current_timestamp, datediff
datetimesDF. \
    withColumn('datediff_date', datediff(current_date(), 'date')). \
    withColumn('datediff_timestamp', datediff(current_timestamp(), 'timestamp')). \
    show(truncate=False)

+----------+-----------------------+-------------+------------------+
|date      |timestamp              |datediff_date|datediff_timestamp|
+----------+-----------------------+-------------+------------------+
|2014-02-28|2014-02-28 10:00:00.123|3754         |3754              |
|2016-02-29|2016-02-29 08:08:08.999|3023         |3023              |
|2017-10-31|2017-12-31 11:59:59.123|2413         |2352              |
|2019-11-30|2019-08-31 00:00:00.000|1653         |1744              |
+----------+-----------------------+-------------+------------------+



In [13]:
from pyspark.sql.functions import months_between, add_months, round, current_date, current_timestamp
datetimesDF. \
    withColumn('months_between_date', months_between(current_date(), 'date')). \
    withColumn('months_between_timestamp', months_between(current_timestamp(), 'timestamp')). \
    withColumn('add_months_date', add_months('date', 3)). \
    withColumn('add_months_timestamp', add_months('timestamp', 3)). \
    show(truncate=False)
# round to 2 decimal places
datetimesDF. \
    withColumn('months_between_date', round(months_between(current_date(), 'date'), 2)). \
    withColumn('months_between_timestamp', round(months_between(current_timestamp(), 'timestamp'), 2)). \
    show()

+----------+-----------------------+-------------------+------------------------+---------------+--------------------+
|date      |timestamp              |months_between_date|months_between_timestamp|add_months_date|add_months_timestamp|
+----------+-----------------------+-------------------+------------------------+---------------+--------------------+
|2014-02-28|2014-02-28 10:00:00.123|123.38709677       |123.39807982            |2014-05-28     |2014-05-28          |
|2016-02-29|2016-02-29 08:08:08.999|99.35483871        |99.36832773             |2016-05-29     |2016-05-29          |
|2017-10-31|2017-12-31 11:59:59.123|79.29032258        |77.29861783             |2018-01-31     |2018-03-31          |
|2019-11-30|2019-08-31 00:00:00.000|54.32258065        |57.31474649             |2020-02-29     |2019-11-30          |
+----------+-----------------------+-------------------+------------------------+---------------+--------------------+

+----------+--------------------+--------------

## Using Date and Time trunc function od Spark Data Frames

In [67]:
from pyspark.sql.functions import current_date, current_timestamp, trunc, date_trunc, next_day

df. \
    withColumn('trunc_month', trunc(current_timestamp(), format='month')). \
    withColumn('trunc_year', trunc(current_date(), format='year')). \
    withColumn('trunc_week', trunc(current_date(), format='week')). \
    withColumn('trunc_quarter', trunc(current_date(), format='quarter')). \
    show()

+-----+-----------+----------+----------+-------------+
|dummy|trunc_month|trunc_year|trunc_week|trunc_quarter|
+-----+-----------+----------+----------+-------------+
|    x| 2024-06-01|2024-01-01|2024-06-03|   2024-04-01|
+-----+-----------+----------+----------+-------------+



In [72]:
df. \
    withColumn('date_trunc_year', date_trunc('year', current_timestamp())). \
    withColumn('date_trunc_minute', date_trunc('minute', current_timestamp())). \
    withColumn('next_day',  next_day(current_timestamp(), 'SUN')). \
    show()

+-----+-------------------+-------------------+----------+
|dummy|    date_trunc_year|  date_trunc_minute|  next_day|
+-----+-------------------+-------------------+----------+
|    x|2024-01-01 00:00:00|2024-06-09 19:25:00|2024-06-16|
+-----+-------------------+-------------------+----------+



## Date and Time Extract Functions on Spark Data Frames

In [5]:
from pyspark.sql.functions import current_timestamp, year, month, weekofyear, dayofyear, dayofmonth, dayofweek, hour, minute, second
df. \
    withColumn('year', year(current_timestamp())). \
    withColumn('month', month(current_timestamp())). \
    withColumn('weekofyear', dayofyear(current_timestamp())). \
    withColumn('dayofyear', dayofyear(current_timestamp())). \
    withColumn('dayofmonth', dayofmonth(current_timestamp())). \
    withColumn('dayofweek', dayofweek(current_timestamp())). \
    withColumn('hour', hour(current_timestamp())). \
    withColumn('minute', minute(current_timestamp())). \
    withColumn('second', second(current_timestamp())). \
    show()

NameError: name 'df' is not defined

## Using to_date and to_timestamp on Spark Data Frames

`yyyy-MM-dd`
`yyyy-MM-dd HH:mm:ss.SSS`

In [11]:
datetimes = [
    (20140228, "28-Feb-2014 10:00:00.123"),
    (20160229, "20-Feb-2016 08:08:08.999"),
    (20171031, "31-Dec-2017 11:59:59.123"),
    (20191130, "31-Aug-2019 00:00:00.000")
]

datetimesDF = spark.createDataFrame(datetimes, 'date bigint, timestamp string')
datetimesDF.show(truncate=False)
datetimesDF.printSchema()

+--------+------------------------+
|date    |timestamp               |
+--------+------------------------+
|20140228|28-Feb-2014 10:00:00.123|
|20160229|20-Feb-2016 08:08:08.999|
|20171031|31-Dec-2017 11:59:59.123|
|20191130|31-Aug-2019 00:00:00.000|
+--------+------------------------+

root
 |-- date: long (nullable = true)
 |-- timestamp: string (nullable = true)



In [28]:
from pyspark.sql.functions import to_date, to_timestamp, lit, col
datetimesDF. \
    withColumn('to_date', to_date(lit(col('date')), 'yyyyMMdd')). \
    withColumn('to_timestamp', to_timestamp(lit(col('timestamp')), 'dd-MMM-yyyy HH:mm:ss.SSS')). \
    show(truncate=False)

+--------+------------------------+----------+-----------------------+
|date    |timestamp               |to_date   |to_timestamp           |
+--------+------------------------+----------+-----------------------+
|20140228|28-Feb-2014 10:00:00.123|2014-02-28|2014-02-28 10:00:00.123|
|20160229|20-Feb-2016 08:08:08.999|2016-02-29|2016-02-20 08:08:08.999|
|20171031|31-Dec-2017 11:59:59.123|2017-10-31|2017-12-31 11:59:59.123|
|20191130|31-Aug-2019 00:00:00.000|2019-11-30|2019-08-31 00:00:00    |
+--------+------------------------+----------+-----------------------+



In [19]:
df = spark.createDataFrame([('x',)]).toDF('dummy')
df.show()

+-----+
|dummy|
+-----+
|    x|
+-----+



In [24]:
df.select(to_date(lit('March 2, 2024'), 'MMMM d, yyyy')).show()
df.select(to_date(lit('Mar 2, 2024'), 'MMM d, yyyy')).show()

+------------------------------------+
|to_date(March 2, 2024, MMMM d, yyyy)|
+------------------------------------+
|                          2024-03-02|
+------------------------------------+

+---------------------------------+
|to_date(Mar 2, 2024, MMM d, yyyy)|
+---------------------------------+
|                       2024-03-02|
+---------------------------------+



In [29]:
df.select(to_timestamp(lit('Mar 2, 2024'), 'MMM d, yyyy')).show()

+--------------------------------------+
|to_timestamp(Mar 2, 2024, MMM d, yyyy)|
+--------------------------------------+
|                   2024-03-02 00:00:00|
+--------------------------------------+



## Using `date_format` on Spark Data Frames

In [37]:
datetimes = [
    ("2014-02-28", "2014-02-28 10:00:00.123"),
    ("2016-02-29", "2016-02-29 08:08:08.999"),
    ("2017-10-31", "2017-12-31 11:59:59.123"),
    ("2019-11-30", "2019-08-31 00:00:00.000")
]

datetimesDF = spark.createDataFrame(datetimes, 'date STRING, timestamp STRING')
datetimesDF.show(truncate=False)

+----------+-----------------------+
|date      |timestamp              |
+----------+-----------------------+
|2014-02-28|2014-02-28 10:00:00.123|
|2016-02-29|2016-02-29 08:08:08.999|
|2017-10-31|2017-12-31 11:59:59.123|
|2019-11-30|2019-08-31 00:00:00.000|
+----------+-----------------------+



In [43]:
from pyspark.sql.functions import date_format
datetimesDF. \
    withColumn('date_format_date', date_format('date', 'dd-MM-yyyy')). \
    withColumn('date_format_date', date_format('date', 'dd-MM-yyyy HH')). \
    withColumn('date_format_timestamp', date_format('timestamp', 'dd-MM-yyyy HH')). \
    show(truncate=False)

+----------+-----------------------+----------------+---------------------+
|date      |timestamp              |date_format_date|date_format_timestamp|
+----------+-----------------------+----------------+---------------------+
|2014-02-28|2014-02-28 10:00:00.123|28-02-2014 00   |28-02-2014 10        |
|2016-02-29|2016-02-29 08:08:08.999|29-02-2016 00   |29-02-2016 08        |
|2017-10-31|2017-12-31 11:59:59.123|31-10-2017 00   |31-12-2017 11        |
|2019-11-30|2019-08-31 00:00:00.000|30-11-2019 00   |31-08-2019 00        |
+----------+-----------------------+----------------+---------------------+



In [46]:
from pyspark.sql.functions import date_format
datetimesDF. \
    withColumn('date_ym', date_format('date', 'yyyyMM')). \
    withColumn('timestamp_ym', date_format('timestamp', 'yyyyMM')). \
    show(truncate=False)

+----------+-----------------------+-------+------------+
|date      |timestamp              |date_ym|timestamp_ym|
+----------+-----------------------+-------+------------+
|2014-02-28|2014-02-28 10:00:00.123|201402 |201402      |
|2016-02-29|2016-02-29 08:08:08.999|201602 |201602      |
|2017-10-31|2017-12-31 11:59:59.123|201710 |201712      |
|2019-11-30|2019-08-31 00:00:00.000|201911 |201908      |
+----------+-----------------------+-------+------------+



In [48]:
datetimesDF. \
    withColumn('date_ym', date_format('date', 'yyyyMM').cast('int')). \
    withColumn('timestamp_ym', date_format('timestamp', 'yyyyMM').cast('int')). \
    printSchema()

root
 |-- date: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- date_ym: integer (nullable = true)
 |-- timestamp_ym: integer (nullable = true)



In [52]:
datetimesDF. \
    withColumn('date_long', date_format('date', 'yyyyMMddHHmmss').cast('long')). \
    withColumn('timestamp_long', date_format('timestamp', 'yyyyMMddHHmmss').cast('long')). \
    printSchema()
datetimesDF. \
    withColumn('date_long', date_format('date', 'yyyyMMddHHmmss').cast('long')). \
    withColumn('timestamp_long', date_format('timestamp', 'yyyyMMddHHmmss').cast('long')). \
    show()

root
 |-- date: string (nullable = true)
 |-- timestamp: string (nullable = true)
 |-- date_long: long (nullable = true)
 |-- timestamp_long: long (nullable = true)

+----------+--------------------+--------------+--------------+
|      date|           timestamp|     date_long|timestamp_long|
+----------+--------------------+--------------+--------------+
|2014-02-28|2014-02-28 10:00:...|20140228000000|20140228100000|
|2016-02-29|2016-02-29 08:08:...|20160229000000|20160229080808|
|2017-10-31|2017-12-31 11:59:...|20171031000000|20171231115959|
|2019-11-30|2019-08-31 00:00:...|20191130000000|20190831000000|
+----------+--------------------+--------------+--------------+



In [61]:
### DDD - day of the year
### EE - name of the week day

datetimesDF. \
    withColumn('date_ym', date_format('date', 'yyyyDDD')). \
    withColumn('timestamp_ym', date_format('timestamp', 'yyyyDDD')). \
    show(truncate=False)
datetimesDF. \
    withColumn('date_desc', date_format('date', 'MMMM d, yyyy')). \
    show(truncate=False)
datetimesDF. \
    withColumn('week_day', date_format('date', 'EE')). \
    withColumn('week_full_day', date_format('date', 'EEEE')). \
    show(truncate=False)

+----------+-----------------------+-------+------------+
|date      |timestamp              |date_ym|timestamp_ym|
+----------+-----------------------+-------+------------+
|2014-02-28|2014-02-28 10:00:00.123|2014059|2014059     |
|2016-02-29|2016-02-29 08:08:08.999|2016060|2016060     |
|2017-10-31|2017-12-31 11:59:59.123|2017304|2017365     |
|2019-11-30|2019-08-31 00:00:00.000|2019334|2019243     |
+----------+-----------------------+-------+------------+

+----------+-----------------------+-----------------+
|date      |timestamp              |date_desc        |
+----------+-----------------------+-----------------+
|2014-02-28|2014-02-28 10:00:00.123|February 28, 2014|
|2016-02-29|2016-02-29 08:08:08.999|February 29, 2016|
|2017-10-31|2017-12-31 11:59:59.123|October 31, 2017 |
|2019-11-30|2019-08-31 00:00:00.000|November 30, 2019|
+----------+-----------------------+-----------------+

+----------+-----------------------+--------+-------------+
|date      |timestamp             

## Dealing with Unix Timestamp in Spark Data Frames

In [79]:
# unix_timestamp, from_unixtime

datetimes = [
    (20140228, '2014-02-28', '2014-02-28 10:00:00'),
    (20160229, '2016-02-29', '2016-02-29 08:08:08'),
    (20171031, '2017-10-31', '2017-12-31 11:59:59'),
    (20191130, '2019-11-30', '2019-08-31 00:00:00')
]

datetimesDF = spark.createDataFrame(datetimes, 'dateid bigint, date string, timestamp string')
datetimesDF.show(truncate=False)
datetimesDF.printSchema()

+--------+----------+-------------------+
|dateid  |date      |timestamp          |
+--------+----------+-------------------+
|20140228|2014-02-28|2014-02-28 10:00:00|
|20160229|2016-02-29|2016-02-29 08:08:08|
|20171031|2017-10-31|2017-12-31 11:59:59|
|20191130|2019-11-30|2019-08-31 00:00:00|
+--------+----------+-------------------+

root
 |-- dateid: long (nullable = true)
 |-- date: string (nullable = true)
 |-- timestamp: string (nullable = true)



In [80]:
from pyspark.sql.functions import unix_timestamp, col
datetimesDF. \
    withColumn('unix_dateid', unix_timestamp(col('dateid').cast('string'), 'yyyyMMdd')). \
    withColumn('unix_date', unix_timestamp(col('date'), 'yyyy-MM-dd')). \
    withColumn('unix_timestamp', unix_timestamp(col('timestamp'))). \
    show(truncate=False)

+--------+----------+-------------------+-----------+----------+--------------+
|dateid  |date      |timestamp          |unix_dateid|unix_date |unix_timestamp|
+--------+----------+-------------------+-----------+----------+--------------+
|20140228|2014-02-28|2014-02-28 10:00:00|1393542000 |1393542000|1393578000    |
|20160229|2016-02-29|2016-02-29 08:08:08|1456700400 |1456700400|1456729688    |
|20171031|2017-10-31|2017-12-31 11:59:59|1509404400 |1509404400|1514717999    |
|20191130|2019-11-30|2019-08-31 00:00:00|1575068400 |1575068400|1567202400    |
+--------+----------+-------------------+-----------+----------+--------------+



In [86]:
unixtimes = [
    (1393561700, ),
    (1456713488, ),
    (1514701799, ),
    (1567189800, )
]

unixtimesDF =  spark.createDataFrame(unixtimes).toDF('unixtime')
unixtimesDF.printSchema()
unixtimesDF.show()

root
 |-- unixtime: long (nullable = true)

+----------+
|  unixtime|
+----------+
|1393561700|
|1456713488|
|1514701799|
|1567189800|
+----------+



In [97]:
from pyspark.sql.functions import from_unixtime, col
unixtimesDF. \
    withColumn('date', from_unixtime(col('unixtime'), 'yyyyMMdd')). \
    withColumn('time', from_unixtime(col('unixtime'))). \
    show()
unixtimesDF. \
    withColumn('date', from_unixtime(col('unixtime'), 'yyyyMMdd')). \
    withColumn('time', from_unixtime(col('unixtime'))). \
    printSchema()

+----------+--------+-------------------+
|  unixtime|    date|               time|
+----------+--------+-------------------+
|1393561700|20140228|2014-02-28 05:28:20|
|1456713488|20160229|2016-02-29 03:38:08|
|1514701799|20171231|2017-12-31 07:29:59|
|1567189800|20190830|2019-08-30 20:30:00|
+----------+--------+-------------------+

root
 |-- unixtime: long (nullable = true)
 |-- date: string (nullable = true)
 |-- time: string (nullable = true)



In [98]:
unixtimesDF.select(col('unixtime').cast('timestamp')).show()
unixtimesDF.select(col('unixtime').cast('timestamp')).printSchema()

+-------------------+
|           unixtime|
+-------------------+
|2014-02-28 05:28:20|
|2016-02-29 03:38:08|
|2017-12-31 07:29:59|
|2019-08-30 20:30:00|
+-------------------+

root
 |-- unixtime: timestamp (nullable = true)



## Dealing with NULL values in Spark Data Frames

In [2]:
employees = [
    (1, "Scott", "Tiger", 1000.0, "united states", 10, "+1 123 456 7890", "123 45 6789"),
    (2, "Henry", "Ford", 1250.0, "India", None, "+91 234 567 8901", "456 78 9123"),
    (3, "Nick", "Junior", 750.0,  "united KINGDOM", "", "+44 111 111 1111", "222 33 4444"),
    (4, "Bill", "Gomes", 1500.0, "AUSTRALIA", 10, "+61 987 654 3210", "789 12 6118")
]
employeesDF =  spark.createDataFrame(employees).toDF('employee_id', 'first_name', 'last_name', 'salary', 'country', 'bonus', 'phone_number', 'ssn')
employeesDF.show()
employeesDF.printSchema()

                                                                                

+-----------+----------+---------+------+--------------+-----+----------------+-----------+
|employee_id|first_name|last_name|salary|       country|bonus|    phone_number|        ssn|
+-----------+----------+---------+------+--------------+-----+----------------+-----------+
|          1|     Scott|    Tiger|1000.0| united states|   10| +1 123 456 7890|123 45 6789|
|          2|     Henry|     Ford|1250.0|         India| NULL|+91 234 567 8901|456 78 9123|
|          3|      Nick|   Junior| 750.0|united KINGDOM|     |+44 111 111 1111|222 33 4444|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|   10|+61 987 654 3210|789 12 6118|
+-----------+----------+---------+------+--------------+-----+----------------+-----------+

root
 |-- employee_id: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- salary: double (nullable = true)
 |-- country: string (nullable = true)
 |-- bonus: string (nullable = true)
 |-- phone_number: s

In [135]:
from pyspark.sql.functions import coalesce, nvl
employeesDF. \
    withColumn('bonus1', coalesce('bonus', lit(0))). \
    withColumn('bonus2', coalesce(col('bonus').cast('int'), lit(0))). \
    withColumn('bonus3', nvl(col('bonus').cast('int'), lit(0))). \
    show()

+-----------+----------+---------+------+--------------+-----+----------------+-----------+------+------+------+
|employee_id|first_name|last_name|salary|       country|bonus|    phone_number|        ssn|bonus1|bonus2|bonus3|
+-----------+----------+---------+------+--------------+-----+----------------+-----------+------+------+------+
|          1|     Scott|    Tiger|1000.0| united states|   10| +1 123 456 7890|123 45 6789|    10|    10|    10|
|          2|     Henry|     Ford|1250.0|         India| NULL|+91 234 567 8901|456 78 9123|     0|     0|     0|
|          3|      Nick|   Junior| 750.0|united KINGDOM|     |+44 111 111 1111|222 33 4444|      |     0|     0|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|   10|+61 987 654 3210|789 12 6118|    10|    10|    10|
+-----------+----------+---------+------+--------------+-----+----------------+-----------+------+------+------+



In [143]:
from pyspark.sql.functions import expr

employeesDF. \
    withColumn('bonus1', expr("nvl(bonus, 0)")). \
    withColumn('bonus2', expr("nvl(cast(bonus as int), 0)")). \
    withColumn('bonus3', expr("nvl(nullif(bonus, ''), 0)")). \
    withColumn('bonus4', expr("coalesce(cast(bonus as int), 0)")). \
    show()

+-----------+----------+---------+------+--------------+-----+----------------+-----------+------+------+------+------+
|employee_id|first_name|last_name|salary|       country|bonus|    phone_number|        ssn|bonus1|bonus2|bonus3|bonus4|
+-----------+----------+---------+------+--------------+-----+----------------+-----------+------+------+------+------+
|          1|     Scott|    Tiger|1000.0| united states|   10| +1 123 456 7890|123 45 6789|    10|    10|    10|    10|
|          2|     Henry|     Ford|1250.0|         India| NULL|+91 234 567 8901|456 78 9123|     0|     0|     0|     0|
|          3|      Nick|   Junior| 750.0|united KINGDOM|     |+44 111 111 1111|222 33 4444|      |     0|     0|     0|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|   10|+61 987 654 3210|789 12 6118|    10|    10|    10|    10|
+-----------+----------+---------+------+--------------+-----+----------------+-----------+------+------+------+------+



In [146]:
employeesDF. \
    withColumn('payment', col('salary')+(col('salary')*coalesce(col('bonus').cast('int'), lit(0))/100)). \
    show()
employeesDF. \
    withColumn('payment', col('salary')+(col('salary')*coalesce(col('bonus').cast('int'), lit(0))/100)). \
    printSchema()

+-----------+----------+---------+------+--------------+-----+----------------+-----------+-------+
|employee_id|first_name|last_name|salary|       country|bonus|    phone_number|        ssn|payment|
+-----------+----------+---------+------+--------------+-----+----------------+-----------+-------+
|          1|     Scott|    Tiger|1000.0| united states|   10| +1 123 456 7890|123 45 6789| 1100.0|
|          2|     Henry|     Ford|1250.0|         India| NULL|+91 234 567 8901|456 78 9123| 1250.0|
|          3|      Nick|   Junior| 750.0|united KINGDOM|     |+44 111 111 1111|222 33 4444|  750.0|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|   10|+61 987 654 3210|789 12 6118| 1650.0|
+-----------+----------+---------+------+--------------+-----+----------------+-----------+-------+

root
 |-- employee_id: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- salary: double (nullable = true)
 |-- country: string (nullable =

In [147]:
### bulk NULLs
### na, fillna, replacena, dropna
employees = [
    (1, "Scott", None, 1000.0, "united states", 10, "+1 123 456 7890", "123 45 6789"),
    (2, "Henry", "Ford", 1250.0, "India", None, "+91 234 567 8901", "456 78 9123"),
    (3, "Nick", None, None, "united KINGDOM", "", "+44 111 111 1111", "222 33 4444"),
    (4, "Bill", "Gomes", 1500.0, "AUSTRALIA", 10, "+61 987 654 3210", "789 12 6118")
]
employeesDF = spark.createDataFrame(employees).toDF('employee_id', 'first_name', 'last_name', 'salary', 'country', 'bonus', 'phone_number', 'ssn')
employeesDF.show()
employeesDF.printSchema()

+-----------+----------+---------+------+--------------+-----+----------------+-----------+
|employee_id|first_name|last_name|salary|       country|bonus|    phone_number|        ssn|
+-----------+----------+---------+------+--------------+-----+----------------+-----------+
|          1|     Scott|     NULL|1000.0| united states|   10| +1 123 456 7890|123 45 6789|
|          2|     Henry|     Ford|1250.0|         India| NULL|+91 234 567 8901|456 78 9123|
|          3|      Nick|     NULL|  NULL|united KINGDOM|     |+44 111 111 1111|222 33 4444|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|   10|+61 987 654 3210|789 12 6118|
+-----------+----------+---------+------+--------------+-----+----------------+-----------+

root
 |-- employee_id: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- salary: double (nullable = true)
 |-- country: string (nullable = true)
 |-- bonus: string (nullable = true)
 |-- phone_number: s

In [152]:
employeesDF.fillna(0.0).show()
employeesDF.fillna('na').show()
employeesDF.fillna(0.0).fillna('na').show()
employeesDF.fillna(0.0, 'salary').fillna('na', 'last_name').show()

+-----------+----------+---------+------+--------------+-----+----------------+-----------+
|employee_id|first_name|last_name|salary|       country|bonus|    phone_number|        ssn|
+-----------+----------+---------+------+--------------+-----+----------------+-----------+
|          1|     Scott|     NULL|1000.0| united states|   10| +1 123 456 7890|123 45 6789|
|          2|     Henry|     Ford|1250.0|         India| NULL|+91 234 567 8901|456 78 9123|
|          3|      Nick|     NULL|   0.0|united KINGDOM|     |+44 111 111 1111|222 33 4444|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|   10|+61 987 654 3210|789 12 6118|
+-----------+----------+---------+------+--------------+-----+----------------+-----------+

+-----------+----------+---------+------+--------------+-----+----------------+-----------+
|employee_id|first_name|last_name|salary|       country|bonus|    phone_number|        ssn|
+-----------+----------+---------+------+--------------+-----+----------------+

## Using CASE and WHEN

In [4]:
from pyspark.sql.functions import col, lit, cast, coalesce
employeesDF. \
    withColumn('bonus1', coalesce(col('bonus').cast('int'), lit(0))). \
    show()

+-----------+----------+---------+------+--------------+-----+----------------+-----------+------+
|employee_id|first_name|last_name|salary|       country|bonus|    phone_number|        ssn|bonus1|
+-----------+----------+---------+------+--------------+-----+----------------+-----------+------+
|          1|     Scott|    Tiger|1000.0| united states|   10| +1 123 456 7890|123 45 6789|    10|
|          2|     Henry|     Ford|1250.0|         India| NULL|+91 234 567 8901|456 78 9123|     0|
|          3|      Nick|   Junior| 750.0|united KINGDOM|     |+44 111 111 1111|222 33 4444|     0|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|   10|+61 987 654 3210|789 12 6118|    10|
+-----------+----------+---------+------+--------------+-----+----------------+-----------+------+



In [11]:
from pyspark.sql.functions import expr
employeesDF. \
    withColumn('bonus1', expr("case when bonus is null or bonus = '' then 0 else bonus end")).\
    show()

+-----------+----------+---------+------+--------------+-----+----------------+-----------+------+
|employee_id|first_name|last_name|salary|       country|bonus|    phone_number|        ssn|bonus1|
+-----------+----------+---------+------+--------------+-----+----------------+-----------+------+
|          1|     Scott|    Tiger|1000.0| united states|   10| +1 123 456 7890|123 45 6789|    10|
|          2|     Henry|     Ford|1250.0|         India| NULL|+91 234 567 8901|456 78 9123|     0|
|          3|      Nick|   Junior| 750.0|united KINGDOM|     |+44 111 111 1111|222 33 4444|     0|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|   10|+61 987 654 3210|789 12 6118|    10|
+-----------+----------+---------+------+--------------+-----+----------------+-----------+------+



In [23]:
from pyspark.sql.functions import when
employeesDF. \
    withColumn('bonus1', when((col('bonus').isNull()) | (col('bonus') == lit('')), 0).otherwise('bonus')). \
    show()

+-----------+----------+---------+------+--------------+-----+----------------+-----------+------+
|employee_id|first_name|last_name|salary|       country|bonus|    phone_number|        ssn|bonus1|
+-----------+----------+---------+------+--------------+-----+----------------+-----------+------+
|          1|     Scott|    Tiger|1000.0| united states|   10| +1 123 456 7890|123 45 6789| bonus|
|          2|     Henry|     Ford|1250.0|         India| NULL|+91 234 567 8901|456 78 9123|     0|
|          3|      Nick|   Junior| 750.0|united KINGDOM|     |+44 111 111 1111|222 33 4444|     0|
|          4|      Bill|    Gomes|1500.0|     AUSTRALIA|   10|+61 987 654 3210|789 12 6118| bonus|
+-----------+----------+---------+------+--------------+-----+----------------+-----------+------+



In [25]:
persons = [
(1, 1),
(2, 13),
(3, 18),
(4, 60),
(5, 120),
(6, 0),
(7, 12),
(8, 160),
]
personsDF = spark.createDataFrame(persons, schema='id INT, age INT')
personsDF.show()

+---+---+
| id|age|
+---+---+
|  1|  1|
|  2| 13|
|  3| 18|
|  4| 60|
|  5|120|
|  6|  0|
|  7| 12|
|  8|160|
+---+---+



In [30]:
personsDF. \
    withColumn('category', expr("""
                                case 
                                when age between 0 and 2 then 'New Born'
                                when age > 2 and age <= 12 then 'Infant'
                                when age >12 and age <= 48 then 'Toddler'
                                when age >48 and age <= 144 then 'Kid'
                                else 'Teenager or Adult'
                                end
                                """)). \
    show()

+---+---+-----------------+
| id|age|         category|
+---+---+-----------------+
|  1|  1|         New Born|
|  2| 13|          Toddler|
|  3| 18|          Toddler|
|  4| 60|              Kid|
|  5|120|              Kid|
|  6|  0|         New Born|
|  7| 12|           Infant|
|  8|160|Teenager or Adult|
+---+---+-----------------+



In [37]:
personsDF. \
    withColumn('category', 
               when(col('age').between(0, 2), 'New Born').
               when((col('age') > 2) & (col('age') <= 12), 'Infant').
               when((col('age') > 12) & (col('age') <= 48), 'Toddler').
               when((col('age') > 48) & (col('age') <= 144), 'Kid').
               otherwise('Teenager or Adult')
    ). \
    show()

+---+---+-----------------+
| id|age|         category|
+---+---+-----------------+
|  1|  1|         New Born|
|  2| 13|          Toddler|
|  3| 18|          Toddler|
|  4| 60|              Kid|
|  5|120|              Kid|
|  6|  0|         New Born|
|  7| 12|           Infant|
|  8|160|Teenager or Adult|
+---+---+-----------------+

