# Datetime Functions

##### Objectives
1. Cast to timestamp
2. Format datetimes
3. Extract from timestamp
4. Convert to date
5. Manipulate datetimes

##### Methods
- <a href="https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.Column.html#pyspark.sql.Column" target="_blank">Column</a>: `cast`
- <a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html?#functions" target="_blank">Built-In Functions</a>: `date_format`, `to_date`, `date_add`, `year`, `month`, `dayofweek`, `minute`, `second`

In [0]:
%run ./Includes/Classroom-Setup

Let's use a subset of the BedBricks events dataset to practice working with date times.

In [0]:
from pyspark.sql.functions import col

df = spark.read.parquet(eventsPath).select("user_id", col("event_timestamp").alias("timestamp"))
display(df)

### Built-In Functions: Date Time Functions
Here are a few built-in functions to manipulate dates and times in Spark.

| Method | Description |
| --- | --- |
| add_months | Returns the date that is numMonths after startDate |
| current_timestamp | Returns the current timestamp at the start of query evaluation as a timestamp column |
| date_format | Converts a date/timestamp/string to a value of string in the format specified by the date format given by the second argument. |
| dayofweek | Extracts the day of the month as an integer from a given date/timestamp/string |
| from_unixtime | Converts the number of seconds from unix epoch (1970-01-01 00:00:00 UTC) to a string representing the timestamp of that moment in the current system time zone in the yyyy-MM-dd HH:mm:ss format |
| minute | Extracts the minutes as an integer from a given date/timestamp/string. |
| unix_timestamp | Converts time string with given pattern to Unix timestamp (in seconds) |

### Cast to Timestamp

#### `cast()`
Casts column to a different data type, specified using string representation or DataType.

In [0]:
timestampDF = df.withColumn("timestamp", (col("timestamp") / 1e6).cast("timestamp"))
display(timestampDF)

In [0]:
from pyspark.sql.types import TimestampType

timestampDF = df.withColumn("timestamp", (col("timestamp") / 1e6).cast(TimestampType()))
display(timestampDF)

### Datetime Patterns for Formatting and Parsing
There are several common scenarios for datetime usage in Spark:

- CSV/JSON datasources use the pattern string for parsing and formatting datetime content.
- Datetime functions related to convert StringType to/from DateType or TimestampType e.g. `unix_timestamp`, `date_format`, `from_unixtime`, `to_date`, `to_timestamp`, etc.

Spark uses <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html" target="_blank">pattern letters for date and timestamp parsing and formatting</a>. A subset of these patterns are shown below.

| Symbol | Meaning         | Presentation | Examples               |
| ------ | --------------- | ------------ | ---------------------- |
| G      | era             | text         | AD; Anno Domini        |
| y      | year            | year         | 2020; 20               |
| D      | day-of-year     | number(3)    | 189                    |
| M/L    | month-of-year   | month        | 7; 07; Jul; July       |
| d      | day-of-month    | number(3)    | 28                     |
| Q/q    | quarter-of-year | number/text  | 3; 03; Q3; 3rd quarter |
| E      | day-of-week     | text         | Tue; Tuesday           |

<img src="https://files.training.databricks.com/images/icon_warn_32.png" alt="Warning"> Spark's handling of dates and timestamps changed in version 3.0, and the patterns used for parsing and formatting these values changed as well. For a discussion of these changes, please reference <a href="https://databricks.com/blog/2020/07/22/a-comprehensive-look-at-dates-and-timestamps-in-apache-spark-3-0.html" target="_blank">this Databricks blog post</a>.

### Format date

#### `date_format()`
Converts a date/timestamp/string to a string formatted with the given date time pattern.

In [0]:
from pyspark.sql.functions import date_format

formattedDF = (timestampDF
               .withColumn("date string", date_format("timestamp", "MMMM dd, yyyy"))
               .withColumn("time string", date_format("timestamp", "HH:mm:ss.SSSSSS"))
              )
display(formattedDF)

### Extract datetime attribute from timestamp

#### `year`
Extracts the year as an integer from a given date/timestamp/string.

##### Similar methods: `month`, `dayofweek`, `minute`, `second`, etc.

In [0]:
from pyspark.sql.functions import year, month, dayofweek, minute, second

datetimeDF = (timestampDF
              .withColumn("year", year(col("timestamp")))
              .withColumn("month", month(col("timestamp")))
              .withColumn("dayofweek", dayofweek(col("timestamp")))
              .withColumn("minute", minute(col("timestamp")))
              .withColumn("second", second(col("timestamp")))
             )
display(datetimeDF)

### Convert to Date

#### `to_date`
Converts the column into DateType by casting rules to DateType.

In [0]:
from pyspark.sql.functions import to_date

dateDF = timestampDF.withColumn("date", to_date(col("timestamp")))
display(dateDF)

### Manipulate Datetimes

#### `date_add`
Returns the date that is the given number of days after start

In [0]:
from pyspark.sql.functions import date_add

plus2DF = timestampDF.withColumn("plus_two_days", date_add(col("timestamp"), 2))
display(plus2DF)

# Active Users Lab
Plot daily active users and average active users by day of week.
1. Extract timestamp and date of events
2. Get daily active users
3. Get average number of active users by day of week
4. Sort day of week in correct order

### Setup
Run the cell below to create the starting DataFrame of user IDs and timestamps of events logged on the BedBricks website.

In [0]:
df = (spark
      .read
      .parquet(eventsPath)
      .select("user_id", col("event_timestamp").alias("ts"))
     )

display(df)

### 1. Extract timestamp and date of events
- Convert **`ts`** from microseconds to seconds by dividing by 1 million and cast to timestamp
- Add **`date`** column by converting **`ts`** to date

In [0]:
# ANSWER
datetimeDF = (df
              .withColumn("ts", (col("ts") / 1e6).cast("timestamp"))
              .withColumn("date", to_date("ts"))
             )
display(datetimeDF)

**CHECK YOUR WORK**

In [0]:
from pyspark.sql.types import DateType, StringType, StructField, StructType, TimestampType

expected1a = StructType([StructField("user_id", StringType(), True),
                         StructField("ts", TimestampType(), True),
                         StructField("date", DateType(), True)])

result1a = datetimeDF.schema

assert expected1a == result1a, "datetimeDF does not have the expected schema"

In [0]:
import datetime

expected1b = datetime.date(2020, 6, 19)
result1b = datetimeDF.sort("date").first().date

assert expected1b == result1b, "datetimeDF does not have the expected date values"

### 2. Get daily active users
- Group by date
- Aggregate approximate count of distinct **`user_id`** and alias to "active_users"
  - Recall built-in function to get approximate count distinct
- Sort by date
- Plot as line graph

In [0]:
# ANSWER
from pyspark.sql.functions import approx_count_distinct

activeUsersDF = (datetimeDF
                 .groupBy("date")
                 .agg(approx_count_distinct("user_id").alias("active_users"))
                 .sort("date")
                )
display(activeUsersDF)

**CHECK YOUR WORK**

In [0]:
from pyspark.sql.types import LongType

expected2a = StructType([StructField("date", DateType(), True),
                         StructField("active_users", LongType(), False)])

result2a = activeUsersDF.schema

assert expected2a == result2a, "activeUsersDF does not have the expected schema"

In [0]:
expected2b = [(datetime.date(2020, 6, 19), 251573), (datetime.date(2020, 6, 20), 357215), (datetime.date(2020, 6, 21), 305055), (datetime.date(2020, 6, 22), 239094), (datetime.date(2020, 6, 23), 243117)]

result2b = [(row.date, row.active_users) for row in activeUsersDF.take(5)]

assert expected2b == result2b, "activeUsersDF does not have the expected values"

### 3. Get average number of active users by day of week
- Add **`day`** column by extracting day of week from **`date`** using a datetime pattern string
- Group by **`day`**
- Aggregate average of **`active_users`** and alias to "avg_users"

In [0]:
# ANSWER
from pyspark.sql.functions import date_format, avg

activeDowDF = (activeUsersDF
               .withColumn("day", date_format(col("date"), "E"))
               .groupBy("day")
               .agg(avg(col("active_users")).alias("avg_users"))
              )
display(activeDowDF)

**CHECK YOUR WORK**

In [0]:
from pyspark.sql.types import DoubleType

expected3a = StructType([StructField("day", StringType(), True),
                         StructField("avg_users", DoubleType(), True)])

result3a = activeDowDF.schema

assert expected3a == result3a, "activeDowDF does not have the expected schema"

In [0]:
expected3b = [("Fri", 247180.66666666666), ("Mon", 238195.5), ("Sat", 278482.0), ("Sun", 282905.5), ("Thu", 264620.0), ("Tue", 260942.5), ("Wed", 227214.0)]

result3b = [(row.day, row.avg_users) for row in activeDowDF.sort("day").collect()]

assert expected3b == result3b, "activeDowDF does not have the expected values"

### Clean up classroom

In [0]:
%run ./Includes/Classroom-Cleanup