#### 4.1 Create the time dimension table
The time dimension table needs to cover every day in 2020, but we only need the date-aspect, we're not interested in the hours, minutes and seconds. We will extract the necessary components and populate the table accordingly.  
We will partition the table by month (normally would do year, then month, but in this case the data is limited to one year).

##### Setup
I'm going to need Spark for this because I'll want to make use of some of its functionality, such as the ability to create temporary SQL views of my dataframes.

In [1]:
from setup import create_spark_session

spark = create_spark_session()

Imports and output paths:

In [2]:
import pandas as pd
from pyspark.sql import functions as F
from pyspark.sql.types import *

from clean import *
from etl import *

# For now, just locally, later on maybe write this to S3 instead
output_path = "output/"

In [3]:
time_df_pd = pd.DataFrame({'date':pd.date_range('2020-01-01', '2020-12-31')})
time_df_pd.head()

Unnamed: 0,date
0,2020-01-01
1,2020-01-02
2,2020-01-03
3,2020-01-04
4,2020-01-05


In [4]:
time_df_pd.count()

date    366
dtype: int64

In [5]:
time_df = spark.createDataFrame(time_df_pd)
time_df.show()

+-------------------+
|               date|
+-------------------+
|2020-01-01 00:00:00|
|2020-01-02 00:00:00|
|2020-01-03 00:00:00|
|2020-01-04 00:00:00|
|2020-01-05 00:00:00|
|2020-01-06 00:00:00|
|2020-01-07 00:00:00|
|2020-01-08 00:00:00|
|2020-01-09 00:00:00|
|2020-01-10 00:00:00|
|2020-01-11 00:00:00|
|2020-01-12 00:00:00|
|2020-01-13 00:00:00|
|2020-01-14 00:00:00|
|2020-01-15 00:00:00|
|2020-01-16 00:00:00|
|2020-01-17 00:00:00|
|2020-01-18 00:00:00|
|2020-01-19 00:00:00|
|2020-01-20 00:00:00|
+-------------------+
only showing top 20 rows



In [6]:
# Spark 3.0+ for some reason removed the ability to parse weekdays into integers, it only supports strings now.
# Don't ask me why, I can't see how that's a good restriction to add.
# We can fall back to the legacy time parser to restore the old behaviour.
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")

In [7]:
time_df = time_df.withColumn('day', dayofmonth('date')) \
    .withColumn('week', weekofyear('date')) \
    .withColumn('month', month('date')) \
    .withColumn('year', year('date')) \
    .withColumn('weekday', date_format(col("date"), "u"))
time_df.show()

+-------------------+---+----+-----+----+-------+
|               date|day|week|month|year|weekday|
+-------------------+---+----+-----+----+-------+
|2020-01-01 00:00:00|  1|   1|    1|2020|      3|
|2020-01-02 00:00:00|  2|   1|    1|2020|      4|
|2020-01-03 00:00:00|  3|   1|    1|2020|      5|
|2020-01-04 00:00:00|  4|   1|    1|2020|      6|
|2020-01-05 00:00:00|  5|   1|    1|2020|      7|
|2020-01-06 00:00:00|  6|   2|    1|2020|      1|
|2020-01-07 00:00:00|  7|   2|    1|2020|      2|
|2020-01-08 00:00:00|  8|   2|    1|2020|      3|
|2020-01-09 00:00:00|  9|   2|    1|2020|      4|
|2020-01-10 00:00:00| 10|   2|    1|2020|      5|
|2020-01-11 00:00:00| 11|   2|    1|2020|      6|
|2020-01-12 00:00:00| 12|   2|    1|2020|      7|
|2020-01-13 00:00:00| 13|   3|    1|2020|      1|
|2020-01-14 00:00:00| 14|   3|    1|2020|      2|
|2020-01-15 00:00:00| 15|   3|    1|2020|      3|
|2020-01-16 00:00:00| 16|   3|    1|2020|      4|
|2020-01-17 00:00:00| 17|   3|    1|2020|      5|


In [8]:
# Even though the original pandas dataframe used datetime, the spark dataframe reverted to timestamp.
# I really don't need the time-of-day parts, so let's force this back to datetime.
time_df = time_df.withColumn('date', time_df['date'].cast(DateType()))
time_df.show()

+----------+---+----+-----+----+-------+
|      date|day|week|month|year|weekday|
+----------+---+----+-----+----+-------+
|2020-01-01|  1|   1|    1|2020|      3|
|2020-01-02|  2|   1|    1|2020|      4|
|2020-01-03|  3|   1|    1|2020|      5|
|2020-01-04|  4|   1|    1|2020|      6|
|2020-01-05|  5|   1|    1|2020|      7|
|2020-01-06|  6|   2|    1|2020|      1|
|2020-01-07|  7|   2|    1|2020|      2|
|2020-01-08|  8|   2|    1|2020|      3|
|2020-01-09|  9|   2|    1|2020|      4|
|2020-01-10| 10|   2|    1|2020|      5|
|2020-01-11| 11|   2|    1|2020|      6|
|2020-01-12| 12|   2|    1|2020|      7|
|2020-01-13| 13|   3|    1|2020|      1|
|2020-01-14| 14|   3|    1|2020|      2|
|2020-01-15| 15|   3|    1|2020|      3|
|2020-01-16| 16|   3|    1|2020|      4|
|2020-01-17| 17|   3|    1|2020|      5|
|2020-01-18| 18|   3|    1|2020|      6|
|2020-01-19| 19|   3|    1|2020|      7|
|2020-01-20| 20|   4|    1|2020|      1|
+----------+---+----+-----+----+-------+
only showing top

In [9]:
time_df.write.partitionBy('month').mode('overwrite').parquet(output_path + "time.parquet")