#### 4.5 Create the time dimension table
The time dimension table needs to cover every day in 2020, but we only need the date-aspect, we're not interested in the hours, minutes and seconds. We will extract the necessary components and populate the table accordingly.  
We will partition the table by month (normally would do year, then month, but in this case the data is limited to one year).

TODO: Need to redo this, extract the required date times from the county facts table instead of precreating them.

##### Setup
I'm going to need Spark for this because I'll want to make use of some of its functionality, such as the ability to create temporary SQL views of my dataframes.

In [14]:
from setup import create_spark_session

spark = create_spark_session()

Imports and output paths:

In [15]:
import pandas as pd
from pyspark.sql import functions as F
from pyspark.sql.types import *

from clean import *
from etl import *

# For now, just locally, later on maybe write this to S3 instead
output_path = "output/"

In [16]:
covid_cases_df = load_covid_case_data(spark)
covid_cases_df.limit(1).show()

+----+-----------+-------+-----------+------------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+------+------+------+------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+------+------+------+------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+------+------+------+------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+------+------+------+------+------+------+------+------+------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-------+-

In [17]:
unix_time = pd.Timestamp("1970-01-01")
second = pd.Timedelta('1s')

date_list = [((pd.to_datetime(c) - unix_time) // second, ) for c in covid_cases_df.columns[5:]]
date_list[:5]

[(1579651200,), (1579737600,), (1579824000,), (1579910400,), (1579996800,)]

In [5]:
len(date_list)

403

In [18]:
time_columns = ['timestamp']

time_df = spark.createDataFrame(date_list, time_columns)
time_df.limit(5).show()

+----------+
| timestamp|
+----------+
|1579651200|
|1579737600|
|1579824000|
|1579910400|
|1579996800|
+----------+



In [19]:
time_df = time_df.withColumn("date", F.from_unixtime("timestamp").cast(DateType()))
time_df.limit(5).show()

+----------+----------+
| timestamp|      date|
+----------+----------+
|1579651200|2020-01-22|
|1579737600|2020-01-23|
|1579824000|2020-01-24|
|1579910400|2020-01-25|
|1579996800|2020-01-26|
+----------+----------+



In [20]:
# Spark 3.0+ for some reason removed the ability to parse weekdays into integers, it only supports strings now.
# Don't ask me why, I can't see how that's a good restriction to add.
# We can fall back to the legacy time parser to restore the old behaviour.
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")

In [22]:
time_df = time_df.withColumn('day', dayofmonth('date')) \
    .withColumn('week', weekofyear('date')) \
    .withColumn('month', month('date')) \
    .withColumn('year', year('date')) \
    .withColumn('weekday', date_format(col("date"), "u").cast(IntegerType()))
time_df.limit(5).show()

+----------+----------+---+----+-----+----+-------+
| timestamp|      date|day|week|month|year|weekday|
+----------+----------+---+----+-----+----+-------+
|1579651200|2020-01-22| 22|   4|    1|2020|      3|
|1579737600|2020-01-23| 23|   4|    1|2020|      4|
|1579824000|2020-01-24| 24|   4|    1|2020|      5|
|1579910400|2020-01-25| 25|   4|    1|2020|      6|
|1579996800|2020-01-26| 26|   4|    1|2020|      7|
+----------+----------+---+----+-----+----+-------+



In [23]:
time_df.printSchema()

root
 |-- timestamp: long (nullable = true)
 |-- date: date (nullable = true)
 |-- day: integer (nullable = true)
 |-- week: integer (nullable = true)
 |-- month: integer (nullable = true)
 |-- year: integer (nullable = true)
 |-- weekday: integer (nullable = true)



In [24]:
time_df = time_df.drop('date')
time_df.limit(5).show()

+----------+---+----+-----+----+-------+
| timestamp|day|week|month|year|weekday|
+----------+---+----+-----+----+-------+
|1579651200| 22|   4|    1|2020|      3|
|1579737600| 23|   4|    1|2020|      4|
|1579824000| 24|   4|    1|2020|      5|
|1579910400| 25|   4|    1|2020|      6|
|1579996800| 26|   4|    1|2020|      7|
+----------+---+----+-----+----+-------+



In [25]:
time_df.write.partitionBy('month').mode('overwrite').parquet(output_path + "time.parquet")