# Movie Recommmendation with Spark and AWS
## Introduction

The project uses datasets (ml-latest-small) from [MovieLens](https://grouplens.org/datasets/movielens/latest/), a movie recommendation service. It contains 100836 ratings and 3683 tags across 9742 movies. The ratings were created by 610 users between 1996 and 2018. The larger dataset contains 27753444 ratings and 1108997 tags across 58098 movies. Ratings were created by 283228 users between 1995 and 2018.

I also generated two txt files for movies with awards. The file is copied from [Wikipedia/Award-winning films](https://en.wikipedia.org/wiki/List_of_Academy_Award-winning_films)

The Project is to build an ETL pipeline that extracts data from S3, processes them using Spark, stages them in Redshift, and transforms data into a set of dimensional tables.

In [None]:
import boto3
import os
import configparser
from datetime import datetime
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql.functions import udf, col, isnan, when, count, trim, desc, sum, asc
from pyspark.sql.functions import year, month, dayofmonth, hour, weekofyear, date_format
from pyspark.sql.functions import countDistinct, explode, split, concat_ws, collect_list
from pyspark.sql.types import StructType as R, StructField as Fld, DoubleType as Dbl, StringType as Str, IntegerType as Int, DateType as Date

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')

# STEP 1: Get the params of the created redshift cluster 
- This is for reading data from S3 to redshift
- We need:
    - The redshift cluster <font color='red'>endpoint</font>
    - The <font color='red'>IAM role ARN</font> that give access to Redshift to read from S3

In [None]:
config = configparser.ConfigParser()

#Normally this file should be in ~/.aws/credentials
config.read_file(open('dwh.cfg'))

KEY                    = config.get('AWS','KEY')
SECRET                 = config.get('AWS','SECRET')

DWH_CLUSTER_TYPE       = config.get("DWH","DWH_CLUSTER_TYPE")
DWH_NUM_NODES          = config.get("DWH","DWH_NUM_NODES")
DWH_NODE_TYPE          = config.get("DWH","DWH_NODE_TYPE")

DWH_CLUSTER_IDENTIFIER = config.get("DWH","DWH_CLUSTER_IDENTIFIER")
DWH_DB                 = config.get("DWH","DWH_DB")
DWH_DB_USER            = config.get("DWH","DWH_DB_USER")
DWH_DB_PASSWORD        = config.get("DWH","DWH_DB_PASSWORD")
DWH_PORT               = config.get("DWH","DWH_PORT")

DWH_IAM_ROLE_NAME      = config.get("DWH", "DWH_IAM_ROLE_NAME")

(DWH_DB_USER, DWH_DB_PASSWORD, DWH_DB)

pd.DataFrame({"Param":
                  ["DWH_CLUSTER_TYPE", "DWH_NUM_NODES", "DWH_NODE_TYPE", "DWH_CLUSTER_IDENTIFIER", "DWH_DB", "DWH_DB_USER", "DWH_DB_PASSWORD", "DWH_PORT", "DWH_IAM_ROLE_NAME"],
              "Value":
                  [DWH_CLUSTER_TYPE, DWH_NUM_NODES, DWH_NODE_TYPE, DWH_CLUSTER_IDENTIFIER, DWH_DB, DWH_DB_USER, DWH_DB_PASSWORD, DWH_PORT, DWH_IAM_ROLE_NAME]
             })

os.environ["AWS_ACCESS_KEY_ID"]= config['AWS']['KEY']
os.environ["AWS_SECRET_ACCESS_KEY"]= config['AWS']['SECRET']

In [None]:
# e.g. DWH_ENDPOINT="redshift-cluster-1.csmamz5zxmle.us-west-2.redshift.amazonaws.com" 
DWH_ENDPOINT="" 
    
#e.g DWH_ROLE_ARN="arn:aws:iam::988332130976:role/dwhRole"
DWH_ROLE_ARN=""

# Step 2: Explore and Assess the Data using Spark

In [None]:
spark = SparkSession.builder\
                     .config("spark.jars.packages","org.apache.hadoop:hadoop-aws:2.7.0")\
                     .getOrCreate()

### Part 1: Load Data from S3 and clean dataframe
- movie.csv: including movieId, title(year), genres
  - split title and year from the second column
  - split generes from the array
- ratings.csv: including userId, movieId, rating, ts
  - transform ts string into timestamp
- tags.csv: including userId, movieId, tag, ts
  - transform ts string into timestamp
- awards.txt: including Film, year, awards, nominations
  - split txt data using delimiter "|"
  - identify issues when splitting data like inappropriate year
  - transform data into appropriate data type
- award_corrected.txt: including Film, year, awards, nominations (corrections for awards.txt)
  - join with awards to correct the year
  - transform data into appropriate data type

In [None]:
movieSchema = R([
            Fld("movieId",Int()),
            Fld("title",Str()),
            Fld("genres",Str())
            ])

In [None]:
ratingSchema = R([
            Fld("userId",Int()),
            Fld("movieId",Int()),
            Fld("rating",Dbl()),
            Fld("ts",Str())
            ])

In [None]:
tagSchema = R([
            Fld("userId",Int()),
            Fld("movieId",Int()),
            Fld("tag",Str()),
            Fld("ts",Str())
            ])

In [None]:
# read movies, ratings, and tags csv
dfmovies = spark.read.csv("s3a://udacity-input/ml-latest-small/movies.csv", header=True, schema=movieSchema)
dfratings = spark.read.csv("s3a://udacity-input/ml-latest-small/ratings.csv", header = True, schema=ratingSchema)
dftags = spark.read.csv("s3a://udacity-input/ml-latest-small/tags.csv", header = True, schema=tagSchema)

In [None]:
# read awards txt
dfawards = spark.read.option("header", "true") \
    .option("delimiter", "|") \
    .option("inferSchema", "true") \
    .csv("s3a://udacity-input/ml-latest-small/Awards.txt")

dfawards.show(10, truncate=False)

In [None]:
# read award_corrected txt
dfawards2 = spark.read.option("header", "true") \
    .option("delimiter", "|") \
    .option("inferSchema", "true") \
    .csv("s3a://udacity-input/ml-latest-small/Award_corrected.txt")

dfawards2.show(10, truncate=False)

# Step 3: Define Relational Data Model
**For the following use cases, I created 5 tables**
- number of movies in the dataset  
- number of movies in each genre  
- number of users in the dataset  
- Minimum number of ratings per user  
- Minimum number of ratings per movie   
- number of movies not rated  
- the top 5 movies with high ratings  
- number of movies receiving awards  
- total awards that movie received  
- number of movies rated and receiving awards  
- the average rating scores of movies with awards  
- year durations in movies, ratings and awards dataset  

**snowflake schema**
* **awards** - (film, year, nominations, awards)  
This table will have the awards that each movie received. The composite key of film and year is used to identify each row in this table since films can be made in the same name. 
* **movies** - (movieId, title, year)  
The primary key for movies is movieId, and genres need to removed from the original table since genres include a list of genres for each movie.
* **genres** - (genreId, movieId, genre)  
A separate table genres needs to be created to identify the type of each movie. Since each movie can have several types, a unique id genreId is created for this table as primary key.  
* **ratings** - (userId, movieId, rating, rate_time, year)  
The composite key is userId and movieId in ratings table since a user can rate different movies.
* **time** - timestamps in ratings broken down into specific units (date_key, day, week, month, year)
A time table is created to check the day, week, month and year. The primary key is date_key.

#### Method 1: Mapping Out Data Pipelines using Spark
- Movies and genres can be created using the movies csv from S3.
- Ratings can be created using the ratings csv from S3.
- Awards can be created by joining data in awards.txt and award_correction.txt.

#### Method 2: Mapping Out Data Pipelines in Redshift
- Awards, ratings, genres table in parquet format can be read directly from S3.  
- Movies and genres can be created using the movies data from S3.

# Step 4: Run Pipelines to Model the Data 
### 4.1 Create the data model using Spark
Build the data pipelines to create the data model.

In [None]:
dfmovies.printSchema()
dfmovies.show(5, truncate = False)
dfmovies.count()

In [None]:
# convert timestamp
dfratings = dfratings.withColumn(
    "rate_time",
    F.to_timestamp(F.from_unixtime((col("ts")) , 'yyyy-MM-dd HH:mm:ss.SSS')).cast("Timestamp")
).drop("ts")

In [None]:
dfratings = dfratings.withColumn("year", F.year("rate_time"))

In [None]:
dfratings.printSchema()
dfratings.show(5)
dfratings.count()

In [None]:
# convert timestamp
dftags = dftags.withColumn("tag_time", F.to_timestamp(col("ts") / 1)).drop("ts")
dftags = dftags.withColumn("year", F.year("tag_time"))

In [None]:
dftags.printSchema()
dftags.show(5)
dftags.count()

In [None]:
dfawards.columns

In [None]:
# clean awards txt flie
dfawards = dfawards.withColumn("film", dfawards['Film   '].cast(Str())).drop('Film   ')
dfawards = dfawards.withColumn("year", dfawards['Year   '].cast(Int())).drop("Year   ")
dfawards = dfawards.withColumn("awards", dfawards['Awards    '].cast(Dbl())).drop("Awards    ")
dfawards = dfawards.withColumn("nominations", dfawards['Nominations'].cast(Int()))

In [None]:
dfawards.columns

In [None]:
dfawards2.columns

In [None]:
dfawards2 = dfawards2.withColumn("film", dfawards2['Film   '].cast(Str())).drop('Film   ')
dfawards2 = dfawards2.withColumn("year", dfawards2['Year   '].cast(Int())).drop("Year   ")
#dfawards2 = dfawards2.withColumn("date", F.to_timestamp(col('Year   '))).drop('Year   ')
#dfawards2 = dfawards2.withColumn("year", F.year("date")).drop("date")
dfawards2 = dfawards2.withColumn("awards", dfawards2['Awards    '].cast(Dbl())).drop("Awards    ")
dfawards2 = dfawards2.withColumn("nominations", dfawards2['Nominations'].cast(Int()))

In [None]:
dfawards.printSchema()
dfawards.show(5, truncate = False)
dfawards.count()

In [None]:
dfawards2.printSchema()
dfawards2.show(5, truncate = False)
dfawards2.count()

In [None]:
# split the mixed genres by '|'
dfmovies2 = dfmovies.withColumn('genre', explode(split(dfmovies.genres, '\|')))

In [None]:
dfmovies2.show(11)

In [None]:
# create genere information for each movie
dfgenre = dfmovies2.select("movieId", "genre").dropDuplicates().dropna(subset=["movieId", "genre"]).withColumn("genreId", F.monotonically_increasing_id())

In [None]:
#dfgenre.filter(dfgenre.title.contains('Toy Story (1995)')).show()
dfgenre.filter(dfgenre.movieId == 1).show()

In [None]:
dfgenre.columns
dfgenre.printSchema()

#### Load Data to S3 in parquet format

In [None]:
dfawards.write.parquet("s3a://sparkifydend/movies/awards/", mode="overwrite")

In [None]:
dfawards2.write.parquet("s3a://sparkifydend/movies/awards2/", mode="overwrite")

In [None]:
dfmovies.write.parquet("s3a://sparkifydend/movies/movies/", mode="overwrite")

In [None]:
dfratings.write.parquet("s3a://sparkifydend/movies/ratings/", mode="overwrite")

In [None]:
dftags.write.parquet("s3a://sparkifydend/movies/tags/", mode="overwrite")

In [None]:
dfgenre.write.parquet("s3a://sparkifydend/movies/genres/", mode="overwrite")

In [None]:
dfawards = spark.read.parquet("s3a://sparkifydend/movies/awards/*")
dfawards2 = spark.read.parquet("s3a://sparkifydend/movies/awards2/*")
dfmovies = spark.read.parquet("s3a://sparkifydend/movies/movies/*")
dfratings = spark.read.parquet("s3a://sparkifydend/movies/ratings/*")
dftags = spark.read.parquet("s3a://sparkifydend/movies/tags/*")
dfgenre = spark.read.parquet("s3a://sparkifydend/movies/genres/*")

### 4.2 Data Quality Checks Part 1: Identify missing values, duplicate data, etc

In [None]:
# check for null values
dfmovies.select([count(when(col(c).isNull(), c)).alias(c) for c in dfmovies.columns]).show()
dfratings.select([count(when(col(c).isNull(), c)).alias(c) for c in dfratings.columns]).show()
dfawards.select([count(when(col(c).isNull(), c)).alias(c) for c in dfawards.columns]).show()
dfawards2.select([count(when(col(c).isNull(), c)).alias(c) for c in dfawards2.columns]).show()

In [None]:
# show records with year < 1920
dfawards.filter(dfawards.year < 1920).show(5, truncate = False)

In [None]:
# check records in dfawards2
dfawards2.filter(trim(dfawards2.film) == "Joker").show()
dfawards2.filter(trim(dfawards2.film) == "Once Upon a Time in Hollywood").show()
dfawards2.filter(trim(dfawards2.film) == "1917").show()
dfawards2.filter(trim(dfawards2.film) == "Roma").show()
dfawards2.filter(trim(dfawards2.film) == "The Favourite").show()

In [None]:
# drop records with wrong year 
dfawards = dfawards.filter(dfawards.year > 1920)

In [None]:
dfawards.select([count(when(col(c).isNull(), c)).alias(c) for c in dfawards.columns]).show()
dfawards.show(5, truncate = False)
dfawards.count()

In [None]:
# union dfawards and dfawards2, and remove duplicates
# dfawards2 has corrections for year
dfawards3 = dfawards.union(dfawards2).distinct().filter(~col("year").isin([0]) & col("year").isNotNull()).sort(desc('year'))
dfawards3.show(5, truncate = False)

In [None]:
# show records with year not in the right range
dfawards3.where(dfawards3.year < 1920).show(5, truncate = False)

In [None]:
# load to S3
dfawards3.write.parquet("s3a://sparkifydend/movies/awards3/", mode="overwrite")

### 4.2 Data Quality Checks Part 2: source/count checks to ensure completeness

In [None]:
def quality_check(df, tablename):
    '''
    Input: Spark dataframe, table name
    Output: Print outcome of data quality check
    '''
    
    result = df.count()
    if result == 0:
        print("Data quality check failed for {} with zero records".format(tablename))
    else:
        print("Data quality check passed for {} with {} records".format(tablename, result))
    return 0

In [None]:
# Perform data quality check with unit test
quality_check(dfmovies, "movies table")
quality_check(dfratings, "ratings table")
quality_check(dfawards3, "awards table")
quality_check(dfgenre, "genre table")

In [None]:
dfmovies.count()

In [None]:
dfmovies[['movieId']].drop_duplicates().count()

In [None]:
dfratings.count()

In [None]:
# dfratings is on movieid and userid level
dfratings[['movieId', 'userId']].drop_duplicates().count()

In [None]:
dfawards3.count()

In [None]:
# dfawards3 is on title and year level
dfawards3[['film', 'year']].drop_duplicates().count()

In [None]:
# check out movies with same name
df1 = dfawards3.groupBy("film").count().filter("count > 1")
df1.show(truncate = False)

In [None]:
dfawards3.filter(trim(dfawards3.film) == "A Star Is Born").show()
dfawards3.filter(trim(dfawards3.film) == "Titanic").show()

### 4.3 Data Wrangling with Spark and OLAP

In [None]:
# use the dataframe dfmovies2 to match every movie to a single genre
genre_movies = dfmovies2 \
                    .groupBy(dfmovies2.genre) \
                    .agg(concat_ws(',', collect_list(dfmovies2.movieId)) \
                    .alias('MovieIds')) \
                    .orderBy('genre')

In [None]:
genre_movies.show()

In [None]:
# use case
# number of movies in the dataset
distinct_movie = dfmovies.select("movieId").distinct().count()
print('{} movies in the movies dataset'.format(distinct_movie))

In [None]:
# number of users in the dataset
distinct_user = dfratings.select("userId").distinct().count()
print('{} users rated the movies'.format(distinct_user))

In [None]:
# number of movies receiving awards
distinct_award = dfawards3.select("film", "year").distinct().count()
print('{} movies received awards'.format(distinct_award))

In [None]:
# show movies receiving more than 10 awards
dfawards3.where(dfawards3.awards > 10).show(truncate = False)

In [None]:
# total awards that movie received
awards_cnt = dfawards3.groupBy("film", "year").agg(F.sum("awards").alias('cnt')).orderBy(desc('cnt'))

In [None]:
awards_cnt.show(truncate = False)

In [None]:
# Minimum number of ratings per user
# Minimum number of ratings per movie 
tmp1 = dfratings.groupBy("userID").count().toPandas()['count'].min()
tmp2 = dfratings.groupBy("movieId").count().toPandas()['count'].min()
print('For the users that rated movies and the movies that were rated:')
print('Minimum number of ratings per user is {}'.format(tmp1))
print('Minimum number of ratings per movie is {}'.format(tmp2))

In [None]:
# count number of movies in each genre
# The top three genres are drama, comedy, and thriller
df2=dfmovies2.groupBy("genre").count().filter(trim(dfmovies2.genre) != '(no genres listed)').sort(desc('count'))
df2.show(truncate = False)

In [None]:
dfratings.createOrReplaceTempView("ratings")     #userId, movieId, rating, rate_time, year
dfmovies.createOrReplaceTempView("movies")       #movieId, title, genre
dftags.createOrReplaceTempView("tags")           #userId, movieId, tag, tag_time, year
dfawards3.createOrReplaceTempView("awards")      #nominations, film, year, awards
dfgenre.createOrReplaceTempView("genres")        #genreId, genre, movieId

In [None]:
# Split title and release year in separate columns     
movies = spark.sql("select movieId, substr(title, 0, length(title)-7) as title, substr(title, -5, 4) as year from movies")
movies.show()
movies.createOrReplaceTempView("movies") 

In [None]:
# year of movies in the dataset
spark.sql("""select 
             min(year) as min_year,
             max(year) as max_year
             from movies 
             where year > 0
""").show()

In [None]:
# year of rating in the dataset
spark.sql("""select 
             min(year) as min_year,
             max(year) as max_year
             from ratings
""").show()

In [None]:
# year of awards in the dataset
spark.sql("""select 
             min(year) as min_year,
             max(year) as max_year
             from awards
""").show()

In [None]:
# number of movies not rated
spark.sql("""select 
          count(distinct movies.movieId)
          from movies 
          where movies.movieId not in
          (select distinct ratings.movieId from ratings)
          """).show()

In [None]:
# number of movies rated and receiving awards
# 474 movies receiving awards and shown in ratings dataset
spark.sql("""select count(distinct movieId) as in_ratings from 
          (select distinct a.film, a.year, m.movieId as movieId
          from awards as a inner join movies as m on trim(a.film) == trim(m.title) and a.year = m.year
          where a.year > 0 and m.year > 0) t
          where movieId in 
          (select distinct ratings.movieId from ratings)
          """).show()

In [None]:
# the top 5 movies with high ratings
avg_rating = spark.sql("""select distinct
    m.title as title,
    m.year as year,
    sum(case when r.rating >= 0 then 1 else 0 end) as num_rating,
    avg(r.rating) as avg_rating
    from movies as m inner join ratings as r on m.movieId = r.movieId
    group by m.title, m.year
    order by avg_rating desc
""")
avg_rating.show(5)
avg_rating.createOrReplaceTempView("avg_rating") 

In [None]:
# total awards for each movie
tot_awards = spark.sql("""select distinct
                    film,
                    year,
                    sum(awards) as tot_awards
                    from awards
                    group by film, year
                    order by tot_awards desc
""")
tot_awards.show(5)
tot_awards.createOrReplaceTempView("tot_awards") 

In [None]:
# the average rating scores of movies with awards
movie_awards_rating = spark.sql("""select distinct
             a.film,
             a.year,
             a.tot_awards,
             r.avg_rating
             from tot_awards as a inner join avg_rating as r on trim(a.film) == trim(r.title) and a.year == r.year
             where a.year > 0 and r.year > 0
             order by tot_awards desc, avg_rating desc
""")
movie_awards_rating.show(truncate = False)
movie_awards_rating.createOrReplaceTempView("movie_awards_rating") 

In [None]:
spark.sql("select count(*) from tot_awards").show()
spark.sql("select count(*) from avg_rating").show()
spark.sql("select count(*) from movie_awards_rating").show()

### 4.4 Create the data model using Redshift
Build the data pipelines to create the data model.

#### Extract parquet data from S3 and transform into fact and dimension tables

In [None]:
s3 = boto3.resource('s3',
                       region_name="us-west-2",
                       aws_access_key_id=KEY,
                       aws_secret_access_key=SECRET
                     )

s3bucket =  s3.Bucket("udacity-input") # private

s3_data = iter(s3bucket.objects.filter(Prefix="ml-latest-small/"))
for _ in range(5): print(next(s3_data))


In [None]:
%load_ext sql

In [None]:
conn_string="postgresql://{}:{}@{}:{}/{}".format(DWH_DB_USER, DWH_DB_PASSWORD, DWH_ENDPOINT, DWH_PORT,DWH_DB)
print(conn_string)
%sql $conn_string

#### copy data from s3 to redshift

In [None]:
%%time

qry = """
    copy dimRatings from 's3://sparkifydend/movies/ratings/' 
    credentials 'aws_iam_role={}'
    FORMAT AS PARQUET;
""".format(DWH_ROLE_ARN)

%sql $qry

In [None]:
%sql select * from dimRatings limit 5;

In [None]:
%sql select count(*) from dimRatings;

#### check the stl_load_errors table

In [None]:
%%sql
select query, substring(filename,22,25) as filename,line_number as line, 
substring(colname,0,12) as column, type, position as pos, substring(raw_line,0,30) as line_text,
substring(raw_field_value,0,15) as field_text, 
substring(err_reason,0,45) as reason
from stl_load_errors 
order by query desc
limit 10;

In [None]:
%%time

qry = """
    copy dimAwards3 from 's3://sparkifydend/movies/awards3/' 
    credentials 'aws_iam_role={}' 
    FORMAT AS PARQUET;
""".format(DWH_ROLE_ARN)

%sql $qry

In [None]:
%sql select * from dimAwards3 limit 5;

In [None]:
%sql select count(*) from dimAwards3;

In [None]:
%%time

qry = """
    copy dimGenres from 's3://sparkifydend/movies/genres/' 
    credentials 'aws_iam_role={}' 
    FORMAT AS PARQUET;
""".format(DWH_ROLE_ARN)
 
%sql $qry

In [None]:
%sql select * from dimGenres limit 5;

In [None]:
%sql select count(*) from dimGenres;

In [None]:
%%sql
INSERT INTO dimDate (date_key, year, month, day, week)
SELECT DISTINCT(rate_time)                                       AS date_key,
       EXTRACT(year FROM rate_time)                              AS year,
       EXTRACT(month FROM rate_time)                             AS month,
       EXTRACT(day FROM rate_time)                               AS day,
       EXTRACT(week FROM rate_time)                              AS week
FROM dimRatings;

In [None]:
%sql select * from dimDate limit 5;

In [None]:
%sql select count(*) from dimDate;

In [None]:
%%time

qry = """
    copy dimmovies0 from 's3://sparkifydend/movies/movies/' 
    credentials 'aws_iam_role={}' 
    FORMAT AS PARQUET;
""".format(DWH_ROLE_ARN)
 
%sql $qry

In [None]:
%%sql
INSERT INTO dimMovies (movieId, title, year)
SELECT movieId                                                      AS movieId,
       substring(title, 0, length(title)-6)                         AS title, 
       substring(title, length(title)-4, 4)                         AS year
FROM dimMovies0

In [None]:
%sql select * from dimMovies limit 5;

In [None]:
%sql select count(*) from dimMovies;

# Summary
* In this project, I implemented two methods to read data from S3 by Spark and Redshift. After loading data from S3 using Spark, I did data quality check and data cleaning using Spark DF and Spark SQL. Then uploaded table to S3 in parquet format.
* Amazon S3 is selected as the data lake tool to store the raw csv and parquet staging data before the data is uploaded to the Amazon Redshift data warehouse. 
* Parquet is selected as the data format for the staging data in S3 because it is in columnar storage and minimizes latency, thus allowing a more efficient data retrieval and processing.
* Apache Spark as a distributed data processing framework allows us to efficiently load and transform huge datasets from the raw datasource to the S3 data lake and load to the Redshift data warehouse.
* I also created fact and dimension tables after reading parquet format data from S3 into redshift. When reading parquet data, the data type must match between parquet data and tables to be inserted.
* The data should be updated based on the MovieLens datasets.
* How I would approach the problem differently under the following scenarios:
 * The data was increased by 100x. 
   - Writing data by partitions to S3 and distributing data to different nodes in redshift by distkey and sortkey. Writing data by partitions in s3 can improve the speed a lot. Redshift is a cloud data warehouse that is optimized for aggregation and read-heavy workloads.
 * The data populates a dashboard that must be updated on a daily basis by 7am every day.
   - Using Airflow to do the management. Creating Airflow allowed us to programmatically schedule our workflows and monitor them via the built-in Airflow user interface.
 * The database needed to be accessed by 100+ people.
   - Amazon Redshift, in which this data model is hosted, allows up to 500 concurrent users accessing the database.
   - Users can connect to the data model with Amazon QuickSight to create dashboards and analyze the dataset.
   - We can also manage user access and permission with the AWS IAM, so that we can control which users can access which dashboards and the underlying dataset.