-
Complete assignment prompts: Write your SQL in the query file corresponding to the prompt number in the
submission
folder. Do not change the file names!⚠️ Please note that the_app
folder,Dockerfile
, and.github
folder are essential components that facilitate automatic tests and feedback generation. To ensure smooth operation of these processes, please refrain from making any changes to these files/folders. Your focus should solely remain on editing files within thesubmission
folder. -
Lint your SQL code for readability. This ensures your code is easy to follow and understand.
-
Add comments to your queries. Use the
--
syntax to explain each step and help the reviewer understand your thought process. -
Submit your work:
-
Before proceeding, ensure that you have:
- Completed all the questions/prompts in the assignment instructions (see below).
- Reviewed your code for errors.
- Thoroughly commented your code to enhance understanding for reviewers.
-
To submit your assignment, open a Pull Request (PR) to merge changes from your
homework
branch in your forked repository into themain
branch in the upstream repository, i.e.,DataExpert-ZachWilson-V4:main
.It's crucial to open the PR in the upstream repository for our automated tests to run correctly. Be sure to double-check that you are submitting your work to the correct branch and repository, as opening a PR in the wrong location could lead to complications in the review process.
-
Once you open the PR, a link to it will be automatically shared with our TA team for review.
-
- Due to our use of GitHub Classroom automation, we cannot accommodate individual requests for changes, so please read through all instructions carefully.
- You can work on multiple branches if it happens that way, but your final submission must be on one branch.
- You can revise and push changes to the PR any time before the deadline.
- Some assignments may include tests or feedback generated by the GitHub Action, which you may use to revise your solutions before the deadline.
- Only open one PR in the
main
branch of the upstream repo for your submission. The first PR you open will be considered your final submission. - Enhance your review by helping your reviewer. Add comments under 'Files changed' to summarize or highlight key parts of your code. If you've already added comments throughout your code as recommended, then you can simply reiterate or summarize those comments in this section.
- Avoid further changes after the deadline. They won't be reviewed, and committing changes after the deadline can cause confusion and delay the review process.
- Grades are pass or fail, used solely for certification.
- Changes made after the deadline won't be considered.
- An approved PR means a Pass grade. If changes are requested, the grade will be marked as Fail.
- The reviewer may provide comments or suggestions when they request changes, but they're optional and won't be reviewed.
- Only one (human) review will be provided for grading after the deadline.
Write a query to de-duplicate the nba_game_details
table from the day 1 lab of the fact modeling week 2 so there are no duplicate values.
You should de-dupe based on the combination of game_id
, team_id
and player_id
, since a player cannot have more than 1 entry per game.
Feel free to take the first value here.
Similarly to what was done in day 2 of the fact data modeling week, write a DDL statement to create a cumulating user activity table by device.
This table will be the result of joining the devices
table onto the web_events
table, so that you can get both the user_id
and the browser_type
.
The name of this table should be user_devices_cumulated
.
The schema of this table should look like:
user_id bigint
browser_type varchar
dates_active array(date)
date date
The dates_active
array should be a datelist implementation that tracks how many times a user has been active with a given browser_type
.
Note that you can also do this using a MAP(VARCHAR, ARRAY(DATE))
type, but then you have to know how to manipulate the contents of those maps correctly (and then you don't include a browser_type
column).
If you use the MAP
type, you'd have one row per user_id
, and the keys of this MAP
would be the values for browser_type
, and the values would be the arrays of dates for which we saw activity for that user on that browser type.
Note only that, but you'll need to take care of doing the CROSS JOIN UNNEST correctly - when we did it in lab, we didn't do it against a MAP
type, but an ARRAY
type, so it exploded into rows in the way you'd expect.
Doing this by just including a browser_type
column means it works almost exactly the same as what we did in lab, you just add an additional group by key.
The first index of the date list array should correspond to the most recent date (today's date).
Write the incremental query to populate the table you wrote the DDL for in the above question from the web_events
and devices
tables. This should look like the query to generate the cumulation table from the fact modeling day 2 lab.
Building on top of the previous question, convert the date list implementation into the base-2 integer datelist representation as shown in the fact data modeling day 2 lab.
Assume that you have access to a table called user_devices_cumulated
with the output of the above query. To check your work, you can either load the data from your previous query (or the lab) into a user_devices_cumulated
table, or you can generate the user_devices_cumulated
table as a CTE in this query.
You can write this query in a single step, but note the three main transformations for this to work:
- unnest the dates, and convert them into powers of 2
- sum those powers of 2 in a group by on
user_id
andbrowser_type
- convert the sum to base 2
Write a DDL statement to create a hosts_cumulated
table, as shown in the fact data modeling day 2 lab. Except for in the homework, you'll be doing it by host
, not user_id
.
The schema for this table should include:
host varchar
host_activity_datelist array(date)
date date
As shown in the fact data modeling day 2 lab, Write a query to incrementally populate the hosts_cumulated
table from the web_events
table.
As shown in the fact data modeling day 3 lab, write a DDL statement to create a monthly host_activity_reduced
table, containing the following fields:
host varchar
metric_name varchar
metric_array array(integer)
month_start varchar
As shown in fact data modeling day 3 lab, write a query to incrementally populate the host_activity_reduced
table from a daily_web_metrics
table. Assume daily_web_metrics
exists in your query. Don't worry about handling the overwrites or deletes for overlapping data.
Remember to leverage a full outer join, and to properly handle imputing empty values in the array for windows where a host gets a visit in the middle of the array time window.