# End to End DBT

### Introduction

In this lesson, we'll start with exploring our data in postgres, use DBT to transform our data, and end by producing our mart tables.

### Getting a fact table

Get started by creating a new branch:

`git checkout -b end_to_end`

Now in this lesson, we'll add in our fact table.  Remember that the fact table is the table in the center of our dimension tables.

Return to our postgres database, and take a look at the tables listed in the database.  Which one of these do you think would be the fact table? 

```bash
List of relations
 Schema |          Name          | Type  |    Owner
--------+------------------------+-------+-------------
 public | categories             | table | jeffreykatz
 public | customer_customer_demo | table | jeffreykatz
 public | customer_demographics  | table | jeffreykatz
 public | customers              | table | jeffreykatz
 public | employee_territories   | table | jeffreykatz
 public | employees              | table | jeffreykatz
 public | order_details          | table | jeffreykatz
 public | orders                 | table | jeffreykatz
 public | products               | table | jeffreykatz
 public | region                 | table | jeffreykatz
 public | shippers               | table | jeffreykatz
 public | suppliers              | table | jeffreykatz
 public | territories            | table | jeffreykatz
 public | us_states              | table | jeffreykatz
 ```

It will be the orders table.  So next, while in postgres, display all of the columns in the `orders` table.

```bash
Table "public.orders"
      Column      |         Type          | Collation | Nullable | Default
------------------+-----------------------+-----------+----------+---------
 order_id         | smallint              |           | not null |
 customer_id      | character varying(5)  |           |          |
 employee_id      | smallint              |           |          |
 order_date       | date                  |           |          |
 required_date    | date                  |           |          |
 shipped_date     | date                  |           |          |
 ship_via         | smallint              |           |          |
 freight          | real                  |           |          |
 ship_name        | character varying(40) |           |          |
 ship_address     | character varying(60) |           |          |
 ship_city        | character varying(15) |           |          |
 ship_region      | character varying(15) |           |          |
 ship_postal_code | character varying(10) |           |          |
 ship_country     | character varying(15) |           |          |
```

### Moving to DBT

Now let's move to DBT, and create a new model in staging.  

* But before building the staging model, first update the `sources.yaml` file.

The staging model should be for orders and should have the following columns of `order_id`, `employee_id`, `customer_id`, `product_id`, `order_date`, `quantity`, `discount`, and `unit_price`.  

* The `product_id` is the one column that is not on the `orders` table.  You'll have to find it on a related source table. 


> Notice that we did not include the shipping information -- including the `required_date`.  This is because, theoretically, that would go in it's own staging file.  All of the `ship_` prefixes were hints that this belongs in a separate model.  

We should see the following results, ordered by order date from earliest to latest.

> `psql -d northwinds -c "select * from dev.stg_rds_orders limit 3;"`

<img src="./updated-orders.png" width="100%">

* Using Jinja

Notice that each of the foreign keys, of `order_id`, `product_id`, and `employee_id`, and `customer_id` have `rds-` preceding each id.

Use a for loop in Jinja to produce this repetitive work for the `product_id`, `employee_id`, and `customer_id` columns -- you can exclude the `order_id` column from the jinja.

Confirm that the refactoring works by calling `dbt run`.  Then perform the following sql command.

`psql -d northwinds -c "select * from dev.stg_rds_orders limit 3;"`

* Adding tests

Finally, before wrapping up, add a couple of tests to the new sources.  We can see how, by looking at the [DBT documentation](https://docs.getdbt.com/docs/building-a-dbt-project/using-sources).

Confirm that they work by running `dbt test`.

### Moving to Integration

Next create an integration file.  Remember that in our integration section, we generally combine our related staging files together.  Here, of course, we only have one staging file related to orders.  

So let's just load that in.

* Surrogate key

Now we use the `dbt_utils.generate_surrogate_key` function generate a new id for us.  This is because, when we merge duplicate records, we lose the original id.  In this case, because our data only comes from one source, there are no duplicate records, so we are likely in good shape.  

Still, because it's a very small amount of work, and we may add additional sources in the future, we'll still use the `surrogate_key` methods.  

Have the the surrogate key generated from the `product_id`, `order_date`, and `customer_id` columns.

If you run `dbt run`, then you can view the results by running the following:

`psql -d northwinds -c "select * from dev.int_orders limit 3;"`

<img src="./int_orders.png" width="100%">

* `utils.star`

> Notice that we do not have the original `order_id` column.  Use the `dbt_utils.star` function to select each column minus the `order_id` column.

### Onto marts

Now let's create an `orders_fact` table.  Begin by creating a new file in the `mart` folder called `orders_fact.sql`.

> The previewed data should look something like the following.  We ordered the data by `order_date` from earliest to latest.  Notice that it has the related `contact_pk` added.

`psql -d northwinds -c "select * from dev.fact_orders limit 3;"` 

<img src="./updated-mart.png" width="100%">

* Materialize the table

Make sure to materialize the table.  And then move to snowflake to confirm that it is converted.

<img src="./mart-tables.png" width="60%">

### Adding some tests

 
#### Integration tests

At the integration level, we should ensure that all of our primary key columns are unique across each integration table, and that no primary key is null.  Do so by adding a new file called `schema.yml` in the integration folder, and then adding the related tests for each integration file.

> Remember that our primary keys are generated from the `generate_surrogate_key` function, and the same key will be generated if the dependent values are the same.

* So at this point we should to ensure that the primary key for the `int_companies`, `int_contacts`, and `int_orders` models are not null, and that they are each unique.
* We should also have our tests that hubspot_contacts do not have null values in the `first_name` and `last_name` columns.

> **Warning**: You will likely get a failing test.

If we add the related tests, and then run DBT test we see something like the following:

> <img src="./failing-test.png" width="100%">

We can see that we have one failing test.

If you run the compiled test against our northwinds database, you should see something like the following.

<img src="./unique-fail.png" width="60%">

It looks like we did not specify a unique combination of columns when generating the surrogate key.  So add in the `order_id` column to our list of other columns for our surrogate key for orders -- and none of the other columns.  That should do the trick.  

Rerun `dbt run`, and then run `dbt test`.

<img src="./passing-int-tests.png" width="100%">

After making the necessary changes, confirm that all tests are now passing.

#### Mart Tests

Next, we should add the following set of tests for our mart tables.

* Primary key tests

Begin by adding the same tests for the primary keys -- that pks are not null, and are unique -- for each of our three mart tables.

* Referential tests

One of the main components to ensure in the marts is referential integrity.  By that, we mean that we should ensure that each foreign key has a corresponding primary key in the relevant table.

For example, when a contact has a `company_pk`, we want to make sure that `company_pk` exists in the companies dimension table.  Look for `relationships` in the [testing documentation](https://docs.getdbt.com/docs/building-a-dbt-project/tests).  

Then add two new tests -- one for each of the relevant foreign keys across our mart tables.

We should see that all of our tests are now passing.

<img src="./passing-tests.png" width="100%">

### Drawing Insights

From there, write a sql query that displays total revenue per day.

```total_per_day
order_day  |   total_revenue
------------+--------------------
 1996-07-04 | 439.99999809265137
 1996-07-05 | 1863.4000644683838
 1996-07-08 | 2483.8000259399414
 1996-07-09 | 3730.0001525878906
 1996-07-10 | 1444.7999839782715
 1996-07-11 |   625.200014591217
 1996-07-12 | 2490.4999780654907
 1996-07-15 |  517.8000068664551
 1996-07-16 |  1119.899953842163
 1996-07-17 | 2018.5999927520752
```

### Summary

Think back through the steps involved in this lab.  And think through the components involved in each stage -- from exploring data in snowflake, to sources, staging, integration and marts.  

As a review, record some example code that belongs in each component of the codebase (from sources, to staging, to integration, to marts).

### Query
```sql
SELECT
    DATE(order_date) AS order_day,
    SUM(quantity*unit_price) AS total_revenue
FROM
    dev.fact_orders
GROUP BY
    DATE(order_date)
ORDER BY
    DATE(order_date);
```