# Cleaning the Staging Layer

### Introduction

So far we have seen how to access our data from our data warehouse, and then perform some initial cleanup on that data through staging.  In this lesson, we'll take another look at the integration layer.

### Combining Information

Let's take another look at our DBT pipeline diagram.

<img src="./dbt-pipeline.jpg" width="70%">

So far, we have defined our sources with the `sources.yaml` file, and then performed some initial transformations of data in staging.  While in the staging layer, our data is still segmented out by source, in the integration layer, we'll combine information from various sources.

### Our current project

We can see the integration work to be done if we look at the file structure of our current project.

> <img src="./staging-models.png" width="40%">

In the staging layer, we have our models segmented out by the source that they came from.  For example, we have our information separated into data from `hubspot`, `mixpanel`, and `rds`.  However, within each of those models is overlapping data about contacts.  

In the integration layer we'll combine and merge data from different sources.  

### Some initial cleanup

Before we move right into integrating our data, we'll make things easier on ourselves if we provide some initial cleanup at the staging layer.  This involves a couple of components.

* Let's create a new branch called `integration_prep`, and get started.

#### 1. Properly differentiate

When we eventually combine our data, it will be useful to know, what data came from what source.  To achieve this, we can concatenate a *prefix* to the primary id of each source.

We can see how this works with our `stg_hubspot_contacts` model.  We'll select the `hubspot_id` if it's not selected already, and then concatenate the prefix hubspot to each id.

Here it is:

```sql
SELECT 
    concat('hubspot-', hubspot_id) as contact_id,
    first_name, 
    last_name
    ...
```

And now when we look at the results, we'll see the following:

<img src="./hubspot-pref.png" width="100%">

> Notice that we call our column `contact_id` instead of `hubspot_id`.

When we eventually combine our data, we can use this information to trace back to the original source.`

#### 2. Properly align

Before combining the data, we should clean up our staging layer to make our data more consistent.  This involves a couple of steps.  

* We should align the columns in our various sources so that our columns are in the same order, with consistent names and consistent formatting

If we look at the `stg_rds_customers.sql` file, we can see that does not currently align with the `stg_hubspot_contacts`.  

Place the columns in the same order, and with the same formatting.  Also, add a prefix to the id called `rds`, and name it `customer_id`.

When complete, our rds customers model should look like the following:

<img src="./cleaned-rds.png" width="100%">

Notice that we achieved the following:

* We added a prefix to our hubspot id, and labeled it as `contact_id`
* Our phone number is now formatted
* We changed `company_name` to `business_name`
* We ordered the columns to be consistent with our hubspot contacts table, where possible.

### Summary

In this lesson, we got cleaned up our staging layer a bit.  And we accomplished this by first adding a prefix to our id.  This will allows us to better identify the source of our information later on.

The other step was to make our staging models more consistent.  We accomplished this by aligning the column names, and also aligning the order of our columns.