# Combining Contact Data

### Introduction

In this lesson, we'll work through the initial steps of combining our contacts data.  Let's get started.

### Getting setup

To start off, create a new branch called `integrate_contacts`, and then under models create a new folder called staging and a file called `int_contacts.sql` inside that folder.

Our file structure should look like the following:

<img src="./file-structure.png" width="60%">

### Our first merge

Now take a look at the rds customers model, and the hubspot contacts model.  We can see that both our rds and hubspot staging models for customers share a lot of the same columns.  And it would be nice to combine the two.  

By combining our tables, we can see which of our contacts end up becoming customers, and make sure that all of our information for a customer is under a single table.

We'll let you move through these steps:

1. Use `union all` sql function to combine the two datasets.  In doing so, we'll want to rename the `customer_id` column as `contact_id` -- this way our two columns can align.

2. Wrap the combined data in a CTE called `merged_contacts`.  And if we select all from this CTE, we should see something like the following:

> <img src="./combined-data.png" width="90%">

* And if we scroll down far enough, we should see that both our hubspot and our rds data is in our `merged_contacts` CTE.

> <img src="./merged-data.png" width="90%">

### Deduplicating our data

Now that we have combined our data, next we will want to deduplicate our data.  

Start by ensuring that some of our data is in fact duplicated.

3. Group by both `first_name` and `last_name`, count the records and order by that count.

> <img src="./combined-count.png" width="90%">

4. Now, slightly update our group by statement so that we group by `first_name` and `last_name`, and take a `max` of the phone number.  Then order the results by `last_name`, ascending.

<img src="./combined-results.png" width="90%">

### Keeping our IDs

Now our data is starting to look pretty good, but unfortunately, we lost the primary ids in the process.  This is because each instance of our duplicated data would have a different id -- a hubspot id and an rds id.  

It turns out we would like to keep both, and separate out the ids into different columns.

<img src="./separated-ids.png" width="90%">

How do we accomplish this?  Take a look at the relevant query below.

```sql
merged_contacts as (
    SELECT 
    contact_id as hubspot_contact_id,
    null as rds_contact_id,
    ...
    
    union all 
    SELECT 
    null as hubspot_contact_id,
    customer_id as rds_contact_id,
```

We first select from the hubspot staging table, and reassign `contact_id` to `hubspot_contact_id`.  We also add a column for `rds_contact_id`, but we set every value to null -- as none of our hubspot staging data will have an rds id.

Then below, after the union all, when we get to our `rds` model, we set each `hubspot_contact_id` to null, and this time set all of the `customer_id` values to `rds_contact_id`.

> **Note** Make sure to keep the `hubspot_contact_id` and `rds_contact_id` columns in the same order with each select statement.  This is important, because `union all` assumes that the columns align.  

We can store all of that in our merged_contacts CTE.  

* group the data

Then, when we group our data together in the next CTE, take the max of `rds_contact_id` and `hubspot_contact_id` to combine the data, and have any present ids override the null values.

Implement this in your model, and you should see the following:

<img src="./separated-ids.png" width="90%">

Next implement a similar pattern for the `company_ids`.  That is, add a separate column for our hubspot company id, and our rds company id.

We should see the following:

<img src="./company_ids.png" width="90%">

### Summary

In this lesson, we moved through many of the steps for merging together our contact data.  The main new component we saw was to separate out our id columns into separate columns.

### Solution

```sql
with contacts as (
     select * from {{ ref('stg_hubspot_contacts') }}
 ), customers as (
     select * from {{ ref('stg_rds_customers') }} 
 ),
  merged_contacts as (
    SELECT 
    contact_id as hubspot_contact_id,
    null as rds_contact_id,
    first_name, 
    last_name,
    phone, 
    company_id as hubspot_company_id,
    null as rds_company_id
     FROM contacts
    union all 
    SELECT 
    null as hubspot_contact_id,
    customer_id as rds_contact_id,
    first_name, 
    last_name,
    phone, 
    null as hubspot_company_id,
    company_id as rds_company_id
    FROM customers
 ), final as (
     select 
     max(hubspot_contact_id) as hubspot_contact_id, max(rds_contact_id) as rds_contact_id,
        first_name, last_name, max(phone) as phone, 
        max(hubspot_company_id) as hubspot_company_id, max(rds_company_id) rds_company_id
     from merged_contacts
      group by first_name, last_name
 )
 select {{ dbt_utils.surrogate_key(['first_name', 'last_name', 'phone']) }} as contact_pk, hubspot_contact_id, rds_contact_id,
  first_name, last_name, phone, hubspot_company_id, rds_company_id from final 
```