# Combining Contact Data

### Introduction

In this lesson, we'll work through the initial steps of combining our contacts data.  Let's get started.

### Getting setup

To start off, create a new branch called `integrate_contacts`, and then under models create a new folder called staging and a file called `int_contacts.sql` inside that folder.

Our file structure should look like the following:

<img src="./file-structure.png" width="60%">

### Our first merge

Now take a look at the rds customers model, and the hubspot contacts model.  We can see that both share a lot of the same columns.  And it would be nice to combine the two, so we can see which of our contacts end up becoming customers.

Let's use union all to combine the two datasets.  In doing so, we'll want to rename the `customer_id` column as `contact_id` -- this way our two columns can align.

Wrap the combined data in a CTE called `merged_contacts`.  And if we select all from this CTE, we should see something like the following:

<img src="./combined-data.png" width="90%">

And if we scroll down far enough, we should see that both our hubspot and our rds data is in our `merged_contacts` CTE.

<img src="./merged-data.png" width="90%">

### Deduplicating our data

Next up, we will want to deduplicate our data.  We can see that some of our data is currently duplicated if we group by both `first_name` and `last_name`, and count the records and order by that count.

<img src="./combined-count.png" width="90%">

Ok, so slightly update our group by so that we group by `first_name` and `last_name`, and take a `max` of the phone number.  Let's order the results by `last_name`, ascending.

<img src="./combined-results.png" width="90%">

### Keeping our IDs

Now our data is starting to look pretty good, but unfortunately, we lost the primary ids in the process.  This is each instance of our duplicated data would have a different id -- a hubspot id and an rds id.  

So which should we store as the primary key?

For now, we'll store both as the primary key, and we can do so with the following:

<img src="./array_agg.png" width="100%">

So when we grouped by, we combined the contact differing contact_ids in an array, which is stored under the `contact_ids` column.

So the array_agg takes the value from each row and places it into an array (an array is another word for a list).

Ok, so now that you saw how `array_agg` works, use it to aggregate the different `company_ids` that can associate a person to either a hubspot company id or an rds company id.  Store the results in a new column called `company_ids`.

<img src="./agg_companies.png" width="100%">

### Summary

In this lesson, we moved through many of the steps for merging together our contact data.  The main new component we saw was `array_agg`, which allowed us to store multiple values in a single column.