# Introduction to Integration

### Introduction

So far we have seen how to use DBT to access data from our data warehouse, and then perform some initial cleanup on that data through staging.  In this lesson, we'll move onto the step after staging, which is integration.

### Combining Information

Before moving to the integration layer, let's just review our general pattern.

<img src="./dbt-pipeline.jpg" width="70%">

So far, we have defined our sources with the `sources.yaml` file, and then performed some initial transformations of data in staging.  In the staging layer, our data is still segmented out by source, but in the integration layer, we'll combine information from various sources.

This point is re-emphasized if we look at the file structure of our current project.

> As we can see below, we still have our hubspot folder separated from our rds folder.

> <img src="./staging-models.png" width="40%">

So in the future, we will create a folder under models called `integration`, and begin to combine our data.

### Our plan going forward

Here's how we'll begin to combine our data.  

1. We'll combine our data by selecting from each source and stacking our data on top of one another with a `union all` function.  

> Remember, that [the union function](https://docs.snowflake.com/en/sql-reference/operators-query.html#union-all) combines multiple select statements.  And the `union all` function includes any duplicates.

2. Now that our data is stacked on top of each other, we'll then deduplicate our data by performing a group by on a column whose information should not be duplicated -- for example, a company name, for companies.  

3. Our combined data will have different primary keys than our source data, so we'll then need to re-associate our models by use of primary keys and foreign keys.

We'll move through these steps in the following lessons.

### Summary

In this lesson, we re-introduced our integration layer.  So far we have kept our data segmented by it's various sources.  In the integration layer, we'll combine the records in our sources together.  And we'll do so by:

1. Combining our data -- including duplicates -- with the `union all` function. 
2. Deduplicating our data by grouping by on a unique attribute across the rows of data
3. Properly associating data to maintain our model relations now that our data has different ids.

### Sources

[Rittman Analytics Data Centralization](https://rittmananalytics.com/blog/2020/5/28/introducing-the-ra-warehouse-dbt-framework-how-rittman-analytics-does-data-centralization)