# Loading Additional Data

### Introduction

Now so far we have been working with various source data.  Our dataset has been the northwinds dataset, and the idea is that this data would come from a transactional database.  That is, a database that backs a website.

However, our data often comes from more than one source.  For example, before someone signs up for our website, we may get some of their information from a tool like [mixpanel](https://mixpanel.com/).  Mixpanel is a marketing tool that allows us to track individuals who visit and click around on our website.  Or perhaps we also have information from a site like hubspot.  Hubspot is a sales tool that allows us to track a sales pipeline and email campaigns.

We'll want to have all of this data in our database. 

In this lesson, we'll do the first step, which is to take some csv data and load it into our database.

### Exploring our csv data

Ok, so currently our data is located in the [pipeline-data](https://github.com/analytics-engineering-jigsaw/dbt/tree/main/pipeline-data) folder.  We can take a look at it [here](https://github.com/analytics-engineering-jigsaw/dbt/tree/main/pipeline-data).

* Hubspot Data

In [1]:
import pandas as pd
url = "https://raw.githubusercontent.com/analytics-engineering-jigsaw/dbt/main/pipeline-data/northwinds_hubspot.csv"
hubspot_df = pd.read_csv(url)

In [2]:
hubspot_df[:2]

Unnamed: 0,hubspot_id,first_name,last_name,phone,business_name
0,0,Henri,Hoffmann,954-564-7735,"Fadel, Lueilwitz and Nitzsche"
1,1,Niles,Ballinger,111-360-4329,"Ruecker, Lehner and Jakubowski"


* Mixpanel data

In [3]:
import pandas as pd
mixpanel_url = "https://raw.githubusercontent.com/analytics-engineering-jigsaw/dbt/main/pipeline-data/northwinds_mixpanel.csv"
mixpanel_df = pd.read_csv(mixpanel_url)
mixpanel_df[:2]

Unnamed: 0,$created,$email,$first_name,$last_name,Abandon Cart Count,Account Created Count,Gender,Registration Date,$city,$region,Last Event,Last Purchase,Last Search,Last Share
0,2020-03-08T13:16:23,Anthony.Bryant@comcastx.com,Anthony,Bryant,4.0,6.0,Male,2020-03-07T22:23:22,,,Abandon Cart,2020-03-31T22:19:47,2020-03-31T22:00:36,2020-03-31T21:57:10
1,2020-03-10T10:58:25,Jessica.Perkins@hotmailx.com,Jessica,Perkins,2.0,6.0,Female,2020-03-13T23:38:28,Eindhoven,North Brabant,Landing Page Loaded,2020-03-19T18:41:03,2020-03-29T20:15:21,2020-03-31T22:31:55


### Loading our data to Postgres

DBT makes it fairly easy to load csv data into our database.  Here's how we do it.

First download our pipeline data.  Remember, that data is located [here](https://github.com/analytics-engineering-jigsaw/dbt/tree/main/pipeline-data).

Then, go to your dbt repository, and create a new branch called `seed_csv_data`.

Then from there, you'll notice that your dbt repository has a folder called `seeds`.  This is where the csv data should be.  Move the `northwinds_hubspot.csv` file and the `northinds_mixpanel.csv` into the `seeds` folder.

Then simply run the following command from the terminal.

```
dbt seed
```

You should see something like the following:

```bash
21:51:30  1 of 2 START seed file dev.northwinds_hubspot .................................. [RUN]
21:51:30  1 of 2 OK loaded seed file dev.northwinds_hubspot .............................. [INSERT 491 in 0.38s]
21:51:30  2 of 2 START seed file dev.northwinds_mixpanel ................................. [RUN]
21:51:40  2 of 2 OK loaded seed file dev.northwinds_mixpanel ............................. [INSERT 5998 in 9.86s]
21:51:40
21:51:40  Finished running 2 seeds in 0 hours 0 minutes and 10.48 seconds (10.48s).
21:51:40
21:51:40  Completed successfully
21:51:40
21:51:40  Done. PASS=2 WARN=0 ERROR=0 SKIP=0 TOTAL=2
```

Ok, that it!  You should now have two additional tables -- `dev.northwinds_mixpanel` and `dev.northwinds_hubspot`.  Confirm that you have these tables.

`psql -d northwinds -c "select * from dev.northwinds_hubspot limit 1"`

`psql -d northwinds -c "select * from dev.northwinds_mixpanel limit 1"`

Ok, if get back the rows in the database, you are in good shape.  

### Wrapping up

Finally, make a new commit. 
```
git add -A
git commit -m 'add seed data'
```

Then checkout the main branch.

`git checkout main`

And merge the our `add_seed_data` branch into main.

```
git merge add_seed_data
```

And then push the main branch up to github.

```
git push origin main
```

### Summary

In this lesson, we saw how we can quickly load csv data into our database with dbt.  We simply have to move our data into the `seeds` folder in our dbt repository, and then run `dbt seed`.  This will create a corresponding table for each csv file and populate the tables with that data.