Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue #375 #411

Merged
merged 43 commits into from Nov 17, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
70a83a4
commit message
jawond Jun 4, 2020
b1a8d47
Merge remote-tracking branch 'upstream/master'
jawond Jun 23, 2020
aad1b6c
first partial draft of new persistence tutorials
jawond Jul 1, 2020
ead664a
Merge branch 'master' into Issue-375
krlmlr Jul 1, 2020
620beda
Tweak
krlmlr Jul 1, 2020
1e9ae50
summmarize()
krlmlr Jul 1, 2020
8800a9e
Merge remote-tracking branch 'upstream/master'
jawond Jul 6, 2020
e3b9fff
Merge branch 'master' into Issue-375
jawond Jul 6, 2020
2d053c0
Update vignettes/howto-dm-rows.Rmd
jawond Jul 7, 2020
3ce7a76
Update vignettes/howto-dm-rows.Rmd
jawond Jul 7, 2020
e829da7
Merge branch 'Issue-375' of https://github.com/jawond/dm into Issue-375
jawond Jul 7, 2020
2ac3b99
Merge remote-tracking branch 'upstream/master' into Issue-375
jawond Jul 21, 2020
e602f99
Merge remote-tracking branch 'upstream/master' into Issue-375
jawond Jul 22, 2020
6dde639
added dm_rows_x examples and explanation
jawond Jul 22, 2020
fb87bac
Merge remote-tracking branch 'upstream/master'
jawond Jul 22, 2020
ba8a470
committing before fetching latest
jawond Jul 29, 2020
e01cef9
Merge remote-tracking branch 'upstream/master'
jawond Jul 29, 2020
b97e3c1
First draft of howto-dm-rows finally finished
jawond Aug 10, 2020
f2b2636
Merge remote-tracking branch 'upstream/master'
jawond Aug 10, 2020
f4e5655
Merge branch 'master' into Issue-375
jawond Aug 10, 2020
58d66f1
Re-arranged text to improve presentation order of concepts and did a …
jawond Aug 12, 2020
a8639b2
Code style
krlmlr Aug 14, 2020
00c2064
Tweaks
krlmlr Aug 14, 2020
3dced2d
Reenable zooming section
krlmlr Aug 14, 2020
ff80cf7
Tweak
krlmlr Aug 14, 2020
dec44ad
Need tidyverse for tibble() and dm for the code
krlmlr Aug 14, 2020
5c6b434
Remove stub
krlmlr Aug 18, 2020
3a6d944
Move section
krlmlr Aug 18, 2020
c39999d
Tweaks
krlmlr Aug 18, 2020
1bfd7be
Add copy_to() code and text from howto-dm-db
krlmlr Aug 18, 2020
7674887
Reorganize, howto-dm-load -> howto-dm-db, remove howto-dm-load
krlmlr Aug 18, 2020
ef854fe
Merge remote-tracking branch 'origin/master' into Issue-375
krlmlr Aug 18, 2020
00e37dd
changes related to video call 20200818 with @krlmlr
jawond Sep 18, 2020
d10215a
Merge branch 'master' into Issue-375
krlmlr Oct 30, 2020
8d55e45
Fix duplicate vignette title, tweak
krlmlr Oct 30, 2020
16b0c29
Robust if imports are missing
krlmlr Oct 30, 2020
0b3f510
Merge branch 'master' into Issue-375
krlmlr Nov 16, 2020
4596d38
Tweaks
krlmlr Nov 17, 2020
53de1a6
Fix link
krlmlr Nov 17, 2020
d777ab4
Titles and index
krlmlr Nov 17, 2020
62c382b
Reshuffle
krlmlr Nov 17, 2020
7dddce5
Extract variable so that it can be built without dbplyr installed
krlmlr Nov 17, 2020
86880c6
Micro-tweaks
krlmlr Nov 17, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
2 changes: 1 addition & 1 deletion R/zoom.R
Expand Up @@ -14,7 +14,7 @@
#'
#' `dm_discard_zoomed()` discards the zoomed table and returns the `dm` as it was before zooming.
#'
#' Please refer to `vignette("dm-zoom-to-table", package = "dm")`
#' Please refer to `vignette("tech-db-zoom", package = "dm")`
#' for a more detailed introduction.
#'
#' @inheritParams dm_add_pk
Expand Down
2 changes: 1 addition & 1 deletion man/dm_zoom_to.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

12 changes: 8 additions & 4 deletions pkgdown/_pkgdown.yml
Expand Up @@ -156,12 +156,16 @@ navbar:
href: reference/index.html
- text: Tutorials
menu:
- text: Create a dm object from a database
href: articles/howto-dm-db.html
- text: Create a dm object from data frames
href: articles/howto-dm-df.html
- text: Introduction to relational data models
href: articles/howto-dm-theory.html
- text: Create a dm object from data frames
href: articles/howto-dm-df.html
- text: Create a dm object from a database
href: articles/howto-dm-db.html
- text: Copy data to and from a database
href: articles/howto-dm-copy.html
- text: Insert, update or remove rows in a database
href: articles/howto-dm-rows.html
- text: Technical articles
menu:
- text: Joining in relational data models
Expand Down
249 changes: 249 additions & 0 deletions vignettes/howto-dm-copy.Rmd
@@ -0,0 +1,249 @@
---
title: "Copy tables to and from a database"
date: "`r Sys.Date()`"
author: James Wondrasek, Kirill Müller
output: rmarkdown::html_vignette
vignette: >
%\VignetteEncoding{UTF-8}
%\VignetteIndexEntry{How to: Copy data to and from a database}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
---


``````{r setup, include = FALSE}
source("setup/setup.R")
``````


In this tutorial we introduce {dm} methods and techniques for copying individual tables and entire relational data models into an RDBMS.
This is an integral part of the {dm} workflow.
Copying tables to an RDBMS is often a step in the process of building a relational data model from locally hosted data.
If your data model is complete, copying it to an RDBMS in a single operation allows you to leverage the power of the database and make it accessible to others.
For modifying and persisting changes to your data at the row-level see `vignette("howto-dm-rows")`.

## Copy models or copy tables?

Using {dm} you can persist an entire relational data model with a single function call.
`copy_dm_to()` will move your entire model into a destination RDBMS.
This may be all you need to deploy a new model.
You may want to add new tables to an existing model on an RDBMS.
These requirements can be handled using the `compute()` and `copy_to()` methods.

Calling `compute()` or `copy_to()` requires write permission on the RDBMS, otherwise an error is returned.
Therefore for the following examples we will instantiate a test dm and move it into a local SQLite database with full permissions.
{dm} and {dbplyr} are designed so there is no difference between the code used to manipulate a local SQLite database and a remote RDBMS.
The steps for this were already introduced in `vignette("howto-dm-db")` and will be discussed in more detail in the [Copying a relational model](#copy-model) section.

``````{r }
library(dm)
library(tidyverse)
library(dbplyr)

fin_dm <-
dm_financial() %>%
dm_select_tbl(-trans) %>%
collect()

local_db <- DBI::dbConnect(RSQLite::SQLite())
deployed_dm <- copy_dm_to(local_db, fin_dm, temporary = FALSE)
``````


## Copying and persisting individual tables {#copying-tables}

As part of your data analysis you may combine tables from multiple sources and create links to existing tables via foreign keys, or create new tables holding data summaries.
The example below, already discussed in `vignette("howto-dm-db")`, computes the total amount of all loans for each account.

``````{r}
my_dm_total <-
deployed_dm %>%
dm_zoom_to(loans) %>%
group_by(account_id) %>%
summarize(total_amount = sum(amount, na.rm = TRUE)) %>%
ungroup() %>%
dm_insert_zoomed("total_loans")
``````

The derived table `total_loans` is a *lazy table* powered by the {[dbplyr](https://dbplyr.tidyverse.org/)} package: the results are not materialized, instead an SQL query is built and executed each time the data is requested.

``````{r}
my_dm_total$total_loans %>%
sql_render()
``````

To avoid recomputing the query every time you use `total_loans`, call `compute()` right before inserting the derived table with `dm_insert_tbl()`.
`compute()` forces the computation of a query and stores the full results in a table on the RDBMS.

``````{r}
my_dm_total_computed <-
deployed_dm %>%
dm_zoom_to(loans) %>%
group_by(account_id) %>%
summarize(total_amount = sum(amount, na.rm = TRUE)) %>%
ungroup() %>%
compute() %>%
dm_insert_zoomed("total_loans")

my_dm_total_computed$total_loans %>%
sql_render()
``````

```{r echo = FALSE}
remote_name_total_loans <- ""
remote_name_total_loans <- remote_name(my_dm_total_computed$total_loans)
```

Note the differences in queries returned by `sql_render()`.
`my_dm_total$total_loans` is still being lazily evaluated and the full query constructed from the chain of operations that generated it, and is required to run to access it, is still in place.
Contrast that with `my_dm_total_computed$total_loans` where the query has been realized and accessing its rows requires a simple `SELECT *` statement.
The table name, `` `r remote_name_total_loans` ``, was automatically generated as the `name` argument was not supplied.

The default is to create a temporary table.
If you want results to persist across sessions in permanent tables, `compute()` must be called with the argument `temporary = FALSE` and a table name for the `name` argument.
See `?compute` for more details.

When called on a whole dm object (without zoom), `compute()` materializes all tables into new (temporary or persistent) tables by executing the associated SQL query and storing the full results.
Depending on the size of your data this may take considerable time or be infeasible.
It may be useful occasionally to create snapshots of data that is subject to change.

``````{r }
my_dm_total_snapshot <-
my_dm_total %>%
compute()
``````


## Adding local data frames to an RDBMS {#data-frames}

If you need to add local data frames to an existing dm object, use the `copy_to()` method.
It takes the same arguments as `copy_dm_to()`, except the second argument takes a data frame rather than a dm.
The result is a derived dm object that contains the new table.

To demonstrate the use of `copy_to()` the example below will use {dm} to pull consolidated data from several tables out of an RDBMS, estimate a linear model from the data, then insert the residuals back into the RDBMS and link it to the existing tables.
This is all done with a local SQLite database, but the process would work unchanged on any supported RDBMS.

``````{r}
loans_df <-
deployed_dm %>%
dm_squash_to_tbl(loans) %>%
select(id, amount, duration, A3) %>%
collect()
``````

Please note the used of `dm_squash_to_tbl()`.
This method gathers all linked information into a single wide table.
It follows foreign key relations starting from the table supplied as its argument and gathers all the columns from related tables, disambiguating column names as it goes.

In the above code, the `select()` statement isolates the columns we need for our model.
`collect()` works similarly to `compute()` by forcing the execution of the underlying SQL query, but it returns the results as a local tibble.

Below, the local tibble, `loans_df`, is used to estimate the linear model and the residuals are stored along with the original associated `id` in a new tibble, `loans_residuals`.
The `id` column is necessary to link the new tibble to the tables in the dm it was collected from.

``````{r}
model <- lm(amount ~ duration + A3, data = loans_df)

loans_residuals <- tibble::tibble(
id = loans_df$id,
resid = unname(residuals(model))
)

loans_residuals
```

Adding `loans_residuals` to the dm is done using `copy_to()`.
The call to the method includes the argument `temporary = FALSE` because we want this table to persist beyond our current session.
In the same pipeline we create the necessary primary and foreign keys to integrate the table with the rest of our relational model.
For more information on key creation see `vignette("howto-dm-db")` and `vignette("howto-dm-theory")`.

``````{r}
my_dm_sqlite_resid <-
copy_to(deployed_dm, loans_residuals, temporary = FALSE) %>%
dm_add_pk(loans_residuals, id) %>%
dm_add_fk(loans_residuals, id, loans)

my_dm_sqlite_resid %>%
dm_set_colors(violet = loans_residuals) %>%
dm_draw()
my_dm_sqlite_resid %>%
dm_examine_constraints()
my_dm_sqlite_resid$loans_residuals
``````


## Persisting a relational model with `copy_dm_to()` {#copy-model}

Persistence, because it is intended to make permanent changes, requires write access to the source RDBMS.
The code below is a repeat of the code that opened the [Copying and persisting individual tables](#copying-tables) section at the beginning of the tutorial.
It uses the {dm} convenience function `dm_financial()` to create a dm object corresponding to a data model from a public dataset repository.
The dm object is downloaded locally first, before deploying it to a local SQLite database.

`dm_select_tbl()` is used to exclude the transaction table `trans` due to its size, then the `collect()` method retrieves the remaining tables and returns them as a local dm object.

``````{r }
dm_financial() %>%
dm_nrow()
fin_dm <-
dm_financial() %>%
dm_select_tbl(-trans) %>%
collect()

fin_dm
``````

It is just as simple to move a local relational model into an RDBMS.

``````{r }
destination_db <- DBI::dbConnect(RSQLite::SQLite())

deployed_dm <-
copy_dm_to(destination_db, fin_dm, temporary = FALSE)

deployed_dm
``````

Note that in the call to `copy_dm_to()` the argument `temporary = FALSE` is supplied.
Without this argument the model would still be copied into the database, but the argument would default to `temporary = TRUE` and the data would be deleted once your session ends.

In the output you can observe that the `src` for `deployed_dm` is SQLite, while for `fin_dm` the source is not indicated because it is a local data model.

Copying a relational model into an empty database is the simplest use case for `copy_dm_to()`.
If you want to copy a model into an RDBMS that is already populated, be aware that `copy_dm_to()` will not overwrite pre-existing tables.
In this case you will need to use the `table_names` argument to give the tables unique names.

`table_names` can be a named character vector, with the names matching the table names in the dm object and the values containing the desired names in the RDBMS, or a function or one-sided formula.
In the example below, `paste0()` is used to add a prefix to the table names to provide uniqueness.

``````{r }
dup_dm <-
copy_dm_to(destination_db, fin_dm, temporary = FALSE, table_names = ~ paste0("dup_", .x))

dup_dm
remote_name(dup_dm$accounts)
remote_name(deployed_dm$accounts)
``````

Note the different table names for `dup_dm$accounts` and `deployed_dm$accounts`.
For both, the table name is `accounts` in the dm, but they link to different tables on the database.
In `dup_dm` the table is backed by the table `dup_accounts` in the RDBMS.
`dm_deployed$accounts` shows us that this table is still backed by the `accounts` table from the `copy_dm_to()` operation we performed in the preceding example.

Managing tables in the RDBMS is outside the scope of `dm`.
If you find you need to remove tables or perform operations directly on the RDBMS, see the {[DBI](https://dbi.r-dbi.org/)} package.

## Conclusion

`dm` makes it straightforward to deploy your complete relational model to an RDBMS using the `copy_dm_to()` function.
For tables that are created from a relational model during analysis or development, `compute()` and `copy_to()` can be used to persist them between sessions or to copy local tables to a database dm.
The `collect()` method downloads an entire dm object that fits into memory from the database.


## Next steps

If you need finer-grained control over modifications to your relational model, see `vignette("howto-dm-rows")` for an introduction to row level operations, including updates, insertions, deletions and patching.

If you feel you need to know more about relational data models in order to get the most out of dm, check out `vignette("howto-dm-theory")`.

If you're familiar with relational data models but want to know how to work with them in dm, then any of `vignette("tech-dm-join")`, `vignette("tech-dm-filter")`, or `vignette("tech-dm-zoom")` is a good next step.