Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs: edits to copy and modify vignettes #1098

Merged
merged 30 commits into from Jul 6, 2022
Merged
Show file tree
Hide file tree
Changes from 28 commits
Commits
Show all changes
30 commits
Select commit Hold shift + click to select a range
ec23e1d
Docs: edits to copy and modify vignettes
IndrajeetPatil Jun 16, 2022
eabc637
Update vignettes/howto-dm-copy.Rmd
IndrajeetPatil Jun 17, 2022
7b5d39c
Update vignettes/howto-dm-rows.Rmd
IndrajeetPatil Jun 17, 2022
fb6ff5d
Merge branch 'cynkra:main' into edits_dm_copy_and_update
IndrajeetPatil Jun 17, 2022
3745dc8
revert added piping
IndrajeetPatil Jun 17, 2022
8b28a8b
created issue so the comment not needed anymore
IndrajeetPatil Jun 17, 2022
0836644
Merge branch 'cynkra:main' into edits_dm_copy_and_update
IndrajeetPatil Jun 19, 2022
3795173
mention `dm_rows_append()`
IndrajeetPatil Jun 19, 2022
d05c881
title
IndrajeetPatil Jun 19, 2022
63c119d
Merge branch 'cynkra:main' into edits_dm_copy_and_update
IndrajeetPatil Jun 19, 2022
fa926d7
scrap references to dm_rows_truncate
IndrajeetPatil Jun 19, 2022
cb8ddd0
Merge branch 'cynkra:main' into edits_dm_copy_and_update
IndrajeetPatil Jun 20, 2022
2115a4a
Merge branch 'cynkra:main' into edits_dm_copy_and_update
IndrajeetPatil Jun 21, 2022
61a7a3b
Merge branch 'cynkra:main' into edits_dm_copy_and_update
IndrajeetPatil Jun 21, 2022
21737fd
also anticipate quoted args
IndrajeetPatil Jun 22, 2022
1356f9d
Revert "also anticipate quoted args"
IndrajeetPatil Jun 22, 2022
e078883
Update vignettes/setup/setup.R
IndrajeetPatil Jun 22, 2022
e6cbe2c
support both styles for calling library
IndrajeetPatil Jun 22, 2022
8002f0e
mention argument for persisting
IndrajeetPatil Jun 22, 2022
07ab1d2
Merge branch 'cynkra:main' into edits_dm_copy_and_update
IndrajeetPatil Jun 22, 2022
ce6c781
Merge branch 'cynkra:main' into edits_dm_copy_and_update
IndrajeetPatil Jun 27, 2022
49ea232
Update vignettes/howto-dm-rows.Rmd
IndrajeetPatil Jun 27, 2022
3f53afa
Merge branch 'main' into edits_dm_copy_and_update
krlmlr Jun 28, 2022
2826379
add clarification
IndrajeetPatil Jun 28, 2022
afb14e1
Merge branch 'cynkra:main' into edits_dm_copy_and_update
IndrajeetPatil Jun 29, 2022
b3692c5
Merge branch 'cynkra:main' into edits_dm_copy_and_update
IndrajeetPatil Jul 4, 2022
a5c6b83
Merge branch 'cynkra:main' into edits_dm_copy_and_update
IndrajeetPatil Jul 4, 2022
3ee6a83
Merge branch 'cynkra:main' into edits_dm_copy_and_update
IndrajeetPatil Jul 4, 2022
889cdcb
Merge branch 'cynkra:main' into edits_dm_copy_and_update
IndrajeetPatil Jul 5, 2022
991ce20
Add method
krlmlr Jul 6, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
40 changes: 20 additions & 20 deletions vignettes/howto-dm-copy.Rmd
Expand Up @@ -17,7 +17,7 @@ source("setup/setup.R")
``````


In this tutorial we introduce {dm} methods and techniques for copying individual tables and entire relational data models into an RDBMS.
In this tutorial, we introduce {dm} methods and techniques for copying individual tables and entire relational data models into a relational database management system (RDBMS).
This is an integral part of the {dm} workflow.
Copying tables to an RDBMS is often a step in the process of building a relational data model from locally hosted data.
If your data model is complete, copying it to an RDBMS in a single operation allows you to leverage the power of the database and make it accessible to others.
Expand All @@ -31,9 +31,9 @@ This may be all you need to deploy a new model.
You may want to add new tables to an existing model on an RDBMS.
These requirements can be handled using the `compute()` and `copy_to()` methods.

Calling `compute()` or `copy_to()` requires write permission on the RDBMS, otherwise an error is returned.
Therefore for the following examples we will instantiate a test dm and move it into a local SQLite database with full permissions.
{dm} and {dbplyr} are designed so there is no difference between the code used to manipulate a local SQLite database and a remote RDBMS.
Calling `compute()` or `copy_to()` requires write permission on the RDBMS; otherwise, an error is returned.
Therefore, for the following examples, we will instantiate a test `dm` object and move it into a local SQLite database with full permissions.
{dm} and {dbplyr} are designed to treat the code used to manipulate a **local** SQLite database and a **remote** RDBMS similarly.
The steps for this were already introduced in `vignette("howto-dm-db")` and will be discussed in more detail in the [Copying a relational model](#copy-model) section.

``````{r }
Expand All @@ -53,7 +53,7 @@ deployed_dm <- copy_dm_to(local_db, fin_dm, temporary = FALSE)

## Copying and persisting individual tables {#copying-tables}

As part of your data analysis you may combine tables from multiple sources and create links to existing tables via foreign keys, or create new tables holding data summaries.
As part of your data analysis, you may combine tables from multiple sources and create links to existing tables via foreign keys, or create new tables holding data summaries.
The example below, already discussed in `vignette("howto-dm-db")`, computes the total amount of all loans for each account.

``````{r}
Expand Down Expand Up @@ -97,16 +97,16 @@ stopifnot(grepl(remote_name_total_loans, sql_render(my_dm_total_computed$total_l
```

Note the differences in queries returned by `sql_render()`.
`my_dm_total$total_loans` is still being lazily evaluated and the full query constructed from the chain of operations that generated it, and is required to run to access it, is still in place.
Contrast that with `my_dm_total_computed$total_loans` where the query has been realized and accessing its rows requires a simple `SELECT *` statement.
`my_dm_total$total_loans` is still being lazily evaluated and the full query constructed from the chain of operations that generated it is still in place and needs to be run to access it.
Contrast that with `my_dm_total_computed$total_loans`, where the query has been realized and accessing its rows requires a simple `SELECT *` statement.
The table name, `` `r remote_name_total_loans` ``, was automatically generated as the `name` argument was not supplied to `compute()`.

The default is to create a temporary table.
If you want results to persist across sessions in permanent tables, `compute()` must be called with the argument `temporary = FALSE` and a table name for the `name` argument.
The default is to create a **temporary** tables.
If you want results to persist across sessions in **permanent** tables, `compute()` must be called with the argument `temporary = FALSE` and a table name for the `name` argument.
See `?compute` for more details.

When called on a whole dm object (without zoom), `compute()` materializes all tables into new (temporary or persistent) tables by executing the associated SQL query and storing the full results.
Depending on the size of your data this may take considerable time or be infeasible.
When called on a whole `dm` object (without zoom), `compute()` materializes all tables into new (temporary or persistent) tables by executing the associated SQL query and storing the full results.
Depending on the size of your data, this may take considerable time or may even be unfeasible.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be useful occasionally to create snapshots of data that is subject to change.

``````{r }
Expand All @@ -118,11 +118,11 @@ my_dm_total_snapshot <-

## Adding local data frames to an RDBMS {#data-frames}

If you need to add local data frames to an existing dm object, use the `copy_to()` method.
If you need to add local data frames to an existing `dm` object, use the `copy_to()` method.
It takes the same arguments as `copy_dm_to()`, except the second argument takes a data frame rather than a dm.
The result is a derived dm object that contains the new table.
The result is a derived `dm` object that contains the new table.

To demonstrate the use of `copy_to()` the example below will use {dm} to pull consolidated data from several tables out of an RDBMS, estimate a linear model from the data, then insert the residuals back into the RDBMS and link it to the existing tables.
To demonstrate the use of `copy_to()`, the example below will use {dm} to pull consolidated data from several tables out of an RDBMS, estimate a linear model from the data, then insert the residuals back into the RDBMS and link it to the existing tables.
This is all done with a local SQLite database, but the process would work unchanged on any supported RDBMS.

``````{r}
Expand Down Expand Up @@ -157,7 +157,7 @@ loans_residuals
Adding `loans_residuals` to the dm is done using `copy_to()`.
The call to the method includes the argument `temporary = FALSE` because we want this table to persist beyond our current session.
In the same pipeline we create the necessary primary and foreign keys to integrate the table with the rest of our relational model.
For more information on key creation see `vignette("howto-dm-db")` and `vignette("howto-dm-theory")`.
For more information on key creation, see `vignette("howto-dm-db")` and `vignette("howto-dm-theory")`.

``````{r}
my_dm_sqlite_resid <-
Expand Down Expand Up @@ -206,7 +206,7 @@ deployed_dm
``````

Note that in the call to `copy_dm_to()` the argument `temporary = FALSE` is supplied.
Without this argument the model would still be copied into the database, but the argument would default to `temporary = TRUE` and the data would be deleted once your session ends.
Without this argument, the model would still be copied into the database, but the argument would default to `temporary = TRUE` and the data would be deleted once your session ends.

In the output you can observe that the `src` for `deployed_dm` is SQLite, while for `fin_dm` the source is not indicated because it is a local data model.

Expand All @@ -227,8 +227,8 @@ remote_name(deployed_dm$accounts)
``````

Note the different table names for `dup_dm$accounts` and `deployed_dm$accounts`.
For both, the table name is `accounts` in the dm, but they link to different tables on the database.
In `dup_dm` the table is backed by the table `dup_accounts` in the RDBMS.
For both, the table name is `accounts` in the `dm` object, but they link to different tables on the database.
In `dup_dm`, the table is backed by the table `dup_accounts` in the RDBMS.
`dm_deployed$accounts` shows us that this table is still backed by the `accounts` table from the `copy_dm_to()` operation we performed in the preceding example.

Managing tables in the RDBMS is outside the scope of `dm`.
Expand All @@ -244,8 +244,8 @@ DBI::dbDisconnect(local_db)
## Conclusion {#conclusion}

`dm` makes it straightforward to deploy your complete relational model to an RDBMS using the `copy_dm_to()` function.
For tables that are created from a relational model during analysis or development, `compute()` and `copy_to()` can be used to persist them between sessions or to copy local tables to a database dm.
The `collect()` method downloads an entire dm object that fits into memory from the database.
For tables that are created from a relational model during analysis or development, `compute()` and `copy_to()` can be used to persist them (using argument `temporary = FALSE`) between sessions or to copy local tables to a database `dm`.
The `collect()` method downloads an entire `dm` object that fits into memory from the database.


## Further Reading
Expand Down
70 changes: 25 additions & 45 deletions vignettes/howto-dm-rows.Rmd
@@ -1,5 +1,5 @@
---
title: "Insert, update or remove rows in a database dm"
title: "Insert, update, or remove rows in a database"
date: "`r Sys.Date()`"
author: James Wondrasek
output: rmarkdown::html_vignette
Expand All @@ -18,25 +18,25 @@ source("setup/setup.R")


This tutorial introduces the methods {dm} provides for modifying the data in the tables of a relational model.
There are 6 methods:
There are 5 methods:

* [`dm_rows_insert()`](#insert) - adds new rows
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dm_rows_append()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should all instances of dm_rows_insert() be replaced by dm_rows_append()?
Or should the vignette mention both of these options?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have both, but dm_rows_insert() seems to be much less useful than dm_rows_append() .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, in that case, I think we should only mention dm_rows_append()?

The fewer functions the user needs to learn about the better, I think, especially if two functions are doing the same thing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are different, though -- dm_rows_insert() will ditch duplicates or give an error, in some cases this is what we want. We can take this on in a separate PR though.

* [`dm_rows_update()`](#update) - changes values in rows
* [`dm_rows_patch()`](#patch) - fills in missing values
* [`dm_rows_upsert()`](#upsert) - adds new rows or changes values if pre-existing
* [`dm_rows_delete()`](#delete) - deletes rows
* [`dm_rows_truncate()`](#truncate) - removes all rows, leaving table structure intact

## The dm_rows_* process

All six methods take the same arguments and using them follows the same process:

1. Create a temporary *changeset dm* that defines the intended changes on the RDBMS
1. Create a temporary *changeset dm* object that defines the intended changes on the RDBMS
1. If desired, simulate changes with `in_place = FALSE` to double-check
1. Apply changes with `in_place = TRUE`.

To start, a dm object is created containing the tables, and rows, that you want to change.
This changeset dm is then copied into the same source as the dm you want to modify.

To start, a `dm` object is created containing the tables and rows that you want to change.
This changeset `dm` is then copied into the same source as the dm you want to modify.
With the dm in the same RDBMS as the destination dm, you call the appropriate method, such as `dm_rows_insert()`, to make your planned changes, along with an argument of `in_place = FALSE` so you can confirm you achieve the changes that you want.

This verification can be done visually, looking at row counts and the like, or using {dm}'s constraint checking method, `dm_examine_constraints()`.
Expand All @@ -47,28 +47,29 @@ With the changes confirmed, you execute the method again, this time with the arg
Note that `in_place = FALSE` is the default: you must opt in to actually change data on the database.

Each method has its own requirements in order to maintain database consistency.
These involve constraints on primary key values as they are how rows are identified.
These involve constraints on primary key values that uniquely identify rows.

| Method | Requirements |
|--------|--------------|
maelle marked this conversation as resolved.
Show resolved Hide resolved
| `dm_rows_insert()` | The primary keys must differ from existing records.|
| `dm_rows_insert()` | Records with existing primary keys are silently ignored (via `dplyr::rows_insert(conflict = "ignore")`). |
| `dm_rows_append()` | All records are inserted, the underlying database might check for uniqueness of primary keys (and fail the operation) if a constraint is set. |
| `dm_rows_update()` | Primary keys must match for all records to be updated.|
| `dm_rows_patch()` | Updates missing values in existing records. Primary keys must match for all records to be patched.|
| `dm_rows_upsert()` | Updates existing records and adds new records, based on the primary key.|
| `dm_rows_delete()` | Removes matching records based on the primary key.|
| `dm_rows_truncate()` | Removes all records, only for tables in the changeset dm.|
| `dm_rows_delete()` | Removes matching records based on the primary key. Primary keys must match for all records to be deleted.|

To ensure the integrity of all relations during the process, all methods automatically determine the correct processing order for the tables involved.
For operations that create records, parent tables are processed before child tables.
For `dm_rows_delete()` and `dm_rows_truncate()`, child tables are processed before their parent tables.
For operations that create records, parent tables (which hold primary keys) are processed before child tables (which hold foreign keys).
For `dm_rows_delete()`, child tables are processed before their parent tables.
Note that the user is still responsible for setting transactions to ensure integrity of operations across multiple tables.
For more details on this see `vignette("howto-dm-theory")` and `vignette("howto-dm-db")`.

IndrajeetPatil marked this conversation as resolved.
Show resolved Hide resolved
## Usage {#usage}

To demonstrate the use of these table modifying methods we will create a simple dm object with two tables linked by a foreign key.
Note the foreign key of `NA` in the `child` table.
To demonstrate the use of these table modifying methods, we will create a simple `dm` object with two tables linked by a foreign key.
Note that the `child` table has a foreign key missing (`NA`).

``````{r }
``````{r}
library(tidyverse)
library(dm)
parent <- tibble(value = c("A", "B", "C"), pk = 1:3)
Expand All @@ -87,9 +88,9 @@ demo_dm %>%

{dm} doesn't check your key values when you create a dm, we add this check:[^null-fk]

[^null-fk]: Be aware that when using `dm_examine_constraints()` NULL (`NA`) foreign keys are allowed and will be counted as a match.
[^null-fk]: Be aware that when using `dm_examine_constraints()`, missing (denoted by `NULL` in SQL, while `NA` in R) foreign keys are allowed and will be counted as a match.
In some cases this doesn't make sense and non-NULL columns should be enforced by the RDBMS.
Currently {dm} does not specify or check non-NULL constraints for columns.
Currently, {dm} does not specify or check non-NULL constraints for columns.

``````{r }
dm_examine_constraints(demo_dm)
Expand All @@ -108,12 +109,12 @@ demo_sql
``````

{dm}'s table modification methods can be piped together to create a repeatable sequence of operations that returns a dm incorporating all the changes required.
This is a common use case for {dm} -- building by hand a sequence of operations using temporary results until it is complete and correct, then committing the result.
This is a common use case for {dm} -- manually building a sequence of operations using temporary results until it is complete and correct, and then committing the result.

## `dm_rows_insert()` {#insert}

To demonstrate `dm_rows_insert()` we create a dm with tables containing the rows to insert and copy it to `sqlite_db`, the same source as `demo_sql`.
For all of the `dm_rows_*` methods the source and destination dm objects must be in the same RDBMS.
To demonstrate `dm_rows_insert()`, we create a dm with tables containing the rows to insert and copy it to `sqlite_db`, the same source as `demo_sql`.
For all of the `dm_rows_*` methods, the source and destination `dm` objects must be in the same RDBMS.
You will get an error message if this is not the case.

The code below adds `parent` and `child` table entries for the letter "D".
Expand All @@ -137,7 +138,7 @@ dm_insert_out <-
dm_rows_insert(dm_insert_in)
``````

This gives us a warning that changes will not be persisted.
This gives us a warning that changes will not persist (i.e., they are temporary).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still true?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it:

library(DBI)
library(dm)
library(tidyverse)

parent <- tibble(value = c("A", "B", "C"), pk = 1:3)
child <- tibble(value = c("a", "b", "c"), pk = 1:3, fk = c(1, 1, NA))
demo_dm <-
  dm(parent = parent, child = child) %>%
  dm_add_pk(parent, pk) %>%
  dm_add_pk(child, pk) %>%
  dm_add_fk(child, fk, parent)

sqlite_db <- dbConnect(RSQLite::SQLite())
demo_sql <- copy_dm_to(sqlite_db, demo_dm, temporary = FALSE)

new_parent <- tibble(value = "D", pk = 4)
new_child <- tibble(value = "d", pk = 4, fk = 4)

dm_insert_in <-
  dm(parent = new_parent, child = new_child) %>%
  copy_dm_to(sqlite_db, ., temporary = TRUE)

dm_insert_out <-
  demo_sql %>%
  dm_rows_insert(dm_insert_in)
#> Not persisting, use `in_place = FALSE` to turn off this message.

dbDisconnect(sqlite_db)

Created on 2022-06-19 by the reprex package (v2.0.1.9000)

Inspecting the `child` table of the resulting `dm_insert_out` and `demo_sql`, we can see that's exactly what happened.
{dm} returned to us a dm object with our inserted rows in place, but the underlying database has not changed.

Expand Down Expand Up @@ -180,8 +181,8 @@ demo_sql$child
## `dm_rows_delete()` {#delete}

`dm_rows_delete()` is not currently implemented to work with an RDBMS, so we will shift our demonstrations back to the local R environment.
We've made changes to `demo_sql` so we use `collect()` to copy the current tables out of SQLite.
Note that persistence is not a concern with local dm objects.
We've made changes to `demo_sql`, so we use `collect()` to copy the current tables out of SQLite.
Note that persistence is not a concern for *local* `dm` objects.
Every operation returns a new dm object containing the changes made.

``````{r }
Expand Down Expand Up @@ -229,27 +230,6 @@ dm_upserted$parent
dm_upserted$child
``````

## `dm_rows_truncate()` {#truncate}

`dm_rows_truncate()` deletes all the rows in a table while leaving all other related information intact, including column names, column types, and key relations.
The function derives its name from the SQL `TRUNCATE TABLE` statement, so we will return to our SQLite database to demonstrate its use.
The example below truncates only the `child` table.
Note how a modified version of the destination dm is used as "changeset dm": the rows in the changeset dm do not matter here.

``````{r }
dm_trunc_in <-
demo_sql %>%
dm_select_tbl(child)
dm_trunc_in
dm_trunc_out <-
demo_sql %>%
dm_rows_truncate(dm_trunc_in, in_place = TRUE)

demo_sql$child
``````



When done, do not forget to disconnect:

``````{r disconnect}
Expand All @@ -259,7 +239,7 @@ DBI::dbDisconnect(sqlite_db)
## Conclusion {#conclusion}

The `dm_rows_*` methods give you row-level granularity over the modifications you need to make to your relational model.
By using the `in_place` argument they all share you can construct and verify your modifications before committing them.
Using the common `in_place` argument, they all can construct and verify your modifications before committing them.
There are a few limitations, as mentioned in the tutorial, but these will be addressed in future updates to {dm}.

## Further Reading
Expand Down
11 changes: 8 additions & 3 deletions vignettes/setup/setup.R
Expand Up @@ -19,12 +19,17 @@ knit_print.grViz <- function(x, ...) {
knitr::asis_output()
}

# If input loads dm...
# If input loads dm or tidyverse, we load it here to omit warnings
IndrajeetPatil marked this conversation as resolved.
Show resolved Hide resolved
input <- readLines(knitr::current_input())
if (rlang::has_length(grep("^library[(]dm[)]", input))) {
# we load it here to omit warnings
if (rlang::has_length(grep('^library[(]"?dm"?[)]', input))) {
library(dm)
}
if (rlang::has_length(grep('^library[(]"?tidyverse"?[)]', input))) {
library(tidyverse)
}
if (rlang::has_length(grep('^library[(]"?dplyr"?[)]', input))) {
library(tidyverse)
}

## Link helper to enable links only on pkgdown
href <- function(title, url) {
Expand Down
4 changes: 2 additions & 2 deletions vignettes/tech-dm-join.Rmd
Expand Up @@ -91,7 +91,7 @@ dm_joined <-
dm_joined
```

As you can see below, the `dm_joined` dataframe has one more column than the `flights` table.
As you can see below, the `dm_joined` data frame has one more column than the `flights` table.
The difference is the `name` column from the `airlines` table.

```{r}
Expand All @@ -105,7 +105,7 @@ dm_joined %>%
names()
```

The result is not a `dm` object anymore, but a conventional dataframe:
The result is not a `dm` object anymore, but a conventional data frame:

```{r}
dm_joined %>%
Expand Down