-
Notifications
You must be signed in to change notification settings - Fork 4
/
index.qmd
239 lines (189 loc) · 8.75 KB
/
index.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
---
title: "🐿️ Sqrrl"
slug: "sqrrl"
description: "Easily build bespoke SQL queries programmatically in R"
date: 2017-07-01T00:00:00+00:00
image: feature.png
categories:
- Project
tags: ["R", "R Package", "SQL", "MySQL", "Shiny"]
page: project
show_post_thumbnail: true
---
```{r setup, include=FALSE}
library(knitr)
library(sqrrl)
opts_chunk$set(echo = TRUE, cache=TRUE)
normal_output <- knit_hooks$get("output")
knit_hooks$set(output = function(x, options) {
as_sql <- !is.null(options$output.sql) && options$output.sql
if (as_sql) {
x <- gsub('^\\[[0-9]+\\] ', '', x)
x <- gsub('"(.+)"', '\\1', x)
x <- gsub('\\\\', '', x)
x <- sqlformat(x, md = FALSE, sqlformat_options = '-k upper')
}
normal_output(x, options)
})
```
**UPDATE 10/17/2017**: **Don't use this!** I made it for myself so it works for what I needed it for.
But you probably shouldn't use this package.
There are better ways of building SQL queries that are safer and better (and probably even easier).
For now, let me just point you in the direction of [db.rstudio.com](http://db.rstudio.com/), [dplyr/dbplyr](http://db.rstudio.com/dplyr#generating-queries), and the recently added [`glue_sql()` function](http://glue.tidyverse.org/reference/glue_sql.html) in the [glue package](http://glue.tidyverse.org).
**Project Links:** [source][sqrrl-src], [documentation][sqrrl-docs]
`sqrrl` is a small collection of utility functions that help build text-based SQL queries in an R-style native-feeling and functional manner.
Unlike other packages that build SQL queries using an object-oriented style, `sqrrl` provides small functions that produce SQL snippets and can be linked together to compose SQL queries.
The result is that the code to produce the SQL statement reads much like the SQL statement iteself.
On the other hand, `sqrrl` doesn't know anything about your database and can't help you out with completions, etc.
Where this package is most useful is with [Shiny web apps][shiny] that interact with a MySQL backend.
The utilities are all built so that queries can be built using column names and values stored inside ordinary R data structures.
The following is a quick demonstration of how the package works using the `nyclights13` dataset.
For more information on `sqrrl`, check out the [package documentation][sqrrl-docs].
## Setup `flights` database
To demonstrate the features in `sqrrl`, let's set up an in-memory SQLite database using the `nycflights13` dataset featured in `dplyr` and `dbplyr`.
First, load (or install) the pacakges and functions that we need.
```{r setup-flights, results='hide', error=FALSE, message=FALSE, warning=FALSE}
# ---- Workspace Setup ----
library('nycflights13') # install.packages('nycflights13')
library('DBI') # install.packages('DBI')
library('dplyr') # install.packages('dplyr')
library('dbplyr') # install.packages('dbplyr')
# Load the sqrrl package
# devtools::isntall_github('gadenbuie/sqrrl')
library('sqrrl')
# Alias to create nice tables
as_table <- function(...) knitr::kable(..., format = 'html')
```
Then load the `flights` data frame from `nycflights13` into the in-memory SQLite database (this code comes direclty from the [dbplyr documentation](http://dbplyr.tidyverse.org/articles/dbplyr.html#connecting-to-the-database)).
```{r load-flights, results='asis'}
# ---- Example Setup ----
# Create an in-memory SQLite database
con <- dbConnect(RSQLite::SQLite(), path = ":memory:")
# Use dplyr/dbplyr to copy flights table to the temp db
copy_to(con, nycflights13::flights, "flights",
temporary = FALSE,
indexes = list(
c("year", "month", "day"),
"carrier",
"tailnum",
"dest"
)
)
# Show first 5 rows
dbGetQuery(con, 'SELECT * FROM flights LIMIT 5') %>%
as_table
```
## Querying `flights`
Often, when I'm working with a database, I'll create an alias for `dbGetQuery` with the database or table name.
Inside the alias function I usually add any data type modifications that might need to be applied, and I suppress the warning messages that `DBI` outputs about data type conversions.
```{r flights-alias}
flights <- function(query, ...) {
suppressWarnings(dbGetQuery(con, query, ...))
}
```
Now we can repeat the above `SELECT` statement using `sqrrl`, this time limiting the columns selected.
```{r flights-select, results='asis'}
flight_cols <- c('year', 'month', 'day',
'carrier', 'flight', 'tailnum')
SELECT(flight_cols) %+%
FROM('flights') %+%
LIMIT(5) %>%
flights %>%
as_table
```
Note that `sqrrl` provides the `%+%` infix operator, which is essentially just an alias for `paste(x, y)`.
```{r plus-infix}
'a' %+% 'b'
# or PHP style without a padded space: paste0
'a' %.% 'b'
```
We can also do more complicated queries, like finding the average arrival delay, grouped by tail number:
```{r flights-avg-arr-delay, results='asis'}
SELECT('tailnum', delay = 'avg(arr_delay)', n = 'count(*)') %+%
FROM('flights') %+%
GROUP_BY('tailnum') %+%
ORDER_BY(DESC('delay')) %+%
LIMIT(10) %>%
flights %>%
as_table
```
`sqrrl` also provides a wrapper around the python utility `sqlformat` that can be used to pretty-print SQL formats.
```{r flights-avg-arr-delay-query, results='asis'}
SELECT('tailnum', delay = 'avg(arr_delay)', n = 'count(*)') %+%
FROM('flights') %+%
GROUP_BY('tailnum') %+%
ORDER_BY(DESC('delay')) %+%
LIMIT(10) %>%
sqlformat %>% cat
```
Let's use the above as an inner query and filter on `n > 100`:
```{r flights-avg-arr-delay-gt-100, results='asis'}
query_all_arr_delay <- SELECT(
'tailnum', delay = 'avg(arr_delay)', n = 'count(*)'
) %+%
FROM('flights') %+%
GROUP_BY('tailnum') %+%
ORDER_BY(DESC('delay'))
SELECT() %+%
FROM(delay = parens(query_all_arr_delay)) %+%
WHERE(gt(n = 100)) %+%
LIMIT(10) %>%
flights %>%
as_table
```
## Queries are just strings
Notice that unlike other packages, `sqrrl` can't build the nested queries for you.
You still need to understand the structure of the database and the structure of the query.
But when compared with the final output of the query, the `sqrrl` version looks a lot like SQL transliterated into R functions.
```{r flights-avg-arr-delay-gt-100-query, results='asis'}
SELECT() %+%
FROM(delay = parens(
SELECT('tailnum', delay = 'avg(arr_delay)', n = 'count(*)') %+%
FROM('flights') %+%
GROUP_BY('tailnum') %+%
ORDER_BY(DESC('delay'))
)) %+%
WHERE(gt(n = 100)) %+%
LIMIT(10) %>%
sqlformat() %>%
cat()
```
For me, at least, where the goal is to write SQL queries as bare strings, `sqrrl` lets me write in R and think in SQL without having to add a huge number of `paste` and `paste0` functions.
Everything in `sqrrl` takes input data from regular R data types and outputs an SQL snippet.
For an example of nearly everything each of the functions can do, see the [Getting Started][sqrrl-start] section in the documentation.
## A more complicated SELECT query
As a final example, here is a fully-loaded select query.
```{r flights-big-example, results='asis'}
SELECT('`year`', 'carrier', 'flight', 'dest',
n = 'count(*)',
avg_dist = 'avg(distance)',
avg_air_time = 'avg(air_time)') %+%
FROM(f = 'flights') %+%
WHERE(
BETWEEN('month', 6, 12),
'carrier' %IN% c("UA", "AA", "US", "WN"),
geq('dep_time' = 800),
leq('air_time' = 120),
'origin' %LIKE% 'JFK'
) %+%
GROUP_BY('`year`', 'carrier', 'flight', 'dest') %+%
ORDER_BY(DESC('n')) %+%
LIMIT(10) %>%
{ sqlformat(.) %>% cat; . } %>%
flights %>%
as_table
```
This query and table select the most popular flights from JFK between June and December of 2013 from the carriers `UA`, `AA`, `US`, and `WN` that depart JFK after 8:00 AM and have an air time of less than 2 hours.
## Learn more
There's more that the package can do, like `JOIN`s, `INSERT`s, and `UPDATE`s that I haven't gone into here.
There are also a number of wrappers, [comparison operators](https://gadenbuie.github.io/sqrrl/reference/comparison.html) and [concatenators](https://gadenbuie.github.io/sqrrl/reference/general.html) that can be used for wrapping strings in quotes --- e.g. `quotes()` --- comparing columns to values --- e.g. `geq()`, `eq()`, `lt()`, `neq()` --- and stringing together statements --- e.g. `AND()`, `OR()`, `%LIKE%`, `%IN%`, `BETWEEN()`.
There's an example of nearly every single function and each of it's possible configurations in the [package documentation][sqrrl-docs].
Hopefully this package is useful to someone other than myself (like you!).
If you run into any problems, [let me know](http://twitter.com/grrrck) or [submit an issue on GitHub](https://github.com/gadenbuie/sqrrl/issues).
```{r breakdown, include=FALSE}
dbDisconnect(con)
```
[sqrrl-src]: https://github.com/gadenbuie/sqrrl/
[sqrrl-docs]: https://gadenbuie.github.io/sqrrl/
[sqrrl-start]: https://gadenbuie.github.io/sqrrl/articles/sqrrl.html
[shiny]: https://shiny.rstudio.com