-
Notifications
You must be signed in to change notification settings - Fork 10
Expand file tree
/
Copy pathfreeform_report.R
More file actions
353 lines (313 loc) · 16.2 KB
/
freeform_report.R
File metadata and controls
353 lines (313 loc) · 16.2 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
#' Get a freeform table
#'
#' Get a report analogous to a **Freeform Table** visualization in Analysis Workspace. The function uses
#' the arguments to construct and execute a JSON-based query to the Adobe Analytics API and then returns
#' the results as a data frame.
#'
#' @details
#'
#' This function is based on the **Freeform Table** visualization in Analysis Workspace. It is accessing
#' the same API call type that is used to generate those visualizations.
#'
#' ## Dimension Ordering
#'
#' Adobe Analytics only queries one dimension at a time, even though the results get returned in a single data
#' frame (or table in the case of Analysis Workspace). The more dimensions are included in the report--the more
#' breakdowns of the data--the more queries are required. As a result, the _order_ of the dimensions _can_
#' have a dramatic impact on the total query time, even if the resulting data is essentially identical.
#'
#' One way to understand this is to consider how much dragging and dropping would be required to return the
#' data in Analysis Workspace.
#'
#' Consider a scenario where you are pulling metrics for the last 30 days (`daterangeday`) for **Mobile Device Type**
#' (`mobiledevicetype`), which has 7 unique values. Setting `dimensions = c("daterangeday", "mobiledevicetype")`
#' would make one query to get the values of the 30 days included. The query would then run a separate query
#' for _each of those 30 days_ to get the `mobiledevicetype` results for each day. So, this would be **31 API calls**.
#'
#' If, instead, the function was called with the `dimension` values reversed (`dimensions = c("mobiledevicetype", "daterangeday")`), then
#' the first query would return the 7 `mobiledevicetype` values, and then would run an additional query for each of
#' those _7 mobile device type values_ to return the results for the 30 days within each device type. This would be only **7 API calls**.
#'
#' Strategically ordering dimensions--and then wrangling the resulting data set as needed--is one of the best
#' ways to improve query performance.
#'
#' ## Date Handling
#'
#' Date handling has several special characteristics that are worth getting familiar with:
#' * The API names for day, week, month, etc. are prepended with `daterange`, so daily data uses
#' `daterangeday`, weekly data uses `daterangeweek`, monthly data uses `daterangemonth`, etc.
#' * When setting the argument for `top`, if the first (or only) `dimension` value is a `daterange...` object,
#' then, if this argument is not explicitly specified _or_ if it uses only a single value (e.g., `top = 10`),
#' the function will still return all of the values that fall in that date range. For instance, if the
#' `date_range` was set for a 30-day period and the first `dimension` value was `daterangeday`, _and_ no value
#' is specified for `top`, rather than simply returning the first 5 dates in the range, all 30 days will be
#' returned. In the same scenario, if `top = 10` was set, then all 30 days would still be returned, and the
#' `10` would simply be applied to the additional dimensions.
#' * If you want to return all of the date/time values but then have specific control over the number of
#' values returned for each of the drilldown dimensions, then set `0` as the first value in the `top`
#' argument and then specify different numbers for each breakdown (e.g., `top = c(0, 3, 10)` would return
#' all of the date/time values for the specified `date_range`, the top 3 values for the second specified
#' `dimension`, and then the top 10 values for each of the next dimension's results).
#' * If you are using a `daterange...` value _not_ as the first dimension, then simply using `0` at the
#' same level in the `top` argument specification will return all of the values for that date/time value.
#'
#' ## Search/Filtering
#'
#' There are powerful filtering abilities within the function. However, to support that power requires a
#' syntax that can feel a bit cumbersome for simple queries. **_Note:_** search filters are case-insensitive.
#' This is Adobe Analytics API functionality and can not be specified otherwise in queries.
#'
#' The `search` argument takes a vector of search strings, with each value in the vector corresponding to
#' the `dimension` value that is at the same position. These search strings support a range of operators,
#' including `AND`, `OR`, `NOT`, `MATCH`, `CONTAINS`, `BEGINS-WITH`, and `ENDS-WITH`.
#'
#' The default for any search string is to use `CONTAINS`. Consider a query where
#' `dimensions = c("mobiledevicetype", "lasttouchchannel")`:
#'
#' - `search = "CONTAINS 'mobile'"` will return results where `mobiledevicetype` contains "mobile", so would return all rows for **Mobile Phone**.
#' - This could be shortened to `search = "'mobile'"` and would behave exactly the same, since `CONTAINS` is the default operator
#' - `search = c("CONTAINS 'mobile'", "CONTAINS 'search'")` will return results where `mobiledevicetype` contains "mobile" and, within those results, results where `lasttouchchannel` contains "search".
#' - `search = c("(CONTAINS 'mobile') OR (CONTAINS 'tablet')", "(MATCH 'paid search')")` will return results where `mobiledevicetype` contains "mobile" _or_ "tablet" and, within those results, will only include results where `lasttouchchannel` exactly matches "paid search" (but is case-insensitive, so would return "Paid Search" values).
#'
#' @seealso [get_me()], [aw_get_reportsuites()], [aw_get_segments()],
#' [aw_get_dimensions()], [aw_get_metrics()], [aw_get_calculatedmetrics()],
#' [aw_segment_table()]
#'
#' @param company_id Company ID. If an environment variable called `AW_COMPANY_ID` exists in `.Renviron` or
#' elsewhere and no `company_id` argument is provided, then the `AW_COMPANY_ID` value will be used.
#' Use [get_me()] to get a list of available `company_id` values.
#' @param rsid Adobe report suite ID (RSID). If an environment variable called `AW_REPORTSUITE_ID` exists
#' in `.Renviron` or elsewhere and no `rsid` argument is provided, then the `AW_REPORTSUITE_ID` value will
#' be used. Use [aw_get_reportsuites()] to get a list of available `rsid` values.
#' @param date_range A length-2 vector with a start date and an end date.
#' `POSIXt` objects are sent as is, for fine control over the date range.
#' Numeric values are automatically converted to dates.
#' @param metrics A character vector of metrics. Use [aw_get_metrics()] and [aw_get_calculatedmetrics()]
#' to get a list of available `metrics` IDs.
#' @param dimensions A character vector of dimensions. There is currently a limit of 20 dimension
#' breakdowns. Each dimension value that gets broken down by another dimension requires an additional API
#' call, so the more dimensions that are included, the longer the function will take to return results.
#' This is how the Adobe Analytics API works. Use [aw_get_dimensions()] to get a list of available
#' `dimensions` IDs.
#' @param top The number of values to be pulled for each dimension. The default is 5 and the "top" is based on
#' the first `metric` value (along with `metricSort`). If there are multiple dimensions, then this argument can
#' either be a vector that includes the number of values to include at each level (each breakdown) or, if a single
#' value is used, then that will be the maximum number of values to return at each level. See the **Details** for
#' information on the unique handling of `daterange...` values.
#' @param page Used in combination with `top` to return the next page of results. Uses 0-based numbering (e.g.,
#' `top = 50000` and `page = 1` will return the top 50,000 items _starting at 50,001_).
#' @param metricSort Pre-sorts the table by metrics. Values are either `asc` (ascending) or `desc` (descending).
#' @param filterType This is a placeholder argument for use as additional functionality is added to the package.
#' Currently, it defaults to `breakdown`, and that is the only supported value.
#' @param include_unspecified Whether or not to include **Unspecified** values in the results. This is the equivalent
#' of the **Include Unspecified (None)** checkbox in freeform tables in Analysis Workspace. This defaults to `TRUE`,
#' which includes **Unspecified** values in the results.
#' @param segmentId A single segment ID or a vector of multiple segment IDs to apply to the overall report.
#' If multiple `segmentId` values are included, the segments will be effectived ANDed together, just as if
#' multiple segments were added to the header of an Analysis Workspace panel. Use [aw_get_segments()]
#' to get a list of available `segmentId` values.
#' @param search Criteria to filter the results by one or more dimensions. Searches are case-insenstive. Refer to
#' the **Details** for more information on constructing values for this argument.
#' @param prettynames A logical that determines whether the column names in the results use the API field name
#' (e.g., "mobiledevicetype", "pageviews") or the "pretty name" for the field (e.g., "Mobile Device Type",
#' "Page Views"). This applies to both dimensions and metrics. The default value is `FALSE`, which returns the
#' API field names. For custom eVars, props, and events, the non-pretty values are simply the variable number
#' (e.g., "evar2", "prop3", "event15"). If `TRUE`, undoes any efficiency gains
#' from setting `check_components` to `FALSE`.
#' @param debug Set to `TRUE` to publish the full JSON request(s) being sent to the API to the console when the
#' function is called. The default is `FALSE`.
#' @param check_components Specifies whether to check the validity of metrics and
#' dimensions before running the query. This defaults to `TRUE`, which triggers
#' several additional API calls behind the scenes to retrieve all dimensions and
#' metrics from the API. This has a nominal performance impact and may not be
#' ideal if you are running many queries. If you have many queries, consider
#' implementing validity checking through other means (manually or within the
#' code) and then set this value to `FALSE`.
#'
#' @return A data frame with the specified dimensions and metrics.
#'
#' @export
aw_freeform_table <- function(company_id = Sys.getenv("AW_COMPANY_ID"),
rsid = Sys.getenv("AW_REPORTSUITE_ID"),
date_range = c(Sys.Date()-30, Sys.Date()-1),
dimensions = c('page', 'lasttouchchannel', 'mobiledevicetype'),
metrics = c("visits", "visitors"),
top = c(5),
page = 0,
filterType = 'breakdown',
segmentId = NA,
metricSort = 'desc',
include_unspecified = TRUE,
search = NA,
prettynames = FALSE,
debug = FALSE,
check_components = TRUE) {
if (company_id == "") stop("'company_id' is blank")
if (rsid == "") stop("'rsid' is blank")
if (all(is.na(segmentId))) segmentId <- NULL
# Repeated dimensions will cause an infinite loop
if (length(dimensions) > length(unique(dimensions))) {
stop("List of dimensions is not unique")
}
# No harm in repeated metrics, simply take the unique ones
metrics <- unique(metrics)
# Component lookup checks
# The component checking is optional, in case speed is a priority
if (check_components | prettynames) {
comp_lookup <- make_component_lookup(rsid,
company_id,
metrics[is_calculated_metric(metrics)])
invalid_components <- invalid_component_names(component = c(dimensions, metrics),
lookup = comp_lookup)
if (length(invalid_components > 0)) {
invalid_components <- paste(invalid_components, collapse = ", ")
stop(paste("Component(s) not found: ", invalid_components), call. = FALSE)
}
if (prettynames == TRUE) {
pretty_comp_names <- c(dimensions, metrics)
names(pretty_comp_names) <- comp_lookup$name[match(pretty_comp_names, comp_lookup$id)]
}
}
# Define the query with a query spec
# This also standardizes arguments (recycling, etc.)
query_spec <- make_query_spec(
rsid = rsid,
company_id = company_id,
dimensions = dimensions,
metrics = metrics,
date_range = date_range,
segment_id = segmentId,
limit = top,
page = page,
include_unspecified = include_unspecified,
dimensionSort = "asc",
search = search,
sort = metricSort
)
# Estimate requests and reset global counter
n_requests <- estimate_requests(qs_top(query_spec))
if (n_requests > 20) {
initialize_global_counter(n_requests)
} else {
kill_global_counter()
}
# Make requests
message("Requesting data...", appendLF = TRUE)
output_data <- get_req_data(
qs = query_spec,
index = 1,
item_ids = NULL,
debug = debug
)
message("Done!")
message(glue::glue("Returning {nrow(output_data)} x {ncol(output_data)} data frame"))
output_data <- convert_date_columns(output_data)
if (prettynames) {
output_data <- dplyr::select(output_data,
all_of(pretty_comp_names))
}
output_data
}
#' Calculate queries to be completed
#'
#' The number of queries required to complete a request. Also useful for vetting
#' a query to find an efficient dimension order.
#'
#' @param top Top argument, essentially the number of rows returned from
#' each query (can be inaccurate when fewer rows returned, but mostly correct)
#'
#' @return Number of queries needed to get top
#' @noRd
n_queries <- function(top) {
top_ind <- c(1, top[-length(top)])
sum(cumprod(top_ind))
}
#' Estimate number of requests for query
#'
#' Also calculates estimated runtime and sends it as a message to the console
#'
#' @param top Top argument
#'
#' @return Number of requests necessary to complete query
#' @noRd
estimate_requests <- function(top) {
queries <- n_queries(top)
if (length(top) > 1) {
# I reckon about 1 second per query
# sec
est_secs <- round(queries * 1.1, digits = 0)
# min
est_mins <- round(est_secs/60, digits = 0)
# hour
est_hours <- round(est_mins / 60, digits = 1)
if (est_secs < 60) {
message_text <- glue::glue("{est_secs}sec.")
} else if (est_mins < 60) {
message_text <- glue::glue("{est_mins}min.")
} else {
message_text <- glue::glue("{est_hours}hr.")
}
message('Estimated runtime: ', message_text)
}
queries
}
#' Initializes the global decile query list
#'
#' This is used for generating the progress bar on long queries.
#'
#' @param n_queries Number of queries to be completed
#'
#' @return Query quantiles, invisibly
#' @noRd
initialize_global_counter <- function(total_queries) {
prog_format <- "Progress [:bar] :percent in :elapsed"
.adobeanalytics$prog_bar <- progress::progress_bar$new(total = total_queries,
format = prog_format,
clear = FALSE)
invisible(total_queries)
}
#' Kill global counter
#'
#' Tears down the global counter
#'
#' @return NULL
#' @noRd
kill_global_counter <- function() {
.adobeanalytics$prog_bar <- NULL
NULL
}
#' Increment global counter
#'
#' Send a `tick` to the progress bar (see `progress::progress_bar`).
#' If no progress bar is initialized, do nothing.
#'
#' @return NULL
#' @noRd
increment_global_counter <- function() {
if (!is.null(.adobeanalytics$prog_bar)) {
.adobeanalytics$prog_bar$tick()
}
NULL
}
#' Convert date columns to date objects
#'
#' @param dat Data frame
#'
#' @return Data frame with date columns as dates
#' @noRd
convert_date_columns <- function(dat) {
# change time variables from character strings
if("daterangeminute" %in% colnames(dat)) {
dat[names(dat) == 'daterangeminute'] <- lubridate::parse_date_time(dat$daterangeminute, orders = "HM ymd")
}
if("daterangehour" %in% colnames(dat)) {
dat[names(dat) == 'daterangehour'] <- lubridate::parse_date_time(dat$daterangehour, orders = "HM ymd")
}
if("daterangeday" %in% colnames(dat)) {
dat[names(dat) == 'daterangeday'] <- as.Date(dat$daterangeday, format = '%b %d, %Y')
}
if("daterangeweek" %in% colnames(dat)) {
dat[names(dat) == 'daterangeweek'] <- as.Date(dat$daterangeweek, format = '%b %d, %Y')
}
dat
}