Codebook is slow #17

mbcann01 · 2022-07-03T16:54:28Z

While running the codebook function on the L2C data, I realized how slow it is. In some ways, this may not be a huge issue because we probably want need to recreate codebooks often. Having said that, it might be nice to try to find ways to speed up the code.

https://www.r-bloggers.com/2021/04/code-performance-in-r-which-part-of-the-code-is-slow/

http://adv-r.had.co.nz/Performance.html

Using HTML instead of Word (#5) might be a good way to speed it up.

flextable can create HTML tables
Need to figure out how to stitch them together into a basic HTML document

Solution

The solution for this problem came from: https://ardata-fr.github.io/officeverse/officer-for-word.html#external-documents

Inserting a document of course allows you to integrate a previously-created Word document into another document. This can be useful when certain parts of a document need to be written manually but automatically integrated into a final document. The document to be inserted must be in docx format. This can be done by using function body_add_docx(). This can be advantageous when you are generating huge documents and the generation is getting slower and slower. It is necessary to generate smaller documents and to design a main script that inserts the different documents into a main Word document.

The text was updated successfully, but these errors were encountered:

mbcann01 · 2022-07-04T18:13:03Z

Working on issue #17. Codebook is slow.

library(dplyr)
library(codebookr)
library(microbenchmark)
library(profvis)

data(study)
data_stata <- haven::read_dta("inst/extdata/study.dta")

How long does it take to run regular data?

microbenchmark(
  codebook(study),
  times = 10L
) # 2-3 seconds each run.

How long does it to run on Stata data?

microbenchmark(
  codebook(data_stata),
  times = 10L
) # 2-3 seconds each run

So, that doesn't seem to make a huge difference.

What are the slow parts?

profvis(codebook(study))

The Flextable stuff is the slowest part. I'm not sure if I can speed that up or not.

profvis(codebook(data_stata))

Flextable stuff for this one too.

Can I do the Flextable stuff at once outside of a loop? Will that make any difference?

Do more rows slow it down?

df_short <- tibble(x = rnorm(100)) # 100 rows
df_medium <- tibble(x = rnorm(10000)) # 10,000 rows
df_long <- tibble(x = rnorm(10000000)) # 10,000,000 rows

microbenchmark(
  codebook(df_short),  # Mean = 347	milliseconds
  codebook(df_medium), # Mean = 1589 milliseconds
  codebook(df_long),   # Mean = 4212	milliseconds
  times = 10L
)

So, adding more observations slows it down.
100 to 10,000 = 4 times as long
100 to 10,000,000 = 12 times as long

Do more columns slow it down?

# Keep the first 100 rows of df_long only
df_medium <- df_medium[1:100,]
# Make 100 column names from combinations of letters
set.seed(123)
cols <- unique(paste0(sample(letters, 100, TRUE), sample(letters, 100, TRUE), sample(letters, 100, TRUE)))
for (col in cols) {
  df_medium[[col]] <- rnorm(100)
}

microbenchmark(
  codebook(df_short),  # Mean = 300	milliseconds
  codebook(df_medium), # Mean = 52776	milliseconds (52 seconds)
  times = 1L
)

So, adding more columns slows it down A LOT!
1 to 100 = 175 times as long!

What parts of the code take the longest to run?

profvis(codebook(df_short))

The flextable parts take the longest (i.e., body_add_flextable and regular_table).

profvis(codebook(df_medium))

The flextable parts take the longest (i.e., body_add_flextable, body_add_par, and regular_table).

profvis(codebook(df_long))

unique.default and cb_add_summary stats take the longest.

There isn't a way for me to change the internals of the flextable functions, but I do wonder if me applying them in a different way would speed things up?

Issue #17 - The solution for this problem came from: https://ardata-fr.github.io/officeverse/officer-for-word.html#external-documents

Part of issue #17 - Now that all the solution has been added to the codebook function, these files shouldn't be needed anymore. However, they are part of the last commit if they are ever needed.

Issue #17

mbcann01 added a commit that referenced this issue Jul 13, 2022

Make codebook faster

1af88a9

Issue #17 - The solution for this problem came from: https://ardata-fr.github.io/officeverse/officer-for-word.html#external-documents

mbcann01 added a commit that referenced this issue Jul 13, 2022

Update version number

b6f3a51

Issue #17

mbcann01 added a commit that referenced this issue Jul 13, 2022

Fix build check errors

04c97e5

Issue #17

mbcann01 closed this as completed Jul 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Codebook is slow #17

Codebook is slow #17

mbcann01 commented Jul 3, 2022 •

edited

mbcann01 commented Jul 4, 2022

Codebook is slow #17

Codebook is slow #17

Comments

mbcann01 commented Jul 3, 2022 • edited

Solution

mbcann01 commented Jul 4, 2022

Working on issue #17. Codebook is slow.

mbcann01 commented Jul 3, 2022 •

edited