Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codebook is slow #17

Closed
6 tasks done
mbcann01 opened this issue Jul 3, 2022 · 1 comment
Closed
6 tasks done

Codebook is slow #17

mbcann01 opened this issue Jul 3, 2022 · 1 comment

Comments

@mbcann01
Copy link
Member

mbcann01 commented Jul 3, 2022

While running the codebook function on the L2C data, I realized how slow it is. In some ways, this may not be a huge issue because we probably want need to recreate codebooks often. Having said that, it might be nice to try to find ways to speed up the code.

https://www.r-bloggers.com/2021/04/code-performance-in-r-which-part-of-the-code-is-slow/

http://adv-r.had.co.nz/Performance.html

Using HTML instead of Word (#5) might be a good way to speed it up.

Solution

The solution for this problem came from: https://ardata-fr.github.io/officeverse/officer-for-word.html#external-documents

Inserting a document of course allows you to integrate a previously-created Word document into another document. This can be useful when certain parts of a document need to be written manually but automatically integrated into a final document. The document to be inserted must be in docx format. This can be done by using function body_add_docx(). This can be advantageous when you are generating huge documents and the generation is getting slower and slower. It is necessary to generate smaller documents and to design a main script that inserts the different documents into a main Word document.

  • Clean up codebook2 code
  • Move codebook2 code over to codebook and delete codebook2
  • Change version number
  • Document
  • Check
  • Commit
@mbcann01
Copy link
Member Author

mbcann01 commented Jul 4, 2022

Working on issue #17. Codebook is slow.

library(dplyr)
library(codebookr)
library(microbenchmark)
library(profvis)
data(study)
data_stata <- haven::read_dta("inst/extdata/study.dta")

How long does it take to run regular data?

microbenchmark(
  codebook(study),
  times = 10L
) # 2-3 seconds each run.

How long does it to run on Stata data?

microbenchmark(
  codebook(data_stata),
  times = 10L
) # 2-3 seconds each run

So, that doesn't seem to make a huge difference.

What are the slow parts?

profvis(codebook(study))

The Flextable stuff is the slowest part. I'm not sure if I can speed that up or not.

profvis(codebook(data_stata))

Flextable stuff for this one too.

Can I do the Flextable stuff at once outside of a loop? Will that make any difference?

Do more rows slow it down?

df_short <- tibble(x = rnorm(100)) # 100 rows
df_medium <- tibble(x = rnorm(10000)) # 10,000 rows
df_long <- tibble(x = rnorm(10000000)) # 10,000,000 rows
microbenchmark(
  codebook(df_short),  # Mean = 347	milliseconds
  codebook(df_medium), # Mean = 1589 milliseconds
  codebook(df_long),   # Mean = 4212	milliseconds
  times = 10L
)

So, adding more observations slows it down.
100 to 10,000 = 4 times as long
100 to 10,000,000 = 12 times as long

Do more columns slow it down?

# Keep the first 100 rows of df_long only
df_medium <- df_medium[1:100,]
# Make 100 column names from combinations of letters
set.seed(123)
cols <- unique(paste0(sample(letters, 100, TRUE), sample(letters, 100, TRUE), sample(letters, 100, TRUE)))
for (col in cols) {
  df_medium[[col]] <- rnorm(100)
}
microbenchmark(
  codebook(df_short),  # Mean = 300	milliseconds
  codebook(df_medium), # Mean = 52776	milliseconds (52 seconds)
  times = 1L
)

So, adding more columns slows it down A LOT!
1 to 100 = 175 times as long!

What parts of the code take the longest to run?

profvis(codebook(df_short))

The flextable parts take the longest (i.e., body_add_flextable and regular_table).

profvis(codebook(df_medium))

The flextable parts take the longest (i.e., body_add_flextable, body_add_par, and regular_table).

profvis(codebook(df_long))

unique.default and cb_add_summary stats take the longest.

There isn't a way for me to change the internals of the flextable functions, but I do wonder if me applying them in a different way would speed things up?

mbcann01 added a commit that referenced this issue Jul 13, 2022
mbcann01 added a commit that referenced this issue Jul 13, 2022
Part of issue #17

- Now that all the solution has been added to the codebook function, these files shouldn't be needed anymore. However, they are part of the last commit if they are ever needed.
mbcann01 added a commit that referenced this issue Jul 13, 2022
mbcann01 added a commit that referenced this issue Jul 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

No branches or pull requests

1 participant