Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submit to CRAN #12

Closed
StevenMMortimer opened this issue Mar 30, 2019 · 20 comments
Closed

Submit to CRAN #12

StevenMMortimer opened this issue Mar 30, 2019 · 20 comments
Assignees

Comments

@StevenMMortimer
Copy link
Collaborator

Adding a note for work relating to the submission of this package to CRAN.

@jeffboichuk
Copy link
Collaborator

jeffboichuk commented Apr 2, 2019

Might be worth cleaning up campaign_id. There are 27 campaigns in the data, and campaign_id includes c(1:23, 27:30). This visualization shines a light on the issue:

library(tidyverse)
library(completejourney)

campaign_descriptions %>% 
  gather(start_date, end_date, key = campaign_event, value = date) %>% 
  ggplot(aes(x = date, y = campaign_id %>% reorder(date, FUN = 'min'))) + 
  geom_line(aes(group = campaign_id), color = "grey50") + 
  geom_point(aes(color = campaign_type)) +
  scale_x_date(date_breaks = "3 months", date_labels = "%b %Y") +
  theme(legend.position = "bottom") +
  labs(
    title = "Redemption Periods by Campaign in the Complete Journey Study",
    subtitle = "Campaigns lasted a median length of five weeks",
    x = "Date",
    y = "Campaign ID",
    color = "Campaign Type",
    caption = "Data Source: The completejourney package (https://github.com/bradleyboehmke/completejourney)"
  )

@bradleyboehmke bradleyboehmke self-assigned this Apr 2, 2019
@bradleyboehmke
Copy link
Owner

@jeffboichuk and @StevenMMortimer, I let this slide for a while but got back to it today. I was able to update all the data sets that contain campaign_id and ensured the ID values range from 1-27 rather than 1-23, 27-30. I also went through all necessary checks and was able to submit it to CRAN. We still have a note on the package size so we'll see if they allow it or not. As soon as I hear back I will let you know.

@bradleyboehmke
Copy link
Owner

@jeffboichuk and @StevenMMortimer , message from Uwe on our submission.

Thanks, we see:

Size of tarball: 28204328 bytes

Not more than 5 MB for a CRAN package, please host such huge data elsewhere.

Best,
Uwe Ligges

Looks like we need to go with plan B - host the data outside of CRAN but have the package functions install the data (like the read_mnist() function in the dslabs package). What we could do is provide just a small subset of the transactions data in the package; small enough where the total size is less than 5 MB. But then have another function (i.e. read_transactions()) that imports the entire transactions data set. Thoughts?

@StevenMMortimer
Copy link
Collaborator Author

@bradleyboehmke Thanks for submitting! Bummer to hear about the size. How do you feel about automatically loading the data from GitHub using the .onLoad() hook? I'm not sure if that's an accepted practice, but just thinking of alternatives to having a read_data() function. If I was loading a data package, I would expect the data to be there without having to call another function.

Here is a version of the .onLoad call that I mocked up:

load_url <- function (url, ..., sha1 = NULL) {
  # based very closely on code for devtools::source_url
  stopifnot(is.character(url), length(url) == 1)
  temp_file <- tempfile()
  on.exit(unlink(temp_file))
  request <- httr::GET(url)
  httr::stop_for_status(request)
  writeBin(httr::content(request, type = "raw"), temp_file)
  file_sha1 <- digest::digest(file = temp_file, algo = "sha1")
  if (is.null(sha1)) {
    #message("SHA-1 hash of file is ", file_sha1)
  }
  else {
    if (nchar(sha1) < 6) {
      stop("Supplied SHA-1 hash is too short (must be at least 6 characters)")
    }
    file_sha1 <- substr(file_sha1, 1, nchar(sha1))
    if (!identical(file_sha1, sha1)) {
      stop("SHA-1 hash of downloaded file (", file_sha1, 
           ")\n  does not match expected value (", sha1, 
           ")", call. = FALSE)
    }
  }
  load(temp_file, envir = .GlobalEnv)
}

#load_url from: https://stackoverflow.com/questions/24846120/importing-data-into-r-rdata-from-github

.onLoad <- function(libname, pkgname) {
  message("Loading Package Data from GitHub")
  dataset_names <- c("campaigns", "campaign_descriptions", "coupons", "coupon_redemptions", 
                     "demographics", "products", "promotions", "transactions")
  pb <- txtProgressBar(min=0, max=1, style=3)
  for (i in 1:length(dataset_names)) {
    load_url(sprintf("https://github.com/bradleyboehmke/completejourney/blob/master/data/%s.rda?raw=true", dataset_names[i]))
    setTxtProgressBar(pb, value = i*(1/length(dataset_names)))
  }
  invisible()
}

@bradleyboehmke
Copy link
Owner

@StevenMMortimer , I like this idea. I've never seen this approach before so let me read up on it a bit. Is there a restructuring of the package required so that the .rda files are not uploaded to CRAN or is as simple as adding the ^data to the Rbuildignore?

@StevenMMortimer
Copy link
Collaborator Author

@bradleyboehmke Yes, I was thinking of keeping the files in their place and ignoring them on build.

@bradleyboehmke
Copy link
Owner

@StevenMMortimer have you found a couple packages that implement this method? If so please share, I'd like to dig into them and see if there are best practices. Or if you've identified best practices regarding this please do educate me :). Either way, this seems like a great approach bitmoji

@StevenMMortimer
Copy link
Collaborator Author

@bradleyboehmke That picture is amazing. I really have no idea what I'm doing. My suggestion to include stuff in the .onLoad() hook was just a gut reaction. I can't think of or find any packages that do that. I could see CRAN still rejecting the package using that technique since it's weird to connect to an external source on load and download something unknown to the user, especially if it's large data that might not fit into the local machine's RAM.

In just searching for packages using that method, I ran across the drat package, which can host packages in a way that allows you to use install.packages(), update.packages(), etc. that is similar to CRAN. Although, I haven't used it or thought through what exactly that would look like on GitHub or GitHub Pages.

Package repo: https://github.com/eddelbuettel/drat
Dirk's blog about it: http://dirk.eddelbuettel.com/code/drat.html

Lastly, it might be worth revisiting with how important it is to be a CRAN package or the benefits we gain from getting it to a state that's installable like a CRAN package. The motivation is escaping me at the moment, aside from it just being a more consistent way to install and maintain for beginners.

@bradleyboehmke
Copy link
Owner

@StevenMMortimer, I made some updates to the package. I decided against the .onLoad and .onAttach approach because of the fact that we would be importing from an external website before the user was informed of us doing so. I monkeyed around with adding an interactive menu() approach where the user is asked if they want to import all the data sets at time of calling library(completejourney); however, that throws a CRAN warning as you are not supposed to include interactiveness at pkg load time.

Instead, I added a package install message and incorporated a get_data() function that will install one or more completejourney data sets (defaults to all). Right now it passes all CRAN checks but I want you to test it out and give me feedback before submitting to CRAN.

@jeffboichuk
Copy link
Collaborator

@bradleyboehmke nicely done! Looks good to me.

Perhaps "Download them with the get_data() function. Consequently, you will need an
active wifi connection to download them from GitHub."
could be changed to "Run get_data() in your console to down the download the datasets from GitHub. An internet connection is required."? I got it, but new users might need more direction. Just a thought...

@StevenMMortimer
Copy link
Collaborator Author

@bradleyboehmke @jeffboichuk That all sounds great to me. Fingers crossed on the second CRAN submission.

@bradleyboehmke
Copy link
Owner

@StevenMMortimer @jeffboichuk Got feedback on the CRAN submission. No major blockers but they wanted the following changed:

  • Add link to https://www.8451.com/area51 in the Description field of DESCRIPTION file
  • Replace \dontrun{} with \donttest{} in @examples
  • Change if(require("dplyr")){} statements to just require("dplyr")

I've made these changes and resubmitted this morning. Should be good to go but will let you know once final acceptance comes in. Might want to start thinking about a hex sticker 🤔!

@jeffboichuk
Copy link
Collaborator

jeffboichuk commented May 4, 2019

@bradleyboehmke @StevenMMortimer I have an idea for the hex sticker. It might be crazy. A Pacman-like maze that has a shopping cart at the start and a few food-related emojis along the way. The image would symbolize that the data represent a sample of households' purchases at a grocer from the start of the study period to the end of it. That's the journey, which shouldn't be confused with customer journey mapping. Customer journey mapping stems from problem recognition, to information search, to choosing within a consideration set, to purchase. Just thought I'd throw it out there as a potential way to capture the package's contents in the hex sticker. Thanks for the finishing touches, Brad!

@bradleyboehmke
Copy link
Owner

@jeffboichuk @StevenMMortimer Ok, it's been a few weeks since the last submission and I've got an update. On the last submission, which was fixing minor issues that the first reviewer, we had a different reviewer assess the submission. This time they responded with:

Thanks, we see code lines such as load(temp_file, envir = .GlobalEnv). Please do not modify the .GlobalEnv and just return a list of loaded objecs.

This peeved me off since it was not discussed by the other reviewer so I argued against it. Well, I lost so we have a couple options:

  1. Have the get_data() return a list with the data frames. Easiest fix but now the user has a list of the tibbles and they need to do a second step to extract each one out as an individual data set.
  2. Try implement an alternative route with the drat package which requires:
    • Package the data as normal R package, e.g. completejourneydata and make it available in a GitHub repo.
    • Add the data package as Suggest to completejourney.
    • Add the data repository as Additonal_repositories to completejourney.

This approach is described in https://journal.r-project.org/archive/2017/RJ-2017-026/RJ-2017-026.pdf but I have not tried to implement it yet. Let me know your thoughts and I can make this change (hopefully quickly) and get the package resubmitted.

@StevenMMortimer
Copy link
Collaborator Author

@bradleyboehmke Thanks for dealing with all that. Option 1 (list of tibbles) sounds easy and reliable. When I looked at drat before it sounded like we needed to host the files somewhere and GitHub was not an option, but I don't really know what I'm talking about with that. If you want to learn more and go down that rabbit hole, that's fine too.

@bradleyboehmke
Copy link
Owner

If you guys are cool with going the list route than I'm fine with that.

@bradleyboehmke
Copy link
Owner

@jeffboichuk @StevenMMortimer , I let this get a little stale but I spent yesterday making revisions per CRANs feedback on the last package submit. Check out the updates and let me know your thoughts. I went to submit it to CRAN yesterday but they are closed for submissions until Aug 18 (summer vacation).

@jeffboichuk
Copy link
Collaborator

@bradleyboehmke I really like the shift you made from get_data() to get_transactions() and get_promotions(). Everything looks great to me!

@bradleyboehmke
Copy link
Owner

completejourney is now on CRAN. Enjoy with install.packages("completejourney"). Thanks!

@StevenMMortimer
Copy link
Collaborator Author

Thanks so much @bradleyboehmke! Awesome stuff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants