Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and simplify pipeline #66

Merged
merged 25 commits into from
Dec 5, 2023
Merged

Refactor and simplify pipeline #66

merged 25 commits into from
Dec 5, 2023

Conversation

dfsnow
Copy link
Member

@dfsnow dfsnow commented Dec 5, 2023

This PR lays the groundwork for further 2024 modeling work by:

  • Consolidating the setup for each pipeline stage into a single script
  • Adding explicit core dependency tracking via the DESCRIPTION file
  • Reducing the overall number of dependencies in the tree from 137 to 124
  • Moving the run deletion script to .github/scripts/ for clarity

Dependencies

Unfortunately, I've think we've hit the limit on cleaning up the dependency tree. The tidymodels sub-packages are extremely dependency heavy, so unless we want to switch away from tidymodels we're stuck with ~130 deps. Here's the dependency tree for workflows alone:

Click here for tune dependency graph
├─cli 3.6.1 [new]
├─dials 1.2.0 [new]
│ ├─scales 1.3.0 [new]
│ │ ├─cli
│ │ ├─farver 2.1.1 [new]
│ │ ├─glue 1.6.2 [new]
│ │ ├─labeling 0.4.3 [new]
│ │ ├─lifecycle 1.0.4 [new]
│ │ │ ├─cli
│ │ │ ├─glue
│ │ │ └─rlang 1.1.2 [new]
│ │ ├─munsell 0.5.0 [new]
│ │ │ └─colorspace 2.1-0 [new]
│ │ ├─R6 2.5.1 [new]
│ │ ├─RColorBrewer 1.1-3 [new]
│ │ ├─rlang
│ │ └─viridisLite 0.4.2 [new]
│ ├─cli
│ ├─DiceDesign 1.9 [new]
│ ├─dplyr 1.1.4 [new]
│ │ ├─cli
│ │ ├─generics 0.1.3 [new]
│ │ ├─glue
│ │ ├─lifecycle
│ │ ├─magrittr 2.0.3 [new]
│ │ ├─pillar 1.9.0 [new]
│ │ │ ├─cli
│ │ │ ├─fansi 1.0.5 [new]
│ │ │ ├─glue
│ │ │ ├─lifecycle
│ │ │ ├─rlang
│ │ │ ├─utf8 1.2.4 [new]
│ │ │ └─vctrs 0.6.5 [new]
│ │ │   ├─cli
│ │ │   ├─glue
│ │ │   ├─lifecycle
│ │ │   └─rlang
│ │ ├─R6
│ │ ├─rlang
│ │ ├─tibble 3.2.1 [new]
│ │ │ ├─fansi
│ │ │ ├─lifecycle
│ │ │ ├─magrittr
│ │ │ ├─pillar
│ │ │ ├─pkgconfig 2.0.3 [new]
│ │ │ ├─rlang
│ │ │ └─vctrs
│ │ ├─tidyselect 1.2.0 [new]
│ │ │ ├─cli
│ │ │ ├─glue
│ │ │ ├─lifecycle
│ │ │ ├─rlang
│ │ │ ├─vctrs
│ │ │ └─withr 2.5.2 [new]
│ │ └─vctrs
│ ├─glue
│ ├─hardhat 1.3.0 [new]
│ │ ├─cli
│ │ ├─glue
│ │ ├─rlang
│ │ ├─tibble
│ │ └─vctrs
│ ├─lifecycle
│ ├─pillar
│ ├─purrr 1.0.2 [new]
│ │ ├─cli
│ │ ├─lifecycle
│ │ ├─magrittr
│ │ ├─rlang
│ │ └─vctrs
│ ├─rlang
│ ├─tibble
│ ├─vctrs
│ └─withr
├─dplyr
├─foreach 1.5.2 [new]
│ ├─codetools 0.2-18 -> 0.2-19 [upd]
│ └─iterators 1.0.14 [new]
├─generics
├─ggplot2 3.4.4 [new]
│ ├─cli
│ ├─glue
│ ├─gtable 0.3.4 [new]
│ │ ├─cli
│ │ ├─glue
│ │ ├─lifecycle
│ │ └─rlang
│ ├─isoband 0.2.7 [new]
│ ├─lifecycle
│ ├─MASS 7.3-58 -> 7.3-60 [upd]
│ ├─mgcv 1.8-41 -> 1.9-0 [upd]
│ │ ├─nlme 3.1-160 -> 3.1-164 [upd]
│ │ │ └─lattice 0.20-45 -> 0.22-5 [upd]
│ │ └─Matrix 1.5-1 -> 1.6-4 [upd]
│ │   └─lattice
│ ├─rlang
│ ├─scales
│ ├─tibble
│ ├─vctrs
│ └─withr
├─glue
├─GPfit 1.0-8 [new]
│ ├─lhs 1.1.6 [new]
│ │ └─Rcpp 1.0.11 [new]
│ └─lattice
├─hardhat
├─lifecycle
├─parsnip 1.1.1 [new]
│ ├─cli
│ ├─dplyr
│ ├─generics
│ ├─ggplot2
│ ├─globals 0.16.2 [new]
│ │ └─codetools
│ ├─glue
│ ├─hardhat
│ ├─lifecycle
│ ├─magrittr
│ ├─pillar
│ ├─prettyunits 1.2.0 [new]
│ ├─purrr
│ ├─rlang
│ ├─tibble
│ ├─tidyr 1.3.0 [new]
│ │ ├─cli
│ │ ├─dplyr
│ │ ├─glue
│ │ ├─lifecycle
│ │ ├─magrittr
│ │ ├─purrr
│ │ ├─rlang
│ │ ├─stringr 1.5.1 [new]
│ │ │ ├─cli
│ │ │ ├─glue
│ │ │ ├─lifecycle
│ │ │ ├─magrittr
│ │ │ ├─rlang
│ │ │ ├─stringi 1.8.2 [new]
│ │ │ └─vctrs
│ │ ├─tibble
│ │ ├─tidyselect
│ │ └─vctrs
│ ├─vctrs
│ └─withr
├─purrr
├─recipes 1.0.8 [new]
│ ├─dplyr
│ ├─cli
│ ├─clock 0.7.0 [new]
│ │ ├─cli
│ │ ├─lifecycle
│ │ ├─rlang
│ │ ├─tzdb 0.4.0 [new]
│ │ └─vctrs
│ ├─ellipsis 0.3.2 [new]
│ │ └─rlang
│ ├─generics
│ ├─glue
│ ├─gower 1.0.1 [new]
│ ├─hardhat
│ ├─ipred 0.9-14 [new]
│ │ ├─rpart 4.1.19 -> 4.1.21 [upd]
│ │ ├─MASS
│ │ ├─survival 3.4-0 -> 3.5-7 [upd]
│ │ │ └─Matrix
│ │ ├─nnet 7.3-18 -> 7.3-19 [upd]
│ │ ├─class 7.3-20 -> 7.3-22 [upd]
│ │ │ └─MASS
│ │ └─prodlim 2023.08.28 [new]
│ │   ├─Rcpp
│ │   ├─data.table 1.14.8 [new]
│ │   ├─diagram 1.6.5 [new]
│ │   │ └─shape 1.4.6 [new]
│ │   ├─survival
│ │   ├─KernSmooth 2.23-20 -> 2.23-22 [upd]
│ │   └─lava 1.7.3 [new]
│ │     ├─future.apply 1.11.0 [new]
│ │     │ ├─future 1.33.0 [new]
│ │     │ │ ├─digest 0.6.33 [new]
│ │     │ │ ├─globals
│ │     │ │ ├─listenv 0.9.0 [new]
│ │     │ │ └─parallelly 1.36.0 [new]
│ │     │ └─globals
│ │     ├─numDeriv 2016.8-1.1 [new]
│ │     ├─progressr 0.14.0 [new]
│ │     │ └─digest
│ │     ├─survival
│ │     └─SQUAREM 2021.1 [new]
│ ├─lifecycle
│ ├─lubridate 1.9.3 [new]
│ │ ├─generics
│ │ └─timechange 0.2.0 [new]
│ ├─magrittr
│ ├─Matrix
│ ├─purrr
│ ├─rlang
│ ├─tibble
│ ├─tidyr
│ ├─tidyselect
│ ├─timeDate 4022.108 [new]
│ ├─vctrs
│ └─withr
├─rlang
├─rsample 1.2.0 [new]
│ ├─cli
│ ├─dplyr
│ ├─furrr 0.3.1 [new]
│ │ ├─future
│ │ ├─globals
│ │ ├─lifecycle
│ │ ├─purrr
│ │ ├─rlang
│ │ └─vctrs
│ ├─generics
│ ├─glue
│ ├─lifecycle
│ ├─pillar
│ ├─purrr
│ ├─rlang
│ ├─slider 0.3.1 [new]
│ │ ├─cli
│ │ ├─rlang
│ │ ├─vctrs
│ │ └─warp 0.2.1 [new]
│ ├─tibble
│ ├─tidyr
│ ├─tidyselect
│ └─vctrs
├─tibble
├─tidyr
├─tidyselect
├─vctrs
├─withr
├─workflows 1.1.3 [new]
│ ├─cli
│ ├─generics
│ ├─glue
│ ├─hardhat
│ ├─lifecycle
│ ├─modelenv 0.1.1 [new]
│ │ ├─glue
│ │ ├─rlang
│ │ ├─tibble
│ │ └─vctrs
│ ├─parsnip
│ ├─rlang
│ ├─tidyselect
│ └─vctrs
└─yardstick 1.2.0 [new]
  ├─cli
  ├─dplyr
  ├─generics
  ├─hardhat
  ├─lifecycle
  ├─rlang
  ├─tibble
  ├─tidyselect
  └─vctrs

Related to #33. Closes #53.

Comment on lines +1 to +38
Type: Project
Description: Residential valuation model of the Cook County Assessor's Office
Depends:
arrow,
assessr,
aws.s3,
aws.ec2metadata,
butcher,
ccao,
conflicted,
dplyr,
furrr,
git2r,
glue,
hardhat,
here,
knitr,
parsnip,
purrr,
lightgbm,
lightsnip,
lubridate,
paws.analytics,
paws.application.integration,
recipes,
rlang,
rsample,
stringr,
tictoc,
tidyr,
tune,
workflows,
yaml,
yardstick
Remotes:
ccao-data/assessr,
ccao-data/ccao,
ccao-data/lightsnip
Copy link
Member Author

@dfsnow dfsnow Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I consolidated all the library loading to R/setup.R and we explicitly track the dev dependencies via profiles, I figured it can't hurt to also make our core dependency tracking explicit.

DBI,
igraph,
lubridate,
markdown,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Required for building README.Rmd.

@@ -2,36 +2,22 @@
# 1. Setup ---------------------------------------------------------------------
#- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

# NOTE: See DESCRIPTION for library dependencies and R/setup.R for
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this note at the beginning of each pipeline stage because people are used to seeing calls to library(), so we should direct them to the correct place.


# Load R libraries
# Load additional dev R libraries (see README#managing-r-dependencies)
Copy link
Member Author

@dfsnow dfsnow Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we load the dev dependencies from the separate profile in addition to the main set of dependencies.

Comment on lines -9 to -46
# Load libraries and scripts
options(tidymodels.dark = TRUE)
suppressPackageStartupMessages({
library(arrow)
library(butcher)
library(ccao)
library(dplyr)
library(here)
library(lightgbm)
library(lightsnip)
library(tictoc)
library(tidymodels)
library(vctrs)
library(yaml)
})

# Load helpers and recipes from files
walk(list.files("R/", "\\.R$", full.names = TRUE), source)

# Initialize a dictionary of file paths. See misc/file_dict.csv for details
paths <- model_file_dict()

# Load the parameters file containing the run settings
params <- read_yaml("params.yaml")

# Override the default CV toggle from params.yaml. This is useful for manually
# running "limited" runs without CV or assessment data (also used for GitHub CI)
cv_enable <- as.logical(
Sys.getenv("CV_ENABLE_OVERRIDE", unset = params$toggle$cv_enable)
)

# Get the number of available physical cores to use for lightgbm multi-threading
# Lightgbm docs recommend using only real cores, not logical
# https://lightgbm.readthedocs.io/en/latest/Parameters.html#num_threads
num_threads <- parallel::detectCores(logical = FALSE)

# Set the overall stage seed
set.seed(params$model$seed)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The big setup block from all the pipeline stages is for the most part entirely contained in R/setup.R.

Comment on lines -5 to -9
"package.dependency.fields": [
"Imports",
"Depends",
"LinkingTo"
],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required or renv will try to snapshot both the profile-specific key in the DESCRIPTION file and everything in the Depends: key.

@@ -12,7 +12,7 @@
"ppm.enabled": true,
"ppm.ignored.urls": [],
"r.version": [],
"snapshot.type": "implicit",
"snapshot.type": "explicit",
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch to explicit tracking in the DESCRIPTION file.

Copy link
Member Author

@dfsnow dfsnow Dec 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I consolidated all the redundant setup sections of each stage into this single script. The only thing we really lose is information about which dependencies are used per stage, but this is much cleaner and just uses the Depends: key in the DESCRIPTION.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved this to .github/scripts/ since it's unlikely to be used by any pipeline users and is specific to GitHub. Also, it gets called at the start of every script if it's in R/ by:

purrrr::walk(list.files("R/", "\\.R$", full.names = TRUE), source)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Managed to cut all the additional tidymodels dependencies that we weren't using by replacing the tidymodels meta package with explicit dependencies only for the libraries we're using.

@dfsnow dfsnow added this to the Final 2024 residential model milestone Dec 5, 2023
@dfsnow dfsnow marked this pull request as ready for review December 5, 2023 02:19
@dfsnow dfsnow self-assigned this Dec 5, 2023
Copy link
Contributor

@jeancochrane jeancochrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice cleanup!

Comment on lines +39 to +40
# Load helpers for model dictionary, data loading, and other misc functions
source(here::here("R", "helpers.R"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Question, non-blocking] Given that we seem to walk the R directory to load this script in each pipeline stage, do we actually need to source R/helpers.R here manually, or can we count on the walk() call doing so?

# Load libraries, helpers, and recipes from files
purrr::walk(list.files("R/", "\\.R$", full.names = TRUE), source)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to in order to load model_file_dict() within the setup script, else loading it is dependent on the walk order.

"Train" = "01", "Assess" = "02",
"Evaluate" = "03", "Interpret" = "04",
"Finalize" = "05"
"Train" = "01", "Assess" = "02", "Evaluate" = "03",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

finally

Copy link
Member

@wrridgeway wrridgeway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pipeline is getting pretty damn slick

@dfsnow dfsnow merged commit 88805fb into master Dec 5, 2023
4 checks passed
@dfsnow dfsnow deleted the dfsnow/pipeline-cleanup branch December 5, 2023 16:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Trim the full pipeline package dependencies
3 participants