Remove MLFlow Integration from Base Environment for Streamlined Configuration by lvijnck · Pull Request #421 · everycure-org/matrix

lvijnck · 2024-09-18T08:55:35Z

Description

to add

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Re-ran against prod

Checklist:

added label to PR (e.g. enhancement or bug)
I have looked at the diff on github to make sure no unwanted files have been committed.
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
Any dependent changes have been merged and published in downstream modules
ran /run-tests check at the end of PR collaboration work to execute integration tests

piotrkan · 2024-09-18T09:04:47Z

Why are we removing MLFlow from base @lvijnck ? It's useful even locally, I used it myself quite a lot when doing embedding benchmark work

lvijnck · 2024-09-18T09:10:35Z

Why are we removing MLFlow from base @lvijnck ? It's useful even locally, I used it myself quite a lot when doing embedding benchmark work

What exactly do you use locally in that case? For lot's of use-cases it's really in the way, as it adds a stateful component to local testing.

That being said, I think we need to go for the ability to run locally either with MLFlow, or without MLFlow. Though Kedro does not support that currently.

piotrkan · 2024-09-18T09:26:10Z

I used it mainly for comparing PCA plots and convergence plots between different Node2vec/Graphsage runs for the subsamples and once we have a sample of the data as mentioned in #348 we can also run modelling tests and compare how the modelling runs compare. Also having it locally can be good for development work, e.g. for inference pipeline I used it for extracting the model from local mlflow artifact storage.

If it was up to me whether we should keep it in base, I would keep it but maybe let's discuss it in a broader group. I agree that ideal scenario would be to have a choice to run the pipeline with or without mlflow tracking

* extract * add metadata * rm from base * add mlflow catalog entries * add * add section * add section * rm tpl

…git branch content (#427) * add CLI * rm scripts e2e * working cli commands * submit improved * new docs for submit command * fix unit tests * add cursorrules file * add notes on AI generation * Update kedro.md (#412) Formatting update in 'Dynamic pipelines' section * Update git-crypt.md (#411) Added WSL for Windows instructions to install git-crypt and gpg * Generate documentation for cross val/model selection (#390) * cross val docs * Update docs/src/data_science/model_selection.md Co-authored-by: Piotr Kaniewski <115791652+piotrkan@users.noreply.github.com> * Update docs/src/data_science/model_selection.md Co-authored-by: Piotr Kaniewski <115791652+piotrkan@users.noreply.github.com> --------- Co-authored-by: Piotr Kaniewski <115791652+piotrkan@users.noreply.github.com> * git-crypt lock -a (#417) * Add PGP key for new member (#416) * Add 1 git-crypt collaborator New collaborators: 267E0673 Mateusz Wasilewski <mateusz@everycure.org> * Add 1 git-crypt collaborator New collaborators: 4828A68F May-lim <may@everycure.org> * Fix robokop paths in catalog to work for both cloud and base (#410) * push all columns to BQ with robo * rm unneeded special cloud env robokop files * fix path in globals for test * Make namespace a parameter for e2e execution script (#413) * move all ARGO CD targeting to a new `infra` branch * run CI on terraform on infra branch intsead * add pre commit for github actions * github actions changes to make path filtering happen at workflow level * paths * s * also run infra deployment only on specific filters with dorny * bump * x * checkout for infra branch * add clone permissions * xi * x * bump * bump * rm old matrix module * make deploy dependent on plan * rm file * concurrency to 1 * move concurrency for CI * update to target infra branch * avoid defaul * bump * increase mlflow size again * mlflow ephemeral storage bug * x * x * increase mlflow size further * pubmedbert endpoint * added spec * deleted obsolete file * added quick locust for endpoints on k8s * add tmp gateway for api * turn on filestore driver * turn on filestore driver * do not run plan in env * bump * added project reference for gcs backend * rm backend and provider * cleanup * avoid attempt to create bucket * test different env for terraform * try with ro user * test jwt token permissions * bump * test with new filter for ref on rw user * do not lock when planning * avoid reading * debug * try breaking this * b * change env * debug again * avoid deploy for nwo * make openai parameterized via env variable * ignore cache directories * parametrize endpoints in makefile * send random number of requests in locust request * add joblib caching and proper compliance to OAI response * bake model into image * gen fake data with locust * updated system to behave as expected in scale up-down behavior * cleanup readme * update scaling * introduce script for submitting workflows * update from RELEASE to RUN * cleanup * push * cleanup * add convenience script * add changes * bump * Dev/bte trapi deploy helm (#260) * added helm chart for deploying bte-trapi locally * changed bte-trapi.yaml template, removed bte-trapi application folder in new branch * MLFlow to GCS (#293) * add example * add work * push changes * rm breakpoint * call save * rm breakpoint * reenable save * rm debugging stuff * commit changes * rm mlflow file * rm lock file * rm test * allow proxying * add the release version to path * add changes * rm subpath in mlflow * Update onboarding.md * push * revert * revert * disable miniop * rm minio user * correct * reenable * set artifact location * revert commenting * Add 1 git-crypt collaborator (#343) New collaborators: 225C3B75 ahueb <alan@hueb.org> * Update index.md (#336) * Update index.md Updated onboarding content with remaining information from Notion which hasn't already been pulled across * Update docs/src/onboarding/index.md Co-authored-by: Pascal Bro <pascal@everycure.org> * Update index.md --------- Co-authored-by: Pascal Bro <pascal@everycure.org> * New script to retry docker compose_down in CI and debug when it's having issues (#357) * add new script to debug docker issues * cleanup structure a bit * Add Robokop data to ingestion pipeline (#188) * add * add todos * add todo pointers * Robokop Ingestion Pipeline added fields for Robokop ingestion. * update gitignore * update ignore * ignore idea files * add pointers for fabrication * Edits from Laurens comments modified files after Laurens comments * Cleaned up removed KC "TODO"s. Fixed Typos * flushing out additional columns per real robokop data * aligning fabricator column data with schema * renaming node name * removing duplicate spark_csv * updating * reverting to String * removing fabricator details * using LazySparkDataset, removing schema info * run pre-commit * Update pipelines/matrix/conf/base/ingestion/catalog.yml Co-authored-by: Laurens <90421718+lvijnck@users.noreply.github.com> * Update pipelines/matrix/conf/base/ingestion/catalog.yml Co-authored-by: Laurens <90421718+lvijnck@users.noreply.github.com> * Update pipelines/matrix/conf/base/ingestion/catalog.yml Co-authored-by: Laurens <90421718+lvijnck@users.noreply.github.com> * Update pipelines/matrix/conf/base/ingestion/catalog.yml Co-authored-by: Laurens <90421718+lvijnck@users.noreply.github.com> * fixing typo * fix layers * take subset of columns * setting header to true * overriding catalog due to change in raw path * add * fix dataset name as - is not supported by bq * Update spark.yml * add new node function for robokop nodes * update * set unit seperator * add descriptiopn --------- Co-authored-by: Kathleen Carter <163005214+eKathleenCarter@users.noreply.github.com> Co-authored-by: Jason Reilly <jdr0887@gmail.com> Co-authored-by: Jason Reilly <jdr0887@users.noreply.github.com> Co-authored-by: Pascal Brokmeier <pascal@everycure.org> * Update pipelines/matrix/conf/base/fabricator/parameters.yml --------- Co-authored-by: elliottsharp <elliott.sharp@hotmail.com> Co-authored-by: Pascal Bro <pascal@everycure.org> Co-authored-by: Kathleen Carter <163005214+eKathleenCarter@users.noreply.github.com> Co-authored-by: Jason Reilly <jdr0887@gmail.com> Co-authored-by: Jason Reilly <jdr0887@users.noreply.github.com> * add if statement to only debug on infra branch * rm 2 * add default artifact root * connect to pgsql * connect to correct svc * rm coc * Update .github/ISSUE_TEMPLATE/onboarding.md Co-authored-by: Pascal Bro <pascal@everycure.org> * Update pipelines/matrix/src/matrix/hooks.py Co-authored-by: Pascal Bro <pascal@everycure.org> * list transient errors * fix test * moved specs in the right places * fix wrong reference to old namespace for httproute * update cert ref * update cert ref * update API endpoint to support 2 models * update memory requirements * add pdb * introduce spot based API backing * Add 1 git-crypt collaborator New collaborators: 7BEAB3B9 Joe Sykora <joseph@everycure.org> * replace from preemptible to spot * Update infra/modules/stacks/compute_cluster/gke.tf * Update services/pubmedbert_embeddings/README.md * not usign another namespace for now * setup correct ep * fix issue chunyu * fix chunyu updates * add embeddings * move * update model * pass model correctly * more retries * Add 1 git-crypt collaborator New collaborators: 7BEAB3B9 Joe Sykora <joseph@everycure.org> * replace from preemptible to spot * bump * bump * revert endpoint * correct model * enable namespace specification for commands * fixes * Update pipelines/matrix/conf/test/globals.yml * Apply suggestions from code review Co-authored-by: Pascal Bro <pascal@everycure.org> * Apply suggestions from code review Co-authored-by: Pascal Bro <pascal@everycure.org> * bump --------- Co-authored-by: Laurens Vijnck <laurens@everycure.org> Co-authored-by: Alan <alan@hueb.org> Co-authored-by: Laurens <90421718+lvijnck@users.noreply.github.com> Co-authored-by: elliottsharp <elliott.sharp@hotmail.com> Co-authored-by: Kathleen Carter <163005214+eKathleenCarter@users.noreply.github.com> Co-authored-by: Jason Reilly <jdr0887@gmail.com> Co-authored-by: Jason Reilly <jdr0887@users.noreply.github.com> Co-authored-by: Piotr Kaniewski <115791652+piotrkan@users.noreply.github.com> * Add 1 git-crypt collaborator (#419) New collaborators: 1030BF3E malanjary <malanjary@scripps.edu> * Implement recall@N in pipeline (#311) * first draft notebooks * first version but running into testing errors * resolved error, basic test works * fix N error on test data * tidy up test * added option for multiple values of N * minor change to class * modify n_values for testing * Update pipelines/matrix/tests/pipelines/test_evaluation.py including suggestion from alexei Co-authored-by: Alexei Stepanenko <alexei.stepa@gmail.com> * resolving merge conflict --------- Co-authored-by: Alexei Stepanenko <alexei.stepa@gmail.com> * Fix exp creation race condition (#420) * fix race condition * push fix * Apply node filtering to clinical trails/drug list/disease lists before integrating (#408) * add * add category label to synonymizatiion * add filtering * update * add updates] * revert * working setup * add neo4j checking to synonymizer * fix reporting node * rn synonymizer updates * rm comment * introduce working example * cleanup * add logging * rm write * add datashader plot * push * push * cleanup * add changes * fix file clashing * add * add * add * expand reporting * revert all dev changes * final update * final update * variable name cleanup * add dataset transcoding to docs * coalesce * fix errornous input * disable test * add link * add test * rm comment * fix error --------- Co-authored-by: piotrkan <pkaniewski998@gmail.com> Co-authored-by: Alexei Stepanenko <alexei.stepa@gmail.com> * ulimit and tqdm disable in test and ci * fix lint * add username label * fix * upgrade kedro * update * add instructions * add svgs * add all * add correct img * correct * Remove MLFlow for base env (#421) * extract * add metadata * rm from base * add mlflow catalog entries * add * add section * add section * rm tpl * Fix code duplication issue in evaluation pipeline (#423) * add * add category label to synonymizatiion * add filtering * update * add updates] * revert * working setup * add neo4j checking to synonymizer * fix reporting node * rn synonymizer updates * rm comment * introduce working example * cleanup * add logging * rm write * add datashader plot * push * push * cleanup * add helper function for disease-centric matrix * moved remove pairs method * use helper function in time split method * checkout main to get rid of unwanted changes --------- Co-authored-by: Laurens Vijnck <laurens@everycure.org> Co-authored-by: piotrkan <pkaniewski998@gmail.com> * fix links --------- Co-authored-by: may-lim <may@everycure.org> Co-authored-by: leelancashire <drllancashire@gmail.com> Co-authored-by: Piotr Kaniewski <115791652+piotrkan@users.noreply.github.com> Co-authored-by: Cheng-Han Chung <jchung@renci.org> Co-authored-by: Laurens Vijnck <laurens@everycure.org> Co-authored-by: Alan <alan@hueb.org> Co-authored-by: Laurens <90421718+lvijnck@users.noreply.github.com> Co-authored-by: elliottsharp <elliott.sharp@hotmail.com> Co-authored-by: Kathleen Carter <163005214+eKathleenCarter@users.noreply.github.com> Co-authored-by: Jason Reilly <jdr0887@gmail.com> Co-authored-by: Jason Reilly <jdr0887@users.noreply.github.com> Co-authored-by: Alexei Stepanenko <alexei.stepa@gmail.com> Co-authored-by: piotrkan <pkaniewski998@gmail.com>

lvijnck added 2 commits September 18, 2024 10:54

extract

a8ae8c6

add metadata

801d304

rm from base

8ba9cf9

pascalwhoop previously approved these changes Sep 19, 2024

View reviewed changes

lvijnck marked this pull request as ready for review September 19, 2024 09:31

lvijnck requested review from alexeistepa, leelancashire and piotrkan as code owners September 19, 2024 09:31

add mlflow catalog entries

3c526b9

lvijnck dismissed pascalwhoop’s stale review via 3c526b9 September 19, 2024 10:10

lvijnck added 4 commits September 19, 2024 12:11

add

190e0ba

add section

6d09780

add section

90d75a1

rm tpl

445f60e

lvijnck merged commit c8b3668 into main Sep 19, 2024

pascalwhoop pushed a commit that referenced this pull request Sep 19, 2024

Remove MLFlow for base env (#421)

2f1d17a

* extract * add metadata * rm from base * add mlflow catalog entries * add * add section * add section * rm tpl

lvijnck mentioned this pull request Sep 20, 2024

Simplify MLFlow setup in Kedro pipeline #399

Closed

pascalwhoop changed the title ~~Remove MLFlow for base env~~ Remove MLFlow Integration from Base Environment for Streamlined Configuration Nov 1, 2024

pascalwhoop added the Simplification A label to use for PRs that make things simpler label Nov 1, 2024

oliverw1 deleted the feature/rm-mlflow-for-base branch January 21, 2025 21:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove MLFlow Integration from Base Environment for Streamlined Configuration#421

Remove MLFlow Integration from Base Environment for Streamlined Configuration#421
lvijnck merged 8 commits intomainfrom
feature/rm-mlflow-for-base

lvijnck commented Sep 18, 2024 •

edited

Loading

Uh oh!

piotrkan commented Sep 18, 2024

Uh oh!

lvijnck commented Sep 18, 2024

Uh oh!

piotrkan commented Sep 18, 2024 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lvijnck commented Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How Has This Been Tested?

Checklist:

Uh oh!

piotrkan commented Sep 18, 2024

Uh oh!

lvijnck commented Sep 18, 2024

Uh oh!

piotrkan commented Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lvijnck commented Sep 18, 2024 •

edited

Loading

piotrkan commented Sep 18, 2024 •

edited

Loading