Path resolution fixes for DatabricksArtifactRepository #4

dbczumar · 2020-06-08T01:44:28Z

What changes are proposed in this pull request?

This PR modifies DatabricksArtifactRepository to compute the relative path of the repository's artifact_uri to the associated MLflow Run's artifact root. All operations are then performed relative to this artifact root.

For example, if the repository is instantiated with the uri dbfs:/databricks/mlflow-tracking/<EXP_ID>/<RUN_ID>/artifacts/my/subpath, all LIST/UPLOAD/DOWNLOAD operations will be performed relative to this location. Calling list_artifacts("foo") will list the artifacts under dbfs:/databricks/mlflow-tracking/<EXP_ID>/<RUN_ID>/artifacts/my/subpath/foo. Previously, artifacts were listed under dbfs:/databricks/mlflow-tracking/<EXP_ID>/<RUN_ID>/artifacts (the run root); this worked because list_artifacts() returned artifact paths relative to the run root, rather than relative to the artifact repo root.

@arjundc-db Let me know if this makes sense and please leave comments / questions! If you can add relevant tests for listing behavior (e.g., tests ensuring that the repo returns paths relative to the artifact repo root rather than the run root), that would be awesome!

Currently missing:

Tests
Possibly comments (@arjundc-db let me know if you can identify any additional places where comments would be helpful)

How is this patch tested?

(Details)

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What component(s), interfaces, languages, and integrations does this PR affect?

Components

area/artifacts: Artifact stores and artifact logging
area/build: Build and test infrastructure for MLflow
area/docs: MLflow documentation pages
area/examples: Example code
area/model-registry: Model Registry service, APIs, and the fluent client calls for
Model Registry
area/models: MLmodel format, model serialization/deserialization, flavors
area/projects: MLproject format, project running backends
area/scoring: Local serving, model deployment tools, spark UDFs
area/tracking: Tracking Service, tracking client APIs, autologging

Interface

area/uiux: Front-end, user experience, JavaScript, plotting
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations

How should the PR be classified in the release notes? Choose one:

rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

dbczumar · 2020-06-08T01:46:47Z

mlflow/store/artifact/databricks_artifact_repo.py

+        # (the `run_relative_artifact_repo_root_path`). All operations performed on this artifact
+        # repository will be performed relative to this computed location
+        artifact_repo_root_path = extract_and_normalize_path(artifact_uri)
+        run_artifact_root_uri = self._get_run_artifact_root(self.run_id)


We fetch the actual run artifact root from the MLflow Tracking Service because it seems somewhat brittle to assume that the root is exactly dbfs:/databricks/mlflow-tracking/<EXP_ID>/<RUN_ID>/artifacts. We must assume that it's at least dbfs:/databricks/mlflow-tracking/<EXP_ID>/<RUN_ID>, but we don't need to assume that the root contains the artifacts subdirectory or that the root is not in some other subdirectory (e.g., dbfs:/databricks/mlflow-tracking/<EXP_ID>/<RUN_ID>/my/other/awesome/root).

dbczumar added 2 commits June 7, 2020 18:34

Fix - needs docs and tests

3d923cb

Comment and simplification

d3fab4a

dbczumar commented Jun 8, 2020

View reviewed changes

arjundc-db merged commit b67e4dc into arjundc-db:databricks-artifact-repo Jun 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Path resolution fixes for DatabricksArtifactRepository #4

Path resolution fixes for DatabricksArtifactRepository #4

dbczumar commented Jun 8, 2020

dbczumar Jun 8, 2020

Path resolution fixes for DatabricksArtifactRepository #4

Path resolution fixes for DatabricksArtifactRepository #4

Conversation

dbczumar commented Jun 8, 2020

What changes are proposed in this pull request?

How is this patch tested?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

dbczumar Jun 8, 2020

Choose a reason for hiding this comment