Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Path resolution fixes for DatabricksArtifactRepository #4

Merged

Conversation

dbczumar
Copy link

@dbczumar dbczumar commented Jun 8, 2020

What changes are proposed in this pull request?

This PR modifies DatabricksArtifactRepository to compute the relative path of the repository's artifact_uri to the associated MLflow Run's artifact root. All operations are then performed relative to this artifact root.

For example, if the repository is instantiated with the uri dbfs:/databricks/mlflow-tracking/<EXP_ID>/<RUN_ID>/artifacts/my/subpath, all LIST/UPLOAD/DOWNLOAD operations will be performed relative to this location. Calling list_artifacts("foo") will list the artifacts under dbfs:/databricks/mlflow-tracking/<EXP_ID>/<RUN_ID>/artifacts/my/subpath/foo. Previously, artifacts were listed under dbfs:/databricks/mlflow-tracking/<EXP_ID>/<RUN_ID>/artifacts (the run root); this worked because list_artifacts() returned artifact paths relative to the run root, rather than relative to the artifact repo root.

@arjundc-db Let me know if this makes sense and please leave comments / questions! If you can add relevant tests for listing behavior (e.g., tests ensuring that the repo returns paths relative to the artifact repo root rather than the run root), that would be awesome!

Currently missing:

  • Tests
  • Possibly comments (@arjundc-db let me know if you can identify any additional places where comments would be helpful)

How is this patch tested?

(Details)

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for
    Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: Local serving, model deployment tools, spark UDFs
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, JavaScript, plotting
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

# (the `run_relative_artifact_repo_root_path`). All operations performed on this artifact
# repository will be performed relative to this computed location
artifact_repo_root_path = extract_and_normalize_path(artifact_uri)
run_artifact_root_uri = self._get_run_artifact_root(self.run_id)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We fetch the actual run artifact root from the MLflow Tracking Service because it seems somewhat brittle to assume that the root is exactly dbfs:/databricks/mlflow-tracking/<EXP_ID>/<RUN_ID>/artifacts. We must assume that it's at least dbfs:/databricks/mlflow-tracking/<EXP_ID>/<RUN_ID>, but we don't need to assume that the root contains the artifacts subdirectory or that the root is not in some other subdirectory (e.g., dbfs:/databricks/mlflow-tracking/<EXP_ID>/<RUN_ID>/my/other/awesome/root).

@arjundc-db arjundc-db merged commit b67e4dc into arjundc-db:databricks-artifact-repo Jun 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants