Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve destination fingerprinting and info #751

Open
rudolfix opened this issue Nov 10, 2023 · 0 comments
Open

improve destination fingerprinting and info #751

rudolfix opened this issue Nov 10, 2023 · 0 comments
Assignees
Labels
tech-debt Leftovers from previous sprint that should be fixed over time

Comments

@rudolfix
Copy link
Collaborator

rudolfix commented Nov 10, 2023

Background
Building on #746 we want a robust fingerprinting and destination info to include in LoadInfo and pipeline traces. The fingerprinting is used to anonymously identify cloud destinations and allows to build proper tracing and data lineage

Tasks

    • improve filesystem fingerprint by including only the schema and netloc. currently we also include path in the bucket which is too much. for local filesystem configuration return a hash of empty string.
    • add method to destination factory that will return a destination info as named tuple/dict
    • improve fingerprinting of destinations that run locally (duckdb/filesystem/postgres)

Implementation
Please do 3 PRs for 3 tasks

destination info contains:

  • destination type, name, environment and fingerprint
  • destination str() representation (which is display safe)
  • local/remote flag

local destination:
for destinations that may run locally (duckdb, postgres, waeviate, quadrant, filesystem etc.) we should start generating fingerprints

  • if destination runs locally (always duckdb, file:// on filesystem, localhost and 127.0.0.1 + ip6 localhost if connection string) a local flag must be set to true in info
  • fingerprint must include .anonymous_id as used by telemetry and path to file/database when applicable
  • .anonymous_id should be detached from telemetry code and become independent (maybe part of paths.py module)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tech-debt Leftovers from previous sprint that should be fixed over time
Projects
Status: Planned
Development

No branches or pull requests

4 participants