Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upload versioned connector docs on publish #30410

Merged
merged 19 commits into from Sep 27, 2023

Conversation

lmossman
Copy link
Contributor

@lmossman lmossman commented Sep 13, 2023

What

Resolves #29400

This PR adds logic to the metadata_service which uploads connector documentation to GCS along with the other metadata uploads.

This will be utilized in an upcoming airbyte server API endpoint to retrieve docs for specific connector versions, which will be utilized in the UI to show the correct documentation for the connector version actively being used.

How

Adds a _docs_upload() function to the gcs_upload.py file which does the following:

  • Gets the path to the connector's doc file by walking up the metadata file path and then down to the docs folder, and using the connector name from the metadata's documentationUrl path to determine the name of the docs file
  • Checks for both .inapp.md and .md files, and uploads both if they exist
  • This function is called once with latest=False to upload to the versioned path, and once with latest=True to upload to the latest path, ensuring that doc files are present in both.
    • Having the doc file in the latest path is important as this will be used by the upcoming server API as a fallback if it cannot find a doc in the versioned path
    • As a follow-up to this PR, I will write and execute a script that uploads the current docs of every connector to the latest path so that the fallback is always present

Note: I decided to keep the connectors docs where they are today, as opposed to moving them next to the metadata.yaml files, because Sherif and I decided that we wanted to keep this change as small in scope as possible so that we can complete the other UI doc tasks we have planned for this quarter.

This small change will unlock the ability to have versioned docs in the UI, and once Connector Ops feels the need to prioritize it, they can do the full migration of these docs into the connector folders, updating docusaurus, etc.

🚨 User Impact 🚨

No breaking changes

@lmossman lmossman marked this pull request as ready for review September 13, 2023 19:22
@lmossman
Copy link
Contributor Author

lmossman commented Sep 13, 2023

@bnchrch what is the best way to test this change on CI? Is there a slash command I can run try to run the publishing step? I assume we have something like this for pre-releases

I've tested it locally by running poetry run metadata_service upload and confirmed that it is working as expected, but I'd like to test on CI too
Screenshot 2023-09-13 at 12 26 07 PM

@lmossman
Copy link
Contributor Author

lmossman commented Sep 13, 2023

One thing I just realized I forgot to address here is images in connector docs - if the markdown docs are pointing at a relative path, e.g. to the ./assets/docs folder, then those images will not load correctly if the UI just pulls the markdown directly.

I think the best way to address this would be to modify any connector docs that include images to point at the full Github URL of the images instead of using the relative path, as this will allow the UI to fetch those images over the network. I would make this change in a separate PR.

Otherwise, we'd need to find a way to upload all referenced images to GCS as well, which feels cumbersome.

EDIT: I've thought about the images issue a bit more, and have some more thoughts:

  • The above approach has a couple of issue:
    • It requires connector developers to upload images to github before they're able to be rendered in docusaurus, which is awkward since today they are usually adding the docs and images all in the same PR
    • It requires connector developers to change their current approach, and requires them to remember to always use full image urls in the future as well, which is very prone to mistakes
  • Instead, I'd like to propose a different approach:
    • When I update the UI to pull versioned connector docs from GCS, I can make another change as well: when our UI encounters an image in the markdown with a relative path like ../../, we can programmatically replace the relative part of that path with the string https://raw.githubusercontent.com/airbytehq/airbyte/master/docs/
    • This way, our UI will always point to the full github URL of the image, which is really the source of truth for the image anyway, and it doesn't require us to copy those images anywhere else and manage that ourselves
    • We are already doing relative path replacement in the webapp to point to the webapp's bundled copy of the docs/ repo anyway, so this is not a large departure from our current approach

The main problem I see with this approach is that if a user is using an old connector version whose docs reference an image which no longer exists on the master branch (e.g. image was removed or renamed during some later release), then that link won't work and we won't show the image. However, I think this is edge-casey enough that it is acceptable, and preferrable to having to manage versioned images as well.

At least, this should be an okay placeholder until we do the full refactor to move docs and their associated images into connector folders, at which point it would be easier to just upload the images for that connector to its metadata folder.

What do you think @bnchrch?

Comment on lines 65 to 121
# TEST UPLOAD COMMAND
@pytest.mark.parametrize(
"latest_uploaded, version_uploaded, icon_uploaded",
"latest_uploaded, version_uploaded, icon_uploaded, doc_version_uploaded, doc_inapp_version_uploaded, doc_latest_uploaded, doc_inapp_latest_uploaded",
[
(False, False, False),
(True, False, False),
(False, True, False),
(False, False, True),
(True, True, False),
(True, False, True),
(False, True, True),
(True, True, True),
(False, False, False, False, False, False, False),
(True, False, False, False, False, False, False),
(False, True, False, False, False, False, False),
(False, False, True, False, False, False, False),
(True, True, False, False, False, False, False),
(True, False, True, False, False, False, False),
(False, True, True, False, False, False, False),
(True, True, True, False, False, False, False),
(True, True, True, True, True, True, True),
],
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Including every permutation of 7 different boolean parameters would be 2^7 = 128 different test cases which felt unnecessary, so I just included one test case where all are set to true

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems fair!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be helpful in this case (I've often found myself wanting the feature in java):
image
(From https://docs.pytest.org/en/latest/how-to/parametrize.html#pytest-mark-parametrize-parametrizing-test-functions)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If indeed we are just enumerating all the options

@bnchrch
Copy link
Contributor

bnchrch commented Sep 13, 2023

Hey @lmossman let me answer some of these larger questions!

Then ill be back for an indepth code review later.

what is the best way to test this change on CI?

So I believe you can test using airbyte-ci safely so long as you have

  1. METADATA_SERVICE_BUCKET_NAME pointed at your dev bucket
  2. METADATA_SERVICE_GCS_CREDENTIALS pointed at a service account with access to that bucket
  3. Are using a connector who docker image is already built and published to dockerhub (this should be all of them)

Heres an example of the env vars I set

export METADATA_SERVICE_GCS_CREDENTIALS=`cat ~/Development/secrets/dev-metadata-service-account.json`
export METADATA_SERVICE_BUCKET_NAME="dev-airbyte-cloud-connector-metadata-service"

export SPEC_CACHE_BUCKET_NAME="io-airbyte-cloud-spec-cache"
export SPEC_CACHE_GCS_CREDENTIALS=`cat ~/Development/secrets/prod-metadata-service-account.json`

export GCP_GSM_CREDENTIALS=`cat ~/Development/secrets/ben-dev-account-dataline-service-key.json`

export DOCKER_HUB_USERNAME="benairbyte"
export DOCKER_HUB_PASSWORD="REDACTED"

Afterwards you should be free to run

airbyte-ci connectors --name=source-faker test # which triggers metadata validate
airbyte-ci connectors --name=source-faker publish --main-release # which triggers metadata upload

On markdown images
This is tricky!

So the main question lies with "who is responsible for turning relative paths to absolute urls?"

The FE already does this so it would not be a stretch to do it there again.

However I think a good set principals to follow is

  1. The platform BE should be unaware of implementation details of the connectors and related systems
  2. FE should be unaware of the implementation details of any external system (platform, connectors or otherwise)
  3. Hacks that result from us taking a shortcut should be located as close as possible to the shortcut we need to change

With that in mind I (naively) think that

  1. the metadata service should be responsible for finding and uploading relative images to the bucket
  2. the registry orchestrator should be responsible for transforming them into absolute urls

Where this breaks down / requires extra thought is

  • How often are images not colocated? i.e. are we seeing image paths that look like this ../../../../someotherpage/someothersection/tutorial_123.png?

@lmossman
Copy link
Contributor Author

How often are images not colocated? i.e. are we seeing image paths that look like this ../../../../someotherpage/someothersection/tutorial_123.png?

@bnchrch I just checked and there are actually only 8 connector docs containing images, and all of them are either full github URLs, or they point to the ../../.gitbook/assets/ folder, and they all seem to be unique to each connector:
Screenshot 2023-09-13 at 4 59 51 PM

So that should make the image uploading logic fairly straightforward, i.e.:

  • Inspect the markdown file
  • Find all images with relative paths
  • Upload those images to the GCS bucket

bnchrch
bnchrch previously approved these changes Sep 14, 2023
Copy link
Contributor

@bnchrch bnchrch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lmossman This is great!

The only comment I have that is a "blocker" is the question around should we fail if there are no docs?

color="green"
)
if metadata_upload_info.doc_inapp_latest_uploaded:
click.secho(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: We're getting a little bit long in the ifs here

I know I started it 😅

But if we created a helper I would not be mad

def log_if_uploaded(metadata_upload_info: MetadataUploadInfo, upload_success_key: str, upload_id_key: str, name: str):
    success = metadata_upload_info[upload_success_key]
    if success:
        path = metadata_upload_info[upload_id_key]
        click.secho(
            f"The {name} file for {metadata_upload_info.metadata_file_path} was uploaded to {path}.",
            color="green"
        )

log_if_uploaded(metadata_upload_info, "icon_uploaded", "icon_blob_id", "icon")

local_doc_path = docs_folder_path / f"{doc_file_name}.md"
local_inapp_doc_path = docs_folder_path / f"{doc_file_name}.inapp.md"

remote_doc_path = get_doc_remote_file_path(metadata.data.dockerRepository, "latest" if latest else metadata.data.dockerImageTag, False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 very nice

if local_doc_path.exists():
doc_uploaded, doc_blob_id = upload_file_if_changed(local_doc_path, bucket, remote_doc_path)
else:
doc_uploaded, doc_blob_id = False, f"No doc found at {local_doc_path}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❓ Do we want to enforce that they have docs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think this should be enforced, I will add an exception here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great. Lets run test on a few GA connectors so that were sure we didnt cause a regression :)

@@ -189,4 +234,12 @@ def upload_metadata_to_gcs(
icon_blob_id=icon_blob_id,
icon_uploaded=icon_uploaded,
metadata_file_path=str(metadata_file_path),
doc_version_uploaded=doc_version_uploaded,
doc_version_blob_id=doc_version_blob_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a comment about "hey we have a lot of ifs in the log function"

I think thats a symptom of the MetadataUploadInfo being flat.

@lmossman Do you have the capacity to move this into a more generic upload_files field?

It should go a long way in removing boiler plate

# Updated models

@dataclass(frozen=True)
class MetadataUploadInfo:
    metadata_uploaded: bool
    metadata_file_path: str
    uploaded_files: List[UploadedFile]

class UploadedFile:
    id: str
    uploaded: bool
    description: str
    blob_id: Optional[str]

# Updated upload helpers

def _version_upload(metadata: ConnectorMetadataDefinitionV0, bucket: storage.bucket.Bucket, metadata_file_path: Path) -> UploadedFile:
    version_path = get_metadata_remote_file_path(metadata.data.dockerRepository, metadata.data.dockerImageTag)
    uploaded, blob_id = upload_file_if_changed(metadata_file_path, bucket, version_path, disable_cache=True)
    return UploadedFile(
        id="version_metadata"
        uploaded=uploaded,
        description="versioned metadata file",
        blob_id=blob_id,
    )

# Updated main method

version_metadata_upload = _version_upload(metadata, bucket, metadata_file_path)

return MetadataUploadInfo(
    uploaded=version_metadata_upload.uploaded or latest_metadata_upload.uploaded,
    metadata_file_path=str(metadata_file_path),
    uploaded_files=[
        version_metadata_upload,
        latest_metadata_upload,
    ]
)

# updated logger

for file in metadata_upload_info.uploaded_files:
    if file.uploaded:
        click.secho(
            f"The {file.description} file for {metadata_upload_info.metadata_file_path} was uploaded to {file.plob_id}.",
            color="green"
        )

@bnchrch bnchrch self-requested a review September 14, 2023 19:55
@bnchrch bnchrch dismissed their stale review September 14, 2023 19:56

accidentally hit approve! sorry!

@bnchrch
Copy link
Contributor

bnchrch commented Sep 14, 2023

Also @lmossman I like your approach with the images

Question on that.

What path will you put them at in the docs folder?

I would prefer a relative path like `../../../../folder1/folder2/folder3/image.jpg

to become assets/image.jpg or assets/folder3/image.jpg

Thoughts?

@lmossman
Copy link
Contributor Author

@bnchrch I've addressed your comments and finally got the tests working after quite a bit of trial and error 😅

The only thing I haven't addressed is the images - I talked with Tim about this and we decided that since the frontend will still need to assume the location of the connector doc file within the docs folder in order to properly resolve relative links to other docs pages, we are just going to go with the frontend-github-url replacement approach that I laid out in this comment in order to keep the scope of this change small, and get to supporting versioned docs more quickly while still having time to complete the other FE projects we have planned for this quarter.

Once connector ops can prioritize a more thorough migration of all doc files along with their images into the connector folders, that uploading logic will hopefully be simpler, so we don't want to invest the effort to make that work for this interim state.

Hopefully you are okay with this more minimal approach for now!

@lmossman lmossman requested review from bnchrch and removed request for bnchrch September 15, 2023 22:47
Copy link
Contributor

bnchrch commented Sep 15, 2023

Totally ok! Done not perfect perfectly applies here! Rereviewing now :)

f"The icon file {metadata_upload_info.metadata_file_path} was uploaded to {metadata_upload_info.icon_blob_id}.", color="green"
)
for file in metadata_upload_info.uploaded_files:
if file.uploaded:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

<3

description="versioned inapp doc",
blob_id=doc_inapp_version_blob_id,
),
UploadedFile(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💅 Ideally we return these from their respective upload functions.

@lmossman Do you think thats a reasonable change we can make?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bnchrch the reason I did it this way was because in some cases we wet the uploaded and blob_id to False, None without calling the upload function. So if UploadedFile is returned from the function, then we would need to duplicate the id and the description in upload_metadata_to_gcs as well for those cases where the upload function isn't called, so doing it this way felt DRYer to me.

What do you think?

Copy link
Contributor

@bnchrch bnchrch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks just fine to me!

Theres one comment that would be great if we can get it

But it is not a blocker so approving now 👍

Thanks for all the context sharing and being so open to feedback @lmossman 💎

@lmossman
Copy link
Contributor Author

lmossman commented Sep 15, 2023

@bnchrch thanks for the quick review!

I tried running airbyte-ci connectors --name=source-faker test with the env vars set as you recommended, but I got a bunch of failures that look like this in the acceptance tests step:

==================================== ERRORS ====================================
__________ ERROR at setup of TestSpec.test_config_match_spec[inputs0] __________

anyio_backend = 'asyncio', args = (), kwargs = {'anyio_backend': 'asyncio'}
backend_name = 'asyncio', backend_options = {}
runner = 

    def wrapper(*args, anyio_backend, **kwargs):  # type: ignore[no-untyped-def]
        backend_name, backend_options = extract_backend_and_options(anyio_backend)
        if has_backend_arg:
            kwargs["anyio_backend"] = anyio_backend
    
        with get_runner(backend_name, backend_options) as runner:
            if isasyncgenfunction(func):
>               yield from runner.run_asyncgen_fixture(func, kwargs)

/usr/local/lib/python3.10/site-packages/anyio/pytest_plugin.py:68: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py:2094: in run_asyncgen_fixture
    self._loop.run_until_complete(f)
/usr/local/lib/python3.10/asyncio/base_events.py:649: in run_until_complete
    return future.result()
/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py:2074: in fixture_runner
    retval = await agen.asend(None)
/app/connector_acceptance_test/conftest.py:160: in dagger_client
    async with dagger.Connection(config=dagger.Config(log_output=sys.stderr)) as client:
/usr/local/lib/python3.10/site-packages/dagger/connection.py:45: in __aenter__
    conn = await stack.enter_async_context(Engine(self.cfg))
/usr/local/lib/python3.10/contextlib.py:619: in enter_async_context
    result = await _cm_type.__aenter__(cm)
/usr/local/lib/python3.10/site-packages/dagger/engine/conn.py:55: in __aenter__
    return await anyio.to_thread.run_sync(self.start)
/usr/local/lib/python3.10/site-packages/anyio/to_thread.py:33: in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py:877: in run_sync_in_worker_thread
    return await future
/usr/local/lib/python3.10/site-packages/anyio/_backends/_asyncio.py:807: in run
    result = context.run(func, *args)
/usr/local/lib/python3.10/site-packages/dagger/engine/conn.py:45: in start
    return self.from_env() or self.from_cli()
/usr/local/lib/python3.10/site-packages/dagger/engine/conn.py:42: in from_cli
    return stack.enter_context(cli_session)
/usr/local/lib/python3.10/contextlib.py:492: in enter_context
    result = _cm_type.__enter__(cm)
/usr/local/lib/python3.10/site-packages/dagger/engine/cli.py:43: in __enter__
    conn = self._get_conn(proc)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = 
proc = 

    def _get_conn(self, proc: subprocess.Popen) -> ConnectParams:
        # TODO: implement engine session timeout (self.cfg.engine_timeout?)
        assert proc.stdout
        conn = proc.stdout.readline()
    
        # Check if subprocess exited with an error
        if ret := proc.poll():
            out = conn + proc.stdout.read()
            err = proc.stderr.read() if proc.stderr and proc.stderr.readable() else None
    
            # Reuse error message from CalledProcessError
            exc = subprocess.CalledProcessError(ret, " ".join(proc.args))
    
            msg = str(exc)
            detail = err or out
            if detail and detail.strip():
                # `msg` ends in a period, just append
                msg = f"{msg} {detail.strip()}"
    
>           raise SessionError(msg)
E           dagger.exceptions.SessionError: Failed to start Dagger engine session: Command '/root/.cache/dagger/dagger-0.6.4 session --label dagger.io/sdk.name:python --label dagger.io/sdk.version:0.6.4 --label dagger.io/sdk.async:true' returned non-zero exit status 1.

Have you seen that before?


I also tried running the publish command as you recommended, and it just got hung on this step as it seems to be asking for confirmation of continuing despite running locally, which I am not able to give it since typing y does nothing when it is in this state:
Screenshot 2023-09-15 at 4 38 15 PM

@bnchrch
Copy link
Contributor

bnchrch commented Sep 16, 2023

@lmossman Hmm youre not targeting colima in your cli by any chance are you?

@lmossman
Copy link
Contributor Author

@lmossman Hmm youre not targeting colima in your cli by any chance are you?

@bnchrch I don't think so (I'm not even really sure how to do that?)

@lmossman
Copy link
Contributor Author

lmossman commented Sep 18, 2023

  • Should add something to metadata_validator.py to ensure that the doc file exists
  • This is where we create the container to handle the metadata upload:
  • @lmossman
    • Extend ValidatorOpts to have a docs_path
    • In validate_and_load, add a validator which checks that path exists
    • In upload_metadata_to_gcs, retrieve the doc path from there
  • @bnchrch
    • To update the MetadataUpload class in airbyte-ci to include that doc path in the container

@lmossman
Copy link
Contributor Author

@bnchrch I've made the changes that we discussed yesterday, in these commits:

  • Added a doc_path argument to the upload and validate commands in the metadata service
  • Passed the doc_path into validate_and_load and upload_metadata_to_gcs through the ValidatorOptions object
  • Updated _doc_upload to use that doc_path instead of inferring it from the metadata
  • Added a test doc.md file and a pointer to it in __init__.py so that I could use that path in the tests
  • Updated the tests to use that fixture path

As discussed I will leave it to you to update the MetadataUpload class in airbyte-ci accordingly!


click.echo(f"Validating {file_path}...")
click.echo(f"Validating {metadata_file_path}...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍 Like this name change

Copy link
Contributor

@bnchrch bnchrch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple small changes! Working on the CI side of it now

def validate(file_path: pathlib.Path):
file_path = file_path if not file_path.is_dir() else file_path / METADATA_FILE_NAME
@click.argument("metadata_file_path", type=click.Path(exists=True, path_type=pathlib.Path))
@click.argument("doc_path", type=click.Path(exists=True, path_type=pathlib.Path))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❗️ We should mark this required

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've marked the metadata-file-path, doc-path, and bucket-name arguments as required for the validate and upload commands, and I've updated the validate example in the readme

@vercel
Copy link

vercel bot commented Sep 21, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Sep 27, 2023 10:20pm

@lmossman
Copy link
Contributor Author

lmossman commented Sep 27, 2023

For anyone tagged here - this was caused by CI applying auto-formatting changes to this PR which for some reason changed a ton of files. Not sure why this happened but please feel free to ignore this PR

lmossman and others added 4 commits September 27, 2023 18:07
Co-authored-by: Ella Rohm-Ensing <erohmensing@gmail.com>
Co-authored-by: bnchrch <bnchrch@users.noreply.github.com>
@lmossman lmossman merged commit 5fd0710 into master Sep 27, 2023
25 checks passed
@lmossman lmossman deleted the lmossman/upload-versioned-connector-docs branch September 27, 2023 23:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Publish versioned connector docs to GCS
4 participants