Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source Apify Dataset: fix broken stream, manifest refactor #30428

Merged
merged 45 commits into from Oct 6, 2023
Merged

Source Apify Dataset: fix broken stream, manifest refactor #30428

merged 45 commits into from Oct 6, 2023

Conversation

vdusek
Copy link
Contributor

@vdusek vdusek commented Sep 14, 2023

What

  • Remove the old broken Item Collection stream.
    • The schema of Apify Dataset is at least Actor-specific, so we cannot have a general Stream for getting data from Dataset.
  • Add a new Item Collection (WCC) stream - specific for the Website Content Crawler Actor.
  • Basically, complete manifest refactor, simplification, merge spec.yaml and manifest.yaml, ...

How

  • Updated manifests, executed checks & tests locally, imported the manifest into the local Airbyte instance a tried to execute the Streams. Everything works fine.

Recommended reading order

  1. manifest
  2. schemas
  3. others

🚨 User Impact 🚨

  • Streams were renamed to match Apify API endpoints.
  • Item Collection stream was removed since it does not work (the Dataset endpoint does not have static schema)
  • Add a new Item Collection WCC stream for getting data produced by Website Content Crawler (WCC) Actor.

Pre-merge Actions

Updating a connector

Community member or Airbyter

  • Grant edit access to maintainers (instructions)
  • Unit & integration tests added

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • If new credentials are required for use in CI, add them to GSM. Instructions.

@octavia-squidington-iii octavia-squidington-iii added the area/connectors Connector related issues label Sep 14, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Sep 14, 2023

Before Merging a Connector Pull Request

Wow! What a great pull request you have here! 🎉

To merge this PR, ensure the following has been done/considered for each connector added or updated:

  • PR name follows PR naming conventions
  • Breaking changes are considered. If a Breaking Change is being introduced, ensure an Airbyte engineer has created a Breaking Change Plan.
  • Connector version has been incremented in the Dockerfile and metadata.yaml according to our Semantic Versioning for Connectors guidelines
  • You've updated the connector's metadata.yaml file any other relevant changes, including a breakingChanges entry for major version bumps. See metadata.yaml docs
  • Secrets in the connector's spec are annotated with airbyte_secret
  • All documentation files are up to date. (README.md, bootstrap.md, docs.md, etc...)
  • Changelog updated in docs/integrations/<source or destination>/<name>.md with an entry for the new version. See changelog example
  • Migration guide updated in docs/integrations/<source or destination>/<name>-migrations.md with an entry for the new version, if the version is a breaking change. See migration guide example
  • If set, you've ensured the icon is present in the platform-internal repo. (Docs)

If the checklist is complete, but the CI check is failing,

  1. Check for hidden checklists in your PR description

  2. Toggle the github label checklist-action-run on/off to re-run the checklist CI.

@vdusek vdusek changed the title Improvement of Apify Dataset Connector Source Apify Dataset: fix broken stream, manifest refactor and simplification Sep 14, 2023
@vdusek vdusek changed the title Source Apify Dataset: fix broken stream, manifest refactor and simplification Source Apify Dataset: fix broken stream, manifest refactor Sep 14, 2023
@octavia-squidington-iii octavia-squidington-iii added the area/documentation Improvements or additions to documentation label Sep 14, 2023
Copy link
Member

@marcosmarxm marcosmarxm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments.

@vdusek
Copy link
Contributor Author

vdusek commented Sep 15, 2023

Thanks for the review @marcosmarxm , the code is updated.

@marcosmarxm marcosmarxm added the team/tse Technical Support Engineers label Sep 15, 2023
@vercel
Copy link

vercel bot commented Sep 25, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
airbyte-docs ⬜️ Ignored (Inspect) Visit Preview Oct 6, 2023 10:43pm

@flash1293
Copy link
Contributor

Thanks for the changes @vdusek

Add an exception for JSON schema backwards compatible check
I don't know what you mean by that, could you please explain that in more detail? Thanks.

I was referring to the acceptance tests but no worry, I will take care of that

@flash1293
Copy link
Contributor

Thanks @vdusek

I fixed a few problems, there is one thing remaining:

The item_collection_website_content_crawler stream has the following schema errors:
Additional properties are not allowed ('#debug', '#error', 'pageTitle' were unexpected)

Do you think we should sync these properties (if yes we should add them to the schema)? If not we can also remove them with a remove transformation in the manifest (like here:

)

@vdusek
Copy link
Contributor Author

vdusek commented Oct 3, 2023

Do you think we should sync these properties (if yes we should add them to the schema)? If not we can also remove them with a remove transformation in the manifest (like here:

Hi @flash1293, in the WCC datasets, there should be no properties such as "#debug", "#error" or "pageTitle".

However, these fields are in the old schema (source_apify_dataset/schemas/item_collection.json). So it might have something to do with it.

Here is an example of the WCC dataset:

[
  {
    "url": "https://docs.apify.com/academy/web-scraping-for-beginners",
    "crawl": {
      "loadedUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
      "loadedTime": "2023-09-08T15:11:20.522Z",
      "referrerUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
      "depth": 0,
      "httpStatusCode": 200
    },
    "metadata": {
      "canonicalUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
      "title": "Web scraping for beginners | Academy | Apify Documentation",
      "description": "Learn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.",
      "author": null,
      "keywords": null,
      "languageCode": "en"
    },
    "screenshotUrl": null,
    "text": "Web scraping for beginners...",
    "markdown": "## Web scraping for beginners..."
  },
  {
    "url": "https://docs.apify.com/academy/web-scraping-for-beginners/introduction",
    "crawl": {
      "loadedUrl": "https://docs.apify.com/academy/web-scraping-for-beginners/introduction",
      "loadedTime": "2023-09-08T15:11:32.622Z",
      "referrerUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
      "depth": 1,
      "httpStatusCode": 200
    },
    "metadata": {
      "canonicalUrl": "https://docs.apify.com/academy/web-scraping-for-beginners/introduction",
      "title": "Introduction | Academy | Apify Documentation",
      "description": "Start learning about web scraping, web crawling, data extraction, and popular tools to start developing your own scraper.",
      "author": null,
      "keywords": null,
      "languageCode": "en"
    },
    "screenshotUrl": null,
    "text": "Introduction\nStart learning about web scraping, ..."
  },
  "...": "..."
]

If I understand it correctly, you tested it with some dataset, and that dataset produced these fields. So it is probably some invalid dataset for this Stream (produced by another Actor). So I'd say we can go with the RemoveFields option.

Copy link
Contributor

@flash1293 flash1293 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vdusek I changed the acceptance tests to actually run on a WCC dataset. I had to adjust the schemas a bit to make it work (some fields were missing in the provided schemas), but it's passing now

@flash1293 flash1293 merged commit 7bff33f into airbytehq:master Oct 6, 2023
31 of 39 checks passed
edgao pushed a commit that referenced this pull request Oct 11, 2023
Co-authored-by: Joe Reuter <joe@airbyte.io>
Co-authored-by: flash1293 <flash1293@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation community connectors/source/apify-dataset team/tse Technical Support Engineers
Projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants