Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 New source: Google PageSpeed Insights [low-code CDK] #19813

Merged
merged 16 commits into from
Dec 14, 2022

Conversation

andnig
Copy link
Contributor

@andnig andnig commented Nov 26, 2022

What

This PR adds the new low-code CDK based Google PageSpeed Insights Connector. PageSpeed Insights allows to collect performance and seo related data from any web-page.

How

This PR uses the lowcode CDK to create a source connector with the following streams:

  • pagespeed: Full pagespeed report of the url entered. More detailed, but only for the specified url.
  • origin_pagespeed: Aggregated value for all the pages and subpages of a specific url. Less detailed but accounts for all the subpages as well.

Recommended reading order

  1. spec.yaml
  2. google-pagespeed-insights.yaml

🚨 User Impact 🚨

No changes to existing code, except docs/integrations/Readme.md

Pre-merge Checklist

Expand the relevant checklist and delete the others.

New Connector

Community member or Airbyter

  • Community member? Grant edit access to maintainers (instructions)
  • Secrets in the connector's spec are annotated with airbyte_secret
  • Unit & integration tests added and passing. Community members, please provide proof of success locally e.g: screenshot or copy-paste unit, integration, and acceptance test output. To run acceptance tests for a Python connector, follow instructions in the README. For java connectors run ./gradlew :airbyte-integrations:connectors:<name>:integrationTest.
  • Code reviews completed
  • Documentation updated
    • Connector's README.md
    • Connector's bootstrap.md. See description and examples
    • docs/integrations/<source or destination>/<name>.md including changelog. See changelog example
    • docs/integrations/README.md
    • airbyte-integrations/builds.md
  • PR name follows PR naming conventions

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

  • Create a non-forked branch based on this PR and test the below items on it
  • Build is successful
  • If new credentials are required for use in CI, add them to GSM. Instructions.
  • /test connector=connectors/<name> command is passing
  • New Connector version released on Dockerhub by running the /publish command described here
  • After the connector is published, connector added to connector index as described here
  • Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here

Tests

Acceptance

Put your acceptance tests output here.
Screenshot 2022-11-26 at 17 27 46

@sajarin sajarin added the bounty-XL Maintainer program: claimable extra large bounty PR label Dec 5, 2022
@cptjacky
Copy link
Contributor

cptjacky commented Dec 8, 2022

Hello @andnig! Thanks a lot for your contribution!

I will be reviewing your PR soon!

(cc @sajarin)

Copy link
Contributor

@cptjacky cptjacky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @andnig and thank you again for your contribution.

I left a few comments, please feel free to ask/challenge if something is unclear!

Also, could you follow the Step 11 here, https://docs.airbyte.com/connector-development/tutorials/building-a-python-source/#step-11-add-the-connector-to-the-apiui and add the necessary file?

Have a great day!

Comment on lines 14 to 17
cateogry: accessibility
cateogry: best-practices
cateogry: performance
cateogry: pwa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a typo here:

Suggested change
cateogry: accessibility
cateogry: best-practices
cateogry: performance
cateogry: pwa
category: accessibility
category: best-practices
category: performance
category: pwa

Comment on lines 50 to 53
cateogry: accessibility
cateogry: best-practices
cateogry: performance
cateogry: pwa
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cateogry: accessibility
cateogry: best-practices
cateogry: performance
cateogry: pwa
category: accessibility
category: best-practices
category: performance
category: pwa

Comment on lines 54 to 55
strategy: desktop
strategy: mobile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After testing this a bit, it looks like the API only accepts one of these at the same time, not both. When both are added, the numbers seem to default to the DESKTOP option.

I think we should make this into a config option. I think we should also include in the record which strategy we were using, as the API does not seem to return any indication of the one used.

$options:
name: "pagespeed"
path: "/runPagespeed"
origin_pagespeed_stream:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open for discussion comment: Should we keep those in a single, since all the information is returned in a single API call? This would avoid making 2 requests to get the same data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "problem" I had when using one stream was, that the check - command takes so long. So, the "normal" request without "origin" - prefix takes between 20 and 40 seconds. But the integration tests time out after 30 seconds. I also thought, that it might be inconvenient for the user to wait for so long? The "origin" request however returns after some seconds.
Any suggestion on how to solve this? It would be awesome to use a different url (the one with "origin" just for the check, but without needing to create a separate stream. Is this possible somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is odd! As far as I can see, both streams make the exact same request to the API I think, so they should take the same time?

Either way, if the 30 secondes itself is the issue, it can be raised!

You can update the acceptance-test-config.yml file (connection test, and add a bigger timeout_seconds property)
https://docs.airbyte.com/connector-development/testing-connectors/source-acceptance-tests-reference#test-connection

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I'll increase the timeout than.

If you prefix the url with origin, only a very reduced (aggregates) report is generated and then returned. Therefore it takes way less time. See this faq here: https://developers.google.com/speed/docs/insights/faq?hl=en

I'll make this clearer in the docs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting! However the page (faq) you're mentioning seems deprecated, and the origin: prefix isn't mentioned in the new documentation, so I wouldn't rely on it!

Copy link
Contributor Author

@andnig andnig Dec 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen that as well. I tried it with postman however and with origin: you only get the main values (fcp, lcs, etc), whereas without origin we get the whole report.
But still I see that this is kinda confusing. I have a suggestion: as we remove the second stream anyhow, I could make the url list in a way that it not only allows http(s)://xyz.com but also origin:http...
Then users can choose themselves. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this could work!

I would put a huge warning in the documentation that this syntax may be deprecated at any time by google so that users are aware of the risks involved!

$options:
name: "origin_pagespeed"
path: "/runPagespeed"
retriever:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see, it does not look like we override any parameter here, so we should remove this block

title: Google PageSpeed Insights Spec
type: object
required:
- api_key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From reading the docs, it looks like having an api key is optional, but that using the API without it is heavily throttled.

Maybe we should make this property optional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, thanks for that suggestion. I'll add it as soon as I can.

href="https://developers.google.com/speed/docs/insights/v5/get-started#APIKey">here</a>.
The toke is case sensitive.
airbyte_secret: true
url:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most websites are likely to want to monitor a list of urls instead of a single one, so I think we should turn this in an urls array property!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we reflect multiple URLs in the yaml?
The API unfortunately only allows to fetch the report for one URL at a time.
With the python SDK I guess we could implement a simple loop over all the urls. How would one implement such a "loop" in the lowcode SDK? Would this somehow work with Stream Slices with a List stream slicer?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the ListStreamSlicer should be the way to go!

description: >-
Google PageSpeed API Key. See <a
href="https://developers.google.com/speed/docs/insights/v5/get-started#APIKey">here</a>.
The toke is case sensitive.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The toke is case sensitive.
The token is case sensitive.


### Performance considerations

Google utilizes an undocumented rate limit which - under normal use - should not be triggered, even with huge datasets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API Key's rate limit is documented in the Google Cloud console, can we update this section to reflect it?

image

@sajarin
Copy link
Contributor

sajarin commented Dec 13, 2022

@andnig can you respond to the above comment ASAP? Thanks!

@andnig
Copy link
Contributor Author

andnig commented Dec 13, 2022

Hey @sajarin : Is there any urgency or did I miss something? The PR was opened 17 days ago and the review was done just 19 hours ago? The review is spot on and I'm happy to implement all the requested changes/engage in the discussion but need some time to do so.

@sajarin
Copy link
Contributor

sajarin commented Dec 13, 2022

Hey @andnig we'd like to get this pr merged before the end of the week. There might be significant delays in getting this merged due to the holidays coming up so I just want to make sure any outstanding PRs are able to get merged in before then. Does that make sense?

@andnig
Copy link
Contributor Author

andnig commented Dec 13, 2022

Hey @andnig we'd like to get this pr merged before the end of the week. There might be significant delays in getting this merged due to the holidays coming up so I just want to make sure any outstanding PRs are able to get merged in before then. Does that make sense?

Ah I see, thanks for clarification. I'll try to squeeze in some hours during my travels.

So far, the connector used one single URL and
categories and strategies hardcoded in the
connector definition.
This commit changes the connector by removing
the second stream (origin) and uses ListStreamSlicers
for url, strategy and category to allow for better
configuration.
@andnig
Copy link
Contributor Author

andnig commented Dec 13, 2022

@sajarin @cptjacky After your input I actually remodeled the whole connector, as it was indeed quite low-quality. I think it's much better now - so thanks for the push ;-)

  • I used CartesianProductStreamSlicers to allow for multiple combinations of URLs, Strategies and Categories.
  • I removed the origin_stream and allowed the URLs to have an optional "origin:" prefix
  • Added a warning to the docs that the origin-prefix however might be removed anytime soon
  • Added more details to the docs about rate limits
  • Added the source to the source_definitions.yaml
  • Added an icon to icons folder
  • Changed timeout of "check" integration test to 60 seconds

See the newly run integration tests here:

Integration Tests

image

@andnig andnig requested a review from a team December 13, 2022 22:06
Copy link
Contributor

@cptjacky cptjacky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @andnig for the changes, they're great!

I have two small remaining comments, but we're almost there!

request_option:
field_name: "strategy"
inject_into: "request_parameter"
category_stream_slicer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Categories do not need different API call, as a single API call accepts multiple categories.

So I don't think they should be included in the slicers :)

(Also, to be confirmed, but I don't think it is currently useful here, as I think it only affect the Lighthouse results, which we don't save here!)

Copy link
Contributor Author

@andnig andnig Dec 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any suggestions for how one can add multiple request parameters with the same key? This would be required if we go without this streamslicer workaround.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Also, to be confirmed, but I don't think it is currently useful here, as I think it only affect the Lighthouse results, which we don't save here!)

Ok good point - actually the schema is missing the lighthouse report sections. I removed them in the beginning because it's quite hard to find a schema for alle web-pages (as the lighthouse report section contains a lot of objects based on the structure of the website).
Nevertheless we are reading the lighthouse results and therefore we are also storing them, right? But as the schema is missing some fields, we can't normalize the results correctly? Or what's actually the impact, if there are fields in the streams which are not defined in the schema?

What would you suggest here? Could we imply add the "lighthouseResult" property as an "object" - type to the schema for now? Alternatively I can look through all the properties and manually select the ones which are "universal" - but this will most probably take some days.

Copy link
Contributor

@cptjacky cptjacky Dec 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any suggestions for how one can add multiple request parameters with the same key? This would be required if we go without this streamslicer workaround.

I'm not sure if this is a hack or if its intended, but this seems to be working:

      request_parameters:
        key: "{{ config['api_key'] }}"
        category: "{{ config['categories'] }}"

(if we use this, we should make sure a test is covering it, so that we're sure it keeps working in the future)

Nevertheless we are reading the lighthouse results and therefore we are also storing them, right?

For most destinations that go through normalisation, yes, however I'm not sure there is an easy way to "renormalize" the entire dataset in the future.

I'm using the bigquery denormalized destination a lot, and in my case the data would be lost :)

I think it's acceptable to not add the lighthouse results at all, and leave it for a further improvement.

I would not be against adding a simplified version of it either, with a few objects properties here and there when its undocumented. I'll let you decide!

request_option:
field_name: "url"
inject_into: "request_parameter"
strategy_stream_slicer:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we do one call per strategy, I think we should include the strategy in the output schema!

Otherwise the user has no way of knowing which row of data is MOBILE or DESKTOP

@CLAassistant
Copy link

CLAassistant commented Dec 14, 2022

CLA assistant check
All committers have signed the CLA.

@andnig
Copy link
Contributor Author

andnig commented Dec 14, 2022

@cptjacky Next try :-)

  • added the categories as params as you requested
  • added expected_records to test whether this "hack" of yours (which I find an elegant solution btw.) works
  • Added schema for lighthouseResults

Copy link
Contributor

@cptjacky cptjacky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the changes @andnig!

Creating that json schema must have not been easy!
However, as you mentioned earlier, it seems like this is a complex schema and that google return may vary.
I got the following error when trying with my own website's url:

ERROR    root:test_core.py:514 
The pagespeed stream has the following schema errors:
None is not of type 'number'

Failed validating 'type' in schema['properties']['lighthouseResult']['properties']['audits']['properties']['button-name']['properties']['score']:
    {'type': 'number'}

On instance['lighthouseResult']['audits']['button-name']['score']:
    None
--------------------------------------------------------------------------------
None is not of type 'number'

Failed validating 'type' in schema['properties']['lighthouseResult']['properties']['audits']['properties']['lcp-lazy-loaded']['properties']['score']:
    {'type': 'number'}

On instance['lighthouseResult']['audits']['lcp-lazy-loaded']['score']:
    None
--------------------------------------------------------------------------------
None is not of type 'number'

Failed validating 'type' in schema['properties']['lighthouseResult']['properties']['audits']['properties']['aria-allowed-attr']['properties']['score']:
    {'type': 'number'}

On instance['lighthouseResult']['audits']['aria-allowed-attr']['score']:
    None
--------------------------------------------------------------------------------
0 is not of type 'null'

Failed validating 'type' in schema['properties']['lighthouseResult']['properties']['audits']['properties']['frame-title']['properties']['score']:
    {'type': 'null'}

On instance['lighthouseResult']['audits']['frame-title']['score']:
    0
--------------------------------------------------------------------------------
None is not of type 'number'

Failed validating 'type' in schema['properties']['lighthouseResult']['properties']['audits']['properties']['aria-valid-attr-value']['properties']['score']:
    {'type': 'number'}

On instance['lighthouseResult']['audits']['aria-valid-attr-value']['score']:
    None
--------------------------------------------------------------------------------
None is not of type 'number'

Failed validating 'type' in schema['properties']['lighthouseResult']['properties']['audits']['properties']['aria-valid-attr']['properties']['score']:
    {'type': 'number'}

On instance['lighthouseResult']['audits']['aria-valid-attr']['score']:
    None
--------------------------------------------------------------------------------
None is not of type 'string'

Failed validating 'type' in schema['properties']['lighthouseResult']['properties']['audits']['properties']['themed-omnibox']['properties']['details']['properties']['items']['items']['properties']['themeColor']:
    {'type': 'string'}

On instance['lighthouseResult']['audits']['themed-omnibox']['details']['items'][0]['themeColor']:
    None

 airbyte-integrations/bases/source-acceptance-test/source_acceptance_test/tests/test_core.py::TestBasicRead.test_read[inputs0] ⨯  

I'm afraid that this is indeed a schema to complicated to get it perfectly right, as it depends on the website being tested.

Maybe we should leave the too complicated objects (like lighthouseResult.audits) as objects definition, without precision?

transformations:
- type: AddFields
fields:
- path: ["strategy"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should also be added to the schemas/pagespeed.json file for it to be kept in the record

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you look at that comment @andnig? I believe this is the last outstanding one and then we'll be able to merge!

@cptjacky
Copy link
Contributor

Also @andnig, it seems like your last commits were made on a different machine, and that the commits identity is not properly linked to your GitHub account, could you fix this (so that the automations correctly detect your signature of the CLA)

@andnig
Copy link
Contributor Author

andnig commented Dec 14, 2022

Hey @cptjacky , I removed the details from almost any sub-field of the lighthouse results. It's a pity, but I guess for the sake of getting this thing finally merged, your suggestion of removing the details makes a lot of sense.

Also resolved the CLA issue.
Added the strategy field to the schema.

Find the integration test results here.

Integration Tests

image

Thanks for your patience.

Copy link
Contributor

@cptjacky cptjacky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Everything looks good to me! Thanks a lot @andnig!

@sajarin could you setup the acceptance tests run? Thank you!

@sajarin sajarin removed the request for review from a team December 14, 2022 22:49
@sajarin
Copy link
Contributor

sajarin commented Dec 14, 2022

/publish connector=connectors/source-google-pagespeed-insights run-tests=false

🕑 Publishing the following connectors:
connectors/source-google-pagespeed-insights
https://github.com/airbytehq/airbyte/actions/runs/3699418564


Connector Did it publish? Were definitions generated?
connectors/source-google-pagespeed-insights

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

@sajarin sajarin merged commit b6c6b7a into airbytehq:master Dec 14, 2022
@sajarin
Copy link
Contributor

sajarin commented Dec 14, 2022

Thanks for the PR @andnig and thanks for the review @cptjacky!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/connectors Connector related issues area/documentation Improvements or additions to documentation bounty bounty-XL Maintainer program: claimable extra large bounty PR community connectors/source/google-pagespeed-insights
Projects
Development

Successfully merging this pull request may close these issues.

None yet

6 participants