-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🎉 New source: Google PageSpeed Insights [low-code CDK] #19813
🎉 New source: Google PageSpeed Insights [low-code CDK] #19813
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @andnig and thank you again for your contribution.
I left a few comments, please feel free to ask/challenge if something is unclear!
Also, could you follow the Step 11 here, https://docs.airbyte.com/connector-development/tutorials/building-a-python-source/#step-11-add-the-connector-to-the-apiui and add the necessary file?
Have a great day!
cateogry: accessibility | ||
cateogry: best-practices | ||
cateogry: performance | ||
cateogry: pwa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a typo here:
cateogry: accessibility | |
cateogry: best-practices | |
cateogry: performance | |
cateogry: pwa | |
category: accessibility | |
category: best-practices | |
category: performance | |
category: pwa |
cateogry: accessibility | ||
cateogry: best-practices | ||
cateogry: performance | ||
cateogry: pwa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cateogry: accessibility | |
cateogry: best-practices | |
cateogry: performance | |
cateogry: pwa | |
category: accessibility | |
category: best-practices | |
category: performance | |
category: pwa |
strategy: desktop | ||
strategy: mobile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After testing this a bit, it looks like the API only accepts one of these at the same time, not both. When both are added, the numbers seem to default to the DESKTOP
option.
I think we should make this into a config option. I think we should also include in the record which strategy we were using, as the API does not seem to return any indication of the one used.
$options: | ||
name: "pagespeed" | ||
path: "/runPagespeed" | ||
origin_pagespeed_stream: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open for discussion comment: Should we keep those in a single, since all the information is returned in a single API call? This would avoid making 2 requests to get the same data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "problem" I had when using one stream was, that the check - command takes so long. So, the "normal" request without "origin" - prefix takes between 20 and 40 seconds. But the integration tests time out after 30 seconds. I also thought, that it might be inconvenient for the user to wait for so long? The "origin" request however returns after some seconds.
Any suggestion on how to solve this? It would be awesome to use a different url (the one with "origin" just for the check, but without needing to create a separate stream. Is this possible somehow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is odd! As far as I can see, both streams make the exact same request to the API I think, so they should take the same time?
Either way, if the 30 secondes itself is the issue, it can be raised!
You can update the acceptance-test-config.yml
file (connection
test, and add a bigger timeout_seconds
property)
https://docs.airbyte.com/connector-development/testing-connectors/source-acceptance-tests-reference#test-connection
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I'll increase the timeout than.
If you prefix the url with origin, only a very reduced (aggregates) report is generated and then returned. Therefore it takes way less time. See this faq here: https://developers.google.com/speed/docs/insights/faq?hl=en
I'll make this clearer in the docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting! However the page (faq) you're mentioning seems deprecated, and the origin:
prefix isn't mentioned in the new documentation, so I wouldn't rely on it!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've seen that as well. I tried it with postman however and with origin: you only get the main values (fcp, lcs, etc), whereas without origin we get the whole report.
But still I see that this is kinda confusing. I have a suggestion: as we remove the second stream anyhow, I could make the url list in a way that it not only allows http(s)://xyz.com but also origin:http...
Then users can choose themselves. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this could work!
I would put a huge warning in the documentation that this syntax may be deprecated at any time by google so that users are aware of the risks involved!
$options: | ||
name: "origin_pagespeed" | ||
path: "/runPagespeed" | ||
retriever: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can see, it does not look like we override any parameter here, so we should remove this block
title: Google PageSpeed Insights Spec | ||
type: object | ||
required: | ||
- api_key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From reading the docs, it looks like having an api key is optional, but that using the API without it is heavily throttled.
Maybe we should make this property optional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, thanks for that suggestion. I'll add it as soon as I can.
href="https://developers.google.com/speed/docs/insights/v5/get-started#APIKey">here</a>. | ||
The toke is case sensitive. | ||
airbyte_secret: true | ||
url: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most websites are likely to want to monitor a list of urls instead of a single one, so I think we should turn this in an urls
array property!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can we reflect multiple URLs in the yaml?
The API unfortunately only allows to fetch the report for one URL at a time.
With the python SDK I guess we could implement a simple loop over all the urls. How would one implement such a "loop" in the lowcode SDK? Would this somehow work with Stream Slices with a List stream slicer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the ListStreamSlicer should be the way to go!
description: >- | ||
Google PageSpeed API Key. See <a | ||
href="https://developers.google.com/speed/docs/insights/v5/get-started#APIKey">here</a>. | ||
The toke is case sensitive. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The toke is case sensitive. | |
The token is case sensitive. |
|
||
### Performance considerations | ||
|
||
Google utilizes an undocumented rate limit which - under normal use - should not be triggered, even with huge datasets. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andnig can you respond to the above comment ASAP? Thanks! |
Hey @sajarin : Is there any urgency or did I miss something? The PR was opened 17 days ago and the review was done just 19 hours ago? The review is spot on and I'm happy to implement all the requested changes/engage in the discussion but need some time to do so. |
Hey @andnig we'd like to get this pr merged before the end of the week. There might be significant delays in getting this merged due to the holidays coming up so I just want to make sure any outstanding PRs are able to get merged in before then. Does that make sense? |
Ah I see, thanks for clarification. I'll try to squeeze in some hours during my travels. |
So far, the connector used one single URL and categories and strategies hardcoded in the connector definition. This commit changes the connector by removing the second stream (origin) and uses ListStreamSlicers for url, strategy and category to allow for better configuration.
@sajarin @cptjacky After your input I actually remodeled the whole connector, as it was indeed quite low-quality. I think it's much better now - so thanks for the push ;-)
See the newly run integration tests here: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @andnig for the changes, they're great!
I have two small remaining comments, but we're almost there!
request_option: | ||
field_name: "strategy" | ||
inject_into: "request_parameter" | ||
category_stream_slicer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Categories do not need different API call, as a single API call accepts multiple categories.
So I don't think they should be included in the slicers :)
(Also, to be confirmed, but I don't think it is currently useful here, as I think it only affect the Lighthouse results, which we don't save here!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any suggestions for how one can add multiple request parameters with the same key? This would be required if we go without this streamslicer workaround.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also, to be confirmed, but I don't think it is currently useful here, as I think it only affect the Lighthouse results, which we don't save here!)
Ok good point - actually the schema is missing the lighthouse report sections. I removed them in the beginning because it's quite hard to find a schema for alle web-pages (as the lighthouse report section contains a lot of objects based on the structure of the website).
Nevertheless we are reading the lighthouse results and therefore we are also storing them, right? But as the schema is missing some fields, we can't normalize the results correctly? Or what's actually the impact, if there are fields in the streams which are not defined in the schema?
What would you suggest here? Could we imply add the "lighthouseResult" property as an "object" - type to the schema for now? Alternatively I can look through all the properties and manually select the ones which are "universal" - but this will most probably take some days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any suggestions for how one can add multiple request parameters with the same key? This would be required if we go without this streamslicer workaround.
I'm not sure if this is a hack or if its intended, but this seems to be working:
request_parameters:
key: "{{ config['api_key'] }}"
category: "{{ config['categories'] }}"
(if we use this, we should make sure a test is covering it, so that we're sure it keeps working in the future)
Nevertheless we are reading the lighthouse results and therefore we are also storing them, right?
For most destinations that go through normalisation, yes, however I'm not sure there is an easy way to "renormalize" the entire dataset in the future.
I'm using the bigquery denormalized destination a lot, and in my case the data would be lost :)
I think it's acceptable to not add the lighthouse results at all, and leave it for a further improvement.
I would not be against adding a simplified version of it either, with a few objects properties here and there when its undocumented. I'll let you decide!
request_option: | ||
field_name: "url" | ||
inject_into: "request_parameter" | ||
strategy_stream_slicer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that we do one call per strategy, I think we should include the strategy in the output schema!
Otherwise the user has no way of knowing which row of data is MOBILE or DESKTOP
@cptjacky Next try :-)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the changes @andnig!
Creating that json schema must have not been easy!
However, as you mentioned earlier, it seems like this is a complex schema and that google return may vary.
I got the following error when trying with my own website's url:
ERROR root:test_core.py:514
The pagespeed stream has the following schema errors:
None is not of type 'number'
Failed validating 'type' in schema['properties']['lighthouseResult']['properties']['audits']['properties']['button-name']['properties']['score']:
{'type': 'number'}
On instance['lighthouseResult']['audits']['button-name']['score']:
None
--------------------------------------------------------------------------------
None is not of type 'number'
Failed validating 'type' in schema['properties']['lighthouseResult']['properties']['audits']['properties']['lcp-lazy-loaded']['properties']['score']:
{'type': 'number'}
On instance['lighthouseResult']['audits']['lcp-lazy-loaded']['score']:
None
--------------------------------------------------------------------------------
None is not of type 'number'
Failed validating 'type' in schema['properties']['lighthouseResult']['properties']['audits']['properties']['aria-allowed-attr']['properties']['score']:
{'type': 'number'}
On instance['lighthouseResult']['audits']['aria-allowed-attr']['score']:
None
--------------------------------------------------------------------------------
0 is not of type 'null'
Failed validating 'type' in schema['properties']['lighthouseResult']['properties']['audits']['properties']['frame-title']['properties']['score']:
{'type': 'null'}
On instance['lighthouseResult']['audits']['frame-title']['score']:
0
--------------------------------------------------------------------------------
None is not of type 'number'
Failed validating 'type' in schema['properties']['lighthouseResult']['properties']['audits']['properties']['aria-valid-attr-value']['properties']['score']:
{'type': 'number'}
On instance['lighthouseResult']['audits']['aria-valid-attr-value']['score']:
None
--------------------------------------------------------------------------------
None is not of type 'number'
Failed validating 'type' in schema['properties']['lighthouseResult']['properties']['audits']['properties']['aria-valid-attr']['properties']['score']:
{'type': 'number'}
On instance['lighthouseResult']['audits']['aria-valid-attr']['score']:
None
--------------------------------------------------------------------------------
None is not of type 'string'
Failed validating 'type' in schema['properties']['lighthouseResult']['properties']['audits']['properties']['themed-omnibox']['properties']['details']['properties']['items']['items']['properties']['themeColor']:
{'type': 'string'}
On instance['lighthouseResult']['audits']['themed-omnibox']['details']['items'][0]['themeColor']:
None
airbyte-integrations/bases/source-acceptance-test/source_acceptance_test/tests/test_core.py::TestBasicRead.test_read[inputs0] ⨯
I'm afraid that this is indeed a schema to complicated to get it perfectly right, as it depends on the website being tested.
Maybe we should leave the too complicated objects (like lighthouseResult.audits
) as objects definition, without precision?
transformations: | ||
- type: AddFields | ||
fields: | ||
- path: ["strategy"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should also be added to the schemas/pagespeed.json
file for it to be kept in the record
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you look at that comment @andnig? I believe this is the last outstanding one and then we'll be able to merge!
Also @andnig, it seems like your last commits were made on a different machine, and that the commits identity is not properly linked to your GitHub account, could you fix this (so that the automations correctly detect your signature of the CLA) |
…airbyte into source-google-pagespeed-insights
Hey @cptjacky , I removed the details from almost any sub-field of the lighthouse results. It's a pity, but I guess for the sake of getting this thing finally merged, your suggestion of removing the details makes a lot of sense. Also resolved the CLA issue. Find the integration test results here. Thanks for your patience. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/publish connector=connectors/source-google-pagespeed-insights run-tests=false
if you have connectors that successfully published but failed definition generation, follow step 4 here |
What
This PR adds the new low-code CDK based Google PageSpeed Insights Connector. PageSpeed Insights allows to collect performance and seo related data from any web-page.
How
This PR uses the lowcode CDK to create a source connector with the following streams:
Recommended reading order
spec.yaml
google-pagespeed-insights.yaml
🚨 User Impact 🚨
No changes to existing code, except docs/integrations/Readme.md
Pre-merge Checklist
Expand the relevant checklist and delete the others.
New Connector
Community member or Airbyter
airbyte_secret
./gradlew :airbyte-integrations:connectors:<name>:integrationTest
.README.md
bootstrap.md
. See description and examplesdocs/integrations/<source or destination>/<name>.md
including changelog. See changelog exampledocs/integrations/README.md
airbyte-integrations/builds.md
Airbyter
If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.
/test connector=connectors/<name>
command is passing/publish
command described hereTests
Acceptance
Put your acceptance tests output here.
![Screenshot 2022-11-26 at 17 27 46](https://user-images.githubusercontent.com/30227237/204098874-cbfbb49d-1917-4357-aea6-2c0fcaaf3837.png)