🎉 New source: Google PageSpeed Insights [low-code CDK] #19813

andnig · 2022-11-26T16:37:41Z

What

This PR adds the new low-code CDK based Google PageSpeed Insights Connector. PageSpeed Insights allows to collect performance and seo related data from any web-page.

How

This PR uses the lowcode CDK to create a source connector with the following streams:

pagespeed: Full pagespeed report of the url entered. More detailed, but only for the specified url.
origin_pagespeed: Aggregated value for all the pages and subpages of a specific url. Less detailed but accounts for all the subpages as well.

🚨 User Impact 🚨

No changes to existing code, except docs/integrations/Readme.md

Pre-merge Checklist

Expand the relevant checklist and delete the others.

New Connector

Community member or Airbyter

Airbyter

If this is a community PR, the Airbyte engineer reviewing this PR is responsible for the below items.

Create a non-forked branch based on this PR and test the below items on it
Build is successful
If new credentials are required for use in CI, add them to GSM. Instructions.
/test connector=connectors/<name> command is passing
New Connector version released on Dockerhub by running the /publish command described here
After the connector is published, connector added to connector index as described here
Seed specs have been re-generated by building the platform and committing the changes to the seed spec files, as described here

Tests

Acceptance

Put your acceptance tests output here.

cptjacky · 2022-12-08T17:47:18Z

Hello @andnig! Thanks a lot for your contribution!

I will be reviewing your PR soon!

(cc @sajarin)

cptjacky

Hello @andnig and thank you again for your contribution.

I left a few comments, please feel free to ask/challenge if something is unclear!

Also, could you follow the Step 11 here, https://docs.airbyte.com/connector-development/tutorials/building-a-python-source/#step-11-add-the-connector-to-the-apiui and add the necessary file?

Have a great day!

cptjacky · 2022-12-12T21:59:52Z

...ce-google-pagespeed-insights/source_google_pagespeed_insights/google_pagespeed_insights.yaml

+        cateogry: accessibility
+        cateogry: best-practices
+        cateogry: performance
+        cateogry: pwa


I think there is a typo here:

Suggested change

cateogry: accessibility

cateogry: best-practices

cateogry: performance

cateogry: pwa

category: accessibility

category: best-practices

category: performance

category: pwa

cptjacky · 2022-12-12T22:00:09Z

...ce-google-pagespeed-insights/source_google_pagespeed_insights/google_pagespeed_insights.yaml

+            cateogry: accessibility
+            cateogry: best-practices
+            cateogry: performance
+            cateogry: pwa


Suggested change

cateogry: accessibility

cateogry: best-practices

cateogry: performance

cateogry: pwa

category: accessibility

category: best-practices

category: performance

category: pwa

cptjacky · 2022-12-12T22:08:21Z

...ce-google-pagespeed-insights/source_google_pagespeed_insights/google_pagespeed_insights.yaml

+            strategy: desktop
+            strategy: mobile


After testing this a bit, it looks like the API only accepts one of these at the same time, not both. When both are added, the numbers seem to default to the DESKTOP option.

I think we should make this into a config option. I think we should also include in the record which strategy we were using, as the API does not seem to return any indication of the one used.

cptjacky · 2022-12-12T22:10:52Z

...ce-google-pagespeed-insights/source_google_pagespeed_insights/google_pagespeed_insights.yaml

+    $options:
+      name: "pagespeed"
+      path: "/runPagespeed"
+  origin_pagespeed_stream:


Open for discussion comment: Should we keep those in a single, since all the information is returned in a single API call? This would avoid making 2 requests to get the same data.

The "problem" I had when using one stream was, that the check - command takes so long. So, the "normal" request without "origin" - prefix takes between 20 and 40 seconds. But the integration tests time out after 30 seconds. I also thought, that it might be inconvenient for the user to wait for so long? The "origin" request however returns after some seconds.
Any suggestion on how to solve this? It would be awesome to use a different url (the one with "origin" just for the check, but without needing to create a separate stream. Is this possible somehow?

This is odd! As far as I can see, both streams make the exact same request to the API I think, so they should take the same time?

Either way, if the 30 secondes itself is the issue, it can be raised!

You can update the acceptance-test-config.yml file (connection test, and add a bigger timeout_seconds property)
https://docs.airbyte.com/connector-development/testing-connectors/source-acceptance-tests-reference#test-connection

Ok I'll increase the timeout than.

If you prefix the url with origin, only a very reduced (aggregates) report is generated and then returned. Therefore it takes way less time. See this faq here: https://developers.google.com/speed/docs/insights/faq?hl=en

I'll make this clearer in the docs

Interesting! However the page (faq) you're mentioning seems deprecated, and the origin: prefix isn't mentioned in the new documentation, so I wouldn't rely on it!

I've seen that as well. I tried it with postman however and with origin: you only get the main values (fcp, lcs, etc), whereas without origin we get the whole report.
But still I see that this is kinda confusing. I have a suggestion: as we remove the second stream anyhow, I could make the url list in a way that it not only allows http(s)://xyz.com but also origin:http...
Then users can choose themselves. What do you think?

Yes, this could work!

I would put a huge warning in the documentation that this syntax may be deprecated at any time by google so that users are aware of the risks involved!

cptjacky · 2022-12-12T22:11:54Z

...ce-google-pagespeed-insights/source_google_pagespeed_insights/google_pagespeed_insights.yaml

+    $options:
+      name: "origin_pagespeed"
+      path: "/runPagespeed"
+    retriever:


As far as I can see, it does not look like we override any parameter here, so we should remove this block

cptjacky · 2022-12-12T22:14:06Z

...tions/connectors/source-google-pagespeed-insights/source_google_pagespeed_insights/spec.yaml

+  title: Google PageSpeed Insights Spec
+  type: object
+  required:
+    - api_key


From reading the docs, it looks like having an api key is optional, but that using the API without it is heavily throttled.

Maybe we should make this property optional?

Good idea, thanks for that suggestion. I'll add it as soon as I can.

cptjacky · 2022-12-12T22:15:28Z

...tions/connectors/source-google-pagespeed-insights/source_google_pagespeed_insights/spec.yaml

+        href="https://developers.google.com/speed/docs/insights/v5/get-started#APIKey">here</a>.
+        The toke is case sensitive.
+      airbyte_secret: true
+    url:


Most websites are likely to want to monitor a list of urls instead of a single one, so I think we should turn this in an urls array property!

How can we reflect multiple URLs in the yaml?
The API unfortunately only allows to fetch the report for one URL at a time.
With the python SDK I guess we could implement a simple loop over all the urls. How would one implement such a "loop" in the lowcode SDK? Would this somehow work with Stream Slices with a List stream slicer?

Yes, the ListStreamSlicer should be the way to go!

cptjacky · 2022-12-12T22:15:36Z

...tions/connectors/source-google-pagespeed-insights/source_google_pagespeed_insights/spec.yaml

+      description: >-
+        Google PageSpeed API Key. See <a
+        href="https://developers.google.com/speed/docs/insights/v5/get-started#APIKey">here</a>.
+        The toke is case sensitive.


Suggested change

The toke is case sensitive.

The token is case sensitive.

cptjacky · 2022-12-12T22:19:31Z

docs/integrations/sources/google-pagespeed-insights.md

+
+### Performance considerations
+
+Google utilizes an undocumented rate limit which - under normal use - should not be triggered, even with huge datasets.


The API Key's rate limit is documented in the Google Cloud console, can we update this section to reflect it?

sajarin · 2022-12-13T17:28:30Z

@andnig can you respond to the above comment ASAP? Thanks!

andnig · 2022-12-13T17:37:00Z

Hey @sajarin : Is there any urgency or did I miss something? The PR was opened 17 days ago and the review was done just 19 hours ago? The review is spot on and I'm happy to implement all the requested changes/engage in the discussion but need some time to do so.

sajarin · 2022-12-13T18:10:55Z

Hey @andnig we'd like to get this pr merged before the end of the week. There might be significant delays in getting this merged due to the holidays coming up so I just want to make sure any outstanding PRs are able to get merged in before then. Does that make sense?

andnig · 2022-12-13T19:38:08Z

Hey @andnig we'd like to get this pr merged before the end of the week. There might be significant delays in getting this merged due to the holidays coming up so I just want to make sure any outstanding PRs are able to get merged in before then. Does that make sense?

Ah I see, thanks for clarification. I'll try to squeeze in some hours during my travels.

So far, the connector used one single URL and categories and strategies hardcoded in the connector definition. This commit changes the connector by removing the second stream (origin) and uses ListStreamSlicers for url, strategy and category to allow for better configuration.

andnig · 2022-12-13T22:05:32Z

@sajarin @cptjacky After your input I actually remodeled the whole connector, as it was indeed quite low-quality. I think it's much better now - so thanks for the push ;-)

I used CartesianProductStreamSlicers to allow for multiple combinations of URLs, Strategies and Categories.
I removed the origin_stream and allowed the URLs to have an optional "origin:" prefix
Added a warning to the docs that the origin-prefix however might be removed anytime soon
Added more details to the docs about rate limits
Added the source to the source_definitions.yaml
Added an icon to icons folder
Changed timeout of "check" integration test to 60 seconds

See the newly run integration tests here:

Integration Tests

cptjacky

Thank you @andnig for the changes, they're great!

I have two small remaining comments, but we're almost there!

cptjacky · 2022-12-13T22:15:26Z

...ce-google-pagespeed-insights/source_google_pagespeed_insights/google_pagespeed_insights.yaml

+    request_option:
+      field_name: "strategy"
+      inject_into: "request_parameter"
+  category_stream_slicer:


Categories do not need different API call, as a single API call accepts multiple categories.

So I don't think they should be included in the slicers :)

(Also, to be confirmed, but I don't think it is currently useful here, as I think it only affect the Lighthouse results, which we don't save here!)

Do you have any suggestions for how one can add multiple request parameters with the same key? This would be required if we go without this streamslicer workaround.

(Also, to be confirmed, but I don't think it is currently useful here, as I think it only affect the Lighthouse results, which we don't save here!)

Ok good point - actually the schema is missing the lighthouse report sections. I removed them in the beginning because it's quite hard to find a schema for alle web-pages (as the lighthouse report section contains a lot of objects based on the structure of the website).
Nevertheless we are reading the lighthouse results and therefore we are also storing them, right? But as the schema is missing some fields, we can't normalize the results correctly? Or what's actually the impact, if there are fields in the streams which are not defined in the schema?

What would you suggest here? Could we imply add the "lighthouseResult" property as an "object" - type to the schema for now? Alternatively I can look through all the properties and manually select the ones which are "universal" - but this will most probably take some days.

Do you have any suggestions for how one can add multiple request parameters with the same key? This would be required if we go without this streamslicer workaround.

I'm not sure if this is a hack or if its intended, but this seems to be working:

request_parameters: key: "{{ config['api_key'] }}" category: "{{ config['categories'] }}"

(if we use this, we should make sure a test is covering it, so that we're sure it keeps working in the future)

Nevertheless we are reading the lighthouse results and therefore we are also storing them, right?

For most destinations that go through normalisation, yes, however I'm not sure there is an easy way to "renormalize" the entire dataset in the future.

I'm using the bigquery denormalized destination a lot, and in my case the data would be lost :)

I think it's acceptable to not add the lighthouse results at all, and leave it for a further improvement.

I would not be against adding a simplified version of it either, with a few objects properties here and there when its undocumented. I'll let you decide!

cptjacky · 2022-12-13T22:17:15Z

...ce-google-pagespeed-insights/source_google_pagespeed_insights/google_pagespeed_insights.yaml

+    request_option:
+      field_name: "url"
+      inject_into: "request_parameter"
+  strategy_stream_slicer:


Now that we do one call per strategy, I think we should include the strategy in the output schema!

Otherwise the user has no way of knowing which row of data is MOBILE or DESKTOP

CLAassistant · 2022-12-14T16:53:40Z

All committers have signed the CLA.

andnig · 2022-12-14T16:56:10Z

@cptjacky Next try :-)

added the categories as params as you requested
added expected_records to test whether this "hack" of yours (which I find an elegant solution btw.) works
Added schema for lighthouseResults

cptjacky

Thank you for the changes @andnig!

Creating that json schema must have not been easy!
However, as you mentioned earlier, it seems like this is a complex schema and that google return may vary.
I got the following error when trying with my own website's url:

ERROR    root:test_core.py:514 
The pagespeed stream has the following schema errors:
None is not of type 'number'

Failed validating 'type' in schema['properties']['lighthouseResult']['properties']['audits']['properties']['button-name']['properties']['score']:
    {'type': 'number'}

On instance['lighthouseResult']['audits']['button-name']['score']:
    None
--------------------------------------------------------------------------------
None is not of type 'number'

Failed validating 'type' in schema['properties']['lighthouseResult']['properties']['audits']['properties']['lcp-lazy-loaded']['properties']['score']:
    {'type': 'number'}

On instance['lighthouseResult']['audits']['lcp-lazy-loaded']['score']:
    None
--------------------------------------------------------------------------------
None is not of type 'number'

Failed validating 'type' in schema['properties']['lighthouseResult']['properties']['audits']['properties']['aria-allowed-attr']['properties']['score']:
    {'type': 'number'}

On instance['lighthouseResult']['audits']['aria-allowed-attr']['score']:
    None
--------------------------------------------------------------------------------
0 is not of type 'null'

Failed validating 'type' in schema['properties']['lighthouseResult']['properties']['audits']['properties']['frame-title']['properties']['score']:
    {'type': 'null'}

On instance['lighthouseResult']['audits']['frame-title']['score']:
    0
--------------------------------------------------------------------------------
None is not of type 'number'

Failed validating 'type' in schema['properties']['lighthouseResult']['properties']['audits']['properties']['aria-valid-attr-value']['properties']['score']:
    {'type': 'number'}

On instance['lighthouseResult']['audits']['aria-valid-attr-value']['score']:
    None
--------------------------------------------------------------------------------
None is not of type 'number'

Failed validating 'type' in schema['properties']['lighthouseResult']['properties']['audits']['properties']['aria-valid-attr']['properties']['score']:
    {'type': 'number'}

On instance['lighthouseResult']['audits']['aria-valid-attr']['score']:
    None
--------------------------------------------------------------------------------
None is not of type 'string'

Failed validating 'type' in schema['properties']['lighthouseResult']['properties']['audits']['properties']['themed-omnibox']['properties']['details']['properties']['items']['items']['properties']['themeColor']:
    {'type': 'string'}

On instance['lighthouseResult']['audits']['themed-omnibox']['details']['items'][0]['themeColor']:
    None

 airbyte-integrations/bases/source-acceptance-test/source_acceptance_test/tests/test_core.py::TestBasicRead.test_read[inputs0] ⨯

I'm afraid that this is indeed a schema to complicated to get it perfectly right, as it depends on the website being tested.

Maybe we should leave the too complicated objects (like lighthouseResult.audits) as objects definition, without precision?

cptjacky · 2022-12-14T18:17:16Z

...ce-google-pagespeed-insights/source_google_pagespeed_insights/google_pagespeed_insights.yaml

+    transformations:
+      - type: AddFields
+        fields:
+          - path: ["strategy"]


I think this should also be added to the schemas/pagespeed.json file for it to be kept in the record

Can you look at that comment @andnig? I believe this is the last outstanding one and then we'll be able to merge!

cptjacky · 2022-12-14T18:26:42Z

Also @andnig, it seems like your last commits were made on a different machine, and that the commits identity is not properly linked to your GitHub account, could you fix this (so that the automations correctly detect your signature of the CLA)

…airbyte into source-google-pagespeed-insights

andnig · 2022-12-14T22:21:30Z

Hey @cptjacky , I removed the details from almost any sub-field of the lighthouse results. It's a pity, but I guess for the sake of getting this thing finally merged, your suggestion of removing the details makes a lot of sense.

Also resolved the CLA issue.
Added the strategy field to the schema.

Find the integration test results here.

Integration Tests

Thanks for your patience.

cptjacky

Great! Everything looks good to me! Thanks a lot @andnig!

@sajarin could you setup the acceptance tests run? Thank you!

sajarin · 2022-12-14T22:56:40Z

/publish connector=connectors/source-google-pagespeed-insights run-tests=false

🕑 Publishing the following connectors:
connectors/source-google-pagespeed-insights
https://github.com/airbytehq/airbyte/actions/runs/3699418564

Connector	Did it publish?	Were definitions generated?
connectors/source-google-pagespeed-insights	✅	✅

if you have connectors that successfully published but failed definition generation, follow step 4 here ▶️

sajarin · 2022-12-14T23:13:21Z

Thanks for the PR @andnig and thanks for the review @cptjacky!

andnig added 2 commits November 26, 2022 17:26

New source: Google PageSpeed Insights

d122b18

update PR number

23e475c

octavia-squidington-iv added area/connectors Connector related issues area/documentation Improvements or additions to documentation bounty community connectors/source/google-pagespeed-insights labels Nov 26, 2022

sajarin added the bounty-XL Maintainer program: claimable extra large bounty PR label Dec 5, 2022

sajarin assigned cptjacky Dec 8, 2022

cptjacky mentioned this pull request Dec 12, 2022

Source Google PageSpeed Insights : New Sandbox #20398

Closed

cptjacky reviewed Dec 12, 2022

View reviewed changes

fixed timeout for check int. tests

82f8a35

andnig requested a review from a team December 13, 2022 22:06

andnig added 2 commits December 13, 2022 23:10

use new pagespeed insights logo

42b523a

sample_config

ef17464

cptjacky reviewed Dec 13, 2022

View reviewed changes

andnig and others added 4 commits December 14, 2022 00:01

add strategy and category to stream

ca6b4a9

use only one request for all defined categories

9b157e3

fix lighthosue results schema

4d56afe

remove required from schema

1944b14

cptjacky reviewed Dec 14, 2022

View reviewed changes

sajarin and others added 3 commits December 14, 2022 13:28

Merge branch 'master' into source-google-pagespeed-insights

54f35d9

remove audit and category details in schema

208ba77

Merge branch 'source-google-pagespeed-insights' of github.com:andnig/…

9d1d571

…airbyte into source-google-pagespeed-insights

add strategy field to schema

5b904be

cptjacky approved these changes Dec 14, 2022

View reviewed changes

sajarin removed the request for review from a team December 14, 2022 22:49

Merge branch 'master' into source-google-pagespeed-insights

3d868dd

sajarin approved these changes Dec 14, 2022

View reviewed changes

auto-bump connector version

fa1d8f3

sajarin merged commit b6c6b7a into airbytehq:master Dec 14, 2022

github-actions bot mentioned this pull request Dec 15, 2022

Bump helm chart version reference to 0.43.5 #20526

Closed

octavia-squidington-iii mentioned this pull request Dec 18, 2022

Bump Airbyte version from 0.40.25 to 0.40.26 #20630

Merged


		### Performance considerations

		Google utilizes an undocumented rate limit which - under normal use - should not be triggered, even with huge datasets.

🎉 New source: Google PageSpeed Insights [low-code CDK] #19813

🎉 New source: Google PageSpeed Insights [low-code CDK] #19813

Conversation

andnig commented Nov 26, 2022

What

How

Recommended reading order

🚨 User Impact 🚨

Pre-merge Checklist

Community member or Airbyter

Airbyter

Tests

cptjacky commented Dec 8, 2022

cptjacky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andnig Dec 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sajarin commented Dec 13, 2022

andnig commented Dec 13, 2022

sajarin commented Dec 13, 2022

andnig commented Dec 13, 2022

andnig commented Dec 13, 2022

cptjacky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andnig Dec 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cptjacky Dec 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

CLAassistant commented Dec 14, 2022 • edited Loading

andnig commented Dec 14, 2022

cptjacky left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cptjacky commented Dec 14, 2022

andnig commented Dec 14, 2022 • edited Loading

cptjacky left a comment • edited Loading

Choose a reason for hiding this comment

sajarin commented Dec 14, 2022 • edited by github-actions bot Loading

sajarin commented Dec 14, 2022

andnig Dec 13, 2022 •

edited

Loading

andnig Dec 13, 2022 •

edited

Loading

cptjacky Dec 13, 2022 •

edited

Loading

CLAassistant commented Dec 14, 2022 •

edited

Loading

andnig commented Dec 14, 2022 •

edited

Loading

cptjacky left a comment •

edited

Loading

sajarin commented Dec 14, 2022 •

edited by github-actions bot

Loading