Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Source: Google Search Console #2257

Closed
JordanChoo opened this issue Mar 2, 2021 · 17 comments · Fixed by #3137
Closed

New Source: Google Search Console #2257

JordanChoo opened this issue Mar 2, 2021 · 17 comments · Fixed by #3137

Comments

@JordanChoo
Copy link

Tell us about the new integration you’d like to have

Which source and which destination? Which frequency?

Google Search Console to BigQuery on a daily basis

Describe the context around this new integration

Which team in your company wants this integration, what for? This helps us understand the use case.

The Google Search Console UI samples the data and provides a very limited amount of export functionality. By being able to programmatically save GSC you are able to get all of the data without any sampling.

This information will allow us to gain deeper insight into how a website and/or URL is performing across countries, keywords and devices. By allowing us to save the data into BigQuery it ensures that we can easily visualize it in Google Data Studio along with other data viz platforms that integrate with BigQuery.

Describe the alternative you are considering or using

What are you considering doing if you don’t have this integration through Airbyte?

Planning on building out the custom pipeline myself.

@JordanChoo JordanChoo added area/connectors Connector related issues new-connector labels Mar 2, 2021
@michel-tricot
Copy link
Contributor

Thank you @JordanChoo!

Are you sure the non-sampled data is available through their API?

@JordanChoo
Copy link
Author

Are you sure the non-sampled data is available through their API?

Yes, @michel-tricot. Within the UI you can only export 1K rows of data across a single dimension while with the API you can pull 1M across multiple dimensions. To date I've been using a janky Supermetrics --> GSheet --> BigQuery workaround

@sherifnada
Copy link
Contributor

@JordanChoo would you be interested in contributing this as a connector to Airbyte? We can offer any support needed, code pairing etc.. and the bonus is that maintenance can be shared between Airbyte/the community/you

@JordanChoo
Copy link
Author

JordanChoo commented Mar 5, 2021

@sherifnada, at this time I wouldn't be able to help out on this since I only know JS right now😢

@michael4tasman
Copy link

Q1: Is there anyone working on this?

Q2: Is it any more complicated than wrapping the Singer tap? ( https://github.com/singer-io/tap-google-search-console )

@sherifnada
Copy link
Contributor

sherifnada commented Apr 13, 2021

@michael4tasman we're not currently working on this but plan to offer it some time in the next couple of months. the only friction is setting up the sandbox environment/CI to test that the connector is working on a recurring basis. We have already verified a domain for Google Search Console, so at this point we are ready to generate an API key and start querying it during CI.

Q2: Is it any more complicated than wrapping the Singer tap? ( https://github.com/singer-io/tap-google-search-console )

Thanks for sharing the Singer tap -- it looks high quality and well maintained. So I think we can go with it! Will prioritize this shortly. Expect it sometime later this month or the first half of May. Does that work with your timeline? Alternatively you're more than welcome to open a PR which wraps the Singer tap if you'd like it sooner than that.

@michael4tasman
Copy link

I need it sooner than that, and I'm happy to work on wrapping the Singer tap, but would appreciate the offer of guidance/pairing made upthread.

@sherifnada
Copy link
Contributor

@michael4tasman glad to hear it. You can get started by using the module autogenerator as described here: https://docs.airbyte.io/contributing-to-airbyte/building-new-connector

Please feel free to book a pairing session with me at any time here: https://calendly.com/sherif-nada/code-pairing-session If none of the times work for you, please reach out at sherif@airbyte.io to find a different time.

@yevhenii-ldv yevhenii-ldv self-assigned this Apr 14, 2021
@yevhenii-ldv
Copy link
Contributor

Integration Vetting

Webhook-based? (no/partially/yes)

no

Available authentication modes (API key/Oauth/other)

Oauth2.0, but can use client_id, client_secret and refresh_token

Creating an account

Already created, but need to enable the Google Search Console API

How to populate the account with data?

For the working with Google Search Console, we have to create a siteUrls(add site to Search Console) and confirm our ownership of the resource.
@sherifnada, I need help from Airbyte to do this.

Available streams for sync

Integration supports incremental sync?

Only for performance_reports

Other information/blockers

Can use Singer Tap for Google Search Console.

@sherifnada
Copy link
Contributor

@michael4tasman good news! We'll start work on this sooner than scheduled. Expecting delivery this week or the next one.

@sherifnada sherifnada changed the title Google Search Console new source: Google Search Console Apr 18, 2021
@sherifnada sherifnada changed the title new source: Google Search Console New Source: Google Search Console Apr 18, 2021
@vitaliizazmic
Copy link
Contributor

vitaliizazmic commented Apr 20, 2021

We can use Singer Tap to implement new source.

It is required to generate credentials according to instruction and put to Lastpass

Singer Tap supports Full and Incremental syncing.

Account requires data populating. We can do this via API for Sites and Sitemaps Stream, but not for Performance Reports Streams.

Singer Tap handles errors and rate limits using backoff.

Blockers

  1. Test creds
  2. Populating Performance Reports Streams

Task breakdown

  • Create new Source based on singer tap - 4-5h
  • Run integration tests and check new source - 2-3h

@sherifnada
Copy link
Contributor

@vitaliizazmic can you verify if we need to populate any data manually if we already have a live website serving traffic linked to Google Search Console?

@JordanChoo
Copy link
Author

@sherifnada - if you want to backfill data (GSC provides up to 16 months) that would have to be done manually

@sherifnada
Copy link
Contributor

thanks for the heads up @JordanChoo !

@sherifnada
Copy link
Contributor

sherifnada commented Apr 21, 2021

@vitaliizazmic heads up -- the singer tap doesn't currently support service-account based oauth. See the client class here. We should fork and change this class to implement JWT OAuth like described here.

I've added the service account credentials to Lastpass. They were generated using oauth with domain-wide-delegation.

@vitaliizazmic
Copy link
Contributor

@sherifnada I've checked service account credentials from Lastpass. All works fine. I fetched sites, sitemaps and performance report. I think, it will be enough and we don't need to populate data manually.

vitaliizazmic added a commit that referenced this issue Apr 30, 2021
…ptance test configs, change tap repo to airbyte
vitaliizazmic added a commit that referenced this issue May 6, 2021
vitaliizazmic added a commit that referenced this issue May 6, 2021
vitaliizazmic added a commit that referenced this issue May 10, 2021
* Google search console source #2257 - new source

* Google search console source #2257 - reformat

* Google search console source #2257 - adding gcc to docker container

* Google search console source #2257 - remove unused files, update acceptance test configs, change tap repo to airbyte

* Google search console source #2257 - updating acceptance tests configs

* Google search console source #2257 - updating acceptance cursor_paths

* Google search console source #2257 - temporary disable tests

* Google search console source #2257 - disable performance_report_date stream

* Google search console source #2257 - disable performance_report_date stream (update docs)

* Google search console source #2257 - disable performance_report_date stream for tests

* Google search console source #2257 - updating singer tap fork
vitaliizazmic added a commit that referenced this issue May 11, 2021
…ing sync_mode and destination_sync_mode to streams)
@sherifnada
Copy link
Contributor

Hey everyone - we just released Google Search Console. Add it by going to the Admin page in the UI and adding it as a new connector. Parameters to add it are:

  "name": "Google Search Console",
  "dockerRepository": "airbyte/source-google-search-console-singer",
  "dockerImageTag": "0.1.0",
  "documentationUrl": "https://hub.docker.com/r/airbyte/source-google-search-console-singer"

big thanks to @vitaliizazmic for making it happen!

vitaliizazmic added a commit that referenced this issue Jun 2, 2021
* Jira source #1389  - adding schemas for streams

* Jira source #1389  - supporting streams

* Jira source #1389  - creating_project script

* Jira source #1389  - updating docs

* Jira source #1389  - fixing check method

* Jira source #1389  - uploading missing schemes

* Jira source #1389  - disabling JQL and Server info streams

* Jira source #1389 - fixing according to PR comments

* Jira source #1389 - fixing filter_sharing and screen_tab_fields streams

* Update airbyte-integrations/connectors/source-jira/source_jira/client.py

* Google search console source #2257 - improving configured catalog(adding sync_mode and destination_sync_mode to streams)

* Jira Source - incremental sync

* Jira source #1390 - issues incremental sync

* Jira source #1390 - issue worklogs incremental sync

* Source Jira #1390 - incremental sync improving

* Source Jira #1390 - migrating to airbyte-cdk, creating CHANGELOG.md

* Source Jira #1389 - reformat

* Jira Source HTTP CDK

* Source Jira #3453 - cleaning branch

* Source Jira #3453 - cleaning branch (fix)

* Source Jira #3453 - abstractmethod get_updated_state

* Jira dummy data #2100 #2101

* Jira source #2100 - data generator

* Jira source #2100 - issue related streams populating

* Jira source #2100 - project related streams populating

* Jira source #2101 - populating data for non issue or project related streams

* Source Jira #2100 - improving according to comments

* Source Jira #2100 - format

* Source Jira #1389 - bump version

* Source Jira #1389 - enabling base_read acceptance test divided by stream groups

* Source Jira #1389 - bump version

Co-authored-by: Sherif A. Nada <snadalive@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment