Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(git): consolidate repository webhook #2667

Merged
merged 24 commits into from
Sep 15, 2022
Merged

feat(git): consolidate repository webhook #2667

merged 24 commits into from
Sep 15, 2022

Conversation

h3n4l
Copy link
Member

@h3n4l h3n4l commented Sep 15, 2022

We used to create a webhook for every VCS project, which brought some problems:

  1. There is a limit to the number of webhooks in a code repository, and Bytebase is likely to exhaust the webhook resources.
  2. When a push event occurs, the code host will send messages to all webhooks indiscriminately, which may consume Bytebase's network resources.
  3. Since we cannot distinguish whether a push event can create an issue in another project, we will create many warning activities.

We try the following solutions:

  1. All Bytebase projects connected to a code repository will reuse a webhook.
  2. When a push event is received, we iterate through each VCS project connected to this webhook. We only create a warning activity for each project when no issue is created.

Some implementation details:

  1. When the same repository web url already exists, we will not create a webhook but reuse the previous one.
  2. Only delete the code host webhook if this project is the last one to use this webhook.
  3. The refresh token should be a batch update.

CleanShot 2022-09-15 at 12 46 14@2x

CleanShot 2022-09-15 at 12 46 31@2x

server/project.go Outdated Show resolved Hide resolved
server/project.go Outdated Show resolved Hide resolved
server/project.go Outdated Show resolved Hide resolved
server/project.go Outdated Show resolved Hide resolved
server/project.go Outdated Show resolved Hide resolved
server/project.go Outdated Show resolved Hide resolved
@tianzhou
Copy link
Collaborator

Also, with this logic, can we also declare a unique key on web_url?

h3n4l and others added 3 commits September 15, 2022 15:19
Co-authored-by: tianzhou <t@bytebase.com>
Co-authored-by: tianzhou <t@bytebase.com>
@h3n4l
Copy link
Member Author

h3n4l commented Sep 15, 2022

Also, with this logic, can we also declare a unique key on web_url?

The web_url will be the same if some Bytebase VCS projects link to the same repository.

server/project.go Outdated Show resolved Hide resolved
api/repository.go Show resolved Hide resolved
api/repository.go Outdated Show resolved Hide resolved
api/repository.go Outdated Show resolved Hide resolved
api/repository.go Outdated Show resolved Hide resolved
server/project.go Outdated Show resolved Hide resolved
Copy link
Collaborator

@tianzhou tianzhou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@h3n4l h3n4l merged commit facc29f into bytebase:main Sep 15, 2022
@h3n4l h3n4l deleted the h3n4l/byt-1336-gitops-consolidate-repository-webhook branch September 15, 2022 11:11
@tianzhou
Copy link
Collaborator

tianzhou commented Sep 15, 2022

Also, with this logic, can we also declare a unique key on web_url?

The web_url will be the same if some Bytebase VCS projects link to the same repository.

You are right. On the hand, we can try using the EXCLUDE constraint: https://www.postgresql.org/docs/current/ddl-constraints.html

EXCLUDE USING GIST (web_url WITH =, external_webhook_id WITH <>)

This essentially means if any of the two rows have the same web_url, then the corresponding external_webhook_id cannot be different

Or

If any of the two rows have different external_webhook_id, then the corresponding web_url can't be equal.

@h3n4l
Copy link
Member Author

h3n4l commented Sep 15, 2022

Also, with this logic, can we also declare a unique key on web_url?

The web_url will be the same if some Bytebase VCS projects link to the same repository.

You are right. On the hand, we can try using the EXCLUDE constraint: postgresql.org/docs/current/ddl-constraints.html

EXCLUDE USING GIST (web_url WITH =, external_webhook_id WITH <>)

This essentially means if any of the two rows have the same web_url, then the corresponding external_webhook_id cannot be different

Or

If any of the two rows have different external_webhook_id, then the corresponding web_url can't be equal.

I will try to add this constraint in another pull request. I'm more worried about the token than the external_webhook_id.

@h3n4l
Copy link
Member Author

h3n4l commented Sep 15, 2022

Also, with this logic, can we also declare a unique key on web_url?

The web_url will be the same if some Bytebase VCS projects link to the same repository.

You are right. On the hand, we can try using the EXCLUDE constraint: postgresql.org/docs/current/ddl-constraints.html

EXCLUDE USING GIST (web_url WITH =, external_webhook_id WITH <>)

This essentially means if any of the two rows have the same web_url, then the corresponding external_webhook_id cannot be different

Or

If any of the two rows have different external_webhook_id, then the corresponding web_url can't be equal.

I try to add it. But it seems that we need CREATE EXTENSION btree_gist first.
CleanShot 2022-09-15 at 21 27 11@2x

Although btree_gist appears to be installed by default, since we can use external PostgreSQL, so execute CREATE EXTENSION might not be a good choice. @tianzhou

@tianzhou
Copy link
Collaborator

Also, with this logic, can we also declare a unique key on web_url?

The web_url will be the same if some Bytebase VCS projects link to the same repository.

You are right. On the hand, we can try using the EXCLUDE constraint: postgresql.org/docs/current/ddl-constraints.html
EXCLUDE USING GIST (web_url WITH =, external_webhook_id WITH <>)
This essentially means if any of the two rows have the same web_url, then the corresponding external_webhook_id cannot be different
Or
If any of the two rows have different external_webhook_id, then the corresponding web_url can't be equal.

I try to add it. But it seems that we need CREATE EXTENSION btree_gist first. CleanShot 2022-09-15 at 21 27 11@2x

Although btree_gist appears to be installed by default, since we can use external PostgreSQL, so execute CREATE EXTENSION might not be a good choice. @tianzhou

We are creating a separate database to store our metadata, the installed extension scope is the schema under the database. As long as we have the right permission, it should be fine.

@h3n4l
Copy link
Member Author

h3n4l commented Sep 15, 2022

Also, with this logic, can we also declare a unique key on web_url?

The web_url will be the same if some Bytebase VCS projects link to the same repository.

You are right. On the hand, we can try using the EXCLUDE constraint: postgresql.org/docs/current/ddl-constraints.html
EXCLUDE USING GIST (web_url WITH =, external_webhook_id WITH <>)
This essentially means if any of the two rows have the same web_url, then the corresponding external_webhook_id cannot be different
Or
If any of the two rows have different external_webhook_id, then the corresponding web_url can't be equal.

I try to add it. But it seems that we need CREATE EXTENSION btree_gist first. CleanShot 2022-09-15 at 21 27 11@2x
Although btree_gist appears to be installed by default, since we can use external PostgreSQL, so execute CREATE EXTENSION might not be a good choice. @tianzhou

We are creating a separate database to store our metadata, the installed extension scope is the schema under the database. As long as we have the right permission, it should be fine.

Yes, on the other hand we are not sure about external PostgreSQL, let's say the user is using RDS, but maybe this extension is not available on RDS.

@tianzhou
Copy link
Collaborator

Also, with this logic, can we also declare a unique key on web_url?

The web_url will be the same if some Bytebase VCS projects link to the same repository.

You are right. On the hand, we can try using the EXCLUDE constraint: postgresql.org/docs/current/ddl-constraints.html
EXCLUDE USING GIST (web_url WITH =, external_webhook_id WITH <>)
This essentially means if any of the two rows have the same web_url, then the corresponding external_webhook_id cannot be different
Or
If any of the two rows have different external_webhook_id, then the corresponding web_url can't be equal.

I try to add it. But it seems that we need CREATE EXTENSION btree_gist first. CleanShot 2022-09-15 at 21 27 11@2x
Although btree_gist appears to be installed by default, since we can use external PostgreSQL, so execute CREATE EXTENSION might not be a good choice. @tianzhou

We are creating a separate database to store our metadata, the installed extension scope is the schema under the database. As long as we have the right permission, it should be fine.

Yes, on the other hand we are not sure about external PostgreSQL, let's say the user is using RDS, but maybe this extension is not available on RDS.

btree_gist is a basic extension that is included in pretty much every pg distribution. It's more viewed as a built-in feature implemented as an extension.

AWS RDS https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_PostgreSQL.html
GCP https://cloud.google.com/sql/docs/postgres/extensions
Azure https://docs.microsoft.com/en-us/azure/postgresql/single-server/concepts-extensions
huawei https://support.huaweicloud.com/intl/en-us/bestpractice-rds/rds_pg_0033.html

It's a fair ask to require btree_gist

GitLab https://docs.gitlab.com/ee/install/postgresql_extensions.html

The reason why I would push this is shit happens, and the database constraint is our last defense. As our schema becomes more complex, it's unavoidable to rely on btree_gist or we just run more risks of having data issues.

@tianzhou
Copy link
Collaborator

Especially for this one, without db constraints, we can create multiple webhooks if the timing is right. So our 1 webhook per VCS project is not an invariant.

@d-bytebase
Copy link
Contributor

d-bytebase commented Sep 15, 2022

I’m more worried about the privileges that our users give to us as CREATE EXTENSION typically requires superuser privilege.

But I think that works on GCP and AWS. We may have to try out other platform as update documentations if needed.

https://www.postgresql.org/docs/current/sql-createextension.html

@tianzhou
Copy link
Collaborator

https://www.postgresql.org/docs/current/btree-gist.html

"This module is considered “trusted”, that is, it can be installed by non-superusers who have CREATE privilege on the current database."

@h3n4l
Copy link
Member Author

h3n4l commented Sep 16, 2022

Also, with this logic, can we also declare a unique key on web_url?

The web_url will be the same if some Bytebase VCS projects link to the same repository.

You are right. On the hand, we can try using the EXCLUDE constraint: postgresql.org/docs/current/ddl-constraints.html
EXCLUDE USING GIST (web_url WITH =, external_webhook_id WITH <>)
This essentially means if any of the two rows have the same web_url, then the corresponding external_webhook_id cannot be different
Or
If any of the two rows have different external_webhook_id, then the corresponding web_url can't be equal.

I try to add it. But it seems that we need CREATE EXTENSION btree_gist first. CleanShot 2022-09-15 at 21 27 11@2x
Although btree_gist appears to be installed by default, since we can use external PostgreSQL, so execute CREATE EXTENSION might not be a good choice. @tianzhou

We are creating a separate database to store our metadata, the installed extension scope is the schema under the database. As long as we have the right permission, it should be fine.

Yes, on the other hand we are not sure about external PostgreSQL, let's say the user is using RDS, but maybe this extension is not available on RDS.

btree_gist is a basic extension that is included in pretty much every pg distribution. It's more viewed as a built-in feature implemented as an extension.

AWS RDS docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_PostgreSQL.html GCP cloud.google.com/sql/docs/postgres/extensions Azure docs.microsoft.com/en-us/azure/postgresql/single-server/concepts-extensions huawei support.huaweicloud.com/intl/en-us/bestpractice-rds/rds_pg_0033.html

It's a fair ask to require btree_gist

GitLab docs.gitlab.com/ee/install/postgresql_extensions.html

The reason why I would push this is shit happens, and the database constraint is our last defense. As our schema becomes more complex, it's unavoidable to rely on btree_gist or we just run more risks of having data issues.

I thought of another way to solve the inconsistency.
The data inconsistency problem we will now have:

  1. Two projects link to the same repository at the same time:

The sequence of operation:

Thread1 Thread2
Find Find
Create Webhook
Create Webhook
Insert repository record Insert repository record

In this case, we will create two identical webhooks like this:
img_v2_4502680d-2af4-45ef-b4d4-508441518a4g

This will degenerate into our previous logic.

  1. Create a new VCS project if another project is refreshing the token
Thread1 Thread2
Find
Read file content
Refresh token
Insert repository record

In this case, the token of the newly inserted project is expired, the user will find this project unavailable until the next token refresh.

I think the fundamental reason is that after we reuse webhooks, the records in the repository will have redundancy in many fields.

We can move some columns to a new table(e.g.: external_repository), and then this part of the schema will look like this
CleanShot 2022-09-16 at 12 28 57@2x
NOTE: the web_url column inexternal_repository has UNIQUE constraint.

For the problem1, insert the second record that will encounter the error because the web_url has UNIQUE constraint, and the user can retry insert, it will reuse the webhook created by thread1 in retry.
But thread 2 will still create an identical webhook on the first attempt, and the message sent by this webhook will also be processed by us (because the endpoint is valid and we use the workspace id). To handle this, we can use the previous method (generate new uuid), so that although it is created, the message sent by the webhook will not be processed by us.

For the problem2, since we only need to update one field in external_repo, this problem should disappear.

@h3n4l
Copy link
Member Author

h3n4l commented Sep 16, 2022

Of course, it looks like using a GiST index can solve both problems as well. We just need to weigh whether to introduce an extension.

@d-bytebase
Copy link
Contributor

The external repository is my preference as that’s a perfect design. We are less vulnerable because the # of repositories and projects isn’t quite large and updates are not frequent (except for tokens).

@h3n4l
Copy link
Member Author

h3n4l commented Sep 16, 2022

Yes, it's easy to have trouble representing N:1 mappings with redundancy within a table.

@h3n4l
Copy link
Member Author

h3n4l commented Sep 17, 2022

What do you think? @tianzhou

@tianzhou
Copy link
Collaborator

tianzhou commented Sep 17, 2022

@h3n4l well, my intention is to fix the data consistency issue at a low cost. And I think introducing gist extension is a low cost. But looks like we don't feel the same way on this.

And TBH, I think the new proposed solution to introduce a new table is overkill to just solve that data inconsistency. Introducing a new table bears a high cost, and for this one, it requires data migration. I don't feel it justifies the effort.

In summary, my preference:

  1. Use gist extension

  2. Do nothing

  3. Introduce a new table.

To me, introducing a new table is probably a No to me at this point. And if you do think "Use gist extension" is problematic, then I am OK with "Do nothing".

@h3n4l
Copy link
Member Author

h3n4l commented Sep 17, 2022

@h3n4l well, my intention is to fix the data consistency issue at a low cost. And I think introducing gist extension is a low cost. But looks like we don't feel the same way on this.

And TBH, I think the new proposed solution to introduce a new table is overkill to just solve that data inconsistency. Introducing a new table bears a high cost, and for this one, it requires data migration. I don't feel it justifies the effort.

In summary, my preference:

  1. Use gist extension

  2. Do nothing

  3. Introduce a new table.

To me, introducing a new table is probably a No to me at this point. And if you do think "Use gist extension" is problematic, then I am OK with "Do nothing".

Do nothing is not suitable. Let's use GiST index to solve this. We can consider introduce a new table again if users encounter error about using GiST index. We should do our best effort. @tianzhou @d-bytebase

@d-bytebase
Copy link
Contributor

What will happen if we do nothing and how could it happen?

@h3n4l
Copy link
Member Author

h3n4l commented Sep 17, 2022

What will happen if we do nothing and how could it happen?

As #2667 (comment) said, the main problems we will encounter are:

  1. Inserting the repository at the same time, resulting in the creation of the same webhook, and the token refresh may cause the token in another repository record to be refreshed to an incorrect value
  2. After reading the record, before inserting the record, the token is refreshed and the token error in the new record is caused.

@d-bytebase
Copy link
Contributor

The external repository is my preference as that’s a perfect design. We are less vulnerable because the # of repositories and projects isn’t quite large and updates are not frequent (except for tokens).

The updates are rare so it’s okay to do nothing. Even with EXCLUDE rule, we’re preventing bugs to damage data in the struct but we are still losing the updating data.

@h3n4l
Copy link
Member Author

h3n4l commented Sep 17, 2022

The external repository is my preference as that’s a perfect design. We are less vulnerable because the # of repositories and projects isn’t quite large and updates are not frequent (except for tokens).

The updates are rare so it’s okay to do nothing. Even with EXCLUDE rule, we’re preventing bugs to damage data in the struct but we are still losing the updating data.

EXCLUDE rule can prevent those problems. We need something like EXCLUDE USING GIST (web_url WITH =, external_webhook_id WITH <>, token WITH <>, refresh_token WITH <>)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants