Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(gatsby-source-drupal): Disable caching + add http/2 agent #32012

Merged
merged 9 commits into from
Jun 22, 2021

Conversation

KyleAMathews
Copy link
Contributor

@KyleAMathews KyleAMathews commented Jun 21, 2021

Talking to Gatsby/Drupal users — most set long max-age for the Drupal cache to
keep their edge http cache as fresh as possible as Drupal can directly purge
its edge http cache.

But this has the unfortunate side-effect with the recent http client change in
#31514 that API calls aren't
revalidated. Meaning that a user could change some content in Drupal and not
see those updates in their Gatsby site until the Drupal cache expires in the
Gatsby cache.

We tested using etag for validation but the overhead of reading/writing to the Gatsby
cache + http revalidation made it somewhat slower than just refetching every time.

While working on this PR, I also noticed that we were missing an http2 agent which
helped speed up requests on large Drupal sites.

I also noticed that we're using Array.concate which gets quite slow with large arrays so I replaced
that with Array.push(...arr) saving a couple of seconds on a large (70k entities) test site.

This site now fetches all data on a warm varnish cache in ~12 seconds!

Talking to Gatsby/Drupal users — most set long max-age for the Drupal cache to
keep their edge http cache as fresh as possible as Drupal can directly purge
its edge http cache.

But this has the unfortunate side-effect with the recent http client change in
#31514 that API calls aren't
revalidated. Meaning that a user could change some content in Drupal and not
see those updates in their Gatsby site until the Drupal cache expires in the
Gatsby cache.

This PR removes the `cache-control` header from Drupal API responses so that
we only can use `etag` for caching which forces revalidation on every request.
@gatsbot gatsbot bot added the status: triage needed Issue or pull request that need to be triaged and assigned to a reviewer label Jun 21, 2021
@Auspicus
Copy link
Contributor

Some relevant Drupal content:

Currently the UNIX timestamp is used for the ETag header when serving pages cacheable by proxies and/or the browser. This does not strictly conform to the specification (RFC 2616, Section 13.3) because the ETag is expected to be a strong validator. The UNIX timestamp does not qualify as a strong validator. In order to comply with the RFC, it is either necessary to explicitly mark the ETag as a being weak or alternatively compute a hash of the content and use that.

Because Drupal does not compute hashes for content (there are reasons for this) it's not the best idea to use ETag for cache validation. ETag values will change every time the public proxy clears so the value more or less just represents the last time the public proxy was cleared. On top of this most Drupal users set a long max-age for their public cache because fetching pages from Varnish or another cache proxy is orders of magnitude faster than a Drupal bootstrap. In addition, Drupal now has solid cache invalidation techniques via cache tags so it's much safer to trust the cache.

Considering this, and following discussions in the Drupal slack, I think we are leaning towards not having a local cache for requests in Gatsby and instead always forwarding the requests through Varnish.

There is one more thing to bear in mind related to caching and Varnish and that would be authentication. When making an authenticated request the public cache is bypassed and Drupal is always bootstrapped. @KyleAMathews and I talked about allowing for a config option which would determine which entities require authentication to be passed since it's usually only a few resource types that you need to use an admin role to fetch. eg. redirects, crops, etc.

@KyleAMathews KyleAMathews changed the title feat(gatsby-source-drupal): Force revalidation on Drupal API requests feat(gatsby-source-drupal): Disable caching + add http/2 agent Jun 22, 2021
@gatsbyjs gatsbyjs deleted a comment from gatsbot bot Jun 22, 2021
}

async function worker([url, options]) {
return got(url, { agent, ...options })
return got(url, {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You also need to pass request: http2wrapper.auto here, as Got v11 doesn't use the newest version of http2-wrapper. Got v12 is going to be released in the next few weeks.

Copy link
Contributor

@wardpeet wardpeet Jun 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for weighing in @szmarczak !

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I almost forgot. You need to set the http2 option to true as well. It will pass the entire agent object to http2wrapper, otherwise it would pass agent: httpsAgent which results in an error if the endpoint is HTTP/2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh haha thanks!

1_sXntGz_sVG3164fdY03-qQ

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should fix it 5fbf618

Thanks btw for all your great work on Got! It's a fantastic piece of software.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random aside while we have you, I profiled this with http/2 & it looks like normalizeArguments is using more than its fair share of CPU?

Screen Shot 2021-06-22 at 9 42 43 AM

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks btw for all your great work on Got! It's a fantastic piece of software.

No problem :) I really enjoy improving it.

oh haha thanks!

https://github.com/sindresorhus/got/blob/f896aa52abc41fe40d4942da94a0408477358f14/source/core/index.ts#L2364-L2367

This hasn't been documented, sorry. One of the ways to fix this in Got would be to always pass the entire agent object to the request function, but then we would need to default request to something like this:

request = (url, options) => {
	if (url.protocol === 'https:') {
		options.agent = options.agent.https;
		return https.request(url, options);
	}

	options.agent = options.agent.http;
	return http.request(url, options);
};

But then request: https.request wouldn't work. We could detect native http functions but then it would fail on custom functions that at the end return the same what https.request does.

I'm not sure what's the best solution here. Alternatively we could make request an object with protocols http2 https and http as keys...

For now http2: true does the trick :P

It looks like normalizeArguments is using more than its fair share of CPU?

Indeed. Can you point it down to a few lines? I suppose it's caused by too intensive object creation. The upcoming Got version uses getters & setters in order to avoid re-doing normalization if not necessary. If you could create a reproducible repo, that would be awesome.

Also feel free to create an issue in the Got repo about this :)

@LekoArts LekoArts added topic: source-drupal Related to Gatsby's integration with Drupal and removed status: triage needed Issue or pull request that need to be triaged and assigned to a reviewer labels Jun 22, 2021
@LekoArts LekoArts added this to To cherry-pick in Release candidate via automation Jun 22, 2021
Copy link
Contributor

@vladar vladar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@vladar vladar added the bot: merge on green Gatsbot will merge these PRs automatically when all tests passes label Jun 22, 2021
@gatsbybot gatsbybot merged commit 113e43e into master Jun 22, 2021
@gatsbybot gatsbybot deleted the force-revalidate branch June 22, 2021 17:39
@vladar vladar moved this from To cherry-pick to Backport PR opened in Release candidate Jun 22, 2021
vladar pushed a commit that referenced this pull request Jun 22, 2021
* feat(gatsby-source-drupal): Force revalidation on Drupal API requests

Talking to Gatsby/Drupal users — most set long max-age for the Drupal cache to
keep their edge http cache as fresh as possible as Drupal can directly purge
its edge http cache.

But this has the unfortunate side-effect with the recent http client change in
#31514 that API calls aren't
revalidated. Meaning that a user could change some content in Drupal and not
see those updates in their Gatsby site until the Drupal cache expires in the
Gatsby cache.

This PR removes the `cache-control` header from Drupal API responses so that
we only can use `etag` for caching which forces revalidation on every request.

* Remove debugging line

* Add http/2 agent

* Disable cache — it's slower than refetch with revalidation

* push don't concat

* consistent return

* Fix statusCode check for got

* Actually this was correct...

* Actually use http/2

(cherry picked from commit 113e43e)

# Conflicts:
#	packages/gatsby-source-drupal/package.json
vladar added a commit that referenced this pull request Jun 22, 2021
… (#32038)

* feat(gatsby-source-drupal): Force revalidation on Drupal API requests

Talking to Gatsby/Drupal users — most set long max-age for the Drupal cache to
keep their edge http cache as fresh as possible as Drupal can directly purge
its edge http cache.

But this has the unfortunate side-effect with the recent http client change in
#31514 that API calls aren't
revalidated. Meaning that a user could change some content in Drupal and not
see those updates in their Gatsby site until the Drupal cache expires in the
Gatsby cache.

This PR removes the `cache-control` header from Drupal API responses so that
we only can use `etag` for caching which forces revalidation on every request.

* Remove debugging line

* Add http/2 agent

* Disable cache — it's slower than refetch with revalidation

* push don't concat

* consistent return

* Fix statusCode check for got

* Actually this was correct...

* Actually use http/2

(cherry picked from commit 113e43e)

# Conflicts:
#	packages/gatsby-source-drupal/package.json

Co-authored-by: Kyle Mathews <mathews.kyle@gmail.com>
@vladar vladar moved this from Backport PR opened to Backported in Release candidate Jun 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bot: merge on green Gatsbot will merge these PRs automatically when all tests passes topic: source-drupal Related to Gatsby's integration with Drupal
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

7 participants