Skip to content
This repository has been archived by the owner on Jan 13, 2022. It is now read-only.

/healthcheck endpoint should check for Elasticsearch availability #487

Closed
aldenstpage opened this issue May 6, 2020 · 10 comments
Closed
Assignees
Labels
✨ goal: improvement Improvement to an existing feature 🙅 status: discontinued Not suitable for work as repo is in maintenance 🏷 status: label work required Needs proper labelling before it can be worked on

Comments

@aldenstpage
Copy link
Contributor

aldenstpage commented May 6, 2020

During deployments, our load balancer repeatedly polls the /healthcheck endpoint to check that the server is reachable. If this check succeeds, the newly deployed instance starts receiving production traffic. Right now, if Elasticsearch is not responsive, /healthcheck will still return 200 OK.

The healthcheck endpoint should check the health of the image index in Elasticsearch using the cluster health API. If it is unavailable, return error 500. Log an informative message explaining why the healthcheck failed.

Because the healthcheck endpoint may be called many times, and Elasticsearch calls are not free, we should cache the response of Elasticsearch for up to 10 seconds per call.

@madewithkode
Copy link

Hi Alden, this looks interesting, I'd love to work on it.

@madewithkode
Copy link

Hi Alden in order to check the health of the image index in the /healthcheck view, I'm trying to use the urllib's urlopen() method to make a request to Elasticsearch's cluster API this way:

cluster_response = urlopen('http://0.0.0.0:8000/_cluster/health/image')

However, I keep getting a 404. Is there something I'm doing wrong?

@madewithkode
Copy link

Hi Alden in order to check the health of the image index in the /healthcheck view, I'm trying to use the urllib's urlopen() method to make a request to Elasticsearch's cluster API this way:

cluster_response = urlopen('http://0.0.0.0:8000/_cluster/health/image')

However, I keep getting a 404. Is there something I'm doing wrong?

Figured this, didn't know elastic search was running on a seperate host/port :)

@aldenstpage
Copy link
Contributor Author

aldenstpage commented May 8, 2020

That's great!

It would be best to use the equivalent elasticsearch-py or elasticsearch-dsl query instead of making direct calls to the REST API (you can get an instance of the connection to Elasticsearch from search_controller.py). Here's an example for getting the cluster health; there ought to also be a way to narrow the query to the image index.

@madewithkode
Copy link

madewithkode commented May 9, 2020 via email

@madewithkode

This comment has been minimized.

@madewithkode
Copy link

Update:

I've successfully managed to query the health of the entire cluster, using the Elasticsearch connection instance gotten from search_controller.py. However when i try to limit the health check to just the image index, the request never resolves and continues to run forever with no response. And when i try to specify a timeout for the request, i get an "Illegal argument exception" even though timeout is a valid kwarg referenced in the API docs.

It'd be nice to point out that as at the time of writing, I'm yet to successfully run ./load_sample_data.sh so i don't know if this could be linked to the above problem.

@madewithkode
Copy link

Hi Alden, Progress Report :)

Successfully got the load_sample_data.sh to run, and so far every other thing is working fine.
I've also set up the 10s response caching on the /healthcheck view using redis and also the error logging.

However, I figured out the reason for the unresponsiveness when querying the elastic search image index was that it was non-existent and that the whole cluster index was empty too.

Do I need to do a manual population or something?

@aldenstpage
Copy link
Contributor Author

aldenstpage commented May 11, 2020

Hi again Onyenanu – if the index doesn't exist, the healthcheck should fail. This could happen in situations where we are switching Elasticsearch clusters in production and forgot to index data into the new one (or something went wrong while we were loading data into the new cluster).

In my experience, the ES Python libs can behave in unexpected ways that you sometimes have to work around. Since it seems like querying specifically for the image index health hangs when the index doesn't exist, perhaps you could query for healthchecks of every index in the cluster, and fail the healthcheck if image is not among them and green?

It sounds like it's coming along nicely!

@madewithkode
Copy link

Hi again Onyenanu – if the index doesn't exist, the healthcheck should fail. This could happen in situations where we are switching Elasticsearch clusters in production and forgot to index data into the new one (or something went wrong while we were loading data into the new cluster).

In my experience, the ES Python libs can behave in unexpected ways that you sometimes have to work around. Since it seems like querying specifically for the image index health hangs when the index doesn't exist, perhaps you could query for healthchecks of every index in the cluster, and fail the healthcheck if image is not among them and green?

It sounds like it's coming along nicely!

Hey Alden...Many thanks again for coming through with better insights. Suggestion sounds nice, would proceed with it.

And yes, the whole stuff is getting more interesting, learnt a handful in the few days :)

madewithkode added a commit to madewithkode/cccatalog-api that referenced this issue May 12, 2020
madewithkode added a commit to madewithkode/cccatalog-api that referenced this issue May 12, 2020
madewithkode added a commit to madewithkode/cccatalog-api that referenced this issue May 14, 2020
@kgodey kgodey added this to Pending Review in Backlog May 15, 2020
madewithkode added a commit to madewithkode/cccatalog-api that referenced this issue May 15, 2020
@kgodey kgodey moved this from Pending Review to Q2 2020 in Backlog May 21, 2020
@annatuma annatuma removed this from Q2 2020 in Backlog Jun 12, 2020
@annatuma annatuma added this to Ready for Development in Active Sprint via automation Jun 12, 2020
@annatuma annatuma moved this from Ready for Development to In Progress (Community) in Active Sprint Jun 12, 2020
@kgodey kgodey added ✨ goal: improvement Improvement to an existing feature and removed enhancement labels Sep 24, 2020
@cc-open-source-bot cc-open-source-bot added the 🏷 status: label work required Needs proper labelling before it can be worked on label Dec 2, 2020
@kgodey kgodey added the 🙅 status: discontinued Not suitable for work as repo is in maintenance label Dec 16, 2020
@kgodey kgodey closed this as completed Dec 16, 2020
Active Sprint automation moved this from In Progress (Community) to Done Dec 16, 2020
@TimidRobot TimidRobot removed this from Done in Active Sprint Jan 12, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
✨ goal: improvement Improvement to an existing feature 🙅 status: discontinued Not suitable for work as repo is in maintenance 🏷 status: label work required Needs proper labelling before it can be worked on
Development

Successfully merging a pull request may close this issue.

4 participants