Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A monitor bot #124

Closed
thuhole opened this issue Jul 29, 2020 · 9 comments
Closed

A monitor bot #124

thuhole opened this issue Jul 29, 2020 · 9 comments

Comments

@thuhole
Copy link

thuhole commented Jul 29, 2020

Is your feature request related to a problem? Please describe.
cState is only a frontend, a backend monitor bot should be provided.

Describe the solution you'd like
I've wrote a very simple bot myself at https://github.com/thuhole/status-probe, with these features:

  • Monitor websites and automatically publish incidents using GitHub API.
  • In case of GitHub API downtime, the bot would retry GitHub API in a separate thread.
  • In case of the probe's connection error rather than the server's connection error, the bot used a trick to handle, see readme.

It would also be a good idea to add some other features, such as webhook, email notification, custom messages, more types of services etc... I just think that such bot should be a cState official feature. So users don't need to write scripts themselves.

Describe alternatives you've considered
Manually publish . However manually publishing disruptions usually has a delay. You cannot ask a human to stay up 24/7.

Additional context

@thuhole
Copy link
Author

thuhole commented Jul 29, 2020

BTW, my script is using Python. Maybe Rust/Go/C++ would be a better choice, for monitoring bot may be on a raspberry pi with limited resources.

@mistermantas mistermantas added discussion enhancement stuck / backlog This is an issue that is hard to do without community support, as such it might not be resolved. labels Jul 29, 2020
@mistermantas
Copy link
Member

mistermantas commented Jul 29, 2020

Hi! This is all worth talking about as what you're talking about (backend automations) are inherent limitations of cState, logically some of which are very disappointing.

I must ask you to clarify on what you mean by this, first of all:

webhook, email notification, custom messages, more types of services etc...

  • What kinds of webhooks and for what purposes?
  • Email notifications if something on the page changes, right?
  • What kinds of custom messages?
  • What other types of services?
  • "etc"?

Now let's think about in what contexts cState may be used.

Recently, with the help of a friend, I added a Dockerfile to the repository, as that was an option requested long ago by somebody on this same issue tracker. From what I know, this means that cState can be hosted 'dynamically' like WordPress or whatever.

However, when I initially made the project in 2017-2018, it was supposed to work (best) with one platform and one platform only: Netlify. Or something like it that supported the Golang Hugo SSG.

Netify builds the site with the aforementioned Hugo engine which I had fallen in love with at the time (because doing it locally every time would have been super annoying) and Netlify hosting is a glorified CDN.

The advantages and disadvantages of static hosting are:

  1. Content changes require re-building the entire site. This process itself only takes a few seconds, on Netlify in general it should only be under a minute or just about. This is not something you would want for a big company with SLAs if you want to update users with information ASAP.
  2. Unless you do some JavaScript magic -- which I avoided to make a status page that works on as many browsers and as many contexts as possible -- there is little 'interactivity'. So doing notifications without a separate backend or some other service of Netlify like Lambda Functions (or whatever) is impossible. Not a good option for enterprises yet again.
  3. The upshot is that this setup with Netlify -- this glorified CDN -- is very cheap for obvious reasons. Most of the time your services should be working.

For a small startup or a hobby project there is no such urgency as for big services. And if there is such urgency, why would you host your own status page? On your own infrastructure? That's what happened with AWS S3 a while back. A big headache indeed.

Because remember, Netlify — with hosting — is basically free. So is cState. Even with Cachet — which I believe has more features — you need to host it yourself which costs some money.

This does not mean that cState cannot be automated and in fact I wish that it was, to some extent.

But I'm a front-end web developer, in fact I am more of a designer these days. I am not knowledgeable in Python, nor C, C++, Rust, Go, etc. You get the point.

Of course I would love if somebody made integrations -- or even became a maintainer and worked officially on this same org (under a repo like cstate/automations).

I want any features that are added officially to be done so with care, so that the project stays simple for those who simply want a glorified informational feed.

One last thing to mention is that 3rd party integrations are always possible -- just look at your own, it looks great! -- and I usually tell people who want live updates -- what I call monitoring -- to use Custom Tabs, Custom HTML, because those are the easiest options available right now.

@thuhole
Copy link
Author

thuhole commented Jul 29, 2020

OK, sorry for my ambiguity. Firstly let me explain all these possible features:

  1. "webhook": When the bot detects something wrong, it triggers a custom HTTP request to notify another service. This "another service" can be anything, such as notifying the admin, doing some automatic security measures, sending email using mailgun API etc.
  2. "email notification": Just for notifying the admin.
  3. "custom messages": For example, show different types of error messages or error code on the status page. The current version of my bot would only show "Investigating..."
  4. "other type of services": My bot can only monitor HTTP GET services. I guess HTTP POST, TCP ping or even MySQL can be implemented too.
  5. "etc.": There are infinite possible features can be added to the bot.

Next, about your what-so-called "inherent limitations". I don't think it should be called "limitations". Actually it's an important point I love cState. The cState job is only for showing people a beautiful website, and it did its own job perfectly. The website is truly beautiful, and free, and stable. The other things should be handled by other services, they don't belong to the frontend.

That's why a monitor bot is needed. It may not be convenient for cState to add a "subscription" feature like GitHub status page, unless you do it in a separate website or use Custom Tabs.

My service is like between a "hobby project" and a "big service". It's got 2,000 UV and 12,000 PV per day. So, there partly is such urgency. With cState, Netlify and GitHub host my status page and I can host my "own infrastructure" backend at home. Therefore there won't be something like the AWS S3 accident , because the downtime of my raspberry pi have no relation with GitHub API or Netlify or my website. These 3 services(status page, monitor bot, my main website service) are mutually independent.

I agree that any official feature should be added with extra care, that's why I don't think Python can do this backend job. I'm not professional either. Maybe we need more professional guy to build a more carefully designed bot.

@thuhole
Copy link
Author

thuhole commented Jul 29, 2020

BTW, I've just noticed that cState can be subscribed via RSS, this also increases the importance of quick website updates.

@mistermantas
Copy link
Member

I mean, in your case, this assumes that:

  • your electricity is solid or you have a UPS
  • your internet is just as rocksolid
  • and as unlikely as this is, but your equipment won't be stolen, burn down, etc.

I'm just saying I can't do anything here to help, somebody else has to take initiative.

If you know who would, my contacts are on my profile.

@thuhole
Copy link
Author

thuhole commented Jul 30, 2020

The downtime of the monitor bot usually has no relation with the downtime of the main service.
For example, if the monitor bot got 99% uptime and your web service got 99.9% uptime, then the probability of simultaneous failure would be 0.001%. That's the probability of incorrect status page.
Usually this probability is less than the probability that GitHub API or Netlify is on error.

@mistermantas
Copy link
Member

I'll keep this open for a bit, but ultimately it's out of the scope of what I'm able to do

@Nevexo
Copy link
Contributor

Nevexo commented Jul 30, 2020

Here's my take:

I believe having a status page go into investigating mode automatically is a good idea, as it allows the devops guys to get straight into resolving the issue.

With cState being primarily server-less, it can run on another platform which (unless you're Cloudflare) shouldn't be affected by your outage.

My personal opinion on an implementation of this, would be to extend an existing systems monitoring platform - Prometheus would be my choice. There are already a tonne of scrapers out there that can monitor not only the service, but the networks & servers powering them. For a simple website this isn't a big deal, but when you're maintaining network infrastructure, you want any outage to any part of it to be reported.

So using a platform like Prometheus solves the issue of data collection (polling one part of the site usually doesn't tell the whole story); we need to get the data to cState. Prometheus has exporters that are fed information about outages, a light-weight application written in, let's say, go could be used for publishing this data via the cState content git repo. https://github.com/go-git/go-git, for example would make this quite simple to do.

This also means the cState exporter doesn't need to have anything to do with other alert methods (webhooks, for example) - they can be handled by other exporters, AlertManager being the most common.

Another advantage to this approach is the rules you can set within Prometheus to set the severity of an outage in cState, for example, if one server in your stack is reporting a higher network latency, you may want to set the site to degraded for that specific region, and then, if you have a really advanced setup, you could set a critical alert if some piece of your network stack goes for a jolly, let's say a switch fails.

It will of course be important that your Prometheus server, scrapers and cState Exporter are away from the infrastructure you're monitoring, and this is quite an advanced way of doing it, which may make it a little harder for an admin of a small website to configure, so there's likely still place for a simple "hey if this server doesn't give me a 200, flag it." - but you'll need to have some form of delay (let's say, it fails to respond 3 times) before updating the status page, or a quick reload of NGINX turns into a "critical systems outage"

So finally, I completely agree that outage reporting should be automated in cState, but it'll take a lot of planning to work out the best way to do it.

edit: we'd also have to think of a way to tell whatever's hosting cState to rebuild the static content, it's all good updating the git repo but it's not good if the site never updates.

@mistermantas
Copy link
Member

edit: we'd also have to think of a way to tell whatever's hosting cState to rebuild the static content, it's all good updating the git repo but it's not good if the site never updates.

If you host with Netlify or GitLab Pages, that's done for you automatically (it's under CI/CD). You could roll your own infra. Either option is not 100% bulletproof. Well, tbh nothing is.

@mistermantas mistermantas removed enhancement good first issue help wanted stuck / backlog This is an issue that is hard to do without community support, as such it might not be resolved. labels Feb 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants