Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Badges are unavailable (for GitHub) #1377

Closed
AlexWayfer opened this issue Dec 18, 2017 · 40 comments

Comments

Projects
None yet
10 participants
@salaros

This comment has been minimized.

Copy link
Contributor

commented Dec 18, 2017

Same for all my shield.io-driven badges - error 503 (timeout)

@janosrusiczki

This comment has been minimized.

Copy link

commented Dec 18, 2017

Just my luck, I switched to shields.io badges a few days ago...

@platan

This comment has been minimized.

Copy link
Member

commented Dec 18, 2017

Problem occurs on all servers and applies to dynamic and static badges. Response times increased. https://status.shields-server.com/
screenshot-2017-12-18 static s0 - shields io status

@RedSparr0w

This comment has been minimized.

Copy link
Contributor

commented Dec 18, 2017

Im getting 521 Web Server Is Down status codes for badges when i visit shields.io

Looks like the server is going down every few minutes
image

@GBH

This comment has been minimized.

Copy link

commented Dec 18, 2017

Is no-one there to give that server a kick?

@platan

This comment has been minimized.

Copy link
Member

commented Dec 18, 2017

As far as I know only @espadrine has access to servers.

@RedSparr0w

This comment has been minimized.

Copy link
Contributor

commented Dec 18, 2017

Unfortunately not, I think @espadrine is the only person with access currently.

@espadrine

This comment has been minimized.

Copy link
Member

commented Dec 18, 2017

Here's a tweet I sent: https://twitter.com/Shields_io/status/942763063270412288

I tried mitigating the issue in a few ways, including passing CloudFlare in danger mode and rate-limiting per IP. I am gathering information to see how to best mitigate the issue.

@espadrine

This comment has been minimized.

Copy link
Member

commented Dec 18, 2017

I think things are better now. The volume of requests we receive is still very high, but the rate limiting protects legitimate users.

Also, I believe the DoS has stopped, about an hour after I started activating rate limiting.
We are back to a 200 req/s average, from about 1000 that we had during the day.

(It is difficult to determine exactly who was the bad actor, because unsurprisingly most serious offenders are AWS servers, and some of those servers are legit. There are other smaller offenders, like the Moscow Youth Autonomous Non-Commercial Organisation Home Computer Network, but they aren't as big.)

I had to whitelist GitHub because otherwise it got all of its IPs banned one by one.

image

@espadrine

This comment has been minimized.

Copy link
Member

commented Dec 19, 2017

I relaxed the rate limit from 50 req/s to 100 req/s this morning, and the load shot from 200 req/s to 400 req/s. The servers can handle up to about 500 req/s, as the graph shows, so my guess is that whoever changed something on Monday did not manually turn it off, they simply respect the Retry-After header, which I indeed pushed to production about at the same time as the issues stopped.

A side-effect of the current rate limit is that the front page won't fully load (your IP will be banned after the first 100 badges show up on the page, for the remainder of the minute). I will experiment with whether we can afford to have an hourly limit instead, or whether we need a combination of both, at the end of my workday.

@AlexWayfer

This comment has been minimized.

Copy link
Author

commented Dec 19, 2017

Again 😱

image

@espadrine

This comment has been minimized.

Copy link
Member

commented Dec 19, 2017

Yep, I saw that. It is strange, because it is such a sudden increase.

After looking at the URLs associated with the AWS IPs I flag, I feel like it does not seem related to one given IP address, which probably means it is a large website that started automatically adding badges to its pages. The additional load definitely comes from the US, however.

So, instead of flagging IPs, I decided to flag badge types. I limited badge types similarly (with progressive tweaking); right now it is at a max of 300 hits every 600 seconds. Here are an example list of flagged badge types:

[
  "npmv",
  "npmdm",
  "githubstars",
  "wordpressplugin",
  "githubrelease",
  "githubforks",
  "githubissues",
  "wordpressv",
  "pypiv",
  "codecovc",
  "nugetv",
  "appveyorci",
  "gitterroom",
  "githublicense",
  "npml",
  "npmdt",
  "badgehttp2",
  "badgeipv6",
  "twitterfollow",
  "githubdownloads"
]

I don't know which one is suddenly more popular than it should, but the badgehttp2 and the badgeipv6 are certainly surprising to me.

Whatever is hitting our servers seems to have stopped again:

image

The start and stop are as sharp as they were yesterday, but at different hours. It is very puzzling to me.

We will probably need more investigation to survive tomorrow, when they start hitting us again.

@AlexWayfer

This comment has been minimized.

Copy link
Author

commented Dec 19, 2017

@espadrine, thank you! Good luck!

Shields.io is a good service, and it's very sad that someone started to harm it.

@paulmelnikow

This comment has been minimized.

Copy link
Member

commented Dec 19, 2017

Thanks for your work on this!

After looking at the URLs associated with the AWS IPs I flag, I feel like it does not seem related to one given IP address, which probably means it is a large website that started automatically adding badges to its pages. The additional load definitely comes from the US, however.

Is there any useful info in the Referer header?

@paulmelnikow

This comment has been minimized.

Copy link
Member

commented Dec 19, 2017

To everyone: Shields gets by with a tiny server and hosting budget. Most of the time this works okay but sometimes things like this happen! Your $10 goes a long way in helping us strengthen and toughen the service.

If you ❤️ Shields, please consider becoming a backer with a one-time $10 donation.

@kbrandwijk

This comment has been minimized.

Copy link

commented Dec 19, 2017

@paulmelnikow I'd love to contribute, and I will. But can anyone share a rough estimate of the cost involved to keep this running? Also, did I see correctly you are using a VPS? Would something more scalable in the cloud not be a more affordable solution?

@espadrine

This comment has been minimized.

Copy link
Member

commented Dec 19, 2017

@kbrandwijk The server costs are about $17/month with 3 servers ; they'd be about $23/month if I added another server.

Cloud providers are a double-edged sword; they definitely can be cheap (although it's hard to beat a VPS given our requirements), but costs are hard to assess and can explode without us noticing.

There is another venue that I try to explore: optimizing the code. Last time I optimized it, I switched the bottleneck to become text width computation, which typically hits 15-20ms. There is quite a bit of caching above it, but it is still the bottleneck, and intuitively there is no reason it cannot be cut down with a smarter algorithm.

@kbrandwijk

This comment has been minimized.

Copy link

commented Dec 19, 2017

What's your current bandwidth? I've had very good experiences with the global CDN + load balancing deployment that Zeit Now offers. Also, I'll have a look at the source code later on. Do you have any detailed metrics in the tests already?

@AlexWayfer

This comment has been minimized.

Copy link
Author

commented Dec 19, 2017

If you ❤️ Shields, please consider becoming a backer with a one-time $10 donation.

I would like to make $1–3 donations every month. Please, consider this possibility (via Patreon, for example).

@kbrandwijk

This comment has been minimized.

Copy link

commented Dec 19, 2017

@AlexWayfer It's possible via OpenCollective as well, shields just needs to define other options (Most projects have a $2/month option).

@paulmelnikow

This comment has been minimized.

Copy link
Member

commented Dec 19, 2017

@kbrandwijk Thanks so much for your donation!

@AlexWayfer That would be great! You can choose monthly, and enter the amount you'd like on this page: https://opencollective.com/shields/donate

@kbrandwijk

This comment has been minimized.

Copy link

commented Dec 19, 2017

@paulmelnikow The donate page has a minimum of $10. That's why I mentioned adding some options yourselves.

@platan

This comment has been minimized.

Copy link
Member

commented Dec 19, 2017

I'm not able to set less than $10 using stepper arrows but I can type something less than $10. But I don't know if it's possible to make a donation with such amount of money.

@RedSparr0w

This comment has been minimized.

Copy link
Contributor

commented Dec 19, 2017

Don't think it allows less than $10 even if you have typed less as it still says $10 down the bottom:
image
Alternatively you could do $12 yearly for essentially $1 monthly

@paulmelnikow

This comment has been minimized.

Copy link
Member

commented Dec 19, 2017

Ah, I gotcha. Sure! I added a $3/month option.

@espadrine

This comment has been minimized.

Copy link
Member

commented Dec 20, 2017

I started rate-limiting by referrer, as suggested by @paulmelnikow. Obviously, we are not currently being hit by whatever was pressuring us, and we are very clearly in the safe zone (until tomorrow morning?), but there is one notable (albeit small) referrer that keeps getting temporarily banned.

This Chrome extension seems to open a tab on a given URL, and that URL contains a handful of badges. I reached out to the author in an issue.

@AlexWayfer

This comment has been minimized.

Copy link
Author

commented Dec 20, 2017

Ah, I gotcha. Sure! I added a $3/month option.

@paulmelnikow, are you sure? Input has min="10" even for monthly option.

image

@kbrandwijk

This comment has been minimized.

Copy link

commented Dec 20, 2017

@AlexWayfer

This comment has been minimized.

Copy link
Author

commented Dec 20, 2017

@AlexWayfer it's here: https://opencollective.com/shields/

Oh, sorry. Thank you!

@espadrine

This comment has been minimized.

Copy link
Member

commented Dec 20, 2017

Small update: today we are seemingly not hit by the DDoS.

image

Here is the monthly look:

image

@PyvesB

This comment has been minimized.

Copy link
Member

commented Dec 20, 2017

In terms of performance for text width computation, #1298 adds a cache layer that divides by two the number of width calculations on average for each generated badge. We should deploy it as soon as possible if we want to benefit from a small performance boost. Ill also look into #1379 in the coming days/weeks to see if things can be further improved. 😉

@kbrandwijk

This comment has been minimized.

Copy link

commented Dec 20, 2017

Also, I noticed that only the text-width calculation for the left side is cached. Is that deliberate? Also, since the cache item size is so small, I would consider setting the max to a lot more than 1000.

@PyvesB

This comment has been minimized.

Copy link
Member

commented Dec 20, 2017

@kbrandwijk : I'm not convinced a bigger cache size would change much. The left hand-side doesn't feature that many different values when you look at shields' homepage (probably around 100 different possibilities). That leaves a big margin for custom badges with unique new left hand-side keys, which probably represent a smaller number of users anyway. ^^

@kbrandwijk

This comment has been minimized.

Copy link

commented Dec 20, 2017

@PyvesB I have no idea why I proposed that. Your explanation makes complete sense...

@espadrine

This comment has been minimized.

Copy link
Member

commented Dec 20, 2017

#1298 adds a cache layer

pdfkit has added a word cache at the start of the year, which is in v0.8.2. We are currently in v0.8.3 according to the package-lock.json. I would expect those caches to overlap; did you notice a significant speedup on average across 10k requests?

@PyvesB

This comment has been minimized.

Copy link
Member

commented Dec 20, 2017

@espadrine : I did not realise they had their own caching system, I was not expecting that from such a library. I did some quick testing at the time of the pull request and I did notice close to 10% when repeatedly calling makeBadge (averaged on way over 10k iterations). Nevertheless, this was done on an environment very different to what shields is running on (different operating system, different Node version and old laptop), so it would probably benefit from a closer look if you think there may be overlapping. 😉

@PyvesB

This comment has been minimized.

Copy link
Member

commented Dec 20, 2017

I haven't actually looked in the pdfkit caching, but one issue with letting the library do the caching is that it won't discriminate between left hand-sides which are very static and only have a small number of possibilities, and right hand-sides which are very variable. In my small test, I did use random strings for the right hand-side texts, which may explain why adding this extra layer of cache helped as it only caches what we know is likely to be requested again soon.

@paulmelnikow

This comment has been minimized.

Copy link
Member

commented Dec 21, 2017

Let's continue the optimization discussion here!
#1379 (comment)

@AlexWayfer

This comment has been minimized.

Copy link
Author

commented Jan 16, 2018

Badges periodically and randomly don't load :(

I saw this on January 11:

image

And today (January 16):

image


And everything is normal now. Strange things.


And again:

image

@RedSparr0w

This comment has been minimized.

Copy link
Contributor

commented Jan 16, 2018

I have been noticing the same thing with a lot of badges, but no certain badges in particular,
My best guess would be something to do with the rate limiting.
I have also seen quite a few lately.

Looks like the server has been failing a lot more often the past couple days also
image
(although s1 has had 100% uptime the past 5 days)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.