Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Youtube badge fails intermittently #8605

Closed
Jetpack-Cat opened this issue Nov 9, 2022 · 33 comments
Closed

Youtube badge fails intermittently #8605

Jetpack-Cat opened this issue Nov 9, 2022 · 33 comments
Labels
question Support questions, usage questions, unconfirmed bugs, discussions, ideas

Comments

@Jetpack-Cat
Copy link

Jetpack-Cat commented Nov 9, 2022

Are you experiencing an issue with...

shields.io

🐞 Description

My youtube badge suddenly says invalid, it appears youtube changed something. Making a new one still shows invalid. It may have something to do with the new youtube handles for channels. I believe it was working before I accepted the @handle youtube suggested

🔗 Link to the badge

You'll see it
https://www.curseforge.com/minecraft/modpacks/jetpack-cat

💡 Possible Solution

I believe youtube changed something and made this stop working, because nothing has changed on that page for months and suddenly it's invalid

@Jetpack-Cat Jetpack-Cat added the question Support questions, usage questions, unconfirmed bugs, discussions, ideas label Nov 9, 2022
@PaulaBarszcz
Copy link
Collaborator

I cannot see the yt badges on the link that you provided (https://www.curseforge.com/minecraft/modpacks/jetpack-cat), you probably removed them if they stopped working :P

Which badge(s) were you using for youtube, that are not working now?
image
https://shields.io/

@DenverCoder1
Copy link
Contributor

I cannot see the yt badges on the link that you provided

They have this badge still at that link

It was showing all grey and as "invalid" yesterday but it appears to be working now.

Same happened with the ones on my profile.

@chris48s
Copy link
Member

I wasn't able to catch this and see what the API response from youtibe was when it wasn't working. I'm going to close this for now on the assumption it was a transient issue but if it happens again, re-open. Cheers.

@chris48s chris48s closed this as not planned Won't fix, can't repro, duplicate, stale Nov 12, 2022
@DenverCoder1
Copy link
Contributor

It seems like the issue is back now.

@Jetpack-Cat
Copy link
Author

It happens intermittently.

@DenverCoder1
Copy link
Contributor

Yeah, seems to be working again now

@chris48s chris48s changed the title Youtube no longer works Youtube badge fails intermittently Dec 30, 2022
@chris48s chris48s reopened this Dec 30, 2022
@chris48s
Copy link
Member

#8758

@programmingwithalex
Copy link

I am encountering the same issue showing invalid. I get the invalid response on https://shields.io/category/social and on my GitHub profile page.

@kassbohm
Copy link

Same issue here. Using https://shields.io/category/social leads to invalid for any youtube video I tested...

@benderillo
Copy link

benderillo commented Feb 24, 2023

Youtube subscribers badge does not work tried with a couple of channels, always "Invalid"

Screen Shot 2023-02-24 at 2 09 31 PM

Here is one for example
https://img.shields.io/youtube/channel/subscribers/UCXXkKBqXrBN4AD2HBuGc9Rg?style=social

@DenverCoder1
Copy link
Contributor

It seems to work when running locally, but failing sometimes in production.

Could it possibly be a rate-limiting issue? Perhaps the quotas need to be increased?

@calebcartwright
Copy link
Member

This is almost certainly a recurrence of the rate limiting issues we've had in the past with YT. I believe our current quota is still 50k daily requests based on the last time we went through this (#6276 (comment)).

I suppose we could request another increase, though it was quite a pain last time and I'm not sure if we've the necessary bandwidth at the moment to handle a similar level of engagement with YT/Google. In the short term we might want to try increasing how long we cache these badges to hopefully alleviate some of the upstream api hits

@adamlui
Copy link

adamlui commented Mar 14, 2023

I think if logic were added to hide the balloon outright vs. showing 'invalid' when requests fail would make it look better in a cluster of shields (like the Twitter one)

image

@adamlui
Copy link

adamlui commented Mar 14, 2023

look better even alone*

@adamlui
Copy link

adamlui commented Jun 8, 2023

In @chris48s's attempt to alleviate this issue in bf1bea8, I proposed a great solution that @calebcartwright asked me to move any additional commentary elsewhere to a relevant issue, so I am replying here instead

The cache value seen/discussed here [in https://github.com/badges/shields/commit/bf1bea8b4a106227cd81a9c5ac88e0266cf9c2c2] is what we set on the response when we return the badge to the requestor, and that cache value is in turn used by the various downstream actors (e.g. your web browser, GitHub's Camo, CDN, etc.). When we increase that value, it results in fewer badge requests getting to our server, and accordingly fewer API requests our server makes to those upstream data sources.

This cache value simply being higher does not solve the problem of badges showing 'invalid' when 50k API requests from shields.io are sent upstream for the day.

My solution to this specific problem is, even if upstream response is invalid, just show '?' to make it less ugly, and mitigate hitting this daily upstream limit further by not just increasing downstream cache duration, but actually limiting (via hard code) a requestor's request frequency to once per 24h (for example, by adding a timestamp of last request, then comparing w/ Date.now(), then if delta is <24h, show the old cached response because numbers are still less ugly than '?' no matter how old`).

The downstream cache duration value simply being higher (while reducing upstream requests) does not control individual requestors' request frequency, which doesn't even need to be high to serve the purpose of this badge (once per 24h will still result in the same final count as 1000 times in the final 10 mins) so it's a waste to not control it if upstream requests can be limited further if done.

And combined with tracking total upstream requests (keep a count that resets at PST midnight so a condition can be added if (cnt > 50000) to show '?' or old number) and the 'invalid' issue will never occur again!

@calebcartwright

This comment was marked as off-topic.

@adamlui

This comment was marked as off-topic.

@adamlui

This comment was marked as off-topic.

@calebcartwright

This comment was marked as off-topic.

@adamlui

This comment was marked as off-topic.

@calebcartwright

This comment was marked as off-topic.

@adamlui

This comment was marked as off-topic.

@calebcartwright
Copy link
Member

Going to lock this for some period of time since it appears to be getting a tad charged and likely spamming those users who are subscribed to the issue in order to receive updates.

Will unlock it at some point down the road and as always we'll post any updates as and when they become available

@badges badges locked as too heated and limited conversation to collaborators Jun 8, 2023
@chris48s
Copy link
Member

chris48s commented Jun 9, 2023

I've been AFK for a couple of days. Coming back and seeing my notifications, I think it would be fair to summarise and say: There has been some activity in this issue thread and in bf1bea8

I acknowledge I've seen all of the notifications. I will read all this backlog and write some sort of reply, but not immediately.

In the meantime, I will say this: Lets calm down. It is a badge in a README. Nobody died.

@chris48s
Copy link
Member

chris48s commented Jun 9, 2023

First of all, everyone who works on shields.io is a volunteer doing this in their spare time. Nobody is getting paid for this, nobody is "on call". We've all got lives. Please respect that. Tbh, writing all this wasn't really what I wanted to spend my evening on. I had a bit of time free to work on the project tonight. My preference would have been to spend it on following up the latest review comments on #9014 (a long-awaited and high impact change for the project), but here we all are..

Reading over this, it seems like there are really 4 distinct topics to address.

  1. Can we do more to fix the youtube badge?
  2. Suggestions about keeping track of the number of requests we're making to upstream services or caching responses from upstream services
  3. Suggestions about changing the way we present error messages
  4. Has anyone violated our code of conduct?

Lets start off with

1. Can we do more to fix the youtube badge?

Yes. In the short term. I've just merged #9250 increasing the max age on the badges a bit more. I will deploy it shortly. Lets see if that gets us under 50k requests a day. Its not a permanent solution, but it is something I can do now that will hopefully improve things.

Next time Caleb and I have a voice call, next steps on this issue will be one of the things on the agenda.

2. Suggestions about keeping track of the number of requests we're making to upstream services or caching responses from upstream services

This is something that comes up from time to time.

Someone will suggest we do something like

keep a count that resets at PST midnight so a condition can be added if (cnt > 50000)

This seems simple on the surface, and it works just fine if your application runs on a single server/container.

Shields currently runs on a minimum of 14 containers (at the weekend) and maximum of 22 containers (during working hours for Europe/America on weekdays).
So to implement something like that, you need some kind of shared state (a database or cache that all the the containers and read from/write to) to store that count.

Shields.io has to serve a lot of traffic on a very tight budget (donations) with only volunteer labour to maintain it. Currently, one of the ways we achieve this is we try to avoid dealing with any shared state between instances and rely entirely on downstream cache (CDN) of the SVG responses to reduce traffic to upstream APIs. There are several reasons for this choice, including:

  • Scalability: As soon as you have any sort of shared DB or cache, you've got a bottleneck that limts your ability to scale by just adding compute resources. At the moment we can basically just keep scaling out by adding containers to the cluster.
  • Cost: Compute resources are relatively cheap. Managed storage resources are much more expensive. For downstream cache, we can benefit from free CDN services from CloudFlare.
  • Simplicity: Our current setup is about as low maintenance as we can possibly make it.

So people sometimes suggest we cache API responses upstream, or track our rate limit usage, and it can seem simple to suggest if (cnt > 50000) but actually in order to implement something like that, we (the core team) would need to take on the financial cost, maintenance overhead, and scalability tradeoffs of shared state in order to do it. So those are our constraints (in some ways self-imposed). Simultaneously, it is a tradeoff which has allowed a small group of volunteers to keep a really widely used service running for ~10 years.

All of that said:
I don't completely rule out ever adopting any more persistent state, but we do need to be careful, deliberate, and make sure it is something that is going to solve problems across many of our (~600) service integrations. It is not something we would lightly take on to solve one issue with one service (in this case, youtube).

3. Suggestions about changing the way we present error messages

The way we present errors is definitely something we think about. A good example where we recently considered this is #9159

In general:

  • We don't keep any logs, at all. This is partly because we don't track our users, and that extends to logging. It is also partly because at the scale we generate logs, the cost of having some upstream service ingest them is very high.
  • Unhandled errors are rare. We log the stacktraces to Sentry, but we strip the console and http context.
  • If we generate a handled error (by far the most common case), we don't really have a better place to expose information about what happened than on the badge itself. Even if the error is not necessarily actionable to the person who views it, at least someone can report it and (hopefully) tell us something that helps us work out what happened and fixed it.

So to work through an example, if a working badge says

build:passing or build:failing

our error badge might say build:unparseable json response

Would it be better to render build ?
Not really
Would it be better to render build:? ?
Maybe. Arguably we're at least not exposing an error that might not be actionable, but at least build:unparseable json response is more likely to result in us getting an actionable bug report if an issue is intermittent or difficult to reproduce locally.

Maybe that is not to everyone's taste, but that's the constraints we're working with (again, some are self-imposed) and the choices we've made when it comes to error messages.

In the case of the youtube badge, we don't render a particularly useful error message. One of the issues here is that if the issue is related to rate limit, I'd expect us to be getting a 429 response status code, which would cause us to render Youtube:rate limited by upstream service, but we're getting back.. something else that we don't know what to do with and rendering the most generic error message. This is one of the reasons I didn't originally identify this as a rate limit issue.

4. Has anyone violated our code of conduct?

I'm not one of the CoC enforcement people, but in this situation, I think I'm probably the best mediator immediately available.

Having read over the conversation, there is nothing here that I would describe as "harassment" or "abuse". It is a conversation that descended into what I would describe as "unprofessional squabbling".

@adamlui You care about this issue. We get it, but alo repeat posting and @mentioning on the same topic isn't helping or increasing our appetite/ability to spend time on this.

It is entirely legitimate for Caleb (one of the project's core team) to jump in and reply to a thread.

I appreciate you aren't getting the answers you want to hear, but I think you would have been wise to follow Caleb's advice when he suggested "It's probably best to take a day or two to reflect and cool down" rather than escalating. One of the things I've consciously done is slow down the pace of this whole conversation. Cooler heads usually prevail.

@calebcartwright I think your first post bf1bea8#commitcomment-117071843 explains things pretty well, but I do also see why @adamlui feels like you were being dismissive and talking about "constraints" and "context" without really explaining it in later posts.

I've taken the time myself to really lay everything out above. Hopefully this helps. Also I can see why you didn't. It took a long time. The conversation was already heated by that point.

Next Steps

I've locked the conversation on commit bf1bea8 - As Caleb (correctly) says, its not an ideal place to carry on the conversation.

I've left one post visible containing the points I've followed up on in this post. I have also hidden most of the posts on this issue thread from both authors. There is a lot here that is not adding much value in the grand scheme of things IMO.

I'm going to unlock this issue thread.

Lets keep any further conversation on this topic constructive. We'll lock it again if we can't.

I think (hope) there was value in explaining things a bit more clearly above, but I don't ideally want to continue exchanging giant walls of text on this. As I've said, it took a long time to read all of this, think about it, and type all that out. It wasn't really what I would have ideally chosen to spend an evening on.

Maybe one thing this can lead into is better documentation on some of this stuff.

@badges badges unlocked this conversation Jun 9, 2023
@adamlui

This comment was marked as off-topic.

@adamlui

This comment was marked as off-topic.

@adamlui

This comment was marked as off-topic.

@adamlui
Copy link

adamlui commented Jun 9, 2023

@chris48s also I am expert at reading people (I played poker professionally for decades) I am nearly certain you derived enjoyment (based on a specific comment you made, before your pc-mode essay) ;)

@calebcartwright
Copy link
Member

Locking this again

@badges badges locked as too heated and limited conversation to collaborators Jun 9, 2023
@PyvesB
Copy link
Member

PyvesB commented Jun 10, 2023

We're indeed reaching our daily API quota:
Errors

Said quota is currently set to 50000 requests per day (#6276), but we have been trying to render close to 100000 daily badges in recent weeks prior to the cache changes.

My past self helpfully left a link to the form (#5101 (comment)), so I simply went ahead and asked for a new quota increase to 200000. This should give us some comfortable headroom.

I'll keep you posted when I hear back from the Youtube team.

@chris48s
Copy link
Member

Wow. Thanks for appearing and taking that on! I had got as far as finding the form and realised I didn't know all of the information to complete it. Also my memory of this is that last time we requested an increased rate limit there were some issues raised in the application compliance audit. Hopefully it is smooth sailing this time 🚢

In the meantime, based on the grafana dashboards, I think the second PR to increase the maxAge has probably got us under 50k requests per day 🤞

@chris48s
Copy link
Member

I'm closing this issue off. Thanks to @PyvesB we have now negotiated a higher rate limit with Youtube and we're now running under it with some headroom, even having reduced the time we cache the badges for.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
question Support questions, usage questions, unconfirmed bugs, discussions, ideas
Projects
None yet
Development

No branches or pull requests

10 participants