Does CDNJS keep track of how many times individual files are pulled from their CDN? It would be absolutely awesome if CDNJS did this and then had up-to-update charts and data tables showing how many downloads each file is getting.
This would be epic because it would allow developers to choose which library versions they want their users downloading. The developer would want to choose the library version that is obviously compatible with their code, but also they would choose the one that is downloaded the most over the past X amount of time.
For example, lets say the latest jQuery just gets released and put on CDNJS. A couple days pass and the stats for jQuery look like this for the past week:
jQuery 1.9.1 = 20,000 downloads
jQuery 1.9.0 = 50,000 downloads
jQuery 1.8.3 = 560,000 downloads
jQuery 1.8.2 = 120,000 downloads
The developer can look at this and know that it's more likely that their visitors are already going to have jQuery 1.8.3 cached as opposed to 1.9.1 since it's new. So as long as their code is 1.8.3 compatible, they would choose this one.
And since these numbers change over time, maybe a month later the developer comes back to CDNJS and see's now that the 1.9.1 stats are higher than 1.8.3, so again, as long as his code is 1.9.1 compilant, he could safely switch his site to use 1.9.1 since his visitors are now more likely to already have 1.9.1 cached.
Does this make sense? To me it would be EXTREMELY useful. The whole point of CDNJS is so that developers share libraries and resources. So over time, as more and more libraries get added to CDNJS and more and more versions of those libraries are added, it would be invaluable to have a tool like this in order for the developers to make informed decisions based on which libraries and resources are being shared the most.
@Jakobud Great suggestion Jake. You're absolutely right that this would really useful, and it is a popular request: #405
We're brainstorming solutions right now, so we're glad to have you as part of the conversation.
Closed old issue #405 continue conversation here
Tagged as high priority, anyone have any brilliant ideas yet on how to parse a few billion lines?
How many lines is the typical log file? Do you split the log files up to one-per-day or smaller? Do the log files simply say what http://path/file was downloaded? Or does it have references to database row id's (id's of each filename which I assume are stored in a database)?
If you could post excerpts of the log files, that would be a place to start.
Any progress on this? You guys need any help with it? I know there are probably a lot of huge log files, but I think it would only be a matter of a simply python script that streamed in the log files and saved out the data to a database or something like that. It would be a long running process but it probably wouldn't be that complicated really.
FYI, I don't know if cdnjs utilizes AWS services on the backend or not, but this is an interesting article that is potentially very relevant to this issue:
It discusses using software called Fluentd to stream logfile changes into data storage. So for CDNJS, it could stream library access logs into some sort of usage database that could be used to display usage statistics.
Also, FYI you guys could get someone to help you with a solution for this if you could divulge details about your logging. How it works, where the files are stored, give us access to a day or weeks worth of logs, etc... Someone could figure out a solution for you.
Another suggestion for you guys, just make your logs public. Put them up on AWS S3 or something and allow anyone to grab them. I GUARANTEE someone (or multiple people probably) will come up with an analytics solution for you.
Just wanted to reach out regarding this issue again. I'll say it again, provide some example log files and someone somewhere will put together a parser for you that will pull library download stats.
Oh, we still not have stats.
Creating an api service for cdnjs would be nice. Something like:
Then, we can use this service to fetch the stats in the cdnjs website. 🍀
Stats from website is easy, but people want the stats from cdn, I remember that cloudflare didn't give us that info or access log.
cc @thomasdavis @ryankirkman @terinjokes
Approximate stats would be nearly as good. If log volume is a problem, logs could be sampled.
That is true! Even one day of traffic * 30 would be interesting enough.
Where are the logs now? Are they accessible in any form? I would think dumping daily logs on some S3 storage would be feasible and then someone could write something that parses them.
I would be excited to write a tool to parse the logs! I'm anyway involved in some statistics & visualizations projects, so that would be awesome. 🎇
Like I said before, all CDNJS needs to do is make the logs accessible in some form, and someone will step up to write a cool parser to generate usage stats.
We are doing now, the IP address in the log will be sensitive, should be careful.
Any update on this? Throwing my hat in the ring as another person who'd be willing to write a parser.
Hey dear all, I'm afraid not, there are some issues more important, but will try our best to have this feature asap.
BTW, thanks for guys want to write parser for us, if you don't mind, you can still contribute to other parts of cdnjs, like bower auto-updater or something, thanks!
Any more updates on this one? It's been over 2 1/2 years. Have you guys just considered making your logs publicly accessible in some form?
Help us Help you!
though ping @thomasdavis @ryankirkman @terinjokes @drewfreyling ...
Hey so I know that back on #405 the issue was money. The logs are in Common Format, however to pull down the logs for 5 million hits its $300 per day or something like that. (2 1/2 years later you guys probably get WAY more than 5 million hits a day).
So the solution thrown out there was to setup a parse on an EC2 instance. This would be the best solution. As long as your EC2 instance is in the same region as your S3 container, there is no cost to transfer you log files from S3 to your EC2 instance.
So essentially, the solution would be to have some sort of daily task that happens:
So this would be an absolute minimal cost. You would only pay for the time the instance is active. Scheduling an EC2 instance to turn on every 24 hours shouldn't be too hard. And I'm pretty sure you can self-terminate an EC2 instance programmatically.
Just a thought. It honestly wouldn't be too terribly difficult to figure out...
Actually an even better solution would be using AWS Data Pipeline
And AWS Elastic Map Reduce
Those tools are made to do exactly what you guys need to do: Analyze data/logs in a cost efficient manner.
@ryankirkman can we evaluate the disk size we need per day, and maybe i can find the storage.
Are Cloudflare logs accessible to you in some form, downloadable or via an API or anything? Also, EC2 transfer pricing:
Data Transfer IN To Amazon EC2 From Internet $0.00 per GB
Data Transfer IN To Amazon EC2 From Internet $0.00 per GB
So I assume that means you could programmatically pull in Cloudflare logs and parse them or do whatever and it would still only cost you for the time the EC2 instance is active.