Suggestion - Post library download statistics/analytics numbers & graphs #1078

Open
Jakobud opened this Issue Mar 22, 2013 · 32 comments

9 participants

@Jakobud

Does CDNJS keep track of how many times individual files are pulled from their CDN? It would be absolutely awesome if CDNJS did this and then had up-to-update charts and data tables showing how many downloads each file is getting.

This would be epic because it would allow developers to choose which library versions they want their users downloading. The developer would want to choose the library version that is obviously compatible with their code, but also they would choose the one that is downloaded the most over the past X amount of time.

For example, lets say the latest jQuery just gets released and put on CDNJS. A couple days pass and the stats for jQuery look like this for the past week:

jQuery 1.9.1 = 20,000 downloads
jQuery 1.9.0 = 50,000 downloads
jQuery 1.8.3 = 560,000 downloads
jQuery 1.8.2 = 120,000 downloads
etc...

The developer can look at this and know that it's more likely that their visitors are already going to have jQuery 1.8.3 cached as opposed to 1.9.1 since it's new. So as long as their code is 1.8.3 compatible, they would choose this one.

And since these numbers change over time, maybe a month later the developer comes back to CDNJS and see's now that the 1.9.1 stats are higher than 1.8.3, so again, as long as his code is 1.9.1 compilant, he could safely switch his site to use 1.9.1 since his visitors are now more likely to already have 1.9.1 cached.

Does this make sense? To me it would be EXTREMELY useful. The whole point of CDNJS is so that developers share libraries and resources. So over time, as more and more libraries get added to CDNJS and more and more versions of those libraries are added, it would be invaluable to have a tool like this in order for the developers to make informed decisions based on which libraries and resources are being shared the most.

@ryankirkman
cdnjs member

@Jakobud Great suggestion Jake. You're absolutely right that this would really useful, and it is a popular request: #405

We're brainstorming solutions right now, so we're glad to have you as part of the conversation.

@Lockyc

Closed old issue #405 continue conversation here

@thomasdavis
cdnjs member

Tagged as high priority, anyone have any brilliant ideas yet on how to parse a few billion lines?

@Jakobud

How many lines is the typical log file? Do you split the log files up to one-per-day or smaller? Do the log files simply say what http://path/file was downloaded? Or does it have references to database row id's (id's of each filename which I assume are stored in a database)?

@ryankirkman
cdnjs member
@Jakobud

If you could post excerpts of the log files, that would be a place to start.

@Jakobud

Any progress on this? You guys need any help with it? I know there are probably a lot of huge log files, but I think it would only be a matter of a simply python script that streamed in the log files and saved out the data to a database or something like that. It would be a long running process but it probably wouldn't be that complicated really.

@Jakobud

FYI, I don't know if cdnjs utilizes AWS services on the backend or not, but this is an interesting article that is potentially very relevant to this issue:

http://aws.amazon.com/blogs/aws/all-your-data-fluentd/

It discusses using software called Fluentd to stream logfile changes into data storage. So for CDNJS, it could stream library access logs into some sort of usage database that could be used to display usage statistics.

@Jakobud

Also, FYI you guys could get someone to help you with a solution for this if you could divulge details about your logging. How it works, where the files are stored, give us access to a day or weeks worth of logs, etc... Someone could figure out a solution for you.

@Jakobud

Another suggestion for you guys, just make your logs public. Put them up on AWS S3 or something and allow anyone to grab them. I GUARANTEE someone (or multiple people probably) will come up with an analytics solution for you.

@Jakobud

Just wanted to reach out regarding this issue again. I'll say it again, provide some example log files and someone somewhere will put together a parser for you that will pull library download stats.

@PeterDaveHello
cdnjs member
@andytruong

Oh, we still not have stats.

@IonicaBizau
cdnjs member

Creating an api service for cdnjs would be nice. Something like:

api.cdnjs.com/lib/jquery/stats

Then, we can use this service to fetch the stats in the cdnjs website. 🍀

@PeterDaveHello
cdnjs member

Stats from website is easy, but people want the stats from cdn, I remember that cloudflare didn't give us that info or access log.

cc @thomasdavis @ryankirkman @terinjokes

@ryankirkman
cdnjs member
@davidbau

Approximate stats would be nearly as good. If log volume is a problem, logs could be sampled.

@thomasdavis
cdnjs member

That is true! Even one day of traffic * 30 would be interesting enough.

@Jakobud

Where are the logs now? Are they accessible in any form? I would think dumping daily logs on some S3 storage would be feasible and then someone could write something that parses them.

@IonicaBizau
cdnjs member

I would be excited to write a tool to parse the logs! I'm anyway involved in some statistics & visualizations projects, so that would be awesome. 🎇

@Jakobud

Like I said before, all CDNJS needs to do is make the logs accessible in some form, and someone will step up to write a cool parser to generate usage stats.

@PeterDaveHello
cdnjs member

We are doing now, the IP address in the log will be sensitive, should be careful.

@fj
fj commented Jul 13, 2015

Any update on this? Throwing my hat in the ring as another person who'd be willing to write a parser.

@PeterDaveHello
cdnjs member

Hey dear all, I'm afraid not, there are some issues more important, but will try our best to have this feature asap.

@PeterDaveHello
cdnjs member

BTW, thanks for guys want to write parser for us, if you don't mind, you can still contribute to other parts of cdnjs, like bower auto-updater or something, thanks!

@Jakobud

Any more updates on this one? It's been over 2 1/2 years. Have you guys just considered making your logs publicly accessible in some form?

Help us Help you!

@Jakobud

Hey so I know that back on #405 the issue was money. The logs are in Common Format, however to pull down the logs for 5 million hits its $300 per day or something like that. (2 1/2 years later you guys probably get WAY more than 5 million hits a day).

So the solution thrown out there was to setup a parse on an EC2 instance. This would be the best solution. As long as your EC2 instance is in the same region as your S3 container, there is no cost to transfer you log files from S3 to your EC2 instance.

So essentially, the solution would be to have some sort of daily task that happens:

  1. EC2 instance starts up
  2. Script pulls logs for last 24 hours from S3 container
  3. Script parses logs
  4. Script deletes local log
  5. Script dumps the data in whatever form you want into some database somewhere
  6. Script terminates EC2 instance

So this would be an absolute minimal cost. You would only pay for the time the instance is active. Scheduling an EC2 instance to turn on every 24 hours shouldn't be too hard. And I'm pretty sure you can self-terminate an EC2 instance programmatically.

Just a thought. It honestly wouldn't be too terribly difficult to figure out...

@Jakobud

Actually an even better solution would be using AWS Data Pipeline

http://aws.amazon.com/documentation/data-pipeline/

And AWS Elastic Map Reduce

https://aws.amazon.com/elasticmapreduce/

Those tools are made to do exactly what you guys need to do: Analyze data/logs in a cost efficient manner.

@ryankirkman
cdnjs member
@PeterDaveHello
cdnjs member

@ryankirkman can we evaluate the disk size we need per day, and maybe i can find the storage.

@Jakobud

Are Cloudflare logs accessible to you in some form, downloadable or via an API or anything? Also, EC2 transfer pricing:

Data Transfer IN To Amazon EC2 From Internet $0.00 per GB

https://aws.amazon.com/ec2/pricing/

So I assume that means you could programmatically pull in Cloudflare logs and parse them or do whatever and it would still only cost you for the time the EC2 instance is active.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment