Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
Already on GitHub? Sign in to your account
Providing public stats for bitcoin.org #605
Comments
saivann
referenced this issue
Oct 9, 2014
Closed
Analytics Need to Be Gathered / Data-Driven Decision Making #385
|
Thanks for working on this, @saivann!
I disagree with this statement. The script will be processing arbitrary data sent by users---the same as a CGI script---with the only difference being that this data is read from a log file rather than the regular server interface. Just at a glance, the RE in I wonder if it might be more prudent to get a $10/month Linode or other VPS, rsync the logs to it, run the stats script there, and then rsync the HTML results back? (Alternatively, a much bigger change would be to move the Bitcoin Core binaries to a secure server dedicated to hosting just them. That way Bitcoin.org website security wouldn't be quite so critical.)
October 13th, maybe? I'll try to spend some more time reviewing the script in the next couple days. Based on my quick review, it looks good. Thanks again! |
|
Thanks for your comments and review!
...I think that's a great idea! Actually the jekyll build process could go there too, so the server wouldn't need to run any script anymore, effectively only serving binaries and static files. |
|
The website is now built on a seperate cloud server and building stats with this setup will work just fine. The server hosting bitcoin.org is now only and solely serving static files. |
|
Great work on this @saivann - really appreciate you taking the initiative to go with this and see it through. Apologies I have not been available to help. /bows. |
|
Are pages that yielded an error (eg, 404) included in the per-page stats? |
|
@luke-jr error.log isn't processed, and only page requests with 200 HTTP status code are processed (stats.rb line 569) . |
|
I created a preview of the stats pages using public logs from NASA from July & August 1995. That should allow ya'll to see what the pages will look like without risk of leaking any private Bitcoin.org data. |
|
@saivann This is really great work that'll give us all a lot of insight, thanks! Do you think it would make sense to aggregate 304/Not Modified status codes in with the 200s? |
|
@gurnec Thanks! That's a good question, bitcoin.org uses no caching for html pages, so my first bet is we probably don't have much of these, but now that you mention it, I guess that it would make sense to include them. 206 status codes however would probably inflate download stats and not reflect reality. @harding Very cool :) |
|
@Coderwill No problem, thanks for the time you have spent on this and other projects! |
|
Note: I finished reviewing both the stats code and the output from running it on logs from several sites, and I see no problems or privacy leaks. I look forward to seeing the live preview tomorrow. |
|
@harding Thank!! In the absence of critical feedback, stats will be published tomorrow, on October 13th. |
|
sounds good to me. |
jgarzik
commented
Oct 13, 2014
|
ACK |
|
Stats can now be found live on https://bitcoin.org/stats/ and should be updated on a daily basis. I hope this data will prove useful. Suspect traffic (e.g. bots or what looks like a DDoS attacks on 2014-03) could perhaps be filtered at some point should there be reasonably accurate and efficient ways to do so. |
|
After additional testing, it appears that anonymizing IPs is reducing unique visitors count by ~25%, which significantly affects accuracy, in case anyone has a good idea to tackle this problem. |
|
@saivann If I read stats.rb correctly, it considers a unique visitor to be an IP address seen within a particular period of time (month|year|all time). This isn't wrong, but I think most stats programs consider a unique visitor to be an IP address seen within a particular day aggregated to the period length. I.e. "unique visitors in September" means "unique visitors on Sept 1st plus unique visitors on Sept 2nd plus unique visitors on Sept 3rd plus...". If you're willing to use this definition of unique visitors, you can process the logs one full day at a time. The first time you see an IP address for a particular day, you can assign it a random ID (storing the mapping from IP-to-ID in a simple associative array). Each entry for that IP address gets written to the database/modified logs with its randomly-assigned ID rather than the partly-obfuscated IP addresses. For example, this log:
Gets written to the database as:
When the program exits, the associative array is destroyed, fully obfuscating the IP addresses---but leaving the randomly-assigned ID still present in the database to allow you to generate statistics. The downsides of this approach I can see are:
Sorry for the long post. I hope it all makes sense. |
I'm not a webmaster, but the one stats package I am familiar with, awstats, maintains its unique database on a monthly basis. E.g. if an IP visits on the 1st and again on the 25th of the same month, it's counted as a single unique visitor. If the IP visits next month, it's a second visitor (for the purposes of displaying a whole year's stats). I think longer-lived tracking such as this produces more useful/accurate statistics. There is, of course, a trade-off between privacy and accuracy. Towards that end, I'd suggest a slight tweak to the suggestion by @harding. Exactly once in the script, do:
And then in anonymizeLine, do:
As already noted, the geoip lookup would have to be done prior to anonymizeLine. The potential advantage, aside from not needing the associative array, is we could more precisely choose where to draw the privacy/accuracy line. Instead of creating a new secret_key for every script run, secret_key could be persisted somewhere, e.g. in a file. Every X days, the old secret_key will be securely deleted and regenerated. Without this tweak, X is effectively 1 day. I think having an X closer to 7 days would yield more useful statistics with a reasonably small additional loss of privacy, but that's certainly debatable. |
|
These are interesting options, although I hoped we could recreate the database from saved logs. The more I read about unique visitors, the more I feel this value is both pretty much abstract and resource consuming without complex analysis. I'm wondering if it wouldn't be better to just provide page views for now, and only track abnormal daily requests by IPs for the purpose of filtering DoS from stats. Another option would be to use nginx userid cookies, (seems like using IPs for unique visitors is generally discouraged, and likewise, we seems to be getting a lot of visitors from China from a few IPs of the same subnet, with different normal looking user-agent fields). However, given that the first request wouldn't have the cookie set, and given that some browsers will ignore cookies, this would likely create other issues. |
I'm surely missing something here... why can't they be recreated?
I agree that web statistics tend to lie, even moreso than other statistics do. However given that the code is already written, I also think more (even if inaccurate) is better. If unique user stats are kept, maybe a disclaimer on the stats page would be in order?
Although I personally like the idea, I'd be concerned about the potential of some backlash over adding cookies. There's also European law to consider, which requires consent of cookie usage (although I think it can be implied consent, e.g. an overlay which dismisses itself). |
Well, saved logs are using the anonymized IP, so recreating the database from saved logs is possible but would bring back the -25% accuracy issue. The database itself keeps very few data. If the unique ids are somehow saved in the logs, then there is the issue of keeping them compatible with the combined log format and be sure this data can be trusted when re-importing these logs later.
Not sure, in my views I tend to think it should either be good and simple enough, or disabled until it gets there. I even hesitated with providing that feature before finding these new issues, as it requires 300 - 600Mb of database space per year and is slowing down the script. I feel at this point that this is an half-baked feature with too high requirements for generating one line of inaccurate (and depreciating) data. |
|
FWIW I compared counting unique visitors using unique IPs per day VS per month (currently). This results in ~2,700,000 additional unique visitors. So this value is also easily affected by where we draw the line with regards to how long an IP is considered to represent a single user. |
|
The unique visitor count doesn't seem to provide us with any information about where to focus our effort on improving the website, which was a primary goal of the stats project. I guess it could help encourage sponsorships (another goal), but the high page view count should also do that---so I agree with @saivann that dropping uniques for now seems reasonable. As for the other topics mentioned here, if we do keep/re-enable uniques, I like @gurnec's suggestion for a one-week or one-month persistent obfuscation key, and I also agree with both of your comments that cookies would likely create additional problems. |
|
Unique visitors is an important “bottom-line” statistic decision makers at organizations would look at when considering a sponsorship and how much to pay for it.
|
@Coderwill Which is in my views yet another reason why it's not a good idea to display a value that is much lower than the reality :) |
|
Yes, an important point to consider if there is no other way to improve the accuracy.
|
|
@saivann Thank you for taking the time to explain. Perhaps this idea would better address some of the issues. The intent of the function below is to transform an input IP into a somewhat anonymized IP such that:
Here's an implementation that I think achieves this, sorry for it being in Python, but I'm sure you'll get the idea: secret_key_bytes = os.urandom(32) # or load it from somewhere
def anonymize_ip(input_ip_str):
last_octet_pos = input_ip_str.rindex('.') + 1
first_octets_str = input_ip_str[:last_octet_pos]
last_octet = int(input_ip_str[last_octet_pos:])
digest_bytes = hmac.new(secret_key_bytes, first_octets_str.encode(), hashlib.sha256).digest()
random.seed(digest_bytes)
octet_mapping = list(range(256))
random.shuffle(octet_mapping)
return first_octets_str + str(octet_mapping[last_octet])One thing I'm unsure of is whether or not a CSPRNG would be required for the shuffle (Python, like Ruby, just uses MT19937). This doesn't address the issue of whether or not displaying unique statistics from older data is worthwhile (I think everyone here agrees it's not), but it would allow more useful unique statistics to be generated later if desired. It maybe doesn't go far enough towards anonymization, Any thoughts? |
christophebiocca
commented
Oct 18, 2014
|
Is there a reason we want to keep the first three octets? The simplest thing to track uniques (and nothing else) is to just keep a digest of the ip address using a durable secret. It's secure as long as the secret isn't obtained by an attacker. The problem is of course that the secret is a single point of failure, and because ipv4 addresses are low entropy, rerunning the function over the entire set is easy. Contrast with IPv6, where an attacker can only use the secret to test for specific addresses, because the space is too large to enumerate (although the space of IPv6 addresses in actual use is very small and not particularly secret). One technique that can increase security beyond this is to use key stretching, but in this case the attacker would only need to do as much work to decode the IPs as we originally put into encoding them. It won't work. |
|
@gurnec Ok wow, thanks! Although I do not have required skills to make sure this implementation is secure, the idea sounds good to me at a first glance. Below is the count of how many unique IPs per subnets can be found in the current logs since January. Privacy is weaker but unless I'm mistaken, in the worst case scenarios all we can conclude is that a specific IP has ~16% chances of being included in the logs (40 / 255 IPs). My impression is we can hardly conclude useful information for identifying an individual person based on that information and since the logs are meant to remain secret anyway, I feel this is a good compromise for keeping the unique visitors count. The private key could be stored in the @harding Any opinion? (IPs in the same subnet : subnets count) |
@christophebiocca Yes, so we can have geolocation stats and so the database can easily be rebuilt from logs should there be any bug, failure or new feature that might require so. |
|
@gurnec's code and revised analysis (with the strikethrough) looks reasonable to me. @saivann my only suggestion would be keeping the temporary persistent key in a separate file so that the code can securely |
|
@harding Thanks! Using a regular file and shred for the private key makes a lot of sense. Regarding |
|
@saivann for rsync, you'd have to use the |
|
@gurnec Good point! However AFAIK regarding SSDs, shredding files would still prevent someone from restoring the file at the filesystem level without physical access to the SSD, so shredding would just increase the cost and decrease chances of restoring files. I doubt this data is (edit - worth the effort of a more perfect solution like encrypting whole drive). |
|
I have just pushed a branch to make it harder for DoS attacks or aggressive bots to affect the stats. The script would now blacklist IPs with an abnormal number of daily requests, or with a high number of daily requests repeating the same pages, referers or useragents. This change would not work against distributed attacks, but would at least increase the costs for vandalizing the stats a little. Diff: (Merged) Live previews (filtered and not filtered): Feedback is welcome. Afterwhile, I would work on the suggested solution for keeping unique visitors (or @gurnec is welcome to provide a pull request), and then move back to working on bitcoin.org's content. |
|
I have just pushed a commit that implements the suggested anonymizing technique. I have tested carefully the generation, use, overwriting and deleting of the random keys as well as the resuming and replaying of the database, compared resulting stats and verified saved logs. More review or testing is welcome. Unless any bug or issue is found, I will move this code to production, update stats, and keep original logs for a few weeks before securely deleting them in case any other issue is found. |
|
@saivann Although my knowledge of Ruby is pretty meager, I took a look anyways. I made one small suggestion (as a line note in the commit) which you may want to consider, but aside from that the rest looks great AFAICT. |
|
@gurnec Very appreciated, thanks! |
saivann commentedOct 9, 2014
Download stats was asked quite a few times (@jgarzik @Michagogo). And I believe bitcoin.org can generally provide an useful insight of the global interest by countries, as well as helping us identifying what needs to be prioritized on the website.
I have completed an optimized ruby script to do just that from server logs:
https://github.com/saivann/bitcoinstats
Due to its public and optimized nature, this script will be limited to providing:
If there is no opposition, I would provide a live preview of the final result on
AugustOctober 13th, effectively releasing the stats. While I have been very careful with testing the script, reviews are always very welcome (please open issues on the repository linked at the beginning of the issue).