Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Providing public stats for bitcoin.org #605

Closed
saivann opened this Issue Oct 9, 2014 · 39 comments

Comments

Projects
None yet
8 participants
Contributor

saivann commented Oct 9, 2014

Download stats was asked quite a few times (@jgarzik @Michagogo). And I believe bitcoin.org can generally provide an useful insight of the global interest by countries, as well as helping us identifying what needs to be prioritized on the website.

I have completed an optimized ruby script to do just that from server logs:
https://github.com/saivann/bitcoinstats

  • This script can scale, is fast, resumable, and is compressing, rotating and filtering saved log files while having a very low memory consumption.
  • This script should protect privacy by allowing us to only keep anonymized IPs and not leak them.
  • This script should avoid weakening security, as it does not require enabling CGI on the HTTP server.

Due to its public and optimized nature, this script will be limited to providing:

  • Total page views with graph.
  • Total unique visitors.
  • Table of visits by page (sortable).
  • Table of visits by country (sortable).

If there is no opposition, I would provide a live preview of the final result on August October 13th, effectively releasing the stats. While I have been very careful with testing the script, reviews are always very welcome (please open issues on the repository linked at the beginning of the issue).

Contributor

harding commented Oct 9, 2014

Thanks for working on this, @saivann!

This script should avoid weakening security, as it does not require enabling CGI on the HTTP server.

I disagree with this statement. The script will be processing arbitrary data sent by users---the same as a CGI script---with the only difference being that this data is read from a log file rather than the regular server interface.

Just at a glance, the RE in logregex (stats.rb line 569) seems to provide sufficient whitelisting, although secure input filtering is definitely not one of my specialties.

I wonder if it might be more prudent to get a $10/month Linode or other VPS, rsync the logs to it, run the stats script there, and then rsync the HTML results back? (Alternatively, a much bigger change would be to move the Bitcoin Core binaries to a secure server dedicated to hosting just them. That way Bitcoin.org website security wouldn't be quite so critical.)

August 13th

October 13th, maybe?

I'll try to spend some more time reviewing the script in the next couple days. Based on my quick review, it looks good. Thanks again!

Contributor

saivann commented Oct 9, 2014

Thanks for your comments and review!

I wonder if it might be more prudent to get a $10/month Linode or other VPS, rsync the logs to it, run the stats script there, and then rsync the HTML results back?

...I think that's a great idea! Actually the jekyll build process could go there too, so the server wouldn't need to run any script anymore, effectively only serving binaries and static files.

Contributor

saivann commented Oct 10, 2014

The website is now built on a seperate cloud server and building stats with this setup will work just fine. The server hosting bitcoin.org is now only and solely serving static files.

Contributor

wbnns commented Oct 10, 2014

Great work on this @saivann - really appreciate you taking the initiative to go with this and see it through. Apologies I have not been available to help.

/bows.

Contributor

luke-jr commented Oct 10, 2014

Are pages that yielded an error (eg, 404) included in the per-page stats?

Contributor

saivann commented Oct 10, 2014

@luke-jr error.log isn't processed, and only page requests with 200 HTTP status code are processed (stats.rb line 569) .

Contributor

harding commented Oct 10, 2014

I created a preview of the stats pages using public logs from NASA from July & August 1995. That should allow ya'll to see what the pages will look like without risk of leaking any private Bitcoin.org data.

Contributor

gurnec commented Oct 11, 2014

@saivann This is really great work that'll give us all a lot of insight, thanks!

Do you think it would make sense to aggregate 304/Not Modified status codes in with the 200s?

Contributor

saivann commented Oct 11, 2014

@gurnec Thanks! That's a good question, bitcoin.org uses no caching for html pages, so my first bet is we probably don't have much of these, but now that you mention it, I guess that it would make sense to include them. 206 status codes however would probably inflate download stats and not reflect reality.

@harding Very cool :)

Contributor

saivann commented Oct 11, 2014

@Coderwill No problem, thanks for the time you have spent on this and other projects!

Contributor

harding commented Oct 13, 2014

Note: I finished reviewing both the stats code and the output from running it on logs from several sites, and I see no problems or privacy leaks. I look forward to seeing the live preview tomorrow.

Contributor

saivann commented Oct 13, 2014

@harding Thank!!

In the absence of critical feedback, stats will be published tomorrow, on October 13th.

Contributor

gmaxwell commented Oct 13, 2014

sounds good to me.

jgarzik commented Oct 13, 2014

ACK

Contributor

saivann commented Oct 14, 2014

Stats can now be found live on https://bitcoin.org/stats/ and should be updated on a daily basis. I hope this data will prove useful.

Suspect traffic (e.g. bots or what looks like a DDoS attacks on 2014-03) could perhaps be filtered at some point should there be reasonably accurate and efficient ways to do so.

Contributor

saivann commented Oct 15, 2014

After additional testing, it appears that anonymizing IPs is reducing unique visitors count by ~25%, which significantly affects accuracy, in case anyone has a good idea to tackle this problem.

Contributor

harding commented Oct 15, 2014

@saivann If I read stats.rb correctly, it considers a unique visitor to be an IP address seen within a particular period of time (month|year|all time). This isn't wrong, but I think most stats programs consider a unique visitor to be an IP address seen within a particular day aggregated to the period length. I.e. "unique visitors in September" means "unique visitors on Sept 1st plus unique visitors on Sept 2nd plus unique visitors on Sept 3rd plus...".

If you're willing to use this definition of unique visitors, you can process the logs one full day at a time. The first time you see an IP address for a particular day, you can assign it a random ID (storing the mapping from IP-to-ID in a simple associative array). Each entry for that IP address gets written to the database/modified logs with its randomly-assigned ID rather than the partly-obfuscated IP addresses. For example, this log:

123.123.123.123 2014-09-01:00:00:01 GET /
456.456.456.456 2014-09-01:00:00:07 GET /en/choose-your-wallet
123.123.123.123 2014-09-01:00:00:13 GET /en/developer-documentation

Gets written to the database as:

74e46f35f5a6b937da53e8ebf154fc1cdae8d660 2014-09-01 /
a6083dc10f24618fb09695b790abf3e1ac989004 2014-09-01 /en/choose-your-wallet
74e46f35f5a6b937da53e8ebf154fc1cdae8d660 2014-09-01 /en/developer-documentation

When the program exits, the associative array is destroyed, fully obfuscating the IP addresses---but leaving the randomly-assigned ID still present in the database to allow you to generate statistics.

The downsides of this approach I can see are:

  • You can't count any statistics from today (UTC) or you'll over-count unique visitors.

  • The GeoIP lookups become non-repeatable---you have to do them before the IP address associative array is destroyed. For that reason, you may want to add the results to the obfuscated logs. E.g.:

    74e46f35f5a6b937da53e8ebf154fc1cdae8d660 2014-09-01:00:00:01 GET / "United States Of America"
    a6083dc10f24618fb09695b790abf3e1ac989004 2014-09-01:00:00:07 GET /en/choose-your-wallet "Japan"
    74e46f35f5a6b937da53e8ebf154fc1cdae8d660 2014-09-01:00:00:13 GET /en/developer-documentation "United States Of America"
    

Sorry for the long post. I hope it all makes sense.

Contributor

gurnec commented Oct 15, 2014

I think most stats programs consider a unique visitor to be an IP address seen within a particular day aggregated to the period length. I.e. "unique visitors in September" means "unique visitors on Sept 1st plus unique visitors on Sept 2nd plus unique visitors on Sept 3rd plus...".

I'm not a webmaster, but the one stats package I am familiar with, awstats, maintains its unique database on a monthly basis. E.g. if an IP visits on the 1st and again on the 25th of the same month, it's counted as a single unique visitor. If the IP visits next month, it's a second visitor (for the purposes of displaying a whole year's stats). I think longer-lived tracking such as this produces more useful/accurate statistics.

There is, of course, a trade-off between privacy and accuracy. Towards that end, I'd suggest a slight tweak to the suggestion by @harding.

Exactly once in the script, do:

secret_key = random_bytes(32)

And then in anonymizeLine, do:

anonymized_ip = hmac-sha256(secret_key, plaintext_ip)

As already noted, the geoip lookup would have to be done prior to anonymizeLine.

The potential advantage, aside from not needing the associative array, is we could more precisely choose where to draw the privacy/accuracy line. Instead of creating a new secret_key for every script run, secret_key could be persisted somewhere, e.g. in a file. Every X days, the old secret_key will be securely deleted and regenerated.

Without this tweak, X is effectively 1 day. I think having an X closer to 7 days would yield more useful statistics with a reasonably small additional loss of privacy, but that's certainly debatable.

Contributor

saivann commented Oct 15, 2014

These are interesting options, although I hoped we could recreate the database from saved logs.

The more I read about unique visitors, the more I feel this value is both pretty much abstract and resource consuming without complex analysis. I'm wondering if it wouldn't be better to just provide page views for now, and only track abnormal daily requests by IPs for the purpose of filtering DoS from stats.

Another option would be to use nginx userid cookies, (seems like using IPs for unique visitors is generally discouraged, and likewise, we seems to be getting a lot of visitors from China from a few IPs of the same subnet, with different normal looking user-agent fields). However, given that the first request wouldn't have the cookie set, and given that some browsers will ignore cookies, this would likely create other issues.

Contributor

gurnec commented Oct 15, 2014

These are interesting options, although I hoped we could recreate the database from saved logs.

I'm surely missing something here... why can't they be recreated?

The more I read about unique visitors, the more I feel this value is both pretty much abstract and resource consuming without complex analysis.

I agree that web statistics tend to lie, even moreso than other statistics do. However given that the code is already written, I also think more (even if inaccurate) is better. If unique user stats are kept, maybe a disclaimer on the stats page would be in order?

Another option would be to user nginx userid cookies

Although I personally like the idea, I'd be concerned about the potential of some backlash over adding cookies. There's also European law to consider, which requires consent of cookie usage (although I think it can be implied consent, e.g. an overlay which dismisses itself).

Contributor

saivann commented Oct 15, 2014

These are interesting options, although I hoped we could recreate the database from saved logs.

I'm surely missing something here... why can't they be recreated?

Well, saved logs are using the anonymized IP, so recreating the database from saved logs is possible but would bring back the -25% accuracy issue. The database itself keeps very few data. If the unique ids are somehow saved in the logs, then there is the issue of keeping them compatible with the combined log format and be sure this data can be trusted when re-importing these logs later.

I agree that web statistics tend to lie, even moreso than other statistics do. However given that the code is already written, I also think more (even if inaccurate) is better. If unique user stats are kept, maybe a disclaimer on the stats page would be in order?

Not sure, in my views I tend to think it should either be good and simple enough, or disabled until it gets there. I even hesitated with providing that feature before finding these new issues, as it requires 300 - 600Mb of database space per year and is slowing down the script. I feel at this point that this is an half-baked feature with too high requirements for generating one line of inaccurate (and depreciating) data.

Contributor

saivann commented Oct 15, 2014

FWIW I compared counting unique visitors using unique IPs per day VS per month (currently). This results in ~2,700,000 additional unique visitors. So this value is also easily affected by where we draw the line with regards to how long an IP is considered to represent a single user.

Contributor

harding commented Oct 16, 2014

The unique visitor count doesn't seem to provide us with any information about where to focus our effort on improving the website, which was a primary goal of the stats project. I guess it could help encourage sponsorships (another goal), but the high page view count should also do that---so I agree with @saivann that dropping uniques for now seems reasonable.

As for the other topics mentioned here, if we do keep/re-enable uniques, I like @gurnec's suggestion for a one-week or one-month persistent obfuscation key, and I also agree with both of your comments that cookies would likely create additional problems.

Contributor

wbnns commented Oct 16, 2014

Unique visitors is an important “bottom-line” statistic decision makers at organizations would look at when considering a sponsorship and how much to pay for it.

On Oct 15, 2014, at 6:49 PM, David A. Harding notifications@github.com wrote:

The unique visitor count doesn't seem to provide us with any information about where to focus our effort on improving the website, which was a primary goal of the stats project. I guess it could help encourage sponsorships (another goal), but the high page view count should also do that---so I agree with @saivann https://github.com/saivann that dropping uniques for now seems reasonable.

As for the other topics mentioned here, if we do keep/re-enable uniques, I like @gurnec https://github.com/gurnec's suggestion for a one-week or one-month persistent obfuscation key, and I also agree with both of your comments that cookies would likely create additional problems.


Reply to this email directly or view it on GitHub bitcoin#605 (comment).

Contributor

saivann commented Oct 16, 2014

Unique visitors is an important “bottom-line” statistic decision makers at organizations would look at when considering a sponsorship and how much to pay for it.

@Coderwill Which is in my views yet another reason why it's not a good idea to display a value that is much lower than the reality :)

Contributor

wbnns commented Oct 17, 2014

Yes, an important point to consider if there is no other way to improve the accuracy.

On Oct 16, 2014, at 12:02 PM, saivann notifications@github.com wrote:

@Coderwill Which is in my views yet another reason why it's not a good idea to display a value that is much lower than the reality :)


Reply to this email directly or view it on GitHub.

Contributor

gurnec commented Oct 18, 2014

@saivann Thank you for taking the time to explain.

Perhaps this idea would better address some of the issues.

The intent of the function below is to transform an input IP into a somewhat anonymized IP such that:

  • The anonymized IP's geolocation is unchanged (same first three octets).
  • Without the secret_key no input IP can be determined given a set of output IPs.
  • The output ip remains useful for visitor tracking (i.e. the function must be repeatable so long as the secret_key remains unchanged and it must be injective).

Here's an implementation that I think achieves this, sorry for it being in Python, but I'm sure you'll get the idea:

secret_key_bytes = os.urandom(32)  # or load it from somewhere

def anonymize_ip(input_ip_str):
    last_octet_pos   = input_ip_str.rindex('.') + 1
    first_octets_str = input_ip_str[:last_octet_pos]
    last_octet       = int(input_ip_str[last_octet_pos:])
    digest_bytes     = hmac.new(secret_key_bytes, first_octets_str.encode(), hashlib.sha256).digest()
    random.seed(digest_bytes)
    octet_mapping    = list(range(256))
    random.shuffle(octet_mapping)
    return first_octets_str + str(octet_mapping[last_octet])

One thing I'm unsure of is whether or not a CSPRNG would be required for the shuffle (Python, like Ruby, just uses MT19937).

This doesn't address the issue of whether or not displaying unique statistics from older data is worthwhile (I think everyone here agrees it's not), but it would allow more useful unique statistics to be generated later if desired. It maybe doesn't go far enough towards anonymization, but I don't think it's any worse than what's used today. Edited- actually it is worse in at least some ways, e.g. if there are 256 unique anonymized IPs (with the same secret_key) all from the same /24 which visit bitcoin.org, it's obvious that every IP from that /24 has visited.

Any thoughts?

Is there a reason we want to keep the first three octets?

The simplest thing to track uniques (and nothing else) is to just keep a digest of the ip address using a durable secret. It's secure as long as the secret isn't obtained by an attacker.

The problem is of course that the secret is a single point of failure, and because ipv4 addresses are low entropy, rerunning the function over the entire set is easy.

Contrast with IPv6, where an attacker can only use the secret to test for specific addresses, because the space is too large to enumerate (although the space of IPv6 addresses in actual use is very small and not particularly secret).

One technique that can increase security beyond this is to use key stretching, but in this case the attacker would only need to do as much work to decode the IPs as we originally put into encoding them. It won't work.

Contributor

saivann commented Oct 18, 2014

@gurnec Ok wow, thanks! Although I do not have required skills to make sure this implementation is secure, the idea sounds good to me at a first glance. Below is the count of how many unique IPs per subnets can be found in the current logs since January. Privacy is weaker but unless I'm mistaken, in the worst case scenarios all we can conclude is that a specific IP has ~16% chances of being included in the logs (40 / 255 IPs).

My impression is we can hardly conclude useful information for identifying an individual person based on that information and since the logs are meant to remain secret anyway, I feel this is a good compromise for keeping the unique visitors count.

The private key could be stored in the config table of the database, and replaced everytime the script starts processing a new month.

@harding Any opinion?

(IPs in the same subnet : subnets count)
40 : 4
38 : 2
37 : 2
36 : 2
34 : 4
33 : 4
32 : 2
31 : 2
30 : 2
29 : 6
28 : 4
27 : 8
26 : 6
24 : 16
23 : 8
22 : 14
21 : 12
20 : 24
19 : 20
18 : 28
17 : 18
16 : 26
15 : 32
14 : 26
13 : 28
12 : 54
11 : 54
10 : 66
9 : 74
8 : 106
7 : 144
6 : 202
5 : 292
4 : 538
3 : 1296
2 : 6958
1 : 158076

Contributor

saivann commented Oct 18, 2014

Is there a reason we want to keep the first three octets?

@christophebiocca Yes, so we can have geolocation stats and so the database can easily be rebuilt from logs should there be any bug, failure or new feature that might require so.

Contributor

harding commented Oct 18, 2014

@gurnec's code and revised analysis (with the strikethrough) looks reasonable to me.

@saivann my only suggestion would be keeping the temporary persistent key in a separate file so that the code can securely shred -u that file instead of less-securely updating the database. Assuming the operating environment is secure, shred should ensure the secret key bytes are immediately destroyed rather than waiting for the database/filesystem to overwrite that part of the disk drive. (We may want to use shred for some of the other places ruby's delete is called also.

Contributor

saivann commented Oct 18, 2014

@harding Thanks! Using a regular file and shred for the private key makes a lot of sense. Regarding shred log files, you made me realize it's actually possible to use shred with logrotate, although I'm not sure how to securely use shred with rsync.

Contributor

harding commented Oct 18, 2014

@saivann for rsync, you'd have to use the --backup mode which will rename the changed or deleted files (and optionally put them in a special directory). Then you shred those changed/deleted files. If I understand correctly, this will take up extra disk space and disk I/O, but it will still be just as bandwidth efficient as a typical rsync.

Contributor

gurnec commented Oct 18, 2014

@saivann @harding I suspect you already know this, I'm just double-checking... using shred is only effective with non-transactional filesystems (e.g. ext4) and HDDs (not SSDs), so going through additional work if the server doesn't meet those requirements wouldn't make sense, correct?

Contributor

saivann commented Oct 18, 2014

@gurnec Good point! However AFAIK regarding SSDs, shredding files would still prevent someone from restoring the file at the filesystem level without physical access to the SSD, so shredding would just increase the cost and decrease chances of restoring files. I doubt this data is (edit - worth the effort of a more perfect solution like encrypting whole drive).

Contributor

saivann commented Oct 22, 2014

I have just pushed a branch to make it harder for DoS attacks or aggressive bots to affect the stats. The script would now blacklist IPs with an abnormal number of daily requests, or with a high number of daily requests repeating the same pages, referers or useragents. This change would not work against distributed attacks, but would at least increase the costs for vandalizing the stats a little.

Diff: (Merged)

Live previews (filtered and not filtered):
(Merged)

Feedback is welcome. Afterwhile, I would work on the suggested solution for keeping unique visitors (or @gurnec is welcome to provide a pull request), and then move back to working on bitcoin.org's content.

Contributor

saivann commented Nov 12, 2014

I have just pushed a commit that implements the suggested anonymizing technique.
saivann/bitcoinstats@68837da

I have tested carefully the generation, use, overwriting and deleting of the random keys as well as the resuming and replaying of the database, compared resulting stats and verified saved logs. More review or testing is welcome. Unless any bug or issue is found, I will move this code to production, update stats, and keep original logs for a few weeks before securely deleting them in case any other issue is found.

Contributor

gurnec commented Nov 13, 2014

@saivann Although my knowledge of Ruby is pretty meager, I took a look anyways. I made one small suggestion (as a line note in the commit) which you may want to consider, but aside from that the rest looks great AFAICT.

Contributor

saivann commented Nov 13, 2014

@gurnec Very appreciated, thanks!

@saivann saivann closed this in fda3f01 Nov 22, 2014

saivann added a commit that referenced this issue Nov 22, 2014

Merge pull request #650 from bitcoin/statsprivacy
Add a "Privacy" page and link to public stats (fixes #605)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment