-
Notifications
You must be signed in to change notification settings - Fork 35.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
contrib: Use asmap for ASN lookup in makeseeds #24864
Conversation
Concept ACK. (Python linter says to remove unused headers.) |
b48d2b0
to
08ecc04
Compare
The following sections might be updated with supplementary metadata relevant to reviewers and maintainers. ConflictsNo conflicts as of last run. |
# Copyright (c) 2013-2020 The Bitcoin Core developers | ||
# Distributed under the MIT software license, see the accompanying | ||
# file COPYING or http://www.opensource.org/licenses/mit-license.php. | ||
import ipaddress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I didn't realize before that Python had built-in functionality to parse/manipulate IP addresses (both v4 and v6). Might make sense (but not in this PR) to use this in more places and replace all the ad-hoc hacks.
contrib/seeds/asmap.py
Outdated
def DecodeBytes(byts): | ||
return [(byt >> i) & 1 for byt in byts for i in range(8)] | ||
|
||
def DecodeBits(stream, bitpos, minval, bit_sizes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK MIT license from me.
Also, I'm currently rewriting the Python asmap code from scratch actually (to have a single class that can do creation of asmap files, decoding, lookup, diffing, have tests, ...), and it's close to being done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ACK MIT license
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Also, I'm currently rewriting the Python asmap code from scratch actually (to have a single class that can do creation of asmap files, decoding, lookup, diffing, have tests, ...), and it's close to being done.
OK, we can wait for that, there's no hurry here. It would be nice to have this before 0.24 branch-off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'm currently rewriting the Python asmap code from scratch actually (to have a single class that can do creation of asmap files, decoding, lookup, diffing, have tests, ...), and it's close to being done.
/start slightly off-topic:
Nice! Is the diffing for encoded mappings or just output from asmap-rs
? Or would this even make asmap-rs
obsolete? I'm in the process of making some improvements to UX there to show progress on RIPE data dump downloads and finding the ASN bottlenecks. I was also thinking of including a command there to compare non-encoded maps. I was going to use that to run some sort of monthly historical analysis over the past year to get an idea of how much the maps tend to change. Just don't want to duplicate effort.
Just some of the things to help get asmaps distributed with releases and enabled by default. @sipa, just let me know if you want to continue this chat off GitHub or on some other issue. :)
/end slightly off-topic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dunxen Yeah, possibly. At this point the new asmap python code can do everything asmap-rs can and more (encode, decode, diff, bottleneck, ...), but there is some more work around making it nicer to use, and of course questions around review and integration into processes.
Tested that both
|
contrib/seeds/README.md
Outdated
@@ -11,18 +11,7 @@ to addrman with). | |||
The seeds compiled into the release are created from sipa's DNS seed data, like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe update here
The seeds compiled into the release are created from sipa's DNS seed data, like this: | |
The seeds compiled into the release are created from sipa's DNS seed and AS map data, like this: |
contrib/seeds/makeseeds.py
Outdated
|
||
print(f'Loading asmap database {args.asmap}... ', end='', file=sys.stderr, flush=True) | ||
asmap = ASMap(args.asmap) | ||
print(f'Done.', file=sys.stderr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a few nitty suggestions here
- print(f'Loading asmap database {args.asmap}... ', end='', file=sys.stderr, flush=True)
+ print(f'Loading asmap database "{args.asmap}"... ', end='', file=sys.stderr, flush=True)
asmap = ASMap(args.asmap)
- print(f'Done.', file=sys.stderr)
+ print('done.\n', file=sys.stderr)
I considered this, but I have a slight preference for explicit named instead of positional arguments that's why I kept the |
contrib/seeds/makeseeds.py
Outdated
@@ -201,7 +178,18 @@ def ip_stats(ips: List[Dict]) -> str: | |||
|
|||
return f"{hist['ipv4']:6d} {hist['ipv6']:6d} {hist['onion']:6d}" | |||
|
|||
def parse_args(): | |||
argparser = argparse.ArgumentParser(description=f'Generates a list of bitcoin node seed ip addresses.') | |||
argparser.add_argument("-a","--asmap", help=f'The location of the asmap asn database file (required)', required=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nits
- argparser = argparse.ArgumentParser(description=f'Generates a list of bitcoin node seed ip addresses.')
- argparser.add_argument("-a","--asmap", help=f'The location of the asmap asn database file (required)', required=True)
+ argparser = argparse.ArgumentParser(description='Generates a list of bitcoin node seed ip addresses.')
+ argparser.add_argument("-a", "--asmap", help='the location of the asmap asn database file (required)', required=True)
$ python3 makeseeds.py -h
usage: makeseeds.py [-h] -a ASMAP
Generates a list of bitcoin node seed ip addresses.
optional arguments:
-h, --help show this help message and exit
-a ASMAP, --asmap ASMAP
the location of the asmap asn database file (required)
I've put more up-to-date and generic asmap files on https://bitcoin.sipa.be/asmap-unfilled.dat and https://bitcoin.sipa.be/asmap-filled.dat (the latter one has subnets with no actual ASN assigned to nearby ASNs to minimize the file size). The input data was sourced through https://github.com/rrybarczyk/asmap-rs, and compiled to asmap format by new code I'm working on at https://github.com/sipa/asmap/tree/nextgen. I hope that's in a presentable state in the next few days somewhere, but I'm put it online already to experiment with. |
08ecc04
to
4f33193
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lightly tested ACK 4f33193 per git range-diff 10a626a1 08ecc04 4f33193
Loading asmap database "asmap-filled.dat"…Done.
IPv4 IPv6 Onion Pass
470769 73264 0 Initial
470769 73264 0 Skip entries with invalid address
470769 73264 0 After removing duplicates
6321 1728 0 Enforce minimal number of blocks
5492 1496 0 Require service bit 1
3861 872 0 Require minimum uptime
3782 849 0 Require a known and recent user agent
3757 841 0 Filter out hosts with multiple bitcoin ports
512 281 0 Look up ASNs and limit results per ASN and per net
Cherry-picked a commit from @jonatack with improvements. |
3d7f68c
to
7edacdc
Compare
re-ACK 7edacdc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tACK 7edacdc
With this version of the map:
₿ sha256sum asmap-filled.dat
f479906ce1731281616a235a85a79c9c5085c36d88689939ef7fcc5196d30874 asmap-filled.dat
I get:
₿ python3 makeseeds.py -a asmap-filled.dat < seeds_main.txt > nodes_main.txt
Loading asmap database "asmap-filled.dat"…Done.
Loading and parsing DNS seeds…Done.
IPv4 IPv6 Onion Pass
471267 73410 0 Initial
471267 73410 0 Skip entries with invalid address
471267 73410 0 After removing duplicates
6927 1908 0 Enforce minimal number of blocks
5967 1617 0 Require service bit 1
3636 856 0 Require minimum uptime
3244 771 0 Require a known and recent user agent
3221 767 0 Filter out hosts with multiple bitcoin ports
512 238 0 Look up ASNs and limit results per ASN and per net
I'm kind of in doubt here, with the testing and review this has got, should we go ahead and merge this and leave updating to |
I think this is fair :) |
The current module there in asmap.py has an ASMap class with tests, and a from_binary and lookup method, which should be sufficient for the purposes here. I don't think much is going to change there, and it's already much better at handling error conditions than the current code, so feel free to switch to that. I'm going to still clean things up more and have some cli tools using it, but the asmap module is pretty much done. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tACK 7edacdc
I've replaced Result is the same, but it's a bit slower (I think the bottleneck is in the loading phase): Before:
After:
|
@laanwj Ah yes, there are only functions for converting between net ranges and prefixes, not addresses. I'll add functions for those too. I don't think that's holding up anything here though; I can PR an update when it's done. The slowdown is indeed somewhat expected, as it's converting the whole map to a lookup tree format rather than interpreting the binary data directly, which is faster per lookup but slower to load. Is that a concern? I could add interpretation based logic too. |
No, not for me at least. It's not a script to be run often and ~30s still a lot better than when it had to DNS query for every address. I also think the error checking implied by parsing instead of interpretation is useful here. And maybe there's scope for optimization of the loading (if anyone does care). Edit: our linter did find a few problems, not sure you care about these as our coding style isn't necessarily the asmap one, but here goes:
|
Between 17 and 20 seconds total |
b499e2d
to
fdfa4c3
Compare
Re-pushed to fix linter errors (upstream in sipa/asmap#5). However there's still one left which I don't really know how to get rid of:
I don't get this locally. It must have to do with Python version differences. Edit: trying to simply remove the Edit.2: that gives me an error while running
|
b00c08d
to
ed418f6
Compare
FWIW the asmap code runs significantly faster inside of |
Was just wondering about this. I'm not seeing the issue locally either with Python 3.10.4. |
Add an argument `-a` to provide a asmap file to do the IP to ASN lookups. This speeds up the script greatly, and makes the output deterministic. Also removes the dependency on `dns.lookup`. I've annotated the output with ASxxxx comments to provide a way to verify the functionality. For now I've added instructions in README.md to download and use the `demo.map` from the asmap repository. When we have some other mechanism for distributing asmap files we could switch to that. This continues bitcoin#24824. I've removed all the fallbacks and extra complexity, as everyone will be using the same instructions anyway. Co-authored-by: Pieter Wuille <pieter.wuille@gmail.com> Co-authored-by: James O'Beirne <james.obeirne@pm.me> Co-authored-by: russeree <reese.russell@ymail.com>
ed418f6
to
bc23f34
Compare
I added a commit that disables the Python linter for
I decided to do this instead of removing the annotations because I think the postponed annotations are useful and we should use them as well (in the future, when the Python version allows for this). This should hopefully make this pass the CI again. |
bc23f34
to
1038342
Compare
1038342
to
1df513a
Compare
Re-pushed, updated for @sipa's suggestions. |
Concept ACK, will review soon. |
@laanwj I've pushed another change, dropping the future annotations. It turns out you can use the string name of forward-declared types as well as typing annotations, which suffices here. I've also merged your sipa/asmap#5. Hopefully it now passes all linter requirements? |
1df513a
to
667e316
Compare
Ok, thanks for adding a license too, updated to the new |
ACK 667e316 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re-ACK 667e316
https://github.com/brunoerg/asmapy can be a good alternative! |
Add an argument
-a
to provide a asmap file to do the IP to ASN lookups.This speeds up the script greatly, and makes the output deterministic. Also removes the dependency on
dns.lookup
.I've annotated the output with ASxxxx comments to provide a way to verify the functionality.
For now I've added instructions in README.md to download and use the
demo.map
from the asmap repository. When we have some other mechanism for distributing asmap files we could switch to that.This continues #24824. I've removed the fallbacks and extra complexity, as everyone will be using the same instructions anyway.
Co-authored-by: Pieter Wuille pieter.wuille@gmail.com
Co-authored-by: russeree reese.russell@ymail.com