Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FacebookBot mislabeled as fake #10

Closed
kirichkov opened this issue Apr 10, 2019 · 7 comments
Closed

FacebookBot mislabeled as fake #10

kirichkov opened this issue Apr 10, 2019 · 7 comments

Comments

@kirichkov
Copy link
Contributor

I'm starting to see Facebook Bot being labeled as a fake search engine, when in reality the IP address is genuine and I think the issue here is in the SegmentTree being built.

The IP in question is 69.171.251.1

I get the following in the console:

irb> ranges = Legitbot::Facebook.reload!
=> {:ipv4=>SegmentTree(31.13.24.0..204.15.23.255), :ipv6=>SegmentTree(2401:db00::..2a03:2887:ff34:ffff:ffff:ffff:ffff:ffff)}
irb> ranges[:ipv4].find(IPAddr.new('69.171.251.1'))
=> nil

On the other hand:

irb> ip = IPAddr.new('31.13.24.0')
=> #<IPAddr: IPv4:31.13.24.0/255.255.255.255>
irb> ranges[:ipv4].find(ip)
=> #<SegmentTree::Segment:0x00000000085acdc0 @range=#<IPAddr: IPv4:31.13.24.0/255.255.248.0>..#<IPAddr: IPv4:31.13.31.255/255.255.248.0>, @value=true>

On one hand the IPv4 SegmentTree range is too broad, but despite that the valid IP address is not returned and a legitimate bot is labeled as a fake one and thus being blocked.

@kirichkov
Copy link
Contributor Author

Two solutions come to my mind:

  1. Evaluate ranges and discard the smaller overlapping ranges, pass the optimized array to SegmentTree
  2. Drop SegmentTree completely and iterate for each IP range, until match is found, or all entries are not a match.

I don't like either of these approaches, as they add overhead.

@alaz
Copy link
Owner

alaz commented Apr 11, 2019

Replying here to your comment and that above.

With the current implementation of SegmentTree using a sorted array and binary search, I don't see how this can be solved in here. Perhaps the list of IPs can be broken up beforehand into the least-common-denominator netmask and passed to SegmentTree as a list of same-sized netmask ranges.

There is a risk to inflate the list in this approach. It could be critical assuming some ranges in FB may be quite wide, especially in IPv6 space. This is also a reason why (2) "drop SegmentTree completely" is hardly an option.

Otherwise we could try to remove smaller ranges that have wider parents. What's on my mind is to sort the list and reduce it skipping those ranges that fit completely into a preceding wider one. This algo will not be able to normalize ranges that overlap partially (I hope FB will not output such a mess). This matches your proposal (1) as far as I understand and IMHO it's a viable approach. What do you think?

@kirichkov
Copy link
Contributor Author

There's also third way that might solve this:

Sort the array of ranges based on their end, as SegmentTree sorts it based on the start. Pass the sorted array, and bypass SegmentTree's sort:

SegmentTree.new(@sorted_ranges, true)

@kirichkov
Copy link
Contributor Author

There is a risk to inflate the list in this approach. It could be critical assuming some ranges in FB may be quite wide, especially in IPv6 space. This is also a reason why (2) "drop SegmentTree completely" is hardly an option.

Agreed! As I said I don't like either.

Otherwise we could try to remove smaller ranges that have wider parents. What's on my mind is to sort the list and reduce it skipping those ranges that fit completely into a preceding wider one. This algo will not be able to normalize ranges that overlap partially (I hope FB will not output such a mess). This matches your proposal (1) as far as I understand and IMHO it's a viable approach. What do you think?

That was also something I had in mind, but I was considering those partial matches as well, and accounting for those was why I discarded the idea. I also think there was some other bot that will need implementation for this workaround.

Let me know what you think of option 3, as well. So far this looks in my head as the most viable approach, with no added overhead. I'm just not sure whether it'll work in practice :-)

@alaz
Copy link
Owner

alaz commented Apr 11, 2019

I also think there was some other bot that will need implementation for this workaround.

IMHO it makes no sense to bother about them beforehand. That would be premature optimization. FB is rather unique in this regard so far.

Let me know what you think of option 3, as well. So far this looks in my head as the most viable approach, with no added overhead. I'm just not sure whether it'll work in practice :-)

This looks like the easiest approach, but it needs to be thought of thoroughly algorithm-wise... it takes time. This would also need careful testing on Legitbot's side. It's also worth mentioning that we shall depend on the knowledge of "segment_tree" internal structures then.

@kirichkov
Copy link
Contributor Author

kirichkov commented Apr 11, 2019

It's also worth mentioning that we shall depend on the knowledge of "segment_tree" internal structures then.

I don't quite get what you mean by this? It's the same approach as used now, only the input array is pre-sorted, and SegmentTree is instructed not to sort it beforehand.

Regarding the other points - I agree. What I'm thinking is to cobble some tests and rewrite the facebook part and run it against the tests, and then against real-world ip data and see whether it works.

As a matter of fact this bug surfaced by pure coincidence. I think at one point it will resolve itself, when Facebook change their IP ranges, but who knows when that will be.

P.S.: This suggestion might be a possible solution too.

@alaz
Copy link
Owner

alaz commented Apr 12, 2019

Published as 0.2.7 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants