FacebookBot mislabeled as fake #10

kirichkov · 2019-04-10T16:08:00Z

I'm starting to see Facebook Bot being labeled as a fake search engine, when in reality the IP address is genuine and I think the issue here is in the SegmentTree being built.

The IP in question is 69.171.251.1

I get the following in the console:

irb> ranges = Legitbot::Facebook.reload!
=> {:ipv4=>SegmentTree(31.13.24.0..204.15.23.255), :ipv6=>SegmentTree(2401:db00::..2a03:2887:ff34:ffff:ffff:ffff:ffff:ffff)}
irb> ranges[:ipv4].find(IPAddr.new('69.171.251.1'))
=> nil

On the other hand:

irb> ip = IPAddr.new('31.13.24.0')
=> #<IPAddr: IPv4:31.13.24.0/255.255.255.255>
irb> ranges[:ipv4].find(ip)
=> #<SegmentTree::Segment:0x00000000085acdc0 @range=#<IPAddr: IPv4:31.13.24.0/255.255.248.0>..#<IPAddr: IPv4:31.13.31.255/255.255.248.0>, @value=true>

On one hand the IPv4 SegmentTree range is too broad, but despite that the valid IP address is not returned and a legitimate bot is labeled as a fake one and thus being blocked.

kirichkov · 2019-04-11T08:45:31Z

Two solutions come to my mind:

Evaluate ranges and discard the smaller overlapping ranges, pass the optimized array to SegmentTree
Drop SegmentTree completely and iterate for each IP range, until match is found, or all entries are not a match.

I don't like either of these approaches, as they add overhead.

alaz · 2019-04-11T08:53:37Z

Replying here to your comment and that above.

With the current implementation of SegmentTree using a sorted array and binary search, I don't see how this can be solved in here. Perhaps the list of IPs can be broken up beforehand into the least-common-denominator netmask and passed to SegmentTree as a list of same-sized netmask ranges.

There is a risk to inflate the list in this approach. It could be critical assuming some ranges in FB may be quite wide, especially in IPv6 space. This is also a reason why (2) "drop SegmentTree completely" is hardly an option.

Otherwise we could try to remove smaller ranges that have wider parents. What's on my mind is to sort the list and reduce it skipping those ranges that fit completely into a preceding wider one. This algo will not be able to normalize ranges that overlap partially (I hope FB will not output such a mess). This matches your proposal (1) as far as I understand and IMHO it's a viable approach. What do you think?

kirichkov · 2019-04-11T08:53:48Z

There's also third way that might solve this:

Sort the array of ranges based on their end, as SegmentTree sorts it based on the start. Pass the sorted array, and bypass SegmentTree's sort:

SegmentTree.new(@sorted_ranges, true)

kirichkov · 2019-04-11T08:57:49Z

There is a risk to inflate the list in this approach. It could be critical assuming some ranges in FB may be quite wide, especially in IPv6 space. This is also a reason why (2) "drop SegmentTree completely" is hardly an option.

Agreed! As I said I don't like either.

Otherwise we could try to remove smaller ranges that have wider parents. What's on my mind is to sort the list and reduce it skipping those ranges that fit completely into a preceding wider one. This algo will not be able to normalize ranges that overlap partially (I hope FB will not output such a mess). This matches your proposal (1) as far as I understand and IMHO it's a viable approach. What do you think?

That was also something I had in mind, but I was considering those partial matches as well, and accounting for those was why I discarded the idea. I also think there was some other bot that will need implementation for this workaround.

Let me know what you think of option 3, as well. So far this looks in my head as the most viable approach, with no added overhead. I'm just not sure whether it'll work in practice :-)

alaz · 2019-04-11T09:10:45Z

I also think there was some other bot that will need implementation for this workaround.

IMHO it makes no sense to bother about them beforehand. That would be premature optimization. FB is rather unique in this regard so far.

Let me know what you think of option 3, as well. So far this looks in my head as the most viable approach, with no added overhead. I'm just not sure whether it'll work in practice :-)

This looks like the easiest approach, but it needs to be thought of thoroughly algorithm-wise... it takes time. This would also need careful testing on Legitbot's side. It's also worth mentioning that we shall depend on the knowledge of "segment_tree" internal structures then.

kirichkov · 2019-04-11T10:21:23Z

It's also worth mentioning that we shall depend on the knowledge of "segment_tree" internal structures then.

I don't quite get what you mean by this? It's the same approach as used now, only the input array is pre-sorted, and SegmentTree is instructed not to sort it beforehand.

Regarding the other points - I agree. What I'm thinking is to cobble some tests and rewrite the facebook part and run it against the tests, and then against real-world ip data and see whether it works.

As a matter of fact this bug surfaced by pure coincidence. I think at one point it will resolve itself, when Facebook change their IP ranges, but who knows when that will be.

P.S.: This suggestion might be a possible solution too.

alaz · 2019-04-12T11:49:56Z

Published as 0.2.7 🚀

alaz mentioned this issue Apr 11, 2019

Lookup fails take-five/segment_tree#3

Closed

alaz added a commit that referenced this issue Apr 11, 2019

❌ add a test case that breaks Segment Tree [story #10]

c5c545c

alaz added a commit that referenced this issue Apr 11, 2019

✨ switch to "interval_tree" library to search IP in ranges [story #10]

29d5b7b

alaz closed this as completed in eb2d53c Apr 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FacebookBot mislabeled as fake #10

FacebookBot mislabeled as fake #10

kirichkov commented Apr 10, 2019

kirichkov commented Apr 11, 2019

alaz commented Apr 11, 2019

kirichkov commented Apr 11, 2019

kirichkov commented Apr 11, 2019

alaz commented Apr 11, 2019

kirichkov commented Apr 11, 2019 •

edited

Loading

alaz commented Apr 12, 2019

FacebookBot mislabeled as fake #10

FacebookBot mislabeled as fake #10

Comments

kirichkov commented Apr 10, 2019

kirichkov commented Apr 11, 2019

alaz commented Apr 11, 2019

kirichkov commented Apr 11, 2019

kirichkov commented Apr 11, 2019

alaz commented Apr 11, 2019

kirichkov commented Apr 11, 2019 • edited Loading

alaz commented Apr 12, 2019

kirichkov commented Apr 11, 2019 •

edited

Loading