Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Drill-7344: Add Geo-IP Functions #1841

Closed
wants to merge 3 commits into from
Closed

Conversation

cgivre
Copy link
Contributor

@cgivre cgivre commented Aug 12, 2019

Drill User Defined Functions

This README documents functions which users have submitted to Apaceh Drill.

Protocol Lookup Functions

These functions provide a convenience lookup capability for port numbers. They will accept port numbers as either an int or string.

  • get_host_name(<ip address>): This function accepts an IP address and will return the host

  • get_service_name(<port number>, <protocol>): This function returns the service name for a port and protocol combination.

apache drill> select get_service_name(666, 'tcp') as service from (values(1));
+------------------+
|     service      |
+------------------+
| doom Id Software |
+------------------+
1 row selected (0.178 seconds)
  • get_short_service_name(<port number>, <protocol>): Same as above but returns a short protocol name.
apache drill> select get_short_service_name(21, 'tcp') as service from (values(1));
   +---------+
   | service |
   +---------+
   | ftp     |
   +---------+
   1 row selected (0.112 seconds)

GeoIP Functions for Apache Drill

This is a collection of GeoIP functions for Apache Drill. These functions are a wrapper for the MaxMind GeoIP Database.

IP Geo-Location is inherently imprecise and should never be relied on to get anything more than a general sense of where the traffic is coming from.

  • getCountryName( <ip> ): This function returns the country name of the IP address, "Unknown" if the IP is unknown or invalid.
  • getCountryConfidence( <ip> ): This function returns the confidence score of the country ISO code of the IP address.
  • getCountryISOCode( <ip> ): This function returns the country ISO code of the IP address, "Unknown" if the IP is unknown or invalid.
  • getCityName( <ip> ): This function returns the city name of the IP address, "Unknown" if the IP is unknown or invalid.
  • getCityConfidence( <ip> ): This function returns confidence score of the city name of the IP address.
  • getLatitude( <ip> ): This function returns the latitude associated with the IP address.
  • getLongitude( <ip> ): This function returns the longitude associated with the IP address.
  • getTimezone( <ip> ): This function returns the timezone associated with the IP address.
  • getAccuracyRadius( <ip> ): This function returns the accuracy radius associated with the IP address, 0 if unknown.
  • getAverageIncome( <ip> ): This function returns the average income of the region associated with the IP address, 0 if unknown.
  • getMetroCode( <ip> ): This function returns the metro code of the region associated with the IP address, 0 if unknown.
  • getPopulationDensity( <ip> ): This function returns the population density associated with the IP address.
  • getPostalCode( <ip> ): This function returns the postal code associated with the IP address.
  • getCoordPoint( <ip> ): This function returns a point for use in GIS functions of the lat/long of associated with the IP address.
  • getASN( <ip> ): This function returns the autonomous system of the IP address, "Unknown" if the IP is unknown or invalid.
  • getASNOrganization( <ip> ): This function returns the autonomous system organization of the IP address, "Unknown" if the IP is unknown or invalid.
  • isEU( <ip> ), isEuropeanUnion( <ip> ): This function returns true if the ip address is located in the European Union, false if not.
  • isAnonymous( <ip> ): This function returns true if the ip address is anonymous, false if not.
  • isAnonymousVPN( <ip> ): This function returns true if the ip address is an anonymous virtual private network (VPN), false if not.
  • isHostingProvider( <ip> ): This function returns true if the ip address is a hosting provider, false if not.
  • isPublciProxy( <ip> ): This function returns true if the ip address is a public proxy, false if not.
  • isTORExitNode( <ip> ): This function returns true if the ip address is a known TOR exit node, false if not.

This product includes GeoLite2 data created by MaxMind, available from https://www.maxmind.com.

@cgivre cgivre added the enhancement PRs that add a new functionality to Drill label Aug 12, 2019
@arina-ielchiieva
Copy link
Member

arina-ielchiieva commented Aug 13, 2019

@cgivre if you want to include PR #1840 and #1841 into the upcoming release, please rework these PRs to have proper error handling, to comply with Drill coding style etc. There were similar PRs which undergo the review and have been merged, you can take a look at the code and make appropriate changes. I think it's more reasonable than making code reviewers pointing at the same issues over and over again :(

<repositories>
<repository>
<id>Jabylon Repository</id>
<url>http://www.jabylon.org/maven/</url>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we should add one more repository. Could you please use dependencies from maven central?


@Override
public void setup() {
java.io.InputStream serviceFile = getClass().getClassLoader().getResourceAsStream("service-names-port-numbers.csv");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this a good idea. First the source of file is unknown, secondly if there is not lib that provides this functionality, I am not sure we should provide such function in Drill.

@vvysotskyi
Copy link
Member

@cgivre, please avoid adding large files to the project resources. One of the files you have added has a size of 57.5 MB (!) (GeoLite2-City.mmdb), other files have a size of several MB.
Also, since these files are binary, could you please provide information, how they were generated and which information they have, since ASF has some restrictions regarding binary files: https://www.apache.org/legal/release-policy.html

@cgivre
Copy link
Contributor Author

cgivre commented Aug 13, 2019

@vvysotskyi
Regarding the large files, they are the open version of the MaxMind geo-locational database for IP addresses. (https://dev.maxmind.com/geoip/geoip2/geolite2/) I put a note about this in the README but these files are in widespread use in open source security tools, including, I believe some Apache projects such as Metron.

@arina-ielchiieva
Copy link
Member

@cgivre taking into account that these files are needed for a couple of functions and their large size, I think we should not allow them into Apache Drill. If these will lead to not adding Geo-IP functions, I think it's much better than enlarging project size. If user needs such functions, he can add them in the classpath.

@cgivre
Copy link
Contributor Author

cgivre commented Aug 13, 2019 via email

@arina-ielchiieva
Copy link
Member

arina-ielchiieva commented Aug 13, 2019

I think this brings too much overhead, if user needs these functions he can include jar with functions and files in the classpath. Having functions that require some special data in the classpath to be working is odd.

@cgivre
Copy link
Contributor Author

cgivre commented Aug 13, 2019 via email

@arina-ielchiieva
Copy link
Member

arina-ielchiieva commented Aug 13, 2019

My point is that it is strange and not common to have functions that do not work out of box. As I said before, you can share these functions in you repo and user can build them from source and add into Drill classpath if needed.

@cgivre cgivre closed this Aug 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement PRs that add a new functionality to Drill
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants