## merge IBM and MaxMind network and location data

-----

The code in this notebook merges Internet and IBM network and location data for use in IBM Streaming Analytics, such as the NetflowViewer demonstration and cyber-security applications:

* Data for the Internet comes from CSV files provided by [MaxMind, Inc.](https://www.maxmind.com/en/home) as [GeoLite2 data](https://dev.maxmind.com/geoip/geoip2/geolite2/). This notebook downloads MaxMind's 'GeoLite2' data that lists subnets and their locations in the Internet, including country, state/province/territory, city, latitude, and longitude. 

* Data for the IBM internal network comes from a ['whois' service](http://whois.ibm.com) provided by AT&T. This notebook downloads 'whois' data that lists subnets and their locations in the IBM internal network, including country, state, city, and street address. This data is mapped to longitude and latitude with the Google Maps geocoding service. 

The Google Maps geocoding service requires an API key for a Google account. To create a key, do this:

* In a browser, go to [Google](https://www.google.com/) and sign into an existing account or create a new account.

* Go to the [Google Geocoding Service](https://developers.google.com/maps/documentation/javascript/geocoding) page and follow the instructions to create a project and enable the geocoding API.

* Go to [Google Geocoding Service 'Get API Key'](https://developers.google.com/maps/documentation/geocoding/get-api-key), click on 'Get a Key', and then click the 'copy' button.

* paste the copied key into the cell below as the value of the 'googlemapsKey' constant 

Google limits usage of their geocoding service to 2,500 requests per day, which is sufficient for two complete runs per day. The responses are cached and reused to avoid unnecessary use of Google's API.

This notebook merges the IBM data into the MaxMind CSV files. It also generates a separate CSV file containing a [geohash code](https://en.wikipedia.org/wiki/Geohash) for each location's latitude/longitude. All of the resulting CSV files are packed into a ZIP file for transfer to Streaming Analytics projects:

* [GeoLite2-City-Blocks-IPv4.csv](merged/GeoLite2-City-Blocks-IPv4.csv)
* [GeoLite2-City-Blocks-IPv6.csv](merged/GeoLite2-City-Blocks-IPv6.csv)
* [GeoLite2-City-Locations-en.csv](merged/GeoLite2-City-Locations-en.csv)
* [GeoLite2-City-Geohashes-en.csv](merged/GeoLite2-City-Geohashes-en.csv)

There are instructions for each step of this notebook in the cells below.

-----
Run this cell once to install additional Python packages used in this notebook:

In [58]:
!pip install --user googlemaps
!pip install --user geohash2

[33mYou are using pip version 9.0.1, however version 9.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[33mYou are using pip version 9.0.1, however version 9.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


-----
Run this cell each time the notebook is loaded to include Python packages, and define some constants and functions used in this notebook:

In [59]:
import os
import re
import requests
from requests_file import FileAdapter
import json
import zipfile
import pandas as pd
import numpy as np
from io import BytesIO
from urllib.request import urlopen
import googlemaps
import geohash2
#####import ibm_boto3
#######from ibm_botocore.client import Config

# This constant points to the MaxMind 'GeoLite2' data.

maxmindGeoLite2URL = 'http://geolite.maxmind.com/download/geoip/database/GeoLite2-City-CSV.zip'

# This constant points to the IBM internal 'whois' service.

ibmWhoisLocationURL = 'http://whois.ibm.com:8080/whois/search.json?inverse-attribute=org&type-filter=domain&query-string=ORG-IBM1-IGA'
#ibmWhoisLocationURL = 'file:///Users/pring/git/MyGeohashDSXProject/whois.ibm.com_json_2018-03-23/ibm.domain.json'

# This constant contains a key for the Google Maps geocode API. 
# To create a key, follow the instructions at the top of this notebook.

googlemapsKey = 'AIzaSyAXrKyHrMa98L_e_CLtdi4UnQRPjHAEcYg'

# This constant contains the name of the JSON file where responses from the Google Maps
# geocode service are cached for reuse.

googlemapsCacheFilename = 'googlemaps.cache.json'

# This constant contains the name of a ZIP package that will contain the merged
# IBM and Maxmind data when this notebook has run to completion.

mergedPackageFilename = 'mergedIBMandMaxmindData.zip'

pd.set_option('max_rows', 15)

# This function iterates through the 'whois.ibm.com' data retrieved from the specified URL, 
# returning a 'dictionary' containing one entry each time it is called, until the end of the 
# data is reached, when it returns None.

def getWhoisEntry(url):
    
    # create an HTTP session that can handle a 'file' URL and send the request
    session = requests.Session()
    session.mount('file://', FileAdapter())
    response = session.get(url, stream=True)

    # if the HTTP response does not specify and encoding, force 'UTF-8'
    if response.encoding is None: response.encoding = 'utf-8'
    
    # read the HTTP response one line at a time, looking for the entry delimiters,
    # and returning one entry, converted to a JSON object, each time the function is called,
    # until the end of the response is reached
    body = ''
    state = 'prefix'
    for line in response.iter_lines(decode_unicode=True):
        if not line: continue
        if line == '  "object" : [ {': # starting delimiter for first entry 
            state = 'body'
        elif line == '  }, {': # delimiter between entries
            if len(body)>0: yield json.loads( '{' + body + '}' )
            body = ''
        elif line == '  } ]': # ending delimiter for last entry 
            if len(body)>0:  yield json.loads( '{' + body + '}' )
            state = 'suffix'
            body = ''
        else:
            if state=='body': body += line + '\n'
            
    # return 'None' when the end of the response is reached
    return None

# This function uses the Google Maps geocode API to map an address
# to location data, including latitude and longitude. Google limits 
# use of the API, so its responses are cached in a local file and 
# reused to avoid unnecessary use.

def getGoogleMapsGeocode(address, client, cache, cacheFilename):
    
    ########return None

    # if this address has been found before, return cached data, otherwise ask Google Maps
    # and cache the data
    if address in cache: 
        geocode = cache[address]
    else:
        print('--------------> geocode(' + address + ')')
        geocode = client.geocode(address)
        cache[address] = geocode
        storeObjectToFile(cacheFilename, cache)
    
    # extract address components from geocode data into simple dictionary
    if geocode and len(geocode)>0 and 'address_components' in geocode[0]: 
        result = { component['types'][0]: component['long_name'] for component in geocode[0]['address_components'] }
        result['latitude'] = geocode[0]['geometry']['location']['lat']
        result['longitude'] = geocode[0]['geometry']['location']['lng']
    else:
        result = None
        
    # return simple dictionary of address and location components
    return result

# This function loads a Python object from a JSON file. It used above
# to cache responses from the Google Maps geocode API.

def loadObjectFromFile(filename):
    if os.path.exists(filename):
        with open(filename, 'r') as file: return json.load(file)
    else:
        return {}

# This function stores a Python object in a JSON file. It used above
# to cache responses from the Google Maps geocode API.

def storeObjectToFile(filename, cache):
    if os.path.exists(filename+'.new'): os.remove(filename+'.new')
    with open(filename+'.new', 'w') as file: json.dump(cache, file)
    if os.path.exists(filename+'.old'): os.remove(filename+'.old')
    if os.path.exists(filename): os.rename(filename, filename+'.old')
    os.rename(filename+'.new', filename)
    
# This function replaces any mis-encoded Unicode characters of the form "\xXX"
# in the specified "UTF-8" string with the corresponding Unicode character, 
# properly encoded for "UTF-8"
    
def fixUnicodeCharacters(xxx):
    yyy = re.sub(r'\\x(..)', lambda match: chr(int(match.group(1),16)), xxx)
    return yyy
    

-----
Run the next cell to download all of the 'domain' entries from the IBM 'whois' service into a list. This avoids the 'Connection broken: IncompleteRead' errors from the service that occur after a few hundred iterations when processing while downloading.

In [60]:
print('running ...')

whoisDomains = []
for whoisEntry in getWhoisEntry(ibmWhoisLocationURL):
    if whoisEntry['type']=='domain': whoisDomains.append(whoisEntry)

print('... done')

running ...
... done


-----
Run the next cell to iterate through the list of 'domain' entries downloaded from the IBM 'whois' service above and collect IBM internal network and location data:

In [61]:
print('running ...')

# data for IBM networks and locations are collected in these lists
ibmNetworksIPv4 = []
ibmNetworksIPv6 = []
ibmLocations = []

# create a client for the Google Maps geocode service, and load any cached 
# responses from previous runs of this cell
googlemapsClient = googlemaps.Client(key=googlemapsKey)
googlemapsCache = loadObjectFromFile(googlemapsCacheFilename)
        
# iterate through the list of domain entries from IBM's whois' service
# downloaded above, collecting data on its locations and the subnets 
# assigned to each location
for whoisEntry in whoisDomains:

    # collect MaxMind location fields in this dictionary
    location = {}

    # skip 'whois' entries that are not IBM locations
    primaryKey = whoisEntry['primary-key']['attribute'][0]['value']
    if not re.fullmatch(r'[A-Z0-9]{3}', primaryKey): continue
    
    # extract fields of interest from 'whois' entries into a simple dictionary
    whois = {}
    for whoisAttribute in whoisEntry['attributes']['attribute']:  
        if whoisAttribute['name']=='remarks':
            match = re.fullmatch(r'RESO_(\w+)\.... = (.*)', whoisAttribute['value'])
            if match: whois[match.group(1)] = fixUnicodeCharacters(match.group(2))
            match = re.fullmatch(r'Prefix_List\.(\w+)\.... = (.*)', whoisAttribute['value'])
            if match and not match.group(2).endswith('NO_PREFIXES_FOUND'): whois[match.group(1)] = match.group(2).split(', ')
       
    # skip 'whois' entries that are not actual IBM locations
    if 'Country' not in whois: continue
    
    # construct string representation of location address from 'whois' fields
    address = whois['Country']
    if 'State' in whois: address = whois['State'] + ', ' + address        
    if 'City' in whois and whois['City']!='NO CITY': address = whois['City'] + ', ' + address
    if  'Address1' in whois and not whois['Address1'].startswith('NO IBM LOCATION'):
        if 'Address2' in whois: address = whois['Address2'] + ', ' + address
        if 'Address1' in whois: address = whois['Address1'] + ', ' + address
        ######if 'PostalCode' in whois: address = address  + ', ' + whois['PostalCode'] 
    print(primaryKey + ': ' + address)
        
    # get geocode data for location from Google Maps, and skip locations without any    
    geocode = getGoogleMapsGeocode(address, googlemapsClient, googlemapsCache, googlemapsCacheFilename)
    if not geocode: continue
        
    # copy 'whois' fields that correspond to MaxMind location fields
    location['geoname_id'] = primaryKey
    location['country_iso_code'] = whois['Country']
    if 'country' in geocode: location['country_name'] = geocode['country']
    if 'State' in whois: location['subdivision_1_iso_code'] = whois['State']
    if 'administrative_area_level_1' in geocode: location['subdivision_1_name'] = geocode['administrative_area_level_1']
    if 'City' in whois: location['city_name'] = 'IBM ' + whois['City']
    ####if 'locality' in geocode: location['city_name'] = 'IBM ' + geocode['locality']
    
    # add each IPv4 and IPv6 subnet at this location to their respective lists
    if 'IPv4' in whois:
        for cidr in whois['IPv4']:
            if not cidr.startswith('9.'): continue
            network = { 'network': cidr, 'geoname_id': primaryKey, 'latitude': geocode['latitude'], 'longitude': geocode['longitude'] }
            if 'PostalCode' in whois: network['postal_code'] = whois['PostalCode']
            ibmNetworksIPv4.append(network)
    if 'IPv6' in whois:
        for cidr in whois['IPv6']:
            if not cidr.startswith('2620:1F7:'): continue
            network = { 'network': cidr, 'geoname_id': primaryKey, 'latitude': geocode['latitude'], 'longitude': geocode['longitude'] }
            if 'PostalCode' in whois: network['postal_code'] = whois['PostalCode']
            ibmNetworksIPv6.append(network)
            
    # add this location to the list
    ibmLocations.append(location)
        
print('... done')

running ...
009: RUA DO PROLETARIADO 14/1, ALFRAGIDE, 11, PT
--------------> geocode(RUA DO PROLETARIADO 14/1, ALFRAGIDE, 11, PT)
00A: AVDA DE LA PALMERA 19, SEVILLA, SE, ES
--------------> geocode(AVDA DE LA PALMERA 19, SEVILLA, SE, ES)
00H: PLAZA CRONOS 1, MADRID, M, ES
--------------> geocode(PLAZA CRONOS 1, MADRID, M, ES)
00J: C/ SAKURA, 8, SAN FRUITÒS DEL BALGÈS, B, ES
--------------> geocode(C/ SAKURA, 8, SAN FRUITÒS DEL BALGÈS, B, ES)
00O: CALLE YECORA 4, MADRID, M, ES
--------------> geocode(CALLE YECORA 4, MADRID, M, ES)
00T: PARCELA PC10601 - PARC DE L'ALBA, CERDANYOLA DEL VALLÉS, BARCELONA, B, ES
--------------> geocode(PARCELA PC10601 - PARC DE L'ALBA, CERDANYOLA DEL VALLÉS, BARCELONA, B, ES)
00U: GRAN VÍA ASIMA 20, OFICINA 35. POLÍGONO SON CASTEL, PALMA DE MALLORCA, PM, ES
--------------> geocode(GRAN VÍA ASIMA 20, OFICINA 35. POLÍGONO SON CASTEL, PALMA DE MALLORCA, PM, ES)
00W: PLAZA EUSKADI 5, BILBAO, PV, ES
--------------> geocode(PLAZA EUSKADI 5, BILBAO, PV, ES)
01D: A

In [62]:
print('running ...')
   
# store IBM location data in a CSV file
columns = ['geoname_id','country_iso_code','country_name','subdivision_1_iso_code','subdivision_1_name','city_name']
pd.DataFrame(ibmLocations).to_csv('test.ibmLocations.csv', index=False, float_format='%.9g', columns=columns)

# store IBM network data in a CSV file
columns = ['network','geoname_id','latitude','longitude','postal_code']
pd.DataFrame(ibmNetworksIPv4).to_csv('test.ibmNetworksIPv4.csv', index=False, float_format='%.9g', columns=columns)
pd.DataFrame(ibmNetworksIPv6).to_csv('test.ibmNetworksIPv6.csv', index=False, float_format='%.9g', columns=columns)

print('... done')

running ...
... done


-----
Run the next cell to download Internet network and location data from MaxMind:

In [63]:
print('running ...')

with urlopen(maxmindGeoLite2URL) as response:
    with zipfile.ZipFile(BytesIO(response.read())) as file:
        file.extractall()

# find the newest directory, in case there are old directories left over from previous runs
maxmindDirectory = sorted( [ f for f in os.listdir() if os.path.isdir(f) and f.startswith('GeoLite2-City-CSV') ] )[-1]

# load the MaxMind network and location data 
maxmindNetworksIPv4 = pd.read_csv(maxmindDirectory + '/GeoLite2-City-Blocks-IPv4.csv', header=0, dtype=str)
maxmindNetworksIPv6 = pd.read_csv(maxmindDirectory + '/GeoLite2-City-Blocks-IPv6.csv', header=0, dtype=str)
maxmindLocations = pd.read_csv(maxmindDirectory + '/GeoLite2-City-Locations-en.csv', header=0, dtype=str)

print('... done')

running ...
... done


-----
Run the next cell to merge the MaxMind and IBM data. The merged data will be written into CSV files in the 'merged' directory, using the filenames of the original MaxMind CSV files.

In [64]:
print('running ...')

# create a directory for the merged MaxMind+IBM CSV files
os.makedirs('merged', exist_ok=True)

# merge the MaxMind and IBM network data and store the result in CSV files

maxmindNetworksIPv4 = maxmindNetworksIPv4[ ~ maxmindNetworksIPv4['network'].str.startswith('9.') ]
mergedNetworksIPv4 = maxmindNetworksIPv4.append(ibmNetworksIPv4)
mergedNetworksIPv4.to_csv('merged/GeoLite2-City-Blocks-IPv4.csv', index=False, float_format='%.9g', columns=maxmindNetworksIPv4.columns)

maxmindNetworksIPv6 = maxmindNetworksIPv6[ ~ maxmindNetworksIPv6['network'].str.startswith('2620:1F7:') ]
maxmindNetworksIPv6 = maxmindNetworksIPv6[ ~ maxmindNetworksIPv6['network'].str.startswith('2620:1f7:') ]
mergedNetworksIPv6 = maxmindNetworksIPv6.append(ibmNetworksIPv6)
mergedNetworksIPv6.to_csv('merged/GeoLite2-City-Blocks-IPv6.csv', index=False, float_format='%.9g', columns=maxmindNetworksIPv6.columns)

# merge the MaxMind and IBM location data and store the result in a CSV file
mergedLocations = maxmindLocations.append(ibmLocations)
mergedLocations.to_csv('merged/GeoLite2-City-Locations-en.csv', index=False, float_format='%.9g', columns=maxmindLocations.columns)

print('... done')

running ...
... done


-----
Run the next cell to calculate [geohash codes](https://en.wikipedia.org/wiki/Geohash) for the latitude/longitude coordinates of merged MaxMind and IBM locations. The geohashes, coordinates, and location data will be written into a CSV file in the 'merged' directory named 'GeoLite2-City-Geohashes-en.csv'.

In [65]:
%%script false

print('running ...')

# create a frame of locations indexed by ID number
mergedLocationsIndexed = mergedLocations.set_index('geoname_id')

# create a frame of geographical coordinates, that is, ID number, latitude, and longitude
mergedCoordinates = mergedNetworks[['geoname_id','latitude','longitude']].drop_duplicates()

# merge location and coordinate data and calculate geohash for each location's coordinates
mergedGeohashes = mergedCoordinates.join(mergedLocationsIndexed, on='geoname_id')
mergedGeohashes['geohash'] = mergedGeohashes.apply(lambda row: geohash2.encode(row['latitude'],row['longitude'],precision=6),axis=1)

# store the result in a CSV file
columns = ['geohash','latitude','longitude','geoname_id','country_iso_code','country_name','subdivision_1_iso_code','subdivision_1_name','subdivision_2_iso_code','subdivision_2_name','city_name']
mergedGeohashes.to_csv('merged/GeoLite2-City-Geohashes-en.csv', index=False, float_format='%.9g', columns=columns)
                           
print('... done')

-----
Run the next cell to pack the merged CSV files in the 'merged' directory into a ZIP file for export to systems running IBM Streams applications with the IPAddressLocation operator:

In [66]:
print('running ...')

# pack all result files into a ZIP package
with zipfile.ZipFile(mergedPackageFilename, 'w', compression=zipfile.ZIP_DEFLATED) as zipFile:
    for file in os.listdir('merged'):
        zipFile.write('merged/'+file, file)

print('... done')

running ...
... done


-----
Run the next cell to upload the ZIP package containing the merged CSV files from DSX to IBM Cloud Object Storage:

In [67]:
%%script false

# credentials for bucket in IBM Cloud Object Store
credentials = {
    'IBM_API_KEY_ID': 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
    'IAM_SERVICE_ID': 'iam-ServiceId-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx',
    'ENDPOINT': 'https://s3-api.us-geo.objectstorage.service.networklayer.com',
    'IBM_AUTH_ENDPOINT': 'https://iam.ng.bluemix.net/oidc/token',
    'BUCKET': 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
    'FILE': 'report_IGA_Global_Q1_2016.xlsx'
}

# create a Cloud Object Store HTTP client with the bucket's credentials
cosClient = ibm_boto3.client(service_name='s3',
    ibm_api_key_id=credentials['IBM_API_KEY_ID'],
    ibm_service_instance_id=credentials['IAM_SERVICE_ID'],
    ibm_auth_endpoint=credentials['IBM_AUTH_ENDPOINT'],
    config=Config(signature_version='oauth'),
    endpoint_url=credentials['ENDPOINT'])

# write the ZIP file to the notebook's bucket in Cloud Object Storage
cosClient.upload_file(Filename=mergedPackageFilename, Bucket=credentials['BUCKET'], Key=mergedPackageFilename)

-----
To download the ZIP package containing the merged CSV files from DSX to your laptop, do this:

* In a browser, go to this notebook's project page

* open the 'Files' panel by clicking the 'Find and Add Data' icon in the upper-right corner of the project page,

* check the box next to 'mergedIBMandMaxmindData.zip'

* select 'Download' from the pop-up menu in the 'Files' panel

\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/