## merge Internet and IBM internal network geographical data

-----

The code in this notebook merges Internet and IBM network geographical data for use in IBM Streaming Analytics, such as the NetflowViewer demonstration and cyber-security applications:

* Data for the Internet comes from CSV files provided by [MaxMind, Inc.](https://www.maxmind.com/en/home) as [GeoLite2 data](https://dev.maxmind.com/geoip/geoip2/geolite2/). This notebook downloads MaxMind's 'GeoLite2' data, which lists subnets and locations in the Internet, including country, state/province/territory, city, latitude, and longitude. 

* Data for the IBM internal network comes from a ['whois' service](http://whois.ibm.com) provided by AT&T. This notebook downloads 'whois' data that lists the subnets and locations in the IBM internal network, including country, state, city, and street address. This data is mapped to longitude and latitude with the Google Maps geocoding service. 

The Google Maps geocoding service requires an API key for a Google account. To create a key, do this:

* In a browser, go to [Google](https://www.google.com/) and sign into an existing account or create a new account.

* Go to the [Google Geocoding Service](https://developers.google.com/maps/documentation/javascript/geocoding) page and follow the instructions to create a project and enable the geocoding API.

* Go to [Google Geocoding Service 'Get API Key'](https://developers.google.com/maps/documentation/geocoding/get-api-key), click on 'Get a Key', and then click the 'copy' button.

* paste the copied key into the cell below as the value of the 'googlemapsKey' constant 

Google limits usage of their geocoding service to 2,500 requests per day, which is sufficient for two complete runs per day. The responses are cached and reused to avoid unnecessary use of Google's API.

This notebook merges the IBM data into the MaxMind CSV files. It also generates a separate CSV file containing a [geohash code](https://en.wikipedia.org/wiki/Geohash) for each location's latitude/longitude. All of the resulting CSV files are packed into a ZIP file for transfer to Streaming Analytics projects:

* [GeoLite2-City-Blocks-IPv4.csv](merged/GeoLite2-City-Blocks-IPv4.csv)
* [GeoLite2-City-Blocks-IPv6.csv](merged/GeoLite2-City-Blocks-IPv6.csv)
* [GeoLite2-City-Locations-en.csv](merged/GeoLite2-City-Locations-en.csv)
* [GeoLite2-City-Geohashes-en.csv](merged/GeoLite2-City-Geohashes-en.csv)

There are instructions for each step of this notebook in the cells below.

-----
Run this cell once to install additional Python packages used in this notebook:

In [183]:
!pip install --user googlemaps
!pip install --user geohash2



-----
Run this cell each time the notebook is loaded to include Python packages, and define some constants and functions used in this notebook:

In [155]:
import os
import re
######import math
########import pprint
import requests
from requests_file import FileAdapter
import json
######import shutil
import zipfile
#######import types
import pandas as pd
import numpy as np
from io import BytesIO
from urllib.request import urlopen
#####import ibm_boto3
#######from ibm_botocore.client import Config
import googlemaps
import geohash2

# This constant points to the MaxMind 'GeoLite2' data.

maxmindGeoLite2URL = 'http://geolite.maxmind.com/download/geoip/database/GeoLite2-City-CSV.zip'

# This constant points to the IBM internal 'whois' service.

#whoisLocationURL = 'http://whois.ibm.com:8080/whois/search.json?inverse-attribute=org&type-filter=domain&query-string=ORG-IBM1-IGA'
ibmWhoisLocationURL = 'file:///Users/pring/git/MyGeohashDSXProject/whois.ibm.com_json_2018-03-23/ibm.domain.json'

# This constant contains a key for the Google Maps geocode API. 
# To create a key, follow the instructions at the top of this notebook.

googlemapsKey = 'AIzaSyAXrKyHrMa98L_e_CLtdi4UnQRPjHAEcYg'

# This constant contains the name of the file where responses from the Google Maps
# geocode service are cached for reuse.

googlemapsCacheFilename = 'googlemaps.cache.json'

pd.set_option('max_rows', 15)

# This function iterates through the 'whois.ibm.com' data retrieved from the specified URL, 
# returning a 'dictionary' containing one entry each time it is called, until the end of the 
# data is reached, when it returns None.

def getWhoisEntry(url):
    
    # create an HTTP session that can handle a 'file' URL and send the request
    session = requests.Session()
    session.mount('file://', FileAdapter())
    response = session.get(url, stream=True)

    # if the HTTP response does not specify and encoding, force 'UTF-8'
    if response.encoding is None: response.encoding = 'utf-8'
    
    # read the HTTP response one line at a time, looking for the entry delimiters,
    # and returning one entry, converted to a JSON object, each time the function is called,
    # until the end of the response is reached
    body = ''
    state = 'prefix'
    for line in response.iter_lines(decode_unicode=True):
        if not line: continue
        if line == '  "object" : [ {': # starting delimiter for first entry 
            state = 'body'
        elif line == '  }, {': # delimiter between entries
            if len(body)>0: yield json.loads( '{' + body + '}' )
            body = ''
        elif line == '  } ]': # ending delimiter for last entry 
            if len(body)>0:  yield json.loads( '{' + body + '}' )
            state = 'suffix'
            body = ''
        else:
            if state=='body': body += line + '\n'
            
    # return 'None' when the end of the response is reached
    return None

# This function uses the Google Maps geocode API to map an address
# to location data, including latitude and longitude. Google limits 
# use of the API, so its responses are cached in a local file and 
# reused to avoid unnecessary use.

def getGoogleMapsGeocode(client, address):
    
    ########return None

    # if this address has been found before, return cached data, otherwise ask Google Maps
    # and cache the data
    if address in googlemapsCache: 
        geocode = googlemapsCache[address]
    else:
        print('--------------> geocode(' + address + ')')
        geocode = client.geocode(address)
        googlemapsCache[address] = geocode
        storeObjectToFile(googlemapsCacheFilename, googlemapsCache)
    
    # extract address components from geocode data into simple dictionary
    if geocode and len(geocode)>0 and 'address_components' in geocode[0]: 
        result = { component['types'][0]: component['long_name'] for component in geocode[0]['address_components'] }
        result['latitude'] = geocode[0]['geometry']['location']['lat']
        result['longitude'] = geocode[0]['geometry']['location']['lng']
    else:
        result = None
        
    # return simple dictionary of address and location components
    return result

# This function loads a Python object from a JSON file. It used above
# to cache responses from the Google Maps geocode API.

def loadObjectFromFile(filename):
    if os.path.exists(filename):
        with open(filename, 'r') as file: return json.load(file)
    else:
        return {}

# This function stores a Python object in a JSON file. It used above
# to cache responses from the Google Maps geocode API.

def storeObjectToFile(filename, cache):
    if os.path.exists(filename+'.new'): os.remove(filename+'.new')
    with open(filename+'.new', 'w') as file: json.dump(cache, file)
    if os.path.exists(filename+'.old'): os.remove(filename+'.old')
    if os.path.exists(filename): os.rename(filename, filename+'.old')
    os.rename(filename+'.new', filename)
    
# This function replaces any mis-encoded Unicode characters of the form "\xXX"
# in the specified "UTF-8" string with the corresponding Unicode character, 
# properly encoded for "UTF-8"
    
def fixUnicodeCharacters(xxx):
    yyy = re.sub(r'\\x(..)', lambda match: chr(int(match.group(1),16)), xxx)
    #############if xxx!=yyy: print(xxx,'-->',yyy)
    return yyy
    

In [143]:


ibmNetworksIPv4 = []
ibmNetworksIPv6 = []
ibmLocations = []

print('running ...')

googlemapsClient = googlemaps.Client(key=googlemapsKey)

googlemapsCache = loadObjectFromFile(googlemapsCacheFilename)

for whoisEntry in getWhoisEntry(ibmWhoisLocationURL):

    # collect MaxMind location fields in this dictionary
    location = {}

    # skip 'whois' entries that are not IBM locations
    if whoisEntry['type']!='domain': continue
    primaryKey = whoisEntry['primary-key']['attribute'][0]['value']
    if not re.fullmatch(r'[A-Z0-9]{3}', primaryKey): continue
    print(primaryKey)
    
    # extract fields of interest from 'whois' entries into a simple dictionary
    whois = {}
    for whoisAttribute in whoisEntry['attributes']['attribute']:  
        if whoisAttribute['name']=='remarks':
            match = re.fullmatch(r'RESO_(\w+)\.... = (.*)', whoisAttribute['value'])
            if match: whois[match.group(1)] = fixUnicodeCharacters(match.group(2))
            match = re.fullmatch(r'Prefix_List\.(\w+)\.... = (.*)', whoisAttribute['value'])
            if match and not match.group(2).endswith('NO_PREFIXES_FOUND'): whois[match.group(1)] = match.group(2).split(', ')
       
    # skip 'whois' entries that are not actual IBM locations
    if 'Country' not in whois: continue
    ########print(whois)  
    
    # MaxMind location column headers:
    # geoname_id,locale_code,continent_code,continent_name,country_iso_code,country_name,subdivision_1_iso_code,subdivision_1_name,subdivision_2_iso_code,subdivision_2_name,city_name,metro_code,time_zone,is_in_european_union
    
    # copy 'whois' fields that correspond to MaxMind location fields
    location['geoname_id'] = primaryKey
    location['country_iso_code'] = whois['Country']
    if 'State' in whois: location['subdivision_1_iso_code'] = whois['State']
    if 'City' in whois: location['city_name'] = whois['City']

    # construct string representation of location address from 'whois' fields
    address = whois['Country']
    if 'State' in whois: address = whois['State'] + ', ' + address        
    if 'City' in whois and whois['City']!='NO CITY': address = whois['City'] + ', ' + address
    if  'Address1' in whois and not whois['Address1'].startswith('NO IBM LOCATION'):
        if 'Address2' in whois: address = whois['Address2'] + ', ' + address
        if 'Address1' in whois: address = whois['Address1'] + ', ' + address
        if 'PostalCode' in whois: address = address  + ', ' + whois['PostalCode']
        address = 'IBM, ' + address
    print(address)
        
    # get geocode data for location from Google Maps    
    geocode = getGoogleMapsGeocode(googlemapsClient, address)
    if geocode:
        #########print(geocode)
        if 'country' in geocode: location['country_name'] = geocode['country']
        if 'administrative_area_level_1' in geocode: location['subdivision_1_name'] = geocode['administrative_area_level_1']
        #####if 'latitude' in geocode: location['latitude'] = geocode['latitude']
        ##########if 'longitude' in geocode: location['longitude'] = geocode['longitude']
    
    # MaxMind network column headers:
    # network,geoname_id,registered_country_geoname_id,represented_country_geoname_id,is_anonymous_proxy,is_satellite_provider,postal_code,latitude,longitude,accuracy_radius

    # add each IPv4 and IPv6 subnet at this location to their respective lists
    if 'IPv4' in whois:
        for cidr in whois['IPv4']:
            if not cidr.startswith('9.'): continue
            network = { 'network': cidr, 'geoname_id': primaryKey }
            if 'PostalCode' in whois: network['postal_code'] = whois['PostalCode']
            if geocode and 'latitude' in geocode: network['latitude'] = geocode['latitude']
            if geocode and 'longitude' in geocode: network['longitude'] = geocode['longitude']
            ibmNetworksIPv4.append(network)
    if 'IPv6' in whois:
        for cidr in whois['IPv6']:
            if not cidr.startswith('2620:1F7:'): continue
            network = { 'network': cidr, 'geoname_id': primaryKey }
            if 'PostalCode' in whois: network['postal_code'] = whois['PostalCode']
            if geocode and 'latitude' in geocode: network['latitude'] = geocode['latitude']
            if geocode and 'longitude' in geocode: network['longitude'] = geocode['longitude']
            ibmNetworksIPv6.append(network)
            
    # add this location to the list
    ibmLocations.append(location)
        
print('... done')

running ...
009
IBM, RUA DO PROLETARIADO 14/1, ALFRAGIDE, 11, PT, 2795
00A
IBM, AVDA DE LA PALMERA 19, SEVILLA, SE, ES, 41013
00H
IBM, PLAZA CRONOS 1, MADRID, M, ES, 28037
00J
IBM, C/ SAKURA, 8, SAN FRUITÒS DEL BALGÈS, B, ES, 08202
00O
IBM, CALLE YECORA 4, MADRID, M, ES, 28022
00T
IBM, PARCELA PC10601 - PARC DE L'ALBA, CERDANYOLA DEL VALLÉS, BARCELONA, B, ES, 08290
00U
IBM, GRAN VÍA ASIMA 20, OFICINA 35. POLÍGONO SON CASTEL, PALMA DE MALLORCA, PM, ES, 07009
00W
IBM, PLAZA EUSKADI 5, BILBAO, PV, ES, 48009
01D
IBM, AVDA DIAGONAL 571, BARCELONA, B, ES, 08029
01E
IBM, AVDA DE ALGORTA NO 16, GETXO-VIZCAYA, BI, ES, 48990
01G
IBM, AV BRUSELAS, 20, ALCOBENDAS, M, ES, 28108
01H
IBM, TALES DE MILETO 1, ALCALA DE HENARES, M, ES, 28806
01I
IBM, CALLE MARÍA TUBAU 3, MADRID, M, ES, 28050
01J
IBM, CALLE MASQUEFA 58, BAJO DERECHA, VALENCIA, V, ES, 46020
01K
IBM, POLÍGONO DE POCOMACO, SECTOR I-1, N, LA CORUÑA, C, ES, 15190
01L
IBM, POLÍGONO INDUSTRIAL SANTA CLARA, CALLE B, LOCAL 4, AV/ SANTA CLARA DE C

In [167]:

# store IBM location data in a CSV file
c = ['geoname_id','country_iso_code','country_name','subdivision_1_iso_code','subdivision_1_name','city_name']
pd.DataFrame(ibmLocations).to_csv('test.ibmLocations.csv', index=False, float_format='%.9g', columns=c)

# store IBM network data in a CSV file
c = ['network','geoname_id','latitude','longitude','postal_code']
pd.DataFrame(ibmNetworksIPv4).to_csv('test.ibmNetworksIPv4.csv', index=False, float_format='%.9g', columns=c)
pd.DataFrame(ibmNetworksIPv6).to_csv('test.ibmNetworksIPv6.csv', index=False, float_format='%.9g', columns=c)


-----
Run the next cell to load Internet network and location data from MaxMind:

In [159]:

print('running ...')

with urlopen(maxmindURL) as response:
    with zipfile.ZipFile(BytesIO(response.read())) as file:
        file.extractall()

# find the newest directory, in case there are old directories left over from previous runs
maxmindDirectory = sorted( [ f for f in os.listdir() if os.path.isdir(f) and f.startswith('GeoLite2-City-CSV') ] )[-1]

# load the MaxMind network and location data 
maxmindNetworksIPv4 = pd.read_csv(maxmindDirectory + '/GeoLite2-City-Blocks-IPv4.csv', header=0, dtype=str)
maxmindNetworksIPv6 = pd.read_csv(maxmindDirectory + '/GeoLite2-City-Blocks-IPv6.csv', header=0, dtype=str)
maxmindLocations = pd.read_csv(maxmindDirectory + '/GeoLite2-City-Locations-en.csv', header=0, dtype=str)

print('... done')

running ...
... done


In [160]:
maxmindNetworksIPv4

Unnamed: 0,network,geoname_id,registered_country_geoname_id,represented_country_geoname_id,is_anonymous_proxy,is_satellite_provider,postal_code,latitude,longitude,accuracy_radius
0,1.0.0.0/24,2151718,2077456,,0,0,3095,-37.7000,145.1833,1000
1,1.0.1.0/24,1810821,1814991,,0,0,,26.0614,119.3061,50
2,1.0.2.0/23,1810821,1814991,,0,0,,26.0614,119.3061,50
3,1.0.4.0/22,2077456,2077456,,0,0,,-33.4940,143.2104,1000
4,1.0.8.0/21,1809858,1814991,,0,0,,23.1167,113.2500,50
5,1.0.16.0/20,1850147,1861060,,0,0,102-0082,35.6850,139.7514,500
6,1.0.32.0/19,1809858,1814991,,0,0,,23.1167,113.2500,50
...,...,...,...,...,...,...,...,...,...,...
2662725,223.255.236.0/22,1796236,1814991,,0,0,,31.0456,121.3997,50
2662726,223.255.240.0/22,1819730,1819730,,0,0,,22.2500,114.1667,50


In [161]:
maxmindNetworksIPv6

Unnamed: 0,network,geoname_id,registered_country_geoname_id,represented_country_geoname_id,is_anonymous_proxy,is_satellite_provider,postal_code,latitude,longitude,accuracy_radius
0,600:8801:9400:5a1:948b:ab15:dde3:61a3/128,5363990,,,0,0,91941,32.7596,-116.9940,100
1,2000:db8::/32,5332921,,,0,0,93614,37.2502,-119.7513,100
2,2001:200::/49,1850147,1861060,,0,0,102-0082,35.6850,139.7514,50
3,2001:200:0:8000::/49,11612577,1861060,,0,0,182-0025,35.6556,139.5522,20
4,2001:200:1::/48,1861060,1861060,,0,0,,36.0000,138.0000,100
5,2001:200:2::/47,1861060,1861060,,0,0,,36.0000,138.0000,100
6,2001:200:4::/46,1861060,1861060,,0,0,,36.0000,138.0000,100
...,...,...,...,...,...,...,...,...,...,...
2039857,2c0f:ffd8:800::/37,953987,953987,,0,0,,-29.0000,24.0000,100
2039858,2c0f:ffd8:1000::/36,953987,953987,,0,0,,-29.0000,24.0000,100


In [162]:
maxmindLocations

Unnamed: 0,geoname_id,locale_code,continent_code,continent_name,country_iso_code,country_name,subdivision_1_iso_code,subdivision_1_name,subdivision_2_iso_code,subdivision_2_name,city_name,metro_code,time_zone,is_in_european_union
0,18918,en,EU,Europe,CY,Cyprus,04,Ammochostos,,,Protaras,,Asia/Famagusta,1
1,49518,en,AF,Africa,RW,Rwanda,,,,,,,Africa/Kigali,0
2,49747,en,AF,Africa,SO,Somalia,BK,Bakool,,,Oddur,,Africa/Mogadishu,0
3,51537,en,AF,Africa,SO,Somalia,,,,,,,Africa/Mogadishu,0
4,53654,en,AF,Africa,SO,Somalia,BN,Banaadir,,,Mogadishu,,Africa/Mogadishu,0
5,54225,en,AF,Africa,SO,Somalia,SH,Lower Shabeelle,,,Merca,,Africa/Mogadishu,0
6,55671,en,AF,Africa,SO,Somalia,JH,Lower Juba,,,Kismayo,,Africa/Mogadishu,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103001,11789760,en,,North America,CA,Canada,NB,New Brunswick,,,Shippagan,,America/Moncton,0
103002,11790338,en,EU,Europe,CH,Switzerland,VD,Vaud,,,Servion,,Europe/Zurich,0


-----
Run the next cell to merge the MaxMind and IBM locations. The merged data will be written into a CSV file in the 'merged' directory named 'GeoLite2-City-Locations-en.csv'.

In [165]:
print('running ...')

# create a directory for the merged MaxMind+IBM CSV files
os.makedirs('merged', exist_ok=True)

# merge the MaxMind and IBM network data and store the result in CSV files

maxmindNetworksIPv4 = maxmindNetworksIPv4[ ~ maxmindNetworksIPv4['network'].str.startswith('9.') ]
mergedNetworksIPv4 = maxmindNetworksIPv4.append(ibmNetworksIPv4)
mergedNetworksIPv4.to_csv('merged/GeoLite2-City-Blocks-IPv4.csv', index=False, float_format='%.9g', columns=maxmindNetworksIPv4.columns)

maxmindNetworksIPv6 = maxmindNetworksIPv6[ ~ maxmindNetworksIPv6['network'].str.startswith('2620:1F7:') ]
mergedNetworksIPv6 = maxmindNetworksIPv6.append(ibmNetworksIPv6)
mergedNetworksIPv6.to_csv('merged/GeoLite2-City-Blocks-IPv6.csv', index=False, float_format='%.9g', columns=maxmindNetworksIPv6.columns)

# merge the MaxMind and IBM location data and store the result in a CSV file
mergedLocations = maxmindLocations.append(ibmLocations)
mergedLocations.to_csv('merged/GeoLite2-City-Locations-en.csv', index=False, float_format='%.9g', columns=maxmindLocations.columns)

print('... done')

running ...
... done


In [166]:
resultPackage = 'mergedIBMandInternetGeographyData.zip'

print('running ...')

# pack all result files into a ZIP package
with zipfile.ZipFile(resultPackage, 'w', compression=zipfile.ZIP_DEFLATED) as zipFile:
    for file in os.listdir('merged'):
        zipFile.write('merged/'+file, file)

# write the ZIP file to the notebook's bucket in Cloud Object Storage
##################cosClient.upload_file(Filename=resultPackage, Bucket=credentials['BUCKET'], Key=resultPackage)

print('... done')

running ...
... done


-----
Run the next cell to merge the MaxMind and IBM networks. The merged data will be written into a CSV file in the 'merged' directory named 'GeoLite2-City-Blocks-IPv4.csv'.

In [198]:
print('running ...')

# create a frame of IBM locations indexed by country, city, and street address
ibmLocationsIndexed = ibmLocations.set_index(['country_name','city_name','street_address'])

# add country, city, street address, and latitude/longitude for each network in the IBM networks frame
ibmNetworksWithLocations = ibmNetworks.join(ibmLocationsIndexed, on=['country_name','city_name','street_address']).dropna(subset=['latitude'])

# merge the MaxMind and IBM network frames and store the result in a CSV file
mergedNetworks = pd.concat([maxmindNetworks,ibmNetworksWithLocations[ list( set(maxmindNetworks.columns) & set(ibmNetworksWithLocations.columns) ) ]])
mergedNetworks.to_csv('merged/GeoLite2-City-Blocks-IPv4.csv', index=False, float_format='%.9g', columns=maxmindNetworks.columns)
                                                        
print('... done')

running ...
... done


In [199]:
mergedNetworks

Unnamed: 0,accuracy_radius,geoname_id,is_anonymous_proxy,is_satellite_provider,latitude,longitude,network,postal_code,registered_country_geoname_id,represented_country_geoname_id
0,1000.0,2151718.0,0.0,0.0,-37.700000,145.183300,1.0.0.0/24,3095,2077456.0,
1,50.0,1810821.0,0.0,0.0,26.061400,119.306100,1.0.1.0/24,,1814991.0,
2,50.0,1810821.0,0.0,0.0,26.061400,119.306100,1.0.2.0/23,,1814991.0,
3,1000.0,2077456.0,0.0,0.0,-33.494000,143.210400,1.0.4.0/22,,2077456.0,
4,50.0,1809858.0,0.0,0.0,23.116700,113.250000,1.0.8.0/21,,1814991.0,
5,500.0,1850147.0,0.0,0.0,35.685000,139.751400,1.0.16.0/20,190-0031,1861060.0,
6,50.0,1809858.0,0.0,0.0,23.116700,113.250000,1.0.32.0/19,,1814991.0,
...,...,...,...,...,...,...,...,...,...,...
8080,,437.0,,,32.047356,118.803251,9.112.12.0/22,210002,,
8574,,399.0,,,12.983971,77.729418,9.113.140.0/23,560066,,


-----
Run the next cell to calculate [geohash codes](https://en.wikipedia.org/wiki/Geohash) for the latitude/longitude coordinates of merged MaxMind and IBM locations. The geohashes, coordinates, and location data will be written into a CSV file in the 'merged' directory named 'GeoLite2-City-Geohashes-en.csv'.

In [200]:
print('running ...')

# create a frame of locations indexed by ID number
mergedLocationsIndexed = mergedLocations.set_index('geoname_id')

# create a frame of geographical coordinates, that is, ID number, latitude, and longitude
mergedCoordinates = mergedNetworks[['geoname_id','latitude','longitude']].drop_duplicates()

# merge location and coordinate data and calculate geohash for each location's coordinates
mergedGeohashes = mergedCoordinates.join(mergedLocationsIndexed, on='geoname_id')
mergedGeohashes['geohash'] = mergedGeohashes.apply(lambda row: geohash2.encode(row['latitude'],row['longitude'],precision=6),axis=1)

# store the result in a CSV file
columns = ['geohash','latitude','longitude','geoname_id','country_iso_code','country_name','subdivision_1_iso_code','subdivision_1_name','subdivision_2_iso_code','subdivision_2_name','city_name']
mergedGeohashes.to_csv('merged/GeoLite2-City-Geohashes-en.csv', index=False, float_format='%.9g', columns=columns)
                           
print('... done')

running ...
... done


In [201]:
mergedGeohashes

Unnamed: 0,geoname_id,latitude,longitude,city_name,continent_code,continent_name,country_iso_code,country_name,locale_code,metro_code,subdivision_1_iso_code,subdivision_1_name,subdivision_2_iso_code,subdivision_2_name,time_zone,geohash
0,2151718.0,-37.700000,145.183300,Research,OC,Oceania,AU,Australia,en,,VIC,Victoria,,,Australia/Melbourne,r1r1x8
1,1810821.0,26.061400,119.306100,Fuzhou,AS,Asia,CN,China,en,,FJ,Fujian,,,Asia/Shanghai,wssu6b
3,2077456.0,-33.494000,143.210400,,OC,Oceania,AU,Australia,en,,,,,,,r4jc6y
4,1809858.0,23.116700,113.250000,Guangzhou,AS,Asia,CN,China,en,,GD,Guangdong,,,Asia/Shanghai,ws0e90
5,1850147.0,35.685000,139.751400,Tokyo,AS,Asia,JP,Japan,en,,13,Tokyo,,,Asia/Tokyo,xn77h0
7,1854383.0,34.661700,133.935000,Okayama,AS,Asia,JP,Japan,en,,33,Okayama,,,Asia/Tokyo,wypjpv
11,1858311.0,34.583300,133.766700,Kurashiki,AS,Asia,JP,Japan,en,,33,Okayama,,,Asia/Tokyo,wyphez
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6054,431.0,31.299186,120.627245,IBM Suzhou,,,CN,China,,,Jiangsu,Jiangsu,,,,wttf0c
6665,432.0,2.924300,101.654478,IBM Cyberjaya,,,MY,Malaysia,,,Selangor,Selangor,,,,w2829h


-----
Finally, pack all of the merged CSV files into a ZIP package and copy it to the notebook's bucket in Cloud Object Storage ....

In [202]:
resultPackage = 'mergedIBMandInternetGeographyData.zip'

print('running ...')

# add the MaxMind IPv6 network file to the ZIP package
shutil.copy(maxmindDirectory + '/GeoLite2-City-Blocks-IPv6.csv', 'merged')

# pack all result files into a ZIP package
with zipfile.ZipFile(resultPackage, 'w', compression=zipfile.ZIP_DEFLATED) as zipFile:
    for file in os.listdir('merged'):
        zipFile.write('merged/'+file, file)

# write the ZIP file to the notebook's bucket in Cloud Object Storage
cosClient.upload_file(Filename=resultPackage, Bucket=credentials['BUCKET'], Key=resultPackage)

print('... done')

running ...
... done


-----
To download the ZIP package containing the results of merging IBM and Internet geography data, do this:

* In a browser, go to this notebook's project page

* open the 'Files' panel by clicking the 'Find and Add Data' icon in the upper-right corner of the project page,

* check the box next to 'mergedIBMandInternetGeographyData.zip'

* select 'Download' from the pop-up menu in the 'Files' panel

\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/

-----
Optionally, run this last cell to clean up the notebook's runtime environment. This is really not necessary.

In [203]:
#!rm -rf *
#!pip uninstall -y googlemaps geohash2 

In [204]:
ls -al

total 115200
drwx------  4 sa73-1acf9232f65bd2-cf1c60ef4a00 users     4096 Feb 18 18:57 [0m[01;34m.[0m/
drwx------ 11 sa73-1acf9232f65bd2-cf1c60ef4a00 users     4096 Feb 17 20:06 [01;34m..[0m/
drwx------  2 sa73-1acf9232f65bd2-cf1c60ef4a00 users     4096 Feb 18 18:51 [01;34mGeoLite2-City-CSV_20180206[0m/
drwx------  2 sa73-1acf9232f65bd2-cf1c60ef4a00 users     4096 Feb 18 18:57 [01;34mmerged[0m/
-rw-------  1 sa73-1acf9232f65bd2-cf1c60ef4a00 users 39541117 Feb 18 18:57 mergedIBMandInternetGeographyData.zip
