## merge IBM and MaxMind geographical data

-----

The code in this notebook merges IBM internal network data and Internet geography files for use in IBM Streaming Analytics, such as the NetflowViewer demonstration and cyber-security applications:

* Data for the IBM internal network comes from an Excel file provided by Mark Harvey (Mark_Harvey@uk.ibm.com). The file contains separate spreadsheets for IBM marketing regions US, EMEA, and AP. These spreadsheets list the IP subnet addresses IBM has assigned to its offices in each region, along with country, city, and street address (but not state/province/territory or latitude/longitude). All of these subnets are within the class A subnet 9.xxx.xxx.xxx assigned to IBM.

* The Internet geography data comes from CSV files provided by [MaxMind, Inc.](https://www.maxmind.com/en/home) as [GeoLite2 data](https://dev.maxmind.com/geoip/geoip2/geolite2/). This notebook downloads the 'GeoLite2' data, which MaxMind offers free of charge. The files list IP subnets in the Internet, along with country, state/province/territory, city, latitude, and longitude. MaxMind updates this data once a month, and this notebook downloads the data into a directory whose name includes the date of the files.

This notebook geocodes the IBM data with state/province/territory and latitude/longitude data from Google using its Geocoding API service. Google limits this service to 2,500 requests per day, which is sufficient for several runs of the cell below that geocodes a list of IBM locations. 

Finally, this notebook merges the geocoded IBM data into the MaxMind CSV files. It also generates a separate CSV file containing a [geohash code](https://en.wikipedia.org/wiki/Geohash) for each location's latitude/longitude. All of the resulting CSV files are packed into a ZIP file for transfer to Streaming Analytics projects:

* [GeoLite2-City-Blocks-IPv4.csv](merged/GeoLite2-City-Blocks-IPv4.csv)
* [GeoLite2-City-Blocks-IPv6.csv](merged/GeoLite2-City-Blocks-IPv6.csv)
* [GeoLite2-City-Locations-en.csv](merged/GeoLite2-City-Locations-en.csv)
* [GeoLite2-City-Geohashes-en.csv](merged/GeoLite2-City-Geohashes-en.csv)

To use this notebook, you will need to provide these things:

* an Excel file named 'report_IGA_Global_Q1_2016.xlsx' containing IBM internal network data

* a Google Maps geocoding API key for a valid Google account

There are detailed instructions for these steps in the cells below.

-----
Run this cell once to install additional function packages:

In [183]:
!pip install --user googlemaps
!pip install --user geohash2



-----
Run this cell to set up the notebook's runtime environment:

In [184]:
import os
import math
import pprint
import shutil
import zipfile
import types

# load functions for maniulating matrixes 
import pandas as pd
pd.set_option('max_rows', 15)

# load functions for reading and writing byte streams
from io import BytesIO

# load functions for reading URLs
from urllib.request import urlopen

# load functions for reading and writing Cloud Object Storage
import ibm_boto3
from ibm_botocore.client import Config

# load functions for the Google Maps geocoding API
import googlemaps

# load functions for converting latitude/longitude coordinates into geohash codes
import geohash2

# delete any files left from previous runs, and create a local directory for staging merged CSV files
!rm -rf *
os.makedirs('merged', exist_ok=True)

-----
Then, provide the Excel file containing IBM internal network data available to this notebook. To do this, copy the file to the notebook's Cloud Object Storage bucket and create an HTTP client for reading and writing files:

* open the 'Files' panel by clicking the 'Data' icon in the upper-right corner of this DSX project,

* drag Excel file 'report_IGA_Global_Q1_2016.xlsx' from your laptip to the 'drop' area in the 'Files' panel,

* position the cursor at the top of the next cell and click 'Insert to code -> Insert Credentials' in the 'Files' panel,

* make sure the name of the inserted variable is 'credentials', and

* run the cell to set the credentials in the variable.

The next cell should look something like this when you run it:

```
credentials = {
    'IBM_API_KEY_ID': 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
    'IAM_SERVICE_ID': 'iam-ServiceId-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx',
    'ENDPOINT': 'https://s3-api.us-geo.objectstorage.service.networklayer.com',
    'IBM_AUTH_ENDPOINT': 'https://iam.ng.bluemix.net/oidc/token',
    'BUCKET': 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx',
    'FILE': 'report_IGA_Global_Q1_2016.xlsx'
}
```

In [185]:
# The code was removed by DSX for sharing.

-----
Run the following cell to read IBM network addresses and locations from the Excel file's spreadsheets into 'pandas' frames:

In [186]:
print('running ...')

# create a Cloud Object Store HTTP client with the bucket's credentials
cosClient = ibm_boto3.client(service_name='s3',
    ibm_api_key_id=credentials['IBM_API_KEY_ID'],
    ibm_service_instance_id=credentials['IAM_SERVICE_ID'],
    ibm_auth_endpoint=credentials['IBM_AUTH_ENDPOINT'],
    config=Config(signature_version='oauth'),
    endpoint_url=credentials['ENDPOINT'])

# get a byte stream for reading the Excel file from the bucket
excelStream = cosClient.get_object(Bucket=credentials['BUCKET'], Key=credentials['FILE'])['Body']

# add an iterator method to the stream object so pandas will accept it as a file-like object
def __iter__(self): return 0
if not hasattr(excelStream, "__iter__"): excelStream.__iter__ = types.MethodType(__iter__, excelStream) 

# prepare to read the byte stream from Cloud Object Store as an Excel file
excelFile = pd.ExcelFile(excelStream)

# concatenate the data in each spreadsheet of the Excel file into a single pandas DataFrame
ibmData = pd.concat( map(lambda sheet: pd.read_excel(excelFile, sheet, header=0), excelFile.sheet_names) )

# correct some misencoded city names in the 'report_IGA_Global_Q1_2016.xlsx' spreadsheets
ibmCorrections = { 'S?O Paulo': 'Sao Paulo', 'Quer?Taro': 'Queretaro' }
for name in ibmCorrections:
    ibmData.loc[ibmData['# City']==name, '# City'] = ibmCorrections[name]

# create a frame with the country, city, and street address of IBM network data
ibmNetworksSkeleton = ibmData[['# Network Container', '# Country', '# City', '# Street']].dropna()
ibmNetworksSkeleton = ibmNetworksSkeleton[ibmNetworksSkeleton['# Network Container'].str.match("[0-9./]+")]
ibmNetworksSkeleton = ibmNetworksSkeleton.drop_duplicates('# Network Container',keep='first')
ibmNetworksSkeleton.columns = ['network', 'country_name', 'city_name', 'street_address']    
                       
# create a skeleton for IBM location data
ibmLocationsSkeleton = ibmNetworksSkeleton.drop('network',axis=1).drop_duplicates(['country_name', 'city_name', 'street_address'])

# add empty columns that will match the MaxMind location data
ibmLocationsSkeleton['geoname_id'] = range(len(ibmLocationsSkeleton))
ibmLocationsSkeleton['country_iso_code'] = None
ibmLocationsSkeleton['subdivision_1_iso_code'] = None
ibmLocationsSkeleton['subdivision_1_name'] = None
ibmLocationsSkeleton['subdivision_2_iso_code'] = None
ibmLocationsSkeleton['subdivision_2_name'] = None
ibmLocationsSkeleton['postal_code'] = None
ibmLocationsSkeleton['latitude'] = None
ibmLocationsSkeleton['longitude'] = None

print('... done')

running ...
... done


In [187]:
ibmNetworksSkeleton

Unnamed: 0,network,country_name,city_name,street_address
37,9.0.128.0/23,United States,Boulder,6300 Diagonal Hwy
43,9.0.130.0/23,United States,Poughkeepsie,2455 South Rd
57,9.0.134.0/23,Argentina,Buenos Aires,Pte. Hipolito Yrigoyen 2149
88,9.0.0.0/19,Great Britain,Portsmouth,Western Road
91,9.0.32.0/19,United States,Boulder,6300 Diagonal Hwy
130,9.1.0.0/16,United States,San Jose,650 Harry Road
188,9.1.248.0/21,United States,San Jose,650 Harry Road
...,...,...,...,...
8080,9.112.12.0/22,China,Nanjing,"1, Dong Ji Avenue"
8574,9.113.140.0/23,India,Bangalore,Plot No 3 Epip(Whitefield) Ind Area


In [188]:
ibmLocationsSkeleton

Unnamed: 0,country_name,city_name,street_address,geoname_id,country_iso_code,subdivision_1_iso_code,subdivision_1_name,subdivision_2_iso_code,subdivision_2_name,postal_code,latitude,longitude
37,United States,Boulder,6300 Diagonal Hwy,0,,,,,,,,
43,United States,Poughkeepsie,2455 South Rd,1,,,,,,,,
57,Argentina,Buenos Aires,Pte. Hipolito Yrigoyen 2149,2,,,,,,,,
88,Great Britain,Portsmouth,Western Road,3,,,,,,,,
130,United States,San Jose,650 Harry Road,4,,,,,,,,
224,United States,Rochester,3605 Hwy 52 N,5,,,,,,,,
472,United States,Tucson,9000 S Rita Rd,6,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...
6665,Malaysia,Cyberjaya,Jalan Teknokrat 3,432,,,,,,,,
6727,India,Hyderabad,"8-2-4-624/A/1, Road No.10",433,,,,,,,,


-----
Run the next cell to download and unpack Internet network and location data from MaxMind into 'pandas' frames:

In [189]:
maxmindURL = 'http://geolite.maxmind.com/download/geoip/database/GeoLite2-City-CSV.zip'

print('running ...')

with urlopen(maxmindURL) as response:
    with zipfile.ZipFile(BytesIO(response.read())) as file:
        file.extractall()

# find the newest directory, in case there are old directories left over from previous runs
maxmindDirectory = sorted( [ f for f in os.listdir() if os.path.isdir(f) and f.startswith('GeoLite2-City-CSV') ] )[-1]

# load the MaxMind network and location data 
maxmindNetworks = pd.read_csv(maxmindDirectory + '/GeoLite2-City-Blocks-IPv4.csv', header=0)
maxmindLocations = pd.read_csv(maxmindDirectory + '/GeoLite2-City-Locations-en.csv', header=0)

# discard rows for IBM internal networks
maxmindNetworks = maxmindNetworks[ ~ maxmindNetworks['network'].str.startswith('9.') ]

print('... done')

running ...
... done


In [190]:
maxmindNetworks

Unnamed: 0,network,geoname_id,registered_country_geoname_id,represented_country_geoname_id,is_anonymous_proxy,is_satellite_provider,postal_code,latitude,longitude,accuracy_radius
0,1.0.0.0/24,2151718.0,2077456.0,,0,0,3095,-37.7000,145.1833,1000.0
1,1.0.1.0/24,1810821.0,1814991.0,,0,0,,26.0614,119.3061,50.0
2,1.0.2.0/23,1810821.0,1814991.0,,0,0,,26.0614,119.3061,50.0
3,1.0.4.0/22,2077456.0,2077456.0,,0,0,,-33.4940,143.2104,1000.0
4,1.0.8.0/21,1809858.0,1814991.0,,0,0,,23.1167,113.2500,50.0
5,1.0.16.0/20,1850147.0,1861060.0,,0,0,190-0031,35.6850,139.7514,500.0
6,1.0.32.0/19,1809858.0,1814991.0,,0,0,,23.1167,113.2500,50.0
...,...,...,...,...,...,...,...,...,...,...
2726910,223.255.236.0/22,1796236.0,1814991.0,,0,0,,31.0456,121.3997,50.0
2726911,223.255.240.0/22,1819730.0,1819730.0,,0,0,,22.2500,114.1667,50.0


In [191]:
maxmindLocations

Unnamed: 0,geoname_id,locale_code,continent_code,continent_name,country_iso_code,country_name,subdivision_1_iso_code,subdivision_1_name,subdivision_2_iso_code,subdivision_2_name,city_name,metro_code,time_zone
0,18918,en,EU,Europe,CY,Cyprus,04,Ammochostos,,,Protaras,,Asia/Famagusta
1,32909,en,AS,Asia,IR,Iran,07,Ostan-e Tehran,,,Shahre Jadide Andisheh,,Asia/Tehran
2,49518,en,AF,Africa,RW,Rwanda,,,,,,,Africa/Kigali
3,49747,en,AF,Africa,SO,Somalia,BK,Bakool,,,Oddur,,Africa/Mogadishu
4,51537,en,AF,Africa,SO,Somalia,,,,,,,Africa/Mogadishu
5,53654,en,AF,Africa,SO,Somalia,BN,Banaadir,,,Mogadishu,,Africa/Mogadishu
6,54225,en,AF,Africa,SO,Somalia,SH,Lower Shabeelle,,,Merca,,Africa/Mogadishu
...,...,...,...,...,...,...,...,...,...,...,...,...,...
103717,11789329,en,EU,Europe,IT,Italy,52,Tuscany,PI,Province of Pisa,Ospedaletto,,Europe/Rome
103718,11789352,en,EU,Europe,CH,Switzerland,TI,Ticino,,,Savosa,,Europe/Zurich


-----
Next, get a Google geocoding API key for a valid Google account:

* In a browser, go to [Google](https://www.google.com/) and sign into an existing account or create a new account.

* Go to the [Google Geocoding Service](https://developers.google.com/maps/documentation/javascript/geocoding) page and follow the instructions to create a project and enable the geocoding API.

* Go to [Google Geocoding Service 'Get API Key'](https://developers.google.com/maps/documentation/geocoding/get-api-key), click on 'Get a Key', and then click the 'copy' button.

* paste the copied key into the next cell as the value of the 'googlemapsKey' variable 

* run the next cell to set key in the variable

The next cell should look something like this when you run it:

```
googlemapsKey = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
```

Note that this service is limited to 2,500 requests per day. That is sufficient for several runs of the cell below that geocodes a list of IBM cities. 

In [192]:
# The code was removed by DSX for sharing.

Run the next cell to fill the empty geography columns for IBM locations with data from the Google geocoding service. Note that Google limits geocoding API requests to 2,500 per day. There are about 450 IBM locations, so you can run this cell several times in the same day before reaching the limit. After the limit is reached, the Google 'geocode()' function will return an error code for the remainder of the day.

In [193]:
def convertAddressToGeocode(client, address):
    result = client.geocode(address)
    ##########print('>>>>>>>>>>>googlemaps.Client.geocode(' + address + ') returned:')
    #########pprint.pprint(result,width=150)
    if result is None: return None
    if len(result)<1: return None
    if 'address_components' not in result[0]: return None
    geocode = dict( [ (i['types'][0],{'long_name':i['long_name'],'short_name': i['short_name']}) for i in result[0]['address_components'] ] )
    geocode['latitude'] = result[0]['geometry']['location']['lat']
    geocode['longitude'] = result[0]['geometry']['location']['lng']
    ########print('>>>>>>>>>>>convertAddressToGeocode(' + address + ') returned:')
    ##########pprint.pprint(geocode,width=150)
    return geocode

K = 500
#####K = 10

def geocodeIBMLocationRow(googlemapsClient, row):
    if row['geoname_id']>K: return None
    address = "IBM, " + row['street_address'] + ', ' + row['city_name'] + ', ' + row['country_name']
    geocode = convertAddressToGeocode(googlemapsClient, address)
    if geocode is None:
        address = "IBM, " + row['city_name'] + ', ' + row['country_name']
        geocode = convertAddressToGeocode(googlemapsClient, address)
    if geocode is None: 
        print('address not found: ' + address)
        return None
    try: 
        row['city_name'] = 'IBM ' + row['city_name']
        if 'country' in geocode: 
            row['country_iso_code'] = geocode['country']['short_name']
        if 'administrative_area_level_1' in geocode: 
            row['subdivision_1_iso_code'] = geocode['administrative_area_level_1']['short_name']
            row['subdivision_1_name'] = geocode['administrative_area_level_1']['long_name']
        if 'administrative_area_level_2' in geocode: 
            row['subdivision_2_iso_code'] = geocode['administrative_area_level_2']['short_name']
            row['subdivision_2_name'] = geocode['administrative_area_level_2']['long_name']
        if 'postal_code' in geocode: 
            row['postal_code'] = geocode['postal_code']['long_name']
        if 'latitude' in geocode: 
            row['latitude'] = geocode['latitude']
            row['longitude'] = geocode['longitude']
    except KeyError as e: 
        print(str(e) + 'not found for address ' + address) 
    return row

print('running ...')

# create an HTTP client for using the Google Maps geocoding API
googlemapsClient = googlemaps.Client(key=googlemapsKey)

# geocode IBM locations, prepending 'IBM' to city names
ibmLocations = ibmLocationsSkeleton.apply(lambda row: geocodeIBMLocationRow(googlemapsClient, row), axis=1).dropna(axis=0,how='all')

# copy IBM networks, prepending 'IBM' to city names
ibmNetworks = ibmNetworksSkeleton
ibmNetworks['city_name'] = 'IBM ' + ibmNetworksSkeleton['city_name']

print('... done')

running ...
address not found: IBM, Acheson, Canada
address not found: IBM, Florenceville, Canada
address not found: IBM, West Des Moines, United States
address not found: IBM, Puebla, Mexico
address not found: IBM, Coerdoba, Argentina
address not found: IBM, Horsham, Great Britain
address not found: IBM, Porto Salvo, Portugal
address not found: IBM, Aachen, Germany
address not found: IBM, Chemnitz, Germany
address not found: IBM, Flensburg, Germany
address not found: IBM, Walldorf, Germany
address not found: IBM, Olsztyn, Poland
address not found: IBM, Cluj-Napoca, Romania
address not found: IBM, Gdansk, Poland
address not found: IBM, Cagliari, Italy
address not found: IBM, Pero, Italy
address not found: IBM, Catania, Italy
address not found: IBM, Napoli, Italy
address not found: IBM, Vaesteraes, Sweden
address not found: IBM, Riga, Latvia
address not found: IBM, Ostrava, Czech Republic
address not found: IBM, Lugano, Switzerland
address not found: IBM, Aubagne, France
address not fou

In [194]:
ibmNetworks

Unnamed: 0,network,country_name,city_name,street_address
37,9.0.128.0/23,United States,IBM Boulder,6300 Diagonal Hwy
43,9.0.130.0/23,United States,IBM Poughkeepsie,2455 South Rd
57,9.0.134.0/23,Argentina,IBM Buenos Aires,Pte. Hipolito Yrigoyen 2149
88,9.0.0.0/19,Great Britain,IBM Portsmouth,Western Road
91,9.0.32.0/19,United States,IBM Boulder,6300 Diagonal Hwy
130,9.1.0.0/16,United States,IBM San Jose,650 Harry Road
188,9.1.248.0/21,United States,IBM San Jose,650 Harry Road
...,...,...,...,...
8080,9.112.12.0/22,China,IBM Nanjing,"1, Dong Ji Avenue"
8574,9.113.140.0/23,India,IBM Bangalore,Plot No 3 Epip(Whitefield) Ind Area


In [195]:
ibmLocations

Unnamed: 0,country_name,city_name,street_address,geoname_id,country_iso_code,subdivision_1_iso_code,subdivision_1_name,subdivision_2_iso_code,subdivision_2_name,postal_code,latitude,longitude
37,United States,IBM Boulder,6300 Diagonal Hwy,0.0,US,CO,Colorado,Boulder County,Boulder County,80301,40.089317,-105.198126
43,United States,IBM Poughkeepsie,2455 South Rd,1.0,US,NY,New York,Dutchess County,Dutchess County,12601,41.653684,-73.935950
57,Argentina,IBM Buenos Aires,Pte. Hipolito Yrigoyen 2149,2.0,AR,CABA,Buenos Aires,Comuna 1,Comuna 1,C1001AFA,-34.596097,-58.371448
88,Great Britain,IBM Portsmouth,Western Road,3.0,GB,England,England,Portsmouth,Portsmouth,PO6 3AU,50.842570,-1.085737
130,United States,IBM San Jose,650 Harry Road,4.0,US,CA,California,Santa Clara County,Santa Clara County,95120,37.211053,-121.806949
224,United States,IBM Rochester,3605 Hwy 52 N,5.0,US,MN,Minnesota,Olmsted County,Olmsted County,55901,44.066202,-92.505961
472,United States,IBM Tucson,9000 S Rita Rd,6.0,US,AZ,Arizona,Pima County,Pima County,85744,32.090811,-110.804402
...,...,...,...,...,...,...,...,...,...,...,...,...
6054,China,IBM Suzhou,"88 Dongchang Road, Suzhou Industria",431.0,CN,Jiangsu,Jiangsu,,,215000,31.299186,120.627245
6665,Malaysia,IBM Cyberjaya,Jalan Teknokrat 3,432.0,MY,Selangor,Selangor,,,63000,2.924300,101.654478


-----
Run the next cell to merge the MaxMind and IBM locations. The merged data will be written into a CSV file in the 'merged' directory named 'GeoLite2-City-Locations-en.csv'.

In [196]:
def removeCommas(name):
    return name.replace(',', '')

print('running ...')

# merge the MaxMind and IBM location frames 
mergedLocations = pd.concat([maxmindLocations,ibmLocations[ list( set(maxmindLocations.columns) & set(ibmLocations.columns) ) ]])

# remove commas from location names
for column in ['country_name','subdivision_1_name','subdivision_2_name','city_name']:
    mergedLocations[column] = mergedLocations[column].apply(lambda name: removeCommas(str(name)))

# store the result as a CSV file
mergedLocations.to_csv('merged/GeoLite2-City-Locations-en.csv', index=False, float_format='%.9g', columns=maxmindLocations.columns)

print('... done')

running ...
... done


In [197]:
mergedLocations

Unnamed: 0,city_name,continent_code,continent_name,country_iso_code,country_name,geoname_id,locale_code,metro_code,subdivision_1_iso_code,subdivision_1_name,subdivision_2_iso_code,subdivision_2_name,time_zone
0,Protaras,EU,Europe,CY,Cyprus,18918.0,en,,04,Ammochostos,,,Asia/Famagusta
1,Shahre Jadide Andisheh,AS,Asia,IR,Iran,32909.0,en,,07,Ostan-e Tehran,,,Asia/Tehran
2,,AF,Africa,RW,Rwanda,49518.0,en,,,,,,Africa/Kigali
3,Oddur,AF,Africa,SO,Somalia,49747.0,en,,BK,Bakool,,,Africa/Mogadishu
4,,AF,Africa,SO,Somalia,51537.0,en,,,,,,Africa/Mogadishu
5,Mogadishu,AF,Africa,SO,Somalia,53654.0,en,,BN,Banaadir,,,Africa/Mogadishu
6,Merca,AF,Africa,SO,Somalia,54225.0,en,,SH,Lower Shabeelle,,,Africa/Mogadishu
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6054,IBM Suzhou,,,CN,China,431.0,,,Jiangsu,Jiangsu,,,
6665,IBM Cyberjaya,,,MY,Malaysia,432.0,,,Selangor,Selangor,,,


-----
Run the next cell to merge the MaxMind and IBM networks. The merged data will be written into a CSV file in the 'merged' directory named 'GeoLite2-City-Blocks-IPv4.csv'.

In [198]:
print('running ...')

# create a frame of IBM locations indexed by country, city, and street address
ibmLocationsIndexed = ibmLocations.set_index(['country_name','city_name','street_address'])

# add country, city, street address, and latitude/longitude for each network in the IBM networks frame
ibmNetworksWithLocations = ibmNetworks.join(ibmLocationsIndexed, on=['country_name','city_name','street_address']).dropna(subset=['latitude'])

# merge the MaxMind and IBM network frames and store the result in a CSV file
mergedNetworks = pd.concat([maxmindNetworks,ibmNetworksWithLocations[ list( set(maxmindNetworks.columns) & set(ibmNetworksWithLocations.columns) ) ]])
mergedNetworks.to_csv('merged/GeoLite2-City-Blocks-IPv4.csv', index=False, float_format='%.9g', columns=maxmindNetworks.columns)
                                                        
print('... done')

running ...
... done


In [199]:
mergedNetworks

Unnamed: 0,accuracy_radius,geoname_id,is_anonymous_proxy,is_satellite_provider,latitude,longitude,network,postal_code,registered_country_geoname_id,represented_country_geoname_id
0,1000.0,2151718.0,0.0,0.0,-37.700000,145.183300,1.0.0.0/24,3095,2077456.0,
1,50.0,1810821.0,0.0,0.0,26.061400,119.306100,1.0.1.0/24,,1814991.0,
2,50.0,1810821.0,0.0,0.0,26.061400,119.306100,1.0.2.0/23,,1814991.0,
3,1000.0,2077456.0,0.0,0.0,-33.494000,143.210400,1.0.4.0/22,,2077456.0,
4,50.0,1809858.0,0.0,0.0,23.116700,113.250000,1.0.8.0/21,,1814991.0,
5,500.0,1850147.0,0.0,0.0,35.685000,139.751400,1.0.16.0/20,190-0031,1861060.0,
6,50.0,1809858.0,0.0,0.0,23.116700,113.250000,1.0.32.0/19,,1814991.0,
...,...,...,...,...,...,...,...,...,...,...
8080,,437.0,,,32.047356,118.803251,9.112.12.0/22,210002,,
8574,,399.0,,,12.983971,77.729418,9.113.140.0/23,560066,,


-----
Run the next cell to calculate [geohash codes](https://en.wikipedia.org/wiki/Geohash) for the latitude/longitude coordinates of merged MaxMind and IBM locations. The geohashes, coordinates, and location data will be written into a CSV file in the 'merged' directory named 'GeoLite2-City-Geohashes-en.csv'.

In [200]:
print('running ...')

# create a frame of locations indexed by ID number
mergedLocationsIndexed = mergedLocations.set_index('geoname_id')

# create a frame of geographical coordinates, that is, ID number, latitude, and longitude
mergedCoordinates = mergedNetworks[['geoname_id','latitude','longitude']].drop_duplicates()

# merge location and coordinate data and calculate geohash for each location's coordinates
mergedGeohashes = mergedCoordinates.join(mergedLocationsIndexed, on='geoname_id')
mergedGeohashes['geohash'] = mergedGeohashes.apply(lambda row: geohash2.encode(row['latitude'],row['longitude'],precision=6),axis=1)

# store the result in a CSV file
columns = ['geohash','latitude','longitude','geoname_id','country_iso_code','country_name','subdivision_1_iso_code','subdivision_1_name','subdivision_2_iso_code','subdivision_2_name','city_name']
mergedGeohashes.to_csv('merged/GeoLite2-City-Geohashes-en.csv', index=False, float_format='%.9g', columns=columns)
                           
print('... done')

running ...
... done


In [201]:
mergedGeohashes

Unnamed: 0,geoname_id,latitude,longitude,city_name,continent_code,continent_name,country_iso_code,country_name,locale_code,metro_code,subdivision_1_iso_code,subdivision_1_name,subdivision_2_iso_code,subdivision_2_name,time_zone,geohash
0,2151718.0,-37.700000,145.183300,Research,OC,Oceania,AU,Australia,en,,VIC,Victoria,,,Australia/Melbourne,r1r1x8
1,1810821.0,26.061400,119.306100,Fuzhou,AS,Asia,CN,China,en,,FJ,Fujian,,,Asia/Shanghai,wssu6b
3,2077456.0,-33.494000,143.210400,,OC,Oceania,AU,Australia,en,,,,,,,r4jc6y
4,1809858.0,23.116700,113.250000,Guangzhou,AS,Asia,CN,China,en,,GD,Guangdong,,,Asia/Shanghai,ws0e90
5,1850147.0,35.685000,139.751400,Tokyo,AS,Asia,JP,Japan,en,,13,Tokyo,,,Asia/Tokyo,xn77h0
7,1854383.0,34.661700,133.935000,Okayama,AS,Asia,JP,Japan,en,,33,Okayama,,,Asia/Tokyo,wypjpv
11,1858311.0,34.583300,133.766700,Kurashiki,AS,Asia,JP,Japan,en,,33,Okayama,,,Asia/Tokyo,wyphez
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6054,431.0,31.299186,120.627245,IBM Suzhou,,,CN,China,,,Jiangsu,Jiangsu,,,,wttf0c
6665,432.0,2.924300,101.654478,IBM Cyberjaya,,,MY,Malaysia,,,Selangor,Selangor,,,,w2829h


-----
Finally, pack all of the merged CSV files into a ZIP package and copy it to the notebook's bucket in Cloud Object Storage ....

In [202]:
resultPackage = 'mergedIBMandInternetGeographyData.zip'

print('running ...')

# add the MaxMind IPv6 network file to the ZIP package
shutil.copy(maxmindDirectory + '/GeoLite2-City-Blocks-IPv6.csv', 'merged')

# pack all result files into a ZIP package
with zipfile.ZipFile(resultPackage, 'w', compression=zipfile.ZIP_DEFLATED) as zipFile:
    for file in os.listdir('merged'):
        zipFile.write('merged/'+file, file)

# write the ZIP file to the notebook's bucket in Cloud Object Storage
cosClient.upload_file(Filename=resultPackage, Bucket=credentials['BUCKET'], Key=resultPackage)

print('... done')

running ...
... done


-----
To download the ZIP package containing the results of merging IBM and Internet geography data, do this:

* In a browser, go to this notebook's project page

* open the 'Files' panel by clicking the 'Find and Add Data' icon in the upper-right corner of the project page,

* check the box next to 'mergedIBMandInternetGeographyData.zip'

* select 'Download' from the pop-up menu in the 'Files' panel

\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/\/

-----
Optionally, run this last cell to clean up the notebook's runtime environment. This is really not necessary.

In [203]:
#!rm -rf *
#!pip uninstall -y googlemaps geohash2 

In [204]:
ls -al

total 115200
drwx------  4 sa73-1acf9232f65bd2-cf1c60ef4a00 users     4096 Feb 18 18:57 [0m[01;34m.[0m/
drwx------ 11 sa73-1acf9232f65bd2-cf1c60ef4a00 users     4096 Feb 17 20:06 [01;34m..[0m/
drwx------  2 sa73-1acf9232f65bd2-cf1c60ef4a00 users     4096 Feb 18 18:51 [01;34mGeoLite2-City-CSV_20180206[0m/
drwx------  2 sa73-1acf9232f65bd2-cf1c60ef4a00 users     4096 Feb 18 18:57 [01;34mmerged[0m/
-rw-------  1 sa73-1acf9232f65bd2-cf1c60ef4a00 users 39541117 Feb 18 18:57 mergedIBMandInternetGeographyData.zip
