# .gov.ua Websites

This notebook analyzes data collected by the `monitor.py` program that checks to see what .gov.ua websites (obtained from Wikidata) can be connected to.

In [1]:
! pip install requests pandas plotly leafmap ipwhois

Collecting plotly
  Downloading plotly-5.6.0-py2.py3-none-any.whl (27.7 MB)
     |████████████████████████████████| 27.7 MB 510 kB/s            
[?25hCollecting leafmap
  Downloading leafmap-0.8.4-py2.py3-none-any.whl (147 kB)
     |████████████████████████████████| 147 kB 15.1 MB/s            
[?25hCollecting ipwhois
  Downloading ipwhois-1.2.0-py2.py3-none-any.whl (73 kB)
     |████████████████████████████████| 73 kB 8.3 MB/s             
Collecting tenacity>=6.2.0
  Downloading tenacity-8.0.1-py3-none-any.whl (24 kB)
Collecting jupyterlab>=3.0.0
  Downloading jupyterlab-3.3.2-py3-none-any.whl (8.7 MB)
     |████████████████████████████████| 8.7 MB 6.3 MB/s            
[?25hCollecting bqplot
  Downloading bqplot-0.12.33-py2.py3-none-any.whl (1.2 MB)
     |████████████████████████████████| 1.2 MB 14.1 MB/s            
[?25hCollecting pyshp>=2.1.3
  Downloading pyshp-2.2.0-py3-none-any.whl (44 kB)
     |████████████████████████████████| 44 kB 19.6 MB/s             
[?25hCollecting

Read the URLs from GitHub. These were extracted from Wikidata using a [SPARQL Query](https://query.wikidata.org/#SELECT%20DISTINCT%20%3Furl%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP856%20%3Furl%20.%0A%20%20FILTER%28CONTAINS%28LCASE%28STR%28%3Furl%29%29%2C%20%27.gov.ua%27%29%29%0A%7D).

## Hostnames

In [2]:
import requests

urls = requests.get('https://raw.githubusercontent.com/edsu/gov-ua/main/urls.txt').text.splitlines()
urls[0:5]

ModuleNotFoundError: No module named 'requests'

To make it easier to process we can put them into a Pandas DataFrame:

In [2]:
import pandas

df = pandas.DataFrame({'homepage': urls})
df

Unnamed: 0,homepage
0,http://2001.ukrcensus.gov.ua
1,http://7aac.gov.ua
2,http://academia.gov.ua
3,http://academy.gov.ua
4,http://academy.kvs.gov.ua
...,...
1382,http://zta.court.gov.ua
1383,http://zt.gov.ua
1384,http://zt-rada.gov.ua
1385,http://ztrada.gov.ua


## IP Addresses

To understand the physical infrastructure behind these websites we can look up the [IP addresses](https://en.wikipedia.org/wiki/IP_address) for each of the website hostnames. We can can use Python's [socket](https://docs.python.org/3/library/socket.html) module to do that.

In [3]:
from socket import gethostbyname

gethostbyname('ezupilska-gromada.gov.ua')

'195.248.234.252'

Since we have URLs in the DataFrame and gethostbyname wants a host name we can write a little function to parse the URL and do the lookup, while guarding against DNS lookup failures.

In [4]:
from urllib.parse import urlparse 

def ip(url):
    uri = urlparse(url)
    try:
        hostname = uri.netloc
        ip = gethostbyname(hostname)
        return ip
    except Exception as e:
        print(f"Failed to lookup {url}: {e}")
        return None

ip('https://ezupilska-gromada.gov.ua/')        

'195.248.234.252'

Ok it works lets use it to lookup the IP addresses for our websites.

In [5]:
df['ip'] = df['homepage'].map(ip)

Failed to lookup http://academy.gov.ua: [Errno 8] nodename nor servname provided, or not known
Failed to lookup http://aku.gov.ua: [Errno 8] nodename nor servname provided, or not known
Failed to lookup http://alex.gov.ua: [Errno 8] nodename nor servname provided, or not known
Failed to lookup http://ananiev-rda.odessa.gov.ua: [Errno 8] nodename nor servname provided, or not known
Failed to lookup http://andrrada.zt.gov.ua: [Errno 8] nodename nor servname provided, or not known
Failed to lookup http://an.loga.gov.ua: [Errno 8] nodename nor servname provided, or not known
Failed to lookup http://archive.nbuv.gov.ua: [Errno 8] nodename nor servname provided, or not known
Failed to lookup http://archive.odessa.gov.ua: [Errno 8] nodename nor servname provided, or not known
Failed to lookup http://archives.kh.gov.ua: [Errno 8] nodename nor servname provided, or not known
Failed to lookup http://archive.zt.gov.ua: [Errno 8] nodename nor servname provided, or not known
Failed to lookup http:/

That took some time so lets save our work!

In [6]:
df.to_csv('websites.csv', index=False)

## Web Hosting Providers

Now we can use [ipwhois](https://ipwhois.readthedocs.io/) to learn more about the web hosting provider that owns the IP address. It is important to point out here that this address will be what is on file for the ISP, and may not actually be where the physical machine with that IP address resides in physical space.

To make things easier lets drop any rows that we don't have an IP address for.

In [7]:
df = df.dropna()
df

Unnamed: 0,homepage,ip
0,http://2001.ukrcensus.gov.ua,194.44.147.62
1,http://7aac.gov.ua,104.21.69.142
2,http://academia.gov.ua,176.103.56.62
4,http://academy.kvs.gov.ua,193.19.229.52
5,http://adm.od.court.gov.ua,212.90.190.139
...,...,...
1382,http://zta.court.gov.ua,212.90.190.139
1383,http://zt.gov.ua,213.108.45.142
1384,http://zt-rada.gov.ua,104.21.92.252
1385,http://ztrada.gov.ua,213.108.45.142


Lets try looking up one IP address to see what the response looks like first (because it's pretty complicated):

In [10]:
from ipwhois import IPWhois

rec = IPWhois('195.248.234.252')
resp = rec.lookup_rdap()

from pprint import pprint
pprint(resp)

{'asn': '42655',
 'asn_cidr': '195.248.234.0/24',
 'asn_country_code': 'UA',
 'asn_date': '2007-03-27',
 'asn_description': 'BESTHOSTING-AS, UA',
 'asn_registry': 'ripencc',
 'entities': ['BESTHOSTING-MNT',
              'BN906-RIPE',
              'ORG-BL42-RIPE',
              'RIPE-NCC-END-MNT',
              'BN906-RIPE'],
 'network': {'cidr': '195.248.234.0/23',
             'country': 'UA',
             'end_address': '195.248.235.255',
             'events': [{'action': 'last changed',
                         'actor': None,
                         'timestamp': '2016-06-02T10:16:28Z'}],
             'handle': '195.248.234.0 - 195.248.235.255',
             'ip_version': 'v4',
             'links': ['https://rdap.db.ripe.net/ip/195.248.234.252',
                       'http://www.ripe.net/data-tools/support/documentation/terms'],
             'name': 'BESTHOSTING-NET',
             'notices': [{'description': 'This output has been filtered.',
                          'links': N

You can see in the `objects -> contact -> address` section there is an address. Here's a function to extract it:

In [40]:
def get_address(resp):
    for obj_name, obj in resp['objects'].items():
        if 'contact' in obj and 'address' in obj['contact'] and obj['contact']['address']:
            for entry in obj['contact']['address']:
                return entry['value'].replace('\n', ' ')
    return None

get_address(resp)

'21029, Ukraine, Vinnitsa Khmelnytske shose str 112-A'

It might also be useful to get the IP address range:

In [23]:
def get_ip_range(resp):
    return [resp['network']['start_address'], resp['network']['end_address']]

get_ip_range(resp)

['195.248.234.0', '195.248.235.255']

And the name of the ISP:

In [24]:
def get_isp(resp):
    return resp['network']['name']

get_isp(resp)

'BESTHOSTING-NET'

Now we can bundle up these different functions with the whois lookup so that we can apply it to the entire DataFrame:

In [59]:
def lookup(ip):
    rec = IPWhois(ip)
    resp = rec.lookup_rdap()
    return [
        get_address(resp),
        get_ip_range(resp),
        get_isp(resp)
    ]

lookup('104.21.69.142')

['101 Townsend Street San Francisco CA 94107 United States',
 ['104.16.0.0', '104.31.255.255'],
 'CLOUDFLARENET']

In [45]:
df[['address', 'ip_range', 'provider']] = df.apply(lambda r: lookup(r.ip), axis=1,  result_type="expand")
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[['address', 'ip_range', 'provider']] = df.apply(lambda r: lookup(r.ip), axis=1,  result_type="expand")


Unnamed: 0,homepage,ip,address,ip_range,provider
0,http://2001.ukrcensus.gov.ua,194.44.147.62,UARNet Ukrainian Academic and Research Network...,"[194.44.0.0, 194.44.255.255]",UA-ZZ-940217
1,http://7aac.gov.ua,104.21.69.142,101 Townsend Street San Francisco CA 94107 Uni...,"[104.16.0.0, 104.31.255.255]",CLOUDFLARENET
2,http://academia.gov.ua,176.103.56.62,"42-A Tobolskaya street, office 230, Kharkov, U...","[176.103.48.0, 176.103.63.255]",XServer
4,http://academy.kvs.gov.ua,193.19.229.52,"Internet Ukraine Ltd. Naukova str. 5 Lviv, Ukr...","[193.19.228.0, 193.19.231.255]",GENERAL-NETWORKS
5,http://adm.od.court.gov.ua,212.90.190.139,"vul. S. Khokhlovyh, 15 Kiev, Ukraine, 04050","[212.90.190.128, 212.90.190.191]",UKRCOM-CUSTOMER-NET
...,...,...,...,...,...
1382,http://zta.court.gov.ua,212.90.190.139,"vul. S. Khokhlovyh, 15 Kiev, Ukraine, 04050","[212.90.190.128, 212.90.190.191]",UKRCOM-CUSTOMER-NET
1383,http://zt.gov.ua,213.108.45.142,"Ukraine, Zhytomyr, L.Ukrainki str., 38","[213.108.40.0, 213.108.47.255]",Electra-ua
1384,http://zt-rada.gov.ua,104.21.92.252,101 Townsend Street San Francisco CA 94107 Uni...,"[104.16.0.0, 104.31.255.255]",CLOUDFLARENET
1385,http://ztrada.gov.ua,213.108.45.142,"Ukraine, Zhytomyr, L.Ukrainki str., 38","[213.108.40.0, 213.108.47.255]",Electra-ua


That took a while so lets save it!

In [46]:
df.to_csv('websites.csv', index=False)

## Geolocation

Now we need can convert the address into geo-coordinates that we can put on a map. We can use [geopy](https://geopy.readthedocs.io/en/stable/) to look them up using the [Mapbox](https://www.mapbox.com/) service. You will need to get an API key to use it, and set it in your notebook environment.

In [8]:
import os
import geopy

api_key = os.environ.get('MAPBOX_API_KEY')

mapbox = geopy.MapBox(api_key=api_key)
loc = mapbox.geocode("vul. S. Khokhlovyh, 15 Kiev, Ukraine, 04050")
print(loc.latitude, loc.longitude)

50.46486 30.477626


Lets create a little function to geocode the addresses but sleep 1/2 a second between requests to avoid rate limiting.

In [9]:
import time

def geocode(address):
    time.sleep(.5)
    loc = mapbox.geocode(address)
    if loc:
        return (loc.latitude, loc.longitude)
    else:
        return (None, None)
    
geocode("vul. S. Khokhlovyh, 15 Kiev, Ukraine, 04050")

(50.46486, 30.477626)

In [11]:
df[['lat', 'lon']] = df.apply(lambda r: geocode(r.address), axis=1, result_type='expand')
df

Unnamed: 0,homepage,ip,address,ip_range,provider,lat,lon
0,http://2001.ukrcensus.gov.ua,194.44.147.62,UARNet Ukrainian Academic and Research Network...,"['194.44.0.0', '194.44.255.255']",UA-ZZ-940217,49.823850,24.033702
1,http://7aac.gov.ua,104.21.69.142,101 Townsend Street San Francisco CA 94107 Uni...,"['104.16.0.0', '104.31.255.255']",CLOUDFLARENET,37.780230,-122.390470
2,http://academia.gov.ua,176.103.56.62,"42-A Tobolskaya street, office 230, Kharkov, U...","['176.103.48.0', '176.103.63.255']",XServer,49.990300,36.230400
3,http://academy.kvs.gov.ua,193.19.229.52,"Internet Ukraine Ltd. Naukova str. 5 Lviv, Ukr...","['193.19.228.0', '193.19.231.255']",GENERAL-NETWORKS,49.819321,24.048161
4,http://adm.od.court.gov.ua,212.90.190.139,"vul. S. Khokhlovyh, 15 Kiev, Ukraine, 04050","['212.90.190.128', '212.90.190.191']",UKRCOM-CUSTOMER-NET,50.464860,30.477626
...,...,...,...,...,...,...,...
1180,http://zta.court.gov.ua,212.90.190.139,"vul. S. Khokhlovyh, 15 Kiev, Ukraine, 04050","['212.90.190.128', '212.90.190.191']",UKRCOM-CUSTOMER-NET,50.464860,30.477626
1181,http://zt.gov.ua,213.108.45.142,"Ukraine, Zhytomyr, L.Ukrainki str., 38","['213.108.40.0', '213.108.47.255']",Electra-ua,50.650000,28.520000
1182,http://zt-rada.gov.ua,104.21.92.252,101 Townsend Street San Francisco CA 94107 Uni...,"['104.16.0.0', '104.31.255.255']",CLOUDFLARENET,37.780230,-122.390470
1183,http://ztrada.gov.ua,213.108.45.142,"Ukraine, Zhytomyr, L.Ukrainki str., 38","['213.108.40.0', '213.108.47.255']",Electra-ua,50.650000,28.520000


Save it again so we don't need to do the geocoding again.

In [12]:
df.to_csv('websites.csv', index=False)

## Map

Now that we have lat/lon coordinates for our websites we can put them on a map with [leafmap](https://leafmap.org).

In [17]:
df = pandas.read_csv('websites.csv')
df = df.dropna()

import folium
from folium.plugins import MarkerCluster

m = folium.Map(center=(50.44676, 30.51313), zoom_level=4)
cluster = MarkerCluster(name=".gov.ua Websites").add_to(m)

for i, row in df.iterrows():
    folium.Marker(
        location=[row['lat'], row['lon']],
        popup=f"{row['homepage']}\n{row['provider']}\n{row['address']}"
    ).add_to(cluster)

folium.LayerControl().add_to(m)

m

In [18]:
m.save('Websites.html')