# .gov.ua Websites

This notebook analyzes data collected by the `monitor.py` program that checks to see what .gov.ua websites (obtained from Wikidata) can be connected to.

In [None]:
! pip install requests pandas plotly leafmap python-geoip-geolite2 python-geoip-python3

Read the URLs from GitHub. These were extracted from Wikidata using a [SPARQL Query](https://query.wikidata.org/#SELECT%20DISTINCT%20%3Furl%20WHERE%20%7B%0A%20%20%3Fitem%20wdt%3AP856%20%3Furl%20.%0A%20%20FILTER%28CONTAINS%28LCASE%28STR%28%3Furl%29%29%2C%20%27.gov.ua%27%29%29%0A%7D).

## Hostnames

In [14]:
import requests

urls = requests.get('https://raw.githubusercontent.com/edsu/gov-ua/main/urls.txt').text.splitlines()
urls[0:5]

['http://2001.ukrcensus.gov.ua',
 'http://7aac.gov.ua',
 'http://academia.gov.ua',
 'http://academy.gov.ua',
 'http://academy.kvs.gov.ua']

To make it easier to process we can put them into a Pandas DataFrame:

In [17]:
import pandas

df = pandas.DataFrame({'homepage': urls})
df

Unnamed: 0,homepage
0,http://2001.ukrcensus.gov.ua
1,http://7aac.gov.ua
2,http://academia.gov.ua
3,http://academy.gov.ua
4,http://academy.kvs.gov.ua
...,...
1382,http://zta.court.gov.ua
1383,http://zt.gov.ua
1384,http://zt-rada.gov.ua
1385,http://ztrada.gov.ua


## IP Addresses

To understand the physical infrastructure behind these websites we can look up the [IP addresses](https://en.wikipedia.org/wiki/IP_address) for each of the website hostnames. We can can use Python's [socket](https://docs.python.org/3/library/socket.html) module to do that.

In [21]:
from socket import gethostbyname

gethostbyname('ezupilska-gromada.gov.ua')

'195.248.234.252'

Since we have URLs in the DataFrame and gethostbyname wants a host name we can write a little function to parse the URL and do the lookup, while guarding against DNS lookup failures.

In [22]:
from urllib.parse import urlparse 

def ip(url):
    uri = urlparse(url)
    try:
        hostname = uri.netloc
        ip = gethostbyname(hostname)
        return ip
    except Exception as e:
        print(f"Failed to lookup {url}: {e}")
        return None

ip('https://ezupilska-gromada.gov.ua/')        

'195.248.234.252'

Ok it works lets use it to lookup the IP addresses for our websites.

In [None]:
df['ip'] = df['homepage'].map(ip)

That took some time so lets save our work!

In [None]:
df.to_csv('websites.csv', index=False)

## Geolocation

Now we can look up the latitude and longitude for the IP addresses using the [geolite2](https://dev.maxmind.com/geoip/geolite2-free-geolocation-data). To make it easier we can remove the rows that don't have an IP address.

In [60]:
df = df.dropna()
df

Unnamed: 0,homepage,ip
0,http://2001.ukrcensus.gov.ua,194.44.147.62
1,http://academia.gov.ua,176.103.56.62
3,http://academy.kvs.gov.ua,193.19.229.52
4,http://adm.od.court.gov.ua,212.90.190.139
5,http://akim.gov.ua,178.20.153.53
...,...,...
1179,https://zhovtanetska-gromada.gov.ua,195.248.234.252
1180,https://zhuravnenska-gromada.gov.ua,195.248.234.252
1181,https://zp.gov.ua,80.254.6.205
1182,https://zpa.court.gov.ua,212.90.190.139


In [23]:
from geoip import geolite2

geolite2.lookup('195.248.234.252')

<IPInfo ip='195.248.234.252' country='UA' continent='EU' subdivisions=frozenset({'05'}) timezone='Europe/Kiev' location=(49.2328, 28.481)>

Lets write a function and apply it to our dataset, again being careful to handle where IP address has not been found.

In [61]:
def geo(ip):
    loc = geolite2.lookup(ip)
    if loc:
        return loc.location
    else:
        return None

geo('195.248.234.252')

(49.2328, 28.481)

Now we can update our DataFrame with the geolocation information.

In [66]:
df['location'] = df['ip'].map(geo)
df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['location'] = df['ip'].map(geo)


Unnamed: 0,homepage,ip,location
0,http://2001.ukrcensus.gov.ua,194.44.147.62,"(50.45, 30.5233)"
1,http://academia.gov.ua,176.103.56.62,"(49.4859, 28.3482)"
3,http://academy.kvs.gov.ua,193.19.229.52,"(50.4333, 30.5167)"
4,http://adm.od.court.gov.ua,212.90.190.139,"(50.45, 30.5233)"
5,http://akim.gov.ua,178.20.153.53,"(50.45, 30.5233)"
...,...,...,...
1179,https://zhovtanetska-gromada.gov.ua,195.248.234.252,"(49.2328, 28.481)"
1180,https://zhuravnenska-gromada.gov.ua,195.248.234.252,"(49.2328, 28.481)"
1181,https://zp.gov.ua,80.254.6.205,"(50.45, 30.5233)"
1182,https://zpa.court.gov.ua,212.90.190.139,"(50.45, 30.5233)"


To make it easier to process we can further remove the rows that don't have locations. It is important to note how the original list of websites is getting whittled down.

In [69]:
df = df.dropna()

Unpack the lat/lon into separate columns, and drop the location.

In [70]:
df['lat'] = df['location'].map(lambda a: a[0])
df['lon'] = df['location'].map(lambda a: a[1])
df = df.drop(columns=['location'])
df

Unnamed: 0,homepage,ip,lat,lon
0,http://2001.ukrcensus.gov.ua,194.44.147.62,50.4500,30.5233
1,http://academia.gov.ua,176.103.56.62,49.4859,28.3482
3,http://academy.kvs.gov.ua,193.19.229.52,50.4333,30.5167
4,http://adm.od.court.gov.ua,212.90.190.139,50.4500,30.5233
5,http://akim.gov.ua,178.20.153.53,50.4500,30.5233
...,...,...,...,...
1179,https://zhovtanetska-gromada.gov.ua,195.248.234.252,49.2328,28.4810
1180,https://zhuravnenska-gromada.gov.ua,195.248.234.252,49.2328,28.4810
1181,https://zp.gov.ua,80.254.6.205,50.4500,30.5233
1182,https://zpa.court.gov.ua,212.90.190.139,50.4500,30.5233


Save it again so we don't need to recalculate:

In [71]:
df.to_csv('websites.csv', index=False)

## Map

Now that we have lat/lon coordinates for our websites we can put them on a map with [leafmap](https://leafmap.org).

In [98]:
df = pandas.read_csv('websites.csv')

import folium
from folium.plugins import MarkerCluster

m = folium.Map(center=(50.44676, 30.51313), zoom_level=4)
cluster = MarkerCluster(name=".gov.ua Websites").add_to(m)

for i, row in df.iterrows():
    folium.Marker(
        location=[row['lat'], row['lon']],
        popup=row['homepage']
    ).add_to(cluster)

folium.LayerControl().add_to(m)

m

In [99]:
m.save('Websites.html')