# Download Permafrost Data

The [Global Terrestrial Network for Permafrost - Database](http://www.gtnpdatabase.org/) contains a repository on data from permafrost sites. Data for both
 * Thaw Depths (aka Active Layer Thickness)
 * Temperatures
are available.

Unfortunately, the site focuses on making the data available individually involving a lot of user interaction. For analyses at scale, this proves inefficient. This notebook downloads the data locally so you can analyse the data yourself.

## Packages

There are a few packages which may not be part of a default installation. Also, some may require an installation at the OS level outside the realm of pipenv, conda, or pip.
* [geopandas](https://geopandas.org/) is required to save data with geospatial information
* [PostgreSQL](https://www.postgresql.org/) and its Geospatial Information System extension [PostGIS](https://postgis.net/install/) are required locally. We will try to make a docker package available. Note that PostGIS needs to be enabled per-database. For Mac users, consider [Postgret.app](https://postgresapp.com/), for Windows users, consider [PostgreSQL Portable](https://github.com/garethflowers/postgresql-portable)
* [pycountry](https://pypi.org/project/pycountry/) is used to standardise country names and look up their ISO codes
* [Selenium](https://pypi.org/project/selenium/) is required to load pages that use JavaScript to build their content. This generally requires some installation on the OS as Selenium needs a working web browser to render these pages.
* We also use sqlalchemy, pandas, numpy, and scipy

In [2]:
import io
import pandas as pd
import geopandas as gpd
import shapely
from bs4 import BeautifulSoup
import requests
import zipfile
from pathlib import Path
import time
from tqdm import tqdm
import numpy as np
import pycountry
import glob
import os
import re
from selenium import webdriver
from sqlalchemy import create_engine
import sqlalchemy
from tqdm import tqdm
import json
import datetime
import socket
import urllib
import sys
from scipy.interpolate import interp1d

## Web Crawling using Selenium

The web page is dynamically generated using JavaScript. This makes access the content slightly more challenging. Selenium is a toolset that is created to allow for automated UI testing, but it is also used to crawl websites. 

Depending on your jupyter setup, selenium may be very tricky, or impossible to install, as it actually requires a functioning web browser to be installed on the machine that hosts the jupyter evironment. This is very often not the case for servers or cloud based installations. Should this be the case:
* Open [http://www.gtnpdatabase.org/activelayers](http://www.gtnpdatabase.org/activelayers) in a browser on your workstation
* Try to File->Save the webpage locally
* There will be a hook for you to copy and paste the file further down

In [3]:
options = webdriver.ChromeOptions() # https://stackoverflow.com/questions/50642308
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

wd = webdriver.Chrome(options=options)

In [5]:
wd.get("http://www.gtnpdatabase.org/activelayers")
print("waiting 10 seconds for the JavaScript to complete",end=".")
time.sleep(10) # wait for javascript to finish
print("...done, next page")
activelayers_page = wd.page_source

wd.get("http://www.gtnpdatabase.org/boreholes")
print("waiting 10 seconds for the JavaScript to complete",end=".")
time.sleep(10) # wait for javascript to finish
print("...finished")
boreholes_temps_page = wd.page_source

waiting 10 seconds for the JavaScript to complete....done, next page
waiting 10 seconds for the JavaScript to complete.

In [7]:
# if you could not get Selenium up and running, open the above pages in a browser and save all the content
# then uncomment the next lines and adjust the file names

# with open("activelayers.html","r") as infile:
#    activelayers_page = infile.read()

# with open("boreholes.html","r") as infile:
#    boreholes_temps_page = infile.read()

Unfortunately, the usual way to parse the table using pandas´ `read_html` results in a missing entry, Station _Da Xi Gou  (Glacier Station)_  is not always added. 

In [8]:
df = pd.read_html(wd.page_source)[0] # little known fact, pandas parses tables quite well, note it returns an array
df.columns = [c.lower() for c in df.columns] 
df[df["gtn-p"]=="CN9"]

Unnamed: 0,name,site,country,gtn-p,vegetation,permafrost,elevation,depth,data,select


### Helper Function to Log Data Issues

As we walk across the data, we will encounter deficiencies. We will set up a log database so that, if possible, deficiencies can be rectified.

In [9]:
def static_vars(**kwargs):
    def decorate(func):
        for k in kwargs:
            setattr(func, k, kwargs[k])
        return func
    return decorate

@static_vars(log=[])
def log_finding(finding):
    finding["datetime_date"] = "{:%Y-%m-%d %H:%M:%S}".format(datetime.datetime.now())
    finding["machine"] = socket.gethostname()
    log_finding.log.append(finding)
    with open("findings.json","w+t") as logfile:
        json.dump(log_finding.log,logfile)

## Database connection

If you plan to put code of yours onto public platforms such as github or gitlab, never, ever, add credentials to your source code. In fact, never ever, even if you have no such plans. We read the credentials and database server from a file names .env, in this repo, copy .env.example and edit the content so it suits your needs. We need a string in the form 

```
postgresql://<user>:<password>@<hostname>/<database>
```

in that file. Often, this is `postgresql://postgres@localhost/postgres` by default.

In [14]:
with open(".env","rt") as idfile:
    connect_string = idfile.read().strip()
conn = create_engine(connect_string)

## Web Page parsing using BeautifulSoup

The download links use an id of the site in their URL. The id field, however, is not visible on the web page, but is in the hyperlink `href` field of the table. These links are contained in c table cell (`<td>`) with an attribute `data-th="Name"`.

The html code looks like

```
<td data-th="Name"><a class="ng-binding" href="/activelayers/view/195/">Allakaket</a></td>
```

and we are after the 195, which is the last forward-slash separated field of the URL. Also, as we already use BeautifulSoup, we take this approach to populate the data frame.

### Activelayers

In [15]:
soup = BeautifulSoup(io.StringIO(activelayers_page))

entries = []
for tr in soup.findAll("tr"):
    entry = {}
    for td in tr.findAll("td"):
        if td["data-th"] == "Gtn-P":
            entry["code"] = td.text
        else:
            entry[td["data-th"].lower()] = td.text
        if td["data-th"] == "Name":
            entry["subsite_id"] = td.contents[0]["href"].split("/")[-2]
            entry["view_url"] = td.contents[0]["href"]
        elif td["data-th"] == "Site":
            entry["site_id"] = td.contents[0]["href"].split("/")[-2]
            entry["site_url"] = td.contents[0]["href"]
    if len(entry)>0:
        entries.append(entry)
dfActiveLayersCatalogue = pd.DataFrame(entries)
dfActiveLayersCatalogue.elevation = pd.to_numeric(dfActiveLayersCatalogue.elevation)
print("{} Entries loaded".format(len(dfActiveLayersCatalogue)))

253 Entries loaded


Add standard country codes and names using `pycountry`.

In [16]:
dfActiveLayersCatalogue["iso2"] = ""
dfActiveLayersCatalogue["iso3"] = ""
dfActiveLayersCatalogue["country_official_name"] = ""
for i,r in dfActiveLayersCatalogue.iterrows():
    info = pycountry.countries.search_fuzzy(r.country)
    dfActiveLayersCatalogue.at[i,"iso2"] = info[0].alpha_2
    dfActiveLayersCatalogue.at[i,"iso3"] = info[0].alpha_3
    if "country_official_name" in info[0].__dict__.keys():
        dfActiveLayersCatalogue.at[i,"country_official_name"] = info[0].country_official_name
    else:
        dfActiveLayersCatalogue.at[i,"country_official_name"] = info[0].name
dfActiveLayersCatalogue["subsite_type"] = "activelayer"
dfActiveLayersCatalogue.head()

Unnamed: 0,name,subsite_id,view_url,site,site_id,site_url,country,type,code,vegetation,permafrost,elevation,data,select,iso2,iso3,country_official_name,subsite_type
0,56 Mile,227,/activelayers/view/227/,Franklin Bluff,16,/sites/view/16/,United States,Grid,U31 A,Tundra,Continuous,114.0,No,,US,USA,United States,activelayer
1,Abisko,42,/activelayers/view/42/,Abisko,6,/sites/view/6/,Sweden,Grid,S2,Forest Tundra,Discontinuous,507.0,No,,SE,SWE,Sweden,activelayer
2,Akhmelo Channel,88,/activelayers/view/88/,Cherskii,8,/sites/view/8/,Russia,Grid,R17,Forest Tundra,Continuous,5.0,Yes,,RU,RUS,Russian Federation,activelayer
3,Alazeya River,82,/activelayers/view/82/,Cherskii,8,/sites/view/8/,Russia,Grid,R22,Shrub Tundra,Continuous,60.0,Yes,,RU,RUS,Russian Federation,activelayer
4,Alexandria Fiord,27,/activelayers/view/27/,Alexandria Fiord,329,/sites/view/329/,Canada,Grid,C1,Tundra,Continuous,0.0,Yes,,CA,CAN,Canada,activelayer


### Temperatures

In [17]:
soup = BeautifulSoup(io.StringIO(boreholes_temps_page), 'lxml')

entries = []
for tr in soup.findAll("tr"):
    entry = {}
    for td in tr.findAll("td"):
        if td["data-th"] == "Gtn-P":
            entry["code"] = td.text
        else:
            entry[td["data-th"].lower()] = td.text
        if td["data-th"] == "Name":
            entry["subsite_id"] = td.contents[0]["href"].split("/")[-2]
            entry["view_url"] = td.contents[0]["href"]
        elif td["data-th"] == "Site":
            entry["site_id"] = td.contents[0]["href"].split("/")[-2]
            entry["site_url"] = td.contents[0]["href"]
    if len(entry)>0:
        entries.append(entry)
dfBoreholeTempsCatalogue = pd.DataFrame(entries)
dfBoreholeTempsCatalogue.elevation = pd.to_numeric(dfBoreholeTempsCatalogue.elevation)
print("{} Entries loaded".format(len(dfBoreholeTempsCatalogue)))

1380 Entries loaded


There is a typo in the country names, which we want to log.

In [18]:
dfBoreholeTempsCatalogue["iso2"] = ""
dfBoreholeTempsCatalogue["iso3"] = ""
dfBoreholeTempsCatalogue["country_official_name"] = ""
for i,r in dfBoreholeTempsCatalogue.iterrows():
    if r.country.lower() == "kyrgystan":
        log_finding({"topic":"borehole temperatures",
                    "site_id":r.site_id,
                    "subsite_id":r.subsite_id,
                    "site_url":r.site_url,
                    "site":r.site,
                    "diagnosis":"Country spelling incorrect, '{}' is spelled 'Kyrgyzstan'".format(r.country),
                    "fix":"changed to 'Kyrgyzstan'",
                    "needs_attention":False})
        info = pycountry.countries.search_fuzzy("Kyrgyzstan")
    else:
        info = pycountry.countries.search_fuzzy(r.country)
    dfBoreholeTempsCatalogue.at[i,"iso2"] = info[0].alpha_2
    dfBoreholeTempsCatalogue.at[i,"iso3"] = info[0].alpha_3
    if "country_official_name" in info[0].__dict__.keys():
        dfBoreholeTempsCatalogue.at[i,"country_official_name"] = info[0].country_official_name
    else:
        dfBoreholeTempsCatalogue.at[i,"country_official_name"] = info[0].name
        
dfBoreholeTempsCatalogue["subsite_type"] = "temperatures"
dfBoreholeTempsCatalogue.head()

Unnamed: 0,name,subsite_id,view_url,site,site_id,site_url,country,code,vegetation,permafrost,elevation,depth,data,select,iso2,iso3,country_official_name,subsite_type
0,0 (Deputatsky),1007,/boreholes/view/1007/,Deputatskiy,331,/sites/view/331/,Russia,RU 118,Shrub Tundra,Continuous,462.0,88.0,No,,RU,RUS,Russian Federation,temperatures
1,01TC1,1744,/boreholes/view/1744/,Yukon,73,/sites/view/73/,Canada,CA 196,Grassland,Continuous,18.0,8.0,No,,CA,CAN,Canada,temperatures
2,01TC2,1745,/boreholes/view/1745/,Yukon,73,/sites/view/73/,Canada,CA 197,Grassland,Continuous,95.0,10.0,No,,CA,CAN,Canada,temperatures
3,03TC1,1743,/boreholes/view/1743/,Yukon,73,/sites/view/73/,Canada,CA 195,Grassland,Discontinuous,3.0,6.0,No,,CA,CAN,Canada,temperatures
4,08 (Deputatsky),1008,/boreholes/view/1008/,Deputatskiy,331,/sites/view/331/,Russia,RU 119,Forest Tundra,Continuous,473.0,96.0,No,,RU,RUS,Russian Federation,temperatures
