# Download Permafrost Data

The [Global Terrestrial Network for Permafrost - Database](http://www.gtnpdatabase.org/) contains a repository on data from permafrost sites. Data for both
 * Thaw Depths (aka Active Layer Thickness)
 * Temperatures
are available.

Unfortunately, the site focuses on making the data available individually involving a lot of user interaction. For analyses at scale, this proves inefficient. This notebook downloads the data locally so you can analyse the data yourself.

We are trying to build a simplified version of the GTN-P data with as few tables as possible. With the source using normalized tables, consistency of GTN-P is generally good, so, for the sake of data science ease of use, we use "fat" tables.

![Data Model](Data_Model.png)


## Packages

There are a few packages which may not be part of a default installation. Also, some may require an installation at the OS level outside the realm of pipenv, conda, or pip.
* [geopandas](https://geopandas.org/) is required to save data with geospatial information
* [PostgreSQL](https://www.postgresql.org/) and its Geospatial Information System extension [PostGIS](https://postgis.net/install/) are required locally. We will try to make a docker package available. Note that PostGIS needs to be enabled per-database. For Mac users, consider [Postgret.app](https://postgresapp.com/), for Windows users, consider [PostgreSQL Portable](https://github.com/garethflowers/postgresql-portable)
* [pycountry](https://pypi.org/project/pycountry/) is used to standardise country names and look up their ISO codes
* [Selenium](https://pypi.org/project/selenium/) is required to load pages that use JavaScript to build their content. This generally requires some installation on the OS as Selenium needs a working web browser to render these pages.
* [tqdm](https://github.com/tqdm/tqdm) for progress bars (although we could also use ipywidgets progress bar, but tqdm also measures and displays time per iteration and often ETA)
* We also use sqlalchemy, pandas, numpy, and scipy. To avoid data conversion difficulties, we store local data files as binary files, preference being parquet, so pyarrow or fastparquet may be required.

In [204]:
import io
import pandas as pd
import geopandas as gpd
import shapely
from bs4 import BeautifulSoup
import bs4
import requests
import zipfile
from pathlib import Path
import time
from tqdm import tqdm
import numpy as np
import pycountry
import glob
import os
import re
from selenium import webdriver
from sqlalchemy import create_engine
import sqlalchemy
from tqdm import tqdm
import json
import datetime
import socket
import urllib
import sys
from scipy.interpolate import interp1d

## Web Crawling using Selenium

The web page is dynamically generated using JavaScript. This makes access the content slightly more challenging. Selenium is a toolset that is created to allow for automated UI testing, but it is also used to crawl websites. 

Depending on your jupyter setup, selenium may be very tricky, or impossible to install, as it actually requires a functioning web browser to be installed on the machine that hosts the jupyter evironment. This is very often not the case for servers or cloud based installations. Should this be the case:
* Open [http://www.gtnpdatabase.org/activelayers](http://www.gtnpdatabase.org/activelayers) in a browser on your workstation
* Try to File->Save the webpage locally
* There will be a hook for you to copy and paste the file further down

In [3]:
options = webdriver.ChromeOptions() # https://stackoverflow.com/questions/50642308
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')

wd = webdriver.Chrome(options=options)

If you cannot get Selenium installed or working, open [http://www.gtnpdatabase.org/activelayers](http://www.gtnpdatabase.org/activelayers) and save it as `./sun/activelayers/index.html`. Same for [http://www.gtnpdatabase.org/boreholes](http://www.gtnpdatabase.org/boreholes) and `./sun/boreholes/index.html`.

In [6]:
Path("./sun/activelayers").mkdir(exist_ok=True,parents=True)

if not os.path.exists("./sun/activelayers/index.html"):
    wd.get("http://www.gtnpdatabase.org/activelayers")
    print("waiting 10 seconds for the JavaScript to complete",end=".")
    time.sleep(10) # wait for javascript to finish
    print("...done, next page")
    activelayers_page = wd.page_source
    with open("./sun/activelayers/index.html","w+") as outfile:
        outfile.write(activelayers_page)
else:
    with open("./sun/activelayers/index.html","r") as infile:
        activelayers_page = infile.read()

Path("./sun/boreholes").mkdir(exist_ok=True,parents=True)

if not os.path.exists("./sun/boreholes/index.html"):
    wd.get("http://www.gtnpdatabase.org/boreholes")
    print("waiting 10 seconds for the JavaScript to complete",end=".")
    time.sleep(10) # wait for javascript to finish
    print("...finished")

    boreholes_temps_page = wd.page_source
    with open("./sun/boreholes/index.html","w+") as outfile:
        outfile.write(boreholes_temps_page)
else:
    with open("./sun/boreholes/index.html","r") as infile:
        boreholes_temps_page = infile.read()

Unfortunately, the usual way to parse the table using pandas´ `read_html` results in a missing entry, Station _Da Xi Gou  (Glacier Station)_  is not always added. We will use that approach further down in this notebook.

In [7]:
df = pd.read_html(wd.page_source)[0] # little known fact, pandas parses tables quite well, note it returns an array
df.columns = [c.lower() for c in df.columns] 
df[df["gtn-p"]=="CN9"]

Unnamed: 0,name,site,country,gtn-p,vegetation,permafrost,elevation,depth,data,select


### Helper Function to Log Data Issues

As we walk across the data, we will encounter deficiencies. We will set up a log database so that, if possible, deficiencies can be rectified.

In [8]:
def static_vars(**kwargs):
    def decorate(func):
        for k in kwargs:
            setattr(func, k, kwargs[k])
        return func
    return decorate

@static_vars(log=[])
def log_finding(finding):
    finding["datetime_date"] = "{:%Y-%m-%d %H:%M:%S}".format(datetime.datetime.now())
    finding["machine"] = socket.gethostname()
    log_finding.log.append(finding)
    with open("findings.json","w+t") as logfile:
        json.dump(log_finding.log,logfile)

## Database connection

If you plan to put code of yours onto public platforms such as github or gitlab, never, ever, add credentials to your source code. In fact, never ever, even if you have no such plans. We read the credentials and database server from a file names .env, in this repo, copy .env.example and edit the content so it suits your needs. We need a string in the form 

```
postgresql://<user>:<password>@<hostname>/<database>
```

in that file. Often, this is `postgresql://postgres@localhost/postgres` by default.

In [9]:
with open(".env","rt") as idfile:
    connect_string = idfile.read().strip()
conn = create_engine(connect_string)

## Web Page parsing using BeautifulSoup

The download links use an id of the site in their URL. The id field, however, is not visible on the web page, but is in the hyperlink `href` field of the table. These links are contained in c table cell (`<td>`) with an attribute `data-th="Name"`.

The html code looks like

```
<td data-th="Name"><a class="ng-binding" href="/activelayers/view/195/">Allakaket</a></td>
```

and we are after the 195, which is the last forward-slash separated field of the URL. Also, as we already use BeautifulSoup, we take this approach to populate the data frame.

### Activelayers

In [10]:
soup = BeautifulSoup(io.StringIO(activelayers_page))

entries = []
for tr in soup.findAll("tr"):
    entry = {}
    for td in tr.findAll("td"):
        if td["data-th"] == "Gtn-P":
            entry["code"] = td.text
        else:
            entry[td["data-th"].lower()] = td.text
        if td["data-th"] == "Name":
            entry["subsite_id"] = td.contents[0]["href"].split("/")[-2]
            entry["view_url"] = td.contents[0]["href"]
        elif td["data-th"] == "Site":
            entry["site_id"] = td.contents[0]["href"].split("/")[-2]
            entry["site_url"] = td.contents[0]["href"]
    if len(entry)>0:
        entries.append(entry)
dfActiveLayersCatalogue = pd.DataFrame(entries)
dfActiveLayersCatalogue.elevation = pd.to_numeric(dfActiveLayersCatalogue.elevation)
print("{} Entries loaded".format(len(dfActiveLayersCatalogue)))

253 Entries loaded


Add standard country codes and names using `pycountry`.

In [11]:
dfActiveLayersCatalogue["iso2"] = ""
dfActiveLayersCatalogue["iso3"] = ""
dfActiveLayersCatalogue["country_official_name"] = ""
for i,r in dfActiveLayersCatalogue.iterrows():
    info = pycountry.countries.search_fuzzy(r.country)
    dfActiveLayersCatalogue.at[i,"iso2"] = info[0].alpha_2
    dfActiveLayersCatalogue.at[i,"iso3"] = info[0].alpha_3
    if "country_official_name" in info[0].__dict__.keys():
        dfActiveLayersCatalogue.at[i,"country_official_name"] = info[0].country_official_name
    else:
        dfActiveLayersCatalogue.at[i,"country_official_name"] = info[0].name
dfActiveLayersCatalogue["subsite_type"] = "activelayer"
dfActiveLayersCatalogue.head()

Unnamed: 0,name,subsite_id,view_url,site,site_id,site_url,country,type,code,vegetation,permafrost,elevation,data,select,iso2,iso3,country_official_name,subsite_type
0,56 Mile,227,/activelayers/view/227/,Franklin Bluff,16,/sites/view/16/,United States,Grid,U31 A,Tundra,Continuous,114.0,No,,US,USA,United States,activelayer
1,Abisko,42,/activelayers/view/42/,Abisko,6,/sites/view/6/,Sweden,Grid,S2,Forest Tundra,Discontinuous,507.0,No,,SE,SWE,Sweden,activelayer
2,Akhmelo Channel,88,/activelayers/view/88/,Cherskii,8,/sites/view/8/,Russia,Grid,R17,Forest Tundra,Continuous,5.0,Yes,,RU,RUS,Russian Federation,activelayer
3,Alazeya River,82,/activelayers/view/82/,Cherskii,8,/sites/view/8/,Russia,Grid,R22,Shrub Tundra,Continuous,60.0,Yes,,RU,RUS,Russian Federation,activelayer
4,Alexandria Fiord,27,/activelayers/view/27/,Alexandria Fiord,329,/sites/view/329/,Canada,Grid,C1,Tundra,Continuous,0.0,Yes,,CA,CAN,Canada,activelayer


### Temperatures

In [12]:
soup = BeautifulSoup(io.StringIO(boreholes_temps_page), 'lxml')

entries = []
for tr in soup.findAll("tr"):
    entry = {}
    for td in tr.findAll("td"):
        if td["data-th"] == "Gtn-P":
            entry["code"] = td.text
        else:
            entry[td["data-th"].lower()] = td.text
        if td["data-th"] == "Name":
            entry["subsite_id"] = td.contents[0]["href"].split("/")[-2]
            entry["view_url"] = td.contents[0]["href"]
        elif td["data-th"] == "Site":
            entry["site_id"] = td.contents[0]["href"].split("/")[-2]
            entry["site_url"] = td.contents[0]["href"]
    if len(entry)>0:
        entries.append(entry)
dfBoreholeTempsCatalogue = pd.DataFrame(entries)
dfBoreholeTempsCatalogue.elevation = pd.to_numeric(dfBoreholeTempsCatalogue.elevation)
print("{} Entries loaded".format(len(dfBoreholeTempsCatalogue)))

1380 Entries loaded


There is a typo in the country names, which we want to log.

In [13]:
dfBoreholeTempsCatalogue["iso2"] = ""
dfBoreholeTempsCatalogue["iso3"] = ""
dfBoreholeTempsCatalogue["country_official_name"] = ""
for i,r in dfBoreholeTempsCatalogue.iterrows():
    if r.country.lower() == "kyrgystan":
        log_finding({"topic":"borehole temperatures",
                    "site_id":r.site_id,
                    "subsite_id":r.subsite_id,
                    "site_url":r.site_url,
                    "site":r.site,
                    "diagnosis":"Country spelling incorrect, '{}' is spelled 'Kyrgyzstan'".format(r.country),
                    "fix":"changed to 'Kyrgyzstan'",
                    "needs_attention":False})
        info = pycountry.countries.search_fuzzy("Kyrgyzstan")
    else:
        info = pycountry.countries.search_fuzzy(r.country)
    dfBoreholeTempsCatalogue.at[i,"iso2"] = info[0].alpha_2
    dfBoreholeTempsCatalogue.at[i,"iso3"] = info[0].alpha_3
    if "country_official_name" in info[0].__dict__.keys():
        dfBoreholeTempsCatalogue.at[i,"country_official_name"] = info[0].country_official_name
    else:
        dfBoreholeTempsCatalogue.at[i,"country_official_name"] = info[0].name
        
dfBoreholeTempsCatalogue["subsite_type"] = "temperatures"
dfBoreholeTempsCatalogue.head()

Unnamed: 0,name,subsite_id,view_url,site,site_id,site_url,country,code,vegetation,permafrost,elevation,depth,data,select,iso2,iso3,country_official_name,subsite_type
0,0 (Deputatsky),1007,/boreholes/view/1007/,Deputatskiy,331,/sites/view/331/,Russia,RU 118,Shrub Tundra,Continuous,462.0,88.0,No,,RU,RUS,Russian Federation,temperatures
1,01TC1,1744,/boreholes/view/1744/,Yukon,73,/sites/view/73/,Canada,CA 196,Grassland,Continuous,18.0,8.0,No,,CA,CAN,Canada,temperatures
2,01TC2,1745,/boreholes/view/1745/,Yukon,73,/sites/view/73/,Canada,CA 197,Grassland,Continuous,95.0,10.0,No,,CA,CAN,Canada,temperatures
3,03TC1,1743,/boreholes/view/1743/,Yukon,73,/sites/view/73/,Canada,CA 195,Grassland,Discontinuous,3.0,6.0,No,,CA,CAN,Canada,temperatures
4,08 (Deputatsky),1008,/boreholes/view/1008/,Deputatskiy,331,/sites/view/331/,Russia,RU 119,Forest Tundra,Continuous,473.0,96.0,No,,RU,RUS,Russian Federation,temperatures


### Combine Borehole Depth and Temperatures Catalogues into one

There is historic reasons these were created separately. Our aim is to assess all permafrost data, so we want one catalogue.

In [17]:
dfBoreholeTempsCatalogue["type"] = "N/A" # pad a missing field which is only present in the borhole data
dfSites = dfActiveLayersCatalogue.append(dfBoreholeTempsCatalogue).reset_index()
# convert Yes/No text into boolean
dfSites["have_data"] = False
idx = dfSites[dfSites["data"] == "Yes"].index
dfSites.at[idx,"have_data"] = True
del dfSites["data"]
dfSites.rename(columns={"name":"subsite_name","site":"site_name"},inplace=True)
dfSites

Unnamed: 0,index,name,subsite_id,view_url,site,site_id,site_url,country,type,code,vegetation,permafrost,elevation,select,iso2,iso3,country_official_name,subsite_type,depth,have_data
0,0,56 Mile,227,/activelayers/view/227/,Franklin Bluff,16,/sites/view/16/,United States,Grid,U31 A,Tundra,Continuous,114.0,,US,USA,United States,activelayer,,False
1,1,Abisko,42,/activelayers/view/42/,Abisko,6,/sites/view/6/,Sweden,Grid,S2,Forest Tundra,Discontinuous,507.0,,SE,SWE,Sweden,activelayer,,False
2,2,Akhmelo Channel,88,/activelayers/view/88/,Cherskii,8,/sites/view/8/,Russia,Grid,R17,Forest Tundra,Continuous,5.0,,RU,RUS,Russian Federation,activelayer,,True
3,3,Alazeya River,82,/activelayers/view/82/,Cherskii,8,/sites/view/8/,Russia,Grid,R22,Shrub Tundra,Continuous,60.0,,RU,RUS,Russian Federation,activelayer,,True
4,4,Alexandria Fiord,27,/activelayers/view/27/,Alexandria Fiord,329,/sites/view/329/,Canada,Grid,C1,Tundra,Continuous,0.0,,CA,CAN,Canada,activelayer,,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1628,1375,Zima railst,1677,/boreholes/view/1677/,Irkutsk,369,/sites/view/369/,Russia,,,Other,Isolated patches,456.0,,RU,RUS,Russian Federation,temperatures,3.00,True
1629,1376,Zugspitze 01/07 (W),859,/boreholes/view/859/,Zugspitze,263,/sites/view/263/,Germany,,DE 01,Other,Isolated Patches,2922.0,,DE,DEU,Germany,temperatures,44.00,False
1630,1377,Zugspitze 02/07 (E),858,/boreholes/view/858/,Zugspitze,263,/sites/view/263/,Germany,,DE 02,Other,Isolated Patches,2922.0,,DE,DEU,Germany,temperatures,58.00,False
1631,1378,Zugspitze tunnel,1798,/boreholes/view/1798/,Zugspitze,263,/sites/view/263/,Germany,,,No Vegetation,Mountain,2785.0,,DE,DEU,Germany,temperatures,50.00,False


In [18]:
Path("./sun/catalogue").mkdir(exist_ok=True,parents=True)
dfSites.to_parquet("./sun/catalogue/sites.catalogue.parquet")

In [19]:
#dfSites = pd.read_parquet("./sun/catalogue/sites.catalogue.parquet")

## Download Site Metadata

Each site has one landing page which describes more about its location and some subsites. In particular, we are after Observation Type, Area, Latitude/Longitude, which is not part of the previous dataset.

Note that we have a hybrid of site and subsite data. In particular, Latitude and Longitude are site data, while name is the subsite name.

As all data we want are inside html tables, we use pandas' `read_html`, which returns a list of dataframes.

In [25]:
nok = []
alldata = []

site_urls = dfSites.site_url.unique()

for site_url in tqdm(site_urls):
    Path("./sun"+site_url).mkdir(exist_ok=True,parents=True)
    if not os.path.exists("./sun"+site_url+"index.html"):
        r = requests.get("http://www.gtnpdatabase.org{}".format(site_url))
        if r.ok:
            with open("./sun"+site_url+"index.html","w+") as outfile:
                outfile.write(r.text)
        else:
            nok.append("http://www.gtnpdatabase.org{}".format(site_url))
            continue
    dfGeneral,dfSiteCentre,dfBBox = pd.read_html("./sun"+site_url+"index.html")
    alldata.append({"site_id":site_url.split("/")[-2],
                   "observation_type":dfGeneral.iloc[1].values[1],
                  "area":pd.to_numeric(dfGeneral.iloc[2].values[1].replace(",","").split(" ")[0]),
                  "site_lon":pd.to_numeric(dfSiteCentre.iloc[0].values[1].replace("°","")),
                  "site_lat":pd.to_numeric(dfSiteCentre.iloc[1].values[1].replace("°","")),
                  "bbox_lat_max":pd.to_numeric(dfBBox.iloc[0].values[1].replace("°","").split(":")[1]),
                  "bbox_lat_min":pd.to_numeric(dfBBox.iloc[2].values[1].replace("°","").split(":")[1]),
                  "bbox_lon_min":pd.to_numeric(dfBBox.iloc[1].values[0].replace("°","").split(":")[1]),
                  "bbox_lon_max":pd.to_numeric(dfBBox.iloc[1].values[2].replace("°","").split(":")[1])})
    time.sleep(.25)

if len(nok)>0:
    print("Warning, had problems downloading {}".format(nok))
else:
    print("Donwload complete")
dfSiteInfo = pd.DataFrame(alldata)
dfSiteInfo.observation_type = dfSiteInfo.observation_type.replace(np.nan, '', regex=True)

100%|██████████| 352/352 [01:31<00:00,  3.84it/s]

Donwload complete





In [27]:
dfSites = dfSites.merge(dfSiteInfo,on="site_id")
Path("./mercury/catalogue").mkdir(exist_ok=True,parents=True)
dfSites.to_parquet("./mercury/catalogue/sites.catalogue.parquet")

In [28]:
gdfSites = gpd.GeoDataFrame(dfSites, geometry=gpd.points_from_xy(dfSites.site_lon,dfSites.site_lat)).set_crs(epsg=4326)
for c in ["subsite_id","site_id","elevation","depth","area","site_lat","site_lon","bbox_lat_max","bbox_lat_min","bbox_lon_min","bbox_lon_max"]:
    gdfSites[c] = pd.to_numeric(gdfSites[c])
del gdfSites["index"]
del gdfSites["select"]

In [29]:
gdfSites.to_postgis("t_permafrost_sites",con=conn,dtype={"subsite_name":sqlalchemy.types.VARCHAR(100),
                                                         "subsite_id":sqlalchemy.types.INT,
                                                         "view_url":sqlalchemy.types.VARCHAR(100),
                                                         "site_name":sqlalchemy.types.VARCHAR(100),
                                                         "site_id":sqlalchemy.types.INT,
                                                         "site_url":sqlalchemy.types.VARCHAR(100),
                                                         "country":sqlalchemy.types.VARCHAR(40),
                                                         "type":sqlalchemy.types.VARCHAR(20),
                                                         "code":sqlalchemy.types.VARCHAR(20),
                                                         "vegetation":sqlalchemy.types.VARCHAR(50),
                                                         "permafrost":sqlalchemy.types.VARCHAR(50),
                                                         "iso2":sqlalchemy.types.VARCHAR(2),
                                                         "iso3":sqlalchemy.types.VARCHAR(3),
                                                         "country_official_name":sqlalchemy.types.VARCHAR(80),
                                                         "subsite_type":sqlalchemy.types.VARCHAR(20),
                                                         "observation_type":sqlalchemy.types.VARCHAR(20),
                                                        },if_exists="replace")

In [30]:
#gdfSites = gpd.read_postgis("SELECT * FROM t_permafrost_sites",con=conn,geom_col="geometry")

## Subsite Metadata

Similar to sites, which are a collection of one or more subsites (which actually resemble the borholes, or borehole arrays), subsites have their own metadata site. Metadata there varies between activelayer and temperatures subsites, also, some tables are not present in the html code if no data are associated with them.

In [408]:
def get_dates_from_soup(soup):
    dates = {}
    for e in soup.find("fieldset").findAll("small"):
        fields = e.text.split(":")
        key = fields[0].lower()
        value = pd.to_datetime(fields[1])
        dates[key] = value
    return dates

def get_explanation_from_soup(soup):
    text = []
    for p in soup.find("div",{"class":"leftCol"}).findAll("p",{"class":""}):
        if p.find("strong"):
            header = p.find("strong")
            text.append("# "+header.text)
        elif p.find("a"):
            link = p.find("a",href=True)
            text.append("[{}]({})".format(link.text,link["href"]))
        else:
            text.append(p.text.strip())
    return ("\n\n".join(text))

def get_references_from_soup(soup,site_id,conn,DEFER=True):
    references = []
    entries = []
    el = soup.find('hr')
    while(el):
        el = el.next_sibling
        if isinstance(el, bs4.element.NavigableString):
            if len(el.strip())>0:
                entries.append(el.strip()) 

    for entry in entries:
        try:
            dfExisting = pd.read_sql("SELECT * FROM t_references WHERE reference='{}'".format(
                entry.replace("'","''")),con=conn) # need this as we use a single ' to quote a string
            if len(dfExisting)>0:
                references.extend(dfExisting.doi.values)
                continue
        except:
            print(entry,end="\n")
            pass
        
        url = "https://api.crossref.org/works?query={}".format(urllib.parse.quote(entry))
        if DEFER:
            with open("deferred.queries.txt","a+t") as todolist:
                todolist.write(url+"\n")
            continue
        r = requests.get(url,headers=headers)
        record = {"doi":"","reference":entry,"site_id":site_id}
        if r.ok:
            result = json.loads(r.text)
            item = result["message"]["items"][0]
            if "URL" in item.keys():
                new_record = {"doi":item["URL"],"reference":entry,"result":json.dumps(item)}
        else:
            continue
            
        if "x-rate-limit-limit" in r.headers.keys():
            rate_limit = r.headers["x-rate-limit-limit"]
            rate_interval = pd.to_numeric(r.headers["x-rate-limit-interval"].replace("s",""))
            try:
                sleep = rate_interval/rate_limit
            except:
                sleep = 0.1
        time.sleep(sleep)
        references.append(new_record["doi"])
        xdf = pd.DataFrame(new_record,index=[0])
        xdf.to_sql("t_references",con=conn,dtype={"doi":sqlalchemy.types.VARCHAR(100),
                                   "reference":sqlalchemy.types.VARCHAR(1000),
                                   "result":sqlalchemy.types.VARCHAR(50000)},
                   if_exists="append",index=False)
    return entries,references

In [409]:
keep_columns = {'CALM-Code:':"calm_code", 'Drilling Angle:':"drilling_angle", 
                'Drilling Method:':"drilling_method", 'Elevation:':"subsite_elevation", 
                'GTN-P:':"gtn_p", 'Latitude:':"subsite_lat", 'Longitude:':"subsite_lon",
                'Lithology:':"lithology", 'Morphology:':"morphology", 
                'Responsible Countries:':"responsible_countries",
                'Responsible Person:':"responsible_person"}

In [534]:
nok = []
manual_review = []

adfSubsiteInfo = []

for i,row in tqdm(gdfSites.iterrows()):
    targetfolder = "./sun"+row.view_url
    targetfile = os.path.join(targetfolder,"index.html")
    if os.path.exists(targetfile):
        with open(targetfile) as infile:
            content = infile.read()
    else:
        Path(targetfolder).mkdir(exist_ok=True,parents=True)
        r = requests.get("http://www.gtnpdatabase.org{}".format(row.view_url)) #"/boreholes/view/1097/")) #
        if r.ok:
            content = r.text
        else:
            nok.append(row.view_url)
            continue
        with open(targetfile,"w+") as outfile:
            outfile.write(content)

    soup = BeautifulSoup(content)
    
    df = pd.DataFrame()
    tabs = {0:"general"}
    for infoTable in soup.findAll("table",{"class":"infoTable"}):
        try:
            df = pd.read_html(io.StringIO(str(infoTable)))[0]
        except:
            continue
    dates = get_dates_from_soup(soup)
    dfArray = pd.read_html(io.StringIO(content))

    # we dont try to parse all information...
    #for i in range(len(dfArray)):
    #    dfArray[i]["array_no"] = i

    ddf = pd.DataFrame().append(dfArray)
    ddf = ddf[ddf[0].isin(keep_columns.keys())][[0,1]]
    ddf = ddf.append(pd.DataFrame({0:["subsite_type","created","modified","subsite_id","view_url"],
                                   1:[row.subsite_type,dates["created"],dates["modified"],row.subsite_id,row.view_url]}))

    if ddf.duplicated(subset=[0]).any():
        print("warning")
        break

    if soup.find("div",{"class":"references"}):
        entries,references = get_references_from_soup(soup.find("div",{"class":"references"}),
                                                  row.site_id,conn,DEFER=False)
    else:
        entries = []
        references = []
    
    ddf = ddf.append(pd.DataFrame({0:["entries","references"],
                                   1:[json.dumps(entries),json.dumps(references)]}))
    ddf.index = ddf[0]
    del ddf[0]
    ddf = ddf.transpose()
    adfSubsiteInfo.append(ddf.rename(columns=keep_columns))
    
dfSubsites = pd.DataFrame().append(adfSubsiteInfo)

1633it [00:41, 39.06it/s]


In [537]:
dfSubsites.subsite_lon = pd.to_numeric(dfSubsites.subsite_lon.str.replace("°",""))
dfSubsites.subsite_lat = pd.to_numeric(dfSubsites.subsite_lat.str.replace("°",""))
dfSubsites.drilling_angle = pd.to_numeric(dfSubsites.drilling_angle.str.replace("°",""))
dfSubsites.subsite_elevation = pd.to_numeric(dfSubsites.subsite_elevation.str.replace("m",""))
dfSubsites.subsite_id = pd.to_numeric(dfSubsites.subsite_id).astype(int)
for c in ["calm_code","responsible_countries","responsible_person","lithology",
          "subsite_type","gtn_p","drilling_method","morphology"]:
    dfSubsites[c] = dfSubsites[c].replace(np.nan, '', regex=True)
dfSubsites.head()

Unnamed: 0,calm_code,responsible_countries,responsible_person,lithology,subsite_lon,subsite_lat,subsite_elevation,subsite_type,created,modified,subsite_id,view_url,entries,references,gtn_p,drilling_angle,drilling_method,morphology
1,U31 A,"United States,",Nikolay Shiklomanov,Organic Layer thikness:NA; mineral texture -- ...,-148.6821,69.6969,114.259277,activelayer,2014-02-03 11:00:00,2016-10-24 12:00:00,227,/activelayers/view/227/,"[""Walker, D.A., Auerbach, N.A., Bockheim, J.G....","[""http://dx.doi.org/10.1038/28839"", ""http://dx...",,,,
1,U8,"United States,",Vladimir E. Romanovsky,The organic layer thickness is 0.23 m,-148.716667,69.683333,120.003014,activelayer,2013-12-27 15:00:00,2014-02-03 14:00:00,15,/activelayers/view/15/,"[""Nicolsky, D. J., Romanovsky, V. E., Alexeev,...","[""http://dx.doi.org/10.1029/2007gl029525"", ""ht...",,,,
1,,"United States,",Gary D. Clow,,-146.338233,69.605883,269.993164,temperatures,2014-08-20 11:00:00,2016-06-02 16:00:00,1128,/boreholes/view/1128/,[],[],US 94,90.0,,
1,,"United States,",Doug L. Kane,,-148.75,69.833333,76.467224,temperatures,2015-10-06 15:00:00,2015-10-06 15:00:00,1173,/boreholes/view/1173/,[],[],US 43,90.0,,
1,,"United States,",Vladimir E. Romanovsky,,-148.720766,69.67414,122.868607,temperatures,2013-12-09 14:00:00,2014-02-27 10:00:00,103,/boreholes/view/103/,[],[],,90.0,,


In [538]:
gdfSubites = gpd.GeoDataFrame(dfSubsites, 
                              geometry=gpd.points_from_xy(dfSubsites.subsite_lon,
                                                          dfSubsites.subsite_lat)).set_crs(epsg=4326)
gdfSubites

Unnamed: 0,calm_code,responsible_countries,responsible_person,lithology,subsite_lon,subsite_lat,subsite_elevation,subsite_type,created,modified,subsite_id,view_url,entries,references,gtn_p,drilling_angle,drilling_method,morphology,geometry
1,U31 A,"United States,",Nikolay Shiklomanov,Organic Layer thikness:NA; mineral texture -- ...,-148.682100,69.696900,114.259277,activelayer,2014-02-03 11:00:00,2016-10-24 12:00:00,227,/activelayers/view/227/,"[""Walker, D.A., Auerbach, N.A., Bockheim, J.G....","[""http://dx.doi.org/10.1038/28839"", ""http://dx...",,,,,POINT (-148.68210 69.69690)
1,U8,"United States,",Vladimir E. Romanovsky,The organic layer thickness is 0.23 m,-148.716667,69.683333,120.003014,activelayer,2013-12-27 15:00:00,2014-02-03 14:00:00,15,/activelayers/view/15/,"[""Nicolsky, D. J., Romanovsky, V. E., Alexeev,...","[""http://dx.doi.org/10.1029/2007gl029525"", ""ht...",,,,,POINT (-148.71667 69.68333)
1,,"United States,",Gary D. Clow,,-146.338233,69.605883,269.993164,temperatures,2014-08-20 11:00:00,2016-06-02 16:00:00,1128,/boreholes/view/1128/,[],[],US 94,90.0,,,POINT (-146.33823 69.60588)
1,,"United States,",Doug L. Kane,,-148.750000,69.833333,76.467224,temperatures,2015-10-06 15:00:00,2015-10-06 15:00:00,1173,/boreholes/view/1173/,[],[],US 43,90.0,,,POINT (-148.75000 69.83333)
1,,"United States,",Vladimir E. Romanovsky,,-148.720766,69.674140,122.868607,temperatures,2013-12-09 14:00:00,2014-02-27 10:00:00,103,/boreholes/view/103/,[],[],,90.0,,,POINT (-148.72077 69.67414)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1,,"United States,",Kenji Yoshikawa,,-163.413000,64.682000,14.000000,temperatures,2014-01-20 13:00:00,2014-06-06 09:00:00,771,/boreholes/view/771/,[],[],US O-94,90.0,,,POINT (-163.41300 64.68200)
1,,"Italy,",Mauro Guglielmin,,10.308250,46.392250,2570.000000,temperatures,2014-01-22 10:00:00,2014-01-22 10:00:00,895,/boreholes/view/895/,[],[],IT 02,90.0,,,POINT (10.30825 46.39225)
1,,"Germany,",Andreas Poschinger,,10.987108,47.421789,2922.000000,temperatures,2014-01-21 15:00:00,2016-10-22 09:00:00,859,/boreholes/view/859/,[],[],DE 01,90.0,,,POINT (10.98711 47.42179)
1,,"Germany,",Andreas Poschinger,,10.987142,47.421778,2922.000000,temperatures,2014-01-21 15:00:00,2016-10-22 09:00:00,858,/boreholes/view/858/,[],[],DE 02,90.0,,,POINT (10.98714 47.42178)


In [539]:
# subsite_id is not unique, we need view_url as the unique key
gdfSubsiteInfo = gdfSubites.merge(gdfSites,on="view_url")

In [540]:
# consistency check (1)
gdfSubsiteInfo[gdfSubsiteInfo.subsite_type_x != gdfSubsiteInfo.subsite_type_y]

Unnamed: 0,calm_code,responsible_countries,responsible_person,lithology,subsite_lon,subsite_lat,subsite_elevation,subsite_type_x,created,modified,...,have_data,observation_type,area,site_lon,site_lat,bbox_lat_max,bbox_lat_min,bbox_lon_min,bbox_lon_max,geometry_y


In [541]:
# consistency check (2)
gdfSubsiteInfo[gdfSubsiteInfo.subsite_id_x != gdfSubsiteInfo.subsite_id_y]

Unnamed: 0,calm_code,responsible_countries,responsible_person,lithology,subsite_lon,subsite_lat,subsite_elevation,subsite_type_x,created,modified,...,have_data,observation_type,area,site_lon,site_lat,bbox_lat_max,bbox_lat_min,bbox_lon_min,bbox_lon_max,geometry_y


In [542]:
# now that we checked consistency, lets get rid of duplicates
gdfSubsiteInfo.rename(columns={"subsite_type_x":"subsite_type",
                              "subsite_id_x":"subsite_id",
                              "geometry_x":"geometry",
                              "geometry_y":"geometry_site",
                              "name":"subsite_name",
                              "site":"site_name"},
                     inplace=True)
del gdfSubsiteInfo["subsite_type_y"]
del gdfSubsiteInfo["subsite_id_y"]
gdfSubsiteInfo = gpd.GeoDataFrame(gdfSubsiteInfo).set_geometry('geometry').set_crs(epsg=4326)
gdfSubsiteInfo

Unnamed: 0,calm_code,responsible_countries,responsible_person,lithology,subsite_lon,subsite_lat,subsite_elevation,subsite_type,created,modified,...,have_data,observation_type,area,site_lon,site_lat,bbox_lat_max,bbox_lat_min,bbox_lon_min,bbox_lon_max,geometry_site
0,U31 A,"United States,",Nikolay Shiklomanov,Organic Layer thikness:NA; mineral texture -- ...,-148.682100,69.696900,114.259277,activelayer,2014-02-03 11:00:00,2016-10-24 12:00:00,...,False,"CALM, TSP",1934.328,-147.544117,69.719608,69.833333,69.605883,-148.750000,-146.338233,POINT (-147.54412 69.71961)
1,U8,"United States,",Vladimir E. Romanovsky,The organic layer thickness is 0.23 m,-148.716667,69.683333,120.003014,activelayer,2013-12-27 15:00:00,2014-02-03 14:00:00,...,True,"CALM, TSP",1934.328,-147.544117,69.719608,69.833333,69.605883,-148.750000,-146.338233,POINT (-147.54412 69.71961)
2,,"United States,",Gary D. Clow,,-146.338233,69.605883,269.993164,temperatures,2014-08-20 11:00:00,2016-06-02 16:00:00,...,False,"CALM, TSP",1934.328,-147.544117,69.719608,69.833333,69.605883,-148.750000,-146.338233,POINT (-147.54412 69.71961)
3,,"United States,",Doug L. Kane,,-148.750000,69.833333,76.467224,temperatures,2015-10-06 15:00:00,2015-10-06 15:00:00,...,False,"CALM, TSP",1934.328,-147.544117,69.719608,69.833333,69.605883,-148.750000,-146.338233,POINT (-147.54412 69.71961)
4,,"United States,",Vladimir E. Romanovsky,,-148.720766,69.674140,122.868607,temperatures,2013-12-09 14:00:00,2014-02-27 10:00:00,...,True,"CALM, TSP",1934.328,-147.544117,69.719608,69.833333,69.605883,-148.750000,-146.338233,POINT (-147.54412 69.71961)
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1628,,"United States,",Kenji Yoshikawa,,-163.413000,64.682000,14.000000,temperatures,2014-01-20 13:00:00,2014-06-06 09:00:00,...,False,TSP,0.000,-163.413000,64.682000,64.682000,64.682000,-163.413000,-163.413000,POINT (-163.41300 64.68200)
1629,,"Italy,",Mauro Guglielmin,,10.308250,46.392250,2570.000000,temperatures,2014-01-22 10:00:00,2014-01-22 10:00:00,...,False,TSP,0.000,10.308250,46.392250,46.392250,46.392250,10.308250,10.308250,POINT (10.30825 46.39225)
1630,,"Germany,",Andreas Poschinger,,10.987108,47.421789,2922.000000,temperatures,2014-01-21 15:00:00,2016-10-22 09:00:00,...,False,TSP,0.000,10.983662,47.420216,47.421789,47.418644,10.980181,10.987142,POINT (10.98366 47.42022)
1631,,"Germany,",Andreas Poschinger,,10.987142,47.421778,2922.000000,temperatures,2014-01-21 15:00:00,2016-10-22 09:00:00,...,False,TSP,0.000,10.983662,47.420216,47.421789,47.418644,10.980181,10.987142,POINT (10.98366 47.42022)


In [544]:
del gdfSubsiteInfo["geometry_site"]

In [545]:
gdfSubsiteInfo.to_postgis("t_permafrost_subsites",con=conn,
                          dtype={"calm_code":sqlalchemy.types.VARCHAR(20),
                                 "responsible_countries":sqlalchemy.types.VARCHAR(60),
                                 "responsible_person":sqlalchemy.types.VARCHAR(100),
                                 "lithology":sqlalchemy.types.VARCHAR(),
                                 "subsite_type":sqlalchemy.types.VARCHAR(20),
                                 "created":sqlalchemy.types.TIMESTAMP,
                                 "modified":sqlalchemy.types.TIMESTAMP,
                                 "subsite_id":sqlalchemy.types.INT,
                                 "view_url":sqlalchemy.types.VARCHAR(100),
                                 "entries":sqlalchemy.types.VARCHAR(5000),
                                 "references":sqlalchemy.types.VARCHAR(1500),
                                 "gtn_p":sqlalchemy.types.VARCHAR(20),
                                 "drilling_method":sqlalchemy.types.VARCHAR(200),
                                 "morphology":sqlalchemy.types.VARCHAR(1000),
                                 "":sqlalchemy.types.VARCHAR(),
                                 "subsite_name":sqlalchemy.types.VARCHAR(100),
                                 "site_name":sqlalchemy.types.VARCHAR(100),
                                 "site_id":sqlalchemy.types.INT,
                                 "site_url":sqlalchemy.types.VARCHAR(100),
                                 "country":sqlalchemy.types.VARCHAR(40),
                                 "type":sqlalchemy.types.VARCHAR(20),
                                 "code":sqlalchemy.types.VARCHAR(20),
                                 "vegetation":sqlalchemy.types.VARCHAR(50),
                                 "permafrost":sqlalchemy.types.VARCHAR(50),
                                 "iso2":sqlalchemy.types.VARCHAR(2),
                                 "iso3":sqlalchemy.types.VARCHAR(3),
                                 "country_official_name":sqlalchemy.types.VARCHAR(80),
                                 "observation_type":sqlalchemy.types.VARCHAR(20),
                                 },if_exists="replace")

In [546]:
#gdfSubsiteInfo = gpd.read_postgis("SELECT * FROM t_permafrost_subsites",con=conn,geom_col="geometry")