# UHI analysis

We'll try to figure out whether there is a systematic relationship between elevated urban air temperatures and population density based on an archive of weather station data from [NOAA's GHCN](https://www.ncdc.noaa.gov/data-access/land-based-station-data/land-based-datasets/global-historical-climatology-network-ghcn) and population data from the [GHSL](https://ghsl.jrc.ec.europa.eu/). 

This notebook prepares the station data by joining information about population numbers at different points in time, region, climate zone, and nearest neighbor station to it.


# ⚠ 

Some parts of this analysis take quite a while. Look for the fields marked with a 

# 🏁

these indicate where intermediate results have been stored and can simply be re-used instead of running the analysis again.

Import some libraries we'll use:


In [1]:
import urllib.request
import gzip
import zipfile
import pandas as pd
import os
from scipy import stats
import matplotlib.pyplot as plt
import rasterio as rio
import numpy as np
from pyproj import Proj, transform
from scipy.spatial import cKDTree  
from scipy import stats
import urllib.request

from sklearn import linear_model

%matplotlib inline
plt.rcParams['figure.figsize'] = (14.0, 10.0) # larger plots

# Get NOAA weather stations

First, let's get the [list of NOAA weather stations](https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt) so that we have a lat/lon location for each:

🏁

In [2]:
file = 'ghcnd-stations.txt'

if os.path.isfile(file):
    print('Stations already downloaded, using local file.')
else:
    print('Using online stations file directly.')
    file = 'https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-stations.txt'

# we are using 100000 rows here to let pandas figure out the column widths - this is a bit slower, 
# but makes sure that we get all the stations way out west or south correctly without chopping of the minus sign 
stations = pd.read_fwf(file, 
            infer_nrows=100000, # how many rows to use to infer the column widths
            usecols = [0,1,2,3,5],
            names = ["station", "lat", "lon", "elevation", "name"])

stations.head()

Stations already downloaded, using local file.


Unnamed: 0,station,lat,lon,elevation,name
0,ACW00011604,17.1167,-61.7833,10.1,ST JOHNS COOLIDGE FLD
1,ACW00011647,17.1333,-61.7833,19.2,ST JOHNS
2,AE000041196,25.333,55.517,34.0,SHARJAH INTER. AIRP
3,AEM00041194,25.255,55.364,10.4,DUBAI INTL
4,AEM00041217,24.433,54.651,26.8,ABU DHABI INTL


Check the range of the lat and lon columns to make sure the coordinates have been parsed correctly:

In [3]:
print(f'Lats go from {stations.lat.min()} to {stations.lat.max()}')
print(f'Lons go from {stations.lon.min()} to {stations.lon.max()}')

Lats go from -90.0 to 83.65
Lons go from -179.983 to 179.32


Pull out the country ID from the station column (first two letters):

In [4]:
stations["country"] = stations["station"].astype(str).str[0:2]
stations.head()

Unnamed: 0,station,lat,lon,elevation,name,country
0,ACW00011604,17.1167,-61.7833,10.1,ST JOHNS COOLIDGE FLD,AC
1,ACW00011647,17.1333,-61.7833,19.2,ST JOHNS,AC
2,AE000041196,25.333,55.517,34.0,SHARJAH INTER. AIRP,AE
3,AEM00041194,25.255,55.364,10.4,DUBAI INTL,AE
4,AEM00041217,24.433,54.651,26.8,ABU DHABI INTL,AE


Since the GHSL data we'll be using later is in Mollweide projection, we'll need to [project](https://github.com/pyproj4/pyproj) the lat/lon to the World Mollweide projection that the raster uses and pick up the values at those projected coordinates:

🏁

In [5]:
file = 'stations_moll.csv'

if os.path.isfile(file):
    print('Stations already projected to Mollweide, using local file.')
    stations = pd.read_csv(file, index_col=0)
else:
    inProj = Proj(init='epsg:4326')   # lat/lon 
    outProj = Proj('+proj=moll +lon_0=0 +x_0=0 +y_0=0 +datum=WGS84 +units=m +no_defs', preserve_flags=True) # Mollweide

    projectedLocations = []

    # go through the list of stations and project each one to Mollweide
    for index, station in stations.iterrows():
        projectedLocations.append((transform(inProj,outProj,station['lon'],station['lat'])))

    # add the projected coordinates back to the stations dataframe
    stations['mollX'], stations['mollY'] = zip(*projectedLocations)  # 'unzip' with the *
    stations.to_csv(file)


stations.head()

Stations already projected to Mollweide, using local file.


Unnamed: 0,station,lat,lon,elevation,name,country,mollX,mollY
0,ACW00011604,17.1167,-61.7833,10.1,ST JOHNS COOLIDGE FLD,AC,-6021233.0,2104299.0
1,ACW00011647,17.1333,-61.7833,19.2,ST JOHNS,AC,-6020901.0,2106316.0
2,AE000041196,25.333,55.517,34.0,SHARJAH INTER. AIRP,AE,5226731.0,3092960.0
3,AEM00041194,25.255,55.364,10.4,DUBAI INTL,AE,5214407.0,3083680.0
4,AEM00041217,24.433,54.651,26.8,ABU DHABI INTL,AE,5168502.0,2985740.0


# Spatial Index

Next, we'll build a spatial index of the stations, so we can quickly look up the nearest neighbors of any station. We'll be using the [scipy.spatial.cKDTree](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.cKDTree.query.html) (based on [this hint](https://gis.stackexchange.com/a/301935/33224).). For that, we'll pull out just the Mollweide coordinates and build the index based on those (otherwise SciPy will make a multidimensional index using all columns):

In [6]:
stationsIndex = cKDTree(stations[['mollX','mollY']].values)

Let's try to find the closest stations to a specific one by name:

In [7]:
def queryStation(stationID, stations, stationsIndex, k=1):
    queryCoords = stations[stations.station==stationID][['mollX','mollY']].values
    dd, ii = stationsIndex.query(queryCoords, 
                             k=range(2,2+k), # start at 2, otherwise we get the station itself as first result
                             n_jobs=-1,      # use all CPUs
                             p=2   )         # p-norm 2 = euclidian distance
    
    knn = stations.iloc[ii[0]]
    knn = knn.copy()  # otherwise we might be modifying the stations dataframe...
    knn['distance'] = dd[0]
    return knn

Test:

In [8]:
stationID = 'RQC00663871'   # GARZAS station
queryStation(stationID, stations, stationsIndex, k=5)

Unnamed: 0,station,lat,lon,elevation,name,country,mollX,mollY,distance
47206,RQC00660053,18.1614,-66.7222,506.0,ADJUNTAS 1 S,RQ,-6479289.0,2231113.0,1924.820868
47331,RQC00668684,18.1333,-66.7333,868.7,SALTILLO 2 ADJUNTAS,RQ,-6481011.0,2227706.0,2060.888936
47266,RQC00664614,18.1506,-66.7719,2.7,HUMACAO NATURAL RESERVE,RQ,-6484363.0,2229804.0,3735.475164
47207,RQC00660061,18.1747,-66.7978,557.8,ADJUNTAS SUBSTN,RQ,-6486324.0,2232726.0,6435.610695
47309,RQC00666982,18.0833,-66.7333,349.9,PENUELAS SALTO GARZAS,RQ,-6482155.0,2221642.0,8231.2522


# Get [GHSL population data](https://ghsl.jrc.ec.europa.eu/ghs_pop.php) for 2015

Download dir at http://cidportal.jrc.ec.europa.eu/ftp/jrc-opendata/GHSL/GHS_POP_GPW4_GLOBE_R2015A/

At this point, we only use this dataset to make sure we only keep stations for which we actually have population data, and get rid of those in the far North and South (they are quite unlikely to experience UHI anyway... 🥶)

🏁

In [9]:
file = 'GHS_POP_GPW42015_GLOBE_R2015A_54009_250_v1_0/GHS_POP_GPW42015_GLOBE_R2015A_54009_250_v1_0.tif'
file_zip = 'GHS_POP_GPW42015_GLOBE_R2015A_54009_250_v1_0.zip'

if os.path.isfile(file):
    print('GHSL population data for 2015 already downloaded.')
else:
    print('Downloading data...')
    url = 'http://cidportal.jrc.ec.europa.eu/ftp/jrc-opendata/GHSL/GHS_POP_GPW4_GLOBE_R2015A/GHS_POP_GPW42015_GLOBE_R2015A_54009_250/V1-0/GHS_POP_GPW42015_GLOBE_R2015A_54009_250_v1_0.zip'
    urllib.request.urlretrieve(url, file_zip)

    print('Unzipping...')
    zip_ref = zipfile.ZipFile(file_zip, 'r')
    zip_ref.extractall('.')
    zip_ref.close()

    print('Cleaning up...')    
    # remove the ZIP file and the extracted overview file - we don't need it and the .ovr file is huge (3GB!)
    os.remove(file_zip)
    os.remove('GHS_POP_GPW42015_GLOBE_R2015A_54009_250_v1_0/GHS_POP_GPW42015_GLOBE_R2015A_54009_250_v1_0.tif.ovr')
    print('Done.')

GHSL population data for 2015 already downloaded.


We'll use the data to assign each station an estimate of the population density in the GHSL cell that it is in. Since GHSL is in an equal area projection (i.e. all cells have the same area), we can safely do that.

We'll use [rasterio's sample method](https://gis.stackexchange.com/questions/190423/getting-pixel-values-at-single-point-using-rasterio) for that. Let's read in the GeoTIFF first:

In [10]:
pop2015 = rio.open(file)
pop2015

<open DatasetReader name='GHS_POP_GPW42015_GLOBE_R2015A_54009_250_v1_0/GHS_POP_GPW42015_GLOBE_R2015A_54009_250_v1_0.tif' mode='r'>

Check that all coordinates are in the raster's bounding box:

In [11]:
print(min(stations['mollX']) > pop2015.bounds.left)
print(max(stations['mollX']) < pop2015.bounds.right)

print(min(stations['mollY']) > pop2015.bounds.bottom)
print(max(stations['mollY']) < pop2015.bounds.top)

True
True
False
False


Okay, so there are some stations North and South of our raster. Remove the stations that are outside of our raster bounding box (they are not really useful for our UHI analysis anyway, and [rasterio seems trip over them](https://gis.stackexchange.com/questions/323481/error-using-rasterios-sample-method)):

In [12]:
print(f'Before removal: {len(stations.index)} stations.')
stations = stations.drop(stations[stations['mollY'] < pop2015.bounds.bottom].index)
stations = stations.drop(stations[stations['mollY'] > pop2015.bounds.top].index)
print(f'After removal: {len(stations.index)} stations.')

Before removal: 113951 stations.
After removal: 113848 stations.


Save to a CSV file again and re-build the spatial index, now that we have removed some stations:

In [13]:
stations.to_csv('stations_moll_inraster.csv')
stationsIndex = cKDTree(stations[['mollX','mollY']].values)

Now we can use those remaining station locations to sample the raster and add a column with the population density in 2015:

🏁

In [14]:
file = 'stations_moll_inraster_pop2015.csv'

if os.path.isfile(file):
    print('Stations data with 2015 population already generated; reusing existing file.')
    stations = pd.read_csv(file, index_col=0)
else:

    locations = list(zip(stations['mollX'], stations['mollY']))
    pop2015col = []

    for val in pop2015.sample(locations):
        pop2015col.append(val[0])

    # make this list a new column in our stations dataframe
    stations['pop2015'] = pop2015col
    stations.to_csv(file)

stations.head()

Stations data with 2015 population already generated; reusing existing file.


Unnamed: 0,station,lat,lon,elevation,name,country,mollX,mollY,pop2015
0,ACW00011604,17.1167,-61.7833,10.1,ST JOHNS COOLIDGE FLD,AC,-6021233.0,2104299.0,0.0
1,ACW00011647,17.1333,-61.7833,19.2,ST JOHNS,AC,-6020901.0,2106316.0,0.0
2,AE000041196,25.333,55.517,34.0,SHARJAH INTER. AIRP,AE,5226731.0,3092960.0,0.0
3,AEM00041194,25.255,55.364,10.4,DUBAI INTL,AE,5214407.0,3083680.0,103.52079
4,AEM00041217,24.433,54.651,26.8,ABU DHABI INTL,AE,5168502.0,2985740.0,180.376755


# Population 2000

Download and attach the population data for 2000 the same way as above:

In [15]:
file = 'GHS_POP_GPW42000_GLOBE_R2015A_54009_250_v1_0/GHS_POP_GPW42000_GLOBE_R2015A_54009_250_v1_0.tif'
file_zip = 'GHS_POP_GPW42000_GLOBE_R2015A_54009_250_v1_0.zip'

if os.path.isfile(file):
    print('GHSL population data for 2000 already downloaded.')
else:
    print('Downloading data...')
    url = 'http://cidportal.jrc.ec.europa.eu/ftp/jrc-opendata/GHSL/GHS_POP_GPW4_GLOBE_R2015A/GHS_POP_GPW42000_GLOBE_R2015A_54009_250/V1-0/GHS_POP_GPW42000_GLOBE_R2015A_54009_250_v1_0.zip'
    urllib.request.urlretrieve(url, file_zip)

    print('Unzipping...')
    zip_ref = zipfile.ZipFile(file_zip, 'r')
    zip_ref.extractall('.')
    zip_ref.close()

    print('Cleaning up...')    
    # remove the ZIP file and the extracted overview file - we don't need it and the .ovr file is huge (3GB!)
    os.remove(file_zip)
    os.remove('GHS_POP_GPW42000_GLOBE_R2015A_54009_250_v1_0/GHS_POP_GPW42000_GLOBE_R2015A_54009_250_v1_0.tif.ovr')
    print('Done.')

pop2000 = rio.open(file)

GHSL population data for 2000 already downloaded.


In [16]:
file = 'stations_moll_inraster_pop2015_2000.csv'

if os.path.isfile(file):
    print('Stations data with 2015 and 2000 population already generated; reusing existing file.')
    stations = pd.read_csv(file, index_col=0)
else:

    locations = list(zip(stations['mollX'], stations['mollY']))
    pop2000col = []

    for val in pop2000.sample(locations):
        pop2000col.append(val[0])

    # make this list a new column in our stations dataframe
    stations['pop2000'] = pop2000col
    stations.to_csv(file)

stations.head()

Stations data with 2015 and 2000 population already generated; reusing existing file.


Unnamed: 0,station,lat,lon,elevation,name,country,mollX,mollY,pop2015,popdens2015,NN,NN_dist,NN_elev,nn_elev_diff,NN_lat,NN_popdens2015,pop2000
0,ACW00011604,17.1167,-61.7833,10.1,ST JOHNS COOLIDGE FLD,AC,-6021233.0,2104299.0,0.0,0.0,,,,-9.1,,,0.0
1,ACW00011647,17.1333,-61.7833,19.2,ST JOHNS,AC,-6020901.0,2106316.0,0.0,0.0,,,,9.1,,,0.0
2,AE000041196,25.333,55.517,34.0,SHARJAH INTER. AIRP,AE,5226731.0,3092960.0,0.0,0.0,,,,23.6,,,0.0
3,AEM00041194,25.255,55.364,10.4,DUBAI INTL,AE,5214407.0,3083680.0,103.52079,1656.332642,,,,-23.6,,,16.931656
4,AEM00041217,24.433,54.651,26.8,ABU DHABI INTL,AE,5168502.0,2985740.0,180.376755,2886.028076,,,,-238.1,,,84.788246


# Population 1990

Rinse and repeat...

In [17]:
file = 'GHS_POP_GPW41990_GLOBE_R2015A_54009_250_v1_0/GHS_POP_GPW41990_GLOBE_R2015A_54009_250_v1_0.tif'
file_zip = 'GHS_POP_GPW41990_GLOBE_R2015A_54009_250_v1_0.zip'

if os.path.isfile(file):
    print('GHSL population data for 1990 already downloaded.')
else:
    print('Downloading data...')
    url = 'http://cidportal.jrc.ec.europa.eu/ftp/jrc-opendata/GHSL/GHS_POP_GPW4_GLOBE_R2015A/GHS_POP_GPW41990_GLOBE_R2015A_54009_250/V1-0/GHS_POP_GPW41990_GLOBE_R2015A_54009_250_v1_0.zip'
    urllib.request.urlretrieve(url, file_zip)

    print('Unzipping...')
    zip_ref = zipfile.ZipFile(file_zip, 'r')
    zip_ref.extractall('.')
    zip_ref.close()

    print('Cleaning up...')    
    # remove the ZIP file and the extracted overview file - we don't need it and the .ovr file is huge (3GB!)
    os.remove(file_zip)
    os.remove('GHS_POP_GPW41990_GLOBE_R2015A_54009_250_v1_0/GHS_POP_GPW41990_GLOBE_R2015A_54009_250_v1_0.tif.ovr')
    print('Done.')

pop1990 = rio.open(file)

GHSL population data for 1990 already downloaded.


In [18]:
file = 'stations_moll_inraster_pop2015_2000_1990.csv'

if os.path.isfile(file):
    print('Stations data with 2015, 2000 and 1990 population already generated; reusing existing file.')
    stations = pd.read_csv(file, index_col=0)
else:

    locations = list(zip(stations['mollX'], stations['mollY']))
    pop1990col = []

    for val in pop1990.sample(locations):
        pop1990col.append(val[0])

    # make this list a new column in our stations dataframe
    stations['pop1990'] = pop1990col
    stations.to_csv(file)

stations.head(15)

Stations data with 2015, 2000 and 1990 population already generated; reusing existing file.


Unnamed: 0,station,lat,lon,elevation,name,country,mollX,mollY,pop2015,popdens2015,NN,NN_dist,NN_elev,nn_elev_diff,NN_lat,NN_popdens2015,pop2000,pop1990
0,ACW00011604,17.1167,-61.7833,10.1,ST JOHNS COOLIDGE FLD,AC,-6021233.0,2104299.0,0.0,0.0,,,,-9.1,,,0.0,0.0
1,ACW00011647,17.1333,-61.7833,19.2,ST JOHNS,AC,-6020901.0,2106316.0,0.0,0.0,,,,9.1,,,0.0,0.0
2,AE000041196,25.333,55.517,34.0,SHARJAH INTER. AIRP,AE,5226731.0,3092960.0,0.0,0.0,,,,23.6,,,0.0,0.0
3,AEM00041194,25.255,55.364,10.4,DUBAI INTL,AE,5214407.0,3083680.0,103.52079,1656.332642,,,,-23.6,,,16.931656,9.649481
4,AEM00041217,24.433,54.651,26.8,ABU DHABI INTL,AE,5168502.0,2985740.0,180.376755,2886.028076,,,,-238.1,,,84.788246,54.742096
5,AEM00041218,24.262,55.609,264.9,AL AIN INTL,AE,5263508.0,2965335.0,0.0,0.0,,,,-34.1,,,0.0,0.0
6,AF000040930,35.317,69.017,3366.0,NORTH-SALANG,AF,6097232.0,4259532.0,0.0,0.0,,,,1574.7,,,0.0,0.0
7,AFM00040938,34.21,62.228,977.2,HERAT,AF,5543622.0,4132507.0,21.793083,348.689331,,,,352.2,,,0.0,0.0
8,AFM00040948,34.566,69.212,1791.3,KABUL INTL,AF,6149475.0,4173426.0,0.0,0.0,,,,-1574.7,,,0.0,0.0
9,AFM00040990,31.5,65.85,1010.0,KANDAHAR AIRPORT,AF,5978977.0,3818927.0,0.0,0.0,,,,-590.2,,,0.0,0.0


# Population 1975

One more time... [🤖](https://www.youtube.com/watch?v=FGBhQbmPwH8)

In [19]:
file = 'GHS_POP_GPW41975_GLOBE_R2015A_54009_250_v1_0/GHS_POP_GPW41975_GLOBE_R2015A_54009_250_v1_0.tif'
file_zip = 'GHS_POP_GPW41975_GLOBE_R2015A_54009_250_v1_0.zip'

if os.path.isfile(file):
    print('GHSL population data for 1975 already downloaded.')
else:
    print('Downloading data...')
    url = 'http://cidportal.jrc.ec.europa.eu/ftp/jrc-opendata/GHSL/GHS_POP_GPW4_GLOBE_R2015A/GHS_POP_GPW41975_GLOBE_R2015A_54009_250/V1-0/GHS_POP_GPW41975_GLOBE_R2015A_54009_250_v1_0.zip'
    urllib.request.urlretrieve(url, file_zip)

    print('Unzipping...')
    zip_ref = zipfile.ZipFile(file_zip, 'r')
    zip_ref.extractall('.')
    zip_ref.close()

    print('Cleaning up...')    
    # remove the ZIP file and the extracted overview file - we don't need it and the .ovr file is huge (3GB!)
    os.remove(file_zip)
    os.remove('GHS_POP_GPW41975_GLOBE_R2015A_54009_250_v1_0/GHS_POP_GPW41975_GLOBE_R2015A_54009_250_v1_0.tif.ovr')
    print('Done.')

pop1975 = rio.open(file)

GHSL population data for 1975 already downloaded.


In [20]:
file = 'stations_moll_inraster_pop2015_2000_1990_1975.csv'

if os.path.isfile(file):
    print('Stations data with 2015, 2000, 1990 and 1975 population already generated; reusing existing file.')
    stations = pd.read_csv(file, index_col=0)
else:

    locations = list(zip(stations['mollX'], stations['mollY']))
    pop1975col = []

    for val in pop1975.sample(locations):
        pop1975col.append(val[0])

    # make this list a new column in our stations dataframe
    stations['pop1975'] = pop1975col
    stations.to_csv(file)

stations.head(15)

Stations data with 2015, 2000, 1990 and 1975 population already generated; reusing existing file.


Unnamed: 0,station,lat,lon,elevation,name,country,mollX,mollY,pop2015,popdens2015,NN,NN_dist,NN_elev,nn_elev_diff,NN_lat,NN_popdens2015,pop2000,pop1990,pop1975
0,ACW00011604,17.1167,-61.7833,10.1,ST JOHNS COOLIDGE FLD,AC,-6021233.0,2104299.0,0.0,0.0,,,,-9.1,,,0.0,0.0,0.0
1,ACW00011647,17.1333,-61.7833,19.2,ST JOHNS,AC,-6020901.0,2106316.0,0.0,0.0,,,,9.1,,,0.0,0.0,0.0
2,AE000041196,25.333,55.517,34.0,SHARJAH INTER. AIRP,AE,5226731.0,3092960.0,0.0,0.0,,,,23.6,,,0.0,0.0,0.0
3,AEM00041194,25.255,55.364,10.4,DUBAI INTL,AE,5214407.0,3083680.0,103.52079,1656.332642,,,,-23.6,,,16.931656,9.649481,2.57145
4,AEM00041217,24.433,54.651,26.8,ABU DHABI INTL,AE,5168502.0,2985740.0,180.376755,2886.028076,,,,-238.1,,,84.788246,54.742096,21.834427
5,AEM00041218,24.262,55.609,264.9,AL AIN INTL,AE,5263508.0,2965335.0,0.0,0.0,,,,-34.1,,,0.0,0.0,0.0
6,AF000040930,35.317,69.017,3366.0,NORTH-SALANG,AF,6097232.0,4259532.0,0.0,0.0,,,,1574.7,,,0.0,0.0,0.0
7,AFM00040938,34.21,62.228,977.2,HERAT,AF,5543622.0,4132507.0,21.793083,348.689331,,,,352.2,,,0.0,0.0,0.0
8,AFM00040948,34.566,69.212,1791.3,KABUL INTL,AF,6149475.0,4173426.0,0.0,0.0,,,,-1574.7,,,0.0,0.0,0.0
9,AFM00040990,31.5,65.85,1010.0,KANDAHAR AIRPORT,AF,5978977.0,3818927.0,0.0,0.0,,,,-590.2,,,0.0,0.0,0.0


## Population densities

The cells in our population dataset are only [250x250m](https://ghsl.jrc.ec.europa.eu/ghs_pop2019.php), so to get population per km<sup>2</sup>, we actually have to multiply the number with 16 (just to make it a bit easier to compare to common measures of population density in people per km<sup>2</sup>:

In [21]:
stations["popdens2015"] = stations["pop2015"] * 16
stations["popdens2000"] = stations["pop2000"] * 16
stations["popdens1990"] = stations["pop1990"] * 16
stations["popdens1975"] = stations["pop1975"] * 16

stations.head()

Unnamed: 0,station,lat,lon,elevation,name,country,mollX,mollY,pop2015,popdens2015,...,NN_elev,nn_elev_diff,NN_lat,NN_popdens2015,pop2000,pop1990,pop1975,popdens2000,popdens1990,popdens1975
0,ACW00011604,17.1167,-61.7833,10.1,ST JOHNS COOLIDGE FLD,AC,-6021233.0,2104299.0,0.0,0.0,...,,-9.1,,,0.0,0.0,0.0,0.0,0.0,0.0
1,ACW00011647,17.1333,-61.7833,19.2,ST JOHNS,AC,-6020901.0,2106316.0,0.0,0.0,...,,9.1,,,0.0,0.0,0.0,0.0,0.0,0.0
2,AE000041196,25.333,55.517,34.0,SHARJAH INTER. AIRP,AE,5226731.0,3092960.0,0.0,0.0,...,,23.6,,,0.0,0.0,0.0,0.0,0.0,0.0
3,AEM00041194,25.255,55.364,10.4,DUBAI INTL,AE,5214407.0,3083680.0,103.52079,1656.332642,...,,-23.6,,,16.931656,9.649481,2.57145,270.906494,154.391693,41.143192
4,AEM00041217,24.433,54.651,26.8,ABU DHABI INTL,AE,5168502.0,2985740.0,180.376755,2886.028076,...,,-238.1,,,84.788246,54.742096,21.834427,1356.611938,875.873535,349.35083


### Comparisons with nearest neighbor

Let's find the closest neighbor for each station, so that we can later calculate differences in population density (for each year), elevation, and latitutde between nearest neighbor stations:

In [22]:
# add new columns to hold the ID of the nearest station and the distance to it

stations["NN"] = ""
stations["NN_dist"] = ""
stations["NN_elev"] = ""
stations["NN_lat"] = ""
stations["NN_popdens2015"] = ""
stations["NN_popdens2000"] = ""
stations["NN_popdens1990"] = ""
stations["NN_popdens1975"] = ""
stations["NN_lat_diff"] = ""
stations["NN_popdens2015_diff"] = ""
stations["NN_popdens2000_diff"] = ""
stations["NN_popdens1990_diff"] = ""
stations["NN_popdens1975_diff"] = ""

stations.head()

Unnamed: 0,station,lat,lon,elevation,name,country,mollX,mollY,pop2015,popdens2015,...,popdens1990,popdens1975,NN_popdens2000,NN_popdens1990,NN_popdens1975,NN_lat_diff,NN_popdens2015_diff,NN_popdens2000_diff,NN_popdens1990_diff,NN_popdens1975_diff
0,ACW00011604,17.1167,-61.7833,10.1,ST JOHNS COOLIDGE FLD,AC,-6021233.0,2104299.0,0.0,0.0,...,0.0,0.0,,,,,,,,
1,ACW00011647,17.1333,-61.7833,19.2,ST JOHNS,AC,-6020901.0,2106316.0,0.0,0.0,...,0.0,0.0,,,,,,,,
2,AE000041196,25.333,55.517,34.0,SHARJAH INTER. AIRP,AE,5226731.0,3092960.0,0.0,0.0,...,0.0,0.0,,,,,,,,
3,AEM00041194,25.255,55.364,10.4,DUBAI INTL,AE,5214407.0,3083680.0,103.52079,1656.332642,...,154.391693,41.143192,,,,,,,,
4,AEM00041217,24.433,54.651,26.8,ABU DHABI INTL,AE,5168502.0,2985740.0,180.376755,2886.028076,...,875.873535,349.35083,,,,,,,,


Iterate through the DF and fill these columns row by row. 

🔥 **TODO** This is reeeeally slow, I'm sure there must be a faster way to do this... for now, I'm at least making sure we only need to do this once by saving the results in a CSV that can just be loaded again

🏁

In [23]:
file = 'stations_with_nearest_neighbor_data.csv'

if os.path.isfile(file):
    print('Nearest neighbor data already computed, using local file.')
    stations = pd.read_csv(file, index_col=0)
else:
    # nearest neighbor station:    
    for index, row in stations.iterrows():
        s = row['station']
        
        # look up nearest neighbor station
        nn = queryStation(s, stations, stationsIndex)
                    
        # pick up values from nearest neighbor station
        stations.loc[stations.station == s, 'NN']      = nn['station'].iloc[0]
        stations.loc[stations.station == s, 'NN_dist'] = nn['distance'].iloc[0]        
        stations.loc[stations.station == s, 'NN_elev'] = nn['elevation'].iloc[0]
        stations.loc[stations.station == s, 'NN_lat']  = nn['lat'].iloc[0]
        
        stations.loc[stations.station == s, 'NN_popdens2015'] = nn['popdens2015'].iloc[0]
        stations.loc[stations.station == s, 'NN_popdens2000'] = nn['popdens2000'].iloc[0]
        stations.loc[stations.station == s, 'NN_popdens1990'] = nn['popdens1990'].iloc[0]
        stations.loc[stations.station == s, 'NN_popdens1975'] = nn['popdens1975'].iloc[0]

# calculate differences between station and nearest neighbor
stations["NN_lat_diff"]         = stations["lat"] - stations["NN_lat"]
stations["NN_popdens2015_diff"] = stations["popdens2015"] - stations["NN_popdens2015"]
stations["NN_popdens2000_diff"] = stations["popdens2000"] - stations["NN_popdens2000"]
stations["NN_popdens1990_diff"] = stations["popdens1990"] - stations["NN_popdens1990"]
stations["NN_popdens1975_diff"] = stations["popdens1975"] - stations["NN_popdens1975"]
                
stations.to_csv(file)
        
stations.head()

Nearest neighbor data already computed, using local file.


Unnamed: 0,station,lat,lon,elevation,name,country,mollX,mollY,pop2015,popdens2015,...,popdens1990,popdens1975,NN_popdens2000,NN_popdens1990,NN_popdens1975,NN_lat_diff,NN_popdens2015_diff,NN_popdens2000_diff,NN_popdens1990_diff,NN_popdens1975_diff
0,ACW00011604,17.1167,-61.7833,10.1,ST JOHNS COOLIDGE FLD,AC,-6021233.0,2104299.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,-0.0166,0.0,0.0,0.0,0.0
1,ACW00011647,17.1333,-61.7833,19.2,ST JOHNS,AC,-6020901.0,2106316.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0166,0.0,0.0,0.0,0.0
2,AE000041196,25.333,55.517,34.0,SHARJAH INTER. AIRP,AE,5226731.0,3092960.0,0.0,0.0,...,0.0,0.0,270.906494,154.391693,41.143192,0.078,-1656.332642,-270.906494,-154.391693,-41.143192
3,AEM00041194,25.255,55.364,10.4,DUBAI INTL,AE,5214407.0,3083680.0,103.52079,1656.332642,...,154.391693,41.143192,0.0,0.0,0.0,-0.078,1656.332642,270.906494,154.391693,41.143192
4,AEM00041217,24.433,54.651,26.8,ABU DHABI INTL,AE,5168502.0,2985740.0,180.376755,2886.028076,...,875.873535,349.35083,0.0,0.0,0.0,0.171,2886.028076,1356.611938,875.873535,349.35083


In [24]:
stations.head(15)[['station','NN','NN_dist']]

Unnamed: 0,station,NN,NN_dist
0,ACW00011604,ACW00011647,2044.423916
1,ACW00011647,ACW00011604,2044.423916
2,AE000041196,AEM00041194,15427.585565
3,AEM00041194,AE000041196,15427.585565
4,AEM00041217,AEM00041218,97173.04912
5,AEM00041218,MUM00041244,17560.373332
6,AF000040930,AFM00040948,100714.428119
7,AFM00040938,TX000038987,127736.107849
8,AFM00040948,AF000040930,100714.428119
9,AFM00040990,PKM00041660,207876.626338


While working with the data, I saw that some stations are missing elevation information (marked with `-999.9`), which we will be using later:

In [25]:
print(str(len(stations[stations.elevation == -999.9])),'out of',str(len(stations)),'lack elevation data.')

0 out of 113848 lack elevation data.


A small percentage, but let's still fix that. We'll use [GPS Visualizer's lookup service for that](https://www.gpsvisualizer.com/elevation). The API is not documented, but this was quite easy to figure out and saves us downloading gigabytes of data for the underlying [SRTM data](https://www2.jpl.nasa.gov/srtm/): 

In [26]:
def elevation_lookup(lat, lon):
    url = 'https://www.gpsvisualizer.com/elevation_data/elev2018.js?coords='+str(lat)+'%2C'+str(lon)+'&gv_nocache=1563196454722'

    with urllib.request.urlopen(url) as f:
        response = f.read(100).decode('utf-8')
        ele = response.split("(")[1].split(",")[0]
        if ele == 'null':
            return 0.0
        
        number = float(ele)
        if number == 0.0:
            return number
        
        return (1./number) # I don't know if this is some kind of super-cheap encryption they do to make it harder to use their API, but they do the same operation and the output make sense

    

Loop through the rows with missing stations and look up the elevation at that lat/lon position using our function. This is going to take a while for ~4000 stations without elevation 😴

🏁

In [27]:
file = 'stations_with_nearest_neighbor_data_fixed_elevation.csv'

if os.path.isfile(file):
    print('Elevation already looked up, using local file.')
    stations = pd.read_csv(file, index_col=0)
else:
    for index, row in stations[stations.elevation == -999.9].iterrows():
        new_elev = elevation_lookup(row.lat, row.lon)
        #print(new_elev)
        stations.loc[stations.station == row.station, 'elevation']  = new_elev
    
    # recalculate elevation difference
    stations['nn_elev_diff'] = stations['elevation'] - stations['NN_elev']
        
    stations.to_csv(file)
    print('Done looking up elevation values.')


Elevation already looked up, using local file.


Next, we will calculate the difference in latitude and poulation density for the four periods between the a given station and its nearest neighbor:

In [28]:
file = 'stations_with_nearest_neighbor_data_differences.csv'

if os.path.isfile(file):
    print('Differences already calculated, using local file.')
    stations = pd.read_csv(file, index_col=0)
else:
    stations['nn_popdens2015_diff'] = stations['popdens2015'] - stations['NN_popdens2015']
    stations['nn_popdens2000_diff'] = stations['popdens2000'] - stations['NN_popdens2000']
    stations['nn_popdens1990_diff'] = stations['popdens1990'] - stations['NN_popdens1990']
    stations['nn_popdens1975_diff'] = stations['popdens1975'] - stations['NN_popdens1975']
    stations['nn_lat_diff'] = stations['lat'] - stations['NN_lat']
    
    stations.to_csv(file)

    
stations.head(15)[['station',
                   'NN',
                   'NN_dist',
                   'nn_popdens2015_diff',
                   'nn_popdens2000_diff',
                   'nn_popdens1990_diff',
                   'nn_popdens1975_diff',
                   'nn_lat_diff']]

Differences already calculated, using local file.


Unnamed: 0,station,NN,NN_dist,nn_popdens2015_diff,nn_popdens2000_diff,nn_popdens1990_diff,nn_popdens1975_diff,nn_lat_diff
0,ACW00011604,ACW00011647,2044.423916,0.0,0.0,0.0,0.0,-0.0166
1,ACW00011647,ACW00011604,2044.423916,0.0,0.0,0.0,0.0,0.0166
2,AE000041196,AEM00041194,15427.585565,-1656.332642,-270.906494,-154.391693,-41.143192,0.078
3,AEM00041194,AE000041196,15427.585565,1656.332642,270.906494,154.391693,41.143192,-0.078
4,AEM00041217,AEM00041218,97173.04912,2886.028076,1356.611938,875.873535,349.35083,0.171
5,AEM00041218,MUM00041244,17560.373332,-211.781464,-1238.914917,-1645.235107,-1309.412109,0.029
6,AF000040930,AFM00040948,100714.428119,0.0,0.0,0.0,0.0,0.751
7,AFM00040938,TX000038987,127736.107849,348.689331,-4691.432129,-1062.285278,0.0,-1.0731
8,AFM00040948,AF000040930,100714.428119,0.0,0.0,0.0,0.0,-0.751
9,AFM00040990,PKM00041660,207876.626338,0.0,0.0,0.0,0.0,1.249


# Climate Zones

Now we'll add information about the climate zone each station is in, using these [world maps of the Köppen-Geiger Climate Classification](http://koeppen-geiger.vu-wien.ac.at/present.htm).

In [29]:
file = 'Map_KG-Global/KG_1986-2010.grd'
file_zip = 'Map_KG-Global.zip'

if os.path.isfile(file):
    print('Climate zones already downloaded.')
else:
    print('Downloading data...')
    url = 'http://koeppen-geiger.vu-wien.ac.at/Rcode/Map_KG-Global.zip'
    urllib.request.urlretrieve(url, file_zip)

    print('Unzipping...')
    zip_ref = zipfile.ZipFile(file_zip, 'r')
    zip_ref.extractall('.')
    zip_ref.close()

    print('Cleaning up...')    
    # remove the ZIP file and the extracted overview file - we don't need it and the .ovr file is huge (3GB!)
    os.remove(file_zip)
    print('Done.')

Climate zones already downloaded.


In [30]:
climateZones = rio.open(file)
climateZones

<open DatasetReader name='Map_KG-Global/KG_1986-2010.grd' mode='r'>

In [33]:
file = 'stations_with_climate_zones.csv'


if os.path.isfile(file):
    print('Climate zones already attached, using local file.')
    stations = pd.read_csv(file, index_col=0)
else:
    locations = list(zip(stations['lon'], stations['lat']))
    zones = []

    for val in climateZones.sample(locations):
        zones.append(val[0])

    # make this list a new column in our stations dataframe
    stations['climatezone'] = zones
    stations.to_csv(file)

stations.head()

Unnamed: 0,station,lat,lon,elevation,name,country,mollX,mollY,pop2015,popdens2015,...,NN_popdens2015_diff,NN_popdens2000_diff,NN_popdens1990_diff,NN_popdens1975_diff,nn_popdens2015_diff,nn_popdens2000_diff,nn_popdens1990_diff,nn_popdens1975_diff,nn_lat_diff,climatezone
0,ACW00011604,17.1167,-61.7833,10.1,ST JOHNS COOLIDGE FLD,AC,-6021233.0,2104299.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.0166,2
1,ACW00011647,17.1333,-61.7833,19.2,ST JOHNS,AC,-6020901.0,2106316.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0166,2
2,AE000041196,25.333,55.517,34.0,SHARJAH INTER. AIRP,AE,5226731.0,3092960.0,0.0,0.0,...,-1656.332642,-270.906494,-154.391693,-41.143192,-1656.332642,-270.906494,-154.391693,-41.143192,0.078,7
3,AEM00041194,25.255,55.364,10.4,DUBAI INTL,AE,5214407.0,3083680.0,103.52079,1656.332642,...,1656.332642,270.906494,154.391693,41.143192,1656.332642,270.906494,154.391693,41.143192,-0.078,7
4,AEM00041217,24.433,54.651,26.8,ABU DHABI INTL,AE,5168502.0,2985740.0,180.376755,2886.028076,...,2886.028076,1356.611938,875.873535,349.35083,2886.028076,1356.611938,875.873535,349.35083,0.171,7
