
# Urban Atlas - process shapefiles to compute stats and extract sampling locations

This notebook processess the collection of GIS vector data (shapefiles) on land use classses from the Urban Atlas dataset. The tasks we are interested in are:

* compute basic statistics on each shapefile, such as the spatial extent of the city and the amount of land classified within each city
* construct ground truth test grids (defined as a $25km \times 25km$ square window around the city center)
* define locations to sample images, both for testing (each cell of the ground truth grid) and for training (sampled appropriately from all available data)

In this analysis we work directly with vector data (polygons) for all the above tasks. This is a faster way to process the data for our specific use case of whole-image classification, as opposed to the more accurate approach of rasterizing the data first, then computing classification masks (which is more suited for image segmentation).

# Setup: packages, paths etc.

In [1]:
import ipyparallel

rc = ipyparallel.Client()
all_engines = rc[:]
lbv = rc.load_balanced_view()

print len(all_engines)

48


In [4]:
%%px --local

# numeric packages
import numpy as np
import pandas as pd

# filesystem and OS
import sys, os, time
import glob

# plotting
from matplotlib import pyplot as plt
import matplotlib
%matplotlib inline

import seaborn as sns
sns.set_style("whitegrid", {'axes.grid' : False})

# compression
import gzip
import cPickle as pickle
import copy

# widgets and interaction
from ipywidgets import FloatProgress
from IPython.display import display, clear_output

import warnings
warnings.filterwarnings('ignore')

# these magics ensure that external modules that are modified are also automatically reloaded
%reload_ext autoreload
%autoreload 2

In [5]:
%%px --local

# custom module for analyzing Urban Atlas data

import sys
sys.path.append("../urbanatlas/")
import urbanatlas as ua

In [45]:
%%px --local

# path to shapefiles

shapefiles_path = "/home/data/urban-atlas/shapefiles/"

import re

def fn_process_path(s):
    b = os.path.basename(s).split(".")[0]
    country = b.split("_")[0]
    city = " ".join(b.split("_")[1:])
    country = re.findall("[a-zA-Z]+", country)[0]
    return (city, country)

shapefiles = glob.glob("%s/*/*/*.shp"%shapefiles_path)
shapefiles = {"%s, %s" % fn_process_path(f):f for f in shapefiles}

# path to save data

outPath = "/home/data/urban-atlas/extracted-data"

if not os.path.exists(outPath):
    os.makedirs(outPath)
    
# classess used in the Urban Atlas dataset

classes = '''Agricultural + Semi-natural areas + Wetlands
Airports
Construction sites
Continuous Urban Fabric (S.L. > 80%)
Discontinuous Dense Urban Fabric (S.L. : 50% -  80%)
Discontinuous Low Density Urban Fabric (S.L. : 10% - 30%)
Discontinuous Medium Density Urban Fabric (S.L. : 30% - 50%)
Discontinuous Very Low Density Urban Fabric (S.L. < 10%)
Fast transit roads and associated land
Forests
Green urban areas
Industrial, commercial, public, military and private units
Isolated Structures
Land without current use
Mineral extraction and dump sites
Other roads and associated land
Port areas
Railways and associated land
Sports and leisure facilities
Water bodies'''.split("\n")

class2label = {c:i for i,c in enumerate(classes)}
label2class = {i:c for i,c in enumerate(classes)}


[stderr:0] 
[autoreload of urbanatlas failed: Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/IPython/extensions/autoreload.py", line 247, in check
    superreload(m, reload, self.old_objects)
ImportError: No module named pysatml
]
[autoreload of urbanatlas failed: Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/IPython/extensions/autoreload.py", line 247, in check
    superreload(m, reload, self.old_objects)
ImportError: No module named pysatml
]
[autoreload of urbanatlas failed: Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/IPython/extensions/autoreload.py", line 247, in check
    superreload(m, reload, self.old_objects)
ImportError: No module named pysatml
]
[autoreload of urbanatlas failed: Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/IPython/extensions/autoreload.py", line 247, in check
    superreload(m, reload, self.old_objects)
ImportError

# Construct ground truth rasters for validation

Also compute useful stats within windows of L=25,30,50km around the city center:
* percentage of polygons per class 
* percentage of classified area per class
* percentage of classified area vs total area

In [57]:
myname = "bucuresti, ro"

# read in shapefile
shapefile = shapefiles[myname]

mycity = ua.UAShapeFile(shapefile, name=myname)

12292 polygons | 18 land use classes


In [8]:
L = mycity.compute_spatial_extent()
print "Spatial extent: %2.2f km." % L

Spatial extent: 46.47 km.


In [9]:
classified_pct = mycity.compute_classified_area()
classified_pct

ITEM
Agricultural + Semi-natural areas + Wetlands                    0.303130
Airports                                                        0.004591
Construction sites                                              0.003278
Continuous Urban Fabric (S.L. > 80%)                            0.064996
Discontinuous Dense Urban Fabric (S.L. : 50% -  80%)            0.012679
Discontinuous Low Density Urban Fabric (S.L. : 10% - 30%)       0.000010
Discontinuous Medium Density Urban Fabric (S.L. : 30% - 50%)    0.000146
Fast transit roads and associated land                          0.000441
Forests                                                         0.055171
Green urban areas                                               0.006503
Industrial, commercial, public, military and private units      0.042433
Isolated Structures                                             0.000751
Land without current use                                        0.002504
Mineral extraction and dump sites             

In [10]:
lonmin, latmin, lonmax, latmax = mycity._bounds
city_center = ((latmin+latmax)/2.0, (lonmin+lonmax)/2.0)
mycity_crop = mycity.crop_centered_window(city_center, (25,25))

In [41]:
raster, locations, cur_classes = mycity_crop.extract_class_raster(grid_size=grid_size)
raster.shape

(100, 100, 18)

In [66]:
locations_train = mycity.generate_sampling_locations()
locations_train.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,lon,lat
ITEM,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Agricultural + Semi-natural areas + Wetlands,0,26.063752,44.329799
Agricultural + Semi-natural areas + Wetlands,1,26.095589,44.354303
Agricultural + Semi-natural areas + Wetlands,2,26.161409,44.272378
Agricultural + Semi-natural areas + Wetlands,3,26.136545,44.474361
Agricultural + Semi-natural areas + Wetlands,4,26.125912,44.407208


In [69]:
%%px --local

# for constructing ground truth raster grids

grid_cell = 100
grid_size = (grid_cell, grid_cell)
window_km_vec = [25, 30, 50]

# for generating sampling locations

img_area = (224 * 1.19/ 1000)**2 # in km^2, at zoom level 17
thresh_frac = 0.25 # at least <thresh_frac> % of the image should be covered by a polygon of a given class
thresh_area = img_area * thresh_frac  
# print "Threshold area: %2.2f km^2"%thresh_area

n_classes = len(classes)

N_SAMPLES_PER_CITY  = 25000
N_SAMPLES_PER_CLASS = N_SAMPLES_PER_CITY / n_classes
MAX_SAMPLES_PER_POLY= 50

In [83]:
def fn_generate_stats(shapefile):
    cityname = "%s, %s" % fn_process_path(shapefile)
    
    # weird issues with several cities, skip
#     if cityname in ["limoges", "linz"]:
#         return "Error for city %s"%cityname
    
    print "Processing %s"%cityname
    
    savedir = "%s/%s/"%(outPath, cityname)
    if not os.path.exists(savedir):
        os.makedirs(savedir)

    if len([x for x in os.listdir(savedir) if 'raster' in x])==3:
        return "Already processed!"
   
    mycity = ua.UAShapeFile(shapefile, name=cityname)
    
    if mycity._gdf is None:
        return "Error reading shapefile %s"%shapefile
     
    # approximate city center by the center of the bounding box of the shapefile
    lonmin, latmin, lonmax, latmax = mycity._bounds
    city_center = ((latmin+latmax)/2.0, (lonmin+lonmax)/2.0)

    # there's some weird issue with the shapefile for Graz
    # lat and lon are inverted?
    if cityname in ["graz"]: #not bounds_gdf.contains(Point(city_center[::-1])):
        gdf['geometry'] = gdf['geometry'].apply(\
                lambda p: Polygon((lon,lat) \
                    for (lon,lat) in zip(p.exterior.coords.xy[1], p.exterior.coords.xy[0])))
    
    # compute spatial extent of city and fraction of land classified
    L = mycity.compute_spatial_extent()
    frac_classified = mycity.compute_classified_area()
    frac_classified['pct land classified'] = frac_classified.sum()
    frac_classified['spatial extent'] = L
    frac_classified.to_csv("%s/basic_stats.csv"%savedir)
        
    for W in window_km_vec:
        # crop a window of width WxW centered at the city center
        window = (W, W)
        mycity_crop = mycity.crop_centered_window(city_center, window)
                    
        # compute ground traster for given window size
        raster, locations_grid, cur_classes = mycity_crop.extract_class_raster(grid_size=grid_size)
        myraster = np.zeros(grid_size + (len(classes),))
        idx = [class2label[c] for k,c in enumerate(cur_classes)]
        myraster[:,:,idx] = raster
        
        # extract sampling locations
        locations_train = mycity.generate_sampling_locations(thresh_area=thresh_area, \
                                                             n_samples_per_class=N_SAMPLES_PER_CLASS,\
                                                             max_samples=MAX_SAMPLES_PER_POLY)
        
        # save data
        np.savez_compressed("%s/ground_truth_class_raster_%d.npz"%(savedir,window_km), myraster, classes)
        locations_grid.to_csv("%s/sample_locations_raster_%d.csv"%(savedir,window_km), index=False)
        locations_train.to_csv("%s/additional_sample_locations%d.csv"%(savedir,window_km), index=False)
        

In [None]:
print shapefile
fn_generate_stats(shapefile)

/home/data/urban-atlas/shapefiles/6/shape/ro001l_bucuresti.shp
Processing bucuresti, ro
12292 polygons | 18 land use classes


In [None]:
res = lbv.map_async(fn_generate_stats, shapefiles.values())

In [None]:
res.progress

In [None]:
# res.result()

# Statistics on all ~300 cities in Urban Atlas

Use computation results from the separate notebook "Urban Atlas - generate sampling locations"

### Statistics on classified area and spatial extent of the city

* What is the total area covered by the land use classification? 
* How does it break down into the different classes?
* What is the spatial extent scale of the city?

In [None]:
datapath = "/home/data/urban-atlas/extracted-data/"

stats_files = glob.glob("%s/*/basic_stats.csv"%datapath)

basic_stats_df = pd.concat([pd.read_csv(f) for f in stats_files]).drop("Unnamed: 0", 1)
basic_stats_df.index = [f.split("/")[-2] for f in stats_files]
basic_stats_df.head()

In [None]:
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 1})

plt.figure(figsize=(8,4))
g = sns.jointplot(x="spatial extent", y="pct land classified", data=basic_stats_df, \
                  kind="kde", color="k", ylim=(0.2,0.9), xlim=(1,120), stat_func=None)
g.plot_joMt.collections[0].set_alpha(0)
g.set_axis_labels("city spatial extent $L$ [km]", "land classified $\%$");

# Generate locations to extract imagery at

Our sampling strategy has the following goals:
* ensure that a uniform $100 \times 100 ~ (25km \times 25km)$ "main grid" is completely sampled (except for where there are no ground truth polygons). We generate samples in this grid first, and assign the ground truth label of the image sampled in each grid cell to the class of the polygon that has the maximum intersection area with that cell; 
* ensure that the resulting dataset is balanced with respect to the land use classes. The trouble is that the classes are highly imbalanced among the polygons in the dataset (e.g., many more polygons are agricultural land and isolated structures than airports).
* sample additional polygons apart from the ones in the initial grid, such that only polygons above a certain threshold size are considered (so that we can ensure that the sampled images contain a large enough area of the class they represent). 
* to ensure higher match between labels and sampled images, sample more images from polygons of larger areas