
# Urban Atlas - process shapefiles to compute stats and extract sampling locations

This notebook processess the collection of GIS vector data (shapefiles) on land use classses from the Urban Atlas dataset. The tasks we are interested in are:

* compute basic statistics on each shapefile, such as the spatial extent of the city and the amount of land classified within each city
* construct ground truth test grids (defined as a $25km \times 25km$ square window around the city center)
* define locations to sample images, both for testing (each cell of the ground truth grid) and for training (sampled appropriately from all available data)

In this analysis we work directly with vector data (polygons) for all the above tasks. This is a faster way to process the data for our specific use case of whole-image classification, as opposed to the more accurate approach of rasterizing the data first, then computing classification masks (which is more suited for image segmentation).

# Setup: packages, paths etc.

In [2]:
import ipyparallel

rc = ipyparallel.Client()
all_engines = rc[:]
lbv = rc.load_balanced_view()

print len(all_engines)

48


In [9]:
%%px --local

# numeric packages
import numpy as np
import pandas as pd

# filesystem and OS
import sys, os, time
import glob

# plotting
from matplotlib import pyplot as plt
import matplotlib
%matplotlib inline

import seaborn as sns
sns.set_style("whitegrid", {'axes.grid' : False})

# compression
import gzip
import cPickle as pickle
import copy

# widgets and interaction
from ipywidgets import FloatProgress
from IPython.display import display, clear_output

import warnings
warnings.filterwarnings('ignore')

# these magics ensure that external modules that are modified are also automatically reloaded
%reload_ext autoreload
%autoreload 2

In [5]:
%%px --local

# path to shapefiles

shapefiles_path = "/home/data/urban-atlas/shapefiles/"

shapefiles = glob.glob("%s/*/*/*.shp"%shapefiles_path)
shapefiles = {" ".join(f.split("/")[-1].split("_")[1:]).replace(".shp",""):f for f in shapefiles}

# path to save data

outPath = "/home/data/urban-atlas/extracted-data"

if not os.path.exists(outPath):
    os.makedirs(outPath)
    
# classess used in the Urban Atlas dataset

classes = '''Agricultural + Semi-natural areas + Wetlands
Airports
Construction sites
Continuous Urban Fabric (S.L. > 80%)
Discontinuous Dense Urban Fabric (S.L. : 50% -  80%)
Discontinuous Low Density Urban Fabric (S.L. : 10% - 30%)
Discontinuous Medium Density Urban Fabric (S.L. : 30% - 50%)
Discontinuous Very Low Density Urban Fabric (S.L. < 10%)
Fast transit roads and associated land
Forests
Green urban areas
Industrial, commercial, public, military and private units
Isolated Structures
Land without current use
Mineral extraction and dump sites
Other roads and associated land
Port areas
Railways and associated land
Sports and leisure facilities
Water bodies'''.split("\n")

class2label = {c:i for i,c in enumerate(classes)}
label2class = {i:c for i,c in enumerate(classes)}


In [12]:
%%px --local

# satellite imagery modules

import sys
sys.path.append("/home/nbserver/urban-environments/urbanatlas/")
import urbanatlas as ua

# Construct ground truth rasters for validation

Also compute useful stats within windows of L=25,30,50km around the city center:
* percentage of polygons per class 
* percentage of classified area per class
* percentage of classified area vs total area

In [19]:
myname = "bucuresti"

# read in shapefile
shapefile = shapefiles[city]

mycity = ua.UAShapeFile(shapefile, name=myname)

12292 polygons | 18 land use classes


In [None]:
    print "Spatial extent: %2.2f km." % L
    print "Land use classified area: %2.3f km^2 (%2.2f of total area covered within bounds %2.3f km^2)"%(classified_area, frac_classified, box_area)
    
    return L, frac_classified


In [None]:
label2class

In [None]:
%%px --local

grid_cell = 100
grid_size = (grid_cell, grid_cell)
window_km_vec = [25, 30, 50]


In [None]:
def fn_generate_stats(shapefile):
    city = " ".join(shapefile.split("/")[-1].split("_")[1:]).replace(".shp","")
    
    # weird issues with several cities, skip
    if city in ["limoges", "linz"]:
        return "Error for city %s"%city
    
    print "Processing %s"%city
    
    savedir = "%s/%s/"%(outPath, city)
    if not os.path.exists(savedir):
        os.makedirs(savedir)

    if len([x for x in os.listdir(savedir) if 'raster' in x])==3:
        return "Already processed!"
   
    gdf, prj = load_shapefile(shapefile)
    if gdf is None:
        return "Error reading shapefile %s"%shapefile
        
    city_center, country_code = get_city_center(shapefile)
    lonmin, latmin, lonmax, latmax = get_bounds(gdf)
    bounds_gdf = Polygon([(lonmin,latmin), (lonmax,latmin), (lonmax,latmax), (lonmin,latmax)])

    if city_center is None:
        city_center = ((latmin+latmax)/2.0, (lonmin+lonmax)/2.0)

    # there's some weird issue with the shapefile for Graz
    # lat and lon are inverted?
    if city in ["graz"]: #not bounds_gdf.contains(Point(city_center[::-1])):
        city_center = ((latmin+latmax)/2.0, (lonmin+lonmax)/2.0)
        gdf['geometry'] = gdf['geometry'].apply(\
                lambda p: Polygon((lon,lat) \
                    for (lon,lat) in zip(p.exterior.coords.xy[1], p.exterior.coords.xy[0])))
    
    # compute spatial extent of city and fraction of land classified
    L, frac_classified = compute_stats(gdf, prj=prj)
    df = pd.DataFrame([L, frac_classified], \
                      index=["spatial extent", "pct land classified"]).T
    df.to_csv("%s/basic_stats.csv"%savedir)
        
    for window_km in window_km_vec:
        window = (window_km, window_km)
        gdf_window = filter_gdf_by_centered_window(gdf, center=city_center, window=window)
        
        # compute stats
        class_coverage_by_area = gdf_window.groupby("ITEM").apply(\
                                lambda x: x["SHAPE_AREA"].sum())/float(window[0]*window[1])
        class_coverage_by_poly= gdf_window.groupby("ITEM").apply(len)/ gdf.groupby("ITEM").apply(len)
        class_coverage_by_area_classified = gdf_window.groupby("ITEM").apply(\
                                                lambda x: x['SHAPE_AREA'].sum()) / gdf_window['SHAPE_AREA'].sum()
    
        # format and save stats
        stats_df = pd.concat([class_coverage_by_area, class_coverage_by_poly, class_coverage_by_area_classified], axis=1)
        stats_df.columns = ["pct area", "pct polygons", "pct classified area"]
        stats_df['window km'] = window_km
        stats_df = stats_df.ix[classes]
        stats_df.to_csv("%s/stats_class_window_%d.csv"%(savedir,window_km))
        
        # compute raster for given window size
        bbox = satimg.bounding_box_at_location(city_center, window)
        raster, locations_df = construct_class_raster(gdf_window, bbox, grid_size=grid_size)
        np.savez_compressed("%s/ground_truth_class_raster_%d.npz"%(savedir,window_km), raster)
        locations_df.to_csv("%s/sample_locations_raster_%d.csv"%(savedir,window_km))

In [None]:
# city_center, country_code = get_city_center(shapefile)
# lonmin, latmin, lonmax, latmax = get_bounds(gdf)
# bounds_gdf = Polygon([(lonmin,latmin), (lonmax,latmin), (lonmax,latmax), (lonmin,latmax)])
# window = (window_km_vec[0], window_km_vec[0])
# gdf_window = filter_gdf_by_centered_window(gdf, center=city_center, window=window)
# bbox = satimg.bounding_box_at_location(city_center, window)
# raster, locations_df = construct_class_raster(gdf_window, bbox, grid_size=grid_size)


In [None]:
res = lbv.map_async(fn_generate_stats, shapefiles.values())

In [None]:
res.progress

In [None]:
# res.result()

# Statistics on all ~300 cities in Urban Atlas

Use computation results from the separate notebook "Urban Atlas - generate sampling locations"

### Statistics on classified area and spatial extent of the city

* What is the total area covered by the land use classification? 
* How does it break down into the different classes?
* What is the spatial extent scale of the city?

In [None]:
datapath = "/home/data/urban-atlas/extracted-data/"

stats_files = glob.glob("%s/*/basic_stats.csv"%datapath)

basic_stats_df = pd.concat([pd.read_csv(f) for f in stats_files]).drop("Unnamed: 0", 1)
basic_stats_df.index = [f.split("/")[-2] for f in stats_files]
basic_stats_df.head()

In [None]:
sns.set_context("notebook", font_scale=1.5, rc={"lines.linewidth": 1})

plt.figure(figsize=(8,4))
g = sns.jointplot(x="spatial extent", y="pct land classified", data=basic_stats_df, \
                  kind="kde", color="k", ylim=(0.2,0.9), xlim=(1,120), stat_func=None)
g.plot_joMt.collections[0].set_alpha(0)
g.set_axis_labels("city spatial extent $L$ [km]", "land classified $\%$");

# Generate locations to extract imagery at

Our sampling strategy has the following goals:
* ensure that a uniform $100 \times 100 ~ (25km \times 25km)$ "main grid" is completely sampled (except for where there are no ground truth polygons). We generate samples in this grid first, and assign the ground truth label of the image sampled in each grid cell to the class of the polygon that has the maximum intersection area with that cell; 
* ensure that the resulting dataset is balanced with respect to the land use classes. The trouble is that the classes are highly imbalanced among the polygons in the dataset (e.g., many more polygons are agricultural land and isolated structures than airports).
* sample additional polygons apart from the ones in the initial grid, such that only polygons above a certain threshold size are considered (so that we can ensure that the sampled images contain a large enough area of the class they represent). 
* to ensure higher match between labels and sampled images, sample more images from polygons of larger areas