Skip to content

Remote sensing and applying machine learning techniques to generate a housing register of the resolution for the Small Area Population administrative boundaries. The method applied relied upon supervised classification of satellite imagery from 2016 generating a relationship model to the latest official Irish Census. The relationship model was a…

codefromjames/GGE_Urban

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

GGE_Urban

The structure of the implementation followed a linear process (figure III.1). The initial stage required acquisition of freely available raw data. The secondary phase required conversion into spatial context for amalgamation and analysis. Finally, generation of a Census 2016 training linear model, based on classification areas from satellite imagery, was applied to the 2017 test spatial classification to predict housing and population. Fig.III.1. Flowchart of methodology with initial ingestion of four data formats; raster (green), vector (purple), string (pink) and numerical (yellow) generating internal models (blue) for the prediction of housing and population in 2017. A. Data The primary components were raster satellite imagery from Sentinl-2 Copernicus program [14], the vector layers of CSO Small Area Populations (SAPs) administration boundaries from 2011 [15] and the numerical CSO Census of 2016 [16]. Secondary data, relating to training sets, were created using data from the Residential Property Price Register (RPPR) [17] from 2012 to 2015 and Co-ordinated information on the Environment (Corine) Land Cover 2012 [18]. All spatial data was converted to World Geodetic System of 1984 (EPSG: 4326) projection for integration with satellite imagery. B. Computational Environment All spatial data was manipulated through Quantum Geographical Information Systems (QGIS) [19] and analysed using Google Earth Engine (GEE) [20]. The GEE facilitated co-location of online databases with cloud-based computing executed through web browser scripts. All numerical relationship models were generated within the Python environment [21]. C. Data Preparation

  1. CSO Census 2016 & SAPs 2011 The SAPs polygons were not officially available for the Census 2016 numerical dataset. The coded SAPs of the Census 2016 (18,642), with new additional areas, were reaggregated for the variables of total population and households back into 2011 SAPs polygons (18,488). The SAPs contained Nomenclature of Territorial Units for Statistics (NUTS) and informal localised geographical regions. This provided a hierarchical spatial dimension selection of SAPs granular level for Census 2016. The SAPs polygon centroids were calculated as latitude and longitude. This engineered feature was incorporated to provide a geographical dimension to the linear model.
  2. Training Set The RPPR string addresses (147,530) were geocoded to latitude and longitude using Google Maps API. The observations without rooftop accuracy were discarded resulting with a training set of 91,274 points. The RPPR vector points were spatially joined to SAP attributes for unified processing of at selected county levels (NUTS3). The Corine Land Cover (CLC) of Ireland contained 18,903 polygons. This was dissolved to from polygons to one-point random samples. The resulting CLC vector points were spatially joined to the SAPs attributes generating 16,874 usable points for training. The CLC has five macro classes; artificial surfaces, agricultural areas, natural areas, wetlands and water bodies. The five classes have several subdivisions which were reformed into five macro classes. The continuous artificial surfaces, which contained urban fabric or green urban areas, were substituted for the RPPR rooftop data to form a unified training set. This reduced spectral mixing of dwellings surrounded by vegetation.

D. Satellite Inagery The satellite imagery of Sentinel-2 was selected due to the accessibility of high temporal resolution (5 days) provided by two satellites. This increased the likelihood of low metrological interference. This incorporated with spatial resolution of 10 to 20 meters across 10 bands of spectral resolution (98~242nm) provided the capabilities of identifying features within the Irish landscape [14]. Figure III.2: Composite 20m resolution cloud -free image of Ireland mosaiced from 500 tiles over a three-month period overlaid on generic map. The image contains 10 bands of information. The image was rendered using B8, B4 and B3 as red, blue and green respectively highlighting vegetation. Areas above 300m were excluded. Croke Park at 20m resolution was included to show extent of rater data present.

  1. Composite Images Cloud interference was still present requiring a composite image of Ireland to be constructed through a specific temporal duration (figure III.2). The composite repopulates the presence of cloud pixels from the first low percentage cloud image (<5%) with the next available image proportion without cloud pixels [22]. The timescale period specified was within the months of low vegetation growth for both 2016 and 2017. This reduced tree canopy spectral obstruction, provided temporal integrity to the Census empirical collection date and continuity between yearly classifications models.
  2. Digital Elevation Model Preliminary analysis identified mountainous regions as sources of classification error. This was due to exposed rock classified as households and increased presence of cloud interference. The Shuttle Radar Topography Mission generated a near-global digital elevation model (DEM) of 30m resolution [23]. The DEM was used to subtract mountainous regions greater than 300m from the composite images (section III-3). The 300m was determined based on the highest recognized settlement in Ireland, Glencullen of Wicklow, situated at 276m [23]. E. Spatial Classification Models
  3. Supervised Classification The composite images were partitioned based on NUT3 regions of county and city districts. The five-classed vector point training set was applied based on the matching geographical regions (section III-2). The composite image was classified based on ten bands (B2, B3, B4, B5,B6, B7, B8, B8a, B11 & B12) which have been previously applied as ratios in various combinations to segregate water, soil, rock and vegetation [24]. The composite image was classified to 20m spatial resolution.
  4. Classifier The optimum classifying method selected was RF [8] for city and county scale. The RF classifier marginally outperformed Classification and Regression Tree (CART) [25] and was a significant improvement upon SVM. The RF used 50 decision trees as a parameter. The performance was assessed using confusion matrixes [26]. The trained classifier was applied to composite imagery for 2016 and 2017. The output was two raster images divided into the five classes for 2016 and 2017. The classified images were cross-validated for accuracy by an error matrix using 70% of the training set to predict the classification of the remaining points within the images.
    F. SAP Pixel Composition The five classifications pixel area represented 1m2. The areas were calculated within each individual SAP and converted to km2. This reduced the spatial complexity to numerical entities to allow comparison to the numerical Census data. All exported classified SAP counties and city were unified into a training set for 2016 and a testing set for 2017.

G. Relationship Models for Houses & Population

  1. Data Preparation The dataset was divided on city districts and county boundaries into rural and urban datasets. This was to reduce computation complexity and arrange similar scale of correlations together. This diversion emanated from the small regular shapes of city SAPs contrasting with large scale irregular following natural features of county SAPs. The urban districts consisted of Cork, Dublin Galway, Limerick and Waterford cities with Fingal, South Dublin and Dún Laoghaire–Rathdown county councils. The remaining 26 counties were grouped as rural areas. This segregated the SAPs into 12,400 and 6,088 units for rural and urban areas respectively.
    The SAPs were engineered for comparison between units of any region by reducing the area of high density populations and increasing the area of remote regions to contain similar sized populations. This reduced correlations between SAPs Census households and satellite classification within the SAPs due to engineered uniformity regardless of area size. This was reverse-engineered by using a ratio of Census households to SAPs spatial characteristics with the formula;

             (1).
    

This formula (1) was repeated for population prediction by substituting households for total population from the Census. The distribution of the ratios for housing and population were positively skewed requiring a logarithmic transformation to normalize the distribution. Categorical variables of county and electoral district were retained to provide geographical clustering due to the spatial pattern of settlements relating to Central Place Theory [27]. The categorical variables were one-hot encoded to allow for unassigned clustering. The scale between independent variables was standardized to prevent larger numerical values, such as arable land and area, distorting the model.

  1. Regression Models The relationship models were generated using modules from sklearn [28]. The models were assessed by r squared (r2) and root mean squared error (RMSE) for accuracy of training set for 2016 and validation of the testing set for 2017. The two methods implemented were elastic net regularization (ENR) [29] and gradient boosting regressor (GBR) [30] for comparison. The two models were combined as an ensemble in most instances for rural and urban predictions. The GBR for urban housing was used as a standalone model. The overall accuracy of the training models was assessed by the mean of r2 and the standard deviation (s.d.) of models using five folds for cross validation.

a) Elastic Net Regularization The ENR combines the benefit of Lasso and Ridge regression in relation to collinearity and regularization respectively using a ratio [29]. The ENR trained the models with parameters of values provided for alpha and the ratio using 5000 iterations fittings along a regularization path. The most appropriate model was selected by cross validation.

b) Gradient Boosting Regressor The GBR constructs an additive model forward by adding a loss function at each stage of increasing complexity through a tree ensemble of weak models. The final model was optimized with allowable differentiable loss through a cost function [31] . The parameters selected were regression trees (3000), learning rate (0.05), max depth (3), minimum leaf sample (15), minimum split (10), maximum features derived by a square method with a robust loss function as a combination (Huber) of least squares regression and least absolute deviation. The GBR model terminates once the gradient descent boosting converged. 3) Variables Variables of low Pearson’s correlation in an urban context for housing prediction were removed from the model consisting of bog, water and vegetation classifications (<0.2). These variables were also removed from the rural housing model due to low correlation. H. Housing & Population The predicted ratios of housing and population were converted back and rounded to natural numbers using the inverse of ratio formula (1). (2). The data were verified against current annual domestic housing estimates and population projection for 2017 by subtraction from Census 2016 or total aggregation respectively. The current annual domestic housing data was accessed at the Department of Housing based on ESB monthly connections from March 2016 to October 2017 [32] and the Sustainable Energy Authority Ireland record of domestic Building Energy Ratings (BERs) certificates issued for the same period. The population projections were accessed and interpolated from the United Nations [33] for December 2017 and the CSO for April 2017

About

Remote sensing and applying machine learning techniques to generate a housing register of the resolution for the Small Area Population administrative boundaries. The method applied relied upon supervised classification of satellite imagery from 2016 generating a relationship model to the latest official Irish Census. The relationship model was a…

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published