# UAC Data Deduplication (28 July 2020)

Using the full, merged dataset with corrections to the extent (only in Delhi) and geometries (polygons, instead of polylines), I do the following to prepare the unauthorized colonies dataset for the spatial index:
* Remove all rows with duplicate geometries
* Select only one polygon (or row) for each map number and registration number. This involves removing all other polygons (or rows) that share the same map number and registration number.

## Details of UAC Data Deduplication
* Import 4 shapefiles 
* Data exploration and pre-processing 
    * Reproject CRS to EPSG 3857
    * Look at rows/columns
    * Set variables for key column names like map column, registration column, etc. Make sure they are consistent and include the data needed for deduplicating polygons from PDF.
    * Set Index as Column 
* Check that there are no duplicate rows 
* Check that shapefile only contains polygon geometries.
* Check for duplicate geometries.
    * First look at sample duplicate geometries. How do they relate in terms of non-geometry attributes?
    * Write code/function to identify which duplicate rows to delete
    * Delete rows and check resulting GeoDataFrame length
* Run code to remove duplicate maps
    * May have to modify code to revise map numbers for single-digit maps. This should turn 4 -> 04
    * Check that no index to be deleted is in keep list (and vice versa)
* Generate USO_ID
    * First identify the highest USO_ID in combined NDMC+JJC dataset
    * Add USO_ID starting with some buffer (e.g., plus 100)
* Save/export file

In [1]:
# Import necessary modules
import pickle
import re
import importlib
import matplotlib.pyplot as plt
import pandas as pd
import geopandas as gpd
from shapely.geometry import box, Polygon, MultiPolygon, LineString, MultiLineString 
from shapely.ops import polygonize, unary_union
from pyproj import CRS
import rasterio
import uac_utils

%matplotlib inline

In [2]:
# Reload uac_utils when it gets updated
importlib.reload(uac_utils)

<module 'uac_utils' from "C:\\Users\\bwbel\\Google Drive\\slum_project\\UAC's Data Deduplication\\uac_utils.py">

## Import shapefiles

In [3]:
uac1 = gpd.read_file('ExtentCorrected_WithDuplicate_9July_Bijoy_QGIS.shp')
uac2 = gpd.read_file('UC_Missing1_Polygons.shp')
uac3 = gpd.read_file('UC_Missing2_Polygons.shp')
uac4 = gpd.read_file('UAC_part.shp')

## Data Exploration and Pre-Processing
* Check (and re-project) CRS
* Look at rows/columns
* Set variables for key column names like map column, registration column, etc.
* Set Index as Column 

### Reproject CRS to EPSG 3857

In [5]:
uac1 = uac_utils.reproject_gdf(uac1, 3857)

GeoDataFrame now has the following CRS:

PROJCRS["WGS 84 / Pseudo-Mercator",BASEGEOGCRS["WGS 84",DATUM["World Geodetic System 1984",ELLIPSOID["WGS 84",6378137,298.257223563,LENGTHUNIT["metre",1]]],PRIMEM["Greenwich",0,ANGLEUNIT["degree",0.0174532925199433]],ID["EPSG",4326]],CONVERSION["Popular Visualisation Pseudo-Mercator",METHOD["Popular Visualisation Pseudo Mercator",ID["EPSG",1024]],PARAMETER["Latitude of natural origin",0,ANGLEUNIT["degree",0.0174532925199433],ID["EPSG",8801]],PARAMETER["Longitude of natural origin",0,ANGLEUNIT["degree",0.0174532925199433],ID["EPSG",8802]],PARAMETER["False easting",0,LENGTHUNIT["metre",1],ID["EPSG",8806]],PARAMETER["False northing",0,LENGTHUNIT["metre",1],ID["EPSG",8807]]],CS[Cartesian,2],AXIS["easting (X)",east,ORDER[1],LENGTHUNIT["metre",1]],AXIS["northing (Y)",north,ORDER[2],LENGTHUNIT["metre",1]],USAGE[SCOPE["unknown"],AREA["World - 85°S to 85°N"],BBOX[-85.06,-180,85.06,180]],ID["EPSG",3857]]


In [6]:
uac2 = uac_utils.reproject_gdf(uac2, 3857)

GeoDataFrame now has the following CRS:

PROJCRS["WGS 84 / Pseudo-Mercator",BASEGEOGCRS["WGS 84",DATUM["World Geodetic System 1984",ELLIPSOID["WGS 84",6378137,298.257223563,LENGTHUNIT["metre",1]]],PRIMEM["Greenwich",0,ANGLEUNIT["degree",0.0174532925199433]],ID["EPSG",4326]],CONVERSION["Popular Visualisation Pseudo-Mercator",METHOD["Popular Visualisation Pseudo Mercator",ID["EPSG",1024]],PARAMETER["Latitude of natural origin",0,ANGLEUNIT["degree",0.0174532925199433],ID["EPSG",8801]],PARAMETER["Longitude of natural origin",0,ANGLEUNIT["degree",0.0174532925199433],ID["EPSG",8802]],PARAMETER["False easting",0,LENGTHUNIT["metre",1],ID["EPSG",8806]],PARAMETER["False northing",0,LENGTHUNIT["metre",1],ID["EPSG",8807]]],CS[Cartesian,2],AXIS["easting (X)",east,ORDER[1],LENGTHUNIT["metre",1]],AXIS["northing (Y)",north,ORDER[2],LENGTHUNIT["metre",1]],USAGE[SCOPE["unknown"],AREA["World - 85°S to 85°N"],BBOX[-85.06,-180,85.06,180]],ID["EPSG",3857]]


In [7]:
uac3 = uac_utils.reproject_gdf(uac3, 3857)

GeoDataFrame now has the following CRS:

PROJCRS["WGS 84 / Pseudo-Mercator",BASEGEOGCRS["WGS 84",DATUM["World Geodetic System 1984",ELLIPSOID["WGS 84",6378137,298.257223563,LENGTHUNIT["metre",1]]],PRIMEM["Greenwich",0,ANGLEUNIT["degree",0.0174532925199433]],ID["EPSG",4326]],CONVERSION["Popular Visualisation Pseudo-Mercator",METHOD["Popular Visualisation Pseudo Mercator",ID["EPSG",1024]],PARAMETER["Latitude of natural origin",0,ANGLEUNIT["degree",0.0174532925199433],ID["EPSG",8801]],PARAMETER["Longitude of natural origin",0,ANGLEUNIT["degree",0.0174532925199433],ID["EPSG",8802]],PARAMETER["False easting",0,LENGTHUNIT["metre",1],ID["EPSG",8806]],PARAMETER["False northing",0,LENGTHUNIT["metre",1],ID["EPSG",8807]]],CS[Cartesian,2],AXIS["easting (X)",east,ORDER[1],LENGTHUNIT["metre",1]],AXIS["northing (Y)",north,ORDER[2],LENGTHUNIT["metre",1]],USAGE[SCOPE["unknown"],AREA["World - 85°S to 85°N"],BBOX[-85.06,-180,85.06,180]],ID["EPSG",3857]]


In [8]:
uac4 = uac_utils.reproject_gdf(uac4, 3857)

GeoDataFrame now has the following CRS:

PROJCRS["WGS 84 / Pseudo-Mercator",BASEGEOGCRS["WGS 84",DATUM["World Geodetic System 1984",ELLIPSOID["WGS 84",6378137,298.257223563,LENGTHUNIT["metre",1]]],PRIMEM["Greenwich",0,ANGLEUNIT["degree",0.0174532925199433]],ID["EPSG",4326]],CONVERSION["Popular Visualisation Pseudo-Mercator",METHOD["Popular Visualisation Pseudo Mercator",ID["EPSG",1024]],PARAMETER["Latitude of natural origin",0,ANGLEUNIT["degree",0.0174532925199433],ID["EPSG",8801]],PARAMETER["Longitude of natural origin",0,ANGLEUNIT["degree",0.0174532925199433],ID["EPSG",8802]],PARAMETER["False easting",0,LENGTHUNIT["metre",1],ID["EPSG",8806]],PARAMETER["False northing",0,LENGTHUNIT["metre",1],ID["EPSG",8807]]],CS[Cartesian,2],AXIS["easting (X)",east,ORDER[1],LENGTHUNIT["metre",1]],AXIS["northing (Y)",north,ORDER[2],LENGTHUNIT["metre",1]],USAGE[SCOPE["unknown"],AREA["World - 85°S to 85°N"],BBOX[-85.06,-180,85.06,180]],ID["EPSG",3857]]


### Look at rows/columns
Common columns include the following:
* MAP_NO
* Registrati (except uac4)
* fme_datase
* geometry

In [14]:
uac1.head(2)

Unnamed: 0,OBJECTID,MAP_NO,REGISTRATI,IMG_NM_IND,FME_DATASE,layer,path,geometry
0,1,520,570,SHIV COLNY KATEWADA,D:\UC Downloads\UC_501-600-Done\520_570.pdf,1to600,/home/hb/Documents/USO/Spatial_Index_Project/U...,"POLYGON ((8568698.722 3350778.289, 8568688.910..."
1,2,509,888,KRISHNA COLONY,D:\UC Downloads\UC_501-600-Done\509_888.pdf,1to600,/home/hb/Documents/USO/Spatial_Index_Project/U...,"POLYGON ((8580894.912 3343225.741, 8580905.114..."


In [10]:
uac1.columns

Index(['OBJECTID', 'MAP_NO', 'REGISTRATI', 'IMG_NM_IND', 'FME_DATASE', 'layer',
       'path', 'geometry'],
      dtype='object')

In [15]:
uac2.head(2)

Unnamed: 0,area,fme_datase,OBJECTID,Map_No,Registrati,FID_,Shape_Leng,geometry
0,1.0,D:\UC Downloads\aa-Missing files\1428_577.pdf,1,1428,577,0,7930.402876,"POLYGON ((8580125.161 3355724.976, 8580175.448..."
1,1.0,D:\UC Downloads\aa-Missing files\1302_728.pdf,2,1302,728,1,7733.501951,"POLYGON ((8583471.298 3344218.996, 8583550.471..."


In [16]:
uac2.columns

Index(['area', 'fme_datase', 'OBJECTID', 'Map_No', 'Registrati', 'FID_',
       'Shape_Leng', 'geometry'],
      dtype='object')

In [24]:
uac3.head(2)

Unnamed: 0,FID_1,area,fme_datase,OBJECTID,Map_No,Registrati,Shape_Leng,Shape_Le_1,geometry
0,0.0,0.0,D:\UC Downloads\aa-Missing files\1327_1487.pdf,7,1286,1460,517.690474,3295.566442,"POLYGON ((8592388.017 3346483.805, 8592387.299..."
1,0.0,0.0,D:\UC Downloads\aa-Missing files\1327_1487.pdf,8,1281,1108,658.088567,511.870469,"POLYGON ((8591940.960 3346236.224, 8591942.331..."


In [25]:
uac3.columns

Index(['FID_1', 'area', 'fme_datase', 'OBJECTID', 'Map_No', 'Registrati',
       'Shape_Leng', 'Shape_Le_1', 'geometry'],
      dtype='object')

In [28]:
uac4.head(16)

Unnamed: 0,OBJECTID,SHAPE_Leng,SHAPE_Area,Map_No,FID_,IMG_NM_IND,fme_datase,Shape_Le_1,Map_No_1,geometry
0,1,317.837099,7224.640143,471,14,,D:\UC Downloads\Incomplete polygons\471_96.pdf,317.837099,471,"POLYGON ((8592962.805 3308510.763, 8592940.683..."
1,2,504.639615,6005.492786,467,17,,D:\UC Downloads\Incomplete polygons\467_1036.pdf,504.639615,467,"POLYGON ((8592960.108 3309594.120, 8592938.021..."
2,3,659.026989,19477.553208,468,13,,D:\UC Downloads\Incomplete polygons\468_1022.pdf,659.026989,468,"POLYGON ((8592905.902 3311196.386, 8592906.651..."
3,4,1628.198637,64890.555879,1350,15,,D:\UC Downloads\Incomplete polygons\1350_1354-...,1628.198637,1350,"POLYGON ((8600327.881 3312934.656, 8600314.495..."
4,5,4607.075855,509357.278642,1492,12,,D:\UC Downloads\Incomplete polygons\1492_22_EL...,4607.075855,1492,"POLYGON ((8590656.678 3314980.221, 8590472.149..."
5,6,7894.343974,296351.871083,1469,6,,D:\UC Downloads\Incomplete polygons\1469_839.pdf,7894.343974,1469,"MULTIPOLYGON (((8582474.717 3316162.490, 85824..."
6,8,1699.209543,98900.345655,1478,4,,D:\UC Downloads\Incomplete polygons\1478_1276_...,1699.209543,1478,"POLYGON ((8568664.910 3326288.959, 8568658.486..."
7,9,935.236699,24119.459544,648,18,,D:\UC Downloads\Incomplete polygons\648_748.pdf,859.998058,648,"POLYGON ((8577386.486 3326835.875, 8577387.056..."
8,10,3348.720346,246965.771888,1394,3,,D:\UC Downloads\Incomplete polygons\1394_922.pdf,3348.720346,1394,"POLYGON ((8605303.306 3328175.328, 8605419.527..."
9,11,1389.170249,32692.729918,813,2,,D:\UC Downloads\Incomplete polygons\813_1361.pdf,1364.965548,813,"POLYGON ((8575714.856 3329773.854, 8575715.263..."


In [27]:
uac4.columns

Index(['OBJECTID', 'SHAPE_Leng', 'SHAPE_Area', 'Map_No', 'FID_', 'IMG_NM_IND',
       'fme_datase', 'Shape_Le_1', 'Map_No_1', 'geometry'],
      dtype='object')

In [22]:
len(uac4)

16

### Create registration id for uac4 shapefile (using regex)

In [30]:
import re

In [75]:
# Takes into account registration numbers that
# are only numbers and those that include
# dashes, underscores, and letters
# We want to extract group 2.
pattern = r"\\(\d+)_(\d+.*).pdf"

In [77]:
# Initialize registration number column
uac4['registration_no'] = -1

In [78]:
# Inspect registration number column
uac4.head(2)

Unnamed: 0,OBJECTID,SHAPE_Leng,SHAPE_Area,Map_No,FID_,IMG_NM_IND,fme_datase,Shape_Le_1,Map_No_1,geometry,registration_no
0,1,317.837099,7224.640143,471,14,,D:\UC Downloads\Incomplete polygons\471_96.pdf,317.837099,471,"POLYGON ((8592962.805 3308510.763, 8592940.683...",-1
1,2,504.639615,6005.492786,467,17,,D:\UC Downloads\Incomplete polygons\467_1036.pdf,504.639615,467,"POLYGON ((8592960.108 3309594.120, 8592938.021...",-1


In [79]:
# Extract registration number from fme_datase entry
# and place into its own column
for idx, row in uac4.iterrows():
    try:
        matches = re.search(pattern, row['fme_datase'])
        uac4.loc[idx, 'registration_no'] = matches.group(2)
    except:
        continue

In [80]:
uac4

Unnamed: 0,OBJECTID,SHAPE_Leng,SHAPE_Area,Map_No,FID_,IMG_NM_IND,fme_datase,Shape_Le_1,Map_No_1,geometry,registration_no
0,1,317.837099,7224.640143,471,14,,D:\UC Downloads\Incomplete polygons\471_96.pdf,317.837099,471,"POLYGON ((8592962.805 3308510.763, 8592940.683...",96
1,2,504.639615,6005.492786,467,17,,D:\UC Downloads\Incomplete polygons\467_1036.pdf,504.639615,467,"POLYGON ((8592960.108 3309594.120, 8592938.021...",1036
2,3,659.026989,19477.553208,468,13,,D:\UC Downloads\Incomplete polygons\468_1022.pdf,659.026989,468,"POLYGON ((8592905.902 3311196.386, 8592906.651...",1022
3,4,1628.198637,64890.555879,1350,15,,D:\UC Downloads\Incomplete polygons\1350_1354-...,1628.198637,1350,"POLYGON ((8600327.881 3312934.656, 8600314.495...",1354-B
4,5,4607.075855,509357.278642,1492,12,,D:\UC Downloads\Incomplete polygons\1492_22_EL...,4607.075855,1492,"POLYGON ((8590656.678 3314980.221, 8590472.149...",22_ELD
5,6,7894.343974,296351.871083,1469,6,,D:\UC Downloads\Incomplete polygons\1469_839.pdf,7894.343974,1469,"MULTIPOLYGON (((8582474.717 3316162.490, 85824...",839
6,8,1699.209543,98900.345655,1478,4,,D:\UC Downloads\Incomplete polygons\1478_1276_...,1699.209543,1478,"POLYGON ((8568664.910 3326288.959, 8568658.486...",1276_C
7,9,935.236699,24119.459544,648,18,,D:\UC Downloads\Incomplete polygons\648_748.pdf,859.998058,648,"POLYGON ((8577386.486 3326835.875, 8577387.056...",748
8,10,3348.720346,246965.771888,1394,3,,D:\UC Downloads\Incomplete polygons\1394_922.pdf,3348.720346,1394,"POLYGON ((8605303.306 3328175.328, 8605419.527...",922
9,11,1389.170249,32692.729918,813,2,,D:\UC Downloads\Incomplete polygons\813_1361.pdf,1364.965548,813,"POLYGON ((8575714.856 3329773.854, 8575715.263...",1361


### Harmonize columns

#### uac1

In [81]:
uac1.columns

Index(['OBJECTID', 'MAP_NO', 'REGISTRATI', 'IMG_NM_IND', 'FME_DATASE', 'layer',
       'path', 'geometry'],
      dtype='object')

In [82]:
uac1_rename = {'MAP_NO': 'map_no', 'REGISTRATI': 'registration_no', 'FME_DATASE': 'fme_database'}

In [84]:
uac1 = uac1.rename(columns=uac1_rename)

In [85]:
uac1 = uac1[['map_no', 'registration_no', 'fme_database', 'geometry']]

In [86]:
uac1.head()

Unnamed: 0,map_no,registration_no,fme_database,geometry
0,520,570,D:\UC Downloads\UC_501-600-Done\520_570.pdf,"POLYGON ((8568698.722 3350778.289, 8568688.910..."
1,509,888,D:\UC Downloads\UC_501-600-Done\509_888.pdf,"POLYGON ((8580894.912 3343225.741, 8580905.114..."
2,516,658,D:\UC Downloads\UC_501-600-Done\516_658.pdf,"POLYGON ((8574843.534 3349736.689, 8574924.630..."
3,503,200,D:\UC Downloads\UC_501-600-Done\503_200.pdf,"POLYGON ((8578433.979 3352949.941, 8578436.107..."
4,504,1194,D:\UC Downloads\UC_501-600-Done\504_1194.pdf,"POLYGON ((8579255.777 3353628.687, 8579180.169..."


#### uac2

In [87]:
uac2.columns

Index(['area', 'fme_datase', 'OBJECTID', 'Map_No', 'Registrati', 'FID_',
       'Shape_Leng', 'geometry'],
      dtype='object')

In [88]:
uac2_rename = {'Map_No': 'map_no', 'Registrati': 'registration_no', 'fme_datase': 'fme_database'}

In [89]:
uac2 = uac2.rename(columns=uac2_rename)

In [90]:
uac2 = uac2[['map_no', 'registration_no', 'fme_database', 'geometry']]

In [92]:
uac2.head(10)

Unnamed: 0,map_no,registration_no,fme_database,geometry
0,1428,577,D:\UC Downloads\aa-Missing files\1428_577.pdf,"POLYGON ((8580125.161 3355724.976, 8580175.448..."
1,1302,728,D:\UC Downloads\aa-Missing files\1302_728.pdf,"POLYGON ((8583471.298 3344218.996, 8583550.471..."
2,1300,1404,D:\UC Downloads\aa-Missing files\1300_1404.pdf,"POLYGON ((8589460.134 3342855.629, 8589451.497..."
3,MAP 196,904,D:\UC Downloads\aa-Missing files\MAP 196_904.pdf,"POLYGON ((8590506.463 3344879.281, 8590663.658..."
4,MAP 197,905,D:\UC Downloads\aa-Missing files\MAP 197_905.pdf,"POLYGON ((8589225.195 3344690.195, 8589226.925..."
5,1321,27-(LOP),D:\UC Downloads\aa-Missing files\1321_27-(LOP)...,"POLYGON ((8590682.723 3345966.645, 8590667.854..."
6,1304,1588,D:\UC Downloads\aa-Missing files\1304_1588.pdf,"POLYGON ((8590903.305 3347923.009, 8590902.582..."
7,1308,3,D:\UC Downloads\aa-Missing files\1308_3_LOP.pdf,"POLYGON ((8591062.050 3347550.620, 8591062.297..."
8,MAP 194,470,D:\UC Downloads\aa-Missing files\MAP 194_470.pdf,"POLYGON ((8590556.502 3345562.504, 8590557.880..."
9,1428,577,D:\UC Downloads\aa-Missing files\1428_577.pdf,"POLYGON ((8580544.696 3354910.504, 8580568.877..."


#### uac3

In [93]:
uac3.columns

Index(['FID_1', 'area', 'fme_datase', 'OBJECTID', 'Map_No', 'Registrati',
       'Shape_Leng', 'Shape_Le_1', 'geometry'],
      dtype='object')

In [94]:
uac3_rename = {'Map_No': 'map_no', 'Registrati': 'registration_no', 'fme_datase': 'fme_database'}

In [95]:
uac3 = uac3.rename(columns=uac3_rename)

In [96]:
uac3 = uac3[['map_no', 'registration_no', 'fme_database', 'geometry']]

In [97]:
uac3.head()

Unnamed: 0,map_no,registration_no,fme_database,geometry
0,1286,1460,D:\UC Downloads\aa-Missing files\1327_1487.pdf,"POLYGON ((8592388.017 3346483.805, 8592387.299..."
1,1281,1108,D:\UC Downloads\aa-Missing files\1327_1487.pdf,"POLYGON ((8591940.960 3346236.224, 8591942.331..."
2,1344,389A,D:\UC Downloads\aa-Missing files\1274_509.pdf,"POLYGON ((8601158.086 3338679.042, 8601159.936..."
3,1443,513A,D:\UC Downloads\aa-Missing files\1288_1552.pdf,"POLYGON ((8599031.183 3312021.877, 8599021.988..."
4,1474,1276,D:\UC Downloads\aa-Missing files\1286_1460.pdf,"POLYGON ((8577231.243 3326492.044, 8577236.413..."


#### uac4

In [98]:
uac4.columns

Index(['OBJECTID', 'SHAPE_Leng', 'SHAPE_Area', 'Map_No', 'FID_', 'IMG_NM_IND',
       'fme_datase', 'Shape_Le_1', 'Map_No_1', 'geometry', 'registration_no'],
      dtype='object')

In [100]:
uac4_rename = {'Map_No': 'map_no', 'fme_datase': 'fme_database'}

In [101]:
uac4 = uac4.rename(columns=uac4_rename)

In [102]:
uac4 = uac4[['map_no', 'registration_no', 'fme_database', 'geometry']]

In [103]:
uac4.head()

Unnamed: 0,map_no,registration_no,fme_database,geometry
0,471,96,D:\UC Downloads\Incomplete polygons\471_96.pdf,"POLYGON ((8592962.805 3308510.763, 8592940.683..."
1,467,1036,D:\UC Downloads\Incomplete polygons\467_1036.pdf,"POLYGON ((8592960.108 3309594.120, 8592938.021..."
2,468,1022,D:\UC Downloads\Incomplete polygons\468_1022.pdf,"POLYGON ((8592905.902 3311196.386, 8592906.651..."
3,1350,1354-B,D:\UC Downloads\Incomplete polygons\1350_1354-...,"POLYGON ((8600327.881 3312934.656, 8600314.495..."
4,1492,22_ELD,D:\UC Downloads\Incomplete polygons\1492_22_EL...,"POLYGON ((8590656.678 3314980.221, 8590472.149..."


### Check data types for map number and registration number 

In [110]:
uac1['map_no'].dtype

dtype('int64')

In [109]:
uac1['registration_no']

0        570
1        888
2        658
3        200
4       1194
        ... 
3499    1460
3500      16
3501    1108
3502    1022
3503     127
Name: registration_no, Length: 3504, dtype: object

In [111]:
uac2['map_no']

0        1428
1        1302
2        1300
3     MAP 196
4     MAP 197
       ...   
82       1493
83       1295
84       1437
85       1295
86       1437
Name: map_no, Length: 87, dtype: object

In [112]:
uac2['registration_no']

0      577
1      728
2     1404
3      904
4      905
      ... 
82      40
83     702
84     168
85     702
86     168
Name: registration_no, Length: 87, dtype: object

In [114]:
uac3['map_no'].dtype

dtype('int64')

In [116]:
uac3['registration_no'].dtype

dtype('O')

In [117]:
uac4['map_no'].dtype

dtype('int64')

In [118]:
uac4['registration_no']

0         96
1       1036
2       1022
3     1354-B
4     22_ELD
5        839
6     1276_C
7        748
8        922
9       1361
10       724
11       167
12       625
13       650
14       218
15       561
Name: registration_no, dtype: object

### Fix uac2 map number column
* Extract number
* Convert to integer
* store in new column: `map_no_int`
* Remove `map_no` and rename `map_no_int` as `map_no`

In [127]:
pattern = r"(\d+)"

In [128]:
uac2['map_no_int'] = -1

In [129]:
# iterate across all rows
for idx, row in uac2.iterrows():
    try:
        # Extract numbers from map_no
        matches = re.search(pattern, row['map_no'])
        
        # Place map number as integer in `map_no_int`
        uac2.loc[idx, 'map_no_int'] = int(matches.group(1))
    except:
        # If regex above does not work, skip this row entry
        continue

In [130]:
uac2.head()

Unnamed: 0,map_no,registration_no,fme_database,geometry,map_no_int
0,1428,577,D:\UC Downloads\aa-Missing files\1428_577.pdf,"POLYGON ((8580125.161 3355724.976, 8580175.448...",1428
1,1302,728,D:\UC Downloads\aa-Missing files\1302_728.pdf,"POLYGON ((8583471.298 3344218.996, 8583550.471...",1302
2,1300,1404,D:\UC Downloads\aa-Missing files\1300_1404.pdf,"POLYGON ((8589460.134 3342855.629, 8589451.497...",1300
3,MAP 196,904,D:\UC Downloads\aa-Missing files\MAP 196_904.pdf,"POLYGON ((8590506.463 3344879.281, 8590663.658...",196
4,MAP 197,905,D:\UC Downloads\aa-Missing files\MAP 197_905.pdf,"POLYGON ((8589225.195 3344690.195, 8589226.925...",197


In [134]:
# Check that map number is an integer
uac2.map_no_int.dtype

dtype('int64')

In [135]:
# Drop `map_no` as columns
uac2 = uac2.drop(columns=['map_no'])

In [136]:
# Rename `map_no_int` as `map_no`
uac2 = uac2.rename(columns={'map_no_int':'map_no'})

In [137]:
uac2.map_no.dtype

dtype('int64')

In [138]:
uac2.head()

Unnamed: 0,registration_no,fme_database,geometry,map_no
0,577,D:\UC Downloads\aa-Missing files\1428_577.pdf,"POLYGON ((8580125.161 3355724.976, 8580175.448...",1428
1,728,D:\UC Downloads\aa-Missing files\1302_728.pdf,"POLYGON ((8583471.298 3344218.996, 8583550.471...",1302
2,1404,D:\UC Downloads\aa-Missing files\1300_1404.pdf,"POLYGON ((8589460.134 3342855.629, 8589451.497...",1300
3,904,D:\UC Downloads\aa-Missing files\MAP 196_904.pdf,"POLYGON ((8590506.463 3344879.281, 8590663.658...",196
4,905,D:\UC Downloads\aa-Missing files\MAP 197_905.pdf,"POLYGON ((8589225.195 3344690.195, 8589226.925...",197


### Merge uac1, uac2, uac3, uac4 into `uac`

In [141]:
concat_df = pd.concat([uac1, uac2, uac3, uac4], ignore_index=True)

In [143]:
uac = gpd.GeoDataFrame(concat_df, geometry='geometry')

In [144]:
len(uac)

3642

In [145]:
uac.head()

Unnamed: 0,map_no,registration_no,fme_database,geometry
0,520,570,D:\UC Downloads\UC_501-600-Done\520_570.pdf,"POLYGON ((8568698.722 3350778.289, 8568688.910..."
1,509,888,D:\UC Downloads\UC_501-600-Done\509_888.pdf,"POLYGON ((8580894.912 3343225.741, 8580905.114..."
2,516,658,D:\UC Downloads\UC_501-600-Done\516_658.pdf,"POLYGON ((8574843.534 3349736.689, 8574924.630..."
3,503,200,D:\UC Downloads\UC_501-600-Done\503_200.pdf,"POLYGON ((8578433.979 3352949.941, 8578436.107..."
4,504,1194,D:\UC Downloads\UC_501-600-Done\504_1194.pdf,"POLYGON ((8579255.777 3353628.687, 8579180.169..."


### Set index as column

In [146]:
uac['index'] = uac.index

In [147]:
uac.head()

Unnamed: 0,map_no,registration_no,fme_database,geometry,index
0,520,570,D:\UC Downloads\UC_501-600-Done\520_570.pdf,"POLYGON ((8568698.722 3350778.289, 8568688.910...",0
1,509,888,D:\UC Downloads\UC_501-600-Done\509_888.pdf,"POLYGON ((8580894.912 3343225.741, 8580905.114...",1
2,516,658,D:\UC Downloads\UC_501-600-Done\516_658.pdf,"POLYGON ((8574843.534 3349736.689, 8574924.630...",2
3,503,200,D:\UC Downloads\UC_501-600-Done\503_200.pdf,"POLYGON ((8578433.979 3352949.941, 8578436.107...",3
4,504,1194,D:\UC Downloads\UC_501-600-Done\504_1194.pdf,"POLYGON ((8579255.777 3353628.687, 8579180.169...",4


In [148]:
uac.tail()

Unnamed: 0,map_no,registration_no,fme_database,geometry,index
3637,27,167,D:\UC Downloads\Incomplete polygons\27_167.pdf,"POLYGON ((8576522.039 3336941.191, 8576576.418...",3637
3638,1161,625,D:\UC Downloads\Incomplete polygons\1161_625.pdf,"POLYGON ((8576663.032 3336801.506, 8576647.696...",3638
3639,1338,650,D:\UC Downloads\Incomplete polygons\1338_650.pdf,"POLYGON ((8600711.390 3340360.255, 8600724.770...",3639
3640,882,218,D:\UC Downloads\Incomplete polygons\882_218.pdf,"POLYGON ((8575193.159 3341022.052, 8575195.388...",3640
3641,1353,561,D:\UC Downloads\Incomplete polygons\1353_561.pdf,"POLYGON ((8574830.088 3341549.589, 8574827.767...",3641


### Set variables for column names

In [20]:
index_colname = 'index'
map_colname = 'MAP_NO'
registration_colname = 'REGISTRATI'

In [21]:
uac[index_colname].head()

0    0
1    1
2    2
3    3
4    4
Name: index, dtype: int64

In [22]:
uac[map_colname].head()

0    520
1    509
2    516
3    503
4    504
Name: MAP_NO, dtype: int64

In [23]:
uac[registration_colname].head()

0     570
1     888
2     658
3     200
4    1194
Name: REGISTRATI, dtype: object

## Check for duplicate rows

In [150]:
uac_utils.gdf_has_duplicate_rows(uac)

False

## Check for only polygon geometries

In [152]:
uac_utils.all_polygon_geometries(uac)

True