# UAC Data Deduplication (30 July 2020)
Using the full, merged dataset with corrections to the extent (only in Delhi) and geometries (polygons, instead of polylines), I do the following to prepare the unauthorized colonies dataset for the spatial index:
* Remove all rows with duplicate geometries
* Select only one polygon (or row) for each map number and registration number. This involves removing all other polygons (or rows) that share the same map number and registration number.

## Details of UAC Data Deduplication
* Import 4 unauthorized shapefiles **[DONE]**
* Data exploration and pre-processing **[DONE]**
    * Reproject CRS to EPSG 3857
    * Look at rows/columns
    * Set variables for key column names like map column, registration column, etc. Make sure they are consistent and include the data needed for deduplicating polygons from PDF.
    * Set Index as Column 
* Check that there are no duplicate rows **[DONE]**
* Check that shapefile only contains polygon geometries. **[DONE]**
* Check for duplicate geometries. **[DONE]**
    * First look at sample duplicate geometries. How do they relate in terms of non-geometry attributes?
    * Delete rows and check resulting GeoDataFrame length
* Run code to remove duplicate maps
    * May have to modify code to revise map numbers for single-digit maps. This should turn 4 -> 04
    * Check that no index to be deleted is in keep list (and vice versa)
* Generate USO_ID
    * First identify the highest USO_ID in combined NDMC+JJC dataset
    * Add USO_ID starting with some buffer (e.g., plus 100)
* Save/export file

In [3]:
# Import necessary modules
import pickle
import re
import importlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import geopandas as gpd
from shapely.geometry import box, Polygon, MultiPolygon, LineString, MultiLineString 
from shapely.ops import polygonize, unary_union
from shapely.geometry.base import geom_factory
from shapely.geos import lgeos
from pyproj import CRS
import rasterio
import uac_utils

%matplotlib inline

In [4]:
# Reload uac_utils when it gets updated
importlib.reload(uac_utils)

<module 'uac_utils' from "C:\\Users\\bwbel\\Google Drive\\slum_project\\UAC's Data Deduplication\\uac_utils.py">

## Import shapefiles

In [None]:
uac1 = gpd.read_file('ExtentCorrected_WithDuplicate_9July_Bijoy_QGIS.shp')
uac2 = gpd.read_file('UC_Missing1_Polygons.shp')
uac3 = gpd.read_file('UC_Missing2_Polygons.shp')
uac4 = gpd.read_file('UAC_part.shp')

## Data Exploration and Pre-Processing
* Check (and re-project) CRS
* Look at rows/columns
* Set variables for key column names like map column, registration column, etc.
* Set Index as Column 

### Reproject CRS to EPSG 3857

In [None]:
uac1 = uac_utils.reproject_gdf(uac1, 3857)

In [None]:
uac2 = uac_utils.reproject_gdf(uac2, 3857)

In [None]:
uac3 = uac_utils.reproject_gdf(uac3, 3857)

In [None]:
uac4 = uac_utils.reproject_gdf(uac4, 3857)

### Look at rows/columns
Common columns include the following:
* MAP_NO
* Registrati (except uac4)
* fme_datase
* geometry

In [None]:
uac1.head(2)

In [None]:
uac1.columns

In [None]:
uac2.head(2)

In [None]:
uac2.columns

In [None]:
uac3.head(2)

In [None]:
uac3.columns

In [None]:
uac4.head(16)

In [None]:
uac4.columns

In [None]:
len(uac4)

### Create registration id for uac4 shapefile (using regex)

In [None]:
# Takes into account registration numbers that
# are only numbers and those that include
# dashes, underscores, and letters
# We want to extract group 2.
pattern = r"\\(\d+)_(\d+.*).pdf"

In [None]:
# Initialize registration number column
uac4['registration_no'] = -1

In [None]:
# Inspect registration number column
uac4.head(2)

In [None]:
# Extract registration number from fme_datase entry
# and place into its own column
for idx, row in uac4.iterrows():
    try:
        matches = re.search(pattern, row['fme_datase'])
        uac4.loc[idx, 'registration_no'] = matches.group(2)
    except:
        continue

In [None]:
uac4.head()

### Harmonize columns

#### uac1

In [None]:
uac1.columns

In [None]:
uac1_rename = {'MAP_NO': 'map_no', 'REGISTRATI': 'registration_no', 'FME_DATASE': 'fme_database'}

In [None]:
uac1 = uac1.rename(columns=uac1_rename)

In [None]:
uac1 = uac1[['map_no', 'registration_no', 'fme_database', 'geometry']]

In [None]:
uac1.head()

#### uac2

In [None]:
uac2.columns

In [None]:
uac2_rename = {'Map_No': 'map_no', 'Registrati': 'registration_no', 'fme_datase': 'fme_database'}

In [None]:
uac2 = uac2.rename(columns=uac2_rename)

In [None]:
uac2 = uac2[['map_no', 'registration_no', 'fme_database', 'geometry']]

In [None]:
uac2.head(10)

#### uac3

In [None]:
uac3.columns

In [None]:
uac3_rename = {'Map_No': 'map_no', 'Registrati': 'registration_no', 'fme_datase': 'fme_database'}

In [None]:
uac3 = uac3.rename(columns=uac3_rename)

In [None]:
uac3 = uac3[['map_no', 'registration_no', 'fme_database', 'geometry']]

In [None]:
uac3.head()

#### uac4

In [None]:
uac4.columns

In [None]:
uac4_rename = {'Map_No': 'map_no', 'fme_datase': 'fme_database'}

In [None]:
uac4 = uac4.rename(columns=uac4_rename)

In [None]:
uac4 = uac4[['map_no', 'registration_no', 'fme_database', 'geometry']]

In [None]:
uac4.head()

### Check data types for map number and registration number 

In [None]:
uac1['map_no'].dtype

In [None]:
uac1['registration_no']

In [None]:
uac2['map_no']

In [None]:
uac2['registration_no']

In [None]:
uac3['map_no'].dtype

In [None]:
uac3['registration_no'].dtype

In [None]:
uac4['map_no'].dtype

In [None]:
uac4['registration_no']

### Fix uac2 map number column
* Extract number
* Convert to integer
* store in new column: `map_no_int`
* Remove `map_no` and rename `map_no_int` as `map_no`

In [None]:
pattern = r"(\d+)"

In [None]:
uac2['map_no_int'] = -1

In [None]:
# iterate across all rows
for idx, row in uac2.iterrows():
    try:
        # Extract numbers from map_no
        matches = re.search(pattern, row['map_no'])
        
        # Place map number as integer in `map_no_int`
        uac2.loc[idx, 'map_no_int'] = int(matches.group(1))
    except:
        # If regex above does not work, skip this row entry
        continue

In [None]:
uac2.head()

In [None]:
# Check that map number is an integer
uac2.map_no_int.dtype

In [None]:
# Drop `map_no` as columns
uac2 = uac2.drop(columns=['map_no'])

In [None]:
# Rename `map_no_int` as `map_no`
uac2 = uac2.rename(columns={'map_no_int':'map_no'})

In [None]:
uac2.map_no.dtype

In [None]:
uac2.head()

### Merge uac1, uac2, uac3, uac4 into `uac`

In [None]:
# Concatenate GeoDataFrames
concat_df = pd.concat([uac1, uac2, uac3, uac4], ignore_index=True)

In [None]:
# Create new GeoDataFrame from concatenation
uac = gpd.GeoDataFrame(concat_df, crs=CRS.from_epsg(3857).to_wkt(), geometry='geometry')

In [None]:
len(uac)

In [None]:
uac.head()

In [None]:
uac['index'] = uac.index

In [None]:
uac.head()

In [None]:
uac.tail()

In [None]:
uac.crs

### Save uac shapefile to disk

In [None]:
uac.to_file('merged_uac_28july2020.shp')

In [None]:
with open('merged_uac_28july2020.data', 'wb') as f:
    pickle.dump(uac, f)

### Load uac pickled file from disk

In [None]:
uac = gpd.read_file('merged_fixed_uac_30July2020.shp')

In [None]:
len(uac)

In [None]:
uac.crs

In [None]:
uac.head()

## Check for duplicate rows

In [None]:
uac_utils.gdf_has_duplicate_rows(uac)

## Check for and Remove None-Type in Geometry

In [None]:
uac[uac['geometry'] == None]

In [None]:
uac = uac.drop(index=[1436, 1889, 2202])

## Check for only polygon geometries

In [None]:
uac_utils.all_polygon_geometries(uac)

In [None]:
len(uac)

### Check validity of geometries

In [None]:
uac['valid_geom'] = uac['geometry'].is_valid

In [None]:
uac['valid_geom'].sum()

# Figure out how to identify duplicate polygons

In [None]:
# Let's create a small, sample GeoDataFrame
poly1 = Polygon(((0,1), (10, 20), (20, 30), (0, 1)))

# Same as poly1 but in different order
poly2 = Polygon(((10,20), (20, 30), (0, 1), (10,20)))

poly3 = Polygon(((11,22), (22, 33), (0, 0), (11,22)))

poly4 = Polygon(((22, 33), (0, 0), (11,22), (22,33)))

poly5 = Polygon(((0, 0), (11,22), (22,33), (0,0)))

df = pd.DataFrame({'geometry': [poly1, poly2, poly3, poly4, poly5]})

gdf = gpd.GeoDataFrame(df, geometry='geometry')

#gdf

In [None]:
uac = uac_utils.remove_duplicate_geom(uac)

In [None]:
uac.head()

In [None]:
len(uac)

### Save File

In [None]:
uac.to_file('merged_fixed_unique_uac_30July2020.shp')

In [None]:
with open('merged_fixed_unique_uac_30July2020.data', 'wb') as f:
    pickle.dump(uac, f)

### Load File

In [5]:
with open('merged_fixed_unique_uac_30July2020.data', 'rb') as f:
    uac = pickle.load(f)

In [7]:
uac = uac.drop(columns=['valid_geom', 'index', 'level_0'])

In [8]:
uac.head()

Unnamed: 0,map_no,registrati,fme_databa,geometry
0,520,570,D:\UC Downloads\UC_501-600-Done\520_570.pdf,"POLYGON ((8568698.722 3350778.289, 8568688.910..."
1,509,888,D:\UC Downloads\UC_501-600-Done\509_888.pdf,"POLYGON ((8580894.912 3343225.741, 8580905.114..."
2,516,658,D:\UC Downloads\UC_501-600-Done\516_658.pdf,"POLYGON ((8574843.534 3349736.689, 8574924.630..."
3,503,200,D:\UC Downloads\UC_501-600-Done\503_200.pdf,"POLYGON ((8578433.979 3352949.941, 8578436.107..."
4,504,1194,D:\UC Downloads\UC_501-600-Done\504_1194.pdf,"POLYGON ((8579255.777 3353628.687, 8579180.169..."


### Set index as column

In [11]:
uac = uac_utils.create_index_column(uac)

In [12]:
uac.head()

Unnamed: 0,map_no,registrati,fme_databa,geometry,index
0,520,570,D:\UC Downloads\UC_501-600-Done\520_570.pdf,"POLYGON ((8568698.722 3350778.289, 8568688.910...",0
1,509,888,D:\UC Downloads\UC_501-600-Done\509_888.pdf,"POLYGON ((8580894.912 3343225.741, 8580905.114...",1
2,516,658,D:\UC Downloads\UC_501-600-Done\516_658.pdf,"POLYGON ((8574843.534 3349736.689, 8574924.630...",2
3,503,200,D:\UC Downloads\UC_501-600-Done\503_200.pdf,"POLYGON ((8578433.979 3352949.941, 8578436.107...",3
4,504,1194,D:\UC Downloads\UC_501-600-Done\504_1194.pdf,"POLYGON ((8579255.777 3353628.687, 8579180.169...",4


### Rename registration column

In [18]:
uac = uac.rename(columns={'registrati': 'registration_no'})

In [19]:
uac.head()

Unnamed: 0,map_no,registration_no,fme_databa,geometry,index
0,520,570,D:\UC Downloads\UC_501-600-Done\520_570.pdf,"POLYGON ((8568698.722 3350778.289, 8568688.910...",0
1,509,888,D:\UC Downloads\UC_501-600-Done\509_888.pdf,"POLYGON ((8580894.912 3343225.741, 8580905.114...",1
2,516,658,D:\UC Downloads\UC_501-600-Done\516_658.pdf,"POLYGON ((8574843.534 3349736.689, 8574924.630...",2
3,503,200,D:\UC Downloads\UC_501-600-Done\503_200.pdf,"POLYGON ((8578433.979 3352949.941, 8578436.107...",3
4,504,1194,D:\UC Downloads\UC_501-600-Done\504_1194.pdf,"POLYGON ((8579255.777 3353628.687, 8579180.169...",4


### Set variables for column names

In [20]:
index_colname = 'index'
map_colname = 'map_no'
registration_colname = 'registration_no'
#uac[index_colname].head()
#uac[map_colname].head()
uac[registration_colname].head()

0     570
1     888
2     658
3     200
4    1194
Name: registration_no, dtype: object

## Next tasks:
* Read what i did before for UAC deduplication, including utils code
* Make sure that code works for map numbers with single digits. For example, Map 1 is 01 in the URL
* Commit changes to utils code by removing copies of gdf within functions. confirm that every function makes a copy of the object instead of references it in Python. Spend the weekend thinking through this.

### Identify which attributes have multiple labels

In [22]:
map_registration_dict = uac_utils.create_map_registration_dict(uac, 
                                                               map_colname=map_colname, 
                                                               registration_colname=registration_colname)

In [23]:
list(map_registration_dict.items())[:5]

[(520, {'570'}),
 (509, {'888'}),
 (516, {'658'}),
 (503, {'200'}),
 (504, {'1194'})]

In [24]:
# Check if number of registration numbers for each Map No is always 1
for key, val in map_registration_dict.items():
    if len(val) != 1:
        print(key, val)

682 {'660B', '660-B'}
1533 {'491A', '491'}
1507 {'1072A', '1072a'}
0 {'1460', '1022', '509', '1552', '16', '1108'}


## Remove rows where map_no = 0

In [29]:
map0_indices = uac[uac['map_no'] == 0].index

In [30]:
uac = uac.drop(index=map0_indices)

In [32]:
uac[uac['map_no'] == 0] 

Unnamed: 0,map_no,registration_no,fme_databa,geometry,index
