# Pre-processing Colonies Dataset

* Data Pre-processing
    * Import colonies
    * Import barrier files – reproject all to EPSG 7760
    * Check validity of all shapefiles (turn this into a function…) – also check that all points are in Delhi. (might be part of spatial index notebook and UAC deduplication)    
* Compute barrier clip for all colonies
* Run Neighbors Algorithm
    * Touching Neighbors algorithm - Modify so that it ignores NDMC and related areas (The NDMC / DCB polygons are coded as NDMC and DCB)
    * bbox Neighbors algorithm
    * Should check for barriers
    * Should check for NDMC and related areas
    * Save as two separate columns: touching neighbors and bbox neighbors
* Additional preprocessing for colonies (turn into super function)
    * Create index column
    * Distance from NDMC (turn into function)
    * Area of each polygon
    * Create fake population data (divide total population by number of colonies)
* Export GeoDataFrame as pickle file and ESRI Shapefiles

## Import modules and set constants

In [170]:
import os
import pickle
from importlib import reload
import pandas as pd
import geopandas as gpd
from shapely.geometry import Polygon, box
import spatial_index_utils

In [174]:
reload(spatial_index_utils)

<module 'spatial_index_utils' from 'C:\\Users\\bwbel\\Google Drive\\slum_project\\spatial_index_python\\spatial_index_utils.py'>

In [43]:
# WGS 84 / Delhi
epsg_code = 7760

## Import shapefiles

In [182]:
#colony_filepath = os.path.join('shapefiles', 'Spatial_Index_GIS', 'Colony_Shapefile', 
#                        'Final_USO_fixed.shp')

colony_filepath = 'final_uso_deduplicated.shp'

barrier_directory = os.path.join('shapefiles', 'Barrier_Clip')

canal_filepath = os.path.join(barrier_directory, 'Canal', 'Canal.shp')
drain_filepath = os.path.join(barrier_directory, 'Drain', 'Major_Drain.shp')
railway_filepath = os.path.join(barrier_directory, 'Railway', 'Railway_Line.shp')

# boundary of Delhi
delhi_bounds_filepath = os.path.join('shapefiles', 'delhi_bounds_buffer.shp')

# Check that all filepaths exist
filepath_list = [colony_filepath, canal_filepath, drain_filepath, railway_filepath, delhi_bounds_filepath]

for filepath in filepath_list:
    if not os.path.exists(filepath):
        print('{} does not exist'.format(filepath))

In [51]:
colonies = gpd.read_file(colony_filepath)

In [52]:
canal = gpd.read_file(canal_filepath)

In [53]:
drain = gpd.read_file(drain_filepath)

In [54]:
railway = gpd.read_file(railway_filepath)

## Inspect shapefiles for validity (`check_shapefile`)

In [175]:
spatial_index_utils.check_shapefile(gdf=colonies, gdf_name='colonies', 
                                    geom_type='Polygon', 
                                    delhi_bounds_filepath=delhi_bounds_filepath)

colonies has duplicate rows: False
----------------------------------------------------
rows with invalid geometries 

----------------------------------------------------
all geometries in colonies are of type Polygon: True
----------------------------------------------------
Rows with None value in geometry column are below
Empty GeoDataFrame
Columns: [index, AREA, USO_AREA_U, HOUSETAX_C, USO_FINAL, geometry, geom_type]
Index: []
----------------------------------------------------
colonies shapefile is contained within Delhi: True
----------------------------------------------------
Done with shapefile evaluation


In [176]:
spatial_index_utils.check_shapefile(gdf=canal, gdf_name='canal', geom_type='Line', 
                                    delhi_bounds_filepath=delhi_bounds_filepath)

canal has duplicate rows: False
----------------------------------------------------
rows with invalid geometries 

----------------------------------------------------
all geometries in canal are of type Line: True
----------------------------------------------------
Rows with None value in geometry column are below
Empty GeoDataFrame
Columns: [index, FID_1, CAN_NM, CAN_CLSF, EL_GND, DIST_NM, geometry, geom_type]
Index: []
----------------------------------------------------
canal shapefile is contained within Delhi: True
----------------------------------------------------
Done with shapefile evaluation


In [177]:
spatial_index_utils.check_shapefile(gdf=drain, gdf_name='drain', geom_type='Line', 
                                    delhi_bounds_filepath=delhi_bounds_filepath)

drain has duplicate rows: False
----------------------------------------------------
rows with invalid geometries 

----------------------------------------------------
all geometries in drain are of type Line: True
----------------------------------------------------
Rows with None value in geometry column are below
Empty GeoDataFrame
Columns: [index, FID, Drain_type, Drain_Name, MAINTAINED, AC_NAME, DISTRICT, geometry, geom_type]
Index: []
----------------------------------------------------
drain shapefile is contained within Delhi: True
----------------------------------------------------
Done with shapefile evaluation


In [178]:
spatial_index_utils.check_shapefile(gdf=railway, gdf_name='railway', geom_type='Line', 
                                    delhi_bounds_filepath=delhi_bounds_filepath)

railway has duplicate rows: False
----------------------------------------------------
rows with invalid geometries 

----------------------------------------------------
all geometries in railway are of type Line: True
----------------------------------------------------
Rows with None value in geometry column are below
Empty GeoDataFrame
Columns: [index, FID_1, RL_ZONE, geometry, geom_type]
Index: []
----------------------------------------------------
railway shapefile is contained within Delhi: True
----------------------------------------------------
Done with shapefile evaluation


## Remove duplicate geometries

In [67]:
canal = spatial_index_utils.remove_duplicate_geom(canal)

Original number of rows is 43:
New number of rows after deduplication is 43:


In [68]:
drain = spatial_index_utils.remove_duplicate_geom(drain)

Original number of rows is 616:
New number of rows after deduplication is 616:


In [69]:
railway = spatial_index_utils.remove_duplicate_geom(railway)

Original number of rows is 5356:
New number of rows after deduplication is 5356:


In [70]:
colonies = spatial_index_utils.remove_duplicate_geom(colonies)

Original number of rows is 4319:
New number of rows after deduplication is 4290:


In [86]:
colonies.head()

Unnamed: 0,index,AREA,USO_AREA_U,HOUSETAX_C,USO_FINAL,geometry,geom_type
0,0,Singhola,3058,H,RV,"POLYGON Z ((1013763.588 1023721.838 0.000, 101...",<class 'geopandas.geoseries.GeoSeries'>
1,1,Indra Colony (Narela),1760,G,RUAC,"POLYGON Z ((1007997.730 1025421.961 0.000, 100...",<class 'geopandas.geoseries.GeoSeries'>
2,2,Bhor Garh,1276,H,Industrial,"POLYGON Z ((1008543.236 1022671.585 0.000, 100...",<class 'geopandas.geoseries.GeoSeries'>
3,3,Gautam Colony,1528,G,RUAC,"POLYGON Z ((1008080.674 1025132.190 0.000, 100...",<class 'geopandas.geoseries.GeoSeries'>
4,4,Kureni,2082,H,RV,"POLYGON Z ((1009508.695 1025281.671 0.000, 100...",<class 'geopandas.geoseries.GeoSeries'>


In [99]:
colonies_copy.to_file('final_uso_deduplicated.shp')

In [101]:
with open('final_uso_deduplicated.data', 'wb') as f:
    pickle.dump(colonies_copy, f)

## Check CRS, reproject to EPSG:7760.

In [191]:
colonies.crs

<Projected CRS: EPSG:7760>
Name: WGS 84 / Delhi
Axis Info [cartesian]:
- X[east]: Easting (metre)
- Y[north]: Northing (metre)
Area of Use:
- name: India - Delhi
- bounds: (76.83, 28.4, 77.34, 28.89)
Coordinate Operation:
- name: Delhi NSF LCC
- method: Lambert Conic Conformal (2SP)
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

In [192]:
canal.crs

<Projected CRS: EPSG:7760>
Name: WGS 84 / Delhi
Axis Info [cartesian]:
- X[east]: Easting (metre)
- Y[north]: Northing (metre)
Area of Use:
- name: India - Delhi
- bounds: (76.83, 28.4, 77.34, 28.89)
Coordinate Operation:
- name: Delhi NSF LCC
- method: Lambert Conic Conformal (2SP)
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

In [193]:
#drain = spatial_index_utils.reproject_gdf(drain, epsg_code)
drain.crs

<Projected CRS: EPSG:7760>
Name: WGS 84 / Delhi
Axis Info [cartesian]:
- X[east]: Easting (metre)
- Y[north]: Northing (metre)
Area of Use:
- name: India - Delhi
- bounds: (76.83, 28.4, 77.34, 28.89)
Coordinate Operation:
- name: Delhi NSF LCC
- method: Lambert Conic Conformal (2SP)
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

In [194]:
railway.crs

<Projected CRS: EPSG:7760>
Name: WGS 84 / Delhi
Axis Info [cartesian]:
- X[east]: Easting (metre)
- Y[north]: Northing (metre)
Area of Use:
- name: India - Delhi
- bounds: (76.83, 28.4, 77.34, 28.89)
Coordinate Operation:
- name: Delhi NSF LCC
- method: Lambert Conic Conformal (2SP)
Datum: World Geodetic System 1984
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

In [195]:
colonies.crs == drain.crs == canal.crs == railway.crs

True