# HEX timing analysis

performed on timing_script.ipynb <br>
<b>no parallelization</b> <br>
<br>
Processor: Intel(R) Xeon(R) w3-2435 3.10 GHz 16 core <br>
RAM: 64.0 GB

### Reading in
![alt text](timing_1.png "Hex data read in time") <br>
Performance is highly dependant on the data location. Using local drive significantly speeds up the process. <br>
Generally, reading data from K drive constitutes to about 90% of total execution time. <br>
<br>
Reading file by file seems to save little time, but could improve mmory management, if cleaned accordingly <br>
<br>
Tab. Data reading in time (in seconds) <br>
|   | H5 |	H6 | H7 | H8 | H9 |
|-:|:-:|:-:|:-:|:-:|:-:|
|K: drive (partial) |0.542	| 13.7	| 88	| 615	| dropped |
|local drive (partial) |0.148	| 0.717	| 4.68	| 33	| 230 |
|local drive (full) |0.679	| 4.24	| 28.5	| 198	| 1391 |

### Spatial join 
![alt text](timing_3.png "Hex data read in time") <br>
Timing for spatial join depends on the size of data used and hex resolution. <br>
H5 can perform all operation in under a minute, and this time increases with the number of hexes, up to 4.5h on H9. One solution would be batching hexes, or perallelizing the code. <br>
When about 1/5 in size hex layer is joined to H5 or H6 procedure it takes couple seconds, and increases with hex resolution, up to roughly 28 minutes for H9. <br>
If performed in batches, merge time needs to be included. 

### Writing data to geodatabase
![alt text](timing_4.png "") <br>
Depending on the size, writing data into geodatabase file takes up to 25 minutes at H9.

### Final full execution time
<br>
Tab. Execution timing, this includes reading data in (local drive), process spatial joins of all five layers (no parallelization) and saving data into geodatabase

|| H5 | H6 | H7 | H8 | H9 |
|-:|:-:|:-:|:-:|:-:|:-:|
|full data, in minutes|0.810|1.584|8.078|48.26|325.2 (5h 25min)|
|partial data, in minutes|0.194|0.257|1.376|5.698|36.66|

### Data size
<br>
Tab. Size of finalized data (local drive) in geodatabase format, all five layers joined

|| H5 | H6 | H7 | H8 | H9 |
|-:|:-:|:-:|:-:|:-:|:-:|
|data size|57.8 MB|47.5 MB|117 MB|519 MB|2.97 GB|

<br>
No type manipulation and cleaning yet implemented <br>

![alt text](timing_5.png " ")


### Identified issues

- no parallelization / batching
- data cleaning (null, missing data)
- format transformation (string, int)
- h10 is too big for 64GB memory and requires either slicing or HPC to run
- possible different join predicates (center, within...) [full list here](https://shapely.readthedocs.io/en/latest/manual.html#binary-predicates)
![alt text](timing_6.png " ")

### Code example, h5, all layers

In [None]:
import geopandas as gpd
import fiona

# Define the path to the geodatabases used
hex_gdb = r"C://Research/Grid_effort/H3_5_10_Grd.gdb"
inputs_gdb = r'C://Research/Grid_effort/H3Grid_Inputs.gdb'

# Hex level
hexLevel = '5'

# Read in the Hex polygon grid layer as the base layer
base_layer = gpd.read_file(hex_gdb, layer='H3_'+hexLevel)
base_layer.drop(columns=['Shape_Length','Shape_Area'],inplace=True)

# Read all joined layers from geodatabase
layers_to_join = ['tj_2021_us_st_cnt', 'Estuarine_Drainage_Areas', 'WBDHU8', 'dtl_cnty_Census_ESRI', 'WBDHU12']
layer_gdfs = {layer: gpd.read_file(inputs_gdb, layer=layer) for layer in layers_to_join}

# Spatial join
for name, gdf in layer_gdfs.items():
    # Ensure the CRS is the same between the base layer and the current layer
    if not gdf.crs == base_layer.crs:
        gdf = gdf.to_crs(base_layer.crs)
    if 'Shape_Length' in gdf.columns:
        gdf.drop('Shape_Length', axis=1, inplace=True)
    if 'Shape_Area' in gdf.columns:
        gdf.drop('Shape_Area', axis=1, inplace=True)

    # Perform spatial join (inner join by default)
    base_layer = gpd.sjoin(base_layer, gdf, how="left")
    
    # Drop the `index_right` column if it exists
    if 'index_right' in base_layer.columns:
        base_layer.drop('index_right', axis=1, inplace=True)

# Save the output
base_layer.to_file('h'+hexLevel+'_allAtOnce.gdb',driver='OpenFileGDB')