### Cleaning up Survey123 Geometry


* Monica's geojsons into our GCS?
* It would be great to save a parquet with one row per project element and a common project id.
* https://pypi.org/project/fs-gcsfs/
* Pip install `pip install fs-gcsfs` and `calitp_data_infra`

In [1]:
import geopandas as gpd
import pandas as pd
from shared_utils import utils, geography_utils


import os
os.environ['USE_PYGEOS'] = '0'
import geopandas

In the next release, GeoPandas will switch to using Shapely by default, even if PyGEOS is installed. If you only have PyGEOS installed to get speed-ups, this switch should be smooth. However, if you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
  import geopandas as gpd


In [2]:
from calitp_data_analysis import get_fs
fs = get_fs()
import os
import _utils
import fiona

In [3]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [4]:
def to_snakecase(df):
    df.columns = df.columns.str.lower().str.replace(' ', '_') 
    return df

In [5]:
geo_path = "TCEP_SCCP_GeometryIntake_geojson_20230801.zip"

In [6]:
#with get_fs().open(f"{TCEP_SCCP_GCS}{geo_path}") as f:
#     tcep_sccp_geo = to_snakecase(gpd.read_file(f))


In [7]:
# tcep_sccp_geo.sample()

In [8]:
#tcep_sccp_geo.shape

In [9]:
#tcep_sccp_geo.project_name.nunique()

In [10]:
#tcep_sccp_geo.drop(columns = ['geometry'])

In [11]:
# tcep_sccp_geo.sort_values(by = ['project_name'])[['project_name','which_type_of_infrastructure_does_the_geometry_above_correspond_to?','geometry']]

In [12]:
# tcep_sccp_geo.explore(style_kwds = {'weight':5}, height = 400, width = 1000)

#### Geometry Scores
* https://stackoverflow.com/questions/64277987/python-geopandas-failing-to-read-misread-750mb-zip-esri-gdb-file-but-not-200mb
* https://fiona.readthedocs.io/en/latest/README.html
* https://fiona.readthedocs.io/en/stable/manual.html

In [13]:
def open_survey123(save_to_gsc:bool = False):
    # https://gis.stackexchange.com/questions/255138/reading-the-names-of-geodatabase-file-layers-in-python
    file = "TCEP_SCCP_Score_Geometry_20230801.gdb.zip"
    GCS_PATH = f"{_utils.GCS_FILE_PATH}Survey123_Geo/"
    fs.get(f'{GCS_PATH}{file}', 'tmp.gdb.zip')
    geo_layers = fiona.listlayers('tmp.gdb.zip')
    
    print(f"layers = {geo_layers}")
    
    gdf = pd.DataFrame()
    
    for i in geo_layers:
        temp = to_snakecase(gpd.read_file('tmp.gdb.zip', layer = i))
        gdf = pd.concat([gdf, temp], axis=0)
        
    print("invalid geo rows:")
    invalid_geo_cols = ['lyr','projname','geometry','geopoint_comments','creator']
    display(gdf[~gdf.geometry.is_valid][invalid_geo_cols])
    
    print("repeated geos rows:")
    
    repeated_cols = ['geometry','parentglobalid','projname','creator']
    repeated_geo = (gdf
                .groupby(repeated_cols)
                .agg({'editor':'count'})
                .reset_index()
                .rename(columns = {'editor':'total_repeats'})
               ) 
    
    repeated_geo = repeated_geo.loc[repeated_geo.total_repeats > 1]
    display(repeated_geo) 
    
    
    # Keep only valid geometries
    gdf = gdf[gdf.geometry.is_valid].reset_index(drop = True)
    gdf = gdf.drop(columns = ['creationdate', 'editdate'])
    
    # Drop duplicates
    gdf = gdf.drop_duplicates(subset = repeated_cols)
    
    gdf = gdf.fillna(gdf.dtypes.replace({'float64': 0.0, 'object': 'None'}))
    
    # Save to GCS
    if save_to_gsc == True:
        utils.geoparquet_gcs_export(gdf, GCS_PATH, "cleaned_survey123_sample13")
        
    return gdf

In [14]:
all_results = open_survey123(True)

layers = ['TCEP_SCCP_GeometryIntake_All_Lns', 'TCEP_SCCP_GeometryIntake_All_Pts']
invalid geo rows:


Unnamed: 0,lyr,projname,geometry,geopoint_comments,creator
8,Lns1,Stockton Channel Viaduct,,,larissa.lee_caltrans
9,Lns1,Stockton Channel Viaduct,,,larissa.lee_caltrans
16,Lns1,SR-46 East Antelope Grade Corridor Improvements,,,larissa.lee_caltrans


repeated geos rows:


Unnamed: 0,geometry,parentglobalid,projname,creator,total_repeats
28,POINT Z (-13624669.664 4951704.679 0.000),{4D60FABF-CDFB-4C4A-870E-DC8F29664447},Fix 5 Cascade Gateway,larissa.lee_caltrans,2


  table = _geopandas_to_arrow(df, index=index, schema_version=schema_version)


In [37]:
all_results.groupby(['projname']).agg({'parentglobalid':'nunique'}).sort_values(['parentglobalid'], ascending = False)

Unnamed: 0_level_0,parentglobalid
projname,Unnamed: 1_level_1
Fix 5 Cascade Gateway,2
National Highway Freight Network Improvement Program - State Route 47-Seaside Avenue & Navy Way Interchange Improvement Project,2
Otay Mesa East Port of Entry,2
U.S. 101 Connected Communities Corridor Rail and Active Transportation Improvements,2
Watsonville-Santa Cruz Multimodal Corridor Program,2
Westbound I-80 Cordelia Commercial Vehicle Enforcement Facility (WB I-80 CCVEF) Project,2
I-80/San Pablo Dam Road Interchange Improvements - Phase 2,1
Konocti Corridor - Segment 2B,1
SR 58 Mobility Improvements – Location 2,1
Stockton Channel Viaduct,1


In [36]:
# all_results.groupby(['projname']).agg({'lyr':'count'}).sort_values('lyr')

In [18]:
# all_results[cols].explore('projname', cmap='tab10', style_kwds = {'weight':5}, height = 400, width = 1000, legend = True)

In [19]:
def preview_one_project(project_name:str):
    """
    Take a look at one project
    """
    one_project = all_results.loc[all_results.projname == project_name]
    map_cols = ['geometry','parentglobalid','geopoint_type','geopoint_type_existing','geopoint_comments']
    display(one_project[map_cols].explore('geopoint_type', cmap='tab10', style_kwds = {'weight':6}, height = 400, width = 1000, legend = True))
    drop_cols = ['parentglobalid','lyr_globalid','editor','shape_length','geometry']
    one_project = one_project.sort_values(by = ['projname']).drop(columns = drop_cols)
    print(f"{len(one_project)} geometries")
    display(one_project)

In [20]:
preview_one_project("National Highway Freight Network Improvement Program - State Route 47-Seaside Avenue & Navy Way Interchange Improvement Project")

6 geometries


Unnamed: 0,lyr,projname,lns,pts,ct_district,efis,ea,ppno,geopoint_type,geopoint_type_existing,geopoint_comments,creator
27,Pts1_Parent,National Highway Freight Network Improvement Program - State Route 47-Seaside Avenue & Navy Way Interchange Improvement Project,0,4,7,,,,ITS,ITS,Singal removal for free flow traffic from east to west.,larissa.lee_caltrans
31,Pts4,National Highway Freight Network Improvement Program - State Route 47-Seaside Avenue & Navy Way Interchange Improvement Project,0,4,7,,,,Interchange Improvement,Highway (Freight),Widening Off-ramp termini and 5th leg to existing signalized intersection,larissa.lee_caltrans
32,Pts4,National Highway Freight Network Improvement Program - State Route 47-Seaside Avenue & Navy Way Interchange Improvement Project,0,4,7,,,,Interchange Improvement,Highway (Freight),Provide dual left-turn lane on North-Bound approach through re-striping.,larissa.lee_caltrans
35,Pts3,National Highway Freight Network Improvement Program - State Route 47-Seaside Avenue & Navy Way Interchange Improvement Project,0,4,7,,,,Interchange Improvement,Highway (Freight),Improving East-bound Merge from Frontage Road,larissa.lee_caltrans
40,Pts2,National Highway Freight Network Improvement Program - State Route 47-Seaside Avenue & Navy Way Interchange Improvement Project,0,4,7,,,,Interchange Improvement,Highway (Freight),Improving Collector-Distributor Road & 2-phase signalization,larissa.lee_caltrans
41,Pts2,National Highway Freight Network Improvement Program - State Route 47-Seaside Avenue & Navy Way Interchange Improvement Project,0,4,7,,,,Interchange Improvement,Highway (Freight),Improving Collector-Distributor Road & 2-phase signalization,larissa.lee_caltrans


In [21]:
# preview_one_project('Stockton Channel Viaduct')

In [22]:
cols = ['parentglobalid','projname','geopoint_type','geopoint_type_existing','geopoint_comments','geometry']

In [23]:
def preview_one_geotype_route(project_name:str, geopoint_type:str):
    """
    Preview one geopoint type for one route
    """
    map_cols = ['lyr','lyr_globalid','geopoint_type','geopoint_type_existing','geopoint_comments','geometry']
    one_project = all_results.loc[(all_results.projname == project_name) & (all_results.geopoint_type == geopoint_type)]
    display(one_project[map_cols].explore('lyr_globalid', cmap='tab10', style_kwds = {'weight':6}, height = 400, width = 1000, legend = True))
    

In [24]:
# all_results[cols].loc[all_results.parentglobalid== "{4D60FABF-CDFB-4C4A-870E-DC8F29664447}"]

In [25]:
preview_one_geotype_route('Fix 5 Cascade Gateway','ITS')

In [26]:
preview_one_project('Fix 5 Cascade Gateway')

5 geometries


Unnamed: 0,lyr,projname,lns,pts,ct_district,efis,ea,ppno,geopoint_type,geopoint_type_existing,geopoint_comments,creator
15,Lns1,Fix 5 Cascade Gateway,2,3,2,215000083,0H920,3597,,Highway (Freight),South Bound Truck Only Lane,larissa.lee_caltrans
16,Lns1,Fix 5 Cascade Gateway,2,3,2,215000083,0H920,3597,,Highway (Freight),North Bound Truck Only Lane,larissa.lee_caltrans
28,Pts1_Parent,Fix 5 Cascade Gateway,2,3,2,215000083,0H920,3597,Highway (Freight),Highway (Freight),South Bound locations where Merging or weaving zone begins.,larissa.lee_caltrans
36,Pts3,Fix 5 Cascade Gateway,2,3,2,215000083,0H920,3597,Highway (Freight),Highway (Freight),North Bound locations where Merging or weaving zone begins.,larissa.lee_caltrans
42,Pts2,Fix 5 Cascade Gateway,2,3,2,215000083,0H920,3597,ITS,ITS,Installing advance signing packages in both south and north bound.,larissa.lee_caltrans


In [27]:
preview_one_project('U.S. 101 Connected Communities Corridor Rail and Active Transportation Improvements')

5 geometries


Unnamed: 0,lyr,projname,lns,pts,ct_district,efis,ea,ppno,geopoint_type,geopoint_type_existing,geopoint_comments,creator
5,Lns2,U.S. 101 Connected Communities Corridor Rail and Active Transportation Improvements,2,3,7,,31780,4961,,Rail (Passenger),Double Tracking Project,darleen.mendez
8,Lns1,U.S. 101 Connected Communities Corridor Rail and Active Transportation Improvements,2,3,7,,31780,4961,,,Bike Trail,darleen.mendez
23,Pts1_Parent,U.S. 101 Connected Communities Corridor Rail and Active Transportation Improvements,2,3,7,,31780,4961,Rail (Passenger),Rail (Passenger),Improvements to Metrolink/Surfliner Rail Corridor including a second platform and pedestrian underpass,darleen.mendez
33,Pts3,U.S. 101 Connected Communities Corridor Rail and Active Transportation Improvements,2,3,7,,31780,4961,Rail (Passenger),Rail (Passenger),Improvements to Metrolink/Surfliner Rail Corridor including a second platform and pedestrian underpass,darleen.mendez
37,Pts2,U.S. 101 Connected Communities Corridor Rail and Active Transportation Improvements,2,3,7,,31780,4961,Rail (Passenger),Rail (Passenger),Improvements to Metrolink/Surfliner Rail Corridor including a second platform and pedestrian underpass,darleen.mendez


In [28]:
preview_one_project('Fix 5 Cascade Gateway')

5 geometries


Unnamed: 0,lyr,projname,lns,pts,ct_district,efis,ea,ppno,geopoint_type,geopoint_type_existing,geopoint_comments,creator
15,Lns1,Fix 5 Cascade Gateway,2,3,2,215000083,0H920,3597,,Highway (Freight),South Bound Truck Only Lane,larissa.lee_caltrans
16,Lns1,Fix 5 Cascade Gateway,2,3,2,215000083,0H920,3597,,Highway (Freight),North Bound Truck Only Lane,larissa.lee_caltrans
28,Pts1_Parent,Fix 5 Cascade Gateway,2,3,2,215000083,0H920,3597,Highway (Freight),Highway (Freight),South Bound locations where Merging or weaving zone begins.,larissa.lee_caltrans
36,Pts3,Fix 5 Cascade Gateway,2,3,2,215000083,0H920,3597,Highway (Freight),Highway (Freight),North Bound locations where Merging or weaving zone begins.,larissa.lee_caltrans
42,Pts2,Fix 5 Cascade Gateway,2,3,2,215000083,0H920,3597,ITS,ITS,Installing advance signing packages in both south and north bound.,larissa.lee_caltrans


### GCS

In [29]:
test_geoparquet = gpd.read_parquet("gs://calitp-analytics-data/data-analyses/project_prioritization/Survey123_Geo/cleaned_survey123_sample13.parquet")

In [30]:
# test_geoparquet.shape

In [31]:
# test_geoparquet.projname.nunique()

In [33]:
test_geoparquet.explore('projname')

NameError: name 'test_geoparquet' is not defined