### Cleaning up Survey123 Geometry


* Monica's geojsons into our GCS?
* It would be great to save a parquet with one row per project element and a common project id.
* https://pypi.org/project/fs-gcsfs/
* Pip install `pip install fs-gcsfs` and `calitp_data_infra`

In [1]:
import geopandas as gpd
import pandas as pd
from shared_utils import utils, geography_utils


import os
os.environ['USE_PYGEOS'] = '0'
import geopandas

In the next release, GeoPandas will switch to using Shapely by default, even if PyGEOS is installed. If you only have PyGEOS installed to get speed-ups, this switch should be smooth. However, if you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
  import geopandas as gpd


In [2]:
from calitp_data_analysis import get_fs
fs = get_fs()
from calitp_data_analysis.sql import to_snakecase

import os
import _utils
import fiona

In [3]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

#### Geometry Scores
* https://stackoverflow.com/questions/64277987/python-geopandas-failing-to-read-misread-750mb-zip-esri-gdb-file-but-not-200mb
* https://fiona.readthedocs.io/en/latest/README.html
* https://fiona.readthedocs.io/en/stable/manual.html

In [4]:
def open_survey123(file:str, 
                   save_to_gsc:bool = False,
                   drop_duplicates:bool = True):
    # https://gis.stackexchange.com/questions/255138/reading-the-names-of-geodatabase-file-layers-in-python
    GCS_PATH = f"{_utils.GCS_FILE_PATH}Survey123_Geo/"
    fs.get(f'{GCS_PATH}{file}', 'tmp.gdb.zip')
    geo_layers = fiona.listlayers('tmp.gdb.zip')
    
    print(f"layers = {geo_layers}")
    
    gdf = pd.DataFrame()
    
    for i in geo_layers:
        temp = to_snakecase(gpd.read_file('tmp.gdb.zip', layer = i))
        gdf = pd.concat([gdf, temp], axis=0)
        
    print("invalid geo rows:")
    invalid_geo_cols = ['lyr','projname','geometry','geopoint_comments','creator']
    display(gdf[~gdf.geometry.is_valid][invalid_geo_cols])
    
    print("repeated geos rows:")
    
    repeated_cols = ['geometry','parentglobalid','projname','creator']
    repeated_geo = (gdf
                .groupby(repeated_cols)
                .agg({'editor':'count'})
                .reset_index()
                .rename(columns = {'editor':'total_repeats'})
               ) 
    
    repeated_geo = repeated_geo.loc[repeated_geo.total_repeats > 1]
    display(repeated_geo) 
    
    
    # Keep only valid geometries
    gdf = gdf[gdf.geometry.is_valid].reset_index(drop = True)
    gdf = gdf.drop(columns = ['creationdate', 'editdate'])
    
    # Drop duplicates
    if drop_duplicates == True:
        gdf = gdf.drop_duplicates(subset = repeated_cols)
    
    # Fill NA
    gdf = gdf.fillna(gdf.dtypes.replace({'float64': 0.0, 'object': 'None'}))
    
    # Save to GCS
    if save_to_gsc == True:
        utils.geoparquet_gcs_export(gdf, GCS_PATH, "cleaned_survey123_sample13")
        
    return gdf

In [5]:
all_results = open_survey123("TCEP_SCCP_Score_Geometry_20230808.gdb.zip", True, True)

layers = ['TCEP_SCCP_GeometryIntake_All_Pts_Finals', 'TCEP_SCCP_GeometryIntake_All_Lns_Finals']
invalid geo rows:


Unnamed: 0,lyr,projname,geometry,geopoint_comments,creator
1,Lns1,Stockton Channel Viaduct,,,larissa.lee_caltrans
2,Lns1,Stockton Channel Viaduct,,,larissa.lee_caltrans
9,Lns1,SR-46 East Antelope Grade Corridor Improvements,,,larissa.lee_caltrans


repeated geos rows:


Unnamed: 0,geometry,parentglobalid,projname,creator,total_repeats
107,POINT Z (-13624669.664 4951704.679 0.000),{4D60FABF-CDFB-4C4A-870E-DC8F29664447},Fix 5 Cascade Gateway,larissa.lee_caltrans,2
121,POINT Z (-13047163.582 4094870.102 0.000),{65C6D65C-3E95-4B31-830B-637093A61A9C},US 101/SR 92 Area Improvements & Multimodal Project,darleen.mendez,3


  table = _geopandas_to_arrow(df, index=index, schema_version=schema_version)


In [6]:
# all_results[cols].explore('projname', cmap='tab10', style_kwds = {'weight':5}, height = 400, width = 1000, legend = True)

In [7]:
def preview_one_project(project_name:str):
    """
    Take a look at one project.
    """
    one_project = all_results.loc[all_results.projname == project_name]
    map_cols = ['geometry','parentglobalid','geopoint_type','geopoint_type_existing','geopoint_comments']
    display(one_project[map_cols].explore('geopoint_type', cmap='tab10', style_kwds = {'weight':6}, height = 400, width = 1000, legend = True))
    drop_cols = ['parentglobalid','creator','lyr','lyr_globalid','editor','shape_length']
    one_project = one_project.sort_values(by = ['projname']).drop(columns = drop_cols)
    print(f"{len(one_project)} geometries")
    display(one_project)

In [8]:
preview_one_project("US 101/SR 92 Area Improvements & Multimodal Project")

18 geometries


Unnamed: 0,projname,lns,pts,ct_district,efis,ea,ppno,geopoint_type,geopoint_type_existing,geopoint_comments,geometry
16,US 101/SR 92 Area Improvements & Multimodal Project,5,1,4,,2Q800,0668D,Transit,Highway,Covert an existing Caltrans-owned Park and Ride Lot into a San Mateo County Transit District mobility hub,POINT Z (-13614102.878 4516375.886 0.000)
184,US 101/SR 92 Area Improvements & Multimodal Project,5,1,4,,2Q800,0668D,Interchange Improvement,Highway,realign the Fashion Island Boulevard exit ramp,"MULTILINESTRING Z ((-13614527.823 4517273.949 0.000, -13614506.326 4517228.564 0.000, -13614477.065 4517181.388 0.000, -13614426.305 4517108.535 0.000, -13614390.476 4517051.804 0.000, -13614294.332 4516921.622 0.000, -13614250.739 4516849.962 0.000, -13614216.104 4516785.767 0.000, -13614177.885 4516721.572 0.000, -13614159.373 4516665.439 0.000, -13614158.776 4516616.172 0.000, -13614164.747 4516593.331 0.000, -13614176.691 4516563.099 0.000, -13614189.530 4516538.429 0.000, -13614195.203 4516500.621 0.000))"
169,US 101/SR 92 Area Improvements & Multimodal Project,5,1,4,,2Q800,0668D,Interchange Improvement,Highway,elimination of the inside lane merge between SB US 101 ramp and eastbound (EB) SR 92,"MULTILINESTRING Z ((-13612222.234 4517330.001 0.000, -13612451.545 4517143.686 0.000, -13612625.917 4517007.533 0.000, -13612945.997 4516830.772 0.000, -13613149.033 4516728.060 0.000, -13613411.785 4516622.959 0.000, -13613567.048 4516582.352 0.000))"
148,US 101/SR 92 Area Improvements & Multimodal Project,1,10,75,,,,Rail (Freight),Rail (Freight),double track & track upgrades,"MULTILINESTRING Z ((-13463905.122 4520748.554 0.000, -13463710.446 4520528.798 0.000, -13463421.418 4520207.524 0.000, -13463149.112 4519912.524 0.000, -13462731.694 4519440.541 0.000, -13461456.151 4518040.788 0.000, -13460457.692 4516908.564 0.000, -13459382.796 4515738.122 0.000, -13458322.233 4514577.235 0.000, -13458092.922 4514324.037 0.000, -13457605.635 4513760.314 0.000, -13456908.148 4513015.053 0.000, -13456072.118 4512078.699 0.000, -13455240.865 4511166.233 0.000, -13453798.116 4509575.387 0.000, -13452651.561 4508285.512 0.000, -13451359.297 4506818.578 0.000, -13450119.584 4505490.783 0.000, -13449821.002 4505136.963 0.000))"
141,US 101/SR 92 Area Improvements & Multimodal Project,5,1,4,,2Q800,0668D,Interchange Improvement,Highway,widening of the existing loop connector from westbound (WB) SR 92 to southbound (SB) US 101,"MULTILINESTRING Z ((-13614150.939 4516576.188 0.000, -13614148.401 4516592.311 0.000, -13614137.801 4516606.046 0.000, -13614123.021 4516616.496 0.000, -13614105.554 4516620.378 0.000, -13614066.142 4516617.989 0.000, -13614029.117 4516585.742 0.000, -13613935.960 4516480.641 0.000, -13613808.167 4516315.824 0.000, -13613750.839 4516228.638 0.000))"
89,US 101/SR 92 Area Improvements & Multimodal Project,1,10,75,,,,Rail (Freight),Rail (Freight),at grade crossing improvements,POINT Z (-13452393.150 4507996.121 0.000)
87,US 101/SR 92 Area Improvements & Multimodal Project,1,10,75,,,,Rail (Freight),Rail (Freight),at grade crossing improvements,POINT Z (-13453356.538 4509078.730 0.000)
84,US 101/SR 92 Area Improvements & Multimodal Project,1,10,75,,,,,,,POINT Z (-13453752.922 4509509.542 0.000)
83,US 101/SR 92 Area Improvements & Multimodal Project,1,10,75,,,,Rail (Freight),Rail (Freight),at grade crossing improvements,POINT Z (-13047163.582 4094870.102 0.000)
81,US 101/SR 92 Area Improvements & Multimodal Project,1,10,75,,,,Rail (Freight),Rail (Freight),at grade crossing improvements,POINT Z (-13453750.779 4509508.507 0.000)


In [9]:
# preview_one_project('Stockton Channel Viaduct')

In [10]:
cols = ['parentglobalid','projname','geopoint_type','geopoint_type_existing','geopoint_comments','geometry']

In [11]:
test = all_results.loc[all_results.lyr_globalid != "{29560BCE-1830-4FDA-80B8-796363385BF7}"]

In [13]:
def preview_one_geotype_route(df, project_name:str, geopoint_type:str):
    """
    Preview one geopoint type for one route
    """
    map_cols = ['lyr','lyr_globalid','geopoint_type','geopoint_type_existing','geopoint_comments','geometry']
    one_project = df.loc[(df.projname == project_name) & (df.geopoint_type == geopoint_type)]
    display(one_project[map_cols].explore('lyr_globalid', cmap='tab10', style_kwds = {'weight':6}, height = 400, width = 1000, legend = True))
    print(f"{len(one_project)} total rows")

In [14]:
preview_one_geotype_route(test, "US 101/SR 92 Area Improvements & Multimodal Project","Rail (Freight)")

10 total rows


In [15]:
# preview_one_project('Fix 5 Cascade Gateway')

In [16]:
# preview_one_project('U.S. 101 Connected Communities Corridor Rail and Active Transportation Improvements')

In [17]:
# preview_one_project('Fix 5 Cascade Gateway')

### GCS

In [18]:
test_geoparquet = gpd.read_parquet("gs://calitp-analytics-data/data-analyses/project_prioritization/Survey123_Geo/cleaned_survey123_sample13.parquet")

In [19]:
test_geoparquet.columns

Index(['parentglobalid', 'lyr_globalid', 'lyr', 'projname', 'lns', 'pts',
       'ct_district', 'efis', 'ea', 'ppno', 'geopoint_type',
       'geopoint_type_existing', 'geopoint_comments', 'creator', 'editor',
       'geometry', 'shape_length'],
      dtype='object')

In [20]:
test_geoparquet.projname.nunique()

47

In [21]:
test_geoparquet.projname.unique()

array(['U.S. 101 Connected Communities Corridor Rail and Active Transportation Improvements',
       'Otay Mesa East Port of Entry',
       'Watsonville-Santa Cruz Multimodal Corridor Program',
       'Westbound I-80 Cordelia Commercial Vehicle Enforcement Facility (WB I-80 CCVEF) Project',
       'National Highway Freight Network Improvement Program - State Route 47-Seaside Avenue & Navy Way Interchange Improvement Project',
       'Fix 5 Cascade Gateway',
       'Metrolink Lilac to Sycamore Avenue Double Track Project on the San Bernardino Line',
       'Inglewood Transit Connector (ITC) Project',
       'Santa Barbara 101 Multimodal Corridor', 'I-5 Managed Lanes',
       'Autonomous, Zero-Emission, On-Demand Transit Tunnel from the Cucamonga Metrolink Station to Ontario International Airport',
       'I-710/I-5 Flyover Utilities Relocation and Construction',
       'Los Angeles Metro Light Rail Capital, Operational and Rehabilitation Enhancements (CORE) Capacity & System Integration

In [22]:
# test_geoparquet.shape

In [23]:
# test_geoparquet.projname.nunique()

In [24]:
test_geoparquet.explore('projname', legend = False)