### Cleaning up Survey123 Geometry & Inputting it into GCS.


* Monica's geojsons into our GCS?
* It would be great to save a parquet with one row per project element and a common project id.
* https://pypi.org/project/fs-gcsfs/
* Pip install `pip install fs-gcsfs` and `calitp_data_infra`

### Future Thoughts
* People who enter in project data should have ArcGIS accounts so they can edit their results.
    * Additionally, when people don't have accounts, they can't delete their old inputs. As such, there could be duplicate rows for a project - one original one with a mistake and a new one.
    * Not necessarily easy to throw out the wrong row depending on the error. 
* Find a way to let people preview what they entered after submitting it, to make sure they have drawn the geographies correctly. 
* I had to manually correct one project, which is not easy to scale. By having the two bullet points in place, people can go in and correct their own work. 


In [1]:
import geopandas as gpd
import pandas as pd
from shared_utils import utils, geography_utils


import os
os.environ['USE_PYGEOS'] = '0'
import geopandas

In the next release, GeoPandas will switch to using Shapely by default, even if PyGEOS is installed. If you only have PyGEOS installed to get speed-ups, this switch should be smooth. However, if you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
  import geopandas as gpd


In [2]:
from calitp_data_analysis import get_fs
fs = get_fs()
from calitp_data_analysis.sql import to_snakecase

import os
import _utils
import fiona

In [3]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

#### Geometry Scores
* https://stackoverflow.com/questions/64277987/python-geopandas-failing-to-read-misread-750mb-zip-esri-gdb-file-but-not-200mb
* https://fiona.readthedocs.io/en/latest/README.html
* https://fiona.readthedocs.io/en/stable/manual.html

In [4]:
def manual_corrections(gdf: gpd.GeoDataFrame)-> gpd.GeoDataFrame:
    # Rows of Union Pacific entered in under US 101/SR.
    # Delete out erroneous pt in Apple Valley & Turlock. 
    union_pac_del = ["{E623C079-0B4F-4420-864F-FFB6BCD37503}", "{073E491B-45D7-45FC-8006-D745AE1C1E17}"]
    gdf = gdf.loc[~gdf.lyr_globalid.isin(union_pac_del)].reset_index(drop = True)
    
    union_pac = (gdf.loc
              [(gdf.projname == "US 101/SR 92 Area Improvements & Multimodal Project") 
               & (gdf.geopoint_type == "Rail (Freight)")]
             )
    
    # Rename
    union_pac.projname = (union_pac.projname.str.replace("US 101/SR 92 Area Improvements & Multimodal Project", 
                                     "Union Pacific (Fresno Subdivision) Ceres to Turlock Double Tracking Project")
                         )
    
    # Delete out old rows 
    globalids = list(union_pac.lyr_globalid.unique())
    
    gdf = gdf.loc[~gdf.lyr_globalid.isin(globalids)].reset_index(drop = True)
    
    # Add back corrected rows
    gdf2 = pd.concat([gdf, union_pac]) 
    
    return gdf2

In [5]:
def open_survey123(file:str, 
                   save_to_gsc:bool = False,
                   drop_duplicates:bool = True) -> gpd.GeoDataFrame:
    
    # https://gis.stackexchange.com/questions/255138/reading-the-names-of-geodatabase-file-layers-in-python
    # https://cal-itp.slack.com/archives/C02KH3DGZL7/p1691009711659079
    GCS_PATH = f"{_utils.GCS_FILE_PATH}Survey123_Geo/"
    fs.get(f'{GCS_PATH}{file}', 'tmp.gdb.zip')
    geo_layers = fiona.listlayers('tmp.gdb.zip')
    
    print(f"layers = {geo_layers}")
    
    gdf = pd.DataFrame()
    
    for i in geo_layers:
        temp = to_snakecase(gpd.read_file('tmp.gdb.zip', layer = i))
        gdf = pd.concat([gdf, temp], axis=0)
    
    # Check invalid rows
    print("invalid geo rows:")
    invalid_geo_cols = ['lyr','projname','geometry','geopoint_comments','creator']
    display(gdf[~gdf.geometry.is_valid][invalid_geo_cols])
    
    # Check duplicates
    print("repeated geos rows:")
    repeated_cols = ['geometry','parentglobalid','projname','creator']
    repeated_geo = (gdf
                .groupby(repeated_cols)
                .agg({'editor':'count'})
                .reset_index()
                .rename(columns = {'editor':'total_repeats'})
               ) 
    
    repeated_geo = repeated_geo.loc[repeated_geo.total_repeats > 1]
    display(repeated_geo) 
    
    # Manually correct some stuff
    gdf = manual_corrections(gdf)
    
    # Keep only valid geometries
    gdf = gdf[gdf.geometry.is_valid].reset_index(drop = True)
    gdf = gdf.drop(columns = ['creationdate', 'editdate'])
    
    # Drop duplicates
    if drop_duplicates == True:
        gdf = gdf.drop_duplicates(subset = repeated_cols)
    
    # Fill NA
    gdf = gdf.fillna(gdf.dtypes.replace({'float64': 0.0, 'object': 'None'}))
    
    # Save to GCS
    if save_to_gsc == True:
        utils.geoparquet_gcs_export(gdf, GCS_PATH, "cleaned_survey123_sample13")
        
    return gdf

In [6]:
all_results = open_survey123("TCEP_SCCP_Score_Geometry_20230808.gdb.zip", True, True)

layers = ['TCEP_SCCP_GeometryIntake_All_Pts_Finals', 'TCEP_SCCP_GeometryIntake_All_Lns_Finals']
invalid geo rows:


Unnamed: 0,lyr,projname,geometry,geopoint_comments,creator
1,Lns1,Stockton Channel Viaduct,,,larissa.lee_caltrans
2,Lns1,Stockton Channel Viaduct,,,larissa.lee_caltrans
9,Lns1,SR-46 East Antelope Grade Corridor Improvements,,,larissa.lee_caltrans


repeated geos rows:


Unnamed: 0,geometry,parentglobalid,projname,creator,total_repeats
107,POINT Z (-13624669.664 4951704.679 0.000),{4D60FABF-CDFB-4C4A-870E-DC8F29664447},Fix 5 Cascade Gateway,larissa.lee_caltrans,2
121,POINT Z (-13047163.582 4094870.102 0.000),{65C6D65C-3E95-4B31-830B-637093A61A9C},US 101/SR 92 Area Improvements & Multimodal Project,darleen.mendez,3


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  super().__setitem__(key, value)
  table = _geopandas_to_arrow(df, index=index, schema_version=schema_version)


In [7]:
# all_results[cols].explore('projname', cmap='tab10', style_kwds = {'weight':5}, height = 400, width = 1000, legend = True)

In [8]:
def preview_one_project(project_name:str):
    """
    Take a look at one project.
    """
    one_project = all_results.loc[all_results.projname == project_name]
    map_cols = ['geometry','lyr_globalid','geopoint_type','geopoint_type_existing','geopoint_comments']
    display(one_project[map_cols].explore('geopoint_type', cmap='tab10', style_kwds = {'weight':6}, height = 400, width = 1000, legend = True))
    drop_cols = ['parentglobalid','creator','lyr','geometry','editor','shape_length']
    one_project = one_project.sort_values(by = ['projname']).drop(columns = drop_cols)
    print(f"{len(one_project)} geometries")
    display(one_project)

In [9]:
preview_one_project("US 101/SR 92 Area Improvements & Multimodal Project")

6 geometries


Unnamed: 0,lyr_globalid,projname,lns,pts,ct_district,efis,ea,ppno,geopoint_type,geopoint_type_existing,geopoint_comments
16,{8293F1BA-5CCB-4E53-895C-82EEF251BA0F},US 101/SR 92 Area Improvements & Multimodal Project,5,1,4,,2Q800,0668D,Transit,Highway,Covert an existing Caltrans-owned Park and Ride Lot into a San Mateo County Transit District mobility hub
128,{12CBAA93-5EBC-437C-88A3-6D29CF226A92},US 101/SR 92 Area Improvements & Multimodal Project,5,1,4,,2Q800,0668D,Interchange Improvement,Highway,widening of the existing loop connector from westbound (WB) SR 92 to southbound (SB) US 101
155,{03463629-02FF-439D-8C00-1E5BD0F727FC},US 101/SR 92 Area Improvements & Multimodal Project,5,1,4,,2Q800,0668D,Interchange Improvement,Highway,elimination of the inside lane merge between SB US 101 ramp and eastbound (EB) SR 92
170,{5651DF55-8F15-4E13-8836-4AA2205ADC4E},US 101/SR 92 Area Improvements & Multimodal Project,5,1,4,,2Q800,0668D,Interchange Improvement,Highway,realign the Fashion Island Boulevard exit ramp
181,{5D5601EA-774A-4614-872E-CB7A9DE666FB},US 101/SR 92 Area Improvements & Multimodal Project,5,1,4,,2Q800,0668D,Interchange Improvement,Highway,widening and realignment of the NB US 101 exit ramp to Hillsdale Boulevard
187,{477ED69B-7B62-4D43-87D5-C0577DD8C422},US 101/SR 92 Area Improvements & Multimodal Project,5,1,4,,2Q800,0668D,Class IV Bike Lane,Bike/Pedestrian,"construction of a two-way Class IV separated bike facility (or cycle track) along Fashion Island Boulevard and 19th Avenue, as well as pedestrian access improvements at four intersections along the bikeway corridor"


In [10]:
def preview_one_geotype_route(df, project_name:str, geopoint_type:str):
    """
    Preview one geopoint type for one route
    """
    map_cols = ['lyr','lyr_globalid','geopoint_type','geopoint_type_existing','geopoint_comments','geometry']
    one_project = df.loc[(df.projname == project_name) & (df.geopoint_type == geopoint_type)]
    display(one_project[map_cols].explore('lyr_globalid', cmap='tab10', style_kwds = {'weight':6}, height = 400, width = 1000, legend = True))
    print(f"{len(one_project)} total rows")


In [11]:
# preview_one_geotype_route(all_results, "US 101/SR 92 Area Improvements & Multimodal Project","Rail (Freight)")

In [12]:
# preview_one_project('Fix 5 Cascade Gateway')

In [13]:
# preview_one_project('U.S. 101 Connected Communities Corridor Rail and Active Transportation Improvements')

In [14]:
# preview_one_project('Fix 5 Cascade Gateway')

### GCS
* Read back in results and make sure they are ok.

In [15]:
test_geoparquet = gpd.read_parquet("gs://calitp-analytics-data/data-analyses/project_prioritization/Survey123_Geo/cleaned_survey123_sample13.parquet")

In [16]:
test_geoparquet.projname.nunique()

48

In [17]:
# test_geoparquet.shape

In [18]:
test_geoparquet.projname.unique()

array(['U.S. 101 Connected Communities Corridor Rail and Active Transportation Improvements',
       'Otay Mesa East Port of Entry',
       'Watsonville-Santa Cruz Multimodal Corridor Program',
       'Westbound I-80 Cordelia Commercial Vehicle Enforcement Facility (WB I-80 CCVEF) Project',
       'National Highway Freight Network Improvement Program - State Route 47-Seaside Avenue & Navy Way Interchange Improvement Project',
       'Fix 5 Cascade Gateway',
       'Metrolink Lilac to Sycamore Avenue Double Track Project on the San Bernardino Line',
       'Inglewood Transit Connector (ITC) Project',
       'Santa Barbara 101 Multimodal Corridor', 'I-5 Managed Lanes',
       'Autonomous, Zero-Emission, On-Demand Transit Tunnel from the Cucamonga Metrolink Station to Ontario International Airport',
       'I-710/I-5 Flyover Utilities Relocation and Construction',
       'Los Angeles Metro Light Rail Capital, Operational and Rehabilitation Enhancements (CORE) Capacity & System Integration

In [19]:
test_geoparquet.explore('projname', tooltip = ['projname','geopoint_type','geopoint_comments'], cmap='tab10', style_kwds = {'weight':6}, legend = False)