## Filter mutliple feeds
Evan: <i>I'm also seeing multiple feeds in the District Digest Map. I don't mind them, but it may be helpful to try to filter for just Public Currently Operating Fixed Route or Regional Subfeed.</i>

### Relevant Links
* https://github.com/cal-itp/data-analyses/issues/1240

In [1]:
import geopandas as gpd
import merge_data
import numpy as np
import pandas as pd
from segment_speed_utils import gtfs_schedule_wrangling, helpers, time_series_utils
from shared_utils import catalog_utils, rt_dates, rt_utils
from update_vars import GTFS_DATA_DICT, RT_SCHED_GCS, SCHED_GCS

In [2]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [3]:
OPERATOR_FILE = GTFS_DATA_DICT.digest_tables.operator_profiles
OPERATOR_ROUTE = GTFS_DATA_DICT.digest_tables.operator_routes_map

In [4]:
operator_df = pd.read_parquet(
    f"{RT_SCHED_GCS}{OPERATOR_FILE}.parquet",
)

# using name instead of schedule_gtfs_dataset_key allows us to get
# the last ones for LA Metro without keeping extraneous rows for LA Metro when keys changed
operator_df = (
    operator_df.sort_values(["service_date", "name"], ascending=[False, True])
    .drop_duplicates(subset=["name"])
    .reset_index(drop=True)
)

In [5]:
operator_df.service_date.unique()

array(['2024-11-13T00:00:00.000000000', '2024-10-16T00:00:00.000000000',
       '2024-09-18T00:00:00.000000000', '2024-07-17T00:00:00.000000000',
       '2024-06-12T00:00:00.000000000', '2024-05-22T00:00:00.000000000',
       '2024-04-17T00:00:00.000000000', '2024-03-13T00:00:00.000000000',
       '2024-02-14T00:00:00.000000000', '2023-12-13T00:00:00.000000000',
       '2023-11-15T00:00:00.000000000', '2023-08-15T00:00:00.000000000',
       '2023-07-12T00:00:00.000000000', '2023-03-15T00:00:00.000000000'],
      dtype='datetime64[ns]')

In [6]:
operator_df.columns

Index(['schedule_gtfs_dataset_key', 'vp_per_min_agency',
       'spatial_accuracy_agency', 'service_date', 'operator_n_routes',
       'operator_n_trips', 'operator_n_shapes', 'operator_n_stops',
       'operator_n_arrivals', 'operator_route_length_miles',
       'operator_arrivals_per_stop', 'n_downtown_local_routes',
       'n_local_routes', 'n_coverage_routes', 'n_rapid_routes',
       'n_express_routes', 'n_rail_routes', 'name',
       'organization_source_record_id', 'organization_name',
       'caltrans_district', 'counties_served', 'service_area_sq_miles',
       'hq_city', 'uza_name', 'service_area_pop', 'organization_type',
       'primary_uza', 'reporter_type'],
      dtype='object')

In [7]:
len(operator_df)

173

In [20]:
operator_df.name.nunique()

172

In [21]:
operator_df.organization_name.nunique()

157

In [22]:
operator_df[["name","organization_name", "caltrans_district","service_date"]]

Unnamed: 0,name,organization_name,caltrans_district,service_date
0,Alhambra Schedule,City of Alhambra,07 - Los Angeles,2024-11-13
1,Amador Schedule,Amador Regional Transit System,10 - Stockton,2024-11-13
2,Antelope Valley Transit Authority Schedule,Antelope Valley Transit Authority,07 - Los Angeles,2024-11-13
3,Arvin Schedule,City of Arvin,06 - Fresno,2024-11-13
4,Auburn Schedule,City of Auburn,03 - Marysville,2024-11-13
5,B-Line Schedule,Butte County Association of Governments,03 - Marysville,2024-11-13
6,Banning Pass Schedule,City of Banning,08 - San Bernardino,2024-11-13
7,Basin Transit GMV Schedule,Basin Transit,08 - San Bernardino,2024-11-13
8,Bay Area 511 AC Transit Schedule,Alameda-Contra Costa Transit District,04 - Oakland,2024-11-13
9,Bay Area 511 BART Schedule,San Francisco Bay Area Rapid Transit District,04 - Oakland,2024-11-13


### Mountain Transit GMV and Mountain Transit Schedule actually differ!

In [13]:
mountain_subset = ["Mountain Transit GMV Schedule", "Mountain Transit Schedule"]

In [15]:
operator_df.loc[operator_df.name.isin(mountain_subset)].T

Unnamed: 0,96,97
schedule_gtfs_dataset_key,0c092a514e4b9ad1427bdacdc67a0091,5ca5d244836397b178993c9bdc4dfb00
vp_per_min_agency,0.00,0.00
spatial_accuracy_agency,0.00,0.00
service_date,2024-11-13 00:00:00,2024-11-13 00:00:00
operator_n_routes,7.00,7.00
operator_n_trips,154.00,172.00
operator_n_shapes,21.00,27.00
operator_n_stops,134.00,127.00
operator_n_arrivals,3580.00,2978.00
operator_route_length_miles,153.64,145.58


### Desert Roadrunner differ too! 

In [18]:
desert_subset = [ 'Desert Roadrunner GMV Schedule', 'Desert Roadrunner Schedule',]

In [19]:
operator_df2.loc[operator_df2.name.isin(desert_subset)].T

Unnamed: 0,51,52
schedule_gtfs_dataset_key,4383eb1cca04093020f1583f57f32d9b,ac9384d5e25378d1898ca522070cef66
vp_per_min_agency,2.93,0.00
spatial_accuracy_agency,84.83,0.00
service_date,2024-11-13 00:00:00,2024-11-13 00:00:00
operator_n_routes,5.00,5.00
operator_n_trips,54.00,45.00
operator_n_shapes,9.00,14.00
operator_n_stops,39.00,161.00
operator_n_arrivals,472.00,1753.00
operator_route_length_miles,220.10,228.79


### Tahoe

In [30]:
tahoe_subset = ["Tahoe Transportation District GMV Schedule", "Tahoe Transportation District Schedule"]

In [31]:
operator_df2.loc[operator_df2.name.isin(tahoe_subset)].T

Unnamed: 0,130,131
schedule_gtfs_dataset_key,c3499b856c717e5706299664fb1c5261,07d3b79f14cec8099119e1eb649f065b
vp_per_min_agency,2.89,0.00
spatial_accuracy_agency,93.59,0.00
service_date,2024-11-13 00:00:00,2024-11-13 00:00:00
operator_n_routes,4.00,5.00
operator_n_trips,121.00,132.00
operator_n_shapes,9.00,13.00
operator_n_stops,117.00,123.00
operator_n_arrivals,2544.00,2409.00
operator_route_length_miles,64.68,90.82


### View two names to one org examples: 15 of these cases.
* D1 Redwood Coast Transit Authority - One name is misspelled
* D3: City of Roseville
* D3: Tahoe Transportation District 	
* D4: Mission Bay Transportation Management Agency 	
* D4: Presidio Trust 	
* D5: City of San Luis Obispo 	
* D7: City of Downey 	
* D7: City of Lawndale
* D7: Los Angeles County Metropolitan Transportation Authority 	 (this one is ok)
* D8: Basin Transit
* D8: City of Beaumont
* D8:Mountain Area Regional Transit Authority
* D8: Palo Verde Valley Transit Agency 	
* D8: Victor Valley Transit Authority 	
* D10: Transit Joint Powers Authority for Merced County 	
* 

In [26]:
operator_df.groupby(["caltrans_district","organization_name","name"]).agg({"service_date":"nunique"})

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,service_date
caltrans_district,organization_name,name,Unnamed: 3_level_1
01 - Eureka,City of Eureka,Humboldt Schedule,1
01 - Eureka,Lake Transit Authority,Lake Schedule,1
01 - Eureka,Mendocino Transit Authority,Mendocino Schedule,1
01 - Eureka,Redwood Coast Transit Authority,Redwood Coast Schedule,1
01 - Eureka,Redwood Coast Transit Authority,Redwood Coast Schedulel,1
02 - Redding,Lassen Transit Service Agency,Lassen Schedule,1
02 - Redding,Modoc Transportation Agency,Sage Stage Schedule,1
02 - Redding,Plumas Transit Systems,Plumas Schedule,1
02 - Redding,Shasta County,Redding Schedule,1
02 - Redding,Siskiyou County,Siskiyou Schedule,1
