# Research Request - Non-Revenue Vehicle Fleet Sizes #1236

How many non-revenue vehicles do transit agencies have. 

Also known as "service vehicles", Non-revenue vehicle data are excluded from UPT, VRM, VRH reports (since these NRV are not meant to carry passengers or make revenue).

NRVs are like support vehicles used to maintain transit operations (service/maintenance vehicles, other support.

Service Vehicle Inventory (Form A-35): "Transit agencies are required to report data on service vehicles, or vehicles which do not carry passengers.


In [1]:
import requests
import json
import pandas as pd
import numpy as np
import altair as alt
import seaborn as sns # for distplot

from calitp_data_analysis.tables import tbls
from siuba import _, filter, count, collect, show_query, select
from scipy.stats import zscore # find any outliers in the data 
from shared_utils import schedule_rt_utils

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None) 
pd.set_option('display.max_colwidth', None)

GCS_PATH = "gs://calitp-analytics-data/data-analyses/ntd/"

In [2]:
keep_cols=[
    "ntd_id",
    "agency",
    "city",
    "organization_type",
    "reporter_type",
    #"agency_voms",
    "total_revenue_vehicles",
    "total_service_vehicles"
]

#ntd_service_veh = (
#    tbls.mart_ntd_annual_reporting.fct_vehicles_type_count_by_agency()
#    >> filter(
#        _.state == 'CA',
#       _.report_year == 2023
#    )
#    >> collect()
#)[keep_cols].reset_index(drop=True)


In [3]:
#ntd_service_veh.describe() # 0 revenue and service vehicles?

In [4]:
keep_cols_2=[
    "ntd_id",
    "agency",
    "city",
    "organization_type",
    "reporter_type",
    "mode",
    "type_of_service",
    "unlinked_passenger_trips",
    "vehicle_revenue_hours",

]

#ntd_ridership_metrics = (
#    tbls.mart_ntd_annual_reporting.fct_metrics()
#    >> filter(
#        _.state == 'CA',
#        _.report_year == 2023
#    )
#    >> collect()
#)[keep_cols_2].reset_index(drop=True)

## use `dim_annual_service_agencies` ntd_id, UPT, VRM, VRH for agencies?

How is this different from `dim_annual_agency_information`? are all reporter types included?

description:
>Provides transit agency-wide totals for service data for applicable agencies reporting to the National Transit Database. This view displays the data at a higher level (by agency), based on the "NTD Annual Data - Service (by Mode and Time Period)" dataset. In the years 2015-2021, you can find this data in the "Service" data table on NTD Program website, at https://transit.dot.gov/ntd/ntd-data. In versions of the data tables from before 2014, you can find data on service in the file called "Transit Operating Statistics: Service Supplied and Consumed."

In [5]:
keep_cols_annual_service=[
    "ntd_id",
    "agency",
    "reporter_type",
    "city",
    "primary_uza_name",
    "actual_vehicles_passenger_car_revenue_hours",
    "actual_vehicles_passenger_car_revenue_miles",
    "unlinked_passenger_trips_upt"
]

upt_vrm_vrh=(
    tbls.mart_ntd.dim_annual_service_agencies ()
    >> filter(
        _.state == "CA",
        _.report_year == 2023,

    )
    >> collect()
)[keep_cols_annual_service]

upt_vrm_vrh.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209 entries, 0 to 208
Data columns (total 8 columns):
 #   Column                                       Non-Null Count  Dtype  
---  ------                                       --------------  -----  
 0   ntd_id                                       209 non-null    object 
 1   agency                                       209 non-null    object 
 2   reporter_type                                209 non-null    object 
 3   city                                         208 non-null    object 
 4   primary_uza_name                             165 non-null    object 
 5   actual_vehicles_passenger_car_revenue_hours  209 non-null    float64
 6   actual_vehicles_passenger_car_revenue_miles  209 non-null    float64
 7   unlinked_passenger_trips_upt                 209 non-null    float64
dtypes: float64(3), object(5)
memory usage: 13.2+ KB


In [6]:
col_check = ["reporter_type","agency"]

for i in col_check:
    display(upt_vrm_vrh[i].value_counts().head())
    
# LA county still showing multiple rows. so, consistent with `dim_annual_agency_information`

Reduced Reporter    84
Full Reporter       81
Rural Reporter      44
Name: reporter_type, dtype: int64

Los Angeles County                                10
Amador Transit                                     1
City of Glendale, dba: Beeline Bus/Dial-A-Ride     1
City of Santa Fe Springs                           1
City of South Gate                                 1
Name: agency, dtype: int64

In [7]:
# get unique NTD IDs from table
annual_service_ntd_id = upt_vrm_vrh["ntd_id"].unique().tolist()
type(annual_service_ntd_id)

list

In [8]:
# aggregate UPT and VRM by agency

upt_by_agency = upt_vrm_vrh.groupby(["ntd_id","agency"]).agg({
    "unlinked_passenger_trips_upt":"sum",
    "actual_vehicles_passenger_car_revenue_hours":"sum",
    "actual_vehicles_passenger_car_revenue_miles":"sum"
}).reset_index()

upt_by_agency.head()

Unnamed: 0,ntd_id,agency,unlinked_passenger_trips_upt,actual_vehicles_passenger_car_revenue_hours,actual_vehicles_passenger_car_revenue_miles
0,90003,"San Francisco Bay Area Rapid Transit District, dba: SF BART",50764402.0,2724074.0,85233749.0
1,90004,Golden Empire Transit District,3293593.0,289338.0,3924016.0
2,90006,Santa Cruz Metropolitan Transit District,3350026.0,214748.0,2975126.0
3,90008,"City of Santa Monica, dba: Big Blue Bus",7767725.0,416944.0,3920395.0
4,90009,"San Mateo County Transit District, dba: SamTrans",8773845.0,651839.0,7793698.0


## pulling data from NTD `2023 Annual Database Service Vehicle Inventory` 

Revenue Vehicle Inventory: https://www.transit.dot.gov/ntd/data-product/2023-annual-database-revenue-vehicle-inventory
>Contains operating statistics reported by mode and type of service. Categorized by vehicles operated and vehicles available in maximum service by day and time period.

Service Vehilce Inventory: https://www.transit.dot.gov/ntd/data-product/2023-annual-database-service-vehicle-inventory

>Contains operating statistics reported by mode and type of service. Categorized by vehicles operated and vehicles available in maximum service by day and time period.

In [9]:
#rev_url="https://www.transit.dot.gov/sites/fta.dot.gov/files/2024-10/2023%20Revenue%20Vehicle%20Inventory.xlsx"
serv_url= "https://www.transit.dot.gov/sites/fta.dot.gov/files/2024-10/2023%20Service%20Vehicle%20Inventory.xlsx"

#ntd_rev_veh_inv = pd.read_excel(rev_url, engine="openpyxl")
ntd_serv_veh_inv = pd.read_excel(serv_url, engine='openpyxl')

#ntd_rev_veh_inv["NTD ID"] = ntd_rev_veh_inv["NTD ID"].astype(str)
ntd_serv_veh_inv["NTD ID"] = ntd_serv_veh_inv["NTD ID"].astype(str)

display(
    #ntd_rev_veh_inv.info(),
    ntd_serv_veh_inv.info()
)

# need to filter by CA agenicess

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17699 entries, 0 to 17698
Data columns (total 21 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   State NTD ID                           1005 non-null   object 
 1   NTD ID                                 17699 non-null  object 
 2   Agency Name                            17699 non-null  object 
 3   Reporter Type                          17699 non-null  object 
 4   Group Plan Sponsor NTD ID              1939 non-null   object 
 5   Group Plan Sponsor Name                1939 non-null   object 
 6   Primary Mode Served                    17699 non-null  object 
 7   Secondary Modes Served                 6105 non-null   object 
 8   Reporting Module                       17699 non-null  object 
 9   Fleet ID                               17699 non-null  int64  
 10  Agency Service Fleet ID                13937 non-null  object 
 11  Se

None

### what are the service vehicles in `service vehicle inventory`?

In [10]:
ntd_serv_veh_inv[
    [
        #"Primary Mode Served",
        "Vehicle Type",
    ]
].value_counts()

Vehicle Type                         
Trucks and other Rubber Tire Vehicles    13984
Automobiles                               2818
Steel Wheel Vehicles                       897
dtype: int64

In [11]:
#long list,
# mix of mfg name, specific model name, but also some random numbers.
# 

ntd_serv_veh_inv["Service Fleet Name"].value_counts().sample(20)


Procurement                                1
NRV 177                                    1
DODGE Durango GT                           1
N-0712 2022 FORD T350 CARGO VAN            1
2018 Ford F-750                            1
Impala Sedan                               1
2016 FREIGHT VAN                           1
Service Truck 1                            2
913                                        1
Ford Police Interceptor Utility            2
SV                                         7
Retired Cutaway                            1
AD-04 - Chevrolet Express Passenger Van    1
SUV-18-TAHOE                               1
2003FORD-F550                              1
ALL TERRAIN CRANE (2002 Grove GMK3050)     1
2008 Vehicles                              1
NRV 369                                    1
TANDEM AXLE TRACTOR - Asset ID: #7278      1
CHEVROLET_2500HD_2010                      1
Name: Service Fleet Name, dtype: int64

In [12]:
keep_col_0 = [
    "ntd_id",
    "agency_name",
    "reporter_type",
    "city",
    "primary_uza_name"
]

agency_info=(
    tbls.mart_ntd.dim_annual_agency_information()
    >> filter(
        _.state == "CA",
        _.year == 2023,
        _._is_current == True
    )
    >> collect()
)#[keep_col_0]

#agency_info = agency_info[keep_col_0]

display(agency_info.info(), agency_info.head())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 237 entries, 0 to 236
Data columns (total 48 columns):
 #   Column                           Non-Null Count  Dtype              
---  ------                           --------------  -----              
 0   key                              237 non-null    object             
 1   year                             237 non-null    int64              
 2   ntd_id                           237 non-null    object             
 3   state_parent_ntd_id              59 non-null     object             
 4   agency_name                      237 non-null    object             
 5   reporter_acronym                 142 non-null    object             
 6   doing_business_as                120 non-null    object             
 7   division_department              130 non-null    object             
 8   legacy_ntd_id                    152 non-null    object             
 9   reported_by_ntd_id               59 non-null     object             
 10  re

None

Unnamed: 0,key,year,ntd_id,state_parent_ntd_id,agency_name,reporter_acronym,doing_business_as,division_department,legacy_ntd_id,reported_by_ntd_id,reported_by_name,reporter_type,reporting_module,organization_type,subrecipient_type,fy_end_date,original_due_date,address_line_1,address_line_2,p_o__box,city,state,zip_code,zip_code_ext,region,url,fta_recipient_id,ueid,service_area_sq_miles,service_area_pop,primary_uza_code,primary_uza_name,tribal_area_name,population,density,sq_miles,voms_do,voms_pt,total_voms,volunteer_drivers,personal_vehicles,tam_tier,number_of_state_counties,number_of_counties_with_service,state_admin_funds_expended,_valid_from,_valid_to,_is_current
0,c2ca2b69c59b5249648128f17dc7566a,2023,90043,,City of Commerce,CCT,City of Commerce Transit,Transportation,9043.0,,,Full Reporter,Urban,"City, County or Local Government Unit or Department of Transportation",,1688083200000,1698710400000,2535 Commerce Way,,,Commerce,CA,90040.0,1410.0,9.0,https://www.ci.commerce.ca.us/city-hall/transportation,1650.0,C1H2LGLYFD63,11.0,12997.0,,"Los Angeles--Long Beach--Anaheim, CA",,12237376.0,7476.284117,1636.83,14.0,,14.0,,,Tier II,,,,2024-11-27 20:30:23.391972+00:00,2098-12-31 23:59:59.999999+00:00,True
1,56674e402b54f109186d7c4a4b59f0c4,2023,90008,,City of Santa Monica,BBB,Big Blue Bus,Department of Transportation,9008.0,,,Full Reporter,Urban,"City, County or Local Government Unit or Department of Transportation",,1688083200000,1698710400000,1685 Main St,,,Santa Monica,CA,90401.0,3248.0,9.0,http://www.bigbluebus.com/,1664.0,DEDCLBH5GBC3,69.0,1022273.0,,"Los Angeles--Long Beach--Anaheim, CA",,12237376.0,7476.284117,1636.83,124.0,19.0,143.0,,,Tier I (Fixed Route VOMS),,,,2024-11-27 20:30:23.391972+00:00,2098-12-31 23:59:59.999999+00:00,True
2,55bfda29a4a916390c88d363559a71d8,2023,90201,,City of Turlock,,Turlock Transit,Transit,9201.0,,,Full Reporter,Urban,"City, County or Local Government Unit or Department of Transportation",,1688083200000,1698710400000,156 S Broadway,,,Turlock,CA,95380.0,5454.0,9.0,www.turlocktransit.com,2715.0,JDYXAB12QLG1,22.0,79203.0,,"Turlock, CA",,79203.0,4688.330579,16.89,,12.0,12.0,,,Tier II,,,,2024-11-27 20:30:23.391972+00:00,2098-12-31 23:59:59.999999+00:00,True
3,d4342f7c3020e383f3219174dbd60595,2023,90256,,City of Burbank,,,Community Development-Transportation,,,,Reduced Reporter,Urban,"City, County or Local Government Unit or Department of Transportation",,1688083200000,1698710400000,275 E Olive Ave,,,Burbank,CA,91502.0,1232.0,9.0,www.burbankbus.org,5566.0,N9Q5C7NS3JS5,17.0,107337.0,,"Los Angeles--Long Beach--Anaheim, CA",,12237376.0,7476.284117,1636.83,5.0,12.0,17.0,,,Tier II,,,,2024-11-27 20:30:23.391972+00:00,2098-12-31 23:59:59.999999+00:00,True
4,ce9c6c0fbb6292bf2600098958c1833f,2023,90296,,City of Claremont,,,Community Services/ Transit Services,9296.0,,,Reduced Reporter,Urban,"City, County or Local Government Unit or Department of Transportation",,1688083200000,1698710400000,207 Harvard Ave,,,Claremont,CA,91711.0,4719.0,9.0,pvtrans.org,2271.0,DGVWH4PSMR98,13.0,36700.0,,"Los Angeles--Long Beach--Anaheim, CA",,12237376.0,7476.284117,1636.83,,22.0,22.0,,,Tier II,,,,2024-11-27 20:30:23.391972+00:00,2098-12-31 23:59:59.999999+00:00,True


In [13]:
len(agency_info["agency_name"].unique())

223

In [14]:
ca_ntd_id = agency_info["ntd_id"].unique().tolist()

len(ca_ntd_id)

237

In [15]:
agency_info.value_counts(subset=["agency_name","ntd_id"]).head() #why does Los Angeles County have multiple ntd id?

agency_name                                                ntd_id
Access Services                                            90157     1
Los Angeles County Metropolitan Transportation Authority   90154     1
Long Beach Transit                                         90023     1
Los Angeles County                                         90269     1
                                                           90270     1
dtype: int64

In [16]:
agency_info[agency_info["agency_name"]=="Los Angeles County"] #why does Los Angeles County have multiple ntd id?

Unnamed: 0,key,year,ntd_id,state_parent_ntd_id,agency_name,reporter_acronym,doing_business_as,division_department,legacy_ntd_id,reported_by_ntd_id,reported_by_name,reporter_type,reporting_module,organization_type,subrecipient_type,fy_end_date,original_due_date,address_line_1,address_line_2,p_o__box,city,state,zip_code,zip_code_ext,region,url,fta_recipient_id,ueid,service_area_sq_miles,service_area_pop,primary_uza_code,primary_uza_name,tribal_area_name,population,density,sq_miles,voms_do,voms_pt,total_voms,volunteer_drivers,personal_vehicles,tam_tier,number_of_state_counties,number_of_counties_with_service,state_admin_funds_expended,_valid_from,_valid_to,_is_current
24,2c0bb79b7b0e76fb6cd2876bd1df031d,2023,90273,,Los Angeles County,LACDPW,,"Department of Public Works, Transit Operations - Florence Firestone",,,,Reduced Reporter,Urban,"City, County or Local Government Unit or Department of Transportation",,1688083200000,1698710400000,900 S Fremont Ave,,,Alhambra,CA,91803.0,1331.0,9.0,https://www.dpw.lacounty.gov/landing/transportation.cfm,5566.0,,7.0,63387.0,,"Los Angeles--Long Beach--Anaheim, CA",,12237376.0,7476.284117,1636.83,,2.0,2.0,,,Tier II,,,,2024-11-27 20:30:23.391972+00:00,2098-12-31 23:59:59.999999+00:00,True
29,2d73d888530862c1a7aa70cab045c575,2023,90276,,Los Angeles County,LACDPW,,"Department of Public Works, Transit Operations - South Whittier",,,,Reduced Reporter,Urban,"City, County or Local Government Unit or Department of Transportation",,1688083200000,1698710400000,900 S Fremont Ave,,,Alhambra,CA,91803.0,1331.0,9.0,https://www.dpw.lacounty.gov/landing/transportation.cfm,5566.0,,5.0,86883.0,,"Los Angeles--Long Beach--Anaheim, CA",,12237376.0,7476.284117,1636.83,,5.0,5.0,,,Tier II,,,,2024-11-27 20:30:23.391972+00:00,2098-12-31 23:59:59.999999+00:00,True
76,0ad21d98d72decda036697dee2a39b9b,2023,90275,,Los Angeles County,,,"Department of Public Works, Transit Operations, Lennox MB",,,,Reduced Reporter,Urban,"City, County or Local Government Unit or Department of Transportation",,1688083200000,1698710400000,900 S Fremont Ave,,,Alhambra,CA,91803.0,1331.0,9.0,www.lagobus.info,5566.0,ZLJUW5L9K685,16.0,27897.0,,"Los Angeles--Long Beach--Anaheim, CA",,12237376.0,7476.284117,1636.83,,1.0,1.0,,,Tier II,,,,2024-11-27 20:30:23.391972+00:00,2098-12-31 23:59:59.999999+00:00,True
87,63aa9ab7eed8d23b41d53975670def51,2023,90269,,Los Angeles County,,LA County Public Works,"Department of Public Works, Transit Operations, Athens MB",,,,Reduced Reporter,Urban,"City, County or Local Government Unit or Department of Transportation",,1688083200000,1698710400000,900 S Fremont Ave,,,Alhambra,CA,91803.0,1331.0,9.0,www.lagobus.info,5566.0,ZLJUW5L9K685,25.0,23159.0,,"Los Angeles--Long Beach--Anaheim, CA",,12237376.0,7476.284117,1636.83,,2.0,2.0,,,Tier II,,,,2024-11-27 20:30:23.391972+00:00,2098-12-31 23:59:59.999999+00:00,True
106,dfd1d3a27488c76e37af6e8560dc8cc6,2023,90270,,Los Angeles County,LACDPW,,"Department of Public Works, Transit Operations – Avocado Heights",,,,Reduced Reporter,Urban,"City, County or Local Government Unit or Department of Transportation",,1688083200000,1698710400000,900 S Fremont Ave,,,Alhambra,CA,91803.0,,9.0,https://www.dpw.lacounty.gov/landing/transportation.cfm,5566.0,,9.0,15500.0,,"Los Angeles--Long Beach--Anaheim, CA",,12237376.0,7476.284117,1636.83,,1.0,1.0,,,Tier II,,,,2024-11-27 20:30:23.391972+00:00,2098-12-31 23:59:59.999999+00:00,True
112,85ea5ddb2ea2dadf51acb0eb1d746bfa,2023,90279,,Los Angeles County,,,"Department of Public Works, Transit Operations, Willowbrook et. al. DR",,,,Reduced Reporter,Urban,"City, County or Local Government Unit or Department of Transportation",,1688083200000,1698710400000,900 S Fremont Ave,,,Alhambra,CA,91803.0,1331.0,9.0,http://www.dpw.lacounty.gov/transit/,5566.0,ZLJUW5L9K685,18.0,198052.0,,"Los Angeles--Long Beach--Anaheim, CA",,12237376.0,7476.284117,1636.83,,3.0,3.0,,,Tier II,,,,2024-11-27 20:30:23.391972+00:00,2098-12-31 23:59:59.999999+00:00,True
116,a960095999f1429e9fe34de5fca45fae,2023,90271,,Los Angeles County,,,"Department of Public Works, Transit Operations, East Los Angeles MB and DR",,,,Reduced Reporter,Urban,"City, County or Local Government Unit or Department of Transportation",,1688083200000,1698710400000,900 S Fremont Ave,,,Alhambra,CA,91803.0,1331.0,9.0,www.lagobus.org,5566.0,ZLJUW5L9K685,8.0,126496.0,,"Los Angeles--Long Beach--Anaheim, CA",,12237376.0,7476.284117,1636.83,,16.0,16.0,,,Tier II,,,,2024-11-27 20:30:23.391972+00:00,2098-12-31 23:59:59.999999+00:00,True
121,f079fdd0fd15831c3907686316f3c5e2,2023,90274,,Los Angeles County,,,"Department of Public Works, Transit Operations, King Medical Center MB",,,,Reduced Reporter,Urban,"City, County or Local Government Unit or Department of Transportation",,1688083200000,1698710400000,900 S Fremont Ave,,,Alhambra,CA,91803.0,1331.0,9.0,www.lagobus.info,5566.0,ZLJUW5L9K685,49.0,12358.0,,"Los Angeles--Long Beach--Anaheim, CA",,12237376.0,7476.284117,1636.83,,2.0,2.0,,,Tier II,,,,2024-11-27 20:30:23.391972+00:00,2098-12-31 23:59:59.999999+00:00,True
173,895847015b7bbf8feec32d723cbe9991,2023,90278,,Los Angeles County,,,"Department of Public Works, Transit Operations, Willowbrook MB",,,,Reduced Reporter,Urban,"City, County or Local Government Unit or Department of Transportation",,1688083200000,1698710400000,900 S Fremont Ave,,,Alhambra,CA,91803.0,1331.0,9.0,http://www.dpw.lacounty.gov/transit,5566.0,ZLJUW5L9K685,36.0,24798.0,,"Los Angeles--Long Beach--Anaheim, CA",,12237376.0,7476.284117,1636.83,,3.0,3.0,,,Tier II,,,,2024-11-27 20:30:23.391972+00:00,2098-12-31 23:59:59.999999+00:00,True


## filter rev and service list by ca ntd id

In [17]:
## need to filter the rev and serv veh inv sheet by annual service ntd id, then join the upt_vrm_vrh data

# rev and serv veh inv
#ntd_rev_veh_inv
#ntd_srv_veh_inv

#metrics
#upt_vrm_vrh

#ntd ID
#annual_service_ntd_id


#ca_rev_veh = ntd_rev_veh_inv[ntd_rev_veh_inv["NTD ID"].isin(annual_service_ntd_id)]

ca_serv_veh = ntd_serv_veh_inv[ntd_serv_veh_inv["NTD ID"].isin(annual_service_ntd_id)]


display(
    #len(ca_rev_veh),
    len(ca_serv_veh)
)

2741

In [18]:
ca_serv_veh.describe()

Unnamed: 0,Fleet ID,Year of Manufacture,Number of Vehicles,Useful Life Benchmark,Percent Agency Capital Responsibility,Estimated Cost,Year Dollar of the Estimated Cost
count,2741.0,2741.0,2741.0,2741.0,2741.0,2741.0,2741.0
mean,17057.534841,2012.197008,1.942357,11.449106,99.854068,155249.4,2016.551623
std,10193.42177,8.749674,3.290602,5.34162,3.818015,593420.7,6.517969
min,1.0,1958.0,0.0,4.0,0.0,1000.0,1973.0
25%,8508.0,2009.0,1.0,8.0,100.0,28605.0,2014.0
50%,16592.0,2014.0,1.0,10.0,100.0,44640.17,2018.0
75%,25885.0,2018.0,1.0,14.0,100.0,97391.0,2021.0
max,31904.0,2023.0,70.0,43.0,100.0,14175000.0,2023.0


In [19]:
ca_serv_veh.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2741 entries, 14737 to 17698
Data columns (total 21 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   State NTD ID                           77 non-null     object 
 1   NTD ID                                 2741 non-null   object 
 2   Agency Name                            2741 non-null   object 
 3   Reporter Type                          2741 non-null   object 
 4   Group Plan Sponsor NTD ID              111 non-null    object 
 5   Group Plan Sponsor Name                111 non-null    object 
 6   Primary Mode Served                    2741 non-null   object 
 7   Secondary Modes Served                 1042 non-null   object 
 8   Reporting Module                       2741 non-null   object 
 9   Fleet ID                               2741 non-null   int64  
 10  Agency Service Fleet ID                2247 non-null   object 
 11 

In [20]:
# group by agency
ca_serv_veh_agg = ca_serv_veh.groupby([
    "NTD ID",
    "Agency Name",
    "Reporter Type"
])["Number of Vehicles"].sum().reset_index()

ca_serv_veh_agg.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110 entries, 0 to 109
Data columns (total 4 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   NTD ID              110 non-null    object
 1   Agency Name         110 non-null    object
 2   Reporter Type       110 non-null    object
 3   Number of Vehicles  110 non-null    int64 
dtypes: int64(1), object(3)
memory usage: 3.6+ KB


## calculate service vehicle ratios


In [21]:
display(
    upt_by_agency.columns,
    ca_serv_veh_agg.columns
)

Index(['ntd_id', 'agency', 'unlinked_passenger_trips_upt',
       'actual_vehicles_passenger_car_revenue_hours',
       'actual_vehicles_passenger_car_revenue_miles'],
      dtype='object')

Index(['NTD ID', 'Agency Name', 'Reporter Type', 'Number of Vehicles'], dtype='object')

In [22]:
# join ntd_service_veh and upt_vrm_vrh

service_veh_upt = ca_serv_veh_agg.merge(upt_by_agency, left_on="NTD ID", right_on="ntd_id", how="inner", indicator=True)
service_veh_upt.head()

Unnamed: 0,NTD ID,Agency Name,Reporter Type,Number of Vehicles,ntd_id,agency,unlinked_passenger_trips_upt,actual_vehicles_passenger_car_revenue_hours,actual_vehicles_passenger_car_revenue_miles,_merge
0,90003,San Francisco Bay Area Rapid Transit District,Full Reporter,660,90003,"San Francisco Bay Area Rapid Transit District, dba: SF BART",50764402.0,2724074.0,85233749.0,both
1,90004,Golden Empire Transit District,Full Reporter,26,90004,Golden Empire Transit District,3293593.0,289338.0,3924016.0,both
2,90006,Santa Cruz Metropolitan Transit District,Full Reporter,41,90006,Santa Cruz Metropolitan Transit District,3350026.0,214748.0,2975126.0,both
3,90008,City of Santa Monica,Full Reporter,20,90008,"City of Santa Monica, dba: Big Blue Bus",7767725.0,416944.0,3920395.0,both
4,90009,San Mateo County Transit District,Full Reporter,78,90009,"San Mateo County Transit District, dba: SamTrans",8773845.0,651839.0,7793698.0,both


In [23]:
# identify zscores
z_score = service_veh_upt[[
    "actual_vehicles_passenger_car_revenue_hours",
    "Number of Vehicles",
    "unlinked_passenger_trips_upt"
]].apply(zscore)

In [24]:
# remove outliers
threshold = 3
service_v_upt_no_outliers = service_veh_upt[(z_score.abs() < threshold).all(axis=1)]


In [25]:
display(
    "initial data",
    service_veh_upt.describe(),
    "outliers removed",
    service_v_upt_no_outliers.describe()
)
# can see that the min/max values of each col were adjusted

'initial data'

Unnamed: 0,Number of Vehicles,unlinked_passenger_trips_upt,actual_vehicles_passenger_car_revenue_hours,actual_vehicles_passenger_car_revenue_miles
count,110.0,110.0,110.0,110.0
mean,48.4,7502405.0,356456.3,5258683.0
std,167.182436,30451900.0,956712.1,14230650.0
min,0.0,6897.0,2937.0,27187.0
25%,2.0,169721.0,27846.5,382027.0
50%,8.0,658896.5,77465.0,1103403.0
75%,22.75,2693204.0,224784.2,3154542.0
max,1498.0,276302400.0,8220160.0,109467500.0


'outliers removed'

Unnamed: 0,Number of Vehicles,unlinked_passenger_trips_upt,actual_vehicles_passenger_car_revenue_hours,actual_vehicles_passenger_car_revenue_miles
count,107.0,107.0,107.0,107.0
mean,24.504673,3355795.0,233305.9,3360262.0
std,49.946835,8571336.0,455818.7,6199244.0
min,0.0,6897.0,2937.0,27187.0
25%,2.0,162306.5,27758.5,377407.5
50%,7.0,640099.0,67790.0,996506.0
75%,21.0,2674811.0,214450.5,2995482.0
max,356.0,68511360.0,2573209.0,34095950.0


In [26]:
# calculate ratios

# service vehicles per ....(10,000 upt, 1,000 vrh)
multiplier= 10000

service_v_upt_no_outliers["srv_upt_ratio"] = (service_v_upt_no_outliers["Number of Vehicles"] / service_v_upt_no_outliers["unlinked_passenger_trips_upt"])*multiplier

service_v_upt_no_outliers["srv_vrh_ratio"] = (service_v_upt_no_outliers["Number of Vehicles"] / service_v_upt_no_outliers["actual_vehicles_passenger_car_revenue_hours"])*multiplier

service_v_upt_no_outliers.describe()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  service_v_upt_no_outliers["srv_upt_ratio"] = (service_v_upt_no_outliers["Number of Vehicles"] / service_v_upt_no_outliers["unlinked_passenger_trips_upt"])*multiplier
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  service_v_upt_no_outliers["srv_vrh_ratio"] = (service_v_upt_no_outliers["Number of Vehicles"] / service_v_upt_no_outliers["actual_vehicles_passenger_car_revenue_hours"])*multiplier


Unnamed: 0,Number of Vehicles,unlinked_passenger_trips_upt,actual_vehicles_passenger_car_revenue_hours,actual_vehicles_passenger_car_revenue_miles,srv_upt_ratio,srv_vrh_ratio
count,107.0,107.0,107.0,107.0,107.0,107.0
mean,24.504673,3355795.0,233305.9,3360262.0,0.253422,1.633812
std,49.946835,8571336.0,455818.7,6199244.0,0.352802,1.700732
min,0.0,6897.0,2937.0,27187.0,0.0,0.0
25%,2.0,162306.5,27758.5,377407.5,0.050739,0.554195
50%,7.0,640099.0,67790.0,996506.0,0.130647,1.196615
75%,21.0,2674811.0,214450.5,2995482.0,0.264105,2.189939
max,356.0,68511360.0,2573209.0,34095950.0,1.592892,11.103644


## A: Total Number of Non-Revenue Vehicles
aka Service Vehicles

In [27]:
total_non_rev_veh = service_v_upt_no_outliers["Number of Vehicles"].sum()
total_non_rev_veh

2622

## A: Number of Non-Revenue vehicles vs GTFS service hours

anything in the warehouse that can tell me service hours?

if i have to do this myself, what tables do i need?
- earliest `start time` and latest `stop time` for each `agency`?

Spoke with Amanda about service hours and GTFS digets
- advised to see the `service hours` they used in the Digest to see if ir can be useful



In [28]:
# url pulled from gtfs digest service hour section
digest_service_hours = "gs://calitp-analytics-data/data-analyses/rt_vs_schedule/digest/total_scheduled_service_hours.parquet"

service_hours = pd.read_parquet(digest_service_hours)

service_hours.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24936 entries, 0 to 24935
Data columns (total 6 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   name                 24936 non-null  object 
 1   month_year           24936 non-null  object 
 2   weekday_weekend      24936 non-null  object 
 3   departure_hour       24936 non-null  Int64  
 4   service_hours        24936 non-null  float64
 5   daily_service_hours  24936 non-null  float64
dtypes: Int64(1), float64(2), object(3)
memory usage: 1.2+ MB


In [29]:
service_hours["name"].value_counts()

Bay Area 511 SamTrans Schedule                          288
Bay Area 511 SFO AirTrain Schedule                      288
LA Metro Bus Schedule                                   288
LAX FlyAway Schedule                                    282
San Diego Schedule                                      276
Bay Area 511 Santa Clara Transit Schedule               273
Bay Area 511 Muni Schedule                              264
Bay Area 511 Marin Schedule                             264
Bay Area 511 AC Transit Schedule                        264
OCTA Schedule                                           264
Anaheim Resort Schedule                                 262
Bay Area 511 Golden Gate Transit Schedule               252
Bay Area 511 BART Schedule                              248
Riverside Schedule                                      244
North County Schedule                                   240
Yolobus Schedule                                        240
Sacramento Schedule                     

In [30]:
#group by name?

schedule_service_hours = service_hours.groupby("name").agg({
    "daily_service_hours":"sum"
}).reset_index()

schedule_service_hours.head()

Unnamed: 0,name,daily_service_hours
0,Alhambra Schedule,285.15
1,Amador Schedule,148.27
2,Amtrak Schedule,35345.12
3,Anaheim Resort Schedule,120397.39
4,Antelope Valley Transit Authority Schedule,3801.1


service hour uses schcedule name
need to crosswalk agency name/ntd id to schedule name...again


In [31]:
# via proposed_changes NB

#crosswalk from gtfs dataset key to orgs
gcs_crosswalk = pd.read_parquet(
    "gs://calitp-analytics-data/data-analyses/gtfs_schedule/crosswalk/gtfs_key_organization_2024-10-16.parquet"
)

gcs_crosswalk.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 215 entries, 0 to 214
Data columns (total 30 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   schedule_gtfs_dataset_key          215 non-null    object 
 1   name                               215 non-null    object 
 2   schedule_source_record_id          215 non-null    object 
 3   base64_url                         215 non-null    object 
 4   organization_source_record_id      215 non-null    object 
 5   organization_name                  215 non-null    object 
 6   caltrans_district                  215 non-null    object 
 7   counties_served                    120 non-null    object 
 8   hq_city                            153 non-null    object 
 9   hq_county                          121 non-null    object 
 10  is_public_entity                   121 non-null    object 
 11  is_publicly_operating              121 non-null    object 

In [32]:
keep_cols_xwalk=[
    "schedule_source_record_id",
    "schedule_gtfs_dataset_key",
    "name",
    "base64_url",
    "organization_source_record_id",
    "organization_name",
    "caltrans_district",
    "reporter_type",
    "primary_uza_name",
    "voms_pt",
    "voms_do"
]

gcs_crosswalk = gcs_crosswalk[keep_cols_xwalk]
gcs_crosswalk.rename(columns={"name":"name_schedule"}, inplace=True)

# dim orgs
dim_orgs = (tbls.mart_transit_database.dim_organizations()
            >> filter(_._is_current == True,
                      #_.ntd_id.isin(sec_g_ntd_id), # filters to ntd_id from Sec G operators
                      _.public_currently_operating_fixed_route == True
                     )
            >> collect()
           )
                   

keep_cols_2 =[
    "key",
    "source_record_id",
    "name",
    "organization_type",
    "caltrans_district",
    "is_public_entity",
    "ntd_id",
    "reporting_category",
    "public_currently_operating_fixed_route",  
]

dim_orgs =dim_orgs[keep_cols_2]
dim_orgs.rename(columns={
    "key":"key_orgs",
    "name":"name_orgs"
}, inplace=True)

dim_orgs_to_crosswalk = dim_orgs.merge(
    gcs_crosswalk, 
    left_on="source_record_id", 
    right_on="organization_source_record_id", 
    how="left"
)

schedule_feed_xwalk = schedule_rt_utils.get_schedule_gtfs_dataset_key(date="2024-10-16")

orgs_to_feed_xwalk = dim_orgs_to_crosswalk.merge(
    schedule_feed_xwalk, 
    left_on = "schedule_gtfs_dataset_key", 
    right_on="gtfs_dataset_key", how="left"
)

In [33]:
display(
    orgs_to_feed_xwalk.info(),
    orgs_to_feed_xwalk.head()
)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 219 entries, 0 to 218
Data columns (total 22 columns):
 #   Column                                  Non-Null Count  Dtype 
---  ------                                  --------------  ----- 
 0   key_orgs                                219 non-null    object
 1   source_record_id                        219 non-null    object
 2   name_orgs                               219 non-null    object
 3   organization_type                       219 non-null    object
 4   caltrans_district_x                     217 non-null    object
 5   is_public_entity                        219 non-null    bool  
 6   ntd_id                                  179 non-null    object
 7   reporting_category                      209 non-null    object
 8   public_currently_operating_fixed_route  219 non-null    bool  
 9   schedule_source_record_id               201 non-null    object
 10  schedule_gtfs_dataset_key               201 non-null    object
 11  name_s

None

Unnamed: 0,key_orgs,source_record_id,name_orgs,organization_type,caltrans_district_x,is_public_entity,ntd_id,reporting_category,public_currently_operating_fixed_route,schedule_source_record_id,schedule_gtfs_dataset_key,name_schedule,base64_url,organization_source_record_id,organization_name,caltrans_district_y,reporter_type,primary_uza_name,voms_pt,voms_do,gtfs_dataset_key,feed_key
0,1d03fdbc2e4dfd044465b4437e70c9e0,recQRoE5mcCn6kdti,POINT,Independent Agency,01 - Eureka,True,,Other Public Transit,True,recWPANXTwn1u1i2l,0d04ec340550e5a62b031a8e125e6658,Oregon POINT,aHR0cHM6Ly9vcmVnb24tZ3Rmcy50cmlsbGl1bXRyYW5zaXQuY29tL2d0ZnNfZGF0YS9wb2ludC1vci11cy9wb2ludC1vci11cy56aXA=,recQRoE5mcCn6kdti,POINT,01 - Eureka,,,,,0d04ec340550e5a62b031a8e125e6658,54a3956fcd7b59a3db76007d2f53a5dc
1,e8f13022ac8ecff976188979ac903d4a,recKsb5FnJy70up78,Amtrak,Federal Government,03 - Marysville,True,,Other Public Transit,True,,,,,,,,,,,,,
2,f269f25681d3ed293401b3b8027f89a8,recG5aXxDPI645S86,OmniTrans,Independent Agency,08 - San Bernardino,True,90029,Core,True,recofAxeH23aWcuAP,95cb514215c61ca578b01d885f35ec0a,OmniTrans Schedule,aHR0cHM6Ly93d3cub21uaXRyYW5zLm9yZy9nb29nbGUvZ29vZ2xlX3RyYW5zaXQuemlw,recG5aXxDPI645S86,OmniTrans,08 - San Bernardino,Full Reporter,"Riverside--San Bernardino, CA",44.0,99.0,95cb514215c61ca578b01d885f35ec0a,b4192f0e3b1746c45066f5016ced9b5b
3,d4d53f5a85cec17582000a9e67a7d642,reczvlrgxLUDiBgAy,Commute.org,Joint Powers Agency,04 - Oakland,True,,Other Public Transit,True,rec9wl7HVaxR79ayw,c2a40ce92e76ec5beb88c40df3cd3a67,Bay Area 511 Commute.org Schedule,aHR0cHM6Ly9hcGkuNTExLm9yZy90cmFuc2l0L2RhdGFmZWVkcz9vcGVyYXRvcl9pZD1DTQ==,reczvlrgxLUDiBgAy,Commute.org,04 - Oakland,,,,,c2a40ce92e76ec5beb88c40df3cd3a67,7de3f9edc864d6923e3664e3efd2f0ff
4,6cc7a7303acb07b47d534007a7d38130,reczIiFqdL5AXTpm1,Kern County,County,06 - Fresno,True,9R02-91059,Core,True,recQiWHhvE6MxozGV,4e2936d8f27a9bca79289ec062a1691a,Kern Schedule,aHR0cHM6Ly9kYXRhLnRyaWxsaXVtdHJhbnNpdC5jb20vZ3Rmcy9rZXJuY291bnR5LWNhLXVzL2tlcm5jb3VudHktY2EtdXMuemlw,reczIiFqdL5AXTpm1,Kern County,06 - Fresno,Rural Reporter,,40.0,,4e2936d8f27a9bca79289ec062a1691a,11b97ce606a9ff66306db7a2f62d5dee


## A: Number of `Non-Revenue vehicles` vs NTD `ridership (UPT)` (and `VRH`)


In [34]:
# how many agencies report ZERO service vehicles?
service_v_upt_no_outliers[service_v_upt_no_outliers["Number of Vehicles"] == 0.0]["agency"].count()

1

In [35]:
# What are the reporter types of agencies report ZERO service vehicles?
#service_v_upt_no_outliers[service_v_upt_no_outliers["Number of Vehicles"] == 0.0]["reporter_type"].value_counts()

In [36]:
# Total revenue vehicles vs total service vehicles
service_v_upt_no_outliers[:20].plot(
    x = "agency",
    y = [
        "total_revenue_vehicles", 
        #"unlinked_passenger_trips",
        "total_service_vehicles"
    ],
    kind = "bar",
    #secondary_y = "total_service_vehicles"  
)

KeyError: "None of [Index(['total_revenue_vehicles', 'total_service_vehicles'], dtype='object')] are in the [columns]"

In [None]:
# srv:upt and srv:vrh compare

service_v_upt_no_outliers[:20].plot(
    x = "agency",
    y = [
        "srv_upt_ratio", 
        #"unlinked_passenger_trips",
        "srv_vrh_ratio"
    ],
    kind = "bar",
    #secondary_y = "total_service_vehicles"  
)

In [None]:
sns.distplot(service_v_upt_no_outliers[service_v_upt_no_outliers["srv_vrh_ratio"]>0]["srv_vrh_ratio"]).set_title('Service Vehicle to VRH Ratio')

In [None]:
sns.distplot(service_v_upt_no_outliers[service_v_upt_no_outliers["srv_upt_ratio"]>0]["srv_upt_ratio"]).set_title("Service Vehicle to UPT Ratio")

In [None]:
def make_scatter(
    data,
    x_ax,
    y_ax
):
    chart = alt.Chart(data).mark_point().encode(
    x = x_ax,
    y = y_ax,
    tooltip=[x_ax,y_ax],
    ).properties(title=f"{x_ax} vs. {y_ax}").interactive()

    return chart + chart.transform_regression(x_ax,y_ax).mark_line()

In [None]:
make_scatter(service_v_upt_no_outliers[service_v_upt_no_outliers["Number of Vehicles"]>0],"Number of Vehicles","unlinked_passenger_trips_upt")

In [None]:
make_scatter(service_v_upt_no_outliers[service_v_upt_no_outliers["Number of Vehicles"]>0],"Number of Vehicles","actual_vehicles_passenger_car_revenue_hours")

In [None]:
service_v_upt_no_outliers.columns.to_list()

In [None]:
make_scatter(service_v_upt_no_outliers[service_v_upt_no_outliers["Number of Vehicles"]>0],"Number of Vehicles","actual_vehicles_passenger_car_revenue_miles")