## FY23 FTA Bus and Low- and No-Emission Grant Awards Analysis

<b>GH issue:</b> 
* Research Request - Bus Procurement Costs & Awards #897

<b>Data source(s):</b> 
1. https://www.transit.dot.gov/funding/grants/fy23-fta-bus-and-low-and-no-emission-grant-awards
2. https://storymaps.arcgis.com/stories/022abf31cedd438b808ec2b827b6faff

<b>Definitions:</b>  
* <u>Grants for Buses and Bus Facilities Program:</u>
    * 49 U.S.C. 5339(b)) makes federal resources available to states and direct recipients to replace, rehabilitate and purchase buses and related equipment and to construct bus-related facilities, including technological changes or innovations to modify low or no emission vehicles or facilities. Funding is provided through formula allocations and competitive grants. 
<br><br>
* <u>Low or No Emission Vehicle Program:</u>
    * 5339(c) provides funding to state and local governmental authorities for the purchase or lease of zero-emission and low-emission transit buses as well as acquisition, construction, and leasing of required supporting facilities.


In [1]:
# import shared_utils
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy import stats
from scipy.stats import norm

# set_option to increase max rows displayed to 200, to see entire df in 1 go/
pd.set_option("display.max_rows", 300)

## Reading in raw data from gcs

In [2]:
df = pd.read_csv(
    "gs://calitp-analytics-data/data-analyses/bus_procurement_cost/data-analyses_bus_procurement_cost_fta_press_release_data_csv.csv"
)

## Data Cleaning
1. snake-case column name
2. currency format funcding column (with $ and , )
3. seperate text from # of bus col (split at '(')
    a. trim spaces in new col
    b. get rid of () characters in new col
4. trim spaces in other columns?

In [3]:
# snake case columns names via list
new_col = [
    "state",
    "project_sponsor",
    "project_title",
    "description",
    "funding",
    "#_of_buses",
    "project_type",
    "propulsion_type",
    "area_served",
    "congressional_districts",
    "fta_region",
    "bus/low-no_program",
]

df.columns = new_col
df.columns

Index(['state', 'project_sponsor', 'project_title', 'description', 'funding',
       '#_of_buses', 'project_type', 'propulsion_type', 'area_served',
       'congressional_districts', 'fta_region', 'bus/low-no_program'],
      dtype='object')

In [4]:
# checking data type of funding col
# checking to see if any values are not numbers
# will need to clean up this col
display(df["funding"].dtype, df.funding.value_counts())

dtype('O')

$5,000,000       3
$6,000,000       2
$3,400,000       2
$104,000,000     1
$4,313,552       1
$3,133,129       1
$3,187,200       1
$3,199,038       1
$3,248,500       1
$3,303,600       1
$3,326,067       1
$3,609,800       1
$3,645,000       1
$3,937,500       1
$4,094,652       1
$4,278,772       1
$4,500,000       1
$4,492,904       1
$2,860,250       1
$4,690,010       1
$4,738,886       1
$5,001,700       1
$5,750,351       1
$5,883,200       1
$5,945,553       1
$6,197,180       1
$6,341,306       1
$6,407,460       1
$6,424,808       1
$6,455,325       1
$2,932,500       1
$2,819,460       1
$103,000,000     1
$1,080,000       1
$233,760         1
$280,800         1
$300,000         1
$320,000         1
$514,002         1
$653,184         1
$723,171         1
$753,118         1
$776,714         1
$945,178         1
$1,006,750       1
$1,010,372       1
$1,055,365       1
$1,145,951       1
$2,359,072       1
$1,162,000       1
$1,200,000       1
$1,276,628       1
$1,280,000  

In [5]:
# clean up funding column. removing $ and , and making column as int64
df["funding"] = df["funding"].str.replace("$", "")
df["funding"] = df["funding"].str.replace(",", "")
df["funding"] = df["funding"].astype("int64")

  df['funding']=df['funding'].str.replace('$','')


In [6]:
# checking to see if str.replace worked.
display(df["funding"].dtype, df.head())

dtype('int64')

Unnamed: 0,state,project_sponsor,project_title,description,funding,#_of_buses,project_type,propulsion_type,area_served,congressional_districts,fta_region,bus/low-no_program
0,DC,Washington Metropolitan Area Transit Authority...,Battery-Electric Metrobus Procurement and Elec...,WMATA will receive funding to convert its Cind...,104000000,100(beb),bus/chargers,zero,Large Urban,DC-001 ; MD-004 ; MD-008 ; VA-008 ; VA-011,3,Low-No
1,TX,Dallas Area Rapid Transit (DART),DART CNG Bus Fleet Modernization Project,Dallas Area Rapid Transit will receive funding...,103000000,90 (estimated-CNG buses),bus,low,Large Urban,TX-003 ; TX-004 ; TX-005 ; TX-006 ; TX-024 ; T...,6,Low-No
2,PA,Southeastern Pennsylvania Transportation Autho...,SEPTA Zero-Emission Bus Transition Facility Sa...,The Southeastern Pennsylvania Transportation A...,80000000,0,facility,zero,Large Urban,PA-002 ; PA-003 ; PA-004 ; PA-005,3,Low-No
3,LA,New Orleans Regional Transit Authority,Accelerating Zero-Emissions Mobility for a Res...,The New Orleans Regional Transit Authority wil...,71439261,20 (zero-emission),Bus / Chargers / Equipment,zero,Large Urban,LA-002 ; LA-001,6,Low-No
4,NJ,New Jersey Transit Corporation,Hilton Bus Garage Modernization,New Jersey Transit will receive funding to mod...,47000000,0,facility/chargers,zero,Large Urban,nj-011,2,Bus


In [7]:
# test of removing the spaces first in # of bus colum, THEN split by (
df["#_of_buses"] = df["#_of_buses"].str.replace(" ", "")

In [8]:
# spaces removed, and zeros are kept
df["#_of_buses"].value_counts()

0                                                      34
7(Electric)                                             3
2                                                       3
4(BEBs)                                                 3
20(BEBs)                                                3
2(electric)                                             2
9                                                       2
16(hybridelectric)                                      2
5(CNG)                                                  2
4(cng)                                                  2
6                                                       2
7                                                       2
5                                                       2
10(CNG)                                                 1
4(zeroemissionelectric)                                 1
11(CNGfueled)                                           1
4(hybridelectric)                                       1
25(hybridelect

In [9]:
# spliting the # of buses column into 2, using the ( char as the delimiter
df[["bus_count", "bus_desc"]] = df["#_of_buses"].str.split(pat="(", n=1, expand=True)

In [10]:
# checking col. retained the initial col. and added new columns to the end.
df.columns

Index(['state', 'project_sponsor', 'project_title', 'description', 'funding',
       '#_of_buses', 'project_type', 'propulsion_type', 'area_served',
       'congressional_districts', 'fta_region', 'bus/low-no_program',
       'bus_count', 'bus_desc'],
      dtype='object')

In [11]:
# examining the new bus count col.
# zero values remained the same
# see there are 2 values that are inconsistent.
df.bus_count.value_counts()

0                          34
4                          10
7                           8
6                           8
20                          6
2                           6
5                           6
9                           5
11                          4
3                           4
16                          3
15                          3
10                          3
25                          3
8                           2
39                          2
1                           2
13                          2
56estimated-cutawayvans     1
134                         1
42                          1
50                          1
14                          1
100                         1
37                          1
160                         1
31                          1
12batteryelectric           1
90                          1
18                          1
17                          1
23                          1
69                          1
30        

In [12]:
# function to find the row index of a specific value and column in a dataframe
def find_loc(data, col, val):
    x = data.loc[data[col] == val].index[0]
    return x

In [13]:
loc1 = find_loc(df, "bus_count", "56estimated-cutawayvans")
loc2 = find_loc(df, "bus_count", "12batteryelectric")

In [14]:
display(loc1, loc2)

58

32

In [15]:
# editing the values of the bus count col at specific location
# syntax, look at ## index, look at XX column
df.loc[58, "bus_count"] = 56
df.loc[32, "bus_count"] = 12

In [16]:
# updating values again for bus_desc. same location
df.loc[58, "bus_desc"] = "estimated-cutaway vans (PM- award will not fund 68 buses)"
df.loc[32, "bus_desc"] = "battery electric"

In [17]:
# values updated as inteneded for bus count and bus desc
display(df.loc[32], df.loc[58])

state                                                                     MN
project_sponsor                                                Metro Transit
project_title              Investments Toward an Electric Future: Metro T...
description                Metro Transit will receive funding to buy batt...
funding                                                             17532900
#_of_buses                                                 12batteryelectric
project_type                                      Bus / Chargers / Equipment
propulsion_type                                                         zero
area_served                                                      Large Urban
congressional_districts           MN-002 ; MN-003 ; MN-004 ; MN-005 ; MN-006
fta_region                                                                 5
bus/low-no_program                                                    Low-No
bus_count                                                                 12

state                                                                     TX
project_sponsor            Texas Department of Transportation on behalf o...
project_title              FY23 Rural Transit Asset Replacement & Moderni...
description                The Texas Department of Transportation will re...
funding                                                              7443765
#_of_buses                 56estimated-cutawayvans(PM-awardwillnotfund68b...
project_type                                                 bus / facilitiy
propulsion_type                                                          low
area_served                                                            Rural
congressional_districts    TX-001 ; TX-002 ; TX-004 ; TX-005 ; TX-006 ; T...
fta_region                                                                 6
bus/low-no_program                                                    Low-No
bus_count                                                                 56

In [18]:
# confirming via value counts that all values are valid now.
df.bus_count.value_counts()

0      34
4      10
7       8
6       8
20      6
2       6
5       6
9       5
11      4
3       4
16      3
15      3
10      3
25      3
8       2
39      2
1       2
13      2
56      1
134     1
42      1
50      1
14      1
100     1
37      1
160     1
31      1
12      1
90      1
18      1
17      1
23      1
69      1
30      1
35      1
40      1
12      1
Name: bus_count, dtype: int64

In [19]:
# clearning the bus desc col.
# removing the )
df["bus_desc"] = df["bus_desc"].str.replace(")", "")

  df["bus_desc"] = df["bus_desc"].str.replace(")", "")


In [20]:
df["bus_desc"].unique()

array(['beb', 'estimated-CNGbuses', None, 'zero-emission', 'cngbuses',
       'BEBs', 'Electric\n16(Hybrid', 'FCEB', 'Electric',
       'FuelCellElectric', 'CNG', 'FuelCell', 'hybrid', 'BEB',
       'battery electric', 'lowemissionCNG', 'cng',
       'BEBsparatransitbuses', 'hybridelectric', 'zeroemissionbuses',
       'dieselelectrichybrids', 'hydrogenfuelcell',
       '2BEBsand4HydrogenFuelCellBuses', '4fuelcell/3CNG',
       'estimated-cutaway vans (PM- award will not fund 68 buses',
       'hybridelectricbuses', 'CNGfueled', 'zeroemissionelectric',
       'hybridelectrics', 'dieselandgas', 'diesel-electrichybrids',
       'propane', 'electric', 'diesel-electric', 'propanebuses',
       '1:CNGbus;2cutawayCNGbuses', 'zeroemission',
       'propanedpoweredvehicles'], dtype=object)

In [21]:
# stripping the values in the bus desc col
df["bus_desc"] = df["bus_desc"].str.strip()

In [22]:
df.bus_desc.unique()

array(['beb', 'estimated-CNGbuses', None, 'zero-emission', 'cngbuses',
       'BEBs', 'Electric\n16(Hybrid', 'FCEB', 'Electric',
       'FuelCellElectric', 'CNG', 'FuelCell', 'hybrid', 'BEB',
       'battery electric', 'lowemissionCNG', 'cng',
       'BEBsparatransitbuses', 'hybridelectric', 'zeroemissionbuses',
       'dieselelectrichybrids', 'hydrogenfuelcell',
       '2BEBsand4HydrogenFuelCellBuses', '4fuelcell/3CNG',
       'estimated-cutaway vans (PM- award will not fund 68 buses',
       'hybridelectricbuses', 'CNGfueled', 'zeroemissionelectric',
       'hybridelectrics', 'dieselandgas', 'diesel-electrichybrids',
       'propane', 'electric', 'diesel-electric', 'propanebuses',
       '1:CNGbus;2cutawayCNGbuses', 'zeroemission',
       'propanedpoweredvehicles'], dtype=object)

In [23]:
# creating a dictionary to add spaces back to the values
new_dict = {
    "beb": "BEB",
    "estimated-CNGbuses": "estimated-CNG buses",
    "cngbuses": "CNG buses",
    "BEBs": "BEB",
    "Electric\n16(Hybrid": "15 electic, 16 hybrid",
    "FuelCellElectric": "fuel cell electric",
    "FuelCell": "fuel cell",
    "lowemissionCNG": "low emission CNG",
    "cng": "CNG",
    "BEBsparatransitbuses": "BEBs paratransit buses",
    "hybridelectric": "hybrid electric",
    "zeroemissionbuses": "zero emission buses",
    "dieselelectrichybrids": "diesel electric hybrids",
    "hydrogenfuelcell": "hydrogen fuel cell",
    "2BEBsand4HydrogenFuelCellBuses": "2 BEBs and 4 hydrogen fuel cell buses",
    "4fuelcell/3CNG": "4 fuel cell / 3 CNG",
    "hybridelectricbuses": "hybrid electric buses",
    "CNGfueled": "CNG fueled",
    "zeroemissionelectric": "zero emission electric",
    "hybridelectrics": "hybrid electrics",
    "dieselandgas": "diesel and gas",
    "diesel-electrichybrids": "diesel-electric hybrids",
    "propanebuses": "propane buses",
    "1:CNGbus;2cutawayCNGbuses": "1:CNGbus ;2 cutaway CNG buses",
    "zeroemission": "zero emission",
    "propanedpoweredvehicles": "propaned powered vehicles",
}

In [24]:
# using new dictionary to replace values in the bus desc col
df.replace({"bus_desc": new_dict}, inplace=True)

In [25]:
# confirming the bus desc values were replaced as indeded.
list(df.bus_desc.unique())

['BEB',
 'estimated-CNG buses',
 None,
 'zero-emission',
 'CNG buses',
 '15 electic, 16 hybrid',
 'FCEB',
 'Electric',
 'fuel cell electric',
 'CNG',
 'fuel cell',
 'hybrid',
 'battery electric',
 'low emission CNG',
 'BEBs paratransit buses',
 'hybrid electric',
 'zero emission buses',
 'diesel electric hybrids',
 'hydrogen fuel cell',
 '2 BEBs and 4 hydrogen fuel cell buses',
 '4 fuel cell / 3 CNG',
 'estimated-cutaway vans (PM- award will not fund 68 buses',
 'hybrid electric buses',
 'CNG fueled',
 'zero emission electric',
 'hybrid electrics',
 'diesel and gas',
 'diesel-electric hybrids',
 'propane',
 'electric',
 'diesel-electric',
 'propane buses',
 '1:CNGbus ;2 cutaway CNG buses',
 'zero emission',
 'propaned powered vehicles']

In [26]:
# bus count for row 12 needs to be adjusted to 31 instead of 15
df.loc[12, "bus_count"] = 31

In [27]:
# confirming the change
df.loc[12]

state                                                                     NC
project_sponsor            City of Charlotte - Charlotte Area Transit System
project_title              Charlotte Area Transit System's Sustainable Fl...
description                The city of Charlotte will receive funding to ...
funding                                                             30890413
#_of_buses                                          15(Electric)\n16(Hybrid)
project_type                                      Bus / Chargers / Equipment
propulsion_type                                                   Zero / Low
area_served                                                      Large Urban
congressional_districts           NC-008 ; NC-012 ; NC-013 ; NC-014 ; SC-005
fta_region                                                                 4
bus/low-no_program                                                       Bus
bus_count                                                                 31

In [28]:
# using str.lower() on project type
df["project_type"] = df["project_type"].str.lower()

In [29]:
# using str.lower() on project type
df["project_type"] = df["project_type"].str.replace(" ", "")

In [30]:
# confirming lower and replace worked as intended
list(df["project_type"].sort_values(ascending=True).unique())

['\tbus/facility',
 'bus',
 'bus/chargers',
 'bus/chargers/equipment',
 'bus/chargers/other',
 'bus/equipment',
 'bus/facilitiy',
 'bus/facility',
 'bus/facility/chargers',
 'bus/facility/chargers/equipment',
 'bus/facility/equipment',
 'bus/facility/equipment/other',
 'bus/facility/other',
 'bus/other',
 'chargers',
 'chargers/equipment',
 'facilities',
 'facility',
 'facility/chargers',
 'facility/chargers/equipment',
 'facility/equipment']

In [31]:
# some values still need to get adjusted. will use a short dictionary to fix
new_type = {
    "\tbus/facility": "bus/facility",
    "bus/facilitiy": "bus/facility",
    "facilities": "facility",
}

In [32]:
# using replace() with the dictionary to replace keys in project type col
# syntax df.replace({'bus_desc': new_dict}, inplace=True)
df.replace({"project_type": new_type}, inplace=True)

In [33]:
# double checking to ensure dictionary reaplce works.
list(df["project_type"].sort_values(ascending=True).unique())

['bus',
 'bus/chargers',
 'bus/chargers/equipment',
 'bus/chargers/other',
 'bus/equipment',
 'bus/facility',
 'bus/facility/chargers',
 'bus/facility/chargers/equipment',
 'bus/facility/equipment',
 'bus/facility/equipment/other',
 'bus/facility/other',
 'bus/other',
 'chargers',
 'chargers/equipment',
 'facility',
 'facility/chargers',
 'facility/chargers/equipment',
 'facility/equipment']

In [34]:
## Cleaning up the bus_desc col
list(df.bus_desc.sort_values().unique())

['15 electic, 16 hybrid',
 '1:CNGbus ;2 cutaway CNG buses',
 '2 BEBs and 4 hydrogen fuel cell buses',
 '4 fuel cell / 3 CNG',
 'BEB',
 'BEBs paratransit buses',
 'CNG',
 'CNG buses',
 'CNG fueled',
 'Electric',
 'FCEB',
 'battery electric',
 'diesel and gas',
 'diesel electric hybrids',
 'diesel-electric',
 'diesel-electric hybrids',
 'electric',
 'estimated-CNG buses',
 'estimated-cutaway vans (PM- award will not fund 68 buses',
 'fuel cell',
 'fuel cell electric',
 'hybrid',
 'hybrid electric',
 'hybrid electric buses',
 'hybrid electrics',
 'hydrogen fuel cell',
 'low emission CNG',
 'propane',
 'propane buses',
 'propaned powered vehicles',
 'zero emission',
 'zero emission buses',
 'zero emission electric',
 'zero-emission',
 None]

In [35]:
bus_dict = {
    "BEBs paratransit buses": "BEB",
    "CNG buses": "CNG",
    "CNG fueled": "CNG",
    "Electric": "electrc (not specified)",
    "battery electric": "BEB",
    "diesel electric hybrids": "diesel-electric hybrids",
    "diesel-electric": "diesel-electric hybrids",
    "electric": "electrc (not specified)",
    "estimated-CNG buses": "CNG",
    "fuel cell": "FCEB",
    "fuel cell electric": "FCEB",
    "hybrid": "hybrid electric",
    "hybrid electric buses": "hybrid electric",
    "hybrid electrics": "hybrid electric",
    "low emission CNG": "CNG",
    "propane buses": "propane",
    "propaned powered vehicles": "propane",
    "zero emission": "zero-emission bus (not specified)",
    "zero emission buses": "zero-emission bus (not specified)",
    "zero emission electric": "zero-emission bus (not specified)",
    "zero-emission": "zero-emission bus (not specified)",
}

In [36]:
# repalcing values in bus_desc with bus_dict dictionary
df.replace({"bus_desc": bus_dict}, inplace=True)

In [37]:
# list of unique bus desc values reduced.
list(df.bus_desc.unique())

['BEB',
 'CNG',
 None,
 'zero-emission bus (not specified)',
 '15 electic, 16 hybrid',
 'FCEB',
 'electrc (not specified)',
 'hybrid electric',
 'diesel-electric hybrids',
 'hydrogen fuel cell',
 '2 BEBs and 4 hydrogen fuel cell buses',
 '4 fuel cell / 3 CNG',
 'estimated-cutaway vans (PM- award will not fund 68 buses',
 'diesel and gas',
 'propane',
 '1:CNGbus ;2 cutaway CNG buses']

In [38]:
# rename bus_desc col to propulsion_type
df = df.rename(columns={"bus_desc": "bus_type"})

In [39]:
# confirm column was renamed
df.columns

Index(['state', 'project_sponsor', 'project_title', 'description', 'funding',
       '#_of_buses', 'project_type', 'propulsion_type', 'area_served',
       'congressional_districts', 'fta_region', 'bus/low-no_program',
       'bus_count', 'bus_type'],
      dtype='object')

In [40]:
## checking existing propulsion_type column
list(df.propulsion_type.unique())

['zero',
 'low',
 'Low',
 'Zero',
 'Zero / Low',
 'combined',
 'Traditional',
 'zero/traditional',
 'Zero / Low / Traditional',
 'zero / low',
 'Zero / Traditional',
 'zero/low/traditional',
 'low/traditional',
 'other',
 'Other']

In [41]:
# make values in prop_type col lower case and remove spaces
df["propulsion_type"] = df["propulsion_type"].str.lower()
df["propulsion_type"] = df["propulsion_type"].str.replace(" ", "")

In [42]:
list(df.propulsion_type.unique())

['zero',
 'low',
 'zero/low',
 'combined',
 'traditional',
 'zero/traditional',
 'zero/low/traditional',
 'low/traditional',
 'other']

In [46]:
df.bus_type.unique()

array(['BEB', 'CNG', None, 'zero-emission bus (not specified)',
       '15 electic, 16 hybrid', 'FCEB', 'electrc (not specified)',
       'hybrid electric', 'diesel-electric hybrids', 'hydrogen fuel cell',
       '2 BEBs and 4 hydrogen fuel cell buses', '4 fuel cell / 3 CNG',
       'estimated-cutaway vans (PM- award will not fund 68 buses',
       'diesel and gas', 'propane', '1:CNGbus ;2 cutaway CNG buses'],
      dtype=object)

In [45]:
df.head()

Unnamed: 0,state,project_sponsor,project_title,description,funding,#_of_buses,project_type,propulsion_type,area_served,congressional_districts,fta_region,bus/low-no_program,bus_count,bus_type
0,DC,Washington Metropolitan Area Transit Authority...,Battery-Electric Metrobus Procurement and Elec...,WMATA will receive funding to convert its Cind...,104000000,100(beb),bus/chargers,zero,Large Urban,DC-001 ; MD-004 ; MD-008 ; VA-008 ; VA-011,3,Low-No,100,BEB
1,TX,Dallas Area Rapid Transit (DART),DART CNG Bus Fleet Modernization Project,Dallas Area Rapid Transit will receive funding...,103000000,90(estimated-CNGbuses),bus,low,Large Urban,TX-003 ; TX-004 ; TX-005 ; TX-006 ; TX-024 ; T...,6,Low-No,90,CNG
2,PA,Southeastern Pennsylvania Transportation Autho...,SEPTA Zero-Emission Bus Transition Facility Sa...,The Southeastern Pennsylvania Transportation A...,80000000,0,facility,zero,Large Urban,PA-002 ; PA-003 ; PA-004 ; PA-005,3,Low-No,0,
3,LA,New Orleans Regional Transit Authority,Accelerating Zero-Emissions Mobility for a Res...,The New Orleans Regional Transit Authority wil...,71439261,20(zero-emission),bus/chargers/equipment,zero,Large Urban,LA-002 ; LA-001,6,Low-No,20,zero-emission bus (not specified)
4,NJ,New Jersey Transit Corporation,Hilton Bus Garage Modernization,New Jersey Transit will receive funding to mod...,47000000,0,facility/chargers,zero,Large Urban,nj-011,2,Bus,0,


### Need new column for bus size type via list and function
cutaway, 40ft etc

In [57]:
bus_size = [
    "standard",
    "40 foot",
    "40-foot",
    "40ft",
    "articulated",
    "cutaway",
]

In [58]:
# Function to match keywords
def find_bus_size_type(description):
    for keyword in bus_size:
        if keyword in description.lower():
            return keyword
    return "not specified"

In [59]:
df["bus_size_type"] = df["description"].apply(find_bus_size_type)

In [60]:
display(df.columns, df.bus_size_type.unique(), df.head())

Index(['state', 'project_sponsor', 'project_title', 'description', 'funding',
       '#_of_buses', 'project_type', 'propulsion_type', 'area_served',
       'congressional_districts', 'fta_region', 'bus/low-no_program',
       'bus_count', 'bus_type', 'bus_size_type'],
      dtype='object')

array(['not specified', 'cutaway'], dtype=object)

Unnamed: 0,state,project_sponsor,project_title,description,funding,#_of_buses,project_type,propulsion_type,area_served,congressional_districts,fta_region,bus/low-no_program,bus_count,bus_type,bus_size_type
0,DC,Washington Metropolitan Area Transit Authority...,Battery-Electric Metrobus Procurement and Elec...,WMATA will receive funding to convert its Cind...,104000000,100(beb),bus/chargers,zero,Large Urban,DC-001 ; MD-004 ; MD-008 ; VA-008 ; VA-011,3,Low-No,100,BEB,not specified
1,TX,Dallas Area Rapid Transit (DART),DART CNG Bus Fleet Modernization Project,Dallas Area Rapid Transit will receive funding...,103000000,90(estimated-CNGbuses),bus,low,Large Urban,TX-003 ; TX-004 ; TX-005 ; TX-006 ; TX-024 ; T...,6,Low-No,90,CNG,not specified
2,PA,Southeastern Pennsylvania Transportation Autho...,SEPTA Zero-Emission Bus Transition Facility Sa...,The Southeastern Pennsylvania Transportation A...,80000000,0,facility,zero,Large Urban,PA-002 ; PA-003 ; PA-004 ; PA-005,3,Low-No,0,,not specified
3,LA,New Orleans Regional Transit Authority,Accelerating Zero-Emissions Mobility for a Res...,The New Orleans Regional Transit Authority wil...,71439261,20(zero-emission),bus/chargers/equipment,zero,Large Urban,LA-002 ; LA-001,6,Low-No,20,zero-emission bus (not specified),not specified
4,NJ,New Jersey Transit Corporation,Hilton Bus Garage Modernization,New Jersey Transit will receive funding to mod...,47000000,0,facility/chargers,zero,Large Urban,nj-011,2,Bus,0,,not specified


In [107]:
## new column for extracted_propulsion_type
propulsion_list = [
    "battery-electric",
    "Battery electric",
    "Battery-Electric",
    "Fuel cell electric",
    "Wired electric",
    "hydrogen fuel cell",
    "cng",
    "CNG",
    "Propane",
    "conventional",
    "electric hybrid",
    "Compressed natural gas",
    "Hybrid",
    "Hybrid electric",
    "Hybrid-electric",
    "Zero emission",
    "Zero-emission"
]

In [108]:
# function
def find_propulsion_type(description):
    for keyword in propulsion_list:
        if keyword.lower() in description.lower():
            return keyword
    return "not specified"

In [115]:
df["extracted_propulsion_type"] = df["description"].apply(find_propulsion_type)

In [116]:
display(
    df.columns,
    df.extracted_propulsion_type.value_counts(),
)

Index(['state', 'project_sponsor', 'project_title', 'description', 'funding',
       '#_of_buses', 'project_type', 'propulsion_type', 'area_served',
       'congressional_districts', 'fta_region', 'bus/low-no_program',
       'bus_count', 'bus_type', 'bus_size_type', 'extracted_propulsion_type'],
      dtype='object')

not specified             40
battery-electric          33
Compressed natural gas    12
Hybrid                    12
Zero-emission              9
Propane                    6
cng                        5
electric hybrid            5
Zero emission              3
Battery electric           3
Fuel cell electric         2
Name: extracted_propulsion_type, dtype: int64

In [117]:
df[df['extracted_propulsion_type']=='not specified']

Unnamed: 0,state,project_sponsor,project_title,description,funding,#_of_buses,project_type,propulsion_type,area_served,congressional_districts,fta_region,bus/low-no_program,bus_count,bus_type,bus_size_type,extracted_propulsion_type
9,OH,METRO Regional Transit Authority,Akron METRO RTA Maintenance and Operations Fac...,The METRO Regional Transit Authority will rece...,37808113,0,facility,low,Large Urban,OH-013,5,Bus,0,,not specified,not specified
15,CA,North County Transit District (NCTD),Accelerate Clean Transit (ACT),The North County Transit District will receive...,29330243,23(FCEB),bus,zero,Large Urban,CA-049 ; CA-050,9,Low-No,23,FCEB,not specified,not specified
22,IA,City of Iowa City,Iowa City Zero-Emission Transit Operations Mai...,Iowa City will receive funding to buy electric...,23280546,4(BEBs),bus/facility/chargers,zero,Small Urban,ia-001,7,Low-No,4,BEB,not specified,not specified
24,GA,Georgia State University,"College Town, Downtown: Transitioning to an Al...",Georgia State University's Panther Express wil...,22286745,18,bus/facility/chargers,zero,Large Urban,GA-005,4,Low-No,18,,not specified,not specified
35,FL,City of Ocala,Electric Bus Vehicle Purchase and Expansion of...,The city of Ocala's SunTran transit system wil...,16166822,31,bus/facility/chargers/equipment,zero,Small Urban,FL-003 ; FL-006,4,Low-No,31,,cutaway,not specified
36,SC,South Carolina Department of Transportation on...,SCDOT Vehicle Replacement Project,The South Carolina Department of Transportatio...,15423904,160,bus,traditional,Rural,SC-All ; SC-001 ; SC-002 ; SC-003 ; SC-004 ; S...,4,Bus,160,,not specified,not specified
39,IL,Illinois Department of Transportation on behal...,Illinois DOT Statewide Paratransit Vehicle Rep...,The Illinois Department of Transportation will...,12600000,134,bus,traditional,statewide,IL-002 ; IL-011 ; IL-012 ; IL-013 ; IL-014 ; I...,5,Bus,134,,cutaway,not specified
42,KY,Kentucky Transportation Cabinet on behalf of 1...,Consolidated Proposal for 10 Transit Agencies ...,The Kentucky Transportation Cabinet will recei...,11570906,42,bus/facility/other,traditional,Rural,KY-001 ; KY-002 ; KY-003 ; KY-004 ; KY-005,4,Bus,42,,not specified,not specified
44,MI,Michigan Department of Transportation on behal...,Transit Facility Repair and Expansion Project ...,The Michigan Department of Transportation will...,10700000,0,facility,zero/traditional,Rural,MI-004 ; MI-006 ; MI-007,5,Bus,0,,not specified,not specified
50,TX,Brazos Transit District,Getting to Zero - Brazos Transit District's Ze...,The Brazos Transit District will receive fundi...,9650646,11,bus/chargers,zero,Small Urban,tx-010,6,Bus,11,,not specified,not specified


## Exporting cleaned data to GCS

In [118]:
# saving to GCS as csv
df.to_csv(
    "gs://calitp-analytics-data/data-analyses/bus_procurement_cost/fta_bus_cost_clean.csv"
)

## Reading in cleaned data from GCS

In [None]:
bus_cost = pd.read_csv(
    "gs://calitp-analytics-data/data-analyses/bus_procurement_cost/fta_bus_cost_clean.csv"
)

In [None]:
# confirming cleaned data shows as expected.
display(bus_cost.shape, type(bus_cost), bus_cost.columns)

In [None]:
# drop unnessary columns
bus_cost = bus_cost.drop(["Unnamed: 0", "congressional_districts"], axis=1)

In [None]:
# confirming columns dropped as intended.
# less columns(14 to 12)
display(bus_cost.shape, bus_cost.columns)

## Cost per Bus, per Transit Agency dataframe

In [None]:
only_bus = bus_cost[bus_cost["bus_count"] > 0]
only_bus.head()

In [None]:
cost_per_bus = (
    only_bus.groupby("project_sponsor")
    .agg({"funding": "sum", "bus_count": "sum"})
    .reset_index()
)

In [None]:
cost_per_bus["cost_per_bus"] = (
    cost_per_bus["funding"] / cost_per_bus["bus_count"]
).astype("int64")

In [None]:
cost_per_bus.dtypes

In [None]:
cost_per_bus

In [None]:
## export cost_per_bus df to gcs
cost_per_bus.to_csv(
    "gs://calitp-analytics-data/data-analyses/bus_procurement_cost/fta_cost_per_bus.csv"
)

## Cost per bus, stats analysis

In [None]:
# read in fta cost per bus csv
cost_per_bus = pd.read_csv(
    "gs://calitp-analytics-data/data-analyses/bus_procurement_cost/fta_cost_per_bus.csv"
)

In [None]:
display(cost_per_bus.shape, cost_per_bus.head())

## Initial Summary Stats

### Summary Stats

In [None]:
# top level alanysis

bus_cost.agg({"project_title": "count", "funding": "sum", "bus_count": "sum"})

In [None]:
# start of agg. by project_type

bus_cost.groupby("project_type").agg(
    {"project_type": "count", "funding": "sum", "bus_count": "sum"}
)

In [None]:
# agg by program

bus_cost.groupby("bus/low-no_program").agg(
    {"project_type": "count", "funding": "sum", "bus_count": "sum"}
)

In [None]:
# agg by state, by funding
bus_cost.groupby("state").agg(
    {"project_type": "count", "funding": "sum", "bus_count": "sum"}
).sort_values(by="funding", ascending=False)

### Projects with bus purchases

In [None]:
# df of only projects with a bus count
only_bus = bus_cost[bus_cost["bus_count"] > 0]

In [None]:
display(only_bus.shape, only_bus.columns)

In [None]:
# agg by propulsion type
only_bus["propulsion_type"].value_counts()

In [None]:
only_bus.project_type.value_counts()

In [None]:
# of the rows with bus_count >1, what are the project types?
bus_agg = only_bus.groupby("project_type").agg(
    {"project_type": "count", "funding": "sum", "bus_count": "sum"}
)

In [None]:
# new column that calculates `cost per bus`
bus_agg["cost_per_bus"] = (bus_agg["funding"] / bus_agg["bus_count"]).astype("int64")

In [None]:
bus_agg

### Projects with no buses

In [None]:
no_bus = bus_cost[bus_cost["bus_count"] < 1]

In [None]:
no_bus["project_type"].value_counts()

## Overall Summary

In [None]:
project_count = bus_cost.project_title.count()
fund_sum = bus_cost.funding.sum()
bus_count_sum = bus_cost.bus_count.sum()
overall_cost_per_bus = (fund_sum) / (bus_count_sum)
bus_program_count = bus_cost["bus/low-no_program"].value_counts()

projects_with_bus = only_bus.project_title.count()
projects_with_bus_funds = only_bus.funding.sum()
cost_per_bus = (only_bus.funding.sum()) / (bus_count_sum)

In [None]:
summary = f"""
Top Level observation:
- {project_count} projects awarded
- ${fund_sum:,.2f} dollars awarded
- {bus_count_sum} buses to be purchased
- ${overall_cost_per_bus:,.2f} overall cost per bus

Projects have some mix of buses, facilities and equipment. Making it difficult to disaggregate actual bus cost.

Of the {project_count} projects awarded, {projects_with_bus} projects inlcuded buses. The remainder were facilities, chargers and equipment

Projects with buses purchases:
- {projects_with_bus} projects
- ${projects_with_bus_funds:,.2f} awarded to purchases buses
- ${cost_per_bus:,.2f} cost per bus
"""

In [None]:
print(summary)

In [None]:
# Assuming your DataFrame is named df
cost_per_bus_values = cost_per_bus["cost_per_bus"]

# Calculate mean and standard deviation
mean_value = cost_per_bus_values.mean()
std_deviation = cost_per_bus_values.std()

# Plot histogram
plt.hist(cost_per_bus_values, bins=30, color="skyblue", edgecolor="black", alpha=0.7)

# Add vertical lines for mean and standard deviation
plt.axvline(mean_value, color="red", linestyle="dashed", linewidth=2, label="Mean")
plt.axvline(
    mean_value + std_deviation,
    color="green",
    linestyle="dashed",
    linewidth=2,
    label="Mean + 1 Std Dev",
)
plt.axvline(
    mean_value - std_deviation,
    color="green",
    linestyle="dashed",
    linewidth=2,
    label="Mean - 1 Std Dev",
)

# Set labels and title
plt.xlabel("cost_per_bus")
plt.ylabel("Frequency")
plt.title("Histogram of cost_per_bus with Mean and Std Dev Lines")
plt.legend()

# Show the plot
plt.show()

In [None]:
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import pandas as pd

# Assuming your DataFrame is named df
cost_per_bus_values = cost_per_bus["cost_per_bus"]

# Calculate mean and standard deviation
mean_value = cost_per_bus_values.mean()
std_deviation = cost_per_bus_values.std()

# Plot histogram
plt.hist(cost_per_bus_values, bins=20, color="skyblue", edgecolor="black", alpha=0.7)

# Add vertical lines for mean and standard deviation
plt.axvline(mean_value, color="red", linestyle="dashed", linewidth=2, label="Mean")
plt.axvline(
    mean_value + std_deviation,
    color="green",
    linestyle="dashed",
    linewidth=2,
    label="Mean + 1 Std Dev",
)
plt.axvline(
    mean_value - std_deviation,
    color="green",
    linestyle="dashed",
    linewidth=2,
    label="Mean - 1 Std Dev",
)

# Set labels and title
plt.xlabel("Cost per Bus (USD)")
plt.ylabel("Frequency")
plt.title("Histogram of Cost per Bus with Mean and Std Dev Lines")
plt.legend()

# Format x-axis ticks as USD
plt.gca().xaxis.set_major_formatter(mticker.StrMethodFormatter("${x:,.0f}"))

# Show the plot
plt.show()