## FY23 FTA Bus and Low- and No-Emission Grant Awards Analysis

<b>GH issue:</b> 
* Research Request - Bus Procurement Costs & Awards #897

<b>Data source(s):</b> 
1. https://www.transit.dot.gov/funding/grants/fy23-fta-bus-and-low-and-no-emission-grant-awards
2. https://storymaps.arcgis.com/stories/022abf31cedd438b808ec2b827b6faff

<b>Definitions:</b>  
* <u>Grants for Buses and Bus Facilities Program:</u>
    * 49 U.S.C. 5339(b)) makes federal resources available to states and direct recipients to replace, rehabilitate and purchase buses and related equipment and to construct bus-related facilities, including technological changes or innovations to modify low or no emission vehicles or facilities. Funding is provided through formula allocations and competitive grants. 
<br><br>
* <u>Low or No Emission Vehicle Program:</u>
    * 5339(c) provides funding to state and local governmental authorities for the purchase or lease of zero-emission and low-emission transit buses as well as acquisition, construction, and leasing of required supporting facilities.


In [1]:
import numpy as np
import pandas as pd
import shared_utils

# set_option to increase max rows displayed to 200, to see entire df in 1 go/
pd.set_option("display.max_rows", 300)

## Reading in raw data from gcs

In [2]:
gcs_path = "gs://calitp-analytics-data/data-analyses/bus_procurement_cost/"
file = "data-analyses_bus_procurement_cost_fta_press_release_data_csv.csv"

fta = pd.read_csv(gcs_path+file)

## Data Cleaning
1. snake-case column names
2. remove currency formatting from funding column (with $ and , )
3. seperate text from # of bus col (split at '(')
    a. trim spaces in new col
    b. get rid of () characters in new col
4. trim spaces in other columns
5. exnamine column values and replace/update as needed
6. create new columns for bus size type and prop type


### Dataframe cleaning

In [14]:
def snake_case(df):
    '''
    snake case dataframe columns and stip of extra spaces
    '''
    df.columns = df.columns.str.lower().str.replace(" ", "_").str.strip()


def fund_cleaner(df, column):
    '''
    function to clean the funding column and make column int64
    '''
    df[column] = df[column].str.replace("$", "").str.replace(",", "").str.strip().astype('int64')

    

def value_replacer(df, col1, col1_val, col2, col2_new_val):
    '''
    function that replaces the value at a speicific row on a specific column.
    in this case, filters the df by a speific col/val, then replaces the value at new col/val
    '''
    df.loc[df[col1] == col1_val , col2] = col2_new_val
    

In [4]:
# snake case function to Df
snake_case(fta)

### Column Cleaning

#### propulsion_type rename to propulstion category

In [5]:
# rename col to propulsion category
fta = fta.rename(columns={"propulsion_type": "propulsion_category"})

# make values in prop_cat col lower case and remove spaces
fta["propulsion_category"] = fta["propulsion_category"].str.lower()
fta["propulsion_category"] = fta["propulsion_category"].str.replace(" ", "")

#### funding

In [7]:
fund_cleaner(fta, "funding")

  df[column] = df[column].str.replace("$", "")


#### split `approx_#_of_buses` to `bus_count` and `prop_type`

In [9]:
# test of removing the spaces first in # of bus colum values, THEN split by (
fta["approx_#_of_buses"] = fta["approx_#_of_buses"].str.replace(" ", "")

# spliting the # of buses column into 2, using the ( char as the delimiter
# also fills `none` values with `needs manual check`
fta[["bus_count", "prop_type"]] = fta["approx_#_of_buses"].str.split(
    pat="(", n=1, expand=True
)
fta[["bus_count", "prop_type"]] = fta[["bus_count", "prop_type"]].fillna(
    "needs manual check"
)

#### bus_count

In [17]:
# running function on rows that need specific value changes
value_replacer(fta,'bus_count','56estimated-cutawayvans', 'bus_count', 56)
value_replacer(fta,'bus_count','12batteryelectric','bus_count', 12)
value_replacer(fta,'prop_type','PM-awardwillnotfund68buses)', 'prop_type', 'estimated-cutaway vans (PM- award will not fund 68 buses)')
value_replacer(fta,'project_sponsor','City of Charlotte - Charlotte Area Transit System','bus_count',31)

#### project_type

In [21]:
# using str.lower() on project type
fta["project_type"] = fta["project_type"].str.lower().str.replace(" ", "")
# using str.lower() on project type
# fta["project_type"] = fta["project_type"]

In [22]:
# some values still need to get adjusted. will use a short dictionary to fix
new_type = {
    "\tbus/facility": "bus/facility",
    "bus/facilitiy": "bus/facility",
    "facilities": "facility",
}
# using replace() with the dictionary to replace keys in project type col
fta.replace({"project_type": new_type}, inplace=True)

#### `prop_type`

In [29]:
# clearning the bus desc/prop_type col.
# removing the )
fta["prop_type"] = fta["prop_type"].str.replace(")", "").str.strip()

  fta["prop_type"] = fta["prop_type"].str.replace(")", "").str.strip()


In [31]:
# creating a dictionary to add spaces back to the values
spaces = {
    "beb": "BEB",
    "estimated-CNGbuses": "estimated-CNG buses",
    "cngbuses": "CNG buses",
    "BEBs": "BEB",
    "Electric\n16(Hybrid": "15 electic, 16 hybrid",
    "FuelCellElectric": "fuel cell electric",
    "FuelCell": "fuel cell",
    "lowemissionCNG": "low emission CNG",
    "cng": "CNG",
    "BEBsparatransitbuses": "BEBs paratransit buses",
    "hybridelectric": "hybrid electric",
    "zeroemissionbuses": "zero emission buses",
    "dieselelectrichybrids": "diesel electric hybrids",
    "hydrogenfuelcell": "hydrogen fuel cell",
    "2BEBsand4HydrogenFuelCellBuses": "2 BEBs and 4 hydrogen fuel cell buses",
    "4fuelcell/3CNG": "4 fuel cell / 3 CNG",
    "hybridelectricbuses": "hybrid electric buses",
    "CNGfueled": "CNG fueled",
    "zeroemissionelectric": "zero emission electric",
    "hybridelectrics": "hybrid electrics",
    "dieselandgas": "diesel and gas",
    "diesel-electrichybrids": "diesel-electric hybrids",
    "propanebuses": "propane buses",
    "1:CNGbus;2cutawayCNGbuses": "1:CNGbus ;2 cutaway CNG buses",
    "zeroemission": "zero emission",
    "propanedpoweredvehicles": "propaned powered vehicles",
}

# using new dictionary to replace values in the bus desc col
fta.replace({"prop_type": spaces}, inplace=True)

In [37]:
# dict to validate prop_type values
prop_type_dict = {
    "15 electic, 16 hybrid": "mix (zero and low emission buses)",
    "1:CNGbus ;2 cutaway CNG buses": "mix (zero and low emission buses)",
    "2 BEBs and 4 hydrogen fuel cell buses": "mix (BEB and FCEB)",
    "4 fuel cell / 3 CNG": "mix (zero and low emission buses)",
    "BEBs paratransit buses": "BEB",
    "CNG buses": "CNG",
    "CNG fueled": "CNG",
    "Electric": "electric (not specified)",
    "battery electric": "BEB",
    "diesel and gas": "mix (low emission)",
    "diesel electric hybrids": "low emission (hybrid)",
    "diesel-electric": "low emission (hybrid)",
    "diesel-electric hybrids": "low emission (hybrid)",
    "electric": "electric (not specified)",
    "estimated-CNG buses": "CNG",
    "estimated-cutaway vans (PM- award will not fund 68 buses": "mix (zero and low emission buses)",
    "fuel cell": "FCEB",
    "fuel cell electric": "FCEB",
    "hybrid": "low emission (hybrid)",
    "hybrid electric": "low emission (hybrid)",
    "hybrid electric buses": "low emission (hybrid)",
    "hybrid electrics": "low emission (hybrid)",
    "hydrogen fuel cell": "FCEB",
    "low emission CNG": "CNG",
    "propane": "low emission (propane)",
    "propane buses": "low emission (propane)",
    "propaned powered vehicles": "low emission (propane)",
    "zero emission": "zero-emission bus (not specified)",
    "zero emission buses": "zero-emission bus (not specified)",
    "zero emission electric": "zero-emission bus (not specified)",
    "zero-emission": "zero-emission bus (not specified)",
}

# repalcing values in prop type with prop type dictionary
fta.replace({"prop_type": prop_type_dict}, inplace=True)

### fix `prop_type == needs manual check`

- subset a df of only prop type == needs manual check
- create list of keywords to check prop type
- create function to replace `needs manualc check` values with list values
- then... do something with both dataframes? 
    * remove rows with `needs manual check`
    * then append subset df to initial df?


In [40]:
# subdf of just `needs manual check` prop_types
manual_check = fta[fta["prop_type"] == "needs manual check"]

In [43]:
# function to match keywords to list
def prop_type_finder(description):
    for keyword in manual_checker_list:
        if keyword in description:
            return keyword
    return "no bus procurement"

In [42]:
manual_checker_list = [
    "propane-powered",
    "hybrid diesel-electric buses",
    "propane fueled buses",
    "cutaway vehicles",
    "diesel-electric hybrid",
    "low or no emission buses",
    "electric buses",
    "hybrid-electric vehicles",
    "electric commuter",
    "Electric Buses",
    "battery electric",
    "Batery Electric",
    "battery-electric",
    "fuel-cell",
    "fuel cell",
    "Fuel Cell",
    "zero emission",
    "Zero Emission",
    "zero-emission electric buses",
    "zero-emission buses",
    "zero‐emission",
    "zero-emission",
    "zeroemission",
    "CNG",
    "cng",
    "County Mass Transit District will receive funding to buy buses",
    "Colorado will receive funding to buy vans to replace older ones",
    "ethanol-fueled buses",
    "will receive funding to buy vans to replace",
    "funding to replace the oldest buses",
    "to buy buses and charging equipment",
    "counties by buying buses",
    "receive funding to buy cutaway paratransit buses",
    "new replacement vehicles",
]

# creates a new column called 'prop_type' by applying function to description column. 
# the function will check the values against the description col against the list, then return the keyword the row matched too
manual_check["prop_type"] = manual_check["description"].apply(prop_type_finder)

### use dictionary to change manual_check prop_type values to match validated values

In [46]:
manual_check_dict= {'zero emission': 'zero-emission bus (not specified)',
 'electric buses':'electric (not specified)',
 'zero-emission': 'zero-emission bus (not specified)',
 'low or no emission buses' : 'mix (zero and low emission buses)',
 'zero-emission buses': 'zero-emission bus (not specified)',
 'new replacement vehicles':'not specified',
 'receive funding to buy cutaway paratransit buses': 'not specified',
 'counties by buying buses': 'not specified',
 'battery-electric' : 'BEB',
 'to buy buses and charging equipment':'not specified',
 'propane-powered': 'low emission (propane)',
 'funding to replace the oldest buses':'not specified',
 'diesel-electric hybrid': 'low emission (hybrid)',
 'hybrid diesel-electric buses': 'low emission (hybrid)',
 'cutaway vehicles':'not specified',
 'propane fueled buses': 'low emission (propane)',
 'County Mass Transit District will receive funding to buy buses':'not specified',
 'ethanol-fueled buses': 'low emission (ethanol)',
 'will receive funding to buy vans to replace': 'not specified',
 'Colorado will receive funding to buy vans to replace older ones': 'not specified',
 'hybrid-electric vehicles': 'low emission (hybrid)'
}

# replace prop_type values using manual_check_dict
manual_check.replace({"prop_type": manual_check_dict}, inplace=True)

### deleting rows from iniail df that have prop_type == 'needs manual check'

In [49]:
# filters df for rows that do not equal `needs manual check`
# expect rows to drop from 130 to 72?
fta = fta[fta['prop_type'] != 'needs manual check']

In [52]:
### appending rows from manual_check to initial df
fta = fta.append(manual_check, ignore_index=True)

  fta = fta.append(manual_check, ignore_index=True)


### Need new column for `bus size type` via list and function
cutaway, 40ft etc

In [None]:
list(df.columns)

In [55]:
bus_size = [
    "standard",
    "40 foot",
    "40-foot",
    "40ft",
    "articulated",
    "cutaway",
]

In [56]:
# Function to match keywords
def find_bus_size_type(description):
    for keyword in bus_size:
        if keyword in description.lower():
            return keyword
    return "not specified"

In [58]:
# new column called bus size type based on description column
fta["bus_size_type"] = fta["description"].apply(find_bus_size_type)

## Exporting cleaned data to GCS

In [None]:
# check work
display(df.head(3), df.bus_size_type.unique(), df.shape)

In [60]:
# saving to GCS as csv

clean_file = 'fta_bus_cost_clean.csv'

fta.to_csv(gcs_path+clean_file)

## Reading in cleaned data from GCS

In [None]:
bus_cost = pd.read_csv(
    "gs://calitp-analytics-data/data-analyses/bus_procurement_cost/fta_bus_cost_clean.csv"
)

In [None]:
# confirming cleaned data shows as expected.
display(bus_cost.shape, type(bus_cost), bus_cost.columns)

In [None]:
bus_cost["prop_type"].sort_values(ascending=True).unique()

## DEPRECATED - Data Analysis
actual data analysis and summary stats exist in the `cost_per_bus_analysis.ipynb` notebook

### Cost per Bus, per Transit Agency dataframe

In [None]:
only_bus = bus_cost[bus_cost["bus_count"] > 0]
only_bus.head()

In [None]:
cost_per_bus = (
    only_bus.groupby("project_sponsor")
    .agg({"funding": "sum", "bus_count": "sum"})
    .reset_index()
)

In [None]:
cost_per_bus["cost_per_bus"] = (
    cost_per_bus["funding"] / cost_per_bus["bus_count"]
).astype("int64")

In [None]:
cost_per_bus.dtypes

In [None]:
cost_per_bus

In [None]:
## export cost_per_bus df to gcs
cost_per_bus.to_csv(
    "gs://calitp-analytics-data/data-analyses/bus_procurement_cost/fta_cost_per_bus.csv"
)

### Cost per bus, stats analysis

In [None]:
# read in fta cost per bus csv
cost_per_bus = pd.read_csv(
    "gs://calitp-analytics-data/data-analyses/bus_procurement_cost/fta_cost_per_bus.csv"
)

In [None]:
display(cost_per_bus.shape, cost_per_bus.head())

### Initial Summary Stats

### Summary Stats

In [None]:
# top level alanysis

bus_cost.agg({"project_title": "count", "funding": "sum", "bus_count": "sum"})

In [None]:
# start of agg. by project_type

bus_cost.groupby("project_type").agg(
    {"project_type": "count", "funding": "sum", "bus_count": "sum"}
)

In [None]:
# agg by program

bus_cost.groupby("bus/low-no_program").agg(
    {"project_type": "count", "funding": "sum", "bus_count": "sum"}
)

In [None]:
# agg by state, by funding
bus_cost.groupby("state").agg(
    {"project_type": "count", "funding": "sum", "bus_count": "sum"}
).sort_values(by="funding", ascending=False)

### Projects with bus purchases

In [None]:
# df of only projects with a bus count
only_bus = bus_cost[bus_cost["bus_count"] > 0]

In [None]:
display(only_bus.shape, only_bus.columns)

In [None]:
# agg by propulsion type
only_bus["propulsion_type"].value_counts()

In [None]:
only_bus.project_type.value_counts()

In [None]:
# of the rows with bus_count >1, what are the project types?
bus_agg = only_bus.groupby("project_type").agg(
    {"project_type": "count", "funding": "sum", "bus_count": "sum"}
)

In [None]:
# new column that calculates `cost per bus`
bus_agg["cost_per_bus"] = (bus_agg["funding"] / bus_agg["bus_count"]).astype("int64")

In [None]:
bus_agg

### Projects with no buses

In [None]:
no_bus = bus_cost[bus_cost["bus_count"] < 1]

In [None]:
no_bus["project_type"].value_counts()

### Overall Summary

In [None]:
project_count = bus_cost.project_title.count()
fund_sum = bus_cost.funding.sum()
bus_count_sum = bus_cost.bus_count.sum()
overall_cost_per_bus = (fund_sum) / (bus_count_sum)
bus_program_count = bus_cost["bus/low-no_program"].value_counts()

projects_with_bus = only_bus.project_title.count()
projects_with_bus_funds = only_bus.funding.sum()
cost_per_bus = (only_bus.funding.sum()) / (bus_count_sum)

In [None]:
summary = f"""
Top Level observation:
- {project_count} projects awarded
- ${fund_sum:,.2f} dollars awarded
- {bus_count_sum} buses to be purchased
- ${overall_cost_per_bus:,.2f} overall cost per bus

Projects have some mix of buses, facilities and equipment. Making it difficult to disaggregate actual bus cost.

Of the {project_count} projects awarded, {projects_with_bus} projects inlcuded buses. The remainder were facilities, chargers and equipment

Projects with buses purchases:
- {projects_with_bus} projects
- ${projects_with_bus_funds:,.2f} awarded to purchases buses
- ${cost_per_bus:,.2f} cost per bus
"""

In [None]:
print(summary)

In [None]:
# Assuming your DataFrame is named df
cost_per_bus_values = cost_per_bus["cost_per_bus"]

# Calculate mean and standard deviation
mean_value = cost_per_bus_values.mean()
std_deviation = cost_per_bus_values.std()

# Plot histogram
plt.hist(cost_per_bus_values, bins=30, color="skyblue", edgecolor="black", alpha=0.7)

# Add vertical lines for mean and standard deviation
plt.axvline(mean_value, color="red", linestyle="dashed", linewidth=2, label="Mean")
plt.axvline(
    mean_value + std_deviation,
    color="green",
    linestyle="dashed",
    linewidth=2,
    label="Mean + 1 Std Dev",
)
plt.axvline(
    mean_value - std_deviation,
    color="green",
    linestyle="dashed",
    linewidth=2,
    label="Mean - 1 Std Dev",
)

# Set labels and title
plt.xlabel("cost_per_bus")
plt.ylabel("Frequency")
plt.title("Histogram of cost_per_bus with Mean and Std Dev Lines")
plt.legend()

# Show the plot
plt.show()

In [None]:
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import pandas as pd

# Assuming your DataFrame is named df
cost_per_bus_values = cost_per_bus["cost_per_bus"]

# Calculate mean and standard deviation
mean_value = cost_per_bus_values.mean()
std_deviation = cost_per_bus_values.std()

# Plot histogram
plt.hist(cost_per_bus_values, bins=20, color="skyblue", edgecolor="black", alpha=0.7)

# Add vertical lines for mean and standard deviation
plt.axvline(mean_value, color="red", linestyle="dashed", linewidth=2, label="Mean")
plt.axvline(
    mean_value + std_deviation,
    color="green",
    linestyle="dashed",
    linewidth=2,
    label="Mean + 1 Std Dev",
)
plt.axvline(
    mean_value - std_deviation,
    color="green",
    linestyle="dashed",
    linewidth=2,
    label="Mean - 1 Std Dev",
)

# Set labels and title
plt.xlabel("Cost per Bus (USD)")
plt.ylabel("Frequency")
plt.title("Histogram of Cost per Bus with Mean and Std Dev Lines")
plt.legend()

# Format x-axis ticks as USD
plt.gca().xaxis.set_major_formatter(mticker.StrMethodFormatter("${x:,.0f}"))

# Show the plot
plt.show()