## FY23 FTA Bus and Low- and No-Emission Grant Awards Analysis

<b>GH issue:</b> 
* Research Request - Bus Procurement Costs & Awards #897

<b>Data source(s):</b> 
1. https://www.transit.dot.gov/funding/grants/fy23-fta-bus-and-low-and-no-emission-grant-awards
2. https://storymaps.arcgis.com/stories/022abf31cedd438b808ec2b827b6faff

<b>Definitions:</b>  
* <u>Grants for Buses and Bus Facilities Program:</u>
    * 49 U.S.C. 5339(b)) makes federal resources available to states and direct recipients to replace, rehabilitate and purchase buses and related equipment and to construct bus-related facilities, including technological changes or innovations to modify low or no emission vehicles or facilities. Funding is provided through formula allocations and competitive grants. 
<br><br>
* <u>Low or No Emission Vehicle Program:</u>
    * 5339(c) provides funding to state and local governmental authorities for the purchase or lease of zero-emission and low-emission transit buses as well as acquisition, construction, and leasing of required supporting facilities.


In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import shared_utils
from scipy import stats
from scipy.stats import norm

# set_option to increase max rows displayed to 200, to see entire df in 1 go/
pd.set_option("display.max_rows", 300)

## Reading in raw data from gcs

In [2]:
df = pd.read_csv(
    "gs://calitp-analytics-data/data-analyses/bus_procurement_cost/data-analyses_bus_procurement_cost_fta_press_release_data_csv.csv"
)

## Data Cleaning
1. snake-case column names
2. remove currency formatting from funding column (with $ and , )
3. seperate text from # of bus col (split at '(')
    a. trim spaces in new col
    b. get rid of () characters in new col
4. trim spaces in other columns
5. exnamine column values and replace/update as needed
6. create new columns for bus size type and prop type


### Dataframe cleaning

In [3]:
# fucntions to clean up dataframe and df columns
def snake_case(df):
    df.columns = df.columns.str.lower()
    df.columns = df.columns.str.replace(" ", "_")
    df.columns = df.columns.str.strip()


def fund_cleaner(df, column):
    df[column] = df[column].str.replace("$", "")
    df[column] = df[column].str.replace(",", "")
    df[column] = df[column].str.strip()

In [4]:
# snake case function to Df
snake_case(df)

### Column Cleaning

#### propulsion_type rename to propulstion category

In [5]:
# rename col to propulsion category
df = df.rename(columns={"propulsion_type": "propulsion_category"})

In [6]:
# make values in prop_cat col lower case and remove spaces
df["propulsion_category"] = df["propulsion_category"].str.lower()
df["propulsion_category"] = df["propulsion_category"].str.replace(" ", "")

In [7]:
df.head(3)

Unnamed: 0,state,project_sponsor,project_title,description,funding,approx_#_of_buses,project_type,propulsion_category,area_served,congressional_districts,fta_region,bus/low-no_program
0,DC,Washington Metropolitan Area Transit Authority...,Battery-Electric Metrobus Procurement and Elec...,WMATA will receive funding to convert its Cind...,"$104,000,000",100(beb),bus/chargers,zero,Large Urban,DC-001 ; MD-004 ; MD-008 ; VA-008 ; VA-011,3,Low-No
1,TX,Dallas Area Rapid Transit (DART),DART CNG Bus Fleet Modernization Project,Dallas Area Rapid Transit will receive funding...,"$103,000,000",90 (estimated-CNG buses),bus,low,Large Urban,TX-003 ; TX-004 ; TX-005 ; TX-006 ; TX-024 ; T...,6,Low-No
2,PA,Southeastern Pennsylvania Transportation Autho...,SEPTA Zero-Emission Bus Transition Facility Sa...,The Southeastern Pennsylvania Transportation A...,"$80,000,000",0,facility,zero,Large Urban,PA-002 ; PA-003 ; PA-004 ; PA-005,3,Low-No


#### funding

In [8]:
fund_cleaner(df, "funding")

  df[column] = df[column].str.replace("$", "")


In [9]:
df["funding"] = df["funding"].astype("int64")

In [10]:
df.columns

Index(['state', 'project_sponsor', 'project_title', 'description', 'funding',
       'approx_#_of_buses', 'project_type', 'propulsion_category',
       'area_served', 'congressional_districts', 'fta_region',
       'bus/low-no_program'],
      dtype='object')

#### split `approx_#_of_buses` to `bus_count` and `prop_type`

In [11]:
# test of removing the spaces first in # of bus colum, THEN split by (
df["approx_#_of_buses"] = df["approx_#_of_buses"].str.replace(" ", "")

In [12]:
# spliting the # of buses column into 2, using the ( char as the delimiter
# also fills `none` values with `needs manual check`
df[["bus_count", "prop_type"]] = df["approx_#_of_buses"].str.split(
    pat="(", n=1, expand=True
)
df[["bus_count", "prop_type"]] = df[["bus_count", "prop_type"]].fillna(
    "needs manual check"
)

#### bus_count

In [13]:
# function to find the row index of a specific value and column in a dataframe
def find_loc(data, col, val):
    x = data.loc[data[col] == val].index[0]
    return x

In [14]:
loc1 = find_loc(df, "bus_count", "56estimated-cutawayvans")
loc2 = find_loc(df, "bus_count", "12batteryelectric")

In [15]:
display(loc1, loc2)

58

32

In [16]:
# editing the values of the bus count col at specific location
# syntax, look at ## index, look at XX column
df.loc[58, "bus_count"] = 56
df.loc[32, "bus_count"] = 12

In [17]:
# updating values again for bus_desc. same location
df.loc[58, "prop_type"] = "estimated-cutaway vans (PM- award will not fund 68 buses)"
df.loc[32, "prop_type"] = "battery electric"

In [18]:
# bus count for row 12 needs to be adjusted to 31 instead of 15
df.loc[12, "bus_count"] = 31

In [19]:
# confirming the change
df.loc[12]

state                                                                     NC
project_sponsor            City of Charlotte - Charlotte Area Transit System
project_title              Charlotte Area Transit System's Sustainable Fl...
description                The city of Charlotte will receive funding to ...
funding                                                             30890413
approx_#_of_buses                                   15(Electric)\n16(Hybrid)
project_type                                      Bus / Chargers / Equipment
propulsion_category                                                 zero/low
area_served                                                      Large Urban
congressional_districts           NC-008 ; NC-012 ; NC-013 ; NC-014 ; SC-005
fta_region                                                                 4
bus/low-no_program                                                       Bus
bus_count                                                                 31

#### project_type

In [20]:
# using str.lower() on project type
df["project_type"] = df["project_type"].str.lower()
# using str.lower() on project type
df["project_type"] = df["project_type"].str.replace(" ", "")

In [21]:
# some values still need to get adjusted. will use a short dictionary to fix
new_type = {
    "\tbus/facility": "bus/facility",
    "bus/facilitiy": "bus/facility",
    "facilities": "facility",
}

In [22]:
# using replace() with the dictionary to replace keys in project type col
# syntax df.replace({'bus_desc': new_dict}, inplace=True)
df.replace({"project_type": new_type}, inplace=True)

In [23]:
df.project_type.unique()

array(['bus/chargers', 'bus', 'facility', 'bus/chargers/equipment',
       'facility/chargers', 'bus/facility', 'bus/facility/chargers',
       'chargers', 'bus/chargers/other', 'bus/facility/equipment',
       'bus/equipment', 'bus/facility/chargers/equipment',
       'bus/facility/other', 'facility/chargers/equipment',
       'facility/equipment', 'chargers/equipment', 'bus/other',
       'bus/facility/equipment/other'], dtype=object)

#### `prop_type`

In [24]:
# clearning the bus desc/prop_type col.
# removing the )
df["prop_type"] = df["prop_type"].str.replace(")", "")

  df["prop_type"] = df["prop_type"].str.replace(")", "")


In [25]:
list(df["prop_type"].unique())

['beb',
 'estimated-CNGbuses',
 'needs manual check',
 'zero-emission',
 'cngbuses',
 'BEBs',
 'Electric\n16(Hybrid',
 'FCEB',
 'Electric',
 'FuelCellElectric',
 'CNG',
 'FuelCell',
 'hybrid',
 'BEB',
 'battery electric',
 'lowemissionCNG',
 'cng',
 'BEBsparatransitbuses',
 'hybridelectric',
 'zeroemissionbuses',
 'dieselelectrichybrids',
 'hydrogenfuelcell',
 '2BEBsand4HydrogenFuelCellBuses',
 '4fuelcell/3CNG',
 'estimated-cutaway vans (PM- award will not fund 68 buses',
 'hybridelectricbuses',
 'CNGfueled',
 'zeroemissionelectric',
 'hybridelectrics',
 'dieselandgas',
 'diesel-electrichybrids',
 'propane',
 'electric',
 'diesel-electric',
 'propanebuses',
 '1:CNGbus;2cutawayCNGbuses',
 'zeroemission',
 'propanedpoweredvehicles']

In [26]:
# stripping the values in the bus desc col
df["prop_type"] = df["prop_type"].str.strip()

In [27]:
# creating a dictionary to add spaces back to the values
spaces = {
    "beb": "BEB",
    "estimated-CNGbuses": "estimated-CNG buses",
    "cngbuses": "CNG buses",
    "BEBs": "BEB",
    "Electric\n16(Hybrid": "15 electic, 16 hybrid",
    "FuelCellElectric": "fuel cell electric",
    "FuelCell": "fuel cell",
    "lowemissionCNG": "low emission CNG",
    "cng": "CNG",
    "BEBsparatransitbuses": "BEBs paratransit buses",
    "hybridelectric": "hybrid electric",
    "zeroemissionbuses": "zero emission buses",
    "dieselelectrichybrids": "diesel electric hybrids",
    "hydrogenfuelcell": "hydrogen fuel cell",
    "2BEBsand4HydrogenFuelCellBuses": "2 BEBs and 4 hydrogen fuel cell buses",
    "4fuelcell/3CNG": "4 fuel cell / 3 CNG",
    "hybridelectricbuses": "hybrid electric buses",
    "CNGfueled": "CNG fueled",
    "zeroemissionelectric": "zero emission electric",
    "hybridelectrics": "hybrid electrics",
    "dieselandgas": "diesel and gas",
    "diesel-electrichybrids": "diesel-electric hybrids",
    "propanebuses": "propane buses",
    "1:CNGbus;2cutawayCNGbuses": "1:CNGbus ;2 cutaway CNG buses",
    "zeroemission": "zero emission",
    "propanedpoweredvehicles": "propaned powered vehicles",
}

In [28]:
# using new dictionary to replace values in the bus desc col
df.replace({"prop_type": spaces}, inplace=True)

In [29]:
list(df["prop_type"].sort_values().unique())

['15 electic, 16 hybrid',
 '1:CNGbus ;2 cutaway CNG buses',
 '2 BEBs and 4 hydrogen fuel cell buses',
 '4 fuel cell / 3 CNG',
 'BEB',
 'BEBs paratransit buses',
 'CNG',
 'CNG buses',
 'CNG fueled',
 'Electric',
 'FCEB',
 'battery electric',
 'diesel and gas',
 'diesel electric hybrids',
 'diesel-electric',
 'diesel-electric hybrids',
 'electric',
 'estimated-CNG buses',
 'estimated-cutaway vans (PM- award will not fund 68 buses',
 'fuel cell',
 'fuel cell electric',
 'hybrid',
 'hybrid electric',
 'hybrid electric buses',
 'hybrid electrics',
 'hydrogen fuel cell',
 'low emission CNG',
 'needs manual check',
 'propane',
 'propane buses',
 'propaned powered vehicles',
 'zero emission',
 'zero emission buses',
 'zero emission electric',
 'zero-emission']

In [30]:
prop_type_dict = {
    "15 electic, 16 hybrid": "mix (zero and low emission buses)",
    "1:CNGbus ;2 cutaway CNG buses": "mix (zero and low emission buses)",
    "2 BEBs and 4 hydrogen fuel cell buses": "mix (BEB and FCEB)",
    "4 fuel cell / 3 CNG": "mix (zero and low emission buses)",
    "BEBs paratransit buses": "BEB",
    "CNG buses": "CNG",
    "CNG fueled": "CNG",
    "Electric": "electric (not specified)",
    "battery electric": "BEB",
    "diesel and gas": "mix (low emission)",
    "diesel electric hybrids": "low emission (hybrid)",
    "diesel-electric": "low emission (hybrid)",
    "diesel-electric hybrids": "low emission (hybrid)",
    "electric": "electric (not specified)",
    "estimated-CNG buses": "CNG",
    "estimated-cutaway vans (PM- award will not fund 68 buses": "mix (zero and low emission buses)",
    "fuel cell": "FCEB",
    "fuel cell electric": "FCEB",
    "hybrid": "low emission (hybrid)",
    "hybrid electric": "low emission (hybrid)",
    "hybrid electric buses": "low emission (hybrid)",
    "hybrid electrics": "low emission (hybrid)",
    "hydrogen fuel cell": "FCEB",
    "low emission CNG": "CNG",
    "propane": "low emission (propane)",
    "propane buses": "low emission (propane)",
    "propaned powered vehicles": "low emission (propane)",
    "zero emission": "zero-emission bus (not specified)",
    "zero emission buses": "zero-emission bus (not specified)",
    "zero emission electric": "zero-emission bus (not specified)",
    "zero-emission": "zero-emission bus (not specified)",
}

In [31]:
# repalcing values in prop type with prop type dictionary
df.replace({"prop_type": prop_type_dict}, inplace=True)

In [32]:
# check work
df.prop_type.value_counts()

needs manual check                   58
BEB                                  18
low emission (hybrid)                15
CNG                                  14
electric (not specified)              6
low emission (propane)                5
zero-emission bus (not specified)     4
mix (zero and low emission buses)     4
FCEB                                  4
mix (BEB and FCEB)                    1
mix (low emission)                    1
Name: prop_type, dtype: int64

### fix `prop_type == needs manual check`

- subset a df of only prop type == needs manual check
- create list of keywords to check prop type
- create function to replace `needs manualc check` values with list values
- then... do something with both dataframes? 
    * remove rows with `needs manual check`
    * then append subset df to initial df?


In [33]:
manual_check = df[df["prop_type"] == "needs manual check"]

In [34]:
display(
    manual_check.shape, manual_check["prop_type"].value_counts(), manual_check.columns
)

(58, 14)

needs manual check    58
Name: prop_type, dtype: int64

Index(['state', 'project_sponsor', 'project_title', 'description', 'funding',
       'approx_#_of_buses', 'project_type', 'propulsion_category',
       'area_served', 'congressional_districts', 'fta_region',
       'bus/low-no_program', 'bus_count', 'prop_type'],
      dtype='object')

In [35]:
manual_checker_list = [
    "propane-powered",
    "hybrid diesel-electric buses",
    "propane fueled buses",
    "cutaway vehicles",
    "diesel-electric hybrid",
    "low or no emission buses",
    "electric buses",
    "hybrid-electric vehicles",
    "electric commuter",
    "Electric Buses",
    "battery electric",
    "Batery Electric",
    "battery-electric",
    "fuel-cell",
    "fuel cell",
    "Fuel Cell",
    "zero emission",
    "Zero Emission",
    "zero-emission electric buses",
    "zero-emission buses",
    "zero‐emission",
    "zero-emission",
    "zeroemission",
    "CNG",
    "cng",
    "County Mass Transit District will receive funding to buy buses",
    "Colorado will receive funding to buy vans to replace older ones",
    "ethanol-fueled buses",
    "will receive funding to buy vans to replace",
    "funding to replace the oldest buses",
    "to buy buses and charging equipment",
    "counties by buying buses",
    "receive funding to buy cutaway paratransit buses",
    "new replacement vehicles",
]

In [36]:
# function to match keywords to list
def prop_type_finder(description):
    for keyword in manual_checker_list:
        if keyword in description:
            return keyword
    return "no bus procurement"

In [37]:
manual_check["prop_type"] = manual_check["description"].apply(prop_type_finder)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  manual_check["prop_type"] = manual_check["description"].apply(prop_type_finder)


In [54]:
# still a lot of `not specified`.
# looked at 'not specified' values, all rows were for facilities & zero buses. wrote the function to return 'no bus procurement'.
# GOOD TO GO
display(
    manual_check.shape, manual_check["prop_type"].value_counts(), manual_check.columns
)

(58, 14)

no bus procurement                                                 21
electric buses                                                     11
zero-emission                                                       5
zero emission                                                       2
battery-electric                                                    2
diesel-electric hybrid                                              1
Colorado will receive funding to buy vans to replace older ones     1
will receive funding to buy vans to replace                         1
ethanol-fueled buses                                                1
County Mass Transit District will receive funding to buy buses      1
propane fueled buses                                                1
cutaway vehicles                                                    1
hybrid diesel-electric buses                                        1
propane-powered                                                     1
funding to replace t

Index(['state', 'project_sponsor', 'project_title', 'description', 'funding',
       'approx_#_of_buses', 'project_type', 'propulsion_category',
       'area_served', 'congressional_districts', 'fta_region',
       'bus/low-no_program', 'bus_count', 'prop_type'],
      dtype='object')

### use dictionary to change manual_check prop_type values to match validated values

In [57]:
manual_check_dict= {'zero emission': 'zero-emission bus (not specified)',
 'electric buses':'electric (not specified)',
 'zero-emission': 'zero-emission bus (not specified)',
 'low or no emission buses' : 'mix (zero and low emission buses)',
 'zero-emission buses': 'zero-emission bus (not specified)',
 'new replacement vehicles':'not specified',
 'receive funding to buy cutaway paratransit buses': 'not specified',
 'counties by buying buses': 'not specified',
 'battery-electric' : 'BEB',
 'to buy buses and charging equipment':'not specified',
 'propane-powered': 'low emission (propane)',
 'funding to replace the oldest buses':'not specified',
 'diesel-electric hybrid': 'low emission (hybrid)',
 'hybrid diesel-electric buses': 'low emission (hybrid)',
 'cutaway vehicles':'not specified',
 'propane fueled buses': 'low emission (propane)',
 'County Mass Transit District will receive funding to buy buses':'not specified',
 'ethanol-fueled buses': 'low emission (ethanol)',
 'will receive funding to buy vans to replace': 'not specified',
 'Colorado will receive funding to buy vans to replace older ones': 'not specified',
 'hybrid-electric vehicles': 'low emission (hybrid)'
}

In [59]:
type(manual_check_dict)

dict

In [60]:
# replace prop_type values using manual_check_dict
manual_check.replace({"prop_type": manual_check_dict}, inplace=True)

In [62]:
# check work
# looks good 
manual_check.prop_type.value_counts()

no bus procurement                   21
electric (not specified)             11
not specified                         9
zero-emission bus (not specified)     8
low emission (hybrid)                 3
BEB                                   2
low emission (propane)                2
mix (zero and low emission buses)     1
low emission (ethanol)                1
Name: prop_type, dtype: int64

### deleting rows from iniail df that have prop_type == 'needs manual check'

In [63]:
# filters df for rows that do not equal `needs manual check`
# expect rows to drop from 130 to 72?
df = df[df['prop_type'] != 'needs manual check']

In [67]:
#check work. math is correct
#value counts shows all values are valid as expected.
display(df.shape, df.prop_type.value_counts())

(72, 14)

BEB                                  18
low emission (hybrid)                15
CNG                                  14
electric (not specified)              6
low emission (propane)                5
zero-emission bus (not specified)     4
mix (zero and low emission buses)     4
FCEB                                  4
mix (BEB and FCEB)                    1
mix (low emission)                    1
Name: prop_type, dtype: int64

In [68]:
### appending rows from manual_check to initial df
df = df.append(manual_check, ignore_index=True)

  df = df.append(manual_check, ignore_index=True)


In [70]:
display(df.shape, df.prop_type.value_counts())

(130, 14)

no bus procurement                   21
BEB                                  20
low emission (hybrid)                18
electric (not specified)             17
CNG                                  14
zero-emission bus (not specified)    12
not specified                         9
low emission (propane)                7
mix (zero and low emission buses)     5
FCEB                                  4
mix (BEB and FCEB)                    1
mix (low emission)                    1
low emission (ethanol)                1
Name: prop_type, dtype: int64

### Need new column for `bus size type` via list and function
cutaway, 40ft etc

In [71]:
list(df.columns)

['state',
 'project_sponsor',
 'project_title',
 'description',
 'funding',
 'approx_#_of_buses',
 'project_type',
 'propulsion_category',
 'area_served',
 'congressional_districts',
 'fta_region',
 'bus/low-no_program',
 'bus_count',
 'prop_type']

In [72]:
bus_size = [
    "standard",
    "40 foot",
    "40-foot",
    "40ft",
    "articulated",
    "cutaway",
]

In [73]:
# Function to match keywords
def find_bus_size_type(description):
    for keyword in bus_size:
        if keyword in description.lower():
            return keyword
    return "not specified"

In [74]:
# new column called bus size type based on description column
df["bus_size_type"] = df["description"].apply(find_bus_size_type)

In [75]:
# check work
df.bus_size_type.value_counts()

not specified    126
cutaway            4
Name: bus_size_type, dtype: int64

## Exporting cleaned data to GCS

In [76]:
# check work
display(df.head(3), df.bus_size_type.unique(), df.shape)

Unnamed: 0,state,project_sponsor,project_title,description,funding,approx_#_of_buses,project_type,propulsion_category,area_served,congressional_districts,fta_region,bus/low-no_program,bus_count,prop_type,bus_size_type
0,DC,Washington Metropolitan Area Transit Authority...,Battery-Electric Metrobus Procurement and Elec...,WMATA will receive funding to convert its Cind...,104000000,100(beb),bus/chargers,zero,Large Urban,DC-001 ; MD-004 ; MD-008 ; VA-008 ; VA-011,3,Low-No,100,BEB,not specified
1,TX,Dallas Area Rapid Transit (DART),DART CNG Bus Fleet Modernization Project,Dallas Area Rapid Transit will receive funding...,103000000,90(estimated-CNGbuses),bus,low,Large Urban,TX-003 ; TX-004 ; TX-005 ; TX-006 ; TX-024 ; T...,6,Low-No,90,CNG,not specified
2,LA,New Orleans Regional Transit Authority,Accelerating Zero-Emissions Mobility for a Res...,The New Orleans Regional Transit Authority wil...,71439261,20(zero-emission),bus/chargers/equipment,zero,Large Urban,LA-002 ; LA-001,6,Low-No,20,zero-emission bus (not specified),not specified


array(['not specified', 'cutaway'], dtype=object)

(130, 15)

In [77]:
# saving to GCS as csv
df.to_csv(
    "gs://calitp-analytics-data/data-analyses/bus_procurement_cost/fta_bus_cost_clean.csv"
)

## Reading in cleaned data from GCS

In [78]:
bus_cost = pd.read_csv(
    "gs://calitp-analytics-data/data-analyses/bus_procurement_cost/fta_bus_cost_clean.csv"
)

In [79]:
# confirming cleaned data shows as expected.
display(bus_cost.shape, type(bus_cost), bus_cost.columns)

(130, 16)

pandas.core.frame.DataFrame

Index(['Unnamed: 0', 'state', 'project_sponsor', 'project_title',
       'description', 'funding', 'approx_#_of_buses', 'project_type',
       'propulsion_category', 'area_served', 'congressional_districts',
       'fta_region', 'bus/low-no_program', 'bus_count', 'prop_type',
       'bus_size_type'],
      dtype='object')

In [80]:
bus_cost["prop_type"].sort_values(ascending=True).unique()

array(['BEB', 'CNG', 'FCEB', 'electric (not specified)',
       'low emission (ethanol)', 'low emission (hybrid)',
       'low emission (propane)', 'mix (BEB and FCEB)',
       'mix (low emission)', 'mix (zero and low emission buses)',
       'no bus procurement', 'not specified',
       'zero-emission bus (not specified)'], dtype=object)

## DEPRECATED - Data Analysis
actual data analysis and summary stats exist in the `cost_per_bus_analysis.ipynb` notebook

### Cost per Bus, per Transit Agency dataframe

In [None]:
only_bus = bus_cost[bus_cost["bus_count"] > 0]
only_bus.head()

In [None]:
cost_per_bus = (
    only_bus.groupby("project_sponsor")
    .agg({"funding": "sum", "bus_count": "sum"})
    .reset_index()
)

In [None]:
cost_per_bus["cost_per_bus"] = (
    cost_per_bus["funding"] / cost_per_bus["bus_count"]
).astype("int64")

In [None]:
cost_per_bus.dtypes

In [None]:
cost_per_bus

In [None]:
## export cost_per_bus df to gcs
cost_per_bus.to_csv(
    "gs://calitp-analytics-data/data-analyses/bus_procurement_cost/fta_cost_per_bus.csv"
)

### Cost per bus, stats analysis

In [None]:
# read in fta cost per bus csv
cost_per_bus = pd.read_csv(
    "gs://calitp-analytics-data/data-analyses/bus_procurement_cost/fta_cost_per_bus.csv"
)

In [None]:
display(cost_per_bus.shape, cost_per_bus.head())

### Initial Summary Stats

### Summary Stats

In [None]:
# top level alanysis

bus_cost.agg({"project_title": "count", "funding": "sum", "bus_count": "sum"})

In [None]:
# start of agg. by project_type

bus_cost.groupby("project_type").agg(
    {"project_type": "count", "funding": "sum", "bus_count": "sum"}
)

In [None]:
# agg by program

bus_cost.groupby("bus/low-no_program").agg(
    {"project_type": "count", "funding": "sum", "bus_count": "sum"}
)

In [None]:
# agg by state, by funding
bus_cost.groupby("state").agg(
    {"project_type": "count", "funding": "sum", "bus_count": "sum"}
).sort_values(by="funding", ascending=False)

### Projects with bus purchases

In [None]:
# df of only projects with a bus count
only_bus = bus_cost[bus_cost["bus_count"] > 0]

In [None]:
display(only_bus.shape, only_bus.columns)

In [None]:
# agg by propulsion type
only_bus["propulsion_type"].value_counts()

In [None]:
only_bus.project_type.value_counts()

In [None]:
# of the rows with bus_count >1, what are the project types?
bus_agg = only_bus.groupby("project_type").agg(
    {"project_type": "count", "funding": "sum", "bus_count": "sum"}
)

In [None]:
# new column that calculates `cost per bus`
bus_agg["cost_per_bus"] = (bus_agg["funding"] / bus_agg["bus_count"]).astype("int64")

In [None]:
bus_agg

### Projects with no buses

In [None]:
no_bus = bus_cost[bus_cost["bus_count"] < 1]

In [None]:
no_bus["project_type"].value_counts()

### Overall Summary

In [None]:
project_count = bus_cost.project_title.count()
fund_sum = bus_cost.funding.sum()
bus_count_sum = bus_cost.bus_count.sum()
overall_cost_per_bus = (fund_sum) / (bus_count_sum)
bus_program_count = bus_cost["bus/low-no_program"].value_counts()

projects_with_bus = only_bus.project_title.count()
projects_with_bus_funds = only_bus.funding.sum()
cost_per_bus = (only_bus.funding.sum()) / (bus_count_sum)

In [None]:
summary = f"""
Top Level observation:
- {project_count} projects awarded
- ${fund_sum:,.2f} dollars awarded
- {bus_count_sum} buses to be purchased
- ${overall_cost_per_bus:,.2f} overall cost per bus

Projects have some mix of buses, facilities and equipment. Making it difficult to disaggregate actual bus cost.

Of the {project_count} projects awarded, {projects_with_bus} projects inlcuded buses. The remainder were facilities, chargers and equipment

Projects with buses purchases:
- {projects_with_bus} projects
- ${projects_with_bus_funds:,.2f} awarded to purchases buses
- ${cost_per_bus:,.2f} cost per bus
"""

In [None]:
print(summary)

In [None]:
# Assuming your DataFrame is named df
cost_per_bus_values = cost_per_bus["cost_per_bus"]

# Calculate mean and standard deviation
mean_value = cost_per_bus_values.mean()
std_deviation = cost_per_bus_values.std()

# Plot histogram
plt.hist(cost_per_bus_values, bins=30, color="skyblue", edgecolor="black", alpha=0.7)

# Add vertical lines for mean and standard deviation
plt.axvline(mean_value, color="red", linestyle="dashed", linewidth=2, label="Mean")
plt.axvline(
    mean_value + std_deviation,
    color="green",
    linestyle="dashed",
    linewidth=2,
    label="Mean + 1 Std Dev",
)
plt.axvline(
    mean_value - std_deviation,
    color="green",
    linestyle="dashed",
    linewidth=2,
    label="Mean - 1 Std Dev",
)

# Set labels and title
plt.xlabel("cost_per_bus")
plt.ylabel("Frequency")
plt.title("Histogram of cost_per_bus with Mean and Std Dev Lines")
plt.legend()

# Show the plot
plt.show()

In [None]:
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import pandas as pd

# Assuming your DataFrame is named df
cost_per_bus_values = cost_per_bus["cost_per_bus"]

# Calculate mean and standard deviation
mean_value = cost_per_bus_values.mean()
std_deviation = cost_per_bus_values.std()

# Plot histogram
plt.hist(cost_per_bus_values, bins=20, color="skyblue", edgecolor="black", alpha=0.7)

# Add vertical lines for mean and standard deviation
plt.axvline(mean_value, color="red", linestyle="dashed", linewidth=2, label="Mean")
plt.axvline(
    mean_value + std_deviation,
    color="green",
    linestyle="dashed",
    linewidth=2,
    label="Mean + 1 Std Dev",
)
plt.axvline(
    mean_value - std_deviation,
    color="green",
    linestyle="dashed",
    linewidth=2,
    label="Mean - 1 Std Dev",
)

# Set labels and title
plt.xlabel("Cost per Bus (USD)")
plt.ylabel("Frequency")
plt.title("Histogram of Cost per Bus with Mean and Std Dev Lines")
plt.legend()

# Format x-axis ticks as USD
plt.gca().xaxis.set_major_formatter(mticker.StrMethodFormatter("${x:,.0f}"))

# Show the plot
plt.show()