## TIRCP Calsta
* TIRCP outcomes for cycles 3-5 for California State Transportation Agency. 
* [Cycles 1-6](https://calsta.ca.gov/subject-areas/transit-intercity-rail-capital-prog)
* Cycle 1: 2015
* Cycle 2: 2016
* Cycle 3: 2018
* Cycle 4: 2020
* Cycle 5: 2022
* Cycle 6: 2023

In [3]:
import A1_data_prep
import A2_tableau
import numpy as np
import pandas as pd
from babel.numbers import format_currency
from calitp import *

In [4]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [5]:
# GCS File Path:
GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/tircp/"

### Manipulate TIRCP
#### Filter out for cycles of interest

In [6]:
df_tircp = to_snakecase(A2_tableau.tableau_dashboard())

  warn(msg)


In [7]:
df_tircp2 = df_tircp.loc[df_tircp["award_year"] >= 2018].reset_index(drop=True)

In [8]:
# Sort df by award year and number
df_tircp2 = df_tircp2.sort_values(["award_year", "#"])

In [9]:
df_tircp2.award_year.value_counts(),

(2018    28
 2022    23
 2020    17
 Name: award_year, dtype: int64,)

In [10]:
df_tircp2.ppno.nunique(), df_tircp2.title.nunique(), len(df_tircp2)

(59, 67, 68)

In [11]:
# Find duplicate project title 
df_tircp2.title.value_counts().head()

North State Intercity Bus System                                                                    2
Purchase Zero Emission High Capacity Buses to Support Transbay Tomorrow and Clean Corridors Plan    1
Expansion of WETA Ferry Services                                                                    1
South Bay Microtransit Expansion                                                                    1
Sacramento Valley Station (SVS) Transit Center: Priority Project                                    1
Name: title, dtype: int64

In [12]:
# Create a detailed title column to avoid duplicates
df_tircp2["award_year"] = df_tircp2["award_year"].astype("object")

In [13]:
detailed_title_cols = [
    "award_year",
    "title",
    "grant_recipient",
]

In [14]:
# https://stackoverflow.com/questions/39291499/how-to-concatenate-multiple-column-values-into-a-single-column-in-pandas-datafra
df_tircp2["detailed_title_col"] = df_tircp2[detailed_title_cols].apply(
    lambda row: "-".join(row.values.astype(str)), axis=1
)

In [15]:
# Subset
df_tircp2 = df_tircp2[
    [
        "award_year",
        "#",
        "ppno",
        "tircp",
        "title",
        "detailed_title_col",
        "grant_recipient",
        "district",
        "county",
        "description",
        "total__cost",
        "estimated_tircp_ghg_reductions",
        "cost_per_ghg_ton_reduced",
        "increased_ridership",
        "service_integration",
        "improve_safety",
    ]
]

#### Add Project Number

In [16]:
df_tircp2["project_number_use"] = (
    df_tircp2["award_year"].astype(str) + "-" + df_tircp2["#"].astype(str)
)

### Add info based on SCCP's output example
Project ID	Project Name	Implementing Agency	Program	Project Description	 Total Cost 	 SB 1 Funds 	Fiscal Year	Is SB 1?	Project Status	Assembly Districts	Senate Districts	Counties	Cities	Caltrans Districts	Is on SHS?	Date Updated	Cycle


#### GIS Template has Assembly District/Senate District/City/Counties info
* Although Linda provided me with more updated/complete GIS information, still using this to glean project statuses. 

In [17]:
# Read in sheet with Assembly info.
gis = to_snakecase(
    pd.read_excel(
        f"{GCS_FILE_PATH}TIRCP_GIS_Template_Requirements 6-1-2022.xlsx",
        sheet_name="Projects Table",
    )
)

In [18]:
# Clean some column names
gis = gis.rename(
    columns={
        "ppno_": "ppno",
    }
)

In [19]:
# Clean PPNO
gis = A1_data_prep.ppno_slice(gis)

In [20]:
# Subset for only cols of interest
gis2 = gis[
    [
        "project_number",
        "ppno",
        "projecttitle",
        "projectstatus",
    ]
]

In [21]:
gis2.ppno.nunique()

45

In [22]:
# There are mulitple entries for each ppno.
gis2.ppno.value_counts().head()

CP033    60
CP035    21
CP042    18
CP032    14
CP031    11
Name: ppno, dtype: int64

In [23]:
# Inglewood Transit Center coded as CP063, should be CP062
gis2.loc[
    (gis2["projecttitle"] == "Inglewood Transit Center (2020:04)"), "ppno"
] = "CP062"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gis2.loc[


In [24]:
# North State Intercity Bus System coded as CP063 in TIRCP Tracking sheet.
gis2.loc[
    (
        gis2["projecttitle"]
        == "North State Intercity Bus System-Lake County Interregional Transit Center (2020:05)"
    ),
    "ppno",
] = "CP063"

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gis2.loc[


In [25]:
# gis2.loc[gis2['ppno'] == 'CP063']

In [26]:
gis2.loc[
    gis2["projecttitle"]
    == "North State Intercity Bus System-Lake County Interregional Transit Center (2020:05)"
]

Unnamed: 0,project_number,ppno,projecttitle,projectstatus
202,2020:05,CP063,North State Intercity Bus System-Lake County Interregional Transit Center (2020:05),PA&ED
203,2020:05,CP063,North State Intercity Bus System-Lake County Interregional Transit Center (2020:05),R/W
204,2020:05,CP063,North State Intercity Bus System-Lake County Interregional Transit Center (2020:05),Construction
205,2020:05,CP063,North State Intercity Bus System-Lake County Interregional Transit Center (2020:05),Ops./Procure


In [27]:
# Clean project_number, only keep year
gis2["project_number"] = gis2["project_number"].str.split(":").str[0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gis2["project_number"] = gis2["project_number"].str.split(":").str[0]


In [28]:
gis2["project_number"] = gis2["project_number"].fillna(0).astype("int64")

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  gis2["project_number"] = gis2["project_number"].fillna(0).astype("int64")


In [29]:
# Place project status all on one row & Remove duplicate statuses
def summarize_rows(df, col_to_group: str, col_to_summarize: str):
    df = df.groupby(col_to_group)[col_to_summarize].apply(",".join).reset_index()

    df[col_to_summarize] = (
        df[col_to_summarize]
        .apply(lambda x: ", ".join(set([y.strip() for y in x.split(",")])))
        .str.strip()
    )
    return df

In [30]:
project_status_gis = summarize_rows(gis2, "ppno", "projectstatus")

In [31]:
# Check that each row matches the number of unique ppno
len(project_status_gis) == gis2.ppno.nunique()

True

In [32]:
# Drop old project status
gis2 = gis2.drop(columns=["projectstatus"])

In [33]:
# Merge with original gis, so there is only one row for each PPNO
final_gis = (
    pd.merge(project_status_gis, gis2, how="left", on=["ppno"])
    .drop_duplicates("ppno")
    .reset_index(drop=True)
)

In [34]:
len(final_gis), final_gis.ppno.nunique()

(45, 45)

#### Merge with TIRCP Tracking

In [35]:
# Merge with df_tircp2
merge1 = pd.merge(
    df_tircp2,
    final_gis,
    how="left",
    left_on=["ppno", "award_year"],
    right_on=["ppno", "project_number"],
    indicator=True,
)

In [36]:
merge1._merge.value_counts()

both          43
left_only     25
right_only     0
Name: _merge, dtype: int64

In [37]:
# Double Check that titles & years correspond with one another
merge1[["title", "projectstatus"]].sample(2)

Unnamed: 0,title,projectstatus
51,Oakland Waterfront Mobility Hub,
56,Expanding Transit Services and Introducing Zero-Emission Fleets on California’s North Coast,


In [38]:
merge1 = merge1.drop(columns=["_merge"])

#### GIS Info from Tracking Sheet 2.0 

In [39]:
gis_tracking_sheet = to_snakecase(
    pd.read_excel(f"{GCS_FILE_PATH}{A1_data_prep.FILE_NAME }", sheet_name="GIS Info")
)

In [40]:
# Drop certain cols
gis_tracking_sheet = gis_tracking_sheet[
    [
        "award_year",
        "project_title",
        "caltransdistrict",
        "assembly\ndistricts",
        "senate\ndistricts",
        "city_code",
        "county_code",
        "_implementing_agency__id_",
    ]
]

In [41]:
# Only keep the years wanted
gis_tracking_sheet2 = gis_tracking_sheet.loc[
    gis_tracking_sheet["award_year"] >= 2018
].reset_index(drop=True)

In [42]:
# Merge with df_tircp2
merge2 = pd.merge(
    merge1,
    gis_tracking_sheet2,
    how="left",
    left_on=["award_year", "title"],
    right_on=["award_year", "project_title"],
    indicator=True,
)

In [43]:
# Merge
merge2._merge.value_counts()

both          68
left_only      0
right_only     0
Name: _merge, dtype: int64

In [44]:
merge2.shape, df_tircp2.shape

((68, 28), (68, 17))

In [45]:
merge2 = merge2.drop(
    columns=["project_number", "projecttitle", "project_title", "_merge"]
)

In [46]:
merge2.columns

Index(['award_year', '#', 'ppno', 'tircp', 'title', 'detailed_title_col',
       'grant_recipient', 'district', 'county', 'description', 'total__cost',
       'estimated_tircp_ghg_reductions', 'cost_per_ghg_ton_reduced',
       'increased_ridership', 'service_integration', 'improve_safety',
       'project_number_use', 'projectstatus', 'caltransdistrict',
       'assembly\ndistricts', 'senate\ndistricts', 'city_code', 'county_code',
       '_implementing_agency__id_'],
      dtype='object')

### Project Sheet 

In [47]:
# Copy merge 2
projects = merge2.copy()

In [48]:
# Fill in empty values with NA
projects = projects.fillna(
    projects.dtypes.replace({"float64": 0.0, "object": "None", "int64": 0})
)

In [49]:
# Format moentary cols
monetary_cols = ["total__cost", "tircp"]
for i in monetary_cols:
    projects[i] = projects[i].apply(
        lambda x: format_currency(x, currency="USD", locale="en_US")
    )

In [50]:
# Clean up columns
projects = A1_data_prep.clean_up_columns(projects)

In [51]:
projects = projects.rename(
    columns={
        "Number Use": "Project Number",
        "Assembly\nDistricts": "Assembly Districts",
        "Senate\nDistricts": "Senate Districts",
        "Caltransdistrict": "CT Districts",
        "Assembly\nDistricts": "Assembly Districts",
    }
)

In [52]:
# Rearrange columns
right_order = [
    "Award Year",
    "#",
    "Project Number",
    "Ppno",
    "Title",
    "Grant Recipient",
    "Tircp",
    "Total  Cost",
    "Description",
    "District",
    "County",
    "Estimated Tircp Ghg Reductions",
    "Cost Per Ghg Ton Reduced",
    "Increased Ridership",
    "Service Integration",
    "Improve Safety",
    "Status",
    "CT Districts",
    "Assembly Districts",
    "Senate Districts",
    "City Code",
    "County Code",
    "Implementing Agency  Id",
]

In [53]:
projects = projects[right_order]

### Outcomes Sheet

In [54]:
# Measure columns
measure_cols = [
    "estimated_tircp_ghg_reductions",
    "cost_per_ghg_ton_reduced",
    "increased_ridership",
    "service_integration",
    "improve_safety",
]

In [58]:
# Turn estimated GHG reductions into a number
merge2["estimated_tircp_ghg_reductions"] = (
    merge2["estimated_tircp_ghg_reductions"]
    .str.replace("MTCO2e", "")
    .str.replace("None", "")
    .str.replace(",", "")
)

In [59]:
merge2["estimated_tircp_ghg_reductions"] = (
    merge2["estimated_tircp_ghg_reductions"]
    .apply(pd.to_numeric, errors="coerce")
    .fillna(0)
)

In [60]:
merge2.columns

Index(['award_year', '#', 'ppno', 'tircp', 'title', 'detailed_title_col',
       'grant_recipient', 'district', 'county', 'description', 'total__cost',
       'estimated_tircp_ghg_reductions', 'cost_per_ghg_ton_reduced',
       'increased_ridership', 'service_integration', 'improve_safety',
       'project_number_use', 'projectstatus', 'caltransdistrict',
       'assembly\ndistricts', 'senate\ndistricts', 'city_code', 'county_code',
       '_implementing_agency__id_'],
      dtype='object')

In [63]:
# Subset to cols simila to SCCP
outcomes = merge1[
    [
        "award_year",
        'detailed_title_col',
        "estimated_tircp_ghg_reductions",
        "cost_per_ghg_ton_reduced",
        "increased_ridership",
        "service_integration",
        "improve_safety",
    ]
].sort_values(["award_year", 'detailed_title_col',])

In [64]:
outcomes = A1_data_prep.clean_up_columns(outcomes)

In [65]:
outcomes.head(1)

Unnamed: 0,Award Year,Detailed Title Col,Estimated Tircp Ghg Reductions,Cost Per Ghg Ton Reduced,Increased Ridership,Service Integration,Improve Safety
1,2018,2018-#Electrify Anaheim: Changing the Transit Paradigm in Southern California-Anaheim Transportation Network,61000.0,Medium-High,Medium-High,Medium-High,Medium


##### Version 1

In [66]:
# Drop award year
outcomes_transformed = outcomes.drop(columns=["Award Year"]).T

In [67]:
# Make first row to column names
outcomes_transformed.columns = outcomes_transformed.iloc[0]

In [68]:
# Del first row
outcomes_transformed = outcomes_transformed.iloc[1:]

##### Outputs: Measures except GHG Reductions.

In [69]:
outcomes_melt = pd.melt(
    outcomes,
    id_vars=[
        "Award Year",
        "Detailed Title Col",
    ],
    value_vars=[
        "Cost Per Ghg Ton Reduced",
        "Increased Ridership",
        "Service Integration",
        "Improve Safety",
    ],
)

In [70]:
outcomes_melt = A1_data_prep.clean_up_columns(outcomes_melt)

In [72]:
year_summary = (
    outcomes_melt.groupby(["Award Year", "Variable", "Value"])
    .agg({"Detailed Title Col": "nunique"})
    .rename(
        columns={"Detailed Title Col": "Number of Projects in this Value Category"}
    )
)

In [73]:
year_summary

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Number of Projects in this Value Category
Award Year,Variable,Value,Unnamed: 3_level_1
2018,Cost Per Ghg Ton Reduced,High,16
2018,Cost Per Ghg Ton Reduced,Medium,3
2018,Cost Per Ghg Ton Reduced,Medium-High,8
2018,Cost Per Ghg Ton Reduced,,1
2018,Improve Safety,High,9
2018,Improve Safety,Medium,12
2018,Improve Safety,Medium-High,7
2018,Increased Ridership,High,13
2018,Increased Ridership,Medium,10
2018,Increased Ridership,Medium-High,5


##### GHG Reductions.

In [74]:
GHG_by_year = outcomes.groupby(["Award Year"]).agg(
    {"Estimated Tircp Ghg Reductions": "sum"}
)

In [75]:
GHG_by_year

Unnamed: 0_level_0,Estimated Tircp Ghg Reductions
Award Year,Unnamed: 1_level_1
2018,31944000.0
2020,5016000.0
2022,4332000.0


#### Save

In [None]:
"""
with pd.ExcelWriter(f"{GCS_FILE_PATH}calsta_draft.xlsx") as writer:
    outcomes.to_excel(writer, sheet_name="outcomes_unpivoted", index=True)
    outcomes_transformed.to_excel(writer, sheet_name="outcomes_transformed", index=True)
    projects.to_excel(writer, sheet_name="projects", index=True)
    year_summary.to_excel(writer, sheet_name="year_summary", index=True)
    GHG_by_year.to_excel(writer, sheet_name="GHG_reduction_year", index=True)
    """