## Autotagging projects
* Who is the lead agency? 
    * Agency in this project = the entity who is receiving funding for this project.
* Is this project on or off the SHS or both?
* How to tell if a project criss-crosses the SHS?

In [1]:
import pandas as pd

# Settings
pd.options.display.max_columns = 100
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)
pd.options.display.float_format = "{:,.2f}".format

GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/project_prioritization/"
FILE = "fake_data.xlsx"

# My utilities
import _utils
from calitp import *



### Preliminary 

In [2]:
# Read in file
df = to_snakecase(pd.read_excel(f"{GCS_FILE_PATH}{FILE}", sheet_name="fake"))

In [3]:
# Subset to columns I want.
df2 = df[
    [
        "project_name",
        "lead_agency",
        "project_description",
        "primary_mode",
        "secondary_mode_s",
        "shs_capacity_increase_detail",
    ]
]

In [4]:
# Count combos
combos = (
    df2.groupby(["primary_mode", "secondary_mode_s", "shs_capacity_increase_detail"])
    .size()
    .reset_index()
    .rename(columns={0: "count"})
)

In [5]:
# Find most commmon combos
combos.sort_values(["count"], ascending=False).head()

Unnamed: 0,primary_mode,secondary_mode_s,shs_capacity_increase_detail,count
171,Rail (Passenger),,,111
78,Highway,,General Purpose Lane,62
10,Bike/Pedestrian,,,57
37,Grade Crossing,,,29
164,Rail (Freight),,,28


### Function #1
* Tag whether values in a column are "highway related" before figuring out if they are on the SHS or not. 

In [27]:
def tagging_columns(df, tagging_col: str, new_col: str, keyword_list: list, true_keyword:str, false_keyword:str):
    """
    Search through a column for keywords.

    Args
    df: the dataframe.
    tagging_col (str): the column to search for the appearance of keywords.
    new_col (str): input whether or not the keyword was found.
    keyword_list (list): list of keywords to search through.
    true_keyword (str): replace "true" with a phrase that is more detailed.
    false_keyword (str): same as above, but with false.

    Returns: a dataframe with a new column stating whether
    the keyword(s) were found or not.
    """
    # Delinate items in keywords list using |
    keywords = f"({'|'.join(keyword_list)})"

    # Lower the strings + strip excess white spaces 
    df[tagging_col] = df[tagging_col].str.lower().str.strip()

    # Create a new column that captures whether or not the keyword appears
    # Using str contains so interchange (new) and interchange (modifying) will appear.
    df["keyword_appears_bool"] = df[tagging_col].str.contains(keywords)

    # Function to categorize whether keyword was found
    def keyword_found(row):
        if row["keyword_appears_bool"] == True:
            return true_keyword
        else:
            return false_keyword

    # Apply function and save results in a new column
    df[new_col] = df.apply(lambda x: keyword_found(x), axis=1)

    # Drop keyword col
    df = df.drop(columns=["keyword_appears_bool"])

    return df

In [28]:
# Search through primary mode.
df3 = tagging_columns(
    df2,
    "primary_mode",
    "primary_mode_SHS",
    [
        "highway",
        "its",
        "interchange",
        "grade",
    ],
    "highway related",
    "not highway related"
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [29]:
# Preview that this is correct
df3[["primary_mode", "primary_mode_SHS"]].drop_duplicates()

Unnamed: 0,primary_mode,primary_mode_SHS
0,complete streets,not highway related
1,bike/pedestrian,not highway related
2,transit,not highway related
3,highway,highway related
5,its,highway related
10,interchange (new),highway related
12,bridge,not highway related
14,zev,not highway related
26,local roadway,not highway related
30,interchange (modification),highway related


In [30]:
# Search through secondary mode.
df3 = tagging_columns(
    df3, "secondary_mode_s", "secondary_mode_SHS", ["highway", "lane", "interchange", "its", "grade"], "highway related",
    "not highway related"
)



In [31]:
df3[["secondary_mode_s", "secondary_mode_SHS"]].drop_duplicates()

Unnamed: 0,secondary_mode_s,secondary_mode_SHS
0,bike/pedestrian,not highway related
1,none,not highway related
2,highway,highway related
4,bike/pedestrian\ncomplete streets,not highway related
5,its,highway related
7,zev,not highway related
12,bridge,not highway related
13,complete streets,not highway related
18,bike/pedestrian\nbridge,not highway related
19,bike/pedestrian\ncomplete streets\nits\ntransit,highway related


In [32]:
# Search through SHS Capacity Detail.
df3 = tagging_columns(
    df3,
    "shs_capacity_increase_detail",
    "shs_capacity_increase_detail_SHS",
    ["highway", "lane", "interchange"], "highway related",
    "not highway related"
)



In [33]:
df3[["shs_capacity_increase_detail",
    "shs_capacity_increase_detail_SHS",]].drop_duplicates()

Unnamed: 0,shs_capacity_increase_detail,shs_capacity_increase_detail_SHS
0,none,not highway related
3,auxiliary lane,highway related
6,general purpose lane,highway related
8,transit/bus-only lane (addition),highway related
10,interchange (new),highway related
27,managed lane,highway related
44,managed lane (conversion),highway related
45,express lane (conversion),highway related
51,auxiliary lane\nmanaged lane (addition),highway related
52,managed lanes addition,highway related


### Function 2
* Apply a function to summarize the results in a single sentence.
    

In [34]:
def SHS_lead_agency_info(df):

    # Tag if the lead agency is Caltrans or a partner.
    def CT_or_partner(row):
        # If partner is none, return Unknown.
        if row.lead_agency == "None":
            return "unknown"
        # If only Caltrans, return Caltrans
        if row.lead_agency == "Caltrans":
            return "Caltrans"
        # Everything else is a partner agency
        else:
            return "a partner"

    # Apply the function
    df["caltrans_or_partner"] = df.apply(CT_or_partner, axis=1)

    # Tag if a project is on the SHS or not thorugh various combos.
    def on_SHS(row):
        # If both secondary, primary, and SHS are highway related and lead agency is Caltrans: on SHS
        if (
            (row.secondary_mode_SHS == "highway related")
            and (row.primary_mode_SHS == "highway related")
            and (row.shs_capacity_increase_detail_SHS == "highway related")
            and (row.caltrans_or_partner == "Caltrans")
        ):
            return "on the SHS"
        # If secondary, primary, and SHS are all highway related: on the SHS. Or perhaps possibly?
        elif (
            (row.secondary_mode_SHS == "highway related")
            and (row.primary_mode_SHS == "highway related")
            and (row.shs_capacity_increase_detail_SHS == "highway related")
        ):
            return "on the SHS"
        # If nothing is highway related: not on SHS.
        elif (
            (row.shs_capacity_increase_detail_SHS == "not highway related")
            and (row.secondary_mode_SHS == "not highway related")
            and (row.primary_mode_SHS == "not highway related")
        ):
            return "not on the SHS"
        # Everything else is not on SHS.
        else:
            return "possibly on the SHS"

    # Apply the function
    df["On_SHS"] = df.apply(on_SHS, axis=1)

    # Create a summary sentence 
    df["sentence"] = (
        "The lead agency is "
        + df["caltrans_or_partner"] 
        + " and the project is "
        + df["On_SHS"]
        + "."
    )

    return df

In [35]:
df4 = SHS_lead_agency_info(df3)

In [36]:
# Check value counts.
df4.caltrans_or_partner.value_counts()

a partner    350
unknown      231
Caltrans     162
Name: caltrans_or_partner, dtype: int64

In [37]:
# Check value counts.
(df4.On_SHS.value_counts()/len(df4))*100

possibly on the SHS   48.32
not on the SHS        43.61
on the SHS             8.08
Name: On_SHS, dtype: float64

In [38]:
# Total sentences
df4.sentence.nunique()

9

In [39]:
# Check value counts.
df4.sentence.value_counts()

The lead agency is a partner and the project is not on the SHS.         217
The lead agency is unknown and the project is possibly on the SHS.      142
The lead agency is a partner and the project is possibly on the SHS.    120
The lead agency is Caltrans and the project is possibly on the SHS.      97
The lead agency is unknown and the project is not on the SHS.            63
The lead agency is Caltrans and the project is not on the SHS.           44
The lead agency is unknown and the project is on the SHS.                26
The lead agency is Caltrans and the project is on the SHS.               21
The lead agency is a partner and the project is on the SHS.              13
Name: sentence, dtype: int64

In [40]:
# Count combos with new dataframe to check results -> less rows because primary mode and secondary mode are
# only coded as highway related or not highway related
combos2 = (
    df4.groupby(
        [    "caltrans_or_partner",
            "sentence",
            "shs_capacity_increase_detail_SHS",
            "primary_mode_SHS",
            "secondary_mode_SHS",
           
        ]
    )
    .size()
    .reset_index()
    .rename(columns={0: "count"})
)

In [41]:
# Group again
combos3 = combos2.groupby(
    [
        "caltrans_or_partner","sentence",
        "shs_capacity_increase_detail_SHS",
        "primary_mode_SHS",
        "secondary_mode_SHS",
        
    ]
).agg({"count": "sum"}) 


### Function 3
* Around 48 percent of values are "possibly on the SHS." Look through them closer.

In [42]:
# Filter for unknown lead agencies & possibly on SHS.
df5 = df4[
    ((df4.caltrans_or_partner == "unknown") | (df4.On_SHS.isin(["possibly on the SHS"])))
]

In [43]:
f"{len(df5)} total projects."

'448 total projects.'

In [44]:
df5.caltrans_or_partner.value_counts()

unknown      231
a partner    120
Caltrans      97
Name: caltrans_or_partner, dtype: int64

In [45]:
df5 = tagging_columns(
    df5,
    "project_description",
    "contains_SR_reference",
    [
        "sr",
        "sr-",
        "state route",
        "sr ",
        "i-",
        "interstate",
        "us"
    ],
    "contains SHS keyword(s)",
    "does not contain SHS keyword(s)"
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [46]:
df5.contains_SR_reference.value_counts()

contains SHS keyword(s)            274
does not contain SHS keyword(s)    174
Name: contains_SR_reference, dtype: int64

In [49]:
# df5[['project_name','lead_agency','project_description',"On_SHS", 'contains_SR_reference']]