## Autotagging projects
* Who is the lead agency? 
    * Agency in this project = the entity who is receiving funding for this project.
* Is this project on or off the SHS or both?
* How to tell if a project criss-crosses the SHS?

In [1]:
import pandas as pd

# Settings
pd.options.display.max_columns = 100
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)
pd.options.display.float_format = "{:,.2f}".format

GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/project_prioritization/"
FILE = "fake_data.xlsx"

# My utilities
import _utils
from calitp import *



### Preliminary

In [2]:
# Read in file
df = to_snakecase(pd.read_excel(f"{GCS_FILE_PATH}{FILE}", sheet_name="fake"))

In [3]:
# Subset to columns I want.
df2 = df[
    [
        "project_name",
        "lead_agency",
        "primary_mode",
        "secondary_mode_s",
        "shs_capacity_increase_detail",
    ]
]

In [4]:
# Count combos
combos = (
    df2.groupby(["primary_mode", "secondary_mode_s", "shs_capacity_increase_detail"])
    .size()
    .reset_index()
    .rename(columns={0: "count"})
)

In [5]:
# Find most commmon combos
combos.sort_values(["count"], ascending=False).head()

Unnamed: 0,primary_mode,secondary_mode_s,shs_capacity_increase_detail,count
171,Rail (Passenger),,,111
78,Highway,,General Purpose Lane,62
10,Bike/Pedestrian,,,57
37,Grade Crossing,,,29
164,Rail (Freight),,,28


### Function #1
* Tag whether values in a column are "highway related" before figuring out if they are on the SHS or not. 

In [6]:
def tagging_columns(df, tagging_col: str, new_col: str, keyword_list: list):
    """
    Search through a column for keywords.

    Args
    df: the dataframe.
    tagging_col (str): the column to search for the appearance of keywords.
    new_col (str): input whether or not the keyword was found.
    keyword_list (list): list of keywords to search through.

    Returns: a dataframe with a new column stating whether
    the keyword(s) were found or not.
    """
    # Delinate items in keywords list using |
    keywords = f"({'|'.join(keyword_list)})"

    # Lower the strings in the column of interest
    df[tagging_col] = df[tagging_col].str.lower()

    # Create a new column that captures whether or not the keyword appears
    # Using str contains so interchange (new) and interchange (modifying) will appear.
    df["keyword_appears_bool"] = df[tagging_col].str.contains(keywords)

    # Function to categorize whether something is highway related or not.
    def highway_or_not(row):
        if row["keyword_appears_bool"] == True:
            return "highway related"
        else:
            return "not highway related"

    # Apply function
    df[new_col] = df.apply(lambda x: highway_or_not(x), axis=1)

    # Drop keyword col
    df = df.drop(columns=["keyword_appears_bool"])

    return df

In [7]:
# Search through primary mode.
df3 = tagging_columns(
    df2,
    "primary_mode",
    "primary_mode_SHS",
    [
        "highway",
        "its",
        "interchange",
    ],
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [8]:
# Preview that this is correct
df3[["primary_mode", "primary_mode_SHS"]].sample(10)

Unnamed: 0,primary_mode,primary_mode_SHS
100,complete streets,not highway related
596,bridge,not highway related
536,rail (freight),not highway related
534,highway,highway related
445,bike/pedestrian,not highway related
460,highway,highway related
689,highway,highway related
583,rail (passenger),not highway related
173,highway,highway related
502,highway,highway related


In [9]:
# Search through secondary mode.
df3 = tagging_columns(
    df2, "secondary_mode_s", "secondary_mode_SHS", ["highway", "lane", "interchange"]
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [10]:
# Search through SHS Capacity Detail.
df3 = tagging_columns(
    df2,
    "shs_capacity_increase_detail",
    "shs_capacity_increase_detail_SHS",
    ["highway", "lane", "interchange"],
)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


### Function 2
* Apply a function to summarize the results in a single sentence.
    

In [11]:
def SHS_lead_agency_info(df):

    # Lower strings.
    for i in [
        "primary_mode",
        "secondary_mode_s",
        "shs_capacity_increase_detail",
  
    ]:
        df[i] = df[i].str.lower()

    # Tag if the lead agency is Caltrans or a partner.
    def CT_or_partner(row):
        # If partner is none, return Unknown.
        if row.lead_agency == "None":
            return "unknown"
        # If only Caltrans, return Caltrans
        if row.lead_agency == "Caltrans":
            return "Caltrans"
        # Everything else is a partner agency
        else:
            return "a partner"

    # Apply the function
    df["caltrans_or_partner"] = df.apply(CT_or_partner, axis=1)

    # Tag if a project is on the SHS or not thorugh various combos.
    def on_SHS(row):
        # If both secondary, primary, and SHS are highway related and lead agency is Caltrans: on SHS
        if (
            (row.secondary_mode_SHS == "highway related")
            and (row.primary_mode_SHS == "highway related")
            and (row.shs_capacity_increase_detail_SHS == "highway related")
            and (row.caltrans_or_partner == "Caltrans")
        ):
            return "on the SHS"
        # If secondary, primary, and SHS are all highway related: on the SHS. Or perhaps possibly?
        elif (
            (row.secondary_mode_SHS == "highway related")
            and (row.primary_mode_SHS == "highway related")
            and (row.shs_capacity_increase_detail_SHS == "highway related")
        ):
            return "on the SHS"
        # If nothing is highway related: not on SHS.
        elif (
            (row.shs_capacity_increase_detail_SHS == "not highway related")
            and (row.secondary_mode_SHS == "not highway related")
            and (row.primary_mode_SHS == "not highway related")
        ):
            return "not on the SHS"
        # Everything else is not on SHS.
        else:
            return "possibly on the SHS"

    # Apply the function
    df["On_SHS"] = df.apply(on_SHS, axis=1)

    # Create a summary sentence 
    df["sentence"] = (
        "The lead agency is "
        + df["caltrans_or_partner"] 
        + " and the project is "
        + df["On_SHS"]
        + "."
    )

    return df

In [12]:
df4 = SHS_lead_agency_info(df3)

In [14]:
# Check value counts.
df4.caltrans_or_partner.value_counts()

a partner    350
unknown      231
Caltrans     162
Name: caltrans_or_partner, dtype: int64

In [15]:
# Check value counts.
df4.On_SHS.value_counts()

not on the SHS         378
possibly on the SHS    318
on the SHS              47
Name: On_SHS, dtype: int64

In [16]:
# Total sentences
df4.sentence.nunique()

9

In [17]:
# Check value counts.
df4.sentence.value_counts()

The lead agency is a partner and the project is not on the SHS.         252
The lead agency is unknown and the project is possibly on the SHS.      137
The lead agency is Caltrans and the project is possibly on the SHS.      92
The lead agency is a partner and the project is possibly on the SHS.     89
The lead agency is unknown and the project is not on the SHS.            69
The lead agency is Caltrans and the project is not on the SHS.           57
The lead agency is unknown and the project is on the SHS.                25
The lead agency is Caltrans and the project is on the SHS.               13
The lead agency is a partner and the project is on the SHS.               9
Name: sentence, dtype: int64

In [18]:
# Count combos with new dataframe to check results -> less rows because primary mode and secondary mode are
# only coded as highway related or not highway related
combos2 = (
    df4.groupby(
        [    "caltrans_or_partner",
            "sentence",
            "shs_capacity_increase_detail_SHS",
            "primary_mode_SHS",
            "secondary_mode_SHS",
           
        ]
    )
    .size()
    .reset_index()
    .rename(columns={0: "count"})
)

In [19]:
# Group again
combos2.groupby(
    [
        "caltrans_or_partner","sentence",
        "shs_capacity_increase_detail_SHS",
        "primary_mode_SHS",
        "secondary_mode_SHS",
        
    ]
).agg({"count": "sum"}) 


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,Unnamed: 4_level_0,count
caltrans_or_partner,sentence,shs_capacity_increase_detail_SHS,primary_mode_SHS,secondary_mode_SHS,Unnamed: 5_level_1
Caltrans,The lead agency is Caltrans and the project is not on the SHS.,not highway related,not highway related,not highway related,57
Caltrans,The lead agency is Caltrans and the project is on the SHS.,highway related,highway related,highway related,13
Caltrans,The lead agency is Caltrans and the project is possibly on the SHS.,highway related,highway related,not highway related,60
Caltrans,The lead agency is Caltrans and the project is possibly on the SHS.,highway related,not highway related,not highway related,6
Caltrans,The lead agency is Caltrans and the project is possibly on the SHS.,not highway related,highway related,highway related,6
Caltrans,The lead agency is Caltrans and the project is possibly on the SHS.,not highway related,highway related,not highway related,15
Caltrans,The lead agency is Caltrans and the project is possibly on the SHS.,not highway related,not highway related,highway related,5
a partner,The lead agency is a partner and the project is not on the SHS.,not highway related,not highway related,not highway related,252
a partner,The lead agency is a partner and the project is on the SHS.,highway related,highway related,highway related,9
a partner,The lead agency is a partner and the project is possibly on the SHS.,highway related,highway related,not highway related,50
