## Autotagging projects
* Who is the lead agency? 
    * Agency in this project = the entity who is receiving funding for this project.
* Is this project on or off the SHS or both?
* How to tell if a project criss-crosses the SHS?

In [1]:
import pandas as pd

# Settings
pd.options.display.max_columns = 100
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)
pd.options.display.float_format = "{:,.2f}".format

GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/project_prioritization/"
FILE = "fake_data.xlsx"

# My utilities
import _utils
from calitp import *



### Preliminary

In [2]:
# Read in file
df = to_snakecase(pd.read_excel(f"{GCS_FILE_PATH}{FILE}", sheet_name="fake"))

In [3]:
# Subset to columns I want. 
df2 = df[['project_name', 'lead_agency','primary_mode',
       'secondary_mode_s','shs_capacity_increase_detail',]]

In [4]:
# Count combos
combos = df2.groupby(['primary_mode',
        'secondary_mode_s','shs_capacity_increase_detail']).size().reset_index().rename(columns={0:'count'})

In [5]:
combos.sort_values(['count'], ascending = False)

Unnamed: 0,primary_mode,secondary_mode_s,shs_capacity_increase_detail,count
171,Rail (Passenger),,,111
78,Highway,,General Purpose Lane,62
10,Bike/Pedestrian,,,57
37,Grade Crossing,,,29
164,Rail (Freight),,,28
87,Highway,,,27
83,Highway,,Managed Lane,17
114,Interchange (Modification),,Interchange (Modification),16
71,Highway,,Auxiliary Lane,14
124,Interchange (New),,Interchange (New),12


### Function #1
* Tag whether values in a column are "highway related" before figuring out if they are on the SHS or not. 
* SHS Capacity Increase Detail is only populated with something besides "None" if it isn't SHS related, so there's no need to apply the function to that col.
* Only have to tag primary and secondary mode.

In [6]:
def tagging_columns(df, tagging_col:str, new_col:str, keyword_list: list):
    '''
    Search through a column for keywords. 
    
    Args
    df: the dataframe.
    tagging_col (str): the column to search for the appearance of keywords. 
    new_col (str): input whether or not the keyword was found.
    keyword_list (list): list of keywords to search through.
    
    Returns: a dataframe with a new column stating whether 
    the keyword(s) were found or not.
    '''
    # Delinate items in keywords list using |
    keywords = f"({'|'.join(keyword_list)})"
    
    # Lower the strings in the column of interest 
    df[tagging_col] =  df[tagging_col].str.lower()
    
    # Create a new column that captures whether or not the keyword appears
    # Using str contains so interchange (new) and interchange (modifying) will appear.
    df["keyword_appears_bool"] = df[tagging_col].str.contains(keywords)
    
    # Function to categorize whether something is highway related or not. 
    def highway_or_not(row):
        if row["keyword_appears_bool"] == True:
            return "highway related"
        else:
            return "not highway related"
             
    # Apply function 
    df[new_col] = df.apply(lambda x: highway_or_not(x), axis=1)
             
    # Drop keyword col
    df = df.drop(columns = ["keyword_appears_bool"]) 
   
    return df 

In [7]:
df3 = tagging_columns(df2, "primary_mode", "primary_mode_SHS", ['highway', 'its',
       'interchange',])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [8]:
# Preview that this is correct
df3[['primary_mode','primary_mode_SHS']].sample(10)

Unnamed: 0,primary_mode,primary_mode_SHS
178,highway,highway related
13,bike/pedestrian,not highway related
492,highway,highway related
624,rail (passenger),not highway related
133,highway,highway related
724,bike/pedestrian,not highway related
241,highway,highway related
337,port,not highway related
417,highway,highway related
43,rail (passenger),not highway related


In [9]:
df3 = tagging_columns(df2,  'secondary_mode_s', "secondary_mode_SHS", ["highway", "lane", "interchange"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


### Function 2
* Apply a function to summarize the results in a comprehensive sentence.
    

In [10]:
def SHS_lead_agency_info(df): 
    
    # Lower strings. 
    for i in ['primary_mode','secondary_mode_s','shs_capacity_increase_detail',]:
        df[i] = df[i].str.lower()
    
    # Tag if the lead agency is Caltrans or a partner. 
    def CT_or_partner(row):
        # If SHS is filled with somethign BESIDES none: on SHS. 
        if (row.lead_agency == "None"):
            return "unknown" 
        # Everything else is not on SHS.
        if (row.lead_agency == "Caltrans"):
            return "Caltrans" 
        else:
            return "a partner"     
        
    # Apply the function
    df["caltrans_or_partner"] = df.apply(CT_or_partner, axis=1)  
    
    # Tag if a project is on the SHS or not thorugh various combos.
    def on_SHS(row):
        # If secondary mode is highway related and shs_capacity_increase_detail isn't none: on SHS. 
        if (row.secondary_mode_SHS == "highway related") and (row.shs_capacity_increase_detail != "none"):
            return "possibly on the SHS"
        # Same thing as above but with primary mode. 
        elif (row.primary_mode_SHS == "highway related") and (row.shs_capacity_increase_detail != "none"):
            return "possibly on the SHS"
        # If both secondary & primary are highway, SHS isn't none, and lead agency is Caltrans: on SHS.
        elif (row.secondary_mode_SHS == "highway related") and (row.primary_mode_SHS == "highway related") and (row.shs_capacity_increase_detail != "none")  and (row.caltrans_or_partner == "Caltrans"):
            return "on the SHS"
         # If both secondary or primary are highway, SHS isn't none, and lead agency is Caltrans: on SHS.
        elif (row.secondary_mode_SHS == "highway related") or (row.primary_mode_SHS == "highway related") and (row.shs_capacity_increase_detail != "none")  and (row.caltrans_or_partner == "Caltrans"):
            return "on the SHS"
        # If both secondary & primary are highway related and lead agency is Caltrans: on SHS.
        elif (row.secondary_mode_SHS == "highway related") and (row.primary_mode_SHS == "highway related") and (row.caltrans_or_partner == "Caltrans"):
            return "on the SHS"
        # If SHS is filled with somethign BESIDES none: on SHS. 
        elif (row.shs_capacity_increase_detail != "none"):
            return "possibly on the SHS" 
        # If nothing is filled out
        elif (row.shs_capacity_increase_detail == "none") and (row.secondary_mode_SHS != "highway related") and (row.primary_mode_SHS != "highway related"):
            return "not on the SHS" 
        # Everything else is not on SHS.
        else:
            return "possibly on the SHS"
    
    # Apply the function
    df["On_SHS"] = df.apply(on_SHS, axis=1)
    
    # Create a sentence that summarizes the lead agency and whether the project is on the SHS or not. 
    df['sentence'] = 'The lead agency is ' + df['caltrans_or_partner'] + ' and the project is ' + df['On_SHS'] + '.'
    
    return df

In [11]:
df4 = SHS_lead_agency_info(df3) 

In [20]:
# Check value counts.
df4.sentence.value_counts()

The lead agency is a partner and the project is not on the SHS.         251
The lead agency is unknown and the project is possibly on the SHS.      156
The lead agency is Caltrans and the project is possibly on the SHS.      96
The lead agency is a partner and the project is possibly on the SHS.     93
The lead agency is unknown and the project is not on the SHS.            68
The lead agency is Caltrans and the project is not on the SHS.           57
The lead agency is Caltrans and the project is on the SHS.                9
The lead agency is unknown and the project is on the SHS.                 7
The lead agency is a partner and the project is on the SHS.               6
Name: sentence, dtype: int64

In [13]:
(df4.loc[df4['sentence'] == "The lead agency is Caltrans and the project is on the SHS."])

Unnamed: 0,project_name,lead_agency,primary_mode,secondary_mode_s,shs_capacity_increase_detail,primary_mode_SHS,secondary_mode_SHS,caltrans_or_partner,On_SHS,sentence
2,Arcata Cap & Humboldt Area Rapid Transit,Caltrans,transit,highway,none,not highway related,highway related,Caltrans,on the SHS,The lead agency is Caltrans and the project is on the SHS.
30,South Mount Shasta Boulevard Intersection,Caltrans,interchange (modification),highway,none,highway related,highway related,Caltrans,on the SHS,The lead agency is Caltrans and the project is on the SHS.
31,Tehama Mineral Multi Use Path / Mineral Multi Use Path (Mineral Bike/Ped Pathway Atp),Caltrans,bike/pedestrian,highway,none,not highway related,highway related,Caltrans,on the SHS,The lead agency is Caltrans and the project is on the SHS.
140,Sr 218 Comprehensive Complete Streets And Zev Improvements,Caltrans,complete streets,highway\nits\ntransit\nzev,none,not highway related,highway related,Caltrans,on the SHS,The lead agency is Caltrans and the project is on the SHS.
186,Coalinga-Avenal Srra Truck Parking Expansion,Caltrans,truck parking,highway,none,not highway related,highway related,Caltrans,on the SHS,The lead agency is Caltrans and the project is on the SHS.
269,I-405 Active Traffic Management (Atm) And Integrated Corridor Management (Icm) Project,Caltrans,its,highway,none,highway related,highway related,Caltrans,on the SHS,The lead agency is Caltrans and the project is on the SHS.
282,I-710 Integrated Corridor Management Project,Caltrans,its,highway,none,highway related,highway related,Caltrans,on the SHS,The lead agency is Caltrans and the project is on the SHS.
486,Sand Canyon Avenue Class Ii Bike Gap Closure At I-405,Caltrans,bike/pedestrian,complete streets\nhighway,none,not highway related,highway related,Caltrans,on the SHS,The lead agency is Caltrans and the project is on the SHS.
515,Orange County Integrated Corridor Management (Icm) System Phase Ii,Caltrans,its,highway,none,highway related,highway related,Caltrans,on the SHS,The lead agency is Caltrans and the project is on the SHS.


In [14]:
# Check value counts.
df4.caltrans_or_partner.value_counts()

a partner    350
unknown      231
Caltrans     162
Name: caltrans_or_partner, dtype: int64

In [15]:
# Make sure every row is tagged. 
df4.sentence.count(), len(df4)

(743, 743)

In [16]:
def SHS_lead_agency_info_v2(df): 
    
    # Lower strings. 
    for i in ['primary_mode','secondary_mode_s','shs_capacity_increase_detail',]:
        df[i] = df[i].str.lower()
    
    # Tag if the lead agency is Caltrans or a partner. 
    def CT_or_partner(row):
        # If SHS is filled with somethign BESIDES none: on SHS. 
        if (row.lead_agency == "None"):
            return "unknown" 
        # Everything else is not on SHS.
        if (row.lead_agency == "Caltrans"):
            return "Caltrans" 
        else:
            return "a partner"     
        
    # Apply the function
    df["caltrans_or_partner"] = df.apply(CT_or_partner, axis=1)  
    
    # Tag if a project is on the SHS or not thorugh various combos.
    def on_SHS(row):
        # If secondary mode is highway related and shs_capacity_increase_detail isn't none: on SHS. 
        if (row.secondary_mode_SHS == "highway related") and (row.shs_capacity_increase_detail != "none"):
            return "possibly on the SHS"
        # Same thing as above but with primary mode. 
        if (row.primary_mode_SHS == "highway related") and (row.shs_capacity_increase_detail != "none"):
            return "possibly on the SHS"
        # If both secondary & primary are highway, SHS isn't none, and lead agency is Caltrans: on SHS.
        if (row.secondary_mode_SHS == "highway related") and (row.primary_mode_SHS == "highway related") and (row.shs_capacity_increase_detail != "none")  and (row.caltrans_or_partner == "Caltrans"):
            return "on the SHS"
         # If both secondary or primary are highway, SHS isn't none, and lead agency is Caltrans: on SHS.
        if (row.secondary_mode_SHS == "highway related") or (row.primary_mode_SHS == "highway related") and (row.shs_capacity_increase_detail != "none")  and (row.caltrans_or_partner == "Caltrans"):
            return "on the SHS"
        # If both secondary & primary are highway related and lead agency is Caltrans: on SHS.
        if (row.secondary_mode_SHS == "highway related") and (row.primary_mode_SHS == "highway related") and (row.caltrans_or_partner == "Caltrans"):
            return "on the SHS"
        # If SHS is filled with somethign BESIDES none: on SHS. 
        if (row.shs_capacity_increase_detail != "none"):
            return "possibly on the SHS" 
        # If nothing is filled out
        if (row.shs_capacity_increase_detail == "none") and (row.secondary_mode_SHS != "highway related") and (row.primary_mode_SHS != "highway related"):
            return "not on the SHS" 
        # Everything else is not on SHS.
        else:
            return "possibly on the SHS"
    
    # Apply the function
    df["On_SHS"] = df.apply(on_SHS, axis=1)
    
    # Create a sentence that summarizes the lead agency and whether the project is on the SHS or not. 
    df['sentence'] = 'The lead agency is ' + df['caltrans_or_partner'] + ' and the project is ' + df['On_SHS'] + '.'
    
    return df

In [17]:
df5 = SHS_lead_agency_info_v2(df3)

In [18]:
# Check value counts.
df5.sentence.value_counts()

The lead agency is a partner and the project is not on the SHS.         251
The lead agency is unknown and the project is possibly on the SHS.      156
The lead agency is Caltrans and the project is possibly on the SHS.      96
The lead agency is a partner and the project is possibly on the SHS.     93
The lead agency is unknown and the project is not on the SHS.            68
The lead agency is Caltrans and the project is not on the SHS.           57
The lead agency is Caltrans and the project is on the SHS.                9
The lead agency is unknown and the project is on the SHS.                 7
The lead agency is a partner and the project is on the SHS.               6
Name: sentence, dtype: int64

In [19]:
df5.loc[df5['sentence'] == "The lead agency is Caltrans and the project is on the SHS."]

Unnamed: 0,project_name,lead_agency,primary_mode,secondary_mode_s,shs_capacity_increase_detail,primary_mode_SHS,secondary_mode_SHS,caltrans_or_partner,On_SHS,sentence
2,Arcata Cap & Humboldt Area Rapid Transit,Caltrans,transit,highway,none,not highway related,highway related,Caltrans,on the SHS,The lead agency is Caltrans and the project is on the SHS.
30,South Mount Shasta Boulevard Intersection,Caltrans,interchange (modification),highway,none,highway related,highway related,Caltrans,on the SHS,The lead agency is Caltrans and the project is on the SHS.
31,Tehama Mineral Multi Use Path / Mineral Multi Use Path (Mineral Bike/Ped Pathway Atp),Caltrans,bike/pedestrian,highway,none,not highway related,highway related,Caltrans,on the SHS,The lead agency is Caltrans and the project is on the SHS.
140,Sr 218 Comprehensive Complete Streets And Zev Improvements,Caltrans,complete streets,highway\nits\ntransit\nzev,none,not highway related,highway related,Caltrans,on the SHS,The lead agency is Caltrans and the project is on the SHS.
186,Coalinga-Avenal Srra Truck Parking Expansion,Caltrans,truck parking,highway,none,not highway related,highway related,Caltrans,on the SHS,The lead agency is Caltrans and the project is on the SHS.
269,I-405 Active Traffic Management (Atm) And Integrated Corridor Management (Icm) Project,Caltrans,its,highway,none,highway related,highway related,Caltrans,on the SHS,The lead agency is Caltrans and the project is on the SHS.
282,I-710 Integrated Corridor Management Project,Caltrans,its,highway,none,highway related,highway related,Caltrans,on the SHS,The lead agency is Caltrans and the project is on the SHS.
486,Sand Canyon Avenue Class Ii Bike Gap Closure At I-405,Caltrans,bike/pedestrian,complete streets\nhighway,none,not highway related,highway related,Caltrans,on the SHS,The lead agency is Caltrans and the project is on the SHS.
515,Orange County Integrated Corridor Management (Icm) System Phase Ii,Caltrans,its,highway,none,highway related,highway related,Caltrans,on the SHS,The lead agency is Caltrans and the project is on the SHS.
