## Autotagging projects
* Who is the lead agency? 
    * Agency in this project = the entity who is receiving funding for this project?
* Is this project on or off the SHS? Or even both?
* How to tell if a project criss-crosses the SHS?

In [None]:
import pandas as pd

# Settings
pd.options.display.max_columns = 100
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)
pd.options.display.float_format = "{:,.2f}".format

GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/project_prioritization/"
FILE = "fake_data.xlsx"

# My utilities
import _utils

### Preliminary
* Subsetting and cleaning up strings.
* Previewing the different values in cols.

In [152]:
# Read in file
df = pd.read_excel(f"{GCS_FILE_PATH}{FILE}", sheet_name="fake")

In [153]:
# Subset to columns I want. 
df2 = df[['project_name', 'lead_agency','primary_mode',
       'secondary_mode_s_','shs_capacity_increase_detail',]]

In [154]:
# Lowercase all strings
for i in ['primary_mode',
       'secondary_mode_s_','shs_capacity_increase_detail',]:
    df2[i] = df2[i].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [155]:
# Check out all values
for i in ['primary_mode',
       'secondary_mode_s_','shs_capacity_increase_detail',]:
    print(df2[[i]].value_counts())

primary_mode              
highway                       203
rail (passenger)              122
bike/pedestrian                90
interchange (modification)     62
rail (freight)                 37
interchange (new)              33
grade crossing                 32
interchange (widening)         32
complete streets               27
port                           26
transit                        22
bridge                         17
grade separation               11
local roadway                  10
its                             6
zev                             6
roundabout                      4
truck parking                   2
none                            1
dtype: int64
secondary_mode_s_                                                                                
none                                                                                                 492
highway                                                                                               57
bike/p

### Function #1
* Tag the columns as "highway related", just in general before figuring out if they are on the SHS or not. 

In [156]:
def tagging_columns(df, tagging_col:str, new_col:str, keyword_list: list):
    
    # Delinate items in keywords list using |
    keywords = f"({'|'.join(keyword_list)})"
    
    # Create a new column that captures whether or not the keyword appears
    # Using str contains so interchange (new) and interchange (modifying) will appear.
    df["keyword_appears_bool"] = df[tagging_col].str.contains(keywords)
    
    # Function to categorize whether something is highway related or not. 
    def highway_or_not(row):
        if row["keyword_appears_bool"] == True:
            return "highway related"
        else:
            return "not highway related"
             
    # Apply function 
    df[new_col] = df.apply(lambda x: highway_or_not(x), axis=1)
             
    # Drop keyword col
    df = df.drop(columns = ["keyword_appears_bool"]) 
   
    return df 

In [157]:
keywords = f"({'|'.join(['highway', 'its','interchange',])})"

In [158]:
keywords

'(highway|its|interchange)'

In [159]:
df3 = tagging_columns(df2, "primary_mode", "primary_mode_SHS", ['highway', 'its',
       'interchange',] ).head(5)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [160]:
df3 = tagging_columns(df2, "secondary_mode_s_", "secondary_mode_SHS", ["highway", "lane", "interchange"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [161]:
# df3['length_of_lead_agency'] = df3['lead_agency'].map(lambda x: len(x)) 

In [162]:
# df3.sample(5)

### Function 2
* After tagging whether primary mode and secondary modes are highway related, write complete the function.
    * SHS Capacity Increase Detail is only populated  with something besides "None" if SHS related, so there's no need to apply the function.
* Tag whether a lead agency is Caltrans or a partner.
* Tag various combinations of secondary mode/primary mode/lead agency are determine whether they are on SHS or not. 
* Create a sentence to summarize everything.

In [163]:
def SHS_lead_agency_info(df): 
    
    # Tag if it's Caltrans or a partner project
    df['caltrans_or_partner'] = df['lead_agency'].map(lambda x: 'Caltrans' if x == 'Caltrans' else 'a partner')     
    
    # Tagging if a project is on the SHS or not thorugh various combos.
    def on_SHS(row):
        # If secondary mode are highway related and shs_capacity_increase_detail isn't none: on SHS. 
        if (row.secondary_mode_SHS == "highway related") and (row.shs_capacity_increase_detail != "none"):
            return "on SHS"
        # Same thing as above but primary mode. 
        if (row.primary_mode_SHS == "highway related") and (row.shs_capacity_increase_detail != "none"):
            return "on SHS"
        # If all secondary & primary are highway, SHS isn't none, and lead agency is Caltrans: on SHS.
        if (row.secondary_mode_SHS == "highway related") and (row.primary_mode_SHS == "highway related") and (row.shs_capacity_increase_detail != "none")  and (row.caltrans_or_partner == "Caltrans"):
            return "on SHS"
        # If all secondary & primary are highway  and lead agency is Caltrans: on SHS.
        if (row.secondary_mode_SHS == "highway related") and (row.primary_mode_SHS == "highway related") and (row.caltrans_or_partner == "Caltrans"):
            return "on SHS"
        # If SHS is anything but none, tag it as SHS. 
        if (row.shs_capacity_increase_detail != "none"):
            return "on the SHS" 
        # Everything else is not on SHS.
        else:
            return "not on the SHS"
    
    # Apply the function
    df["On_SHS"] = df.apply(on_SHS, axis=1)
    
    # Create a sentence that summarizes the lead agency and whether the project is on the SHS or not. 
    df['sentence'] = 'The lead agency is ' + df['caltrans_or_partner'] + ' and the project is ' + df['On_SHS'] + '.'
    
    return df

In [164]:
df4 = SHS_lead_agency_info(df3) 

In [165]:
df4.sentence.value_counts()

The lead agency is a partner and the project is not on the SHS.    380
The lead agency is a partner and the project is on SHS.            195
The lead agency is Caltrans and the project is on SHS.              79
The lead agency is Caltrans and the project is not on the SHS.      77
The lead agency is Caltrans and the project is on the SHS.           6
The lead agency is a partner and the project is on the SHS.          6
Name: sentence, dtype: int64

In [166]:
df4.sample(5)

Unnamed: 0,project_name,lead_agency,primary_mode,secondary_mode_s_,shs_capacity_increase_detail,primary_mode_SHS,secondary_mode_SHS,caltrans_or_partner,On_SHS,sentence
552,La-Sb Dedicated Passenger Corridor: Construct 3Rd Main Track On The Bnsf Sb Route,Metrolink,rail (passenger),none,none,not highway related,not highway related,a partner,not on the SHS,The lead agency is a partner and the project is not on the SHS.
205,Freeman Gulch Widening - Phase 3,,highway,none,general purpose lane,highway related,not highway related,a partner,on SHS,The lead agency is a partner and the project is on SHS.
394,Sr-71 Corridor Enhancement Project\nRiverside County Route 71 Widening,,highway,bike/pedestrian\nrail (passenger),general purpose lane,highway related,not highway related,a partner,on SHS,The lead agency is a partner and the project is on SHS.
304,Sr-60/7Th Street Interchange Improvement,Caltrans,interchange (modification),none,interchange (modification),highway related,not highway related,Caltrans,on SHS,The lead agency is Caltrans and the project is on SHS.
741,Camp Pendleton Cct,,bike/pedestrian,none,none,not highway related,not highway related,a partner,not on the SHS,The lead agency is a partner and the project is not on the SHS.
