## Autotagging projects
* Who is the lead agency? 
    * Agency in this project = the entity who is receiving funding for this project.
* Is this project on or off the SHS or both?
* How to tell if a project criss-crosses the SHS?

In [1]:
import pandas as pd

# Settings
pd.options.display.max_columns = 100
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)
pd.options.display.float_format = "{:,.2f}".format

GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/project_prioritization/"
FILE = "fake_data.xlsx"

# My utilities
import _utils



### Preliminary

In [2]:
# Read in file
df = pd.read_excel(f"{GCS_FILE_PATH}{FILE}", sheet_name="fake")

In [3]:
# Subset to columns I want. 
df2 = df[['project_name', 'lead_agency','primary_mode',
       'secondary_mode_s_','shs_capacity_increase_detail',]]

In [4]:
"""
for i in ['primary_mode',
       'secondary_mode_s_','shs_capacity_increase_detail',]:
    df2[i] = df2[i].str.lower()
"""

"\nfor i in ['primary_mode',\n       'secondary_mode_s_','shs_capacity_increase_detail',]:\n    df2[i] = df2[i].str.lower()\n"

In [5]:
# Check out all values
for i in ['primary_mode',
       'secondary_mode_s_','shs_capacity_increase_detail',]:
    print(df2[[i]].value_counts())

primary_mode              
Highway                       203
Rail (Passenger)              122
Bike/Pedestrian                90
Interchange (Modification)     62
Rail (Freight)                 37
Interchange (New)              33
Grade Crossing                 32
Interchange (Widening)         32
Complete Streets               27
Port                           26
Transit                        22
Bridge                         17
Grade Separation               11
Local Roadway                  10
Its                             6
Zev                             6
Roundabout                      4
Truck Parking                   2
None                            1
dtype: int64
secondary_mode_s_                                                                                
None                                                                                                 492
Highway                                                                                               57
Bike/P

### Function #1
* Tag whether values in a column are "highway related" before figuring out if they are on the SHS or not. 
* SHS Capacity Increase Detail is only populated with something besides "None" if it isn't SHS related, so there's no need to apply the function.
* Only have to tag primary and secondary mode.

In [6]:
def tagging_columns(df, tagging_col:str, new_col:str, keyword_list: list):
    '''
    Search through a column for keywords. 
    
    Args
    df: the dataframe.
    tagging_col (str): the column to search for the appearance of keywords. 
    new_col (str): input whether or not the keyword was found.
    keyword_list (list): list of keywords to search through.
    
    Returns: a dataframe with a new column stating whether 
    the keyword(s) were found or not.
    '''
    # Delinate items in keywords list using |
    keywords = f"({'|'.join(keyword_list)})"
    
    # Lower the strings in the column of interest 
    df[tagging_col] =  df[tagging_col].str.lower()
    
    # Create a new column that captures whether or not the keyword appears
    # Using str contains so interchange (new) and interchange (modifying) will appear.
    df["keyword_appears_bool"] = df[tagging_col].str.contains(keywords)
    
    # Function to categorize whether something is highway related or not. 
    def highway_or_not(row):
        if row["keyword_appears_bool"] == True:
            return "highway related"
        else:
            return "not highway related"
             
    # Apply function 
    df[new_col] = df.apply(lambda x: highway_or_not(x), axis=1)
             
    # Drop keyword col
    df = df.drop(columns = ["keyword_appears_bool"]) 
   
    return df 

In [7]:
df3 = tagging_columns(df2, "primary_mode", "primary_mode_SHS", ['highway', 'its',
       'interchange',])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [8]:
# Preview that this is correct
df3[['primary_mode','primary_mode_SHS']].sample(10)

Unnamed: 0,primary_mode,primary_mode_SHS
231,highway,highway related
564,rail (passenger),not highway related
689,highway,highway related
141,roundabout,not highway related
196,interchange (modification),highway related
176,highway,highway related
213,interchange (new),highway related
392,highway,highway related
242,highway,highway related
108,highway,highway related


In [9]:
df3 = tagging_columns(df2, "secondary_mode_s_", "secondary_mode_SHS", ["highway", "lane", "interchange"])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


In [11]:
# Apply the function
df3["testing"] = df3.apply(CT_or_partner, axis=1)

### Function 2
* Apply a function to summarizes the results in a comprehensive sentence.
    

In [12]:
def SHS_lead_agency_info(df): 
    
    # Lower strings. 
    for i in ['primary_mode','secondary_mode_s_','shs_capacity_increase_detail',]:
        df[i] = df[i].str.lower()
    
    # Tag if the lead agency is Caltrans or a partner. 
    def CT_or_partner(row):
        # If SHS is filled with somethign BESIDES none: on SHS. 
        if (row.lead_agency == "None"):
            return "unknown" 
        # Everything else is not on SHS.
        if (row.lead_agency == "Caltrans"):
            return "Caltrans" 
        else:
            return "a partner"     
        
    # Apply the function
    df["caltrans_or_partner"] = df.apply(CT_or_partner, axis=1)  
    
    # Tag if a project is on the SHS or not thorugh various combos.
    def on_SHS(row):
        # If secondary mode is highway related and shs_capacity_increase_detail isn't none: on SHS. 
        if (row.secondary_mode_SHS == "highway related") and (row.shs_capacity_increase_detail != "none"):
            return "on the SHS"
        # Same thing as above but with primary mode. 
        if (row.primary_mode_SHS == "highway related") and (row.shs_capacity_increase_detail != "none"):
            return "on the SHS"
        # If both secondary & primary are highway, SHS isn't none, and lead agency is Caltrans: on SHS.
        if (row.secondary_mode_SHS == "highway related") and (row.primary_mode_SHS == "highway related") and (row.shs_capacity_increase_detail != "none")  and (row.caltrans_or_partner == "Caltrans"):
            return "on the SHS"
        # If both secondary & primary are highway related and lead agency is Caltrans: on SHS.
        if (row.secondary_mode_SHS == "highway related") and (row.primary_mode_SHS == "highway related") and (row.caltrans_or_partner == "Caltrans"):
            return "on the SHS"
        # If SHS is filled with somethign BESIDES none: on SHS. 
        if (row.shs_capacity_increase_detail != "none"):
            return "on the SHS" 
        # Everything else is not on SHS.
        else:
            return "not on the SHS"
    
    # Apply the function
    df["On_SHS"] = df.apply(on_SHS, axis=1)
    
    # Create a sentence that summarizes the lead agency and whether the project is on the SHS or not. 
    df['sentence'] = 'The lead agency is ' + df['caltrans_or_partner'] + ' and the project is ' + df['On_SHS'] + '.'
    
    return df

In [13]:
df4 = SHS_lead_agency_info(df3) 

In [14]:
# Check value counts.
df4.sentence.value_counts()

The lead agency is a partner and the project is not on the SHS.    287
The lead agency is unknown and the project is on the SHS.          138
The lead agency is unknown and the project is not on the SHS.       93
The lead agency is Caltrans and the project is on the SHS.          85
The lead agency is Caltrans and the project is not on the SHS.      77
The lead agency is a partner and the project is on the SHS.         63
Name: sentence, dtype: int64

In [15]:
# Check value counts.
df4.caltrans_or_partner.value_counts()

a partner    350
unknown      231
Caltrans     162
Name: caltrans_or_partner, dtype: int64

In [16]:
# Make sure every row is tagged. 
df4.sentence.count(), len(df4)

(743, 743)

In [19]:
df4.loc[df4['sentence'] == 'The lead agency is Caltrans and the project is not on the SHS.'].sample(3)

Unnamed: 0,project_name,lead_agency,primary_mode,secondary_mode_s_,shs_capacity_increase_detail,primary_mode_SHS,secondary_mode_SHS,testing,caltrans_or_partner,On_SHS,sentence
700,Jackson Street In Riverside,Caltrans,grade separation,none,none,not highway related,not highway related,Caltrans,Caltrans,not on the SHS,The lead agency is Caltrans and the project is not on the SHS.
25,Sagehen Adin Its,Caltrans,its,none,none,highway related,not highway related,Caltrans,Caltrans,not on the SHS,The lead agency is Caltrans and the project is not on the SHS.
2,Arcata Cap & Humboldt Area Rapid Transit,Caltrans,transit,highway,none,not highway related,highway related,Caltrans,Caltrans,not on the SHS,The lead agency is Caltrans and the project is not on the SHS.
