## General function to clean up data from various grants
To-Do
* De duplicate projects
* Rearrange counties in County column in alphabetical order.
* Millions to thousands -> seems easier to read.
* Differentiate btwn project START year and END year.
* Add Post Mile column

Done
* Switch City of Berkeley to Berkeley City. https://github.com/cal-itp/data-analyses/blob/main/Agreement_Overlap/add_dla.ipynb

Strategy/Questions:
* Make sure one row=one project. How? 
* What should be the unit of project cost?
* Break up Caltrans by district or leave as is? 

Columns/Data Dictionary
* project_title (str): N/A.
* lead_agency (str): the entity leading the project or receiving the grant.
* project_year (TBD): when the project will begin.
* project_category (str): the category/categories a project belongs to.
* grant_program (str): the fund a project is receiving funds for. This does not preclude the fact that a project can receive funds from mulitple programs. 
* phase (str): the latest phase the project is in.
* project_description (str): N/A.
* total_project_cost_(millions): N/A.
* total_available_funds_(millions): all the funds available to the project.
* unfunded_needs_(millions): subtract total_project_cost_(millions) by total_available_funds_(millionis).
* city (str): the city a project is located in.
* county (str): the county a project is lcoated in.
* location (str): an address or more detailed information regarding where the project will take place.
* geometry: geospatial information.
* data_source (str): N/A.
* notes (str): additional information regarding the project.
* funding_notes (str): additional funding information regarding the project.
* ct_district (int): the Caltrans district a project is located in.
* fully_funded (str): comparing total_available_funds_(millions) and total_project_cost_(millions) to figure out whether a project is fully, partially, or not funded.
* enough_info (str): counting the # of null values and # of strings in the project description to determine whether or not a project has enough information.

In [1]:
import _cleaning_utils
import _harmonization_utils as harmonization_utils
import _state_rail_plan_utils as srp_utils
import geopandas as gpd
import pandas as pd
import shapely
from calitp_data_analysis.sql import to_snakecase


import os
os.environ['USE_PYGEOS'] = '0'
import geopandas

In a future release, GeoPandas will switch to using Shapely by default. If you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
  import geopandas as gpd


In [2]:
"""
import re
import nltk
from nltk import ngrams
from nltk.corpus import stopwords
from nltk.tokenize import sent_tokenize, word_tokenize
import re
from collections import Counter
from autocorrect import Speller
"""

'\nimport re\nimport nltk\nfrom nltk import ngrams\nfrom nltk.corpus import stopwords\nfrom nltk.tokenize import sent_tokenize, word_tokenize\nimport re\nfrom collections import Counter\nfrom autocorrect import Speller\n'

In [3]:
pd.options.display.max_columns = 100
pd.options.display.float_format = "{:.2f}".format
pd.set_option("display.max_rows", None)
pd.set_option("display.max_colwidth", None)

In [4]:
# lost = harmonization_utils.load_lost()

In [5]:
def create_notes(df, note_cols: list, new_col_name: str):
    """
    Concat multiple columns into one.
    """
    prefix = "_"
    for column in note_cols:
        df[f"{prefix}{column}"] = df[column].astype(str)
    note_cols = [prefix + sub for sub in note_cols]

    # https://stackoverflow.com/questions/65532480/how-to-combine-column-names-and-values
    def combine_notes(x):
        return ", ".join([col + ": " + x[col] for col in note_cols])

    df[new_col_name] = df.apply(combine_notes, axis=1)
    df[new_col_name] = df[new_col_name].str.replace("_", " ")

    return df

In [6]:
# srp = harmonization_utils.load_state_rail_plan()

In [7]:
columns_to_keep = [
    "project_title",
    "lead_agency",
    "project_year",
    "project_category",
    "grant_program",
    "phase",
    "project_description",
    "total_project_cost_(millions)",
    "total_available_funds_(millions)",
    "unfunded_needs_(millions)",
    "city",
    "county",
    "location",
    "geometry",
    "data_source",
    "notes",
    "funding_notes",
    "ct_district",
    "project_description2",
]

In [8]:
def harmonizing(
    df,
    agency_name_col: str,
    project_name_col: str,
    project_description_col: str,
    project_category_col: str,
    phase_col: str,
    project_cost_col: str,
    location_col: str,
    geography_col: str,
    post_mile_col:str,
    county_col: str,
    city_col: str,
    district_col:str, 
    project_start_year_col: str,
    project_completion_year_col:str,
    program_col: str,
    data_source: str,
    fund_cols: list,
    notes_cols: list,
    cost_in_millions: bool = True,
):
    """
    Take a dataset and change the column names/types to
    default names and formats.
    """
    rename_columns = {
        agency_name_col: "lead_agency",
        project_name_col: "project_title",
        project_description_col: "project_description",
        project_category_col: "project_category",
        project_cost_col: "total_project_cost_(millions)",
        location_col: "location",
        geography_col: "geometry",
        phase_col: "phase",
        post_mile_col: "post_mile",
        county_col: "county",
        city_col: "city",
        district_col: "ct_district",
        project_start_year_col: "project_start_year",
        project_end_year_col: "project_completion_year",
        program_col: "grant_program",
    }
    # Rename columns
    df = df.rename(columns=rename_columns)
    
    # Clean up monetary columns to be interger
    cost_columns = df.columns[df.columns.str.contains("(cost|funds)")].tolist()
    for i in cost_columns:
        df[i] = df[i].apply(pd.to_numeric, errors="coerce").fillna(0)
    
    # Divide cost columns by millions
    # If bool is set to True
    if cost_in_millions:
        for i in fund_cols + ["total_project_cost_(millions)"]:
            df[i] = df[i].divide(1_000_000)

    # Add new column with funding breakout
    # Since it's summarized above and the details are suppressed.
    df["total_available_funds_(millions)"] = df[fund_cols].sum(axis=1)
    df = create_notes(df, fund_cols, "funding_notes")
    
    # Add column for unfunded needs
    df["unfunded_needs_(millions)"] = df["total_project_cost_(millions)"] - df["total_available_funds_(millions)"]
    
    # Add program
    df["data_source"] = data_source
    
    # Create columns even if they don't exist, just to harmonize
    # before concatting.
    create_columns = [
        "county",
        "city",
        "notes",
        "project_start_year",
        "project_completion_year",
        "post_mile",
        "project_category",
        "location",
        "phase",
        "ct_district"
    ]
    for column in create_columns:
        if column not in df:
            df[column] = "None"
    if "geometry" not in df:
        df["geometry"] = None
    if "grant_program" not in df:
        df["grant_program"] = data_source
    
    # Create notes - aka other columns that were supressed
    df = create_notes(df, notes_cols, "notes")
    
    # Clean up string columns
    string_cols = df.select_dtypes(include=["object"]).columns.to_list()
    for i in string_cols:
        df[i] = df[i].str.replace("_", " ").str.strip().str.title()

    # Fill in any nulls
    df['project_description2'] = df.project_description.fillna(df.project_title)
    df = df.fillna(df.dtypes.replace({"float64": 0.0, "object": "None"}))

    # Only keep certain columns
    df = df[columns_to_keep]
    return df

In [9]:
def harmonize_srp():
    df = harmonization_utils.load_state_rail_plan()
    df = harmonizing(
        df,
        agency_name_col="lead_agency",
        project_name_col="project_name",
        project_description_col="project_description",
        project_category_col="project_category",
        phase_col="",
        project_cost_col="total_project_cost",
        location_col="corridor",
        geography_col="",
        county_col="",
        city_col="",
        district_col="",
        project_year_col="",
        program_col="",
        data_source="State Rail Plan",
        fund_cols=[],
        notes_cols = ['project_time_horizon','srp_region', 
       'sub_corridor_node_1', 'sub_corridor_node_2', 'itsp_corridor'],
        cost_in_millions=True,
    )

    return df

In [10]:
# srp_harmonized = harmonize_srp()

In [11]:
# srp_harmonized.tail()

In [12]:
# srp_og = harmonization_utils.load_state_rail_plan()

In [13]:
# srp_og.sample()

In [14]:
# srp_og.columns

In [15]:
def harmonize_lost():
    df = harmonization_utils.load_lost()
    df = harmonizing(
        df,
        agency_name_col="agency",
        project_name_col="project_title",
        project_description_col="project_description",
        project_category_col="project_category",
        project_cost_col="cost__in_millions_",
        phase_col="",
        location_col="location",
        geography_col="",
        county_col="county",
        city_col="city",
        district_col = "",
        project_year_col="",
        program_col="measure",
        data_source="Local Options Sales Tax",
        fund_cols=[
            "estimated_lost_funds",
            "estimated_federal_funds",
            "estimated_state_funds",
            "estimated_local_funds",
            "estimated_other_funds",
        ],
        notes_cols = ["notes"],
        cost_in_millions=False,
    )

    return df

In [16]:
# lost_og = harmonization_utils.load_lost()

In [17]:
# lost_og.columns

In [18]:
def harmonize_sb1():
    df = harmonization_utils.load_sb1()
    df = harmonizing(
        df,
        agency_name_col="implementingagency",
        project_name_col="projecttitle_x",
        project_description_col="projectdescription",
        project_category_col="",
        phase_col="projectstatuses",
        project_cost_col="totalcost",
        location_col="",
        geography_col="geometry",
        county_col="countynames",
        city_col="citynames",
        district_col = "ct_districts",
        project_year_col="fiscalyears",
        program_col="programcodes",
        data_source="SB1",
        fund_cols=["sb1funds", "iijafunds"],
        notes_cols = ['iijaprogram','dateupdated','isonshs', 'isonshscodes','agencies', 'popup'],
        cost_in_millions=True,
    )

    return df

In [19]:
# sb1_og = harmonization_utils.load_sb1()

In [20]:
# sb1_og.columns

In [21]:
# sb1_og.drop(columns = ['geometry']).sample(3)

In [22]:
# harmonized_sb1 = harmonize_sb1()

### Stacking

#### Does this project have enough information to be useful?

In [23]:
def categorize_info(df): 
 
    #Get percentiles in objects for total vehicle.
    p50_project_desc= df.project_description_count.quantile(0.50).astype(float)
    p50_null_values = df.total_percent_null_values.quantile(0.50).astype(float)
    
    #Function for fleet size
    def percentile_info (row):
        if ((row.project_description_count >= p50_project_desc) and (row.total_percent_null_values <= p50_null_values)):
            return "Yes"
        else: 
            return "No"
    df["enough_info"] = df.apply(lambda x: percentile_info(x), axis=1)
  
    return df    

In [24]:
def enough_info(df):
    # Select string columns
    string_cols = df.select_dtypes(include=["object"]).columns.to_list()
    
    # https://stackoverflow.com/questions/73839250/count-number-of-occurrences-of-text-over-row-python-pandas
    # Count "nones" in string columns
    df['none_counts'] = df[string_cols].astype(str).sum(axis=1).str.lower().str.count("none")
    
    # Count zeroes
    df['zero_counts'] = (df == 0).astype(int).sum(axis=1)
    
    # Total up all none/zeroes 
    df["total_percent_null_values"] = df[['none_counts','zero_counts']].sum(axis=1)/len(df.columns) * 100
    
    # Count project descriptions
    df["project_description_count"] = df["project_description"].str.count('\w+')
    
    # Categorize whether it has enough info or not
    df = categorize_info(df)
    
    # Compress columns to retain some info
    df['counts'] = 'number of strings in project desc: ' + df.project_description_count.astype(str) + ' % of null values:' + df.total_percent_null_values.astype(int).astype(str)
    
    df = df.drop(columns = ['none_counts','zero_counts','project_description_count','total_percent_null_values'])
    return df 

#### Correct lead agencies again

In [123]:
def flip_county_city(df, agency_col:str):
    # https://github.com/cal-itp/data-analyses/blob/main/Agreement_Overlap/add_dla.ipynb
    to_correct = df[(df[agency_col].str.contains('County')) | (df[agency_col].str.contains('City'))]
    to_correct = to_correct[[agency_col]].drop_duplicates().reset_index(drop = True)
    to_correct['str_len'] = to_correct[agency_col].str.split().str.len()
    to_correct = to_correct[to_correct.str_len <= 5 ].reset_index(drop = True)
    to_correct[['name_pt1', 'name_pt2']] = to_correct[agency_col].str.split(' Of ', 1, expand=True)
    to_correct['new_name'] = to_correct['name_pt2'] + ' ' + to_correct['name_pt1']
    
    new_names_dictionary = (dict(to_correct[[agency_col, 'new_name']].values))
    df['agency_corrected'] = df[agency_col].map(new_names_dictionary)
    df['agency_corrected'] = df['agency_corrected'].fillna(df[agency_col])
    
    df = df.drop(columns = [agency_col])
    df = df.rename(columns = {"agency_corrected":agency_col})
    
    return df 

In [126]:
# all_projects_metric.lead_agency.value_counts()

In [25]:
def add_all_projects():

    # Load  dataframes
    state_rail_plan = harmonize_srp()
    lost = harmonize_lost()
    sb1 = harmonize_sb1()

    # Concat for df
    df = pd.concat([lost, state_rail_plan, sb1])
    
    # Clean agency names
    df = harmonization_utils.organization_cleaning(df, "lead_agency")
    df = flip_county_city(df, 'lead_agency')
    
    # Determine if the project completely funded or not?
    # Add up all available funds
    df["fully_funded"] = df.apply(harmonization_utils.funding_vs_expenses, axis=1)
    
    # Does this project have enough info?
    df = enough_info(df)
    
    
    return df

In [26]:
all_projects = add_all_projects()



In [27]:
all_projects.columns

Index(['project_title', 'lead_agency', 'project_year', 'project_category',
       'grant_program', 'phase', 'project_description',
       'total_project_cost_(millions)', 'total_available_funds_(millions)',
       'unfunded_needs_(millions)', 'city', 'county', 'location', 'geometry',
       'data_source', 'notes', 'funding_notes', 'ct_district',
       'project_description2', 'fully_funded', 'enough_info', 'counts'],
      dtype='object')

In [53]:
all_projects.drop(columns = ['geometry']).sample(3)

Unnamed: 0,project_title,lead_agency,project_year,project_category,grant_program,phase,project_description,total_project_cost_(millions),total_available_funds_(millions),unfunded_needs_(millions),city,county,location,data_source,notes,funding_notes,ct_district,project_description2,fully_funded,enough_info,counts
358,,,"19/20, 20/21",,Sgr,"In Progress, Planned",,0.12,0.12,0.0,Corcoran,Kings,,Sb1,"Iijaprogram: , Dateupdated: 2021-09-09, Isonshs: N, Isonshscodes: N, Agencies: City Of Corcoran, Popup: None","Sb1Funds: 0.121909, Iijafunds: 0.0",,,Fully funded,No,number of strings in project desc: 1 % of null values:40
1085,Spring Street Overlay,City Of Signal Hill,19/20,,Lsr,Planned,,3.0,0.13,2.87,Signal Hill,Los Angeles,,Sb1,"Iijaprogram: , Dateupdated: 6/30/2021, Isonshs: N, Isonshscodes: N, Agencies: City Of Signal Hill, Popup: None","Sb1Funds: 0.126705, Iijafunds: 0.0",,,Partially funded,No,number of strings in project desc: 1 % of null values:27
2106,Major Damage Restoration,Caltrans,20/21,,Shopp,In Progress,"A $16.52 Million Dollar Project In Del Norte County On Route 101 Will Realign Roadway, Construct Retaining Walls, And Place A Video Monitoring System.",16.52,9.08,7.44,,Del Norte,,Sb1,"Iijaprogram: State Hwy Operations & Protection Program Major-Federal, Dateupdated: 2022-06-28, Isonshs: None, Isonshscodes: Y, Agencies: Caltrans, Popup: Major Damage Restorationbr","Sb1Funds: 0.0, Iijafunds: 9.083566",1.0,"A $16.52 Million Dollar Project In Del Norte County On Route 101 Will Realign Roadway, Construct Retaining Walls, And Place A Video Monitoring System.",Partially funded,Yes,number of strings in project desc: 25 % of null values:18


In [29]:
all_projects.grant_program.value_counts()

Shopp                              1631
Imperial D 2008                     726
Hm                                  520
Lsr                                 285
State Rail Plan                     276
Atp                                 216
Sgr                                 156
Stip                                126
San Mateo W 2018                     91
Los Angeles Angeles M 2016           89
San Benito G 2004                    86
Santa Clara B 2016                   85
Tircp                                82
Shopa                                79
San Mateo A2 2006                    78
Alameda B 2000                       62
San Diego A 2004                     59
San Joaquin K 2003                   56
Tcep                                 55
San Bernardino I2 2018               51
Sacramento A2 2004                   51
Tulare R 2006                        49
Sta                                  49
Sonoma M 2004                        44
Alameda Bb 2014                      40


In [30]:
all_projects.data_source.value_counts()

Sb1                        3305
Local Options Sales Tax    1849
State Rail Plan             276
Name: data_source, dtype: int64

In [31]:
all_projects["total_project_cost_(millions)"].value_counts().head() / len(all_projects) * 100

0.00    20.06
0.33     2.65
0.25     1.25
7.61     0.85
17.86    0.77
Name: total_project_cost_(millions), dtype: float64

In [52]:
all_projects.fully_funded.value_counts()

No available funding info    1963
Partially funded             1796
No project cost info         1089
Fully funded                  582
Name: fully_funded, dtype: int64

### Metrics
* Rewrite to be shorter?
* Correct spelling of descriptions?
* https://github.com/cal-itp/data-analyses/blob/29ed3ad1d107c6be09fecbc1a5f3d8ef5f2b2da6/dla/dla_utils/clean_data.py#L305

In [65]:
def add_categories(df):
    """
    Create general categories for each projects.
    https://github.com/cal-itp/data-analyses/blob/29ed3ad1d107c6be09fecbc1a5f3d8ef5f2b2da6/dla/dla_utils/clean_data.py#L305
    """
    # There are many projects that are 
    ACTIVE_TRANSPORTATION = ['bike', 'bicycle', 'cyclist', 
                             'pedestrian', 
                             ## including the spelling errors of `pedestrian`
                             'pedestrain',
                             'crosswalk', 
                             'bulb out', 'bulb-out', 
                             'active transp', 'traffic reduction', 
                             'speed reduction', 'ped', 'srts', 
                             'safe routes to school',
                             'sidewalk', 'side walk', 'Cl ', 'trail',
                             'atp'
                            ]
    TRANSIT = ['bus', 'metro', 'station', #Station comes up a few times as a charging station and also as a train station
               'transit','fare', 'brt', 'yarts', 'railroad', 'highway-rail'
               # , 'station' in description and 'charging station' not in description
              ] 
    BRIDGE = ["bridge", 'viaduct']
    STREET = ['traffic signal', 'resurface', 'resurfacing', 'slurry', 'seal' 
              'sign', 'stripe', 'striping', 'median', 
              'guard rail', 'guardrail', 
              'road', 'street', 
              'sinkhole', 'intersection', 'signal', 'curb',
              'light', 'tree', 'pavement', 'roundabout'
             ]

    FREEWAY = ['hov ', 'hot ', 'freeway', 'highway', 'express lanes', 'hwy']

    INFRA_RESILIENCY_ER = ['repair', 'emergency', 'replace','retrofit', 'er',
                           'rehab', 'improvements', 'seismic', 'reconstruct', 'restoration']

    CONGESTION_RELIEF = ['congestion', 'rideshare','ridesharing', 'vanpool', 'car share']

    NOT_INC = ['charging', 'fueling', 'cng', 'bridge', 'trail',
           'k-rail', 'guardrails', 'bridge rail', 'guard', 'guarrail']
    
    PASSENGER_MODE = ['non sov', 'high quality transit areas', 
                      'hqta', 'hov']
    
    
    SAFETY = ['fatalities','safe', 'speed management','signal coordination',
              'slow speeds', 'roundabouts', 'victims','collisoins','protect',
              'crash', 'modification factors', 'safety system'] 
    
    def categorize_project_descriptions(row):
        """
        This function takes a individual type of work description (row of a dataframe)
        and returns a dummy flag of 1 if it finds keyword present in
        project categories (active transportation, transit, bridge, etc).
        A description can contain multiple keywords across categories.
        """
        # Clean up project description 2
        project_description = (row.project_description2.lower()
                               .replace("-","")
                               .replace(".","")
                               .replace(":","")
                              )
    
        # Store a bunch of columns that will be flagged
        # A project can involve multiple things...also, not sure what's in the descriptions
        active_transp = ""
        transit = ""
        bridge =""
        street = ""
        freeway = ""
        infra_resiliency_er = ""
        congestion_relief = ""
        passenger_mode_shift = ""
        safety = ""
        
        if any(word in project_description for word in ACTIVE_TRANSPORTATION):
            active_transp = "active transportation"
        
        #if any(word in description if instanceof(word, str) else word(description) for word in TRANSIT)

        if (any(word in project_description for word in TRANSIT) and 
            not any(exclude_word in project_description for exclude_word in NOT_INC)
           ):
            transit = "transit"
        if any(word in project_description for word in BRIDGE):
            bridge = "bridge"
        if any(word in project_description for word in STREET):
            street = "street"
        if any(word in project_description for word in FREEWAY):
            freeway = "freeway" 
        if any(word in project_description for word in INFRA_RESILIENCY_ER):
            infra_resiliency_er = "infrastructure"
        if any(word in project_description for word in CONGESTION_RELIEF):
            congestion_relief = "congestion relief"    
        if any(word in project_description for word in PASSENGER_MODE):
            passenger_mode_shift = "passenger mode shift"    
        if any(word in project_description for word in SAFETY):
            safety = "safety"    
        return pd.Series(
            [active_transp, transit, bridge, street, freeway, infra_resiliency_er, congestion_relief,
            passenger_mode_shift, safety], 
            index=['active_transp', 'transit', 'bridge', 'street', 
                   'freeway', 'infra_resiliency_er', 'congestion_relief',
                  'passenger_mode_shift', 'safety']
        )
    
    
    work_categories = df.apply(categorize_project_descriptions, axis=1)
    work_cols = list(work_categories.columns)
    df2 = pd.concat([df, work_categories], axis=1)
    
    df2['categories'] = df2[work_cols].agg(' '.join, axis=1)
    df2['categories'] = df2['categories'].str.strip()
    df2 = df2.drop(columns = work_cols)
    
    return df2



In [66]:
all_projects_metric = add_categories(all_projects)

In [67]:
all_projects_metric.drop(columns = ['geometry']).sample(3)

Unnamed: 0,project_title,lead_agency,project_year,project_category,grant_program,phase,project_description,total_project_cost_(millions),total_available_funds_(millions),unfunded_needs_(millions),city,county,location,data_source,notes,funding_notes,ct_district,project_description2,fully_funded,enough_info,counts,categories
1886,Safety - Hm4,Caltrans,21/22,,Hm,In Progress,Maintain/Repair Transportaiton Management Systems,0.2,0.0,0.2,Visalia,Tulare,,Sb1,"Iijaprogram: None, Dateupdated: 2022-09-19, Isonshs: None, Isonshscodes: N, Agencies: Caltrans, Popup:","Sb1Funds: 0.0, Iijafunds: 0.0",6.0,Maintain/Repair Transportaiton Management Systems,No available funding info,Yes,number of strings in project desc: 5 % of null values:22,infrastructure
1119,Bus/Carpool Ramp Connection From Sr 50 E To Sr 99 S,,,Freeway Safety And Congestion Relief Program,Sacramento A2 2004,,,47.0,0.0,47.0,,Sacramento,,Local Options Sales Tax,Notes: No Specific Amounts For Each Project. Divided Total Fund Slated For A Project Category By Number Of Projects In That Category.,"Estimated Lost Funds: 0.0, Estimated Federal Funds: 0.0, Estimated State Funds: 0.0, Estimated Local Funds: 0, Estimated Other Funds: 0.0",,Bus/Carpool Ramp Connection From Sr 50 E To Sr 99 S,No available funding info,No,number of strings in project desc: 1 % of null values:40,transit
1589,Highway 101: Betteravia Road Interchange,,,,Santa Barbara A 2008,,Improve The Operations Of Intersections At Betteravia Road And Highway 101 By Constructionructioning A\nNorthbound Loop On Ramp In The South East Interchange Quadrant.,2.0,5.0,-3.0,,Santa Barbara,,Local Options Sales Tax,Notes: Nan,"Estimated Lost Funds: 2.0, Estimated Federal Funds: 0.0, Estimated State Funds: 0.0, Estimated Local Funds: 0, Estimated Other Funds: 3.0",,Improve The Operations Of Intersections At Betteravia Road And Highway 101 By Constructionructioning A\nNorthbound Loop On Ramp In The South East Interchange Quadrant.,Fully funded,No,number of strings in project desc: 24 % of null values:36,street freeway infrastructure


In [69]:
all_projects_metric.categories.value_counts().head(30)

infrastructure                                                    1436
                                                                  1381
street  infrastructure                                             739
street                                                             372
bridge   infrastructure                                            226
transit    infrastructure                                          201
active transportation   street  infrastructure                     106
transit                                                             75
street  infrastructure   safety                                     58
transit  street  infrastructure                                     52
freeway infrastructure                                              52
bridge street  infrastructure                                       45
bridge                                                              44
active transportation     infrastructure                            44
active

In [88]:
def apply_metrics(df):
    def categorize_metrics(row):
        categories = row.categories.lower()
        safety = ""
        passenger_mode_shift = ""
        infill_development = ""
        
        if any(word in categories for word in ['infrastructure','safety',]):
            safety = "safety"
        if any(word in categories for word in ['active transportation', 'passenger_mode_shift', "congestion relief"]):
            passenger_mode_shift = "passenger_mode_shift"
        if any(word in categories for word in ['transit', 'active transportation',]):
            infill_development = "infill_development" 
       
        return pd.Series(
            [safety,passenger_mode_shift,infill_development], 
            index=['safety', 'passenger_mode_shift', 'infill_development']
        )
    
    work_categories = df.apply(categorize_metrics, axis=1)
    work_cols = list(work_categories.columns)
    df2 = pd.concat([df, work_categories], axis=1)
    
    df2['applicable_metrics'] = df2[work_cols].agg(' '.join, axis=1)
    df2['applicable_metrics'] = df2['applicable_metrics'].str.strip()
    df2 = df2.drop(columns = work_cols)
    
    return df2

In [89]:
all_projects_metric = apply_metrics(all_projects_metric)

In [90]:
all_projects_metric[['grant_program','project_description2','categories','applicable_metrics']].sample(50)

Unnamed: 0,grant_program,project_description2,categories,applicable_metrics
2587,Shopp,A $4.91 Million Dollar Project In Santa Barbara County On Route 154 Will Place High Friction Surface Treatment (Hfst) And Construct Centerline Rumble Strip.,infrastructure,safety
1058,Lsr,,,
17,State Rail Plan,Expansion Of The Smart Fleet To Accommodate Service Expansion.,infrastructure,safety
845,Imperial D 2008,Overlay,infrastructure,safety
1933,Hm,Maintain/Repair Pavement - Seal Coat,street infrastructure,safety
2032,Shopp,A $11.57 Million Dollar Project In Humboldt County On Route 299 Will Widen Shoulders.,infrastructure,safety
331,Sgr,,,
3222,Shopp,"A $5.8 Million Dollar Project In San Diego County On Route 5 Will Apply Polyester Concrete Overlay To Bridge Decks, Apply Methacrylate To Approach Slabs, And Repair Spalls. (Bridge Deck Preservation)",bridge infrastructure,safety
106,State Rail Plan,Double Track From Mp 436.65 To Cp Santa Susana To Allow At-Speed Meets At 437.4. Add 2Nd Platform At Simi Valley Station To Allow Boarding From Both Tracks.,transit,infill_development
1092,Lsr,,,


In [86]:
all_projects_metric.applicable_metrics.nunique()

7

### Categorization

In [43]:
def get_list_of_words(df, col: str) -> list:
    """
    Natalie's function to clean and place words in a project description column
    into a list
    """
    # get just the one col
    column = df[[col]]

    # remove single-dimensional entries from the shape of an array
    col_text = column.squeeze()
    # get list of words
    text_list = col_text.tolist()

    # Join all the column into one large text blob, lower text
    text_list = " ".join(text_list).lower()

    # remove punctuation
    text_list = re.sub(r"[^\w\s]", "", text_list)

    # List of stopwords
    swords = [re.sub(r"[^A-z\s]", "", sword) for sword in stopwords.words("english")]

    # Remove stopwords
    clean_text_list = [
        word for word in word_tokenize(text_list.lower()) if word not in swords
    ]

    return clean_text_list

In [44]:
def find_common_phrases(df, description_column: str, values_to_add: list):

    # Break apart every word in the description column into a list
    descriptions_list = get_list_of_words(df, description_column)

    # Get phrases of whatever length you want (2,3,4,etc)
    c = Counter([" ".join(y) for x in [2] for y in ngrams(descriptions_list, x)])

    # Make a dataframe out of the counter values
    df_phrases = pd.DataFrame({"phrases": list(c.keys()), "total": list(c.values())})

    # Take phrases that are repeated more than 40 times and turn it into a list
    df_phrases = ((df_phrases.loc[df_phrases["total"] > 40])).reset_index(drop=True)
    common_phrases_list = df_phrases.phrases.tolist()

    phrases_to_del = [
        "san bernardino",
        "los angeles",
        "contra costa",
        "el dorado",
        "san luis obispo",
        "luis obispo",
        "del norte",
        "san francisco",
        "improve approximately",
    ]

    common_phrases_list = list(set(common_phrases_list) - set(phrases_to_del))

    # CLean up the list to delete county information/etc
    words_to_delete = [
        "county",
        "route",
        "dollar",
        "mile",
        "santa",
        "project",
        "san",
        "lanes",
        "lane",
        "2",
        "4",
        "financial",
        "prop",
        "best",
        "approximately",
    ]

    for word in words_to_delete:
        common_phrases_list = [x for x in common_phrases_list if word not in x]

    # ADD certain keywords here
    # Operating Additional Service
    common_phrases_list.extend(values_to_add)

    return common_phrases_list

In [45]:
def categorize_projects(
    df,
    description_column: str,
    project_id_column: str,
    title_column: str,
    values_to_add: list,
):

    # Find most common 2 word phrases for some automatic project categories
    common_phrases_list = find_common_phrases(df, description_column, values_to_add)

    # Place all the words in common_phrases_list into a blob named query
    # https://stackoverflow.com/questions/64727090/extract-all-matching-keywords-from-a-list-of-words-and-create-a-new-dataframe-pa
    query = "|".join(common_phrases_list)

    # Remove punctation and lower strings in original description column befores searching
    df["clean_description"] = (
        df[description_column]
        .str.lower()
        .str.replace("-", " ", regex=True)
        .str.replace("(", " ", regex=True)
        .str.replace(")", " ", regex=True)
        .str.replace(".", " ", regex=True)
        .str.strip()
    )

    # Search through description column for the most common phrases
    # Input the results in the new column
    df["auto_project_category"] = df["clean_description"].str.findall(
        r"\b({})\b".format(query)
    )

    # Explode to take categories out of a list
    # Drop duplicate project keywords by title
    df = (
        df.explode("auto_project_category")
        .sort_values([project_id_column, title_column])
        .drop_duplicates(
            subset=[
                description_column,
                project_id_column,
                title_column,
                "auto_project_category",
            ]
        )
    )

    # Fill any uncategorized projects as "Other"
    df["auto_project_category"] = (
        df["auto_project_category"].fillna("Other").str.title()
    )

    # Correct spelling
    spell = Speller(lang="en")
    df["auto_project_category"] = df["auto_project_category"].apply(
        lambda x: " ".join([spell(i) for i in x.split()])
    )

    # Summarize - put all the categories onto one line
    df = (
        df.groupby(
            [
                description_column,
                project_id_column,
                title_column,
            ]
        )["auto_project_category"]
        .apply(",".join)
        .reset_index()
    )

    return df

In [46]:
def add_all_projects2():

    # Load  dataframes
    state_rail_plan = harmonize_srp()
    lost = harominze_lost()
    sb1 = harmonize_sb1()

    # Concat for df
    all_projects_df = pd.concat([lost, state_rail_plan, sb1])

    # Categorize
    categories = categorize_projects(
        all_projects_df,
        "project_description",
        "project_title",
        "project_id",
        [
            "operating",
            "service",
            "zero emission vehicle",
            "zev",
            "maintain/repair",
            "repair/replace",
        ],
    )

    # Merge categorized
    all_projects_df = pd.merge(
        all_projects_df.drop(columns=["clean_description"]),
        categories,
        how="left",
        on=["project_description", "project_title", "project_id"],
    )

    # Rename
    all_projects_df = all_projects_df.drop(columns=["auto_project_category_x"]).rename(
        columns={"auto_project_category_y": "auto_tagged_project_categories"}
    )
    # Concat for gdf
    all_projects_gdf = pd.concat([sb1])
    all_projects_gdf = all_projects_gdf.set_geometry("location")

    return all_projects_df, all_projects_gdf

In [47]:
# all_projects, all_projects_geo = add_all_projects()

In [48]:
# all_projects.drop(columns = ['location'])[['project_title','project_category', 'auto_tagged_project_categories','project_description','total_available_funds','funding_notes']].sample(100)

### Look at the data

In [51]:
all_projects.groupby(["lead_agency"]).agg({"project_id": "nunique"}).sort_values(
    "project_id", ascending=False
).head(10)

KeyError: "Column(s) ['project_id'] do not exist"

In [None]:
all_projects[
    (all_projects.county == "Kern")
    & (all_projects.project_description.str.contains("Seal Coat"))
].drop(columns=["location"])

In [None]:
# all_projects.groupby(['project_category','auto_tagged_project_categories']).agg({'project_id':'nunique'})

In [None]:
all_projects.groupby(["auto_tagged_project_categories"]).agg(
    {"project_id": "nunique"}
).sort_values("project_id", ascending=False).head(10)

In [None]:
all_projects.groupby(["project_category"]).agg({"project_id": "nunique"}).sort_values(
    "project_id", ascending=False
).head(10)

In [None]:
all_projects.groupby(["project_description"]).agg(
    {"project_id": "nunique"}
).sort_values("project_id", ascending=False).head(10)

In [None]:
all_projects.groupby(["county"]).agg({"project_id": "nunique"}).sort_values(
    "project_id", ascending=False
).head(10)

In [None]:
all_projects.lead_agency.nunique()

In [None]:
all_projects.total_project_cost.describe()

In [None]:
all_projects.loc[all_projects.fully_funded == "Fully funded"].groupby(
    ["data_source"]
).agg({"project_id": "nunique"})

In [None]:
all_projects.loc[all_projects.fully_funded == "Partially funded"].groupby(
    ["data_source"]
).agg({"project_id": "nunique"})

In [None]:
all_projects.groupby(["data_source"]).agg({"project_id": "nunique"})

In [None]:
all_projects.groupby(["fully_funded"]).agg(
    {"project_id": "nunique"}
).reset_index().sort_values("project_id", ascending=False)

In [None]:
all_projects.groupby(["data_source", "fully_funded"]).agg({"project_id": "nunique"})