# Identifing Project Groupings

Looking into the different agencies, Locodes and Project IDs to identify instances of multiple oblgiations for the same project or type of funding in the same timeframe. 

Will look into various geographical locations, as well as the County with the most obligations, Humboldt County. 


In [1]:
import pandas as pd
from siuba import _, mutate, count, filter, group_by, ungroup, summarize, show_query, arrange, collect

import numpy as np
from sqlalchemy import create_engine
from siuba.sql import LazyTbl

In [2]:
#! pip install plotnine

In [3]:
from plotnine import *

In [4]:
pd.set_option('display.max_columns', None)
pd.options.display.float_format = "{:.2f}".format

In [5]:
df = pd.read_csv('gs://calitp-analytics-data/data-analyses/dla/e-76Obligated/clean_obligated_waiting.csv', low_memory=False).drop('Unnamed: 0', axis=1)



In [6]:
df.head()

Unnamed: 0,location,prefix,project_no,agency,prepared_date,submit__to_hq_date,hq_review_date,submit_to_fhwa_date,to_fmis_date,fed_requested,ac_requested,total_requested,status_comment,locode,dist,status,dist_processing_days,hq_processing_days,fhwa_processing_days,ftip_no,project_location,type_of_work,seq,date_request_initiated,date_completed_request,mpo,warning,projectID,projectNO,compare_id_locode
0,Obligated,BPMPL,5904(121),Humboldt County,2018-12-18,2018-12-18,2018-12-18,2018-12-18,2018-12-27,0.0,0.0,0.0,Authorized,5904,1,E-76 approved on,,0.0,9.0,HBPLOCAL,14 Bridges In Humboldt County,Bridge Preventive Maintenance - Deck Joints,3,,,NON-MPO,,5904,121,True
1,Obligated,ER,32D0(008),Mendocino County,2018-12-17,2018-12-19,2018-12-20,2018-12-20,2018-12-27,11508.0,0.0,13000.0,Authorized,5910,1,E-76 approved on,1.0,1.0,7.0,,"Comptche Ukiah Road, Cr 223 Pm 17.25",Permanent Restoration,3,2018-12-17,2018-12-18,NON-MPO,,32D0,8,False
2,Obligated,ER,4820(004),Humboldt County,2018-12-07,2018-12-21,2018-12-21,2018-12-21,2018-12-27,45499.64,0.0,51394.58,Authorized,5904,1,E-76 approved on,14.0,0.0,6.0,,Mattole Rd Pm 43.17,Permanent Restoration,5,2018-12-06,2018-12-07,NON-MPO,,4820,4,False
3,Obligated,CML,5924(244),Sacramento County,2018-12-11,2018-12-11,2018-12-21,2018-12-27,2018-12-27,207002.0,0.0,247002.0,Authorized,5924,3,E-76 approved on,4.0,16.0,0.0,SAC25086,Fair Oaks Blvd. Between Howe Ave And Munroe St,Create A Smart Growth Corridor With Barrier Se...,1,2018-12-07,2018-12-07,SACOG,,5924,244,True
4,Obligated,CML,5924(214),Sacramento County,2018-12-05,2018-12-11,2018-12-21,2018-12-27,2018-12-27,0.0,5680921.0,5702041.0,Authorized,5924,3,E-76 approved on,7.0,16.0,0.0,SAC24753,Florin Rd Between Power Inn Rd. And Florin Per...,Streetscape (tc),3,2018-11-28,2018-12-04,SACOG,,5924,214,True


In [7]:
cols = ['prepared_date','to_fmis_date','submit_to_fhwa_date','submit__to_hq_date','hq_review_date','date_request_initiated','date_completed_request']
df[cols] = df[cols].applymap(lambda x : pd.to_datetime(x, format = '%Y-%m-%d'))

KeyboardInterrupt: 

In [None]:
df.prefix.value_counts().sort_values(ascending=False)

In [None]:
df.agency.value_counts().sort_values(ascending=False).nlargest(10)

In [None]:
df.agency.value_counts().sort_values(ascending=False).nsmallest(10)

## What dates have the most obligations? 

In [None]:
df.sample()

In [None]:
df.loc[df["prepared_date"] == "NaT"]

In [None]:
df['prepared_date'].isnull().sum()

In [None]:
df['date_request_initiated'].isnull().sum()

In [None]:
df['date_completed_request'].isnull().sum()

In [None]:
df['to_fmis_date'].isnull().sum()

`Prepared Date` is the best to check when the obligations began. Ideally we would like to use the `Date Request Initiated` and `Date Completed Request`, but they have more NaT values

Since `prepared date` is the best, we will create columns containing the month and year of each obligation.

In [None]:
df['prepared_y_m'] = pd.to_datetime(df["prepared_date"].dt.strftime('%Y-%m'))

In [None]:
df['prepared_y'] = pd.to_datetime(df["prepared_date"].dt.strftime('%Y'))

In [None]:
df.sample()

In [None]:
(df
    >> group_by(_.agency)
    >> count(_.prepared_date) 
    >> arrange(-_.n)
    >> filter(_.n >= 5)
)

In [None]:
(df
    >> group_by(_.agency)
    >> count(_.prepared_date) 
    >> arrange(-_.n)
    >> filter(_.n >= 5)
)

102 rows of agencies that have more than 5 obligations on the same date. Will get into Humboldt County in another notebook

In [None]:
(df
    >> group_by(_.prefix)
    >> count(_.prepared_y_m) 
    >> arrange(-_.n)
)

132 obligations in the same month of June 2018 for ER funds

### Agencies with 5 or more Obligations on a given date

In [None]:
## grouping by project code too to see which if that is a factor... 
(df
    >> group_by(_.agency, _.project_no)
    >> count(_.prepared_date) 
    >> arrange(-_.n)
    >> filter(_.n >= 5)
)

In [None]:
(df
    >> group_by(_.agency, _.project_no)
    >> count(_.prepared_y_m) 
    >> arrange(-_.n)
    >> filter(_.n >= 5)
)

looks like the same number of entries. trying for the year. 

In [None]:
df['prepared_y'] = pd.to_datetime(df["prepared_date"].dt.strftime('%Y'))

In [None]:
df.sample()

In [None]:
(df
    >> group_by(_.agency, _.project_no)
    >> count(_.prepared_y) 
    >> arrange(-_.n)
    >> filter(_.n >= 5)
)

got some new entries! 

Starting to look at the agencies individually now 

### Diving into the agencies with 5 or more obligations


After a few queries into the agencies, we found that many of the obligations are FTA transfers of an unspecified sort. After some digging, I found [this document](https://www.fhwa.dot.gov/federalaid/projects.pdf) containing the program codes for FTA transfers, which in this dataset are located in the `status_comment` column

##### 1. City and County of San Francisco
* 6328(082) / 2016-01-01


In [None]:
df >> filter(_.agency.str.contains("City & County Of San Francisco, Mta/Parking"),
             _.project_no == "6328(082)",
             _.prepared_y =='2016-01-01')


In [None]:
# not much information here... no descriptions either
## prefix of FTA does not tell much

In [None]:
(df >> filter(_.agency.str.contains("City & County Of San Francisco, Mta/Parking"),
             _.project_no == "6328(082)",
             _.prepared_y =='2016-01-01')
    >> count(_.status_comment)
)

In [None]:
# some variation in the program codes for the type of FTA Transfer

In [None]:
(df >> filter(_.agency.str.contains("City & County Of San Francisco, Mta/Parking"),
             _.project_no == "6328(082)",
             _.prepared_y =='2016-01-01')
    >> summarize(sum_funds = _.total_requested.sum())
)

In [None]:

(df >> filter(_.agency.str.contains("City & County Of San Francisco, Mta/Parking"),
             _.project_no == "6328(082)",
             _.prepared_y =='2016-01-01')
    >> group_by(_.status_comment)
    >> summarize(sum_funds = _.total_requested.sum())
)

In [None]:
(df >> filter(_.agency.str.contains("City & County Of San Francisco, Mta/Parking"),
             _.project_no == "6328(082)",
             _.prepared_y =='2016-01-01')
    >> group_by(_.status_comment)
    >> summarize(sum_funds = _.total_requested.sum())
    >> ggplot(aes("status_comment", "sum_funds", fill="status_comment")) + geom_col() + theme(axis_text_x = element_text(angle = 45 , hjust=1))
)

##### 2. Napa County

* 6429(023) / 2018-01-01

In [None]:
df >> filter(_.agency.str.contains("Napa County"),
             _.project_no == "6429(023)",
             _.prepared_y =='2018-01-01')

In [None]:
# agian, not much infomation... 
# we have a program code of FTASTPL but no other descriptions

In [None]:
(df >> filter(_.agency.str.contains("Napa County"),
             _.project_no == "6429(023)",
             _.prepared_y =='2018-01-01')
    >> count(_.status_comment)
)

In [None]:
(df >> filter(_.agency.str.contains("Napa County"),
             _.project_no == "6429(023)",
             _.prepared_y =='2018-01-01')
    >> summarize(sum_funds = _.total_requested.sum())
)

In [None]:
(df >> filter(_.agency.str.contains("Napa County"),
             _.project_no == "6429(023)",
             _.prepared_y =='2018-01-01')
    >> group_by(_.status_comment)
    >> summarize(sum_funds2 = _.total_requested.sum())
)

In [None]:
(df >> filter(_.agency.str.contains("Napa County"),
             _.project_no == "6429(023)",
             _.prepared_y =='2018-01-01')
    >> group_by(_.status_comment)
    >> summarize(sum_funds2 = _.total_requested.sum())
    >> ggplot(aes("status_comment", "sum_funds2", fill="status_comment")) + geom_col() + theme(axis_text_x = element_text(angle = 45 , hjust=1))
)


#####  3. Access Services
* 6312(022) / 2016-01-01
* 6312(027) / 2019-01-01

In [None]:
(df >> filter(_.agency.str.contains("Access Services"),
             _.project_no == "6312(022)",
             _.prepared_y =='2016-01-01')
)

In [None]:
(df >> filter(_.agency.str.contains("Access Services"),
             _.project_no == "6312(022)",
             _.prepared_y =='2016-01-01')
    >> count(_.status_comment)
)

In [None]:
(df >> filter(_.agency.str.contains("Access Services"),
             _.project_no == "6312(022)",
             _.prepared_y =='2016-01-01')
    >> group_by(_.status_comment)
    >> summarize(sumfunds=_.total_requested.sum())
)

In [None]:
(df >> filter(_.agency.str.contains("Access Services"),
             _.project_no == "6312(022)",
             _.prepared_y =='2016-01-01')
    >> group_by(_.status_comment)
    >> summarize(sumfunds=_.total_requested.sum())
    >> ggplot(aes("status_comment", "sumfunds", fill="status_comment")) + geom_col() + theme(axis_text_x = element_text(angle = 45 , hjust=1))
)

In [None]:
(df >> filter(_.agency.str.contains("Access Services"),
             _.project_no == "6312(027)",
             _.prepared_y =='2019-01-01')
)

In [None]:
(df >> filter(_.agency.str.contains("Access Services"),
             _.project_no == "6312(027)",
             _.prepared_y =='2019-01-01')
    >> count(_.status_comment)
)

In [None]:
(df >> filter(_.agency.str.contains("Access Services"),
             _.project_no == "6312(027)",
             _.prepared_y =='2019-01-01')
    >> group_by(_.status_comment)
    >> summarize(sumfunds=_.total_requested.sum())
)

In [None]:
(df >> filter(_.agency.str.contains("Access Services"),
             _.project_no == "6312(027)",
             _.prepared_y =='2019-01-01')
    >> group_by(_.status_comment)
    >> summarize(sumfunds=_.total_requested.sum())
    >> ggplot(aes("status_comment", "sumfunds", fill="status_comment")) + geom_col() + theme(axis_text_x = element_text(angle = 45 , hjust=1))
)

#####  4. Los Angeles County MTA 

* 6065(199) / 2015-01-01
* 6065(225) / 2018-01-01
* 6065(235) / 2019-01-01

In [None]:
(df >> filter(_.agency.str.contains("Los Angeles County"),
             _.project_no == "6065(199)",
             _.prepared_y =='2015-01-01')
)

In [None]:
(df >> filter(_.agency.str.contains("Los Angeles County"),
             _.project_no == "6065(199)",
             _.prepared_y =='2015-01-01')
    >> count(_.status_comment)
)

In [None]:
(df >> filter(_.agency.str.contains("Los Angeles County"),
             _.project_no == "6065(199)",
             _.prepared_y =='2015-01-01')
    >> group_by(_.status_comment)
    >> summarize(sum_funds3= _.total_requested.sum())
)

In [None]:
(df >> filter(_.agency.str.contains("Los Angeles County"),
             _.project_no == "6065(199)",
             _.prepared_y =='2015-01-01')
    >> group_by(_.status_comment)
    >> summarize(sum_funds3= _.total_requested.sum())
    >> ggplot(aes("status_comment", "sum_funds3", fill="status_comment")) + geom_col() + theme(axis_text_x = element_text(angle = 45 , hjust=1))
)

In [None]:
(df >> filter(_.agency.str.contains("Los Angeles County"),
             _.project_no == "6065(225)",
             _.prepared_y =='2018-01-01')
)

In [None]:
(df >> filter(_.agency.str.contains("Los Angeles County"),
             _.project_no == "6065(225)",
             _.prepared_y =='2018-01-01')
    >> count(_.status_comment)
)

In [None]:
(df >> filter(_.agency.str.contains("Los Angeles County"),
             _.project_no == "6065(225)",
             _.prepared_y =='2018-01-01')
    >> group_by(_.status_comment)
    >> summarize(sumfunds = _.total_requested.sum())
)

In [None]:
(df >> filter(_.agency.str.contains("Los Angeles County"),
             _.project_no == "6065(225)",
             _.prepared_y =='2018-01-01')
    >> group_by(_.status_comment)
    >> summarize(sumfunds = _.total_requested.sum())
    >> ggplot(aes("status_comment", "sumfunds", fill="status_comment")) + geom_col() + theme(axis_text_x = element_text(angle = 45 , hjust=1))
)

In [None]:
(df >> filter(_.agency.str.contains("Los Angeles County"),
             _.project_no == "6065(235)",
             _.prepared_y =='2019-01-01')
)

In [None]:
(df >> filter(_.agency.str.contains("Los Angeles County"),
             _.project_no == "6065(235)",
             _.prepared_y =='2019-01-01')
    >> count(_.status_comment)
)

In [None]:
(df >> filter(_.agency.str.contains("Los Angeles County"),
             _.project_no == "6065(235)",
             _.prepared_y =='2019-01-01')
    >> group_by(_.status_comment)
    >> summarize(sumfunds=_.total_requested.sum())
)

In [None]:
(df >> filter(_.agency.str.contains("Los Angeles County"),
             _.project_no == "6065(235)",
             _.prepared_y =='2019-01-01')
    >> group_by(_.status_comment)
    >> summarize(sumfunds=_.total_requested.sum())
    >> ggplot(aes("status_comment", "sumfunds", fill="status_comment")) + geom_col() + theme(axis_text_x = element_text(angle = 45 , hjust=1))
)



#####  5. San Diego Metropolitan Tranit System

* 7503(001)  / 2020-01-01

In [None]:
(df >> filter(_.agency.str.contains("San Diego Metropolitan Tranit System"),
             _.project_no == "7503(001)",
             _.prepared_y =='2020-01-01'))


In [None]:
(df >> filter(_.agency.str.contains("San Diego Metropolitan Tranit System"),
             _.project_no == "7503(001)",
             _.prepared_y =='2020-01-01')
    >> count(_.status_comment)
)


In [None]:
(df >> filter(_.agency.str.contains("San Diego Metropolitan Tranit System"),
             _.project_no == "7503(001)",
             _.prepared_y =='2020-01-01')
    >> group_by(_.status_comment)
    >> summarize(sum_funds4=_.total_requested.sum())
)

In [None]:
# three program coodes under a dollar...?

In [None]:
(df >> filter(_.agency.str.contains("San Diego Metropolitan Tranit System"),
             _.project_no == "7503(001)",
             _.prepared_y =='2020-01-01')
    >> group_by(_.status_comment)
    >> summarize(sum_funds4=_.total_requested.sum())
    >> ggplot(aes("status_comment", "sum_funds4", fill="status_comment")) + geom_col() + theme(axis_text_x = element_text(angle = 45 , hjust=1))
)

#### Trying another approach for non-FTA oblgiations

In [None]:
(df
    >> group_by(_.prefix, _.agency, _.project_no)
    >> count(_.prepared_y) 
    >> arrange(-_.n)
    >> filter(_.prefix.str.contains('FTA')== False)
    >> filter(_.n > 3)
)

In [None]:
(df
    >> group_by(_.prefix, _.agency, _.project_no)
    >> count(_.prepared_y) 
    >> arrange(-_.n)
    >> filter(_.prefix.str.contains('FTA')== False)
    >> filter(_.n > 3)
    >> ggplot(aes("agency", "n", fill="prefix")) + geom_col() + theme(axis_text_x = element_text(angle = 45 , hjust=1))
)

In [None]:
(df
    >> group_by(_.prefix, _.agency, _.project_no)
    >> count(_.prepared_y) 
    >> arrange(-_.n)
    >> filter(_.prefix.str.contains('FTA')== False)
    >> filter(_.n > 4)
)

In [None]:
# exploring these obligations in hopes that these have more information 

In [None]:
(df
    >> filter(_.agency == 'San Joaquin Regional Rail Commission',
             _.project_no == '6262(020)',
             _.prepared_y == '2019-01-01')
)

In [None]:
# These are also FTA Transfers, program codes are all the same

In [None]:
(df
    >> filter(_.agency == 'Compton',
             _.project_no == '5078(012)',
             _.prepared_y == '2018-01-01')
)

In [None]:
#these have different sequences, projects city-wide

In [None]:
(df
    >> filter(_.agency == 'Brawley',
             _.project_no == '5167(037)',
             _.prepared_y == '2018-01-01')
)

In [None]:
# multiple segment project, no first sequence, and the oblogated amounts are near zero or below zero. 

In [None]:
(df
    >> filter(_.agency == 'Huron',
             _.project_no == '5305(014)',
             _.prepared_y == '2014-01-01')
)

In [None]:
#another multi-segment project for road construction 

In [None]:
(df
    >> filter(_.agency == 'San Jose',
             _.project_no == '5005(163)',
             _.prepared_y == '2020-01-01')
)

In [None]:
# another double entry- possibly a refund (?) since it is negative funds.


In [None]:
(df
    >> filter(_.agency == 'Calipatria',
             _.project_no == '5243(002)',
             _.prepared_y == '2016-01-01')
)

In [None]:
# intersting to see another group of obligations have no funds attached to them. 


### Filtering by agency and year and type of work

In [None]:
(df
    >> group_by(_.agency)
    >> count(_.prepared_y) 
    >> arrange(-_.n)
    >> filter(_.n >= 50)
)

In [None]:
(df
    >> group_by(_.agency, _.type_of_work, _.prepared_y)
    >> count(_.prefix) 
    >> arrange(-_.n)
    >> filter(_.n > 10)
)

#### Los Angeles Project Locations

In [None]:
df >> filter(_.agency=='Los Angeles') >> count(_.project_location) >> arrange(-_.n) >> filter(_.n>=4)

In [None]:
(df >> filter(_.agency=='Los Angeles') 
    >> filter(_.project_location.str.contains('Sixth Street Viaduct')) 
    >> count(_.project_no))

In [None]:
(df >> filter(_.agency=='Los Angeles') 
    >> filter(_.project_location.str.contains('Sixth Street Viaduct')) 
    >> count(_.prefix))

In [None]:
(df >> filter(_.agency=='Los Angeles') 
    >> filter(_.project_location.str.contains('Sixth Street Viaduct')) 
    >> group_by(_.prepared_y)
    >> count(_.project_no))

In [None]:
(df >> filter(_.agency=='Los Angeles') 
    >> filter(_.project_location.str.contains('Sixth Street Viaduct')) 
    >> filter(_.prepared_y == '2016-01-01')
    
)

## How many times do the fund request columns equal $0

In [None]:
(df >> mutate(sum_funds = _.fed_requested + _.ac_requested) >> filter(_.sum_funds==0.00))

* 5364 rows that have a net $0.00 fund obligations. 

Creating a subset df

In [None]:
df_nofunds = (df >> mutate(sum_funds = _.fed_requested + _.ac_requested) >> filter(_.sum_funds==0.00))

In [None]:
print(len(df_nofunds))

In [None]:
df_nofunds>>count(_.agency)>>arrange(-_.n)

In [None]:
df_nofunds >> filter(_.agency=='Humboldt County') >> count(_.prefix)

In [None]:
df_nofunds >> count(_.mpo) >> arrange(-_.n)