# Finding another way to change locode 

Stackoverflow questions and answers that may help: 
* [Replace Column Values in one Dataframe by Values of Another Dataframe](https://stackoverflow.com/questions/36413993/replace-column-values-in-one-dataframe-by-values-of-another-dataframe)
    * similar to [Replace Column Values Based on Partial String Match From Another Dataframe](https://stackoverflow.com/questions/54808130/replace-column-values-based-on-partial-string-match-from-another-dataframe-pytho)
* [Based on Partial String Match Fill one Dataframe Column from Another Dataframe](https://stackoverflow.com/questions/61811137/based-on-partial-string-match-fill-one-data-frame-column-from-another-dataframe)


In [1]:
import pandas as pd
from siuba import *

import numpy as np

from datetime import date
from IPython.display import Markdown, HTML, display_html

from calitp import *

In [3]:
from dla_utils import clean_data

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_nullagency['primary_agency_name2'] = df_nullagency['agency'].map(locode_map3)


In [4]:
pd.set_option('display.max_columns', None)

In [5]:
df = pd.read_csv('gs://calitp-analytics-data/data-analyses/dla/e-76Obligated/clean_obligated_waiting.csv', low_memory=False).drop('Unnamed: 0', axis=1)

In [6]:
df.head()

Unnamed: 0,location,prefix,project_no,agency,prepared_date,submit__to_hq_date,hq_review_date,submit_to_fhwa_date,to_fmis_date,fed_requested,ac_requested,total_requested,status_comment,locode,dist,status,dist_processing_days,hq_processing_days,fhwa_processing_days,ftip_no,project_location,type_of_work,seq,date_request_initiated,date_completed_request,mpo,warning,projectID,projectNO,compare_id_locode
0,Obligated,BPMPL,5904(121),Humboldt County,2018-12-18,2018-12-18,2018-12-18,2018-12-18,2018-12-27,0.0,0.0,0.0,Authorized,5904,1,E-76 approved on,,0.0,9.0,HBPLOCAL,14 Bridges In Humboldt County,Bridge Preventive Maintenance - Deck Joints,3,,,NON-MPO,,5904,121,True
1,Obligated,ER,32D0(008),Mendocino County,2018-12-17,2018-12-19,2018-12-20,2018-12-20,2018-12-27,11508.0,0.0,13000.0,Authorized,5910,1,E-76 approved on,1.0,1.0,7.0,,"Comptche Ukiah Road, Cr 223 Pm 17.25",Permanent Restoration,3,2018-12-17,2018-12-18,NON-MPO,,32D0,8,False
2,Obligated,ER,4820(004),Humboldt County,2018-12-07,2018-12-21,2018-12-21,2018-12-21,2018-12-27,45499.64,0.0,51394.58,Authorized,5904,1,E-76 approved on,14.0,0.0,6.0,,Mattole Rd Pm 43.17,Permanent Restoration,5,2018-12-06,2018-12-07,NON-MPO,,4820,4,False
3,Obligated,CML,5924(244),Sacramento County,2018-12-11,2018-12-11,2018-12-21,2018-12-27,2018-12-27,207002.0,0.0,247002.0,Authorized,5924,3,E-76 approved on,4.0,16.0,0.0,SAC25086,Fair Oaks Blvd. Between Howe Ave And Munroe St,Create A Smart Growth Corridor With Barrier Se...,1,2018-12-07,2018-12-07,SACOG,,5924,244,True
4,Obligated,CML,5924(214),Sacramento County,2018-12-05,2018-12-11,2018-12-21,2018-12-27,2018-12-27,0.0,5680921.0,5702041.0,Authorized,5924,3,E-76 approved on,7.0,16.0,0.0,SAC24753,Florin Rd Between Power Inn Rd. And Florin Per...,Streetscape (tc),3,2018-11-28,2018-12-04,SACOG,,5924,214,True


In [7]:
len(df>>count(_.agency))

671

In [8]:
def get_num(x):
    try:
        return int(x)
    except Exception:
        try:
            return float(x)
        except Exception:
            return x

In [9]:
df['locode'] = df['locode'].apply(get_num)

In [10]:
df['locode'] = clean_data.get_num(df['locode'])

## Read in Agency Locode Crosswalk 

In [11]:
ldf = pd.read_csv('gs://calitp-analytics-data/data-analyses/dla/e-76Obligated/agencylocode_primary_crosswalk1.csv')


In [12]:
ldf.head()

Unnamed: 0,agency_name,agency_locode,primary_agency_name,primary_agency_locode
0,Sacramento,5002,Sacramento,5002
1,Benicia,5003,Benicia,5003
2,San Diego,5004,San Diego,5004
3,San Jose,5005,San Jose,5005
4,Los Angeles,5006,Los Angeles,5006


In [13]:
#should return no values
ldf>>filter(_.agency_locode==7500)

Unnamed: 0,agency_name,agency_locode,primary_agency_name,primary_agency_locode


## Match Dataframe with the Crosswalk 

* To fix inccorect agency names

### Run 1
* Code help from top answer in [Replace Column Values in one Dataframe by Values of Another Dataframe](https://stackoverflow.com/questions/36413993/replace-column-values-in-one-dataframe-by-values-of-another-dataframe)
* fill the NaN results: [Remap values in pandas column with a dict, preserve NaNs](https://stackoverflow.com/questions/20250771/remap-values-in-pandas-column-with-a-dict-preserve-nans)

In [14]:
df1 = df.copy()

In [15]:
df1.agency.isna().sum()

0

In [16]:
df1.agency.nunique()

671

In [17]:
# code help from: https://towardsdatascience.com/state-name-to-state-abbreviation-crosswalks-6936250976c
locode_map = dict(zip(ldf['primary_agency_locode'], 
                          ldf['primary_agency_name']))


In [18]:
df1['primary_agency_name'] = df1['locode'].map(locode_map)


In [19]:
df1

Unnamed: 0,location,prefix,project_no,agency,prepared_date,submit__to_hq_date,hq_review_date,submit_to_fhwa_date,to_fmis_date,fed_requested,ac_requested,total_requested,status_comment,locode,dist,status,dist_processing_days,hq_processing_days,fhwa_processing_days,ftip_no,project_location,type_of_work,seq,date_request_initiated,date_completed_request,mpo,warning,projectID,projectNO,compare_id_locode,primary_agency_name
0,Obligated,BPMPL,5904(121),Humboldt County,2018-12-18,2018-12-18,2018-12-18,2018-12-18,2018-12-27,0.00,0.0,0.00,Authorized,5904,1,E-76 approved on,,0.0,9.0,HBPLOCAL,14 Bridges In Humboldt County,Bridge Preventive Maintenance - Deck Joints,3,,,NON-MPO,,5904,121,True,Humboldt County
1,Obligated,ER,32D0(008),Mendocino County,2018-12-17,2018-12-19,2018-12-20,2018-12-20,2018-12-27,11508.00,0.0,13000.00,Authorized,5910,1,E-76 approved on,1.0,1.0,7.0,,"Comptche Ukiah Road, Cr 223 Pm 17.25",Permanent Restoration,3,2018-12-17,2018-12-18,NON-MPO,,32D0,8,False,Mendocino County
2,Obligated,ER,4820(004),Humboldt County,2018-12-07,2018-12-21,2018-12-21,2018-12-21,2018-12-27,45499.64,0.0,51394.58,Authorized,5904,1,E-76 approved on,14.0,0.0,6.0,,Mattole Rd Pm 43.17,Permanent Restoration,5,2018-12-06,2018-12-07,NON-MPO,,4820,4,False,Humboldt County
3,Obligated,CML,5924(244),Sacramento County,2018-12-11,2018-12-11,2018-12-21,2018-12-27,2018-12-27,207002.00,0.0,247002.00,Authorized,5924,3,E-76 approved on,4.0,16.0,0.0,SAC25086,Fair Oaks Blvd. Between Howe Ave And Munroe St,Create A Smart Growth Corridor With Barrier Se...,1,2018-12-07,2018-12-07,SACOG,,5924,244,True,Sacramento County
4,Obligated,CML,5924(214),Sacramento County,2018-12-05,2018-12-11,2018-12-21,2018-12-27,2018-12-27,0.00,5680921.0,5702041.00,Authorized,5924,3,E-76 approved on,7.0,16.0,0.0,SAC24753,Florin Rd Between Power Inn Rd. And Florin Per...,Streetscape (tc),3,2018-11-28,2018-12-04,SACOG,,5924,214,True,Sacramento County
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20112,DISTRICT,FTACRRS,6000(069),Bay Area Rt,,,,,,0.00,0.0,0.00,prog code,6000,4,FTA transferred waiting at DISTRICT,,0.0,0.0,,FTA transfer,Bart Fare Collection Equipment,1,,,MTC,,6000,69,True,San Francisco Bay Area Rapid Transit District
20113,DISTRICT,FTASTPL,6343(006),Mctd,,,,,,0.00,0.0,0.00,prog code,6343,4,FTA transferred waiting at DISTRICT,,0.0,0.0,,FTA transfer,Bus Stops Improvement,1,,,MTC,,6343,6,True,Marin County Transit District
20114,DISTRICT,FTASTPL,6264(091),Vta,,,,,,0.00,0.0,0.00,prog code,6264,4,FTA transferred waiting at DISTRICT,,0.0,0.0,,FTA transfer,Electronic Locker Upgrade And Replacement,1,,,MTC,,6264,91,True,Santa Clara Valley Transportation Authority
20115,DISTRICT,FTASTPL,6002(030),Ala-Con Costa T,,,,,,0.00,0.0,0.00,prog code,6002,4,FTA transferred waiting at DISTRICT,,0.0,0.0,,FTA transfer,Quick Builds And Tempo Lane Delineation,1,,,MTC,,6002,30,True,Alameda - Contra Costa Transit District


In [20]:
df1.primary_agency_name.nunique()

607

* lowest unique agency count to date (12/13/21)

### Checking to see if it worked

In [21]:
#should return no values
df1>>filter(_.primary_agency_name==('Riv Co Trans Co'))

Unnamed: 0,location,prefix,project_no,agency,prepared_date,submit__to_hq_date,hq_review_date,submit_to_fhwa_date,to_fmis_date,fed_requested,ac_requested,total_requested,status_comment,locode,dist,status,dist_processing_days,hq_processing_days,fhwa_processing_days,ftip_no,project_location,type_of_work,seq,date_request_initiated,date_completed_request,mpo,warning,projectID,projectNO,compare_id_locode,primary_agency_name


In [22]:
# worked to correct the spelling typos

In [23]:
#should retrun more three counties 
df1>>filter(_.locode==5940)>>count(_.agency)

Unnamed: 0,agency,n
0,Humboldt,1
1,Mariposa,4
2,Mariposa County,37


In [24]:
#should return Mariposa as Mariposa County
df1>>filter(_.locode==5940)>>count(_.agency, _.primary_agency_name)

Unnamed: 0,agency,primary_agency_name,n
0,Humboldt,Mariposa County,1
1,Mariposa,Mariposa County,4
2,Mariposa County,Mariposa County,37


In [25]:
df1>>filter(_.locode==5940)>>count(_.primary_agency_name)

Unnamed: 0,primary_agency_name,n
0,Mariposa County,42


In [26]:
# should be showing one Humboldt County... based on previous attempt to correct the data

In [27]:
df1>>filter(_.locode==5940)>>filter(_.agency.str.contains('Humboldt'))

Unnamed: 0,location,prefix,project_no,agency,prepared_date,submit__to_hq_date,hq_review_date,submit_to_fhwa_date,to_fmis_date,fed_requested,ac_requested,total_requested,status_comment,locode,dist,status,dist_processing_days,hq_processing_days,fhwa_processing_days,ftip_no,project_location,type_of_work,seq,date_request_initiated,date_completed_request,mpo,warning,projectID,projectNO,compare_id_locode,primary_agency_name
8186,Obligated,ER,20H0(001),Humboldt,2015-10-06,2015-10-06,2015-10-06,2015-10-06,2015-10-09,0.0,0.0,-11002.15,Authorized,5940,10,E-76 approved on E-76 approved on,0.0,0.0,3.0,,Ben Hur Road,Temporary Restoration Of Erosion,2,,,NON-MPO,,20H0,1,False,Mariposa County


In [28]:
# agency is wrong, locode is right

In [29]:
df1>>filter(_.locode==5953)>>count(_.primary_agency_name)

Unnamed: 0,primary_agency_name,n
0,Los Angeles County,424


In [30]:
df1>>filter(_.locode==5953)>>filter(_.primary_agency_name==('Pico Rivera'))

Unnamed: 0,location,prefix,project_no,agency,prepared_date,submit__to_hq_date,hq_review_date,submit_to_fhwa_date,to_fmis_date,fed_requested,ac_requested,total_requested,status_comment,locode,dist,status,dist_processing_days,hq_processing_days,fhwa_processing_days,ftip_no,project_location,type_of_work,seq,date_request_initiated,date_completed_request,mpo,warning,projectID,projectNO,compare_id_locode,primary_agency_name


In [31]:
df1>>filter(_.agency.str.contains('Los Angeles'))>>group_by(_.agency, _.primary_agency_name)>>count(_.locode)


Unnamed: 0,agency,primary_agency_name,locode,n
0,Los Angeles,Los Angeles,5006,423
1,Los Angeles,Los Angeles County,5953,12
2,Los Angeles County,Calaveras County,5930,1
3,Los Angeles County,Los Angeles County,5953,404
4,Los Angeles County,Pico Rivera,5351,3
5,Los Angeles County Metropolitan Transportation...,Los Angeles County Metropolitan Transportation...,6065,84
6,Los Angeles Unified School District,Los Angeles Unified School District,6508,2


### Comparing names

In [32]:
compare_names = np.where(df1["agency"] == df1["primary_agency_name"], True, False)
df1["compare_names"] = compare_names


In [33]:
df1.compare_names.value_counts()

True     17885
False     2232
Name: compare_names, dtype: int64

In [34]:
df1>>filter(_.compare_names==False)>>select(_.agency, _.primary_agency_name, _.locode)

Unnamed: 0,agency,primary_agency_name,locode
50,Southern California Association Of Governments,Southern California Association of Governments,6049
78,City/County Association Of Governments Of San ...,City/County Association of Governments of San ...,6419
128,"City & County Of San Francisco, Mta/Parking & ...","City & County of San Francisco, MTA/Parking & ...",6328
140,San Buena Ventura,San Buenaventura,5026
219,San Diego Association Of Governments,San Diego Association of Governments,6066
...,...,...,...
20112,Bay Area Rt,San Francisco Bay Area Rapid Transit District,6000
20113,Mctd,Marin County Transit District,6343
20114,Vta,Santa Clara Valley Transportation Authority,6264
20115,Ala-Con Costa T,Alameda - Contra Costa Transit District,6002


In [35]:
unmatched = df1>>filter(_.compare_names==False)>>select(_.agency, _.primary_agency_name, _.locode)

In [36]:
unmatched.primary_agency_name.unique()

array(['Southern California Association of Governments',
       'City/County Association of Governments of San Mateo County',
       'City & County of San Francisco, MTA/Parking & Traffic',
       'San Buenaventura', 'San Diego Association of Governments',
       'Kern County Council of Governments', 'Modoc County',
       'Yuba County', 'Butte County',
       'Sacramento Area Council of Governments',
       'Stanislaus Council of Governments',
       'Transportation Agency for Monterey County',
       'Coachella Valley Association of Governments',
       'San Gabriel Valley Council of Governments',
       'San Joaquin Council of Governments',
       'Department of Water Resources',
       'Yosemite Area Regional Transportation System JPA',
       'Merced County Association of Governments', 'OmniTrans',
       'University of California - Davis', 'Santa Cruz County',
       'San Luis Obispo Council of Governments',
       'Department of Parks and Recreation',
       'SouthWest Transport

In [37]:
unmatched >> group_by(_.primary_agency_name) >> summarize(n=_.agency.nunique()) >> arrange(-_.n) >>filter(_.n>1)

Unnamed: 0,primary_agency_name,n
18,Los Angeles County,4
25,Modoc County,3
34,San Bernardino County,3
11,Department of Parks and Recreation,2
21,Mariposa County,2
37,San Diego County,2
53,"U.S. Forest Service, Pacific Southwest Region",2
59,Yuba County,2


In [38]:
#running through these matches and checking to make sure they are already documented in `issues_dla_data_locode.xlsx`
unmatched >> filter(_.primary_agency_name=='Yuba County') >> arrange(_.agency)

Unnamed: 0,agency,primary_agency_name,locode
14515,Shasta County,Yuba County,5916
15282,Shasta County,Yuba County,5916
19517,Shasta County,Yuba County,5916
273,Tuolumne County,Yuba County,5916
14100,Tuolumne County,Yuba County,5916
16358,Tuolumne County,Yuba County,5916


#### Adding
*  Tuolumne County/Yuba County 

In [39]:
## looking at the agencies with 1 match

In [40]:
unmatched >> group_by(_.primary_agency_name) >> summarize(n=_.agency.nunique()) >> arrange(-_.n) >>filter(_.n==1)

Unnamed: 0,primary_agency_name,n
0,Alameda - Contra Costa Transit District,1
1,Butte County,1
2,Butte County Association of Governments,1
3,Calabasas,1
4,Calaveras Council of Governments,1
5,Calaveras County,1
6,"City & County of San Francisco, MTA/Parking & ...",1
7,City/County Association of Governments of San ...,1
8,Coachella Valley Association of Governments,1
9,Council of Fresno County Governments,1


In [41]:
unmatched >> filter(_.primary_agency_name=='Yreka City') >> arrange(_.agency)

Unnamed: 0,agency,primary_agency_name,locode
18987,Sonoma County,Yreka City,5020
18988,Sonoma County,Yreka City,5020
18989,Sonoma County,Yreka City,5020


#### Adding to Unmatched:
* Sonoma County/Yreka City
* Shasta County/Napa County
* Ora Co Trans Au/Morro Bay 

In [42]:
## Checking for the true location of the 

In [43]:
len(df1>>filter(_.agency=='Solano County Transit')>>filter(_.locode==6503))

1

In [44]:
df1>>filter(_.agency=='Solano County Transit')>>filter(_.locode==6503)

Unnamed: 0,location,prefix,project_no,agency,prepared_date,submit__to_hq_date,hq_review_date,submit_to_fhwa_date,to_fmis_date,fed_requested,ac_requested,total_requested,status_comment,locode,dist,status,dist_processing_days,hq_processing_days,fhwa_processing_days,ftip_no,project_location,type_of_work,seq,date_request_initiated,date_completed_request,mpo,warning,projectID,projectNO,compare_id_locode,primary_agency_name,compare_names
8617,FTA Transferred,FTACML,6503(001),Solano County Transit,2015-06-02,2015-06-02,2015-06-02,2015-07-16,2015-07-23,6000000.0,0.0,6000000.0,Prog Code M0E3,6503,4,FTA transferred on 7/23/2015,0.0,44.0,7.0,,,FTA Transfer,1,,,MTC,,6503,1,True,,False


## Finding agencies and string locodes

In [45]:
errors = (df[df['locode'].apply(lambda x: isinstance(x, str))])

In [46]:
print(len(errors))

9


In [47]:
errors>>count(_.locode)>>arrange(-_.n)

Unnamed: 0,locode,n
3,40A0,3
4,NBIL,3
0,32L0,1
1,38R0,1
2,38Y0,1


In [48]:
errors>>count(_.projectID)>>arrange(-_.n)

Unnamed: 0,projectID,n
3,40A0,3
4,NBIL,3
0,32L0,1
1,38R0,1
2,38Y0,1


In [49]:
errors2 = (df[df['projectID'].apply(lambda x: isinstance(x, str))])

In [50]:
print(len(errors2))

20117


In [51]:
errors2>>count(_.projectID)>>arrange(-_.n)

Unnamed: 0,projectID,n
39,32L0,1136
77,5006,479
125,5060,459
551,5950,276
171,5109,270
...,...,...
704,769,1
705,804,1
709,NBIS,1
710,TC03,1


In [52]:
errors2>>count(_.locode)>>arrange(-_.n)

Unnamed: 0,locode,n
428,5904,627
53,5060,459
460,5936,446
477,5953,424
4,5006,423
...,...,...
614,6503,1
621,7504,1
623,32L0,1
624,38R0,1


In [53]:
errors_total = pd.concat([errors, errors2], ignore_index=True)

In [54]:
errors_total.head()

Unnamed: 0,location,prefix,project_no,agency,prepared_date,submit__to_hq_date,hq_review_date,submit_to_fhwa_date,to_fmis_date,fed_requested,ac_requested,total_requested,status_comment,locode,dist,status,dist_processing_days,hq_processing_days,fhwa_processing_days,ftip_no,project_location,type_of_work,seq,date_request_initiated,date_completed_request,mpo,warning,projectID,projectNO,compare_id_locode
0,Obligated,BR,NBIL(537),La Quinta,2019-04-10,2019-04-10,2019-04-11,2019-04-11,2019-04-24,482489.0,0.0,545000.0,Authorized,NBIL,8,E-76 approved on,0.0,1.0,13.0,RIV121202,Dune Palms Road Over Coachella Valley Stormwat...,Replace A 3-lane Low Water Crossing With A 4-l...,1,2019-04-10,2019-04-10,SCAG,,NBIL,537,True
1,Obligated,ACSTP,40A0(038),Mendocino,2019-12-20,2019-12-20,2020-01-10,2020-01-17,2020-01-22,0.0,31039.0,35060.0,Authorized,40A0,1,E-76 approved on,2.0,28.0,5.0,,"Mountain View Road Pm 1.65, Cr 510","Excavate Slide Material, Stabilize With Rsp An...",1,2019-12-18,2019-12-20,NON-MPO,,40A0,38,True
2,Obligated,BR,NBIL(537),La Quinta,2020-11-16,2020-12-22,2021-01-12,2021-01-14,2021-01-19,684337.0,0.0,1003409.0,Authorized,NBIL,8,E-76 approved on,36.0,23.0,5.0,RIV121202,Dune Palms Road Over Coachella Valley Stormwat...,Replace Low Water Crossing With A 4-lane Bridge,2,2020-11-16,2020-11-16,SCAG,,NBIL,537,True
3,Obligated,ACSTP,38Y0(002),Los Angeles,2021-02-03,2021-02-22,2021-02-24,2021-03-09,2021-03-18,0.0,2967985.57,3330361.43,Authorized,38Y0,7,E-76 approved on,19.0,15.0,9.0,,Mulholland Highway Over Triunfo Creek,Demolition And Removal Of Burnt Bridge; Instal...,1,2021-02-03,2021-02-03,SCAG,,38Y0,2,True
4,Obligated,BR,NBIL(546),Yucaipa,2020-08-11,2021-01-19,2021-02-24,2021-02-25,2021-03-09,0.0,663975.0,750000.0,Authorized,NBIL,8,E-76 approved on,183.0,37.0,12.0,SBDLS08,Fremont Street Over Wilson Creek From Oak Glen...,Environmental Mitigation As A Component Of Nbi...,1,2020-07-20,2020-12-10,SCAG,,NBIL,546,True


In [55]:
errors_total.duplicated().sum()

9

In [56]:
errors_total.drop_duplicates(inplace=True)

In [57]:
compare_locode = np.where(errors_total["locode"] == errors_total["projectID"], True, False)
errors_total["compare_error_locodes"] = compare_locode


In [58]:
errors_total.compare_error_locodes.value_counts()

False    20108
True         9
Name: compare_error_locodes, dtype: int64

In [59]:
errors_total >> filter(_.compare_error_locodes==True)>>count(_.agency, _.prefix)

Unnamed: 0,agency,prefix,n
0,Grass Valley,ACSTP,1
1,La Quinta,BR,2
2,Los Angeles,ACSTP,1
3,Mendocino,ACSTP,1
4,San Bernardino,ACSTP,2
5,Santa Cruz,ACSTP,1
6,Yucaipa,BR,1
