# NTD 2021 vs 2022

* Explore where / how much `ntd_id` has changed between 2021 and 2022 exports. 
* Use BigQuery, from `mart_ntd` grab 2021 and 2022 and export as csv.
* Pass it through a variety of merges can help winnow down which ones we do need to manually reconcile.
* Go from more stringent merges (ids and names) to looser merges
   * Parsing the `ntd_id` and grabbing the suffix portion can help, since there's a good batch where the `ntd_id` change in 2022 is a new prefix added, where `ntd_id_2022 = [xxxx-ntd_id_2021]`.

In [1]:
import numpy as np
import pandas as pd

#GCS_BUCKET = "gs://calitp-ntd-data-products"
#GCS_PATH = (f"{GCS_BUCKET}annual-database-agency-information/"
#            "dt=2023-11-15/ts=2023-11-15T22:29:51.925030+00:00/year=2022/"
#            "annual-database-agency-information.jsonl.gz"
#           )

LOCAL_PATH = "ntd_2021_2022.csv"

In [2]:
df_full = pd.read_csv(LOCAL_PATH)

df_2021 = df_full[df_full.year==2021].reset_index(drop=True)
df_2022 = df_full[df_full.year==2022].reset_index(drop=True)

In [3]:
def basic_stats(df: pd.DataFrame): 
    cols = ["ntd_id", "legacy_ntd_id", 
            "reported_by_name", 
            "agency_name",
            "city"
           ]
    for c in cols:
        print(f"nunique {c}: {df[c].nunique()}")

In [4]:
basic_stats(df_2021)

nunique ntd_id: 3021
nunique legacy_ntd_id: 2110
nunique reported_by_name: 64
nunique agency_name: 2929
nunique city: 1974


In [5]:
basic_stats(df_2022)

nunique ntd_id: 2969
nunique legacy_ntd_id: 2092
nunique reported_by_name: 60
nunique agency_name: 2924
nunique city: 1964


## Full set of merge columns

* `ntd_id, legacy_ntd_id, agency_name, reported_by_name, city`

Probably the most complete set of identifiers

In [6]:
cols = ["ntd_id", "legacy_ntd_id", 
        "reported_by_name", 
        "agency_name",
        "city", 
       ]

m1 = pd.merge(
    df_2021[cols + ["key", "year"]],
    df_2022[cols + ["key", "year"]],
    on = cols,
    how = "outer",
    indicator = True
)

m1._merge.value_counts()

left_only     1891
right_only    1887
both          1130
Name: _merge, dtype: int64

In [7]:
m1._merge.value_counts(normalize=True)

left_only     0.385289
right_only    0.384474
both          0.230236
Name: _merge, dtype: float64

### Majority will merge if we don't merge on `ntd_id`, but use `legacy_ntd_id` and variations of name instead

These could be solved if we just use `agency_name` and `reported_by_name`. Even though `ntd_id` is not necessarily the same, `legacy_ntd_id` appears to be (even if same means it's NaN for both years).

Exclude `city` from merge, since there are some that change cities, but it's the same agency. We do want to know if city changes from year to year.

In [8]:
m1[m1._merge != "both"].sort_values("agency_name")

Unnamed: 0,ntd_id,legacy_ntd_id,reported_by_name,agency_name,city,key_x,year_x,key_y,year_y,_merge
3614,70242,7R01-015,Iowa Department of Transportation,10-15 Regional Transit Agency,Ottumwa,,,49ba215d3793e5a58366b7dde1cb587e,2022.0,right_only
2002,7R01-70242,7R01-015,Iowa Department of Transportation,10-15 Regional Transit Agency,Ottumwa,4400da6a95a6a328ab9b7f651a1bdeb4,2021.0,,,left_only
963,A0002-55329,,Stark Area Regional Transit Authority,"ABCD, Inc.",Canton,463265c2d666621a7757630e5548fc3b,2021.0,,,left_only
4536,55329,,Stark Area Regional Transit Authority,"ABCD, Inc.",Canton,,,16b0e8ca1108b744db0f7952d8f6b381,2022.0,right_only
3211,88285,,Colorado Department of Transportation,AEX - Alpine Express,Gunnison,,,00fc807db5fbd510c447e71e21069b19,2022.0,right_only
...,...,...,...,...,...,...,...,...,...,...
2182,9R02-91070,9R02-019,California Department of Transportation,Yosemite Area Regional Transportation System,Merced,f68cf9996e537ff2968339f4508dd60d,2021.0,,,left_only
4269,91070,9R02-019,California Department of Transportation,Yosemite Area Regional Transportation System,Merced,,,1e8e914f03d459189a27bcdc2fc32106,2022.0,right_only
4422,66320,,Texas Department of Transportation,Zapata County,Zapata,,,e19ba955b67fff1dc3fd906aad5ab873,2022.0,right_only
1390,6R05-66320,,Texas Department of Transportation,Zapata County,Zapata,dc5f81697aabb1d127a4c03225a6cdde,2021.0,,,left_only


In [9]:
# Remove the keys that would now merge between the years
ok_keys = np.concatenate((
    m1[m1._merge=="both"].key_x.unique(),
    m1[m1._merge=="both"].key_y.unique()
))

### Merge on `legacy_ntd_id`, variations of name
These probably need to be manually addressed using a crosswalk, since we want to store variations of the `agency_name` over time.

In [10]:
# Remove city from merge, since there are a couple
# that would merge but are set to diff cities
m2 = pd.merge(
    df_2021[~df_2021.key.isin(ok_keys)][cols + ["key", "year"]],
    df_2022[~df_2022.key.isin(ok_keys)][cols + ["key", "year"]],
    on = ["legacy_ntd_id", "reported_by_name", "agency_name"],
    how = "outer",
    indicator = True
)

m2._merge.value_counts()

both          1789
left_only      102
right_only      98
Name: _merge, dtype: int64

Some of these are clearly the same agency when you spot check it (abbreviations, minor changes in name, etc), but some are less obvious. Might have to start compiling a larger crosswalk of variations on agency name.

These would be grouped together (`ntd_id` changes...but with an additional prefix)
* Whitley County Commissioners
* Whitley County Council on Aging

In [11]:
m2[m2._merge != "both"].sort_values(["agency_name"])

Unnamed: 0,ntd_id_x,legacy_ntd_id,reported_by_name,agency_name,city_x,key_x,year_x,ntd_id_y,city_y,key_y,year_y,_merge
1342,6R02-66284,,Louisiana Department of Transportation,Acadia COA,Crowley,5127e6c6b2c3c3018cc2629cfa94f74e,2021.0,,,,,left_only
1692,A0004-00415,,Valley Regional Transit,Ada County Highway District,Boise,8b1ef6431abe5e9a5eb5c0a1124477d9,2021.0,,,,,left_only
1928,,5R02-020,Indiana Department of Transportation,Area 10 Council on Aging of Monroe County,,,,50308,Ellettsville,00d4f9e499a70f66a8a72c0f4c42d408,2022.0,right_only
1971,,5R02-017,Indiana Department of Transportation,Area IV Agency on Aging and Community Action P...,,,,50365,Lafayette,a561a8b8df6a37f2dd4f141e4902d346,2022.0,right_only
1937,,,Arizona Department of Transportation,Assist to Independence,,,,99466,Tuba City,42b9d82b9be91d4b150f55f9e88a809f,2022.0,right_only
...,...,...,...,...,...,...,...,...,...,...,...,...
1577,5R02-50468,5R02-039,Indiana Department of Transportation,Whitley County Commissioners,Columbia City,518ace07cb87072ec541dc7fef24e6f7,2021.0,,,,,left_only
1973,,5R02-039,Indiana Department of Transportation,Whitley County Council on Aging,,,,50468,Columbia City,0b058ae9f5d5a1e9780257c981a6b91c,2022.0,right_only
1686,6R02-66299,,Louisiana Department of Transportation,Winn COA,Winnfield,57e123ed53fdf3ce74769f216cc2c39d,2021.0,,,,,left_only
1933,,5R02-024,Indiana Department of Transportation,YMCA of Vincennes,,,,50392,Vincennes,13d170d217ad0cf7860e62ce13651004,2022.0,right_only


In [12]:
m2[m2._merge != "both"].sort_values(["agency_name"]).agency_name.unique()

array(['Acadia COA', 'Ada County Highway District',
       'Area 10 Council on Aging of Monroe County',
       'Area IV Agency on Aging and Community Action Programs ',
       'Assist to Independence', 'Atlanta-Region Transit Link Authority',
       'Autonomous Municipality of Vega Alta', 'Bacon County',
       'Baltimore County Department of Aging',
       'Baltimore County Department of Public Works Transportation',
       'Bay State Cruise Company', 'Bay State LLC', 'Berrien County',
       'Blue River Services ', 'Boone County Commissioners',
       'Boone County Senior Services', 'Boulder, City of',
       'Brantley County', 'Brooks County Transit',
       'Brown County Senior Citizens Council', 'Brown County YMCA',
       'Buckeye Community Services', 'CENTRAL MISSISSIPPI  INC',
       'CHANDLER, CITY OF', 'Cache Employment & Training Center (CETC)',
       'Calcasieu Voluntary Council in Aging', 'Cardinal Services ',
       'Cass County Commissioners', 'Cass County Council on Ag

In [13]:
# Remove the keys that would now merge between the years
ok_keys2 = np.concatenate((
    ok_keys,
    m2[m2._merge == "both"].key_x.unique(),
    m2[m2._merge == "both"].key_y.unique()
))

In [14]:
def ntd_id_parsed(df: pd.DataFrame):
    df = df.assign(
        ntd_id_no_prefix = df.apply(
            lambda x:
            x.ntd_id.split("-")[1] if "-" in x.ntd_id
            else x.ntd_id, 
            axis=1)
    )
    
    return df

In [15]:
m3 = pd.merge(
    df_2021[~df_2021.key.isin(ok_keys2)][cols + ["key", "year"]].pipe(ntd_id_parsed),
    df_2022[~df_2022.key.isin(ok_keys2)][cols + ["key", "year"]].pipe(ntd_id_parsed),
    on = ["ntd_id_no_prefix", "legacy_ntd_id",],
    how = "outer",
    indicator = True
)

m3._merge.value_counts()

left_only     61
right_only    54
both          41
Name: _merge, dtype: int64

### Parsing `ntd_id` into a no-prefix version can help 

If we are going to remove the prefix...we could do this earlier and hopefully get more to merge. Although, this does mean that we are left with variations on `agency_name` and `reported_by_name`, which still needs to make it into our crosswalk, even if we do not use it to merge.

Left with a batch of about 60 to reconcile manually.

In [16]:
m3[m3._merge=="both"].sort_values("ntd_id_no_prefix")

Unnamed: 0,ntd_id_x,legacy_ntd_id,reported_by_name_x,agency_name_x,city_x,key_x,year_x,ntd_id_no_prefix,ntd_id_y,reported_by_name_y,agency_name_y,city_y,key_y,year_y,_merge
61,11238,,,Bay State Cruise Company,,3e52a7c9c6b2b668f3f8d8aaf7407bb1,2021.0,11238,11238,,Bay State LLC,Boston,0d5ba39814c8d079274b8a9f0cc57094,2022.0,both
56,3R03-30130,3R03-010,Maryland Department of Transportation,Baltimore County Department of Aging,Baltimore,a78720414cc2f8110bc995522b13af5f,2021.0,30130,30130,Maryland Department of Transportation,Baltimore County Department of Public Works Tr...,Towson,82625c40cb5f233003bd99bf662c0f8c,2022.0,both
24,40105,4105,,Puerto Rico Highway and Transportation Authori...,San Juan,15dfe7c9516dd80cfcbd7e44894c3cd0,2021.0,40105,40105,,Puerto Rico Highway and Transportation Authori...,San Juan,671fb05927de6f93c0b7fbc231ba0c3f,2022.0,both
30,40269,,,Municipality of Anasco,Anasco,727e463669495f24983f88e0c1ebc84e,2021.0,40269,40269,,Municipality of Añasco,Anasco,bf314c9d218b6b33879aca0e15a0aec6,2022.0,both
63,4R03-41133,4R03-115,Georgia Department of Transportation,Coweta County,Newnan,a2b14dc5773bf3c0ad04dfe1757d3be6,2021.0,41133,41133,,Coweta County,Newnan,dd0a411799e522f52608bb1977d5a149,2022.0,both
2,4R05-44979,,Mississippi Department of Transportation,CENTRAL MISSISSIPPI INC,Winona,a526f2abfd1780150ba2b794124d7343,2021.0,44979,44979,Mississippi Department of Transportation,"Central Mississippi, Incorporated",Winona,b1c19267d796b92f9da7b0491ac539e7,2022.0,both
40,50193,5193,,Enterprise Rideshare - Michigan,Farmington Hills,fb118835888478e280646363880222b4,2021.0,50193,50193,,Michigan Department of Transportation,,5db618a2d5318c4dff590a53f339688a,2022.0,both
101,5R02-50230,5R02-011,Indiana Department of Transportation,Boone County Commissioners,Lebanon,75778eabc386b8eafba4c9ecdba3400e,2021.0,50230,50230,Indiana Department of Transportation,Boone County Senior Services,Lebanon,51057c0cb82672b5f5c59afd1e9217ef,2022.0,both
25,5R02-50246,5R02-012,Indiana Department of Transportation,Marshall County Commissioners,Plymouth,77570092c7163f61fcf99dd2fbaa8d16,2021.0,50246,50246,Indiana Department of Transportation,Marshall County Council on Aging,Plymouth,67f53fd56b632e5a1edbcbc9beb819cb,2022.0,both
49,5R02-50248,5R02-043,Indiana Department of Transportation,Steuben County Commissioners,Angola,09864a8b4fdbf162b3d9a0434a63efa9,2021.0,50248,50248,Indiana Department of Transportation,Steuben County Council on Aging,Angola,68c0951ae10e643259fbad16a68b0501,2022.0,both
