# Merge v1 and v2 trip tables 

* Start with Oct, then do the same for Nov (but it doesn't appear that schedule data changes much between these 2 dates)
* Merge trip tables, merging on `route_id-shape_id-trip_id-service_hours`
* Allow a `m:m` merge at the trip-level, since we won't have a way to link `itp_id-url_number` to `feed_key-name` otherwise.
* Aggregate up to the operator level to deep dive into how those merges performed.

### 4 major cases to explore
#### (1): v1 and v2 merged, and there are 1 or 2 feeds present.
* These are the most clean cut cases, where generally there's a primary feed with `url_number==0`. In general, the MTC feed is excluded, and agency feeds are used.
#### (2): v1 and v2 merged, and there are more than 2 feeds present.
* Take a closer look at these to make sure that dropping precursor feeds or just doing regional subfeeds will handle it correctly.
#### (3): v1 and v2 resulted in left_only and both merge values
* Get into why this happens. Let's just keep the feeds that give us the `both` instead of `left` (in v1 with `itp_id-url_number` but not v2 with `feed_key-name` 
* We will throw away some feeds from v1, or `url_numbers` here
#### (4): v1 and v2 resulted in left_only and right_only merge values
* None fit this case

### Double check aggregate feeds
* Discover that v1 still comes up with way more hours than v2.
* Dive into each operator, and find that in v2, for the same `feed_key`, it was stored as different `itp_ids` in v1. 
* Dropping duplicates in v1 only handles *within* operator differences, aka, multiple feeds, as indicated by multiple `url_numbers`. 
* But, since these are stored as different `itp_ids`, but linked to the same organization in Airtable, v2 now stores them under the same `feed_key`, and drop duplicates would drop a lot more.


### Set up function to query feeds on a given day
As a result of the above findings, how should the `feed_keys` be queried in a way to find the organizations we want?

Allow subset of feeds to be found by operators to **include** or **exclude**.

In [1]:
import pandas as pd

from shared_utils import rt_dates, geography_utils

oct_date = rt_dates.DATES["oct2022"]
GCS_FILE_PATH = "gs://calitp-analytics-data/data-analyses/gtfs_v1_v2_parity/"



In [2]:
daily_feeds = pd.read_parquet(
    f"{GCS_FILE_PATH}daily_feeds_orgs_{oct_date}.parquet")

In [3]:
def prep_v2_trip(date: str) -> pd.DataFrame:
    """
    Quick renaming to deal with _x and _y column names.
    This won't happen in the dbt tables.
    """
    df = pd.read_parquet(f"{GCS_FILE_PATH}trips_{date}_v2.parquet")
    
    # Accidentally left out columns in merge_cols,
    # so now there's _x and _y versions
    df2 = (df.rename(columns = {
                "feed_key_x": "feed_key",
                "trip_id_x": "trip_id", 
                "route_id_x": "route_id"})
          )[["feed_key", "trip_id", "route_id", 
             "shape_id", "service_hours"]]

    return df2

In [4]:
v1_id = ["calitp_itp_id", "calitp_url_number"]
v2_id = ["feed_key"]

def import_v1_v2_trips(date: str) -> pd.DataFrame:
    """
    Import v1 and v2 data for the same day.
    """
    keep_cols = [
        "trip_id", "route_id", "shape_id",
        "service_hours"
    ]
    
    df1 = pd.read_parquet(f"{GCS_FILE_PATH}trips_{date}_v1.parquet")
    df1 = df1[v1_id + keep_cols]
    
    df2 = prep_v2_trip(date)[v2_id + keep_cols]

    return df1, df2

In [5]:
oct1, oct2 = import_v1_v2_trips(oct_date)

In [6]:
print(f"# rows in v1 (feed-operator) {oct1[v1_id].drop_duplicates().shape}")
print(f"# rows in v2 (feed-operator) {oct2[['feed_key']].drop_duplicates().shape}")

# rows in v1 (feed-operator) (212, 2)
# rows in v2 (feed-operator) (187, 1)


In [7]:
# Use this as a crosswalk to see 
# allow m:m merge
oct_merge = pd.merge(
    oct1, 
    oct2,
    on = ["trip_id", "route_id", "shape_id", "service_hours"],
    how = "outer",
    validate = "m:m",
    indicator = True
)[v1_id + v2_id + ["_merge"]
 ].drop_duplicates().reset_index(drop=True)


In [8]:
oct_merge2 = pd.merge(
    oct_merge,
    daily_feeds[["feed_key", "name", "regional_feed_type", "is_future"]],
    on = "feed_key",
    how = "outer",
    validate = "m:1"
)

In [9]:
# Instead of a _merge variable holding values,
# change it to dummy variables and aggregate by itp_id
oct_merge2 = pd.merge(
    oct_merge2.drop(columns = "_merge"),
    pd.get_dummies(oct_merge2._merge),
    left_index=True,
    right_index=True
)

In [10]:
merge_counts = (oct_merge2.groupby("calitp_itp_id")
                .agg({"left_only": "sum",
                      "both": "sum",
                      "right_only": "sum"})
                .reset_index()
               )

merge_counts.head(2)

Unnamed: 0,calitp_itp_id,left_only,both,right_only
0,4.0,0,4,0
1,6.0,0,1,0


## Case 1: Both and 1 or 2 feeds

If 2 feeds, pick regional subfeed

In [11]:
def get_id_list(df: pd.DataFrame):
    """
    From a subset df, return itp_id as list
    """
    return df.calitp_itp_id.tolist()

def print_regional_feed(df: pd.DataFrame, id_list:list):
    """
    Explore how many of the duplicates are due to 
    various regional_feed_types. Should these get dropped ahead of time?
    """
    subset = df[df.calitp_itp_id.isin(id_list)]
    print(subset.regional_feed_type.value_counts())

In [12]:
def feeds_to_keep(df: pd.DataFrame, id_list: list):
    """
    Subset the df to select certain itp_ids and 
    also drop regional precursor feeds. 
    Keep these for now, and see if the service_hours are the same
    in v1 and v2.
    """
    subset = df[(df.calitp_itp_id.isin(id_list)) & 
                (df.regional_feed_type != "Regional Precursor Feed")
               ]
    return subset

In [13]:
def v1_v2_service_hours(v1_df: pd.DataFrame, 
                        v2_df: pd.DataFrame,
                        subset_feeds: pd.DataFrame):
    """
    For the subset of feeds to keep, check whether v1 and v2
    service hours are the same.
    """
    v1_subset = pd.merge(
        v1_df,
        subset_feeds,
        on = v1_id,
        how = "inner"
    )
    
    v2_subset = pd.merge(
        v2_df,
        subset_feeds,
        on = v2_id,
        how = "inner"
    )
    
    v1_hours = v1_subset.service_hours.sum().round(3)
    v2_hours = v2_subset.service_hours.sum().round(3)
    
    print(v1_hours, v2_hours)
    
    if v1_hours == v2_hours:
        status = "EQUAL"
    else:
        status = "CHECK ME"
    
    return status

In [14]:
# No mixed merges
# It's ok that there are more than 2 observations...looks like
# Regional Precursor Feeds are included
both_ids = merge_counts[(merge_counts.left_only==0) & 
                        (merge_counts.right_only==0)]

In [15]:
one_feed = get_id_list(both_ids[both_ids.both==1])
print_regional_feed(oct_merge2, one_feed)

Regional Subfeed           7
Regional Precursor Feed    5
Combined Regional Feed     1
Name: regional_feed_type, dtype: int64


In [16]:
part1 = feeds_to_keep(oct_merge2, one_feed)
v1_v2_service_hours(oct1, oct2, part1)

48054.132 48054.132


'EQUAL'

In [17]:
keep_one_feed = oct_merge2[oct_merge2.calitp_itp_id.isin(one_feed)]

In [18]:
two_feeds = get_id_list(both_ids[both_ids.both==2])
print_regional_feed(oct_merge2, two_feeds)

Regional Precursor Feed    12
Regional Subfeed           11
Combined Regional Feed      1
Name: regional_feed_type, dtype: int64


In [19]:
part2 = feeds_to_keep(oct_merge2, two_feeds)
v1_v2_service_hours(oct1, oct2, part2)

22422.045 22422.045


'EQUAL'

In [20]:
keep_two_feeds = oct_merge2[oct_merge2.calitp_itp_id.isin(two_feeds)]

## Case 2: Both and more than 2 feeds

* Get rid of `regional_feed_type = "Regional Precursor Feed"` because there are instances of more than 2 feeds

In [21]:
more_than_2_feeds = get_id_list(both_ids[both_ids.both > 2])
print_regional_feed(oct_merge2, more_than_2_feeds)

Regional Precursor Feed    24
Regional Subfeed           21
Name: regional_feed_type, dtype: int64


In [22]:
part3 = feeds_to_keep(oct_merge2, more_than_2_feeds)
v1_v2_service_hours(oct1, oct2, part3)

31393.084 31393.084


'EQUAL'

## Case 3: Mixed merges: operator has left and both merges

In [23]:
left_and_both = merge_counts[(merge_counts.left_only > 0) &
                             (merge_counts.both > 0) & 
                             (merge_counts.right_only == 0)]

left_and_both_ids = get_id_list(left_and_both)
print_regional_feed(oct_merge2, left_and_both_ids)

Regional Subfeed           4
Regional Precursor Feed    2
Name: regional_feed_type, dtype: int64


In [24]:
left_and_both_keep = pd.DataFrame()

for i in left_and_both_ids:
    print(i)
    
    subset = oct_merge2[oct_merge2.calitp_itp_id == i]
    display(subset)
    
    feed_to_keep = subset[(subset.both > 0) & 
                          (subset.regional_feed_type != "Regional Precursor Feed")
                  ][v1_id + v2_id]
    display(feed_to_keep)
    
    left_and_both_keep = pd.concat([left_and_both_keep, feed_to_keep], axis=0)

14.0


Unnamed: 0,calitp_itp_id,calitp_url_number,feed_key,name,regional_feed_type,is_future,left_only,right_only,both
22,14.0,0.0,,,,,1,0,0
213,14.0,0.0,eb438c167b7f3c3b884f11ba875e345b,Anaheim Resort Schedule,,False,0,0,1


Unnamed: 0,calitp_itp_id,calitp_url_number,feed_key
213,14.0,0.0,eb438c167b7f3c3b884f11ba875e345b


34.0


Unnamed: 0,calitp_itp_id,calitp_url_number,feed_key,name,regional_feed_type,is_future,left_only,right_only,both
15,34.0,0.0,,,,,1,0,0
139,34.0,0.0,72d31b44f06e4aba7790ea5aa0082c43,Beaumont Pass Schedule,,False,0,0,1


Unnamed: 0,calitp_itp_id,calitp_url_number,feed_key
139,34.0,0.0,72d31b44f06e4aba7790ea5aa0082c43


127.0


Unnamed: 0,calitp_itp_id,calitp_url_number,feed_key,name,regional_feed_type,is_future,left_only,right_only,both
19,127.0,1.0,,,,,1,0,0
20,127.0,2.0,,,,,1,0,0
126,127.0,0.0,92bd23b390b7c7460a05baffd91684d6,Golden Gate Bridge Schedule,Regional Precursor Feed,False,0,0,1


Unnamed: 0,calitp_itp_id,calitp_url_number,feed_key


247.0


Unnamed: 0,calitp_itp_id,calitp_url_number,feed_key,name,regional_feed_type,is_future,left_only,right_only,both
5,247.0,2.0,e8ccd13b7c2c7ea2a0c3ae92ba2b0c82,Bay Area 511 Petaluma Schedule,Regional Subfeed,False,0,0,1
6,247.0,1.0,e8ccd13b7c2c7ea2a0c3ae92ba2b0c82,Bay Area 511 Petaluma Schedule,Regional Subfeed,False,0,0,1
14,247.0,1.0,,,,,1,0,0
89,247.0,0.0,ee533bf21b57e229c459ea70bcff7f85,Petaluma Schedule,Regional Precursor Feed,False,0,0,1


Unnamed: 0,calitp_itp_id,calitp_url_number,feed_key
5,247.0,2.0,e8ccd13b7c2c7ea2a0c3ae92ba2b0c82
6,247.0,1.0,e8ccd13b7c2c7ea2a0c3ae92ba2b0c82


280.0


Unnamed: 0,calitp_itp_id,calitp_url_number,feed_key,name,regional_feed_type,is_future,left_only,right_only,both
18,280.0,0.0,,,,,1,0,0
109,280.0,1.0,131c004dfec39b02438760650ed753fe,Bay Area 511 San Francisco Bay Ferry Schedule,Regional Subfeed,False,0,0,1


Unnamed: 0,calitp_itp_id,calitp_url_number,feed_key
109,280.0,1.0,131c004dfec39b02438760650ed753fe


290.0


Unnamed: 0,calitp_itp_id,calitp_url_number,feed_key,name,regional_feed_type,is_future,left_only,right_only,both
23,290.0,0.0,,,,,1,0,0
214,290.0,1.0,050dcfdc6ff44718ff6e326f3d3b9a5a,Bay Area 511 SamTrans Schedule,Regional Subfeed,False,0,0,1


Unnamed: 0,calitp_itp_id,calitp_url_number,feed_key
214,290.0,1.0,050dcfdc6ff44718ff6e326f3d3b9a5a


In [25]:
keep_left_and_both = pd.merge(
    oct_merge2[v1_id + v2_id].drop_duplicates(),
    left_and_both_keep,
    on = v1_id + ["feed_key"],
    how = "inner"
)

In [26]:
check_me = []
for i in left_and_both_ids:
    print(f"ITP ID: {i}")
    status = v1_v2_service_hours(
        oct1, oct2, 
        left_and_both_keep[left_and_both_keep.calitp_itp_id==i]
     )
    print(status)
    
    if status == "CHECK ME":
        check_me.append(i)

ITP ID: 14.0
3161.147 2909.452
CHECK ME
ITP ID: 34.0
59.75 59.083
CHECK ME
ITP ID: 127.0
0.0 0.0
EQUAL
ITP ID: 247.0
106.028 106.028
EQUAL
ITP ID: 280.0
83.167 83.167
EQUAL
ITP ID: 290.0
1269.783 1269.783
EQUAL


In [27]:
check_me

[14.0, 34.0]

In [28]:
def aggregate_v1_v2(itp_id: int, feed_key: str):
    """
    Aggregate service hours and count unique shape_id, trip_id,
    and route_id for v1 and v2 for 1 operator.
    """
    v1_agg = geography_utils.aggregate_by_geography(
        oct1[oct1.calitp_itp_id == itp_id],
        group_cols = ["calitp_itp_id", "calitp_url_number"],
        sum_cols = ["service_hours"],
        nunique_cols = ["shape_id", "trip_id", "route_id"],
        rename_cols = True
    )
    
    v2_agg = geography_utils.aggregate_by_geography(
        oct2[oct2.feed_key == feed_key],
        group_cols = ["feed_key"],
        sum_cols = ["service_hours"],
        nunique_cols = ["shape_id", "trip_id", "route_id"],
        rename_cols = True
    )
    
    return v1_agg, v2_agg

In [29]:
check_id = 14
check_feed = left_and_both_keep[
    left_and_both_keep.calitp_itp_id==check_id].feed_key.iloc[0]

In [30]:
v1, v2 = aggregate_v1_v2(check_id, check_feed)
pd.concat([v1, v2], axis=0)

Unnamed: 0,calitp_itp_id,calitp_url_number,service_hours_sum,route_id_nunique,shape_id_nunique,trip_id_nunique,feed_key
0,14.0,0.0,3161.146667,13,229,14231,
0,,,2909.451944,12,227,13390,eb438c167b7f3c3b884f11ba875e345b


In [31]:
check_id = 34
check_feed = left_and_both_keep[
    left_and_both_keep.calitp_itp_id==check_id].feed_key.iloc[0]

In [32]:
v1, v2 = aggregate_v1_v2(check_id, check_feed)
pd.concat([v1, v2], axis=0)

Unnamed: 0,calitp_itp_id,calitp_url_number,service_hours_sum,route_id_nunique,shape_id_nunique,trip_id_nunique,feed_key
0,34.0,0.0,59.75,7,22,96,
0,,,59.083333,7,21,96,72d31b44f06e4aba7790ea5aa0082c43


After checking these 2 ITP IDs, 14 and 34, it seems like there's some 
slight differences in `service_hours` coming from slightly different number of unique `shape_id`, `route_id` and `trip_id`. This could be because in v2, some of the erroneous observations are dealt with.

These are ok.

## Case 4: Unmerged...either left only in v1 or right only in v2

None..this might be because we 

In [33]:
unmerged = merge_counts[(merge_counts.left_only > 0) & 
                        (merge_counts.right_only > 0) 
                       ]

unmerged

Unnamed: 0,calitp_itp_id,left_only,both,right_only


## Put 4 cases together and get aggregate view of service hours in v1 and v2

In [34]:
all_cases_keep_feeds = pd.concat([
    keep_one_feed,
    keep_two_feeds,
    keep_left_and_both,
], axis=0)

all_cases_keep_feeds = (all_cases_keep_feeds[v1_id + ["feed_key"]]
                        .sort_values("calitp_itp_id")
                        .reset_index(drop=True)
                       )

In [35]:
final_v1 = pd.merge(
    oct1, 
    all_cases_keep_feeds,
    on = v1_id,
    how = "inner"
)


final_v2 = pd.merge(
    oct2, 
    all_cases_keep_feeds,
    on = v2_id,
    how = "inner"
)

print(final_v1.service_hours.sum(), final_v2.service_hours.sum())

78941.41055555556 78689.04916666668


## Daily Feeds to Organization Name

* Allow parameters to include certain ones or exclude certain ones.
* By default, drop the precursor feeds. We would have kept unique combos of `itp_id-trip_id`, so multiple feeds would have been combined anyway.

In [36]:
def prep_daily_feeds(date: str, 
                     include: dict = "All", 
                     exclude: dict = {
                         "name": ["Bay Area 511 Regional Schedule"],
                         "regional_feed_type": ["Regional Precursor Feed"]}
                    ) -> pd.DataFrame:
    """
    Allow dict of feeds to include OR feeds to exclude.
    include: dict 
        Takes form 
        {"this_col": list_of_values_to_include}
    exclude: dict
        Takes form
    """
    
    daily_feeds = pd.read_parquet(
        f"{GCS_FILE_PATH}daily_feeds_orgs_{date}.parquet")
    
    daily_feeds["exclude_me"] = 0

    for col, values in exclude.items():
        daily_feeds = daily_feeds.assign(
            exclude_me = daily_feeds.apply(
                lambda x: 1 if x[col] in values 
                else x.exclude_me, axis=1)
        )
    
    daily_feeds = (daily_feeds[daily_feeds.exclude_me == 0]
                   .drop(columns = "exclude_me")
                  )
    
    if include != "All":
        keep = pd.DataFrame()
        
        for col, values in include.items():
            subset = daily_feeds[daily_feeds[col].isin(values)]
            keep = pd.concat([keep, subset], axis=0)

    
        daily_feeds = keep
    
    daily_feeds = daily_feeds.sort_values("name").reset_index(drop=True)
    
    return daily_feeds

In [37]:
t1 = prep_daily_feeds(oct_date, include = "All")

In [38]:
daily_feeds2 = pd.merge(
    all_cases_keep_feeds,
    t1,
    on = v2_id,
    how = "outer",
    validate = "m:1",
    indicator=True
)

daily_feeds2._merge.value_counts()

both          158
right_only     35
left_only      17
Name: _merge, dtype: int64

In [39]:
left_feeds = daily_feeds2[
    daily_feeds2._merge=="left_only"].feed_key.tolist()

# Make sure these are all precursor feeds
daily_feeds[daily_feeds.feed_key.isin(left_feeds)
           ].regional_feed_type.value_counts()

Regional Precursor Feed    16
Name: regional_feed_type, dtype: int64

In [40]:
daily_feeds3 = pd.merge(
    all_cases_keep_feeds,
    t1,
    on = v2_id,
    how = "inner",
    validate = "m:1",
)

In [41]:
v1_kept_feeds = pd.merge(
    oct1,
    daily_feeds3[v1_id],
    on = v1_id,
    how = "inner"
)

no_drops = v1_kept_feeds.service_hours.sum()
drop_dups = v1_kept_feeds.drop_duplicates(subset=["calitp_itp_id", "trip_id"]).service_hours.sum()

print(f"service hours across feeds: {no_drops}")
print(f"service hours drop dups by operator-trip: {drop_dups}")

service hours across feeds: 75156.05222222222
service hours drop dups by operator-trip: 75103.03833333333


In [42]:
v2_kept_feeds = pd.merge(
    oct2,
    daily_feeds3[v2_id + ["name"]],
    on = v2_id,
    how = "inner"
)

no_drops = v2_kept_feeds.service_hours.sum()
drop_dups = v2_kept_feeds.drop_duplicates(subset=["name", "trip_id", "service_hours"]).service_hours.sum()

print(f"service hours across feeds: {no_drops}")
print(f"service hours drop dups by operator-trip: {drop_dups}")

service hours across feeds: 74903.69083333334
service hours drop dups by operator-trip: 67446.37638888888


## Why is v2 dropping so many more hours compared to v1?

There were multiple itp_ids in the past, now linked to same feed, and these are ok. We wouldn't have dropped in v1 because we would only look for duplicates within `itp_id-trip_id`, but now that these are tagged as the same feed in v2, it would drop a lot more.

For ITP ID = 247, which we have 3 feeds to begin with, and we end up keeping 2 feeds, the drop duplicates does handle it correctly because it's a within operator (ITP_ID) duplicate.

In [43]:
v2_dups = v2_kept_feeds[v2_kept_feeds.duplicated(
    subset=["name", "trip_id", "service_hours"])
             ].sort_values(["feed_key", "trip_id"])

problem_feeds = v2_dups.feed_key.unique().tolist()

In [44]:
for test_feed in problem_feeds:
    org_name = v2_dups[v2_dups.feed_key==test_feed].name.iloc[0]
    
    subset = all_cases_keep_feeds[all_cases_keep_feeds.feed_key==test_feed]
    
    print(org_name)
    display(subset)

OCTA Schedule


Unnamed: 0,calitp_itp_id,calitp_url_number,feed_key
53,142.0,0.0,05c263e0f0563d3326a15c1b640ad945
96,235.0,0.0,05c263e0f0563d3326a15c1b640ad945


Eastern Sierra Schedule


Unnamed: 0,calitp_itp_id,calitp_url_number,feed_key
34,99.0,0.0,0a165d0fe19fefcb424f577091cf52d0
78,190.0,0.0,0a165d0fe19fefcb424f577091cf52d0


LADPW Schedule


Unnamed: 0,calitp_itp_id,calitp_url_number,feed_key
64,171.0,0.0,1b548fc017e5c82205e45b89d9fdb42b
65,172.0,0.0,1b548fc017e5c82205e45b89d9fdb42b
66,173.0,0.0,1b548fc017e5c82205e45b89d9fdb42b
67,174.0,0.0,1b548fc017e5c82205e45b89d9fdb42b
68,176.0,0.0,1b548fc017e5c82205e45b89d9fdb42b
69,177.0,0.0,1b548fc017e5c82205e45b89d9fdb42b
70,178.0,0.0,1b548fc017e5c82205e45b89d9fdb42b
71,179.0,0.0,1b548fc017e5c82205e45b89d9fdb42b
72,181.0,0.0,1b548fc017e5c82205e45b89d9fdb42b


VCTC GMV Schedule


Unnamed: 0,calitp_itp_id,calitp_url_number,feed_key
94,231.0,0.0,813f7923bf17858360fecc81a53d1640
163,380.0,0.0,813f7923bf17858360fecc81a53d1640


Redding Schedule


Unnamed: 0,calitp_itp_id,calitp_url_number,feed_key
28,82.0,0.0,8a19328a3d85a91d3539943af3482669
106,259.0,0.0,8a19328a3d85a91d3539943af3482669


Tehama Schedule


Unnamed: 0,calitp_itp_id,calitp_url_number,feed_key
141,329.0,0.0,8aa4cefe774ac9adb6f1a6e2a5ff3909
143,334.0,0.0,8aa4cefe774ac9adb6f1a6e2a5ff3909


Humboldt Schedule


Unnamed: 0,calitp_itp_id,calitp_url_number,feed_key
7,18.0,0.0,de927a514fa2ef945fa4419e5bdc61c0
15,42.0,0.0,de927a514fa2ef945fa4419e5bdc61c0
40,108.0,0.0,de927a514fa2ef945fa4419e5bdc61c0
51,135.0,0.0,de927a514fa2ef945fa4419e5bdc61c0


Bay Area 511 Petaluma Schedule


Unnamed: 0,calitp_itp_id,calitp_url_number,feed_key
102,247.0,1.0,e8ccd13b7c2c7ea2a0c3ae92ba2b0c82
103,247.0,2.0,e8ccd13b7c2c7ea2a0c3ae92ba2b0c82


In [45]:
v1_kept_feeds[v1_kept_feeds.calitp_itp_id==247].drop_duplicates(
    subset=["calitp_itp_id", "trip_id"]).calitp_url_number.value_counts()

2    230
Name: calitp_url_number, dtype: int64