# Sanity Check

In [1]:
import datetime as dt
import geopandas as gpd
import numpy as np
import pandas as pd

import utils
import shared_utils

from calitp.tables import tbl
from siuba import *



## shapes_initial

Still big differences in `shapes_initial` observations. Dig into sources of differences, since now the df is expanded to include 0's even where there is no service.

In [2]:
DATA_PATH = "./data/test/"

def check_shapes_initial(DATA_PATH):
    df = gpd.read_parquet(f"{DATA_PATH}shapes_initial.parquet")
    # Check unique shape_ids 
    df2 = df.groupby(["calitp_itp_id"]).agg({"shape_id": "nunique"}).reset_index()
    return df2

In [3]:
m1 = pd.merge(check_shapes_initial(utils.GCS_FILE_PATH), 
              check_shapes_initial(DATA_PATH), 
              on = "calitp_itp_id",
              how = "outer",
              validate = "1:1",
              indicator=True
             )

In [4]:
m1._merge.value_counts()

both          142
right_only     59
left_only       0
Name: _merge, dtype: int64

In [5]:
# That means 59 operators are in my shapes_initial, but not in Eric's?
m1[m1._merge=="right_only"].calitp_itp_id.unique()

array([  0,   1,   2,   3,   6,   7,   8,  13,  16,  48,  56,  57,  61,
        62,  76,  87,  97, 103, 106, 110, 111, 122, 127, 152, 164, 170,
       183, 194, 200, 206, 207, 208, 214, 238, 254, 256, 271, 273, 278,
       280, 289, 290, 295, 305, 312, 320, 323, 325, 338, 341, 344, 346,
       349, 350, 372, 390, 394, 474, 482])

In [6]:
in_both = m1[m1._merge=="both"]

in_both = in_both.assign(
    category = in_both.apply(lambda x: "equal" if 
                             x.shape_id_x == x.shape_id_y 
                            else "less" if x.shape_id_x < x.shape_id_y 
                            else "more" , axis=1)
)

In [7]:
print(f"# shape_ids in Eric's for `both`: {in_both.shape_id_x.sum()}")
print(f"# shape_ids in Tiff's for `both`: {in_both.shape_id_y.sum()}")

# These numbers are much closer, and this is reasonable

# shape_ids in Eric's for `both`: 7216.0
# shape_ids in Tiff's for `both`: 7099


In [8]:
print(f"# shape_ids in Tiff's for `right_only`: {m1[m1._merge=='right_only'].shape_id_y.sum()}")

# shape_ids in Tiff's for `right_only`: 7162


For the operators that are in common to both, the unique `shape_ids` are in the same reasonable ballpark. They will differ because `shapes.txt` only shows the most recent, so cannot extract the exact same shapes now as when Eric initially ran this.

Ideally, have a `dim_shapes` table to grab the `shape_id` for the actual date of service.

So, the 59 operators that show up in my df are contributing the other 7k observations, and that explains the difference. But, looking at that list of ITP_IDs, some of these are not in the current `agencies.yml`, so will have to remove them. Hopefully, only a handful of agencies are left.

In [9]:
m1[m1._merge=="right_only"].calitp_itp_id.unique()

array([  0,   1,   2,   3,   6,   7,   8,  13,  16,  48,  56,  57,  61,
        62,  76,  87,  97, 103, 106, 110, 111, 122, 127, 152, 164, 170,
       183, 194, 200, 206, 207, 208, 214, 238, 254, 256, 271, 273, 278,
       280, 289, 290, 295, 305, 312, 320, 323, 325, 338, 341, 344, 346,
       349, 350, 372, 390, 394, 474, 482])

Refering to `traffic_ops/prep_data` to see how to get the latest_itp_ids from `views.gtfs_schedule_dim_feeds`...if this is put together with `views.gtfs_schedule_fact_daily_feed_files`, which allows you to grab the `feed_key` for a certain date.

But, should `calitp_id_is_in_latest == True` be used from `views.gtfs_schedule_dim_feeds`? Latest is the latest version of `agencies.yml`, but not necessarily for the Oct date of analysis.

Somehow, the list of operators needs to be pared down? But not sure if the all 59 operators should be dropped? If those operators weren't in Eric's, they wouldn't have made it through the inner join to get the line geometry. But, it's possible that right now, they're in mine, but not all should be, and after the dataset gets expanded to hold 0's for no service, it expands the dataset much larger than it should be?


In [10]:
# Why are ITP_ID==200 back again? Those got dropped in the script.
# Making routes came from agencies table...which we want to pull from service_funding
shape = pd.read_parquet(f"{DATA_PATH}shape_frequency.parquet")

In [11]:
shape2 = (shape[["calitp_itp_id"]].drop_duplicates()
          .assign(in_shape_frequency=1)
          .reset_index(drop=True)
         )

In [12]:
m2 = pd.merge(m1.rename(columns = {"_merge": "_merge1"}), 
                        shape2, 
              on = "calitp_itp_id",
        how = "outer",
        validate = "1:1",
        indicator=True)

This shows that 54 out of the 59 are operators that don't appear in `shape_frequency`, and once we get rid of these, the operators list should be much more similar. Only 5 obs different.

In [13]:
m2._merge.value_counts()

both          147
left_only      54
right_only      0
Name: _merge, dtype: int64

In [14]:
m2[m2._merge=="left_only"]

Unnamed: 0,calitp_itp_id,shape_id_x,shape_id_y,_merge1,in_shape_frequency,_merge
14,35,1.0,1,both,,left_only
15,36,2.0,2,both,,left_only
16,37,4.0,4,both,,left_only
49,137,1.0,1,both,,left_only
50,142,168.0,168,both,,left_only
53,154,13.0,19,both,,left_only
56,167,196.0,124,both,,left_only
72,192,5.0,4,both,,left_only
89,235,168.0,168,both,,left_only
101,265,1.0,1,both,,left_only
