# Understand aggregations for operator/highway stats

Is the `pct_parallel` metric measuring what we want?

In [1]:
import geopandas as gpd
import pandas as pd

from shared_utils import geography_utils

DATA_PATH = "./data/"



In [2]:
gdf = gpd.read_parquet(f"{DATA_PATH}parallel_or_intersecting.parquet")

## Aggregate to highway or operator

Display these stats along with interactive map

### Operator stats
1. Unique observation is id-route-hwy. 

* But, the % parallel coming from this seems artificially low.

2. Unique observation is id-route. 
* Flag route as being parallel if it's parallel to at least 1 hwy. (ok if it's many hwys). 
* This will give better understanding, out of LA Metro's 120 routes, which ones are considered parallel, and from these, which ones are viable, and along which hwys. 

### Highway Stats

1. Unique observation is hwy-county-route. 

* From this, find % parallel of all the possible route-hwy intersections. 
* This will potentially suffer from artificially low % parallel too?

2. Unique observation is hwy-county.

* Before aggregation, flag the unique id-route too, and whether that route is parallel at some point to this hwy. If there are multiple routes...this also does the "at least 1 parallel" dummy method.
* Now, aggregate to hwy-county. In LA, LA Metro should only contribute 120 routes to this denominator, and so on with all other LA operators. Do not want LA Metro routes to count up to 600s!


Play with a couple of subsets to find the right unit of analysis.

In [3]:
operator_group_cols = ["itp_id", "County"]

# obs is itp_id-county-route_id
# for the itp_id-county (operators can operate across county boundaries)
# % parallel = (# routes-hwy that are parallel / # routes-hwy combination)

part1 = (
    geography_utils.aggregate_by_geography(
        gdf, 
        group_cols = operator_group_cols,
        sum_cols = ["parallel"],
        count_cols = ["Route"],
        nunique_cols = ["route_id"],
    ).rename(columns = {
        "parallel": "sum_parallel",
        "Route": "count_Route",
        "route_id": "unique_route_id",
    })
)


part2 = (gdf.assign(
            atleast1_parallel = (gdf.groupby(operator_group_cols + ["route_id"])
                             ["parallel"].transform("max")),
        )[operator_group_cols + ["route_id", "atleast1_parallel"]]
         .drop_duplicates()
         .reset_index(drop=True)
        )


part2 = (part2.groupby(operator_group_cols)
         .agg({"atleast1_parallel": "sum"})
         .reset_index()
        )
         
operator_stats = pd.merge(part1, part2,
                          on = operator_group_cols,
                          how = "left",
                          validate = "1:1"
                         )

Investigate a subset of observations to see how it is summed up.

In [4]:
keep_operator = [4, 182]
keep_county = ["ALA", "CC", "LA", "ORA"]

In [5]:
(operator_stats[(operator_stats.County.isin(keep_county)) & 
               (operator_stats.itp_id.isin(keep_operator)) ]
 .sort_values("itp_id")
)

Unnamed: 0,itp_id,County,sum_parallel,count_Route,unique_route_id,atleast1_parallel
0,4,ALA,109,582,119,79
26,4,CC,16,73,30,12
133,182,LA,49,667,120,49
177,182,ORA,0,8,4,0


In [6]:
display_cols = ["itp_id", "route_id", "Route", "County", "parallel"]

t1 = (gdf[(gdf.County.isin(keep_county)) & 
     (gdf.itp_id.isin(keep_operator))]
)[display_cols].reset_index(drop=True)

t1[t1.itp_id==182].sort_values(["route_id", "Route"])

Unnamed: 0,itp_id,route_id,Route,County,parallel
1031,182,10-13153,2,LA,0
655,182,10-13153,10,LA,0
860,182,10-13153,101,LA,0
1250,182,10-13153,105,LA,0
934,182,10-13153,110,LA,1
...,...,...,...,...,...
991,182,96-13153,110,LA,0
1030,182,96-13153,110,LA,0
1117,182,96-13153,134,LA,0
1286,182,SOFI,105,LA,0


In [7]:
t1 = t1.assign(
    atleast1_parallel = t1.groupby(operator_group_cols + ["route_id"])["parallel"].transform("max")
)

* For `ITP_ID==4`, these 2 route_ids each intersect with 3 highways.
* Only along `Route==580` is the route_id parallel to the highway.
* **Operator perspective**, the 681 and 72M lines should be counted once (not triple counted, and each should be flagged as parallel) 
<br>*Calculation*: `pct_parallel = 1.0 for 681` (2 parallel to at least 1 hwy / 2 lines) = 1.0 
<br>*Calculation*: `pct_parallel = 1.0 for 72M` (2 parallel to at least 1 hwy / 2 lines) = 1.0 

* **Highway perspective**, the 3 highways should be counted once (not double counted)
<br>*Calculation*: 
<br>`pct_parallel = 1 for 580` (2 parallel / 2 lines) = 1.0 
<br>`pct_parallel = 0 for 80` (0 parallel /2 lines) = 0
<br>`pct_parallel = 0 for 123` (0 parallel /2 lines) = 0

In [8]:
keep_route = ["681", "72M"]
subset = t1[t1.route_id.isin(keep_route)].sort_values(["itp_id", "route_id", "Route"])
subset

Unnamed: 0,itp_id,route_id,Route,County,parallel,atleast1_parallel
540,4,681,80,CC,0,1
560,4,681,123,CC,0,1
579,4,681,580,CC,1,1
295,4,72M,13,ALA,0,0
342,4,72M,24,ALA,0,0
373,4,72M,80,ALA,0,0
546,4,72M,80,CC,0,1
411,4,72M,123,ALA,0,0
565,4,72M,123,CC,0,1
450,4,72M,260,ALA,0,0


In [9]:
# Operator perspective

# Must sum up `parallel`
# Summing `atleast1_parallel` is incorrect!

(subset.groupby(["itp_id", "County"])
 .agg({"route_id": "nunique", 
       "parallel": "sum",
       "atleast1_parallel": "sum"
      })
 .reset_index()
)

Unnamed: 0,itp_id,County,route_id,parallel,atleast1_parallel
0,4,ALA,1,0,0
1,4,CC,2,2,6


In [10]:
# Double check with LA Metro
# This confirms sum(parallel) is correct
keep_route = ["105-13153", "910-13153"]
subset2 = t1[t1.route_id.isin(keep_route)].sort_values(["itp_id", "route_id", "Route"])

(subset2.groupby(["itp_id", "County"])
 .agg({"route_id": "nunique", 
       "parallel": "sum",
       "atleast1_parallel": "sum"
      })
 .reset_index()
)

# So, the error came from pct_parallel  = sum(parallel) / count(route_id)
# It needs to be pct_parallel = sum(parallel) / nunique(route_id)

Unnamed: 0,itp_id,County,route_id,parallel,atleast1_parallel
0,182,LA,2,1,12


In [11]:
# Highway perspective

# Must sum up `parallel`
# Summing `atleast1_parallel` is incorrect!

(subset.groupby(["Route", "County"])
 .agg({"route_id": "nunique", 
       "parallel": "sum",
       "atleast1_parallel": "sum"
      })
 .reset_index()
)

Unnamed: 0,Route,County,route_id,parallel,atleast1_parallel
0,13,ALA,1,0,0
1,24,ALA,1,0,0
2,80,ALA,1,0,0
3,80,CC,2,0,2
4,123,ALA,1,0,0
5,123,CC,2,0,2
6,260,ALA,1,0,0
7,580,ALA,1,0,0
8,580,CC,2,2,2
9,880,ALA,1,0,0


In [12]:
(subset2.groupby(["Route", "County"])
 .agg({"route_id": "nunique", 
       "parallel": "sum",
       "atleast1_parallel": "sum"
      })
 .reset_index()
).sort_values(["parallel", "Route"], ascending=[False, True])

Unnamed: 0,Route,County,route_id,parallel,atleast1_parallel
7,110,LA,2,1,2
0,1,LA,1,0,1
1,5,LA,1,0,1
2,10,LA,2,0,1
3,47,LA,1,0,1
4,91,LA,1,0,1
5,101,LA,1,0,1
6,105,LA,1,0,1
8,164,LA,1,0,1
9,187,LA,1,0,0


The 910 is parallel to the 110 fwy (that makes sense, because 910 is the Silver Line that runs on the fwy).

As a transit line, it gets counted as being parallel because it is at least parallel to 1 freeway (110).

For the 110, it also gets counted as a parallel line, but it should not for all the other highways.

But, the need for calculating `atleast1_parallel` came up because the 110 is 2 observations due to `RouteType`, once as Interstate, once as State Highway.

So, there needs to be an aggregation to remove `RouteType` differences (these should be edge cases), and take the `max(parallel)` in the aggregation. Then, calculate `sum(parallel)`.

In [13]:
gdf[(gdf.Route.isin(subset2.Route)) & 
    (gdf.County.isin(subset2.County)) &
    (gdf.route_id.isin(subset2.route_id) & 
    (gdf.Route==110))
].sort_values(["route_id", "Route"], ascending=[True, True])

Unnamed: 0,itp_id,shape_id,route_id,route_length,total_routes,Route,County,District,RouteType,NB,SB,EB,WB,highway_length,geometry,pct_route,pct_highway,parallel
5798,182,1050260_DEC21,105-13153,84957.561255,120,110,LA,7,Interstate,1,1,0,0,125191.103899,"LINESTRING (6481771.040 1823906.478, 6481666.0...",0.125,0.085,0
5850,182,9100212_DEC21,910-13153,206333.820355,120,110,LA,7,Interstate,1,1,0,0,125191.103899,"LINESTRING (6474114.722 1725182.990, 6474116.0...",0.682,1.0,1
6054,182,9100212_DEC21,910-13153,206333.820355,120,110,LA,7,State,1,1,0,0,38787.994447,"MULTILINESTRING ((6488116.776 1842375.031, 648...",0.016,0.085,0
