# Number of feeds with data about physical accessibility

[GH issue](https://github.com/cal-itp/data-infra/issues/553)

MVP:

* Presence of `stops#wheelchair_boarding` field
* Presence of `trips#wheelchair_accessible` field
* Presence of non-empty pathways table when at least one child_stop within parent_stop exists in stops

More rigorous analysis:

Require minimum percent of:
* Rows in stops table with `wheelchair_boarding` field set to "not unknown" value
* Rows in trips table with `wheelchair_accessible` field set to "not unknown" value
* Child_stops within a parent_stop that can reach every other child_stop within parent_stop when simulating travel as an able-bodied person

**Data Sources:**
* Static GTFS
* `stops#wheelchair_boarding`
* `trips#wheelchair_accessible`
* Pathways data as it relates to child and parent stops

### Method 1:
The first way is to calculate the total number of feeds that meet all of the 3 conditions:

* 100% of stops have a value of 1 or 2 in the wheelchair_boarding field.
* 100% of trips have a value of 1 or 2 in the wheelchair_accessible field.
* If any of the stops have a parent stop, there needs to be at least 1 record in the pathways file. If no stops have a parent stop, then this criteria is satisfied.
* **Current notebook**: instead of showing only those meeting 100%, use histogram instead and show distribution.

### Method 2:
The second way is to give a weighted answer by adding up the per-feed score across all feeds where a feed's score is calculated as follows:
```
# set a pathways_required variable to true if any of the stops have a parent stop defined
num_parent_stops = count_num_parent_stops()
pathways_required = num_parent_stops > 0

# find weight to use for trips and stops based on whether pathways data is required
stops_and_trips_weight = 0.3333 if pathways_required else 0.5

# calculate overall accessible score
feed_score = (
    stops_and_trips_weight * calculate_percent_of_stops_with_explicitly_defined_wheelchair_code() + 
    stops_and_trips_weight * calculate_percent_of_trips_with_explicitly_defined_wheelchair_code() + 
    # TODO: a more rigorous analysis that checks if each child-stop of a parent is able to be accessed to 
    # every other child-stop of the parent stop would be needed to calculate a more realistic value here.
    # Until then, a full score on pathways-completeness is given if more than 0 pathways entries exist.
    (0 if not pathways_required else 0.3333 if count_num_pathways() > 0 else 0)
) / (3 if pathways_required else 2)
```

In [1]:
import altair as alt
import pandas as pd
import os

from datetime import date
from IPython.display import Markdown
from siuba import *

import warehouse_queries
from shared_utils import styleguide
from shared_utils import calitp_color_palette as cp

display(Markdown(
        f"<b>Report updated:</b> {date.today().strftime('%-m/%-d/%y')}"
    )
)



<b>Report updated:</b> 2/13/22

In [2]:
%%html
<style>
@import url('https://fonts.googleapis.com/css?family=Raleway');
@import url('https://fonts.googleapis.com/css?family=Nunito+Sans');
@import url('https://fonts.googleapis.com/css?family=Bitter');
</style>

In [3]:
def categorize_values(df, col, values_dict = {}, new_colname = None):   
    if new_colname == None:
        new_colname = col
    df = (df.assign(
            col = df[col].fillna("unknown").map(values_dict)
        ).drop(columns = col)
          .rename(columns = {"col": new_colname})
    )
    
    return df


def summarize_metric_for_operator(df, group_cols = [], numerator="", denominator=""):
    df2 = (df.groupby(group_cols)
           .agg({
               numerator: "sum", 
               denominator: "count"
           }).reset_index()
          )
    
    df2 = df2.assign(
        pct = df2[numerator].divide(df2[denominator])
    ).rename(columns = {"pct": f"pct_{numerator}"})
           
    return df2


def make_histogram(df, x_col):
    x_title = f"{x_col.replace('pct_has_', '').replace('_', ' ')} information"
             
    chart = (alt.Chart(df)
             .mark_bar()
             .encode( 
                 x=alt.X(f"{x_col}:Q", bin=True, title=f"% {x_title}",
                        axis=alt.Axis(format="%")),
                 y=alt.Y("count()", title="# Feeds"),
                 color=alt.value(cp.CALITP_CATEGORY_BRIGHT_COLORS[0]),
                 #Tooltip for aggregates: https://github.com/altair-viz/altair/issues/1065
                 #Tooltip for histogram: https://github.com/altair-viz/altair/issues/2006
                 tooltip=[alt.Tooltip(x_col, bin=True, title="bin"), 
                          alt.Tooltip("count()", title="count")]
             )
    )
    
    chart = (styleguide.preset_chart_config(chart)
             .properties(title=f"Feeds by % {x_title.title()}")
             .interactive()
            )
    
    display(chart)
    #chart.save(f"{IMG_PATH}{x_col}.png")
    #return chart

### Stops - % with Accessibility Info (not unknown)

Histogram shows distribution of feeds by % of stop accessibility info.

In [4]:
# https://gtfs.org/reference/static/#stopstxt
# 0 is unknown; 1 is accessible; 2 is not accessible
STOPS_VALUES_DICT = {
    "unknown": 0,
    "0": 0, 
    "1": 1,
    "2": 1,
}

GROUP_COLS = ["calitp_itp_id", "calitp_url_number"]

stops = warehouse_queries.stops >> collect()
stops = categorize_values(stops, "wheelchair_boarding", 
                           values_dict = STOPS_VALUES_DICT, 
                           new_colname = "has_stop_accessibility")

stops = summarize_metric_for_operator(stops, 
                              group_cols = GROUP_COLS, 
                              denominator = "stop_id", 
                              numerator = "has_stop_accessibility")

make_histogram(stops, "pct_has_stop_accessibility")



### Trips - % with accessibility info (not unknown)

Histogram shows distribution of feeds by % of trip accessibility info.

In [5]:
# https://gtfs.org/reference/static/#tripstxt
# 0 is unknown; 1 is accessible; 2 is not accessible
TRIPS_VALUES_DICT = {
    "unknown": 0,
    "0": 0, 
    "1": 1,
    "2": 1,
}

trips = warehouse_queries.trips >> collect()

trips = categorize_values(trips, "wheelchair_accessible", 
                           values_dict = TRIPS_VALUES_DICT, 
                           new_colname = "has_trip_accessibility")

trips = summarize_metric_for_operator(trips, 
                              group_cols = GROUP_COLS, 
                              denominator = "trip_id", 
                              numerator = "has_trip_accessibility")

make_histogram(trips, "pct_has_trip_accessibility")

### Plot feeds with full info for stops and routes

To begin with, feeds are counted on the stops metric, on the trips metric, and plotted separately. 

Also add feeds that meet both.

In [6]:
unique_feeds_full_info = pd.merge(
    stops[stops.pct_has_stop_accessibility==1],
    trips[trips.pct_has_trip_accessibility==1],
    on = ["calitp_itp_id", "calitp_url_number"],
    how = "inner",
    validate = "1:1"
)

In [10]:
full_info = {
    "stops": len(stops[stops.pct_has_stop_accessibility==1]),
    "trips": len(trips[trips.pct_has_trip_accessibility==1]),
    "both": len(unique_feeds_full_info),
}

combined = (pd.DataFrame.from_dict(full_info, orient="index", columns=["value"])
        .reset_index()
        .rename(columns = {"index": "category"})
       )


print(f"# feeds in stops: {len(stops)}")
print(f"# feeds in trips: {len(trips)}")

combined = (combined.assign(
        total_feeds = len(stops),
        pct = round(combined.value / len(stops), 3)
    )
)

combined

# feeds in stops: 251
# feeds in trips: 251


Unnamed: 0,category,value,total_feeds,pct
0,stops,10,251,0.04
1,trips,22,251,0.088
2,both,8,251,0.032


In [11]:
chart = (alt.Chart(combined)
         .mark_bar(size=35)
         .encode(
             x=alt.X("category"),
             y=alt.Y("value", title="# feeds"),
             color=alt.value(cp.CALITP_CATEGORY_BRIGHT_COLORS[0]),
             tooltip=["category", "value"]
         ).properties(title="# Feeds with Full Accessibility Information")
)

chart = (styleguide.preset_chart_config(chart)
         .properties(width = styleguide.chart_width*0.4)
         .interactive()
        )

display(chart)
#chart.save(f"{IMG_PATH}full_info.png")

In [12]:
chart = (alt.Chart(combined)
         .mark_bar(size=35)
         .encode(
             x=alt.X("category",),
             y=alt.Y("pct", title="% feeds", 
                     axis=alt.Axis(format="%")
                    ),
             color=alt.value(cp.CALITP_CATEGORY_BRIGHT_COLORS[0]),
             tooltip = ["category", "pct"]
         ).properties(title="% Feeds with Full Accessibility Information")
)

chart = (styleguide.preset_chart_config(chart)
         .properties(width = styleguide.chart_width*0.4)
         .interactive()
        )

display(chart)
#chart.save(f"{IMG_PATH}full_info_pct.png")

### Pathways table

Only `ITP_ID==200` appears in pathways. 
But, that ID isn't in the `stops` or `trips` table. 

So, feed score right now is 50-50 weight split by `stops` and `trips`, not sure how to include pathways in that scoring yet.

### Feed Score

For a given feed:

`feed_score = 0.5 * pct_stop_accessibility + 0.5 * pct_trip_accessibility`

Histogram shows the distribution of the feed scores.

In [14]:
def calculate_feed_score(stop_df, trip_df):
    df = pd.merge(stop_df, trip_df, 
                  on = GROUP_COLS,
                  how = "inner", 
                  validate = "1:1")
    
    STOP_WEIGHT = 0.5
    TRIP_WEIGHT = 0.5
    
    df = df.assign(
        feed_score = ((STOP_WEIGHT * df.pct_has_stop_accessibility) + 
                      (TRIP_WEIGHT * df.pct_has_trip_accessibility)
                     )
    )
    
    return df

In [15]:
df = calculate_feed_score(stops, trips)
df.head()

Unnamed: 0,calitp_itp_id,calitp_url_number,has_stop_accessibility,stop_id,pct_has_stop_accessibility,has_trip_accessibility,trip_id,pct_has_trip_accessibility,feed_score
0,0,0,0,866,0.0,0,2940,0.0,0.0
1,1,0,7,130,0.053846,0,126,0.0,0.026923
2,1,1,9,254,0.035433,0,558,0.0,0.017717
3,1,2,9,254,0.035433,0,558,0.0,0.017717
4,1,3,0,350,0.0,0,1167,0.0,0.0


In [16]:
# Distribution of feed score
# Most are 0's, but can see that a 100% threshold only grabs very few feeds
make_histogram(df, "feed_score")