## Exercise 5: Geospatial wrangling and making maps

Skills: 
* More geospatial practice building on earlier skills
* Make a map with `geopandas`

References: 
* https://docs.calitp.org/data-infra/analytics_new_analysts/02-data-analysis-intermediate.html
* https://docs.calitp.org/data-infra/analytics_tools/python_libraries.html

In [1]:
import geopandas as gpd
import intake
import os
import pandas as pd
import shapely

os.environ["CALITP_BQ_MAX_BYTES"] = str(100_000_000_000)

from calitp_data_analysis.tables import tbls
from siuba import *

# Hint: if this doesn't import: refer to docs for correctly import
# cd into _shared_utils folder, run the make setup_env command
import shared_utils




## Research Question

What's the average number of trips per stop by operators in southern California? Show visualizations at the operator and county-level.
<br>**Geographic scope:** southern California counties
<br>**Deliverables:** chart(s) and map(s) showing metrics comparing across counties and also across operators. Make these visualizations using function(s).

### Prep data

* Use the same query, but grab a different set of operators. These are in southern California, so the map should zoom in counties ranging from LA to SD.
* *Hint*: for some counties, there are multiple operators. Make sure the average stop events per stop by counties is the weighted average.
* Use the same [shapefile for CA counties](https://gis.data.ca.gov/datasets/CALFIRE-Forestry::california-county-boundaries/explore?location=37.246136%2C-119.002032%2C6.12) as in Exercise 4.
* Join the data and only keep counties that have bus stops.
* If you cannot connect to the warehouse, use this dict to map feed_keys to names.
    ```
    feed_keys_to_names_dict = {
        "71d91d70ad6c07b1f9b0a618ffceef93": "Alhambra Schedule",
        "a7ba6f075198e9bf9152fab6c7faf0f6": "San Diego Schedule",
        "4f77ef02b983eccc0869c7540f98a7d0": "Big Blue Bus Schedule"
        "ae93a53469371fb3f9059d2097f66842": "OmniTrans Schedule",
        "180d48eb03829594478082dca5782ccd": "Culver City Schedule"
    }
    ```

In [2]:
#checking what is included in the shared_utils.gtfs_utils_v2 module

print(dir(shared_utils.gtfs_utils_v2)) 

['Fx', 'GCS_PROJECT', 'Literal', 'METROLINK_ROUTE_TO_SHAPE', 'METROLINK_SHAPE_TO_ROUTE', 'Pipeable', 'Union', '_', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__spec__', 'across', 'add_count', 'anti_join', 'arrange', 'case_when', 'check_operator_feeds', 'collect', 'complete', 'count', 'datetime', 'distinct', 'expand', 'extract', 'fill_in_metrolink_trips_df_with_shape_id', 'filter', 'filter_custom_col', 'filter_date', 'filter_feed_options', 'filter_operator', 'filter_start_end_ts', 'full_join', 'gather', 'geography_utils', 'get_metrolink_feed_key', 'get_shapes', 'get_stop_times', 'get_stops', 'get_trips', 'gpd', 'group_by', 'head', 'hour_tuple_to_seconds', 'if_else', 'inner_join', 'join', 'left_join', 'mutate', 'nest', 'pd', 'pipe', 'rename', 'right_join', 'schedule_daily_feed_to_gtfs_dataset_name', 'schedule_rt_utils', 'select', 'semi_join', 'separate', 'shapely', 'show_query', 'siuba', 'spread', 'subset_cols', 'summarize', 'tbl', 'tbl

In [3]:
#schedule_daily_feed_to_organization name has changed to schedule_daily_feed_to_gtfs_dataset_name
feeds_to_names = shared_utils.gtfs_utils_v2.schedule_daily_feed_to_gtfs_dataset_name(
    selected_date = "2022-06-01",
    get_df = True
)[["feed_key", "name"]].drop_duplicates()

feeds_to_names

Unnamed: 0,feed_key,name
0,5efaa2460085a481db5dfbf57ae78187,Kern Schedule
1,c50220b8622624dfa0c5c22859b14694,Humboldt Schedule
2,1b77ef49f5bc70038cbf15e4f5f98477,Compton Schedule
3,4b6b673ab50c016344c1adf09de2cc84,Banning Pass Schedule
4,7a7e9069dedca7a58e5a89aaa0a97256,Bay Area 511 Santa Rosa CityBus Schedule
...,...,...
195,be1ab75c2b37f1ee2964d0c00b56707f,Huntington Schedule
196,3551cafd288e0f647ff54627e26d0479,SBMTD Schedule
197,b20a7d27be377835a8d542b5b7a34e9a,El Segundo Schedule
198,a7fdbe01be9a5f96e9e45a3aceb17167,Turlock Schedule


In [4]:
OPERATORS = [
    "Alhambra Schedule", 
    "San Diego Schedule",
    "Big Blue Bus Schedule",
    "Culver City Schedule",
    "OmniTrans Schedule",
    "OCTA Schedule"
]

SUBSET_FEEDS = feeds_to_names[
    feeds_to_names.name.isin(OPERATORS)
].feed_key.tolist()


In [5]:
stops = (
    tbls.mart_gtfs.fct_daily_scheduled_stops()
    >> filter(_.feed_key.isin(SUBSET_FEEDS))
    >> filter(_.service_date == "2022-06-01")
    >> select(_.feed_key, 
              _.stop_id, _.pt_geom)
    >> collect()
)

  sqlalchemy.util.warn(


Check the type of `stops`. Is it a pandas df or geopandas gdf?

In [6]:
type(stops)

pandas.core.frame.DataFrame

> Stops is pandas dataframe.

In [7]:
stops

Unnamed: 0,feed_key,stop_id,pt_geom
0,41ee0151e3cac17098d055ce25b3f104,151,POINT(-118.414249 33.992827)
1,41ee0151e3cac17098d055ce25b3f104,152,POINT(-118.412862 33.994731)
2,41ee0151e3cac17098d055ce25b3f104,153,POINT(-118.411612 33.996444)
3,41ee0151e3cac17098d055ce25b3f104,154,POINT(-118.410039 33.998632)
4,41ee0151e3cac17098d055ce25b3f104,155,POINT(-118.408081 34.001101)
...,...,...,...
13242,e7985c6c0c873f17871d79a527a50afa,7249,POINT(-117.815121 33.745222)
13243,e7985c6c0c873f17871d79a527a50afa,7250,POINT(-117.816895 33.74346)
13244,e7985c6c0c873f17871d79a527a50afa,6309,POINT(-117.876418 33.750429)
13245,e7985c6c0c873f17871d79a527a50afa,1030,POINT(-117.883635 33.915522)


In [8]:
# Turn stops into a gdf
geom = [shapely.wkt.loads(x) for x in stops.pt_geom]

stops = gpd.GeoDataFrame(
    stops, 
    geometry=geom, 
    crs="EPSG:4326"
).drop(columns="pt_geom")

Check the type of `stops`. Is it a pandas df or geopandas gdf?

What is the CRS and geometry column name?

In [9]:
type(stops)

geopandas.geodataframe.GeoDataFrame

In [10]:
stops.geometry.name

'geometry'

In [11]:
stops.crs

<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

### Bring in a new table from BigQuery

* In `mart_gtfs`, bring in the table called `fct_daily_scheduled_stops` for the subset of feeds defined above.
* Modify the snippet below to:
   * filter for the subset of operators
   * only keep columns: `feed_key`, `stop_id`, `stop_event_count`

In [84]:
stop_counts = (
    tbls.mart_gtfs.fct_daily_scheduled_stops()
    >> filter(_.service_date == "2022-06-01")
    >> filter(_.feed_key.isin(SUBSET_FEEDS))
    >> select(_.feed_key, _.stop_id, _.stop_event_count, _.pt_geom)
    >> collect()
)

stop_counts

Unnamed: 0,feed_key,stop_id,stop_event_count,pt_geom
0,a3af905228efc93bb48f360b92965afb,7498,1,POINT(-117.566836 34.018881)
1,a3af905228efc93bb48f360b92965afb,7510,1,POINT(-117.56684 34.01906)
2,a3af905228efc93bb48f360b92965afb,8924,1,POINT(-117.575712 34.001556)
3,a3af905228efc93bb48f360b92965afb,8925,1,POINT(-117.575702 34.007739)
4,a3af905228efc93bb48f360b92965afb,8926,1,POINT(-117.575993 34.006527)
...,...,...,...,...
13242,239e56d11510f71d7182a24c5621be8c,1184,205,POINT(-118.444813 34.069451)
13243,239e56d11510f71d7182a24c5621be8c,72,216,POINT(-118.484154 34.011762)
13244,239e56d11510f71d7182a24c5621be8c,304,237,POINT(-118.444838 34.059653)
13245,239e56d11510f71d7182a24c5621be8c,402,246,POINT(-118.445258 34.062525)


In [13]:
# Turn stops into a gdf
geom1 = [shapely.wkt.loads(x) for x in stop_counts.pt_geom]

stop_counts = gpd.GeoDataFrame(
    stop_counts, 
    geometry=geom1, 
    crs="EPSG:4326"
).drop(columns="pt_geom")



### Aggregate
* Write a function to aggregate to the operator level or county level, add new columns for desired metrics.
* Merge in CA shapefile to get a gdf.
* Add another `geometry` column, called `centroid`, and grab the county's centroid.
* Refer to [docs](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.set_geometry.html) to see how to pick which column to use as the `geometry` for the gdf, since technically, a gdf can handle multiple geometry columns.

In [14]:
#Bringing in the CA shapefile from the URL
LONG_URL_PATH = "https://services1.arcgis.com/jUJYIo9tSA7EHvfZ/arcgis/rest/services/California_County_Boundaries/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson"
CA_county = gpd.read_file(LONG_URL_PATH)
CA_county.head(2)

Unnamed: 0,OBJECTID,COUNTY_NAME,COUNTY_ABBREV,COUNTY_NUM,COUNTY_CODE,COUNTY_FIPS,ISLAND,Shape__Area,Shape__Length,GlobalID,geometry
0,1,Alameda,ALA,1,1,1,,3402787000.0,308998.650766,e6f92268-d2dd-4cfb-8b79-5b4b2f07c559,"POLYGON ((-122.27125 37.90503, -122.27024 37.9..."
1,2,Alpine,ALP,2,2,3,,3146939000.0,274888.492411,870479b2-480a-494b-8352-ad60578839c1,"POLYGON ((-119.58667 38.71420, -119.58653 38.7..."


In [17]:
county_stops = gpd.sjoin(
    CA_county, 
    stop_counts, 
    how = 'inner',
    predicate = 'intersects'
)

county_stops.geometry.unique()
county_stops.head(2)

Unnamed: 0,OBJECTID,COUNTY_NAME,COUNTY_ABBREV,COUNTY_NUM,COUNTY_CODE,COUNTY_FIPS,ISLAND,Shape__Area,Shape__Length,GlobalID,geometry,index_right,feed_key,stop_id,stop_event_count
18,19,Los Angeles,LOS,19,19,37,,15054690000.0,629726.475248,3b1e1d69-2b1a-464d-ba43-611c4201b219,"POLYGON ((-117.66733 34.79317, -117.66728 34.7...",3046,e7985c6c0c873f17871d79a527a50afa,4086,16
18,19,Los Angeles,LOS,19,19,37,,15054690000.0,629726.475248,3b1e1d69-2b1a-464d-ba43-611c4201b219,"POLYGON ((-117.66733 34.79317, -117.66728 34.7...",3324,e7985c6c0c873f17871d79a527a50afa,4069,17


In [45]:
county_stops = county_stops.assign(
    centroid = county_stops.geometry.centroid
)

county_stops = county_stops.set_geometry('centroid')
county_stops.head(2)


  centroid = county_stops.geometry.centroid


Unnamed: 0,OBJECTID,COUNTY_NAME,COUNTY_ABBREV,COUNTY_NUM,COUNTY_CODE,COUNTY_FIPS,ISLAND,Shape__Area,Shape__Length,GlobalID,geometry,index_right,feed_key,stop_id,stop_event_count,centroid
18,19,Los Angeles,LOS,19,19,37,,15054690000.0,629726.475248,3b1e1d69-2b1a-464d-ba43-611c4201b219,"POLYGON ((-117.66733 34.79317, -117.66728 34.7...",3046,e7985c6c0c873f17871d79a527a50afa,4086,16,POINT (-118.21689 34.36117)
18,19,Los Angeles,LOS,19,19,37,,15054690000.0,629726.475248,3b1e1d69-2b1a-464d-ba43-611c4201b219,"POLYGON ((-117.66733 34.79317, -117.66728 34.7...",3324,e7985c6c0c873f17871d79a527a50afa,4069,17,POINT (-118.21689 34.36117)


> Geodataframe can contain other columns with geometrical (shapely) objects but only one column can be active geometry at a time. 

In [35]:
county_stops.geometry.name

'centroid'

### Visualizations
* Make one chart for comparing trips per stop by operators, and another chart for comparing it by counties. Use a function to do this.
* Make 1 map for comparing trips per stop by counties. Use `gdf.explore()` to do this.
* Visualizations should follow the Cal-ITP style guide: [styleguide example notebook](https://github.com/cal-itp/data-analyses/blob/main/starter_kit/style-guide-examples.ipynb)
* More on `folium` and `ipyleaflet`: https://github.com/jorisvandenbossche/geopandas-tutorial/blob/master/05-more-on-visualization.ipynb

In [42]:
trips_county = (county_stops.groupby(['COUNTY_NAME'])
            .agg({'stop_event_count' : 'sum',
                 'stop_id': 'count'}
                ).reset_index()
           )

trips_county

Unnamed: 0,COUNTY_NAME,stop_event_count,stop_id
0,Los Angeles,73507,1539
1,Orange,178772,5221
2,Riverside,147,4
3,San Bernardino,57712,2257
4,San Diego,228968,4226


In [36]:
# To add styleguide
from calitp_data_analysis import styleguide
from calitp_data_analysis import calitp_color_palette as cp

In [44]:
trips_county = trips_county.assign(
    trips_per_stop = trips_county['stop_event_count']/trips_county['stop_id']
)
trips_county

Unnamed: 0,COUNTY_NAME,stop_event_count,stop_id,trips_per_stop
0,Los Angeles,73507,1539,47.762833
1,Orange,178772,5221,34.24095
2,Riverside,147,4,36.75
3,San Bernardino,57712,2257,25.570226
4,San Diego,228968,4226,54.180786


Make a chart and a map of total stop events by county.

In [77]:
dir(cp)

['CALITP_CATEGORY_BOLD_COLORS',
 'CALITP_CATEGORY_BRIGHT_COLORS',
 'CALITP_DIVERGING_COLORS',
 'CALITP_SEQUENTIAL_COLORS',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__']

In [80]:
import altair as alt

def make_chart(trips_county, colorscale): 
    chart = (alt.Chart(trips_county)
             .mark_bar()
             .encode(
                 x=alt.X("COUNTY_NAME", title="county"),
                 y=alt.Y("stop_event_count", title="total stop events"),
                 color = alt.Color("stop_event_count",
                                   scale = alt.Scale(range=colorscale),
                                  ),
             ).properties(title="total stop events by county")
            )
    chart = styleguide.preset_chart_config(chart)
    display(chart)

make_chart(trips_county, cp.CALITP_CATEGORY_BOLD_COLORS)

                 



Make a chart and a map of stop events per stop by county.

In [82]:
import altair as alt

def make_chart(trips_county, colorscale): 
    chart = (alt.Chart(trips_county)
             .mark_bar()
             .encode(
                 x=alt.X("COUNTY_NAME", title="county"),
                 y=alt.Y("trips_per_stop", title="trips per stop"),
                 color = alt.Color("trips_per_stop",
                                   scale = alt.Scale(range=colorscale),
                                  ),
             ).properties(title="trips per stop by county")
            )
    chart = styleguide.preset_chart_config(chart)
    display(chart)

make_chart(trips_county, cp.CALITP_CATEGORY_BRIGHT_COLORS)

Use a Markdown cell and write how you would summarize and interpret the visualizations.

- San Diego has the highest number of trips per stop followed by Los Angeles.
- San Diego has the highest number of stops event count followed by Orange and Los Angeles county. 