## Exercise 5: Geospatial wrangling and making maps

Skills: 
* More geospatial practice building on earlier skills
* Make a map with `geopandas`

References: 
* https://docs.calitp.org/data-infra/analytics_new_analysts/02-data-analysis-intermediate.html
* https://docs.calitp.org/data-infra/analytics_tools/python_libraries.html

In [1]:
import geopandas as gpd
import intake
import os
import pandas as pd
import shapely

os.environ["CALITP_BQ_MAX_BYTES"] = str(100_000_000_000)

#from calitp_data_analysis.tables import tbls
from siuba import *

# Hint: if this doesn't import: refer to docs for correctly import
# cd into _shared_utils folder, run the make setup_env command
#import shared_utils
FOLDER = "./data/"
FILE_NAME = "exercise_5_stops_sample.parquet"
stops=gpd.read_parquet(f"{FOLDER}{FILE_NAME}")


import os
os.environ['USE_PYGEOS'] = '0'
import geopandas

In the next release, GeoPandas will switch to using Shapely by default, even if PyGEOS is installed. If you only have PyGEOS installed to get speed-ups, this switch should be smooth. However, if you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
  import geopandas as gpd


In [2]:
stops.head(2)

Unnamed: 0,feed_key,stop_id,stop_key,stop_name,route_type_0,route_type_1,route_type_2,route_type_3,route_type_4,route_type_5,route_type_6,route_type_7,route_type_11,route_type_12,missing_route_type,geometry
0,a7ba6f075198e9bf9152fab6c7faf0f6,10094,f6d0217add9d4426389028a3adf22d11,Broadway & 3rd Av,,,,306.0,,,,,,,,POINT (266399.422 -584535.132)
1,a7ba6f075198e9bf9152fab6c7faf0f6,11531,7685e406465ed8df429f758eaf47b48a,Cabrillo Memorial Hwy & Ft Rosecrans Cemetery ...,,,,9.0,,,,,,,,POINT (258751.992 -588197.418)


## Research Question

What's the average number of trips per stop by operators in southern California? Show visualizations at the operator and county-level.
<br>**Geographic scope:** southern California counties
<br>**Deliverables:** chart(s) and map(s) showing metrics comparing across counties and also across operators. Make these visualizations using function(s).

### Prep data

* Use the same query, but grab a different set of operators. These are in southern California, so the map should zoom in counties ranging from LA to SD.
* *Hint*: for some counties, there are multiple operators. Make sure the average trips per stop by counties is the weighted average.
* Use the same [shapefile for CA counties](https://gis.data.ca.gov/datasets/CALFIRE-Forestry::california-county-boundaries/explore?location=37.246136%2C-119.002032%2C6.12) as in Exercise 4.
* Join the data and only keep counties that have bus stops.

In [None]:
feeds_to_names = shared_utils.gtfs_utils_v2.schedule_daily_feed_to_organization(
    selected_date = "2022-06-01",
    get_df = True
)[["feed_key", "name"]].drop_duplicates()

In [None]:
OPERATORS = [
    "Alhambra Schedule", 
    "San Diego Schedule",
    "Big Blue Bus Schedule",
    "Culver City Schedule",
    "OmniTrans Schedule",
]

SUBSET_FEEDS = feeds_to_names[
    feeds_to_names.name.isin(OPERATORS)
].feed_key.tolist()

OPERATORS = [
    "Alhambra Schedule", 
    "San Diego Schedule",
    "Big Blue Bus Schedule",
    "Culver City Schedule",
    "OmniTrans Schedule",
    "OCTA Schedule"
]

SUBSET_FEEDS = feeds_to_names[
    feeds_to_names.name.isin(OPERATORS)
].feed_key.tolist()

In [None]:
stops = (
    #tbls.mart_gtfs.fct_daily_scheduled_stops()
    >> filter(_.feed_key.isin(SUBSET_FEEDS))
    >> filter(_.service_date == "2022-06-01")
    >> select(_.feed_key, 
              _.stop_id, _.pt_geom)
    #>> collect()
)

Check the type of `stops`. Is it a pandas df or geopandas gdf?

In [None]:
type(stops)

In [None]:
# Turn stops into a gdf
geom = [shapely.wkt.loads(x) for x in stops.pt_geom]

stops = gpd.GeoDataFrame(
    stops, 
    geometry=geom, 
    crs="EPSG:4326"
).drop(columns="pt_geom")

Check the type of `stops`. Is it a pandas df or geopandas gdf?

What is the CRS and geometry column name?

In [None]:
type(stops)

In [None]:
stops.geometry.name

In [None]:
stops.crs

In [3]:
counties = gpd.read_file('https://services1.arcgis.com/jUJYIo9tSA7EHvfZ/arcgis/rest/services/California_County_Boundaries/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson')

In [4]:
counties.head(2)

Unnamed: 0,OBJECTID,COUNTY_NAME,COUNTY_ABBREV,COUNTY_NUM,COUNTY_CODE,COUNTY_FIPS,ISLAND,Shape__Area,Shape__Length,GlobalID,geometry
0,1,Alameda,ALA,1,1,1,,3402787000.0,308998.650766,e6f92268-d2dd-4cfb-8b79-5b4b2f07c559,"POLYGON ((-122.27125 37.90503, -122.27024 37.9..."
1,2,Alpine,ALP,2,2,3,,3146939000.0,274888.492411,870479b2-480a-494b-8352-ad60578839c1,"POLYGON ((-119.58667 38.71420, -119.58653 38.7..."


In [5]:
counties = counties.to_crs('EPSG:4326')

In [6]:
stops = stops.to_crs('EPSG:4326')

In [7]:
# only keep counties have bus stops only??? 
join = gpd.sjoin(counties, stops, how = 'inner', predicate = 'intersects')

In [8]:
list(join.columns)

['OBJECTID',
 'COUNTY_NAME',
 'COUNTY_ABBREV',
 'COUNTY_NUM',
 'COUNTY_CODE',
 'COUNTY_FIPS',
 'ISLAND',
 'Shape__Area',
 'Shape__Length',
 'GlobalID',
 'geometry',
 'index_right',
 'feed_key',
 'stop_id',
 'stop_key',
 'stop_name',
 'route_type_0',
 'route_type_1',
 'route_type_2',
 'route_type_3',
 'route_type_4',
 'route_type_5',
 'route_type_6',
 'route_type_7',
 'route_type_11',
 'route_type_12',
 'missing_route_type']

In [None]:
#average trips per stop by counties is the weighted average.
trip=join.groupby(['feed_key']).agg({'???':'count'}).reset_index()
stop=join.groupby(['feed_key']).agg({'stop_id':'count'}).reset_index()

In [None]:
merge1 = pd.merge(join, trip, on = 'feed_key',
    how = 'inner', validate = 'm:1')

In [None]:
merge2 = pd.merge(merge1, stop, on = 'feed_key',
    how = 'inner', validate = 'm:1')

In [None]:
merge2['trip_per_stop'] = merge2.trip/merge2.stop

### Bring in a new table from BigQuery

* In `mart_gtfs`, bring in the table called `fct_daily_scheduled_stops` for the subset of feeds defined above.
* Modify the snippet below to:
   * filter for the subset of operators
   * only keep columns: `feed_key`, `stop_id`, `stop_event_count`

In [None]:
stop_counts = (
    #tbls.mart_gtfs.fct_daily_scheduled_stops()
    >> filter(_.activity_date == "2022-06-01")
)

In [None]:
stops = pd.read_parquet('./data/exercise_5_stops_sample.parquet')
stops = (stops
    #tbls.mart_gtfs.fct_daily_scheduled_stops()
    >> filter(_.activity_date == "2022-06-01")
    >> select(_.feed_key, _.stop_id, 
             _.stop_event_count)
    >> arrange(_.feed_key, _.stop_id)
    #>> collect() 
)

### Aggregate
* Write a function to aggregate to the operator level or county level, add new columns for desired metrics.
* Merge in CA shapefile to get a gdf.
* Add another `geometry` column, called `centroid`, and grab the county's centroid.
* Refer to [docs](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.set_geometry.html) to see how to pick which column to use as the `geometry` for the gdf, since technically, a gdf can handle multiple geometry columns.

In [None]:
#Add another geometry column, called centroid, and grab the county's centroid.
gdf = gpd.GeoDataFrame(
    stops, 
    geometry=gpd.points_from_xy(stops['stop_lon'], stops['stop_lat']),
    crs='EPSG:4326'
)

In [None]:
stops["geometry"] = stops.centroid
stops2 = stops.set_geometry("geometry")
#GeoDataFrame.set_geometry(col, crs='EPSG:4326')

In [None]:
stops2.head(2)

### Visualizations
* Make one chart for comparing trips per stop by operators, and another chart for comparing it by counties. Use a function to do this.
* Make 1 map for comparing trips per stop by counties. Use `gdf.explore()` to do this.
* Visualizations should follow the Cal-ITP style guide: [styleguide example notebook](https://github.com/cal-itp/data-analyses/blob/main/example_report/style-guide-examples.ipynb)
* More on `folium` and `ipyleaflet`: https://github.com/jorisvandenbossche/geopandas-tutorial/blob/master/05-more-on-visualization.ipynb

In [None]:
# To add styleguide
from shared_utils import styleguide
from shared_utils import calitp_color_palette as cp

In [None]:
merge2.plot(x='feed_key', y='trip_per_stop', kind='bar')

In [None]:
merge2.plot(x='COUNTY_NAME', y='trip_per_stop', kind='bar')