## Exercise 5: Geospatial wrangling and making maps

Skills: 
* More geospatial practice building on earlier skills
* Make a map with `geopandas`

References: 
* https://docs.calitp.org/data-infra/analytics_new_analysts/02-data-analysis-intermediate.html
* https://docs.calitp.org/data-infra/analytics_tools/python_libraries.html

In [1]:
import geopandas as gpd
import intake
import os
import pandas as pd
import shapely

os.environ["CALITP_BQ_MAX_BYTES"] = str(100_000_000_000)

from calitp_data_analysis.tables import tbls
from siuba import *

# Hint: if this doesn't import: refer to docs for correctly import
# cd into _shared_utils folder, run the make setup_env command
import shared_utils


import os
os.environ['USE_PYGEOS'] = '0'
import geopandas

In the next release, GeoPandas will switch to using Shapely by default, even if PyGEOS is installed. If you only have PyGEOS installed to get speed-ups, this switch should be smooth. However, if you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
  import geopandas as gpd


## Research Question

What's the average number of trips per stop by operators in southern California? Show visualizations at the operator and county-level.
<br>**Geographic scope:** southern California counties
<br>**Deliverables:** chart(s) and map(s) showing metrics comparing across counties and also across operators. Make these visualizations using function(s).

### Prep data

* Use the same query, but grab a different set of operators. These are in southern California, so the map should zoom in counties ranging from LA to SD.
* *Hint*: for some counties, there are multiple operators. Make sure the average trips per stop by counties is the weighted average.
* Use the same [shapefile for CA counties](https://gis.data.ca.gov/datasets/CALFIRE-Forestry::california-county-boundaries/explore?location=37.246136%2C-119.002032%2C6.12) as in Exercise 4.
* Join the data and only keep counties that have bus stops.

In [24]:
feeds_to_names = shared_utils.gtfs_utils_v2.schedule_daily_feed_to_gtfs_dataset_name(
    selected_date = "2022-06-01",
    get_df = True
)[["feed_key", "name"]].drop_duplicates()

feeds_to_names.head()

Unnamed: 0,feed_key,name
0,5efaa2460085a481db5dfbf57ae78187,Kern Schedule
1,c50220b8622624dfa0c5c22859b14694,Humboldt Schedule
2,1b77ef49f5bc70038cbf15e4f5f98477,Compton Schedule
3,4b6b673ab50c016344c1adf09de2cc84,Banning Pass Schedule
4,7a7e9069dedca7a58e5a89aaa0a97256,Bay Area 511 Santa Rosa CityBus Schedule


In [37]:
#also brining in CA Counties from ex 4.

ca_county = gpd.read_file('https://services1.arcgis.com/jUJYIo9tSA7EHvfZ/arcgis/rest/services/California_County_Boundaries/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson')
type(ca_county)

geopandas.geodataframe.GeoDataFrame

In [29]:
type(feeds_to_names)

pandas.core.frame.DataFrame

In [26]:
OPERATORS = [
    "Alhambra Schedule", 
    "San Diego Schedule",
    "Big Blue Bus Schedule",
    "Culver City Schedule",
    "OmniTrans Schedule",
    "OCTA Schedule"
]

SUBSET_FEEDS = feeds_to_names[
    feeds_to_names.name.isin(OPERATORS)
].feed_key.tolist()

SUBSET_FEEDS
#gives us the feed_key for the list of operators found in the `feeds_to_name` list

['41ee0151e3cac17098d055ce25b3f104',
 '239e56d11510f71d7182a24c5621be8c',
 '455fadac7ed63a72e7d3f36273d78313',
 'e7985c6c0c873f17871d79a527a50afa',
 'a3af905228efc93bb48f360b92965afb',
 'd76560b3dfecce2d588023bf1d1c4c2d']

In [5]:
stops = (
    tbls.mart_gtfs.fct_daily_scheduled_stops()
    >> filter(_.feed_key.isin(SUBSET_FEEDS))
    >> filter(_.service_date == "2022-06-01")
    >> select(_.feed_key, 
              _.stop_id, _.pt_geom)
    >> collect()
)

  sqlalchemy.util.warn(


Check the type of `stops`. Is it a pandas df or geopandas gdf?

In [6]:
# initial check of `stops` df to see if its a pandas or gdf

type(stops)

#its pandas df

pandas.core.frame.DataFrame

In [7]:
# Turn stops into a gdf
geom = [shapely.wkt.loads(x) for x in stops.pt_geom]

stops = gpd.GeoDataFrame(
    stops, 
    geometry=geom, 
    crs="EPSG:4326"
).drop(columns="pt_geom")

Check the type of `stops`. Is it a pandas df or geopandas gdf?

What is the CRS and geometry column name?

In [8]:
print(type(stops))
print(stops.crs)
print(stops.shape)
print(stops.head())
print(stops.feed_key.value_counts())
#now stops is gdf
#CRS set to EPSG:4326
#Geometry column name is called `geometry`

<class 'geopandas.geodataframe.GeoDataFrame'>
EPSG:4326
(13247, 3)
                           feed_key stop_id                     geometry
0  239e56d11510f71d7182a24c5621be8c    1017  POINT (-118.38411 34.05184)
1  239e56d11510f71d7182a24c5621be8c    1061  POINT (-118.39544 34.05224)
2  239e56d11510f71d7182a24c5621be8c    1089  POINT (-118.39499 34.04938)
3  239e56d11510f71d7182a24c5621be8c    1090  POINT (-118.39077 34.04875)
4  239e56d11510f71d7182a24c5621be8c    1092  POINT (-118.38664 34.04898)
e7985c6c0c873f17871d79a527a50afa    5317
455fadac7ed63a72e7d3f36273d78313    4226
a3af905228efc93bb48f360b92965afb    2277
239e56d11510f71d7182a24c5621be8c     916
41ee0151e3cac17098d055ce25b3f104     431
d76560b3dfecce2d588023bf1d1c4c2d      80
Name: feed_key, dtype: int64


### Bring in a new table from BigQuery

* In `mart_gtfs`, bring in the table called `fct_daily_scheduled_stops` for the subset of feeds defined above.
* Modify the snippet below to:
   * filter for the subset of operators
   * only keep columns: `feed_key`, `stop_id`, `stop_event_count`

In [27]:
#initial code snippet
#stop_counts = (
#    tbls.mart_gtfs.fct_daily_scheduled_stops()
#    >> filter(_.activity_date == "2022-06-01")
#)

stop_counts = (
    tbls.mart_gtfs.fct_daily_scheduled_stops()
    >> filter(_.feed_key.isin(SUBSET_FEEDS))
    >> select(_.feed_key, 
              _.stop_id, _.stop_event_count)
    >> collect()
)



  sqlalchemy.util.warn(


In [23]:
#stop counts initially was a siuba query. need to add `>> collect()` to convert to a pandas df?
print(type(stop_counts))
print(stop_counts.shape)
print(stop_counts.head())
print(stop_counts.feed_key.value_counts())

<class 'pandas.core.frame.DataFrame'>
(743023, 3)
                           feed_key stop_id  stop_event_count
0  41ee0151e3cac17098d055ce25b3f104       1               103
1  41ee0151e3cac17098d055ce25b3f104       1                88
2  41ee0151e3cac17098d055ce25b3f104       1                88
3  41ee0151e3cac17098d055ce25b3f104       1               103
4  41ee0151e3cac17098d055ce25b3f104       1                88
e7985c6c0c873f17871d79a527a50afa    320609
a3af905228efc93bb48f360b92965afb    207083
455fadac7ed63a72e7d3f36273d78313     93099
239e56d11510f71d7182a24c5621be8c     92786
d76560b3dfecce2d588023bf1d1c4c2d     22888
41ee0151e3cac17098d055ce25b3f104      6558
Name: feed_key, dtype: int64


### Aggregate
* Write a function to aggregate to the operator level or county level, add new columns for desired metrics.
* Merge in CA shapefile to get a gdf.
* Add another `geometry` column, called `centroid`, and grab the county's centroid.
* Refer to [docs](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoDataFrame.set_geometry.html) to see how to pick which column to use as the `geometry` for the gdf, since technically, a gdf can handle multiple geometry columns.

In [35]:
#listing out all df/gdf columns

print(list(feeds_to_names.columns))
print(list(stops.columns))
print(list(stop_counts.columns))
print(list(ca_county.columns))

['feed_key', 'name']
['feed_key', 'stop_id', 'geometry']
['feed_key', 'stop_id', 'stop_event_count']
['OBJECTID', 'COUNTY_NAME', 'COUNTY_ABBREV', 'COUNTY_NUM', 'COUNTY_CODE', 'COUNTY_FIPS', 'ISLAND', 'Shape__Area', 'Shape__Length', 'GlobalID', 'geometry']


### methdology
What's the average number of trips per stop by operators in southern California? Show visualizations at the operator and county-level.

table structure 
* County
* operator name
* count of stop_id


potential new columns
* number of trips
* number of stops
* aver

### Visualizations
* Make one chart for comparing trips per stop by operators, and another chart for comparing it by counties. Use a function to do this.
* Make 1 map for comparing trips per stop by counties. Use `gdf.explore()` to do this.
* Visualizations should follow the Cal-ITP style guide: [styleguide example notebook](https://github.com/cal-itp/data-analyses/blob/main/example_report/style-guide-examples.ipynb)
* More on `folium` and `ipyleaflet`: https://github.com/jorisvandenbossche/geopandas-tutorial/blob/master/05-more-on-visualization.ipynb

In [None]:
# To add styleguide
from shared_utils import styleguide
from shared_utils import calitp_color_palette as cp