## Exercise 6: Shared utility functions, data catalogs

Skills: 
* Import shared utils
* Data catalog
* Use functions to repeat certain data cleaning steps

References: 
* https://docs.calitp.org/data-infra/analytics_new_analysts/02-data-analysis-intermediate.html
* https://docs.calitp.org/data-infra/analytics_tools/python_libraries.html
* https://docs.calitp.org/data-infra/analytics_tools/data_catalogs.html

In [1]:
import geopandas as gpd
import intake
import pandas as pd

# Hint: if this doesn't import: refer to docs for correctly import
# cd into _shared_utils folder, run the make setup_env command
import shared_utils


import os
os.environ['USE_PYGEOS'] = '0'
import geopandas

In the next release, GeoPandas will switch to using Shapely by default, even if PyGEOS is installed. If you only have PyGEOS installed to get speed-ups, this switch should be smooth. However, if you are using PyGEOS directly (calling PyGEOS functions on geometries from GeoPandas), this will then stop working and you are encouraged to migrate from PyGEOS to Shapely 2.0 (https://shapely.readthedocs.io/en/latest/migration_pygeos.html).
  import geopandas as gpd


## Create a data catalog

* Include one geospatial data source and one tabular (they should be related...your analysis depends on combining them)
* Import your datasets using the catalog method

#geospatial data source
#CA county data again
#code snippet
    #test_geojson:
    #    driver: geojson
    #    description: Description
    #    args:
    #      urlpath: gs://calitp-analytics-data/test_geojson_file.geojson
    #      use_fsspec: true

ca_county:
    driver: geojson
    description: california counties
    args:
            urlpath: https://services1.arcgis.com/jUJYIo9tSA7EHvfZ/arcgis/rest/services/California_County_Boundaries/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson
            use_fspec: true


In [12]:
county_ext = gpd.read_file('https://services1.arcgis.com/jUJYIo9tSA7EHvfZ/arcgis/rest/services/California_County_Boundaries/FeatureServer/0/query?outFields=*&where=1%3D1&f=geojson')
type(county_ext)

geopandas.geodataframe.GeoDataFrame

In [None]:
#tabular data source (csv, parquet)
#stops again from Ex4?


import os

#os.environ["CALITP_BQ_MAX_BYTES"] = str(100_000_000_000)
pd.set_option("display.max_rows", 20)

from calitp_data_analysis.tables import tbls
from calitp_data_analysis.sql import query_sql
from siuba import *

FEEDS = [
    "25c6505166c01099b2f6f2de173e20b9", # Caltrain
    "52639f09eb535f75b33d2c6a654cb89e", # Merced
]

stops = (
    tbls.mart_gtfs.dim_stops()
    >> filter(_.feed_key.isin(FEEDS))
    >> select(_.feed_key, _.stop_id, 
             _.stop_lat, _.stop_lon, _.stop_name)
    >> arrange(_.feed_key, _.stop_id, 
               _.stop_lat, _.stop_lon)
    >> collect()
)

In [None]:
#savings stops to my gcs folder
stops.to_csv('gs://calitp-analytics-data/data-analyses/csuyat_folder/stops_caltrain_merced.csv')

#code snippet
    #test_csv:
    #driver: csv
    #description: Description
   # args:
      #urlpath: https://raw.githubusercontent.com/CityOfLosAngeles/covid19-indicators/master/data/ca_county_pop_crosswalk.csv
    
stops:
    driver: csv
    description: stops in merced and caltrain
    args:
            urlpath: gs://calitp-analytics-data/data-analyses/csuyat_folder/stops_caltrain_merced.csv

In [6]:
#catalog
#code sample: catalog = intake.open_catalog("./sample-catalog.yml")

catalog = intake.open_catalog("./christian_ex6_catalog.yml")

catalog

christian_ex6_catalog:
  args:
    path: ./christian_ex6_catalog.yml
  description: ''
  driver: intake.catalog.local.YAMLFileCatalog
  metadata:
    version: 1


## Combine datasets
* Do a merge or spatial join to combine the geospatial and tabular data
* Create a new column of a summary statistic to visualize
* Rely on `shared_utils` to do at least one operation (aggregation, re-projecting to a different CRS, exporting geoparquet, etc)

In [8]:
# getting error with ca_county catalog
stops = catalog.stops.read()
ca_county = catalog.ca_county.read()


IndexError: list index out of range

Method

for stops
* drop unnessary column
* used shared utils to turn into point geometry and set crs to 2229

For ca_county or county_ext
* assign crs to 2229

for sjoin
* ensure geometry col for stops and county are the same
* put county on left, join using inner


In [13]:
stops.head()

Unnamed: 0.1,Unnamed: 0,feed_key,stop_id,stop_lat,stop_lon,stop_name
0,0,25c6505166c01099b2f6f2de173e20b9,22nd_street,37.756972,-122.392492,22nd Street
1,1,25c6505166c01099b2f6f2de173e20b9,2537740,37.438491,-122.156405,Stanford Caltrain Station
2,2,25c6505166c01099b2f6f2de173e20b9,2537744,37.438425,-122.156482,Stanford Caltrain Station
3,3,25c6505166c01099b2f6f2de173e20b9,70011,37.77639,-122.394992,San Francisco Caltrain Station
4,4,25c6505166c01099b2f6f2de173e20b9,70012,37.776348,-122.394935,San Francisco Caltrain Station


In [15]:
type(stops)

pandas.core.frame.DataFrame

In [None]:
#use share_utils to create point geometry, check for gdf status
#stops_ptg = geography_utils.create_point_geometry(
#    stops_gdf,
#    "stop_lon",
#    "stop_lat",
#    crs = "EPSG:2229"
#)


In [21]:
#stops is still a pandas df
#convert table to a gdf


stops_gdf = gpd.GeoDataFrame(stops)

#then turn lat lon into point geo
geo = gpd.points_from_xy(stops_gdf.stop_lon, stops_gdf.stop_lat, crs='EPSG:2229')
stops_gdf['geometry'] = geo



  stops_gdf['geometry'] = geo


Unnamed: 0.1,Unnamed: 0,feed_key,stop_id,stop_lat,stop_lon,stop_name,geometry
0,0,25c6505166c01099b2f6f2de173e20b9,22nd_street,37.756972,-122.392492,22nd Street,POINT (-122.392 37.757)
1,1,25c6505166c01099b2f6f2de173e20b9,2537740,37.438491,-122.156405,Stanford Caltrain Station,POINT (-122.156 37.438)
2,2,25c6505166c01099b2f6f2de173e20b9,2537744,37.438425,-122.156482,Stanford Caltrain Station,POINT (-122.156 37.438)
3,3,25c6505166c01099b2f6f2de173e20b9,70011,37.77639,-122.394992,San Francisco Caltrain Station,POINT (-122.395 37.776)
4,4,25c6505166c01099b2f6f2de173e20b9,70012,37.776348,-122.394935,San Francisco Caltrain Station,POINT (-122.395 37.776)


In [22]:
#drop `unnamed: 0` column

stops_gdf = stops_gdf.drop(['Unnamed: 0'], axis=1)
stops_gdf.head()

Unnamed: 0,feed_key,stop_id,stop_lat,stop_lon,stop_name,geometry
0,25c6505166c01099b2f6f2de173e20b9,22nd_street,37.756972,-122.392492,22nd Street,POINT (-122.392 37.757)
1,25c6505166c01099b2f6f2de173e20b9,2537740,37.438491,-122.156405,Stanford Caltrain Station,POINT (-122.156 37.438)
2,25c6505166c01099b2f6f2de173e20b9,2537744,37.438425,-122.156482,Stanford Caltrain Station,POINT (-122.156 37.438)
3,25c6505166c01099b2f6f2de173e20b9,70011,37.77639,-122.394992,San Francisco Caltrain Station,POINT (-122.395 37.776)
4,25c6505166c01099b2f6f2de173e20b9,70012,37.776348,-122.394935,San Francisco Caltrain Station,POINT (-122.395 37.776)


In [23]:
type(stops_gdf)

geopandas.geodataframe.GeoDataFrame

In [26]:
county_ext.crs

<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

In [27]:
#change crs of `county_ext` to 2229
# `to_crs` is onlyt for projecting to another crs temporaily, not really setting it to another crs

# county2229 = county_ext.to_crs('EPSG:2229')

In [28]:
county2229.crs

<Projected CRS: EPSG:2229>
Name: NAD83 / California zone 5 (ftUS)
Axis Info [cartesian]:
- X[east]: Easting (US survey foot)
- Y[north]: Northing (US survey foot)
Area of Use:
- name: United States (USA) - California - counties Kern; Los Angeles; San Bernardino; San Luis Obispo; Santa Barbara; Ventura.
- bounds: (-121.42, 32.76, -114.12, 35.81)
Coordinate Operation:
- name: SPCS83 California zone 5 (US Survey feet)
- method: Lambert Conic Conformal (2SP)
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

In [33]:
county2229.head()

Unnamed: 0,OBJECTID,COUNTY_NAME,COUNTY_ABBREV,COUNTY_NUM,COUNTY_CODE,COUNTY_FIPS,ISLAND,Shape__Area,Shape__Length,GlobalID,geometry
0,1,Alameda,ALA,1,1,1,,3402787000.0,308998.650766,e6f92268-d2dd-4cfb-8b79-5b4b2f07c559,"POLYGON ((5327843.636 3270649.517, 5328125.534..."
1,2,Alpine,ALP,2,2,3,,3146939000.0,274888.492411,870479b2-480a-494b-8352-ad60578839c1,"POLYGON ((6107872.113 3543254.346, 6107910.593..."
2,3,Amador,AMA,3,3,5,,2562635000.0,361708.438013,4f45b3a6-be10-461c-8945-6b2aaa7119f6,"POLYGON ((5968863.196 3541606.326, 5968719.695..."
3,4,Butte,BUT,4,4,7,,7339348000.0,526547.115238,44fba680-aecc-4e04-a499-29d69affbd4a,"POLYGON ((5691714.322 3875581.366, 5690777.192..."
4,5,Calaveras,CAL,5,5,9,,4351069000.0,370637.578323,d11ef739-4a1e-414e-bfd1-e7dcd56cd61e,"POLYGON ((5982508.552 3443893.703, 5982516.557..."


In [None]:
stops_gdf = s

In [36]:
stops_gdf.crs

<Projected CRS: EPSG:2229>
Name: NAD83 / California zone 5 (ftUS)
Axis Info [cartesian]:
- X[east]: Easting (US survey foot)
- Y[north]: Northing (US survey foot)
Area of Use:
- name: United States (USA) - California - counties Kern; Los Angeles; San Bernardino; San Luis Obispo; Santa Barbara; Ventura.
- bounds: (-121.42, 32.76, -114.12, 35.81)
Coordinate Operation:
- name: SPCS83 California zone 5 (US Survey feet)
- method: Lambert Conic Conformal (2SP)
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

In [34]:
stops_gdf.head()

Unnamed: 0,feed_key,stop_id,stop_lat,stop_lon,stop_name,geometry
0,25c6505166c01099b2f6f2de173e20b9,22nd_street,37.756972,-122.392492,22nd Street,POINT (-122.392 37.757)
1,25c6505166c01099b2f6f2de173e20b9,2537740,37.438491,-122.156405,Stanford Caltrain Station,POINT (-122.156 37.438)
2,25c6505166c01099b2f6f2de173e20b9,2537744,37.438425,-122.156482,Stanford Caltrain Station,POINT (-122.156 37.438)
3,25c6505166c01099b2f6f2de173e20b9,70011,37.77639,-122.394992,San Francisco Caltrain Station,POINT (-122.395 37.776)
4,25c6505166c01099b2f6f2de173e20b9,70012,37.776348,-122.394935,San Francisco Caltrain Station,POINT (-122.395 37.776)


In [35]:
stops_gdf.geometry.name

'geometry'

In [32]:
#spatial join of stops_gdf and county2229

sjoin = gpd.sjoin(county2229, stops_gdf, how='inner')

sjoin

Unnamed: 0,OBJECTID,COUNTY_NAME,COUNTY_ABBREV,COUNTY_NUM,COUNTY_CODE,COUNTY_FIPS,ISLAND,Shape__Area,Shape__Length,GlobalID,geometry,index_right,feed_key,stop_id,stop_lat,stop_lon,stop_name


## Use functions to do parameterized visualizations
* Use a function to create your chart
* Within the function, the colors should use the Cal-ITP theme that is available in `styleguide`
* Within the function, there should be at least 1 parameter that changes (ex: chart title reflects the correct county, legend title reflects the correct county, etc)
* Produce 3 charts, using your function each time, and have the function correctly insert the parameters 