## Exercise 6: Shared utility functions, data catalogs

Skills: 
* Import shared utils
* Data catalog
* Use functions to repeat certain data cleaning steps

References: 
* https://docs.calitp.org/data-infra/analytics_new_analysts/02-data-analysis-intermediate.html
* https://docs.calitp.org/data-infra/analytics_tools/python_libraries.html
* https://docs.calitp.org/data-infra/analytics_tools/data_catalogs.html

In [1]:
import geopandas as gpd
import intake
import pandas as pd

# Hint: if this doesn't import: refer to docs for correctly import
# cd into _shared_utils folder, run the make setup_env command
import shared_utils



## Create a data catalog

* Include one geospatial data source and one tabular (they should be related...your analysis depends on combining them)
* Import your datasets using the catalog method

In [2]:
#catalog. see `christian_ex6_catalog.yml' file
#code sample: catalog = intake.open_catalog("./sample-catalog.yml")

catalog = intake.open_catalog("./christian_ex6_catalog.yml")

catalog

christian_ex6_catalog:
  args:
    path: ./christian_ex6_catalog.yml
  description: ''
  driver: intake.catalog.local.YAMLFileCatalog
  metadata:
    version: 1


In [3]:
#importing datasets using intake catalog
stops = catalog.stops.read()
ca_county = catalog.ca_counties.read()


## Combine datasets
* Do a merge or spatial join to combine the geospatial and tabular data
* Create a new column of a summary statistic to visualize
* Rely on `shared_utils` to do at least one operation (aggregation, re-projecting to a different CRS, exporting geoparquet, etc)

In [None]:
#checking whats in `stops`
stops.head()

In [None]:
#checking to see whats in `ca_county`
ca_county.head()

### Method

for `stops`
* drop unnessary column
* used shared utils to turn into point geometry and set crs to 2229

For `ca_county` 
* assign crs to 2229

for sjoin
* ensure geometry col for stops and county are the same
* put county on left, join using inner


In [4]:
#drop unnamed col from `stops`

stops = stops.drop(columns=['Unnamed: 0'], axis=1)
stops.head()

Unnamed: 0,feed_key,stop_id,stop_lat,stop_lon,stop_name
0,25c6505166c01099b2f6f2de173e20b9,22nd_street,37.756972,-122.392492,22nd Street
1,25c6505166c01099b2f6f2de173e20b9,2537740,37.438491,-122.156405,Stanford Caltrain Station
2,25c6505166c01099b2f6f2de173e20b9,2537744,37.438425,-122.156482,Stanford Caltrain Station
3,25c6505166c01099b2f6f2de173e20b9,70011,37.77639,-122.394992,San Francisco Caltrain Station
4,25c6505166c01099b2f6f2de173e20b9,70012,37.776348,-122.394935,San Francisco Caltrain Station


In [5]:
#use share_utils to create point geometry, check for gdf status
#stops_ptg = geography_utils.create_point_geometry(
#    stops_gdf,
#    "stop_lon",
#    "stop_lat",
#    crs = "EPSG:2229"
#)

from calitp_data_analysis import geography_utils

stops_ptg = geography_utils.create_point_geometry(
    stops,
    'stop_lon',
    'stop_lat',
    crs = 'EPSG:2229'
)
stops_ptg.head()

Unnamed: 0,feed_key,stop_id,stop_lat,stop_lon,stop_name,geometry
0,25c6505166c01099b2f6f2de173e20b9,22nd_street,37.756972,-122.392492,22nd Street,POINT (5290484.166 3218221.779)
1,25c6505166c01099b2f6f2de173e20b9,2537740,37.438491,-122.156405,Stanford Caltrain Station,POINT (5353967.719 3099309.691)
2,25c6505166c01099b2f6f2de173e20b9,2537744,37.438425,-122.156482,Stanford Caltrain Station,POINT (5353944.533 3099286.476)
3,25c6505166c01099b2f6f2de173e20b9,70011,37.77639,-122.394992,San Francisco Caltrain Station,POINT (5290070.611 3225326.968)
4,25c6505166c01099b2f6f2de173e20b9,70012,37.776348,-122.394935,San Francisco Caltrain Station,POINT (5290086.422 3225310.947)


In [None]:
#checking type of df for `stops_ptg` after creating it. looking for gdf
type(stops_ptg)

In [None]:
#checking for CRS for `stops_ptg`. looking for 2229
stops_ptg.crs

In [None]:
#ensuring ca_county is gdf
type(ca_county)

In [None]:
#current CRS set to EPGS 4326
ca_county.crs

In [None]:
#test to set CRS to another 2229
county2229 = ca_county.to_crs('EPSG:2229')

In [None]:
#crs now set to 2229
county2229.crs

In [None]:
#checking columns to see what we got
county2229.head()

#noticed that county2229 geometry column is now a multpolygon. expecting sjoin not to work

In [None]:
stops_ptg.head()
#see that geometry column is in point. byt numbers inside look correct

In [None]:
#spatial join of county2229 and stops_ptg vis inner join
sjoin = gpd.sjoin(county2229, stops_ptg, how='inner')

sjoin.head()
#join worked any produced something

In [None]:
#checking to see what sjoin looks like
print(sjoin.shape)
print(sjoin.OBJECTID.notna().value_counts())
print(sjoin.feed_key.notna().value_counts())

#checking for any blank values. true = not blank

In [None]:
#checking to see if sjoin plots anything

sjoin.plot('stop_id')

#sjoin plots something

In [None]:
#testing what explore() looks like

#TAKES A LONG TIME TO RUN, AND MAKES LOCAL SAVING FAIL.

# sjoin.explore('COUNTY_NAME')

In [None]:
#add a new column that can aggregate some detail like `count of stops per county`

#sjoin['stopID_count_per_feedkey'] = sjoin.assign(sjoin.feed_key.count())

sjoin_agg = sjoin.groupby('COUNTY_NAME').stop_id.count().reset_index()

sjoin_agg

In [None]:
#test of using group by with multiple col critera
sjoin_agg2 = sjoin.groupby(['COUNTY_NAME', 'feed_key']).agg({
    'stop_id':'count'
}).reset_index().rename(columns = {'stop_id':'stop_id_count'})
sjoin_agg2

In [None]:
#alternate test. aggregating stop_id, then inserting into sjoin as new col based on county_name
#wth does `transform()` do? see example on Pandas page 'https://pandas.pydata.org/docs/reference/api/pandas.Series.transform.html'
sjoin['stop_id_count'] = sjoin.groupby('COUNTY_NAME')['stop_id'].transform('count')
sjoin.head()

## Use functions to do parameterized visualizations
* Use a function to create your chart
* Within the function, the colors should use the Cal-ITP theme that is available in `styleguide`
* Within the function, there should be at least 1 parameter that changes (ex: chart title reflects the correct county, legend title reflects the correct county, etc)|
* Produce 3 charts, using your function each time, and have the function correctly insert the parameters 

In [None]:
#lets try altair from ex 3, but using sjoin

import altair as alt

def bar_chart(df, x_col, y_col):
    x_title = f"{x_col.title()}"
    
    chart = (alt.Chart(df)
             .mark_bar()
             .encode(
                 x=alt.X(x_col, title=x_title),
                 y=alt.Y(y_col, title=""),
             )
            )
    return chart
    

In [None]:
#bart_chart function has been made. Now i can reuse the same function and use other dataframes. INSTEAD OF RE WRITING INDIVIDUAL CODE FOR OTHER DATAFRAMES!!

#bar_chart(sjoin, 'COUNTY_NAME', 'SHAPE_Area')

#COMMENTING OUT CHARTS TO SAVE SPACE!

In [None]:
#testing styleguide

from shared_utils import styleguide
from shared_utils import calitp_color_palette as cp

In [None]:
#Test creating a function to make a chart
def plot_bar(df,x_col, y_col):
    df.plot.bar(x=x_col, y=y_col)

In [None]:
#testing `plot_bar` function
#plot_bar(sjoin,'COUNTY_NAME', 'stop_id_count')

#it's ugly but `plot_bar` function makes a chart!!!!

#COMMENTING OUT CHARTS TO SAVE SPACE!

In [None]:
#old way
#sjoin.plot.bar(x='COUNTY_NAME', y='stop_id_count')

#COMMENTING OUT CHARTS TO SAVE SPACE!