# Exercise 9: series

* pandas series vs numpy arrays [explanation](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html)

### Common series operations
These are the most common series operations we use. Refer to the `pandas` docs for even more!

* Getting dates, hours, minutes from datetime types (`df.datetime_col.dt.date`)
* Parsing strings (`df.string_col.str.split('_')`)

### Common geoseries operations
These are the most common. Refer to the `geopandas` docs for even more!

* `distance` between 2 points or a point to a polygon or line [docs](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.distance.html)
* `intersects`: [docs](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.intersects.html)
* `within`: [docs](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.within.html)
* `contains`: [docs](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.contains.html)

In fact, we've often used geoseries methods without even realizing it. Often, we'd create a new column that stores either the line's length or a polygon's area. `gdf.geometry` is a geoseries, and we call methods on that geoseries, and add that as a new column.

For calculations like `length`, `area`, and `distance`, we need to use a projected CRS that has units like meters or feet. We cannot use decimal degrees (do not use WGS 84 / EPSG:3326)! Distance calculations must be done only once the spherical 3D Earth has been converted into a 2D plane.

* `length`: get the length of a line (`gdf.geometry.length`)
* `area`: get the area of a polygon (`gdf.geometry.area`)
* `centroid`: get the centroid of a polygon (`gdf.geometry.centroid`)
* `x`: get the x coordinate of a point (`gdf.geometry.x`)
* `y`: get the y coordinate of a point (`gdf.geometry.y`)

### Arrays
* Occasionally, we may even use arrays, especially when the datasets get even larger but we have simple mathematical calculations
* If we need to apply an exponential decay function to a distance column, we essentially want to multiple `distance` by some number
* Since this exponential decay function is somewhat custom and requires us to write our own formula, we would extract the column as a series (`df.distance`) and multiply each value by some other number.
* Even quicker is to use `numpy` with `distance_array = np.array(df.distance)` and get `exponential_array = distance_array*some_number`

In [None]:
import geopandas as gpd
import intake
import numpy as np
import pandas as pd

catalog = intake.open_catalog(
    "../_shared_utils/shared_utils/shared_data_catalog.yml")

If you're asking how far is a transit stop from the interstate, you'd want the distance of every point (every row) compared to an interstate highway geometry.

Let's prep the datasets to use series / geoseries to do this.

In [None]:
stops = catalog.ca_transit_stops.read()[["agency", "stop_id", 
                                         "stop_name", "geometry"]]
highways = catalog.state_highway_network.read()

In [None]:
#grabbed ca map from previous exercise.
ca = catalog.caltrans_districts.read().dissolve()
ca.plot()

Since we want to know the distance from a stop's point to the interstate generally, we need a dissolve. We don't want to compare the distance against the I-5, the I-10 individually, but to the interstate system as a whole.

In [None]:
interstates = (highways[highways.RouteType=="Interstate"]
               .dissolve()
               .reset_index()
               [["geometry"]]
              )

In [None]:
#function that runs the same checks for dfs
def df_check(x):
    display(f'shape of df:{x.shape}'),
    display(f'type of :{type(x)}'),
    display(x.head()),
    return 

In [None]:
df_check(stops)

In [None]:
df_check(highways)

In [None]:
highways.plot()

In [None]:
# This is still a gdf, just with 1 column
type(interstates)

In [None]:
df_check(interstates)

In [None]:
# Pulling out the individual column, it becomes a series/geoseries.
# It's a geoseries here because we had a gdf. 
# If it was a df, it would be a series.
print(type(stops.geometry))
print(type(interstates.geometry))

Distance is something you can calculate using `geopandas`.

Specifically, it takes a geoseries on the left, and either a geoseries or a single geometry on the right.

An example of having 2 geoseries would be comparing the distance between 2 points. On the left, it would be a geoseries of the origin points and on the right, destination points.

In [None]:
# We get a warning if we leave it in EPSG:4326!
stops.geometry.distance(interstates.geometry.iloc[0])

In [None]:
stops_geom = stops.to_crs("EPSG:2229").geometry
interstates_geom = interstates.to_crs("EPSG:2229").geometry.iloc[0]

In [None]:
df_check(stops_geom)

In [None]:
interstates_geom

In [None]:
distance_series = stops_geom.distance(interstates_geom)

In [None]:
#returns a series (like a 1 col table)
df_check(distance_series)


In [None]:
# Let's make sure that for every stop, a distance is calculated
print(f"# rows in stops: {len(stops_geom)}")
print(f"# rows in stops: {len(distance_series)}")

In [None]:
# distance is numeric, not a geometry, so we're back to being a series
type(distance_series)

What can we do with this? 

We usually add it as a new column. Since we did nothing to shift the index, we can just attach the series back to our gdf.

Getting a distance calculation using geoseries is much quicker than a row-wise lambda function where you calculate the distance.

```
Alternative method that's slower:
      
interstate_geom = interstates.geometry.iloc[0]

stops = stops.assign(
   distance = stops.geometry.apply(
         lambda x: x.distance(interstate_geom))
)   
```

In [None]:
#adds a new column to stops called `distance_to_interstate` and fills it with values from `distance_series`. 
#the indicies are the same for both, meaning they match up

stops = stops.assign(
    distance_to_interstate = distance_series
)

In [None]:
df_check(stops)

In [None]:
#this cell took a loooooooong time to run
#%%timeit
#distance_series = stops_geom.distance(interstates_geom)

In [None]:
#also took a loooooong time to run
#%%timeit
#stops.assign(
   #distance = stops.geometry.apply(
       #  lambda x: x.distance(interstates_geom))
#)   

In [None]:
#import dask_geopandas as dg

#stops_gddf = dg.from_geopandas(stops, npartitions=2)
#stops_geom_dg = stops_gddf.to_crs("EPSG:2229").geometry

In [None]:
#was a lot laster to run
#%%timeit

#distance_series = stops_geom_dg.distance(interstates_geom)

## To Do

* Use the `stop_times` table and `stops` table.
* Calculate the straight line distance between the first and last stop for each trip. Call this column `trip_distance`
* Calculate the distance between each stop to the nearest interstate. For each trip, keep the value for the stop that's the closest to the interstate. Call this column `shortest_distance_hwy`.
* For each trip, add these 2 new columns, but use series, geoseries, and/or arrays to assign it.
* Provide a preview of the resulting df (do not export)

In [None]:
GCS_FILE_PATH = ("gs://calitp-analytics-data/data-analyses/"
                 "rt_delay/compiled_cached_views/"
                )

analysis_date = "2023-01-18"
STOP_TIMES_FILE = f"{GCS_FILE_PATH}st_{analysis_date}.parquet"
STOPS_FILE = f"{GCS_FILE_PATH}stops_{analysis_date}.parquet"
highways = catalog.state_highway_network.read()

In [None]:
#test to import parquet files
stops = pd.read_parquet(STOPS_FILE)
stop_times = pd.read_parquet(STOP_TIMES_FILE)

In [None]:
#what does each row mean?
#each row is a stop_key, a stop_key can have multiple feeds and stops
#what is the difference between stop_key and stop_id?

#noticed the geometry col is in WKB. need to convert this to something else.

df_check(stops)

In [None]:
#found method to create geoseries from wkb.
test = gpd.GeoSeries.from_wkb(stops.geometry)


In [None]:
#have a geoeries called `test`. now i am able to add this series back to initial stops table (using assign)
#stops = stops.assign(
#    distance_to_interstate = distance_series
#)
stops2 = stops.assign(wkb_to_pt = test)

In [None]:
#now I can create a gdf and set an active geom col and change crs to ft.

stops2 = gpd.GeoDataFrame(stops2).set_geometry('wkb_to_pt').set_crs('EPSG:2229')

In [None]:
#function confirms that stops2 is a gdf, also used `stops2.geometry.name` and `stops2.crs` to confirm active geom col and crs was set as intended.
df_check(stops2)

#plotting reveals the stops2 is nationwide. will need to clip this to CA only or something.
stops2.plot()

In [None]:
df_check(stop_times)

In [None]:
#cleaned up a couple of columns, dissolved by routes, reset index and set crs to feet
highways_d = highways[['Route', 'geometry', 'RouteType']].dissolve(by='Route').reset_index().to_crs('EPSG:2229')

In [None]:
#can you sjoin highways and stops to get stops in ca?

sjoin = gpd.sjoin(highways.to_crs('EPSG:2229'), stops2, how='right')
sjoin.plot()

In [None]:
df_check(highways_d)
highways_d.plot()

In [None]:
stop2times = stops2.merge(stop_times, how='inner', on=['feed_key', 'stop_id'])

In [None]:
df_check(stop2times)

In [None]:
#Calculate the straight line distance between the first and last stop for each trip. Call this column trip_distance

#how do find the first and last stop of a trip? try a grouby/agg 

In [None]:
#testing pivot table for max stop sequence

#need coordinates for the stop!
pivot_max = stop2times.pivot_table(
    index=['trip_id'],
    values=['stop_sequence'],
    aggfunc={
        'stop_sequence':'max'}
).reset_index()

In [None]:
pivot_max = pivot_max.rename(columns={'stop_sequence':'last_stop'})

In [None]:
#test join to get geom col for this trip ID an stop. 
pivot_max = pivot_max.merge(stop2times[['trip_id','wkb_to_pt']], on='trip_id', how='inner')

In [None]:
#lots of dupe rows, need to consoidate down
pivot_max = pivot_max.drop_duplicates()

In [None]:
pivot_max.iloc[100:110]

In [None]:
#using same pivot table method to get min stop value
pivot_min = stop2times.pivot_table(
    index=['trip_id'],
    values=['stop_sequence'],
    aggfunc={
        'stop_sequence':'min'}
).reset_index()

In [None]:
#test to combine all the previous cleaning steps 
pivot_min = pivot_min.rename(columns={'stop_sequence':'first_stop'}).merge(stop2times[['trip_id','wkb_to_pt']], on='trip_id', how='inner').drop_duplicates()

In [None]:
pivot_min

In [None]:
stop2times = stop2times.merge(pivot_max, on='trip_id',how='left')


In [None]:
stop2times = stop2times.merge(pivot_min, on='trip_id', how='left')

In [None]:
stop2times.head()

# hall-o-shame

In [None]:
#test = stop2times({'trip_id': group.groups.keys(), 
#                   'first_stop': first_stop, 
#                  'last_stop': last_stop}
#                 )
#test.head

In [None]:
#test to find the first stop of a trip usign .iloc[0]
#group = stop_times.groupby('trip_id')

In [None]:
#first_stop = []
#last_stop = []

In [None]:
# FOR LOOPS!!!
#for `every trip_id` group in group df, do this operation
#for trip_id, group in group:
#    f_stop = group.iloc[0]['stop_id']
#    l_stop = group.iloc[-1]['stop_id']

In [None]:
#first_stop.append(f_stop)
#first_stop

In [None]:
#last_stop.append(l_stop)
#last_stop

In [None]:
#group.groups.keys()

In [None]:

#add new col - first_stop. use assign with trip ID, and stop_sequence.iloc[0]
#add new col - last _stop. use assign with trip ID, and stop_sequence.iloc[-1]
#all new col - distance between first and last stop with distance first stop, last stop



In [None]:
#overlay highways with stops2?
#RETURNS NOTHING!
#test = gpd.overlay(stops2, highways.to_crs('EPSG:2229'), how='intersection', keep_geom_type=True)

In [None]:
#try to overlay stop2times on highways_d (points on line?)
#RETURNS NOTHING!
#test = gpd.overlay(highways.to_crs('EPSG:2229'), stop2times, how ='intersection', keep_geom_type=True)

In [None]:
#df_check(test)

In [None]:
#Calculate the distance between each stop to the nearest interstate. 
#For each trip, keep the value for the stop that's the closest to the interstate. Call this column shortest_distance_hwy.

In [None]:
#can i dissolve by trip_ID, then get length?

#NOPE DIDNT WORK AS EXPECTED

#trip_d = stop2times.dissolve(by='trip_id').reset_index()

In [None]:
#df_check(trip_d)
#trip_d.plot()