# Exercise 9: series

* pandas series vs numpy arrays [explanation](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html)

### Common series operations
These are the most common series operations we use. Refer to the `pandas` docs for even more!

* Getting dates, hours, minutes from datetime types (`df.datetime_col.dt.date`)
* Parsing strings (`df.string_col.str.split('_')`)

### Common geoseries operations
These are the most common. Refer to the `geopandas` docs for even more!

* `distance` between 2 points or a point to a polygon or line [docs](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.distance.html)
* `intersects`: [docs](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.intersects.html)
* `within`: [docs](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.within.html)
* `contains`: [docs](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.contains.html)

In fact, we've often used geoseries methods without even realizing it. Often, we'd create a new column that stores either the line's length or a polygon's area. `gdf.geometry` is a geoseries, and we call methods on that geoseries, and add that as a new column.

For calculations like `length`, `area`, and `distance`, we need to use a projected CRS that has units like meters or feet. We cannot use decimal degrees (do not use WGS 84 / EPSG:3326)! Distance calculations must be done only once the spherical 3D Earth has been converted into a 2D plane.

* `length`: get the length of a line (`gdf.geometry.length`)
* `area`: get the area of a polygon (`gdf.geometry.area`)
* `centroid`: get the centroid of a polygon (`gdf.geometry.centroid`)
* `x`: get the x coordinate of a point (`gdf.geometry.x`)
* `y`: get the y coordinate of a point (`gdf.geometry.y`)

### Arrays
* Occasionally, we may even use arrays, especially when the datasets get even larger but we have simple mathematical calculations
* If we need to apply an exponential decay function to a distance column, we essentially want to multiple `distance` by some number
* Since this exponential decay function is somewhat custom and requires us to write our own formula, we would extract the column as a series (`df.distance`) and multiply each value by some other number.
* Even quicker is to use `numpy` with `distance_array = np.array(df.distance)` and get `exponential_array = distance_array*some_number`

In [1]:
import geopandas as gpd
import intake
import numpy as np
import pandas as pd

catalog = intake.open_catalog(
    "../_shared_utils/shared_utils/shared_data_catalog.yml")

If you're asking how far is a transit stop from the interstate, you'd want the distance of every point (every row) compared to an interstate highway geometry.

Let's prep the datasets to use series / geoseries to do this.

In [None]:
stops = catalog.ca_transit_stops.read()[["agency", "stop_id", 
                                         "stop_name", "geometry"]]
highways = catalog.state_highway_network.read()

Since we want to know the distance from a stop's point to the interstate generally, we need a dissolve. We don't want to compare the distance against the I-5, the I-10 individually, but to the interstate system as a whole.

In [None]:
interstates = (highways[highways.RouteType=="Interstate"]
               .dissolve()
               .reset_index()
               [["geometry"]]
              )

In [None]:
df_check(stops)

In [None]:
df_check(highways)

In [None]:
highways.plot()

In [None]:
# This is still a gdf, just with 1 column
type(interstates)

In [None]:
df_check(interstates)

In [None]:
# Pulling out the individual column, it becomes a series/geoseries.
# It's a geoseries here because we had a gdf. 
# If it was a df, it would be a series.
print(type(stops.geometry))
print(type(interstates.geometry))

Distance is something you can calculate using `geopandas`.

Specifically, it takes a geoseries on the left, and either a geoseries or a single geometry on the right.

An example of having 2 geoseries would be comparing the distance between 2 points. On the left, it would be a geoseries of the origin points and on the right, destination points.

In [None]:
# We get a warning if we leave it in EPSG:4326!
stops.geometry.distance(interstates.geometry.iloc[0])

In [None]:
stops_geom = stops.to_crs("EPSG:2229").geometry
interstates_geom = interstates.to_crs("EPSG:2229").geometry.iloc[0]

In [None]:
df_check(stops_geom)

In [None]:
interstates_geom

In [None]:
distance_series = stops_geom.distance(interstates_geom)

In [None]:
#returns a series (like a 1 col table)
df_check(distance_series)


In [None]:
# Let's make sure that for every stop, a distance is calculated
print(f"# rows in stops: {len(stops_geom)}")
print(f"# rows in stops: {len(distance_series)}")

In [None]:
# distance is numeric, not a geometry, so we're back to being a series
type(distance_series)

What can we do with this? 

We usually add it as a new column. Since we did nothing to shift the index, we can just attach the series back to our gdf.

Getting a distance calculation using geoseries is much quicker than a row-wise lambda function where you calculate the distance.

```
Alternative method that's slower:
      
interstate_geom = interstates.geometry.iloc[0]

stops = stops.assign(
   distance = stops.geometry.apply(
         lambda x: x.distance(interstate_geom))
)   
```

In [None]:
#adds a new column to stops called `distance_to_interstate` and fills it with values from `distance_series`. 
#the indicies are the same for both, meaning they match up

stops = stops.assign(
    distance_to_interstate = distance_series
)

In [None]:
df_check(stops)

In [None]:
#this cell took a loooooooong time to run
#%%timeit
#distance_series = stops_geom.distance(interstates_geom)

In [None]:
#also took a loooooong time to run
#%%timeit
#stops.assign(
   #distance = stops.geometry.apply(
       #  lambda x: x.distance(interstates_geom))
#)   

In [None]:
#import dask_geopandas as dg

#stops_gddf = dg.from_geopandas(stops, npartitions=2)
#stops_geom_dg = stops_gddf.to_crs("EPSG:2229").geometry

In [None]:
#was a lot laster to run
#%%timeit

#distance_series = stops_geom_dg.distance(interstates_geom)

## To Do

* Use the `stop_times` table and `stops` table.
* Calculate the straight line distance between the first and last stop for each trip. Call this column `trip_distance`
* Calculate the distance between each stop to the nearest interstate. For each trip, keep the value for the stop that's the closest to the interstate. Call this column `shortest_distance_hwy`.
* For each trip, add these 2 new columns, but use series, geoseries, and/or arrays to assign it.
* Provide a preview of the resulting df (do not export)

In [2]:
GCS_FILE_PATH = ("gs://calitp-analytics-data/data-analyses/"
                 "rt_delay/compiled_cached_views/"
                )

analysis_date = "2023-01-18"
STOP_TIMES_FILE = f"{GCS_FILE_PATH}st_{analysis_date}.parquet"
STOPS_FILE = f"{GCS_FILE_PATH}stops_{analysis_date}.parquet"
highways = catalog.state_highway_network.read()

In [3]:
#import parquet files
stops = pd.read_parquet(STOPS_FILE)
stop_times = pd.read_parquet(STOP_TIMES_FILE)



In [4]:
#grabbed ca map from previous exercise. so ican clip everything to CA
districts = catalog.caltrans_districts.read().to_crs('EPSG:2229')


In [5]:
ca = districts.dissolve()

In [6]:
#function that runs the same checks for dfs
def df_check(x):
    display(f'shape of df:{x.shape}'),
    display(f'type of :{type(x)}'),
    display(x.head()),
    return 

In [7]:
#what does each row mean?
#each row is a stop_key, a stop_key can have multiple feeds and stops
#what is the difference between stop_key and stop_id?

#noticed the geometry col is in WKB. need to convert this to something else.

df_check(stops)

'shape of df:(84688, 16)'

"type of :<class 'pandas.core.frame.DataFrame'>"

Unnamed: 0,feed_key,stop_id,stop_key,stop_name,route_type_0,route_type_1,route_type_2,route_type_3,route_type_4,route_type_5,route_type_6,route_type_7,route_type_11,route_type_12,missing_route_type,geometry
0,6adf6cd9b6d24ab4ee8ee220e3697a73,15193,d4eb0920e7e256606df449c31b3c3e6a,Vanowen / Encino,,,,69.0,,,,,,,,"b'\x01\x01\x00\x00\x00?\xe5\x98,\xee\xa0]\xc0\..."
1,6adf6cd9b6d24ab4ee8ee220e3697a73,14025,038cca58ef5f071ff5c94b8213989f87,Vermont / 110th,,,,107.0,,,,,,,,b'\x01\x01\x00\x00\x00\x10\x1em\x1c\xb1\x92]\x...
2,6adf6cd9b6d24ab4ee8ee220e3697a73,15638,06b1447efcc028791c8409d65fa3b3ee,3rd / Hobart,,,,143.0,,,,,,,,b'\x01\x01\x00\x00\x00\xd7d\x8dz\x88\x93]\xc03...
3,6adf6cd9b6d24ab4ee8ee220e3697a73,10244,87f19e30889f90d25e6dee49f04c4985,Vernon / Hooper,,,,97.0,,,,,,,,b'\x01\x01\x00\x00\x00z\xc2\x12\x0f(\x90]\xc0\...
4,6adf6cd9b6d24ab4ee8ee220e3697a73,20206,eda9e3eb339b7f510babcd4ee0999f85,Broadway / Pacific,,,,108.0,,,,,,,,b'\x01\x01\x00\x00\x001du\xab\xe7\x90]\xc0\xf4...


In [8]:
#found method to create geoseries from wkb.
test = gpd.GeoSeries.from_wkb(stops.geometry)


In [9]:
#have a geoeries called `test`. now i am able to add this series back to initial stops table (using assign)
stops2 = stops.assign(pt_geom = test)

In [10]:
#now I can create a gdf and set an active geom col and change crs to ft.
stops2 = gpd.GeoDataFrame(stops2).set_geometry('pt_geom').set_crs('EPSG:2229')

In [11]:
#function confirms that stops2 is a gdf, also used `stops2.geometry.name` and `stops2.crs` to confirm active geom col and crs was set as intended.
df_check(stops2)

#plotting reveals the stops2 is nationwide. may need to clip this to CA only or something.
stops2.geometry.name


'shape of df:(84688, 17)'

"type of :<class 'geopandas.geodataframe.GeoDataFrame'>"

Unnamed: 0,feed_key,stop_id,stop_key,stop_name,route_type_0,route_type_1,route_type_2,route_type_3,route_type_4,route_type_5,route_type_6,route_type_7,route_type_11,route_type_12,missing_route_type,geometry,pt_geom
0,6adf6cd9b6d24ab4ee8ee220e3697a73,15193,d4eb0920e7e256606df449c31b3c3e6a,Vanowen / Encino,,,,69.0,,,,,,,,"b'\x01\x01\x00\x00\x00?\xe5\x98,\xee\xa0]\xc0\...",POINT (-118.515 34.194)
1,6adf6cd9b6d24ab4ee8ee220e3697a73,14025,038cca58ef5f071ff5c94b8213989f87,Vermont / 110th,,,,107.0,,,,,,,,b'\x01\x01\x00\x00\x00\x10\x1em\x1c\xb1\x92]\x...,POINT (-118.292 33.936)
2,6adf6cd9b6d24ab4ee8ee220e3697a73,15638,06b1447efcc028791c8409d65fa3b3ee,3rd / Hobart,,,,143.0,,,,,,,,b'\x01\x01\x00\x00\x00\xd7d\x8dz\x88\x93]\xc03...,POINT (-118.305 34.069)
3,6adf6cd9b6d24ab4ee8ee220e3697a73,10244,87f19e30889f90d25e6dee49f04c4985,Vernon / Hooper,,,,97.0,,,,,,,,b'\x01\x01\x00\x00\x00z\xc2\x12\x0f(\x90]\xc0\...,POINT (-118.252 34.004)
4,6adf6cd9b6d24ab4ee8ee220e3697a73,20206,eda9e3eb339b7f510babcd4ee0999f85,Broadway / Pacific,,,,108.0,,,,,,,,b'\x01\x01\x00\x00\x001du\xab\xe7\x90]\xc0\xf4...,POINT (-118.264 34.147)


'pt_geom'

In [12]:
df_check(stop_times)

'shape of df:(3589931, 9)'

"type of :<class 'pandas.core.frame.DataFrame'>"

Unnamed: 0,feed_key,trip_id,stop_id,stop_sequence,timepoint,arrival_sec,departure_sec,arrival_hour,departure_hour
0,48138ae7269d615d5509958097039bf7,t287-b194-sl4_merged_3564,1140,11,,25047,25047,6,6
1,48138ae7269d615d5509958097039bf7,t708-b12D-sl4_merged_4213,1161,25,,66583,66583,18,18
2,48138ae7269d615d5509958097039bf7,t476-b194-sl4_merged_4047,1153,22,,43440,43440,12,12
3,48138ae7269d615d5509958097039bf7,t6DF-b68-sl4_merged_3187,1437,4,,64959,64959,18,18
4,d4642902c43d526677dff02b09342b78,t607-b1F4B-sl2_merged_1620,601,1,,56580,56580,15,15


In [13]:
#join stops2 and stop times, resulting df is for every stop, we get the stop time and its location 
stop2times = stops2.merge(stop_times, how='inner', on=['feed_key', 'stop_id'])

In [14]:
#check to make sure merge works and that active geom col is pt_geom
df_check(stop2times)
stop2times.geometry.name

'shape of df:(3589718, 24)'

"type of :<class 'geopandas.geodataframe.GeoDataFrame'>"

Unnamed: 0,feed_key,stop_id,stop_key,stop_name,route_type_0,route_type_1,route_type_2,route_type_3,route_type_4,route_type_5,...,missing_route_type,geometry,pt_geom,trip_id,stop_sequence,timepoint,arrival_sec,departure_sec,arrival_hour,departure_hour
0,6adf6cd9b6d24ab4ee8ee220e3697a73,15193,d4eb0920e7e256606df449c31b3c3e6a,Vanowen / Encino,,,,69.0,,,...,,"b'\x01\x01\x00\x00\x00?\xe5\x98,\xee\xa0]\xc0\...",POINT (-118.515 34.194),10165002071128-DEC22,54,0.0,44760,44760,12,12
1,6adf6cd9b6d24ab4ee8ee220e3697a73,15193,d4eb0920e7e256606df449c31b3c3e6a,Vanowen / Encino,,,,69.0,,,...,,"b'\x01\x01\x00\x00\x00?\xe5\x98,\xee\xa0]\xc0\...",POINT (-118.515 34.194),10165002070728-DEC22,54,0.0,30480,30480,8,8
2,6adf6cd9b6d24ab4ee8ee220e3697a73,15193,d4eb0920e7e256606df449c31b3c3e6a,Vanowen / Encino,,,,69.0,,,...,,"b'\x01\x01\x00\x00\x00?\xe5\x98,\xee\xa0]\xc0\...",POINT (-118.515 34.194),10165002071426-DEC22,54,0.0,55740,55740,15,15
3,6adf6cd9b6d24ab4ee8ee220e3697a73,15193,d4eb0920e7e256606df449c31b3c3e6a,Vanowen / Encino,,,,69.0,,,...,,"b'\x01\x01\x00\x00\x00?\xe5\x98,\xee\xa0]\xc0\...",POINT (-118.515 34.194),10165002071113-DEC22,54,0.0,43860,43860,12,12
4,6adf6cd9b6d24ab4ee8ee220e3697a73,15193,d4eb0920e7e256606df449c31b3c3e6a,Vanowen / Encino,,,,69.0,,,...,,"b'\x01\x01\x00\x00\x00?\xe5\x98,\xee\xa0]\xc0\...",POINT (-118.515 34.194),10165002071541-DEC22,54,0.0,60180,60180,16,16


'pt_geom'

In [None]:
#Calculate the straight line distance between the first and last stop for each trip. Call this column trip_distance

#how do find the first and last stop of a trip? try a grouby/agg 

In [25]:
#testing pivot table for max stop sequence (aka last stop)

#for every trip, there are multiple stops. for every stop, there is a stop sequence 
#need coordinates for the stop!
pivot_max = stop2times.pivot_table(
    index=['trip_id','stop_id'],
    values=['stop_sequence'],
    aggfunc={
        'stop_sequence':'max'}
).reset_index()

pivot_max

Unnamed: 0,trip_id,stop_id,stop_sequence
0,002bqqucv,1,6
1,002bqqucv,11,10
2,002bqqucv,12,2
3,002bqqucv,13,3
4,002bqqucv,14,4
...,...,...,...
3564958,zwxp4b4ea,14,4
3564959,zwxp4b4ea,15,5
3564960,zwxp4b4ea,2,7
3564961,zwxp4b4ea,3,8


In [26]:
#test to merge in pt_geom column frrom stop2times, based on stop and trip id
pivot_max2 = pivot_max.merge(stop2times[['trip_id','stop_id','pt_geom']], on=['stop_id','trip_id'], how='inner').drop_duplicates().rename(columns={'stop_sequence':'last_stop', 'pt_geom':'last_stop_geom'})

In [27]:
#checking pivot merge to ensure column was renamed and have point geometry col
#note the resulting df is NOT A GDF
df_check(pivot_max2)

'shape of df:(3568059, 4)'

"type of :<class 'pandas.core.frame.DataFrame'>"

Unnamed: 0,trip_id,stop_id,last_stop,last_stop_geom
0,002bqqucv,1,6,POINT (-122.387 37.615)
1,002bqqucv,11,10,POINT (-122.399 37.632)
3,002bqqucv,12,2,POINT (-122.398 37.631)
4,002bqqucv,13,3,POINT (-122.397 37.629)
5,002bqqucv,14,4,POINT (-122.398 37.628)


In [28]:
#using same pivot table method to get min stop value (aka first stop)
pivot_min = stop2times.pivot_table(
    index=['trip_id', 'stop_id'],
    values=['stop_sequence'],
    aggfunc={
        'stop_sequence':'min'}
).reset_index()

pivot_min

Unnamed: 0,trip_id,stop_id,stop_sequence
0,002bqqucv,1,6
1,002bqqucv,11,1
2,002bqqucv,12,2
3,002bqqucv,13,3
4,002bqqucv,14,4
...,...,...,...
3564958,zwxp4b4ea,14,4
3564959,zwxp4b4ea,15,5
3564960,zwxp4b4ea,2,7
3564961,zwxp4b4ea,3,8


In [29]:
pivot_min2 = pivot_min.merge(stop2times[['trip_id','stop_id', 'pt_geom']], on=['trip_id','stop_id'], how='inner').drop_duplicates().rename(columns={'stop_sequence':'first_stop', 'pt_geom': 'first_stop_geom'})

In [30]:
pivot_min2

Unnamed: 0,trip_id,stop_id,first_stop,first_stop_geom
0,002bqqucv,1,6,POINT (-122.387 37.615)
1,002bqqucv,11,1,POINT (-122.399 37.632)
3,002bqqucv,12,2,POINT (-122.398 37.631)
4,002bqqucv,13,3,POINT (-122.397 37.629)
5,002bqqucv,14,4,POINT (-122.398 37.628)
...,...,...,...,...
3589713,zwxp4b4ea,14,4,POINT (-122.398 37.628)
3589714,zwxp4b4ea,15,5,POINT (-122.398 37.627)
3589715,zwxp4b4ea,2,7,POINT (-122.384 37.617)
3589716,zwxp4b4ea,3,8,POINT (-122.387 37.618)


In [31]:
#stps to creat gdf
#stops2 = gpd.GeoDataFrame(stops2).set_geometry('pt_geom').set_crs('EPSG:2229')

def makegdf(df, geom):
    gdf = gpd.GeoDataFrame(df).set_geometry(geom).set_crs('EPSG:2229')
    
    return gdf

In [38]:
gdfmax = makegdf(pivot_max2, 'last_stop_geom')

gdfmin = makegdf(pivot_min2, 'first_stop_geom')

In [47]:
#check to ensure active geom col is set as intended, and crs is equivilent 
display(gdfmax.geometry.name)
display(gdfmin.geometry.name)
display(gdfmax.crs == gdfmin.crs)

'last_stop_geom'

'first_stop_geom'

True

In [43]:
df_check(gdfmax)
df_check(gdfmin)

'shape of df:(3568059, 4)'

"type of :<class 'geopandas.geodataframe.GeoDataFrame'>"

Unnamed: 0,trip_id,stop_id,last_stop,last_stop_geom
0,002bqqucv,1,6,POINT (-122.387 37.615)
1,002bqqucv,11,10,POINT (-122.399 37.632)
3,002bqqucv,12,2,POINT (-122.398 37.631)
4,002bqqucv,13,3,POINT (-122.397 37.629)
5,002bqqucv,14,4,POINT (-122.398 37.628)


'shape of df:(3568059, 4)'

"type of :<class 'geopandas.geodataframe.GeoDataFrame'>"

Unnamed: 0,trip_id,stop_id,first_stop,first_stop_geom
0,002bqqucv,1,6,POINT (-122.387 37.615)
1,002bqqucv,11,1,POINT (-122.399 37.632)
3,002bqqucv,12,2,POINT (-122.398 37.631)
4,002bqqucv,13,3,POINT (-122.397 37.629)
5,002bqqucv,14,4,POINT (-122.398 37.628)


In [49]:
#test to find the distance between pivot_max2 and pivot_min 2

distance = gdfmax.distance(gdfmin)

#all that work and got 0 for evertthing :(
distance.value_counts()


0.0    3568059
dtype: int64

In [None]:
#then add distance back to stop2times

# hall-o-shame

In [21]:
#merge pivot_max back into stop2times table to get the location of last stop
#stop2times_test = stop2times.merge(pivot_max, on=['trip_id','stop_id'],how='left')

In [None]:
#df_check(stop2times_test)

In [None]:
#merge pivot_min back into stop2times to get location of first stop.
#stop2times_test_2 = stop2times_test.merge(pivot_min, on=['trip_id','stop_id'], how='left')

In [None]:
#df_check(stop2times_test_2)

In [None]:
#test to calculate distance between first and last stop



In [None]:
#cleaned up a couple of columns, dissolved by routes, reset index and set crs to feet
#highways_d = highways[['Route', 'geometry', 'RouteType']].dissolve(by='Route').reset_index().to_crs('EPSG:2229')

In [None]:
#can you sjoin highways and stops to get stops in ca?

#sjoin = gpd.sjoin(highways.to_crs('EPSG:2229'), stops2, how='right')


#sjoin.plot()

In [None]:
#df_check(highways_d)
#highways_d.plot()

In [None]:
#test = stop2times({'trip_id': group.groups.keys(), 
#                   'first_stop': first_stop, 
#                  'last_stop': last_stop}
#                 )
#test.head

In [None]:
#test to find the first stop of a trip usign .iloc[0]
#group = stop_times.groupby('trip_id')

In [None]:
#first_stop = []
#last_stop = []

In [None]:
# FOR LOOPS!!!
#for `every trip_id` group in group df, do this operation
#for trip_id, group in group:
#    f_stop = group.iloc[0]['stop_id']
#    l_stop = group.iloc[-1]['stop_id']

In [None]:
#first_stop.append(f_stop)
#first_stop

In [None]:
#last_stop.append(l_stop)
#last_stop

In [None]:
#group.groups.keys()

In [None]:

#add new col - first_stop. use assign with trip ID, and stop_sequence.iloc[0]
#add new col - last _stop. use assign with trip ID, and stop_sequence.iloc[-1]
#all new col - distance between first and last stop with distance first stop, last stop



In [None]:
#overlay highways with stops2?
#RETURNS NOTHING!
#test = gpd.overlay(stops2, highways.to_crs('EPSG:2229'), how='intersection', keep_geom_type=True)

In [None]:
#try to overlay stop2times on highways_d (points on line?)
#RETURNS NOTHING!
#test = gpd.overlay(highways.to_crs('EPSG:2229'), stop2times, how ='intersection', keep_geom_type=True)

In [None]:
#df_check(test)

In [None]:
#Calculate the distance between each stop to the nearest interstate. 
#For each trip, keep the value for the stop that's the closest to the interstate. Call this column shortest_distance_hwy.

In [None]:
#can i dissolve by trip_ID, then get length?

#NOPE DIDNT WORK AS EXPECTED

#trip_d = stop2times.dissolve(by='trip_id').reset_index()

In [None]:
#df_check(trip_d)
#trip_d.plot()