# Exercise 9: series

* pandas series vs numpy arrays [explanation](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html)

### Common series operations
These are the most common series operations we use. Refer to the `pandas` docs for even more!

* Getting dates, hours, minutes from datetime types (`df.datetime_col.dt.date`)
* Parsing strings (`df.string_col.str.split('_')`)

### Common geoseries operations
These are the most common. Refer to the `geopandas` docs for even more!

* `distance` between 2 points or a point to a polygon or line [docs](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.distance.html)
* `intersects`: [docs](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.intersects.html)
* `within`: [docs](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.within.html)
* `contains`: [docs](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.contains.html)

In fact, we've often used geoseries methods without even realizing it. Often, we'd create a new column that stores either the line's length or a polygon's area. `gdf.geometry` is a geoseries, and we call methods on that geoseries, and add that as a new column.

For calculations like `length`, `area`, and `distance`, we need to use a projected CRS that has units like meters or feet. We cannot use decimal degrees (do not use WGS 84 / `EPSG:4326`)! Distance calculations must be done only once the spherical 3D Earth has been converted into a 2D plane.

* `length`: get the length of a line (`gdf.geometry.length`)
* `area`: get the area of a polygon (`gdf.geometry.area`)
* `centroid`: get the centroid of a polygon (`gdf.geometry.centroid`)
* `x`: get the x coordinate of a point (`gdf.geometry.x`)
* `y`: get the y coordinate of a point (`gdf.geometry.y`)

### Arrays
* Occasionally, we may even use arrays, especially when the datasets get even larger but we have simple mathematical calculations
* If we need to apply an exponential decay function to a distance column, we essentially want to multiple `distance` by some number
* Since this exponential decay function is somewhat custom and requires us to write our own formula, we would extract the column as a series (`df.distance`) and multiply each value by some other number.
* Even quicker is to use `numpy` with `distance_array = np.array(df.distance)` and get `exponential_array = distance_array*some_number`

In [1]:
import geopandas as gpd
import intake
import numpy as np
import pandas as pd


catalog = intake.open_catalog(
    "../_shared_utils/shared_utils/shared_data_catalog.yml")

If you're asking how far is a transit stop from the interstate, you'd want the distance of every point (every row) compared to an interstate highway geometry.

Let's prep the datasets to use series / geoseries to do this.

In [2]:
stops = catalog.ca_transit_stops.read()[["agency", "stop_id", 
                                         "stop_name", "geometry"]]
highways = catalog.state_highway_network.read()

Since we want to know the distance from a stop's point to the interstate generally, we need a dissolve. We don't want to compare the distance against the I-5, the I-10 individually, but to the interstate system as a whole.

In [3]:
highways.head(2)

Unnamed: 0,Route,County,District,RouteType,Direction,geometry
0,1,LA,7,State,NB,"MULTILINESTRING ((-118.14322 33.79010, -118.14..."
1,1,LA,7,State,SB,"MULTILINESTRING ((-118.39630 33.94454, -118.39..."


In [4]:
interstates = (highways[highways.RouteType=="Interstate"]
               .dissolve()
               .reset_index()
               [["geometry"]]
              )

In [5]:
# This is still a gdf, just with 1 column
type(interstates)

geopandas.geodataframe.GeoDataFrame

In [6]:
# Pulling out the individual column, it becomes a series/geoseries.
# It's a geoseries here because we had a gdf. 
# If it was a df, it would be a series.
print(type(stops.geometry))
print(type(interstates.geometry))

<class 'geopandas.geoseries.GeoSeries'>
<class 'geopandas.geoseries.GeoSeries'>


Distance is something you can calculate using `geopandas`.

Specifically, it takes a geoseries on the left, and either a geoseries or a single geometry on the right.

An example of having 2 geoseries would be comparing the distance between 2 points. On the left, it would be a geoseries of the origin points and on the right, destination points.

In [7]:
# We get a warning if we leave it in EPSG:4326!
stops.geometry.distance(interstates.geometry.iloc[0])


  stops.geometry.distance(interstates.geometry.iloc[0])


0         0.023029
1         0.024552
2         0.027300
3         0.026145
4         0.023530
            ...   
124173    0.294958
124174    0.293110
124175    0.292390
124176    0.293875
124177    0.291830
Length: 124178, dtype: float64

In [8]:
stops_geom = stops.to_crs("EPSG:2229").geometry
interstates_geom = interstates.to_crs("EPSG:2229").geometry.iloc[0]

In [9]:
distance_series = stops_geom.distance(interstates_geom)

In [10]:
# Let's make sure that for every stop, a distance is calculated
print(f"# rows in stops: {len(stops_geom)}")
print(f"# rows in stops: {len(distance_series)}")

# rows in stops: 124178
# rows in stops: 124178


In [11]:
# distance is numeric, not a geometry, so we're back to being a series
type(distance_series)

pandas.core.series.Series

What can we do with this? 

We usually add it as a new column. Since we did nothing to shift the index, we can just attach the series back to our gdf.

Getting a distance calculation using geoseries is much quicker than a row-wise lambda function where you calculate the distance.

```
Alternative method that's slower:
      
interstate_geom = interstates.geometry.iloc[0]

stops = stops.assign(
   distance = stops.geometry.apply(
         lambda x: x.distance(interstate_geom))
)   
```

In [12]:
stops = stops.assign(
    distance_to_interstate = distance_series
)

In [13]:
%%timeit
distance_series = stops_geom.distance(interstates_geom)

25.8 s ± 209 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [14]:
%%timeit
stops.assign(
   distance = stops.geometry.apply(
         lambda x: x.distance(interstates_geom))
)   

39.8 s ± 1.68 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [15]:
import dask_geopandas as dg

stops_gddf = dg.from_geopandas(stops, npartitions=2)
stops_geom_dg = stops_gddf.to_crs("EPSG:2229").geometry

In [16]:
%%timeit

distance_series = stops_geom_dg.distance(interstates_geom)

2.12 ms ± 189 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## To Do

* Use the `stop_times` table and `stops` table.
* Calculate the straight line distance between the first and last stop for each trip. Call this column `trip_distance`
* Calculate the distance between each stop to the nearest interstate. For each trip, keep the value for the stop that's the closest to the interstate. Call this column `shortest_distance_hwy`.
* For each trip, add these 2 new columns, but use series, geoseries, and/or arrays to assign it.
* Provide a preview of the resulting df (do not export)

In [17]:
GCS_FILE_PATH = ("gs://calitp-analytics-data/data-analyses/"
                 "rt_delay/compiled_cached_views/"
                )

analysis_date = "2023-01-18"
STOP_TIMES_FILE = f"{GCS_FILE_PATH}st_{analysis_date}.parquet"
STOPS_FILE = f"{GCS_FILE_PATH}stops_{analysis_date}.parquet"
highways = catalog.state_highway_network.read()

In [18]:
stops = pd.read_parquet(STOPS_FILE)
stop_times = pd.read_parquet(STOP_TIMES_FILE)



#### STOPS

In [19]:
stops.head(2)

Unnamed: 0,feed_key,stop_id,stop_key,stop_name,route_type_0,route_type_1,route_type_2,route_type_3,route_type_4,route_type_5,route_type_6,route_type_7,route_type_11,route_type_12,missing_route_type,geometry
0,6adf6cd9b6d24ab4ee8ee220e3697a73,15193,d4eb0920e7e256606df449c31b3c3e6a,Vanowen / Encino,,,,69.0,,,,,,,,"b'\x01\x01\x00\x00\x00?\xe5\x98,\xee\xa0]\xc0\..."
1,6adf6cd9b6d24ab4ee8ee220e3697a73,14025,038cca58ef5f071ff5c94b8213989f87,Vermont / 110th,,,,107.0,,,,,,,,b'\x01\x01\x00\x00\x00\x10\x1em\x1c\xb1\x92]\x...


In [20]:
type(stops)

pandas.core.frame.DataFrame

- Geometry column contains well known binary element wkb, hence using loads to change the dataset to gdf

In [21]:
pip install shapely 

Note: you may need to restart the kernel to use updated packages.


In [22]:
from shapely.wkb import loads 

In [23]:
stops['geometry'] = stops['geometry'].apply(lambda x:loads(x)) 

In [24]:
stops_gdf = gpd.GeoDataFrame(stops, geometry = 'geometry')

In [25]:
type(stops_gdf)

geopandas.geodataframe.GeoDataFrame

In [26]:
stops_gdf.head(2)

Unnamed: 0,feed_key,stop_id,stop_key,stop_name,route_type_0,route_type_1,route_type_2,route_type_3,route_type_4,route_type_5,route_type_6,route_type_7,route_type_11,route_type_12,missing_route_type,geometry
0,6adf6cd9b6d24ab4ee8ee220e3697a73,15193,d4eb0920e7e256606df449c31b3c3e6a,Vanowen / Encino,,,,69.0,,,,,,,,POINT (-118.51454 34.19401)
1,6adf6cd9b6d24ab4ee8ee220e3697a73,14025,038cca58ef5f071ff5c94b8213989f87,Vermont / 110th,,,,107.0,,,,,,,,POINT (-118.29206 33.93564)


- Changing the CRS to EPSG:2229

In [27]:
stops_geom = stops_gdf.set_crs('EPSG:2229')

In [28]:
stops_geom.crs

<Projected CRS: EPSG:2229>
Name: NAD83 / California zone 5 (ftUS)
Axis Info [cartesian]:
- X[east]: Easting (US survey foot)
- Y[north]: Northing (US survey foot)
Area of Use:
- name: United States (USA) - California - counties Kern; Los Angeles; San Bernardino; San Luis Obispo; Santa Barbara; Ventura.
- bounds: (-121.42, 32.76, -114.12, 35.81)
Coordinate Operation:
- name: SPCS83 California zone 5 (US Survey feet)
- method: Lambert Conic Conformal (2SP)
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

#### STOP TIMES

In [29]:
stop_times.head(2)

Unnamed: 0,feed_key,trip_id,stop_id,stop_sequence,timepoint,arrival_sec,departure_sec,arrival_hour,departure_hour
0,48138ae7269d615d5509958097039bf7,t287-b194-sl4_merged_3564,1140,11,,25047,25047,6,6
1,48138ae7269d615d5509958097039bf7,t708-b12D-sl4_merged_4213,1161,25,,66583,66583,18,18


In [30]:
type(stop_times)

pandas.core.frame.DataFrame

#### Creating a join between STOP and STOP TIMES

In [31]:
merge = pd.merge(stops_geom, stop_times, how= 'inner', on=["feed_key", "stop_id"])

In [32]:
merge.head(2)

Unnamed: 0,feed_key,stop_id,stop_key,stop_name,route_type_0,route_type_1,route_type_2,route_type_3,route_type_4,route_type_5,...,route_type_12,missing_route_type,geometry,trip_id,stop_sequence,timepoint,arrival_sec,departure_sec,arrival_hour,departure_hour
0,6adf6cd9b6d24ab4ee8ee220e3697a73,15193,d4eb0920e7e256606df449c31b3c3e6a,Vanowen / Encino,,,,69.0,,,...,,,POINT (-118.515 34.194),10165002071128-DEC22,54,0.0,44760,44760,12,12
1,6adf6cd9b6d24ab4ee8ee220e3697a73,15193,d4eb0920e7e256606df449c31b3c3e6a,Vanowen / Encino,,,,69.0,,,...,,,POINT (-118.515 34.194),10165002070728-DEC22,54,0.0,30480,30480,8,8


## Calculate the straight line distance between the first and last stop for each trip.


- Stop_Sequence = Order of stops for a particular trip. The values must increase along the trip but do not need to be consecutive.

#### Steps:
- Creating maximum and minimum sequence data to find the first and last stop of a trip
- Adding geometry to the Max and Min table by creating a merge with merged table to find the distance between two sequences
- Changing Pandas Dataset to gdf
- Calculating the distance between the geometries
- Adding the distance column to each trip_id


#### Creating maximum and minimum sequence data to find the first and last stop of a trip

In [33]:
pivot_max = merge.pivot_table(index='trip_id', values='stop_sequence', aggfunc='max').reset_index()

In [34]:
pivot_min = merge.pivot_table(index='trip_id', values='stop_sequence', aggfunc='min').reset_index()

In [35]:
pivot_max.head(2)

Unnamed: 0,trip_id,stop_sequence
0,002bqqucv,10
1,00339a54-1d52-408a-a0a5-db494ca7bef2,92


In [36]:
pivot_min.head(2)

Unnamed: 0,trip_id,stop_sequence
0,002bqqucv,1
1,00339a54-1d52-408a-a0a5-db494ca7bef2,0


#### Adding geometry to the Max and Min table by creating a merge with merged table to find the distance between two sequences


In [37]:
max_geom = pivot_max.merge(
    merge[['trip_id', 'geometry', 'stop_sequence']],
    on=['trip_id', 'stop_sequence'],
    how='left'
)

In [38]:
min_geom = pivot_min.merge(
    merge[['trip_id', 'geometry', 'stop_sequence']],
    on=['trip_id', 'stop_sequence'],
    how='left'
)

In [39]:
max_geom.head(2)

Unnamed: 0,trip_id,stop_sequence,geometry
0,002bqqucv,10,POINT (-122.399 37.632)
1,00339a54-1d52-408a-a0a5-db494ca7bef2,92,POINT (-121.764 38.660)


In [40]:
min_geom.head(2)

Unnamed: 0,trip_id,stop_sequence,geometry
0,002bqqucv,1,POINT (-122.399 37.632)
1,00339a54-1d52-408a-a0a5-db494ca7bef2,0,POINT (-121.764 38.660)


In [41]:
type(max_geom)
type(min_geom)

pandas.core.frame.DataFrame

#### Changing Pandas Dataset to gdf

In [42]:
gdf_max = gpd.GeoDataFrame(max_geom, geometry= 'geometry').set_crs('EPSG:2229')
gdf_min = gpd.GeoDataFrame(min_geom, geometry= 'geometry').set_crs('EPSG:2229')                      

In [43]:
gdf_max.crs

<Projected CRS: EPSG:2229>
Name: NAD83 / California zone 5 (ftUS)
Axis Info [cartesian]:
- X[east]: Easting (US survey foot)
- Y[north]: Northing (US survey foot)
Area of Use:
- name: United States (USA) - California - counties Kern; Los Angeles; San Bernardino; San Luis Obispo; Santa Barbara; Ventura.
- bounds: (-121.42, 32.76, -114.12, 35.81)
Coordinate Operation:
- name: SPCS83 California zone 5 (US Survey feet)
- method: Lambert Conic Conformal (2SP)
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

In [44]:
gdf_min.crs

<Projected CRS: EPSG:2229>
Name: NAD83 / California zone 5 (ftUS)
Axis Info [cartesian]:
- X[east]: Easting (US survey foot)
- Y[north]: Northing (US survey foot)
Area of Use:
- name: United States (USA) - California - counties Kern; Los Angeles; San Bernardino; San Luis Obispo; Santa Barbara; Ventura.
- bounds: (-121.42, 32.76, -114.12, 35.81)
Coordinate Operation:
- name: SPCS83 California zone 5 (US Survey feet)
- method: Lambert Conic Conformal (2SP)
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

#### Calculating the distance between the geometries 

In [45]:
distance = gdf_min.distance(gdf_max)

  distance = gdf_min.distance(gdf_max)


In [46]:
distance

0         0.000000
1         0.000000
2         0.000000
3         0.000492
4         0.000492
            ...   
107337         NaN
107338         NaN
107339         NaN
107340         NaN
107341         NaN
Length: 107342, dtype: float64

#### Adding the distance column to each trip_id

In [47]:
merged_gdf = gpd.GeoDataFrame.merge(gdf_max, gdf_min,left_on='trip_id', right_on = 'trip_id', suffixes = ('_max', '_min'))

In [48]:
merged_gdf = merged_gdf.assign(trip_distance = gdf_min.distance(gdf_max))

  merged_gdf = merged_gdf.assign(trip_distance = gdf_min.distance(gdf_max))


## Calculate the distance between each stop to the nearest interstate. For each trip, keep the value for the stop that's the closest to the interstate

Steps
- Checking tables and CRS
- Changing the CRS of interstates data
- Calculating the distance
- Adding the shortest distance to highway column to each trip_id

#### Checking tables and CRS

In [49]:
interstates.head(2)

Unnamed: 0,geometry
0,"MULTILINESTRING ((-122.06017 39.01932, -122.06..."


In [50]:
stops_geom.head(2)

Unnamed: 0,feed_key,stop_id,stop_key,stop_name,route_type_0,route_type_1,route_type_2,route_type_3,route_type_4,route_type_5,route_type_6,route_type_7,route_type_11,route_type_12,missing_route_type,geometry
0,6adf6cd9b6d24ab4ee8ee220e3697a73,15193,d4eb0920e7e256606df449c31b3c3e6a,Vanowen / Encino,,,,69.0,,,,,,,,POINT (-118.515 34.194)
1,6adf6cd9b6d24ab4ee8ee220e3697a73,14025,038cca58ef5f071ff5c94b8213989f87,Vermont / 110th,,,,107.0,,,,,,,,POINT (-118.292 33.936)


In [51]:
interstates.crs

<Geographic 2D CRS: EPSG:4326>
Name: WGS 84
Axis Info [ellipsoidal]:
- Lat[north]: Geodetic latitude (degree)
- Lon[east]: Geodetic longitude (degree)
Area of Use:
- name: World.
- bounds: (-180.0, -90.0, 180.0, 90.0)
Datum: World Geodetic System 1984 ensemble
- Ellipsoid: WGS 84
- Prime Meridian: Greenwich

In [52]:
stops_geom.crs

<Projected CRS: EPSG:2229>
Name: NAD83 / California zone 5 (ftUS)
Axis Info [cartesian]:
- X[east]: Easting (US survey foot)
- Y[north]: Northing (US survey foot)
Area of Use:
- name: United States (USA) - California - counties Kern; Los Angeles; San Bernardino; San Luis Obispo; Santa Barbara; Ventura.
- bounds: (-121.42, 32.76, -114.12, 35.81)
Coordinate Operation:
- name: SPCS83 California zone 5 (US Survey feet)
- method: Lambert Conic Conformal (2SP)
Datum: North American Datum 1983
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich

#### Changing the CRS of interstates data

In [53]:
interstates_geom = interstates.to_crs("EPSG:2229").geometry.iloc[0]

#### Calculating the distance

In [54]:
shortest_distance_hwy = stops_geom.distance(interstates_geom)

In [55]:
shortest_distance_hwy

0        6.153002e+06
1        6.153002e+06
2        6.153001e+06
3        6.153001e+06
4        6.153001e+06
             ...     
84683    6.153003e+06
84684    6.153003e+06
84685    6.153003e+06
84686    6.153003e+06
84687    6.153003e+06
Length: 84688, dtype: float64

In [56]:
merged_stops_gdf = stops_geom.assign(shortest_distance_hwy = stops_geom.distance(interstates_geom))

#### Adding the shortest distance to highway column to each trip_id

In [57]:
merged_stops_gdf.head(2)

Unnamed: 0,feed_key,stop_id,stop_key,stop_name,route_type_0,route_type_1,route_type_2,route_type_3,route_type_4,route_type_5,route_type_6,route_type_7,route_type_11,route_type_12,missing_route_type,geometry,shortest_distance_hwy
0,6adf6cd9b6d24ab4ee8ee220e3697a73,15193,d4eb0920e7e256606df449c31b3c3e6a,Vanowen / Encino,,,,69.0,,,,,,,,POINT (-118.515 34.194),6153002.0
1,6adf6cd9b6d24ab4ee8ee220e3697a73,14025,038cca58ef5f071ff5c94b8213989f87,Vermont / 110th,,,,107.0,,,,,,,,POINT (-118.292 33.936),6153002.0


In [58]:
merge.head(2)

Unnamed: 0,feed_key,stop_id,stop_key,stop_name,route_type_0,route_type_1,route_type_2,route_type_3,route_type_4,route_type_5,...,route_type_12,missing_route_type,geometry,trip_id,stop_sequence,timepoint,arrival_sec,departure_sec,arrival_hour,departure_hour
0,6adf6cd9b6d24ab4ee8ee220e3697a73,15193,d4eb0920e7e256606df449c31b3c3e6a,Vanowen / Encino,,,,69.0,,,...,,,POINT (-118.515 34.194),10165002071128-DEC22,54,0.0,44760,44760,12,12
1,6adf6cd9b6d24ab4ee8ee220e3697a73,15193,d4eb0920e7e256606df449c31b3c3e6a,Vanowen / Encino,,,,69.0,,,...,,,POINT (-118.515 34.194),10165002070728-DEC22,54,0.0,30480,30480,8,8


In [59]:
merged_stops_trip_id = merge.merge(
    merged_stops_gdf,
    on= ['feed_key', 'stop_id','stop_key'], 
    how = 'inner',
    )[['feed_key', 'stop_id', 'stop_key', 'trip_id', 'shortest_distance_hwy']].drop_duplicates()

In [60]:
merged_stops_trip_id.head(2)

Unnamed: 0,feed_key,stop_id,stop_key,trip_id,shortest_distance_hwy
0,6adf6cd9b6d24ab4ee8ee220e3697a73,15193,d4eb0920e7e256606df449c31b3c3e6a,10165002071128-DEC22,6153002.0
1,6adf6cd9b6d24ab4ee8ee220e3697a73,15193,d4eb0920e7e256606df449c31b3c3e6a,10165002070728-DEC22,6153002.0


## For each trip, add these 2 new columns, but use series, geoseries, and/or arrays to assign it

Steps
- Adding distance column 
- Adding shortest distance to the highway column

In [61]:
trip_ids = stop_times.trip_id.reset_index()[['trip_id']]


In [62]:
unique_trip_ids = trip_ids.merge(merged_gdf[['trip_id', 'trip_distance']], on = 'trip_id', how = 'inner').drop_duplicates()

In [63]:
unique_trip_ids.head(2)

Unnamed: 0,trip_id,trip_distance
0,t287-b194-sl4_merged_3564,1.239299
29,t708-b12D-sl4_merged_4213,7.959108


In [64]:
trips_final = unique_trip_ids.merge(merged_stops_trip_id[['trip_id', 'shortest_distance_hwy']], on='trip_id', how = 'inner').drop_duplicates()

In [65]:
trips_final.head(10)

Unnamed: 0,trip_id,trip_distance,shortest_distance_hwy
0,t287-b194-sl4_merged_3564,1.239299,6153001.0
1,t287-b194-sl4_merged_3564,1.239299,6153001.0
2,t287-b194-sl4_merged_3564,1.239299,6153001.0
3,t287-b194-sl4_merged_3564,1.239299,6153001.0
4,t287-b194-sl4_merged_3564,1.239299,6153001.0
5,t287-b194-sl4_merged_3564,1.239299,6153001.0
6,t287-b194-sl4_merged_3564,1.239299,6153001.0
7,t287-b194-sl4_merged_3564,1.239299,6153001.0
8,t287-b194-sl4_merged_3564,1.239299,6153001.0
9,t287-b194-sl4_merged_3564,1.239299,6153001.0
