# Exercise 9: series

* pandas series vs numpy arrays [explanation](https://jakevdp.github.io/PythonDataScienceHandbook/03.01-introducing-pandas-objects.html)

### Common series operations
These are the most common series operations we use. Refer to the `pandas` docs for even more!

* Getting dates, hours, minutes from datetime types (`df.datetime_col.dt.date`)
* Parsing strings (`df.string_col.str.split('_')`)

### Common geoseries operations
These are the most common. Refer to the `geopandas` docs for even more!

* `distance` between 2 points or a point to a polygon or line [docs](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.distance.html)
* `intersects`: [docs](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.intersects.html)
* `within`: [docs](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.within.html)
* `contains`: [docs](https://geopandas.org/en/stable/docs/reference/api/geopandas.GeoSeries.contains.html)

In fact, we've often used geoseries methods without even realizing it. Often, we'd create a new column that stores either the line's length or a polygon's area. `gdf.geometry` is a geoseries, and we call methods on that geoseries, and add that as a new column.

For calculations like `length`, `area`, and `distance`, we need to use a projected CRS that has units like meters or feet. We cannot use decimal degrees (do not use WGS 84 / EPSG:3326)! Distance calculations must be done only once the spherical 3D Earth has been converted into a 2D plane.

* `length`: get the length of a line (`gdf.geometry.length`)
* `area`: get the area of a polygon (`gdf.geometry.area`)
* `centroid`: get the centroid of a polygon (`gdf.geometry.centroid`)
* `x`: get the x coordinate of a point (`gdf.geometry.x`)
* `y`: get the y coordinate of a point (`gdf.geometry.y`)

### Arrays
* Occasionally, we may even use arrays, especially when the datasets get even larger but we have simple mathematical calculations
* If we need to apply an exponential decay function to a distance column, we essentially want to multiple `distance` by some number
* Since this exponential decay function is somewhat custom and requires us to write our own formula, we would extract the column as a series (`df.distance`) and multiply each value by some other number.
* Even quicker is to use `numpy` with `distance_array = np.array(df.distance)` and get `exponential_array = distance_array*some_number`

In [1]:
import geopandas as gpd
import intake
import numpy as np
import pandas as pd

catalog = intake.open_catalog(
    "../_shared_utils/shared_utils/shared_data_catalog.yml")

If you're asking how far is a transit stop from the interstate, you'd want the distance of every point (every row) compared to an interstate highway geometry.

Let's prep the datasets to use series / geoseries to do this.

In [2]:
stops = catalog.ca_transit_stops.read()[["agency", "stop_id", 
                                         "stop_name", "geometry"]]
highways = catalog.state_highway_network.read()

Since we want to know the distance from a stop's point to the interstate generally, we need a dissolve. We don't want to compare the distance against the I-5, the I-10 individually, but to the interstate system as a whole.

In [3]:
interstates = (highways[highways.RouteType=="Interstate"]
               .dissolve()
               .reset_index()
               [["geometry"]]
              ) 

In [4]:
# This is still a gdf, just with 1 column
type(interstates)

geopandas.geodataframe.GeoDataFrame

In [5]:
# Pulling out the individual column, it becomes a series/geoseries.
# It's a geoseries here because we had a gdf. 
# If it was a df, it would be a series.
print(type(stops.geometry))
print(type(interstates.geometry))

<class 'geopandas.geoseries.GeoSeries'>
<class 'geopandas.geoseries.GeoSeries'>


Distance is something you can calculate using `geopandas`.

Specifically, it takes a geoseries on the left, and either a geoseries or a single geometry on the right.

An example of having 2 geoseries would be comparing the distance between 2 points. On the left, it would be a geoseries of the origin points and on the right, destination points.

In [6]:
# We get a warning if we leave it in EPSG:4326!
stops.geometry.distance(interstates.geometry.iloc[0])


  stops.geometry.distance(interstates.geometry.iloc[0])


0         0.023029
1         0.024552
2         0.027300
3         0.026145
4         0.023530
            ...   
119963    0.294958
119964    0.293110
119965    0.292390
119966    0.293875
119967    0.291830
Length: 119968, dtype: float64

In [7]:
stops_geom = stops.to_crs("EPSG:2229").geometry
interstates_geom = interstates.to_crs("EPSG:2229").geometry.iloc[0]

In [8]:
distance_series = stops_geom.distance(interstates_geom)

In [9]:
# Let's make sure that for every stop, a distance is calculated
print(f"# rows in stops: {len(stops_geom)}")
print(f"# rows in stops: {len(distance_series)}")

# rows in stops: 119968
# rows in stops: 119968


In [10]:
# distance is numeric, not a geometry, so we're back to being a series
type(distance_series)

pandas.core.series.Series

What can we do with this? 

We usually add it as a new column. Since we did nothing to shift the index, we can just attach the series back to our gdf.

Getting a distance calculation using geoseries is much quicker than a row-wise lambda function where you calculate the distance.

```
Alternative method that's slower:
      
interstate_geom = interstates.geometry.iloc[0]

stops = stops.assign(
   distance = stops.geometry.apply(
         lambda x: x.distance(interstate_geom))
)   
```

In [11]:
stops = stops.assign(
    distance_to_interstate = distance_series
)

In [12]:
%%timeit
distance_series = stops_geom.distance(interstates_geom)

22.7 s ± 185 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [13]:
%%timeit
stops.assign(
   distance = stops.geometry.apply(
         lambda x: x.distance(interstates_geom))
)   

54.3 s ± 11.1 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [14]:
import dask_geopandas as dg

stops_gddf = dg.from_geopandas(stops, npartitions=2)
stops_geom_dg = stops_gddf.to_crs("EPSG:2229").geometry

In [15]:
%%timeit

distance_series = stops_geom_dg.distance(interstates_geom)

2.22 ms ± 75.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


## To Do

* Use the `stop_times` table and `stops` table.
* Calculate the straight line distance between the first and last stop for each trip. Call this column `trip_distance`
* Calculate the distance between each stop to the nearest interstate. For each trip, keep the value for the stop that's the closest to the interstate. Call this column `shortest_distance_hwy`.
* For each trip, add these 2 new columns, but use series, geoseries, and/or arrays to assign it.
* Provide a preview of the resulting df (do not export)

In [16]:
GCS_FILE_PATH = ("gs://calitp-analytics-data/data-analyses/"
                 "rt_delay/compiled_cached_views/"
                )

analysis_date = "2023-01-18"
STOP_TIMES_FILE = f"{GCS_FILE_PATH}st_{analysis_date}.parquet"
STOPS_FILE = f"{GCS_FILE_PATH}stops_{analysis_date}.parquet"
highways = catalog.state_highway_network.read()

In [22]:
#test to import parquet files
stops = pd.read_parquet(STOPS_FILE)
stop_times = pd.read_parquet(STOP_TIMES_FILE)

In [25]:
#function that runs the same checks for dfs
def df_check(x):
    display(f'shape of df:{x.shape}'),
    display(f'type of :{type(x)}'),
    display(x.head()),
    return 

In [26]:
#what does each row mean?
#each row is a stop_key, a stop_key can have multiple feeds and stops
#what is the difference between stop_key and stop_id?
df_check(stops)

'shape of df:(84688, 16)'

"type of :<class 'pandas.core.frame.DataFrame'>"

Unnamed: 0,feed_key,stop_id,stop_key,stop_name,route_type_0,route_type_1,route_type_2,route_type_3,route_type_4,route_type_5,route_type_6,route_type_7,route_type_11,route_type_12,missing_route_type,geometry
0,6adf6cd9b6d24ab4ee8ee220e3697a73,15193,d4eb0920e7e256606df449c31b3c3e6a,Vanowen / Encino,,,,69.0,,,,,,,,"b'\x01\x01\x00\x00\x00?\xe5\x98,\xee\xa0]\xc0\..."
1,6adf6cd9b6d24ab4ee8ee220e3697a73,14025,038cca58ef5f071ff5c94b8213989f87,Vermont / 110th,,,,107.0,,,,,,,,b'\x01\x01\x00\x00\x00\x10\x1em\x1c\xb1\x92]\x...
2,6adf6cd9b6d24ab4ee8ee220e3697a73,15638,06b1447efcc028791c8409d65fa3b3ee,3rd / Hobart,,,,143.0,,,,,,,,b'\x01\x01\x00\x00\x00\xd7d\x8dz\x88\x93]\xc03...
3,6adf6cd9b6d24ab4ee8ee220e3697a73,10244,87f19e30889f90d25e6dee49f04c4985,Vernon / Hooper,,,,97.0,,,,,,,,b'\x01\x01\x00\x00\x00z\xc2\x12\x0f(\x90]\xc0\...
4,6adf6cd9b6d24ab4ee8ee220e3697a73,20206,eda9e3eb339b7f510babcd4ee0999f85,Broadway / Pacific,,,,108.0,,,,,,,,b'\x01\x01\x00\x00\x001du\xab\xe7\x90]\xc0\xf4...


In [27]:
df_check(stop_times)

'shape of df:(3589931, 9)'

"type of :<class 'pandas.core.frame.DataFrame'>"

Unnamed: 0,feed_key,trip_id,stop_id,stop_sequence,timepoint,arrival_sec,departure_sec,arrival_hour,departure_hour
0,48138ae7269d615d5509958097039bf7,t287-b194-sl4_merged_3564,1140,11,,25047,25047,6,6
1,48138ae7269d615d5509958097039bf7,t708-b12D-sl4_merged_4213,1161,25,,66583,66583,18,18
2,48138ae7269d615d5509958097039bf7,t476-b194-sl4_merged_4047,1153,22,,43440,43440,12,12
3,48138ae7269d615d5509958097039bf7,t6DF-b68-sl4_merged_3187,1437,4,,64959,64959,18,18
4,d4642902c43d526677dff02b09342b78,t607-b1F4B-sl2_merged_1620,601,1,,56580,56580,15,15


In [58]:
stop_times.sort_values(['trip_id', 'stop_sequence'])

Unnamed: 0,feed_key,trip_id,stop_id,stop_sequence,timepoint,arrival_sec,departure_sec,arrival_hour,departure_hour
1914079,d1b694a25d2e172e9ea98abe1829a0fd,002bqqucv,11,1,1.0,33960,33960,9,9
79498,d1b694a25d2e172e9ea98abe1829a0fd,002bqqucv,12,2,1.0,34020,34020,9,9
1861424,d1b694a25d2e172e9ea98abe1829a0fd,002bqqucv,13,3,1.0,34080,34080,9,9
142166,d1b694a25d2e172e9ea98abe1829a0fd,002bqqucv,14,4,1.0,34140,34140,9,9
104676,d1b694a25d2e172e9ea98abe1829a0fd,002bqqucv,15,5,1.0,34200,34200,9,9
...,...,...,...,...,...,...,...,...,...
70056,d1b694a25d2e172e9ea98abe1829a0fd,zwxp4b4ea,1,6,1.0,19560,19560,5,5
130401,d1b694a25d2e172e9ea98abe1829a0fd,zwxp4b4ea,2,7,1.0,19620,19620,5,5
1856152,d1b694a25d2e172e9ea98abe1829a0fd,zwxp4b4ea,3,8,1.0,19740,19740,5,5
177447,d1b694a25d2e172e9ea98abe1829a0fd,zwxp4b4ea,4,9,1.0,19920,19920,5,5


In [28]:
df_check(highways)

'shape of df:(1052, 6)'

"type of :<class 'geopandas.geodataframe.GeoDataFrame'>"

Unnamed: 0,Route,County,District,RouteType,Direction,geometry
0,1,LA,7,State,NB,"MULTILINESTRING ((-118.14322 33.79010, -118.14..."
1,1,LA,7,State,SB,"MULTILINESTRING ((-118.39630 33.94454, -118.39..."
2,1,MEN,1,State,NB,"MULTILINESTRING ((-123.81956 39.79816, -123.81..."
3,1,MEN,1,State,SB,"MULTILINESTRING ((-123.79591 39.69252, -123.79..."
4,1,MON,5,State,NB,"MULTILINESTRING ((-121.76641 36.77189, -121.76..."


In [44]:
highways_d = highways.dissolve(by='Route').reset_index()

In [46]:
df_check(highways_d)

'shape of df:(242, 6)'

"type of :<class 'geopandas.geodataframe.GeoDataFrame'>"

Unnamed: 0,Route,geometry,County,District,RouteType,Direction
0,1,"MULTILINESTRING ((-118.14322 33.79010, -118.14...",LA,7,State,NB
1,2,"MULTILINESTRING ((-118.23350 34.11859, -118.23...",LA,7,State,EB
2,3,"MULTILINESTRING ((-122.67443 41.67834, -122.67...",SIS,2,State,NB
3,4,"MULTILINESTRING ((-120.01535 38.48068, -120.01...",ALP,10,State,EB
4,5,"MULTILINESTRING ((-122.06017 39.01932, -122.06...",COL,3,Interstate,NB
