## Match datasets 

### Name your spark application as `GASPAR_final` or `GROUP_NAME_final`.

<div class='alert alert-info'><b>Any application without a proper name would be promptly killed.</b></div>

In [1]:
%%configure
{"conf": {
    "spark.app.name": "lgptguys_final"
}}

ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
8983,application_1589299642358_3520,pyspark,idle,Link,Link,
9081,application_1589299642358_3641,pyspark,idle,Link,Link,
9084,application_1589299642358_3644,pyspark,busy,Link,Link,
9098,application_1589299642358_3660,pyspark,idle,Link,Link,
9112,application_1589299642358_3675,pyspark,busy,Link,Link,
9121,application_1589299642358_3684,pyspark,idle,Link,Link,
9130,application_1589299642358_3694,pyspark,idle,Link,Link,
9145,application_1589299642358_3710,pyspark,idle,Link,Link,
9152,application_1589299642358_3716,pyspark,idle,Link,Link,
9153,application_1589299642358_3717,pyspark,idle,Link,Link,


### Start Spark

In [2]:
# Initialization
%%spark

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
9180,application_1589299642358_3747,pyspark,idle,Link,Link,✔


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

SparkSession available as 'spark'.


FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

An error was encountered:
unknown magic command '%spark'
UnknownMagic: unknown magic command '%spark'



In [3]:
%%send_to_spark -i username -t str -n username

An error was encountered:
Variable named username not found.


### Import useful libraries 

In [4]:
from geopy.distance import great_circle
from pyspark.sql.functions import *
from pyspark.sql.types import StructType, StructField, StringType, IntegerType,LongType, TimestampType

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Read TimeTable curated data

contains only stops / trips in a 15km range from Zurich HB

In [5]:
# Load data with stop_id of interest
stop_times = spark.read.csv('data/lgpt_guys/stop_times_final_cyril.csv', header = True)
stops_15km = stop_times.select(col('stop_id_general')).dropDuplicates()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

### Read the [SBB actual data](https://opentransportdata.swiss/en/dataset/istdaten) in ORC format

In [7]:
sbb = spark.read.orc('/data/sbb/orc/istdaten')
sbb.columns

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

['betriebstag', 'fahrt_bezeichner', 'betreiber_id', 'betreiber_abk', 'betreiber_name', 'produkt_id', 'linien_id', 'linien_text', 'umlauf_id', 'verkehrsmittel_text', 'zusatzfahrt_tf', 'faellt_aus_tf', 'bpuic', 'haltestellen_name', 'ankunftszeit', 'an_prognose', 'an_prognose_status', 'abfahrtszeit', 'ab_prognose', 'ab_prognose_status', 'durchfahrt_tf']

### Subset SBB data

We take only stop_id in 15 km range from Zurich HB - Then, we want to write an intermidate table to avoid doing the computation on the whole SBB dataset.

In [14]:
# Used to subset sbb table based on stop_id from stops_15km
stop_id  = stops_15km.select('stop_id_general').collect()
stop_idx = [item.stop_id_general for item in stop_id]

# Make the subset dataframe
sbb_filt = sbb.filter( sbb['bpuic'].isin(stop_idx) )\
              .select('fahrt_bezeichner','haltestellen_name', 'produkt_id',\
                      'ankunftszeit', 'abfahrtszeit', 'betriebstag',\
                      col('bpuic').alias('stop_id'))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

We write the resulting subset. This is important to avoid working on the whole dataset and only on a subset of it, which makes every run much faster. 

In [15]:
# save
username = 'acoudray'
sbb_filt.write.format("orc").save("/user/{}/sbb_filt2.orc".format(username))

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

For these are the files previously added to /user/{}/ :
- `sbb_filt2.orc` : every day with stations < 15km (Cyril final version)
- `sbb_filt.orc` : every day with stations < 15km
- `sbb_subTime.orc` : schedule of May 13-17, 2019, stations < 15km
- `sbb_subTime2.orc` : schedule of May 13-17, 2019, stations < 15km (Cyril final version)
- `sbb_subTime3.orc` : schedule of May 13-17, 2019, stations < 15km (Cyril final version)
- `sbb_oneday.orc` : May 13th 2019 only, stations < 15km, `linien_id` field added

### Get corresponding stop_id between two datasets 

We first look at the station names in timetable dataset. Stop_id can be given in multiple formats :
- `8502186` : the format defining the stop itself, which matches sbb `bpuic` field

We will call the 3 next ones __Special cases__ throughout the notebook :
- `8502186:0:1` or `8502186:0:2` : The individual platforms are separated by “:”. A “platform” can also be a platform+sectors (e.g. “8500010:0:7CD”).
- `8502186P` : All the stops have a common “parent” “8500010P”.
- `8502186:0:Bfpl` : if the RBS uses it for rail replacement buses.

source : [timetable cookbook](https://opentransportdata.swiss/en/cookbook/gtfs/), section stops.txt 

In the sbb actual_data we find equivalent to stop_id in its first format defining the station without platform information, in its `bpuic` field

### Get corresponding trip_id between two datasets 

In sbb dataset, the trip ids are defined by `FAHRT_BEZEICHNER` field and in timetable `trip_id`. We will use corresponding station_id and arrival_times in order to get corresponding trip_id. Our goal is to find a match in sbb dataset for _timetable_ trips (and not the other way around). So we will focus on getting this assymetrical correspondance table. 

These labels will be used to differentiate 3 different ways to compute probabilities :
- __One-to-one__ we find a clear match : we use distribution of delays on weekdays for a given trip/station_id based on all past sbb data. 
- __One-to-many__ we find multiple matches : Matches are aggregated together in the final distribution table
- __One-to-none__ we find no match : as described later, we will use delay distribution of similar trip (sharing stop_id, transport type and hour) to infer the delay.

__Timetable dataset__ 

We first load _timetable_ with curated trip_id, in a 15km radius from Zurich HB. 

In [6]:
# Load data 
stop_times = spark.read.csv('data/lgpt_guys/stop_times_final_cyril.csv', header = True)

# Print number of lines and show
print stop_times.count()
stop_times.show(3, False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

260459
+-----------+---------------+----------------------+-------+------------+--------------+-------------+------------------------+----------------+----------------+---------------+---------------+------------+--------------------+---------+----------+--------+----------+---------------------------+
|route_id   |stop_id_general|trip_id               |stop_id|arrival_time|departure_time|stop_sequence|stop_name               |stop_lat        |stop_lon        |trip_headsign  |trip_short_name|direction_id|departure_first_stop|route_int|stop_count|stop_int|route_desc|monotonically_increasing_id|
+-----------+---------------+----------------------+-------+------------+--------------+-------------+------------------------+----------------+----------------+---------------+---------------+------------+--------------------+---------+----------+--------+----------+---------------------------+
|26-46-j19-1|8591371        |742.TA.26-46-j19-1.8.R|8591371|18:06:00    |18:06:00      |13           |

In [7]:
# Make the subset dataframe
stop_times_format = stop_times\
                   .select('trip_id', col('stop_id_general').alias('stop_id'), 
                          unix_timestamp(stop_times.arrival_time, 'HH:mm:ss')\
                          .alias('arrival_time_ut'),\
                          unix_timestamp(stop_times.departure_time, 'HH:mm:ss')\
                          .alias('departure_time_ut') )\
                   .select('trip_id', 'stop_id', 
                            from_unixtime('arrival_time_ut')\
                            .alias('arrival_time_dty'),
                            from_unixtime('departure_time_ut')\
                            .alias('departure_time_dty'))\
                   .select('trip_id', 'stop_id', 
                             date_format('arrival_time_dty', 'hh:mm')\
                             .alias('arrival_time'),
                             date_format('departure_time_dty', 'hh:mm')\
                            .alias('departure_time'))\
                   .na.fill("unknown")


stop_times_format.show(5, False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------------------+-------+------------+--------------+
|trip_id               |stop_id|arrival_time|departure_time|
+----------------------+-------+------------+--------------+
|742.TA.26-46-j19-1.8.R|8591371|06:06       |06:06         |
|742.TA.26-46-j19-1.8.R|8591358|06:07       |06:07         |
|742.TA.26-46-j19-1.8.R|8591158|06:08       |06:08         |
|742.TA.26-46-j19-1.8.R|8576241|06:09       |06:09         |
|742.TA.26-46-j19-1.8.R|8591155|06:10       |06:10         |
+----------------------+-------+------------+--------------+
only showing top 5 rows

We have reformated `arrival_time` and `departure_time` to get it in `hh:mm` format. 

Here is an example of a single trip_id for an Eurocity and another one for an Intercity train : they have very few stops in the 15km perimeter around Zurich HB. 

In [8]:
EC_trip_id = '35.TA.40-5-Y-j19-1.33.H'

stop_times_format.filter(stop_times_format['trip_id'] == EC_trip_id)\
                 .show(10,False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------------------+-------+------------+--------------+
|trip_id                |stop_id|arrival_time|departure_time|
+-----------------------+-------+------------+--------------+
|35.TA.40-5-Y-j19-1.33.H|8503000|06:52       |07:02         |
|35.TA.40-5-Y-j19-1.33.H|8503016|07:12       |07:14         |
+-----------------------+-------+------------+--------------+

In [9]:
IC_trip_id = '6.TA.16-5-j19-1.6.R'

stop_times_format.filter(stop_times_format['trip_id'] == IC_trip_id)\
                 .show(10,False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------------------+-------+------------+--------------+
|trip_id            |stop_id|arrival_time|departure_time|
+-------------------+-------+------------+--------------+
|6.TA.16-5-j19-1.6.R|8503000|07:30       |07:39         |
|6.TA.16-5-j19-1.6.R|8503006|07:45       |07:46         |
|6.TA.16-5-j19-1.6.R|8503016|07:51       |07:53         |
+-------------------+-------+------------+--------------+

We have _timetable_ trip_id, the stop_id as defined above, and arrival/departure time. The idea is to match these information with the ones we have in sbb dataset. Stop_id and time match between both datasets.

 __SBB dataset__
 
We will subset sbb dataset to get only the 13th of May in sbb dataset :

In [10]:
username='acoudray'
sbb_subTime = spark.read.orc("/user/{}/sbb_filt2.orc".format(username))

sbb_subTime.show(5)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------------+-----------------+----------+----------------+----------------+-----------+-------+
|fahrt_bezeichner|haltestellen_name|produkt_id|    ankunftszeit|    abfahrtszeit|betriebstag|stop_id|
+----------------+-----------------+----------+----------------+----------------+-----------+-------+
|    85:11:10:002|        Zürich HB|       Zug|12.10.2018 21:51|                | 12.10.2018|8503000|
| 85:11:10293:004|        Zürich HB|       Zug|                |13.10.2018 00:25| 12.10.2018|8503000|
| 85:11:10293:004| Zürich Flughafen|       Zug|13.10.2018 00:34|13.10.2018 00:35| 12.10.2018|8503016|
| 85:11:10536:004|        Zürich HB|       Zug|                |12.10.2018 20:03| 12.10.2018|8503000|
| 85:11:10537:006|        Zürich HB|       Zug|12.10.2018 21:59|                | 12.10.2018|8503000|
+----------------+-----------------+----------+----------------+----------------+-----------+-------+
only showing top 5 rows

We first convert time in string format 'hh:mm', same than in timetable. We fill null with 'unknown' string (to be able to catch it) and we format `fahrt_bezeichner` to remove extended

In [11]:
# Make the subset dataframe
sbb_filt = sbb_subTime.select('fahrt_bezeichner', 'stop_id',\
                          unix_timestamp(sbb_subTime.ankunftszeit, 'dd.MM.yyyy HH:mm')\
                          .alias('arrival_time_ut'),\
                          unix_timestamp(sbb_subTime.abfahrtszeit, 'dd.MM.yyyy HH:mm')\
                          .alias('departure_time_ut') )\
                   .select('fahrt_bezeichner', 'stop_id', 
                            from_unixtime('arrival_time_ut')\
                            .alias('arrival_time_dty'),
                            from_unixtime('departure_time_ut')\
                            .alias('departure_time_dty'))\
                   .select('fahrt_bezeichner', 'stop_id', 
                             date_format('arrival_time_dty', 'hh:mm')\
                             .alias('arrival_time'),
                             date_format('departure_time_dty', 'hh:mm')\
                            .alias('departure_time'))\
                   .na.fill("unknown")

sbb_filt.show(5, False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------------+-------+------------+--------------+
|fahrt_bezeichner|stop_id|arrival_time|departure_time|
+----------------+-------+------------+--------------+
|85:11:10:002    |8503000|09:51       |unknown       |
|85:11:10293:004 |8503000|unknown     |12:25         |
|85:11:10293:004 |8503016|12:34       |12:35         |
|85:11:10536:004 |8503000|unknown     |08:03         |
|85:11:10537:006 |8503000|09:59       |unknown       |
+----------------+-------+------------+--------------+
only showing top 5 rows

Let's check a control Eurocity / Intercity :

In [12]:
EC_trip_sbb = '85:11:2281:004'

sbb_filt.filter(sbb_filt['fahrt_bezeichner'] == EC_trip_sbb)\
                 .show(20,False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+----------------+-------+------------+--------------+
|fahrt_bezeichner|stop_id|arrival_time|departure_time|
+----------------+-------+------------+--------------+
|85:11:2281:004  |8503000|06:52       |07:02         |
|85:11:2281:004  |8503016|07:12       |07:14         |
|85:11:2281:004  |8503000|06:52       |07:02         |
|85:11:2281:004  |8503016|07:12       |07:14         |
|85:11:2281:004  |8503000|06:52       |07:02         |
|85:11:2281:004  |8503016|07:12       |07:14         |
|85:11:2281:004  |8503000|06:52       |07:02         |
|85:11:2281:004  |8503016|07:12       |07:14         |
|85:11:2281:004  |8503000|06:52       |07:02         |
|85:11:2281:004  |8503016|07:12       |07:14         |
|85:11:2281:004  |8503000|06:52       |07:02         |
|85:11:2281:004  |8503016|07:12       |07:14         |
|85:11:2281:004  |8503000|06:52       |07:02         |
|85:11:2281:004  |8503016|07:12       |07:14         |
|85:11:2281:004  |8503000|06:52       |07:02         |
|85:11:228

There is one match per day, but they all share the same `fahrt_bezeichner`. 

__Join two datasets on stop_id and time__

We can now create a joined table using timetable-derived `stop_time` and sbb-derived `sbb_filt`. We use `stop_id`, `arrival_time` and `departure_time` to merge tables using `join` function. The idea is to compare the trip_id from both dataset - they should end up on the same line after the join. We use join with _left_outer_ so that we can only have _null_ values on the sbb side (assymetrical join).

In [13]:
joined_trip_table = stop_times_format.join(sbb_filt,\
                                           on=['stop_id', 'arrival_time', 'departure_time'],\
                                           how='left_outer')\
                                     .select('stop_id', 'arrival_time', 'departure_time',
                                             'trip_id', 'fahrt_bezeichner')\
                                     .distinct()\
                                     .select('trip_id', 'fahrt_bezeichner')
                                     #.select('trip_id', col('fahrt_bezeichner_format').alias('fahrt_bezeichner'))
joined_trip_table.show(10, False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-----------------------+----------------+
|trip_id                |fahrt_bezeichner|
+-----------------------+----------------+
|62.TA.1-17-A-j19-1.5.H |85:31:511:000   |
|62.TA.1-17-A-j19-1.5.H |85:31:571:000   |
|62.TA.1-17-A-j19-1.5.H |85:31:511:002   |
|62.TA.1-17-A-j19-1.5.H |85:31:571:002   |
|5.TA.30-57-Y-j19-1.1.H |null            |
|29.TA.30-57-Y-j19-1.1.H|null            |
|152.TA.26-14-j19-1.32.R|null            |
|224.TA.26-14-j19-1.43.H|85:11:19427:001 |
|224.TA.26-14-j19-1.43.H|85:11:19475:001 |
|224.TA.26-14-j19-1.43.H|85:11:31475:005 |
+-----------------------+----------------+
only showing top 10 rows

This is the raw results of the intersection. Note that we used a `distinct()` to avoid having multiple lines corresponding to multiple days. Each line must be a unqiue combination of `trip_id` x `stop_id`, no matter which day it is. 

Now we can count how many stops (with same time) are shared between trip_id from _timetable_ and _sbb_ data.

In [14]:
joined_trip_count = joined_trip_table.groupBy("trip_id", "fahrt_bezeichner")\
                                    .count()

joined_trip_count.show(10, False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------------------------+----------------------+-----+
|trip_id                  |fahrt_bezeichner      |count|
+-------------------------+----------------------+-----+
|224.TA.26-14-j19-1.43.H  |85:11:19427:001       |13   |
|1250.TA.26-67-j19-1.2.H  |85:849:248315-13067-1 |8    |
|233.TA.26-24-j19-1.123.H |85:11:20419:001       |12   |
|56.TA.79-10-B-j19-1.3.H  |85:78:12823:002       |4    |
|12.TA.26-816-j19-1.1.H   |85:838:401979-17850-1 |4    |
|12.TA.26-816-j19-1.1.H   |85:838:298342-17850-1 |4    |
|12.TA.26-816-j19-1.1.H   |85:838:232343-10850-1 |4    |
|603.TA.26-33E-j19-1.4.H  |85:849:414098-12412-1 |10   |
|603.TA.26-33E-j19-1.4.H  |85:849:521451-28031-1 |1    |
|3683.TA.26-8-C-j19-1.27.H|85:3849:590678-05011-1|4    |
+-------------------------+----------------------+-----+
only showing top 10 rows

Write intermediate table to save results and get better performance for next steps.

In [15]:
joined_trip_count.write.csv('data/lgpt_guys/joined_trip_count_6_full.csv', \
                            header = True, mode="overwrite")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

Re-load cached data. 

In [5]:
joined_trip_count = spark.read.csv('data/lgpt_guys/joined_trip_count_6_full.csv', \
                            header = True)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

We can now use a threshold to only get correspondances between `trip_id` that share a certain number of `stop_id` at the same `departure_time` / `arrival_time`.  We decided to use 2 as a minimum number of match needed -> this was required to be able to get InterCity / InterRegio trains, which have few stops in the 15km perimeter.

In [6]:
cutoff_min_overlap = 2

joined_trip_atL2 = joined_trip_count.filter(col('count') >= cutoff_min_overlap )

joined_trip_atL2.show(10, False)

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------------+----------------------+-----+
|trip_id                   |fahrt_bezeichner      |count|
+--------------------------+----------------------+-----+
|89.TA.26-721-j19-1.3.H    |85:773:778860-04720-1 |7    |
|217.TA.1-17-A-j19-1.17.H  |85:31:987:000         |12   |
|1890.TA.26-11-A-j19-1.27.R|85:3849:137108-21011-1|25   |
|1612.TA.26-10-j19-1.11.R  |85:3849:617087-24010-1|19   |
|113.TA.26-131-j19-1.6.R   |85:807:473534-31131-1 |3    |
|113.TA.26-131-j19-1.6.R   |85:807:620139-12131-1 |3    |
|124.TA.26-131-j19-1.7.R   |85:807:277454-25131-1 |6    |
|59.TA.26-842-j19-1.1.H    |85:838:283297-14850-1 |5    |
|59.TA.26-842-j19-1.1.H    |85:838:83400-11850-2  |5    |
|38.TA.26-660-j19-1.5.H    |85:882:847761-15101-1 |6    |
+--------------------------+----------------------+-----+
only showing top 10 rows

This table indicates how many stop_id / departure_time / arrival_time were identical between `trip_id` (_timetable_ data) and `fahrt_bezeichner` (sbb data). The idea is to take every trip_id with more than X matches between the two datasets.

We write a little summary of how many trip_id are found in the translation table :

In [7]:
print "size of the join is :                           {}".format(joined_trip_atL2.count())
print "number of unique timetable trip_id :            {}".format(joined_trip_atL2.select('trip_id').distinct().count())
print "number of unique sbb trip_id (fahrtbezeichner): {}".format(joined_trip_atL2.select('fahrt_bezeichner').distinct().count())

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

size of the join is :                           921476
number of unique timetable trip_id :            19135
number of unique sbb trip_id (fahrtbezeichner): 601214

Write results in csv format in general folder data/lgpt_guys

In [8]:
joined_trip_atL2.write.csv('data/lgpt_guys/match_datasets_translation.csv', header = True, mode="overwrite")

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

__Validation - Check a few exemple trains__

Can we find these 

We will check 
- an eurocity train : trip_id = `35.TA.40-5-Y-j19-1.33.H` 
- and an Intercity  : trip_id = `6.TA.16-5-j19-1.6.R`

They should have 2/3 stop_id each :
- EC : `8503000` and `8503016` 
- IC : `8503000`, `8503006` and `8503016`

In [19]:
IC_trip_id = '6.TA.16-5-j19-1.6.R'
joined_trip_atL2.filter(joined_trip_atL2.trip_id == IC_trip_id).show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+-------------------+----------------+-----+
|            trip_id|fahrt_bezeichner|count|
+-------------------+----------------+-----+
|6.TA.16-5-j19-1.6.R|  85:11:1533:001|    3|
|6.TA.16-5-j19-1.6.R| 85:11:31509:004|    3|
|6.TA.16-5-j19-1.6.R| 85:11:30509:005|    3|
|6.TA.16-5-j19-1.6.R| 85:11:30533:006|    3|
|6.TA.16-5-j19-1.6.R|  85:11:1509:001|    3|
|6.TA.16-5-j19-1.6.R|  85:11:1509:002|    3|
|6.TA.16-5-j19-1.6.R| 85:11:30509:002|    3|
|6.TA.16-5-j19-1.6.R| 85:11:71533:001|    2|
|6.TA.16-5-j19-1.6.R| 85:11:31533:001|    3|
|6.TA.16-5-j19-1.6.R| 85:11:30233:008|    3|
|6.TA.16-5-j19-1.6.R| 85:11:71533:007|    2|
|6.TA.16-5-j19-1.6.R| 85:11:30533:005|    3|
|6.TA.16-5-j19-1.6.R| 85:11:30533:001|    3|
|6.TA.16-5-j19-1.6.R| 85:11:10409:002|    2|
|6.TA.16-5-j19-1.6.R| 85:11:30509:006|    3|
|6.TA.16-5-j19-1.6.R| 85:11:10433:002|    2|
|6.TA.16-5-j19-1.6.R| 85:11:30033:003|    3|
|6.TA.16-5-j19-1.6.R| 85:11:71509:002|    2|
|6.TA.16-5-j19-1.6.R|  85:11:1533:002|    3|
|6.TA.16-5

In [20]:
EC_trip_id = '35.TA.40-5-Y-j19-1.33.H'
joined_trip_atL2.filter(joined_trip_atL2.trip_id == EC_trip_id).show()

FloatProgress(value=0.0, bar_style='info', description='Progress:', layout=Layout(height='25px', width='50%'),…

+--------------------+----------------+-----+
|             trip_id|fahrt_bezeichner|count|
+--------------------+----------------+-----+
|35.TA.40-5-Y-j19-...| 85:11:30191:004|    2|
|35.TA.40-5-Y-j19-...|  85:11:2281:004|    2|
|35.TA.40-5-Y-j19-...| 85:11:30591:015|    2|
|35.TA.40-5-Y-j19-...|   85:11:191:001|    2|
+--------------------+----------------+-----+