# Report: State of train mobility in Germany

This report aims to analyze the current state of train mobility in Germany.

- Where is the expansion or improvement of existing rail connections worth prioritising?
- Where is already good train infrastructure, where should improvements be made fastly? 

For this purpose, connection times between different cities with different means of transport are analysed in this report. Secondly, an attempt is made to identify bootlenecks at train stations via a Deutsche Bahn timetable API.

The structure of this report is as follows:

1. [Introduction](#introduction)
2. [Analysis of Dataset 1: Connection Times between German towns](#Analysis-of-dataset-1-connection-times-between-german-towns)
3. [Analysis of Dataset 2: Delay causes for specific train stations](#Analysis-of-dataset-2-delay-causes-for-specific-train-stations)
4. [Summary](#summary)

This report is based on open data from two different datasources:

### Datasource 1
Datasource 1 holds a graph with the connection times between the 100 biggest towns in Germany by different means of transport. 

The Datasource 1 data is provided under a [Creative Commons Attribution 4.0 International (CC BY 4.0)0](https://creativecommons.org/licenses/by/4.0/) license.

Datasource information:
- Metadata URL: https://mobilithek.info/offers/573356838940979200
- Data URL: https://mobilithek.info/mdp-api/files/aux/573356838940979200/moin-2022-05-02.1-20220502.131229-1.ttl.bz2
- Data Type: RDF (Star) Graph, .ttl.bz2 - Archive

### Datasource 2
The second datasource is the DB Timetable API Version 1.0.x. The timetables API can be used to query information about the current (train) traffic situation in Germany and its causes. For the report an API endpoint is called that returns all known delay causes for a train station given by an *eva number* (train station identifier). 

For further information see the official website [DB API Marketplace - Timetables API](https://developers.deutschebahn.com/db-api-marketplace/apis/product/timetables) There you can find also the OpenAPI-document of the DB Timetable API.

The Timetables APIs data is provided under a [Creative Commons Attribution 4.0 International (CC BY 4.0)0](https://creativecommons.org/licenses/by/4.0/) license.

Datasource information:
- Metadata URL: https://developers.deutschebahn.com/db-api-marketplace/apis/product/timetables/api/26494#/Timetables_10213/overview
- Data URL: https://apis.deutschebahn.com/db-api-marketplace/apis/timetables/v1/
- Data Type: API - application/xml

## Introduction
This section covers all requirements for the further Analysis of both datasets like the installation of dependencies and the loading of the datasets.

Also it is shown which towns are cothered in both datasets and which towns are analyzed

### Install dependencies
Initially, install all required dependencies

In [34]:
%pip install pandas
%pip install plotly
%pip install SQLAlchemy
%pip install nbformat
%pip install ipywidgets

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.2.2 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.2.2 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.2.2 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.2.2 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip available: 22.2.2 -> 23.1.2
[notice] To update, run: python.exe -m pip install --upgrade pip


### Load data
Create a pandas dataframe using the local sqlite file.

In [35]:
import pandas as pd

ds1_df = pd.read_sql_table('connection_time_graph', 'sqlite:///project/data/train_connection_analysis.sqlite')
ds2_df = pd.read_sql_table('timetable_for_stations', 'sqlite:///project/data/train_connection_analysis.sqlite')

## Analysis of Dataset 1: Connection Times between German towns
The following chapter now covers the Analysis of Dataset 1.

The question we like to answer here is which towns already have better or worser connection times to other towns in comparison the car connections times.

We therefore query the graph from Dataset 1 that includes the connection times between every analyzed town with different means of transport like cars or trains.

### Structure of Dataset1
Before diving into Analysis, we show the structure of the dataset to get a feeling about the data.

In [37]:
print(ds2_df.info)
ds2_df.head(2)

<bound method DataFrame.info of        index        id message_type     from_time       to_time  \
0          0  r1923001            h  2.303310e+09  2.306302e+09   
1          1  r1961074            h  2.306232e+09  2.306260e+09   
2          2  r1923001            h  2.303310e+09  2.306302e+09   
3          3  r1982070            h  2.306232e+09  2.307122e+09   
4          4  r1983704            h  2.306231e+09  2.306241e+09   
...      ...       ...          ...           ...           ...   
42998  17330  r1978875            h  2.306180e+09  2.307302e+09   
42999  17331  r1978875            h  2.306180e+09  2.307302e+09   
43000  17332  r1806644            h  2.212110e+09  2.312092e+09   
43001  17333      None         None           NaN           NaN   
43002  17334      None         None           NaN           NaN   

                                 category     timestamp  priority  \
0      Bauarbeiten. (Quelle: zuginfo.nrw)  2.304052e+09       2.0   
1      Bauarbeiten. (Quel

Unnamed: 0,index,id,message_type,from_time,to_time,category,timestamp,priority,train_station,problems_found,del
0,0,r1923001,h,2303310000.0,2306302000.0,Bauarbeiten. (Quelle: zuginfo.nrw),2304052000.0,2.0,Aachen Hbf,True,
1,1,r1961074,h,2306232000.0,2306260000.0,Bauarbeiten. (Quelle: zuginfo.nrw),2306061000.0,2.0,Aachen Hbf,True,


### Connection Times for a specific connection
To answer the question of the connection times about a specific connection in detail, we make details for a specific connection accessible. At first we show the information as it is in the dataset. Then we will aggregate the data by towns and calculate metrics to compare train and car connections by town.

Just select your \<Source\> and your \<Destination\>. It will show the duration in minutes for a possible connection and the transportType of the connection.

Note: There are different train connections with different connection times.

In [38]:
import ipywidgets as widgets
from ipywidgets import interact

sources = list(ds1_df["source"].unique())
destinations = list(ds1_df["destination"].unique())

@interact
def show_basic_connection_information(source=sources,
                                destination=destinations):
    connection = ds1_df[(ds1_df["source"] == source) & (ds1_df["destination"] == destination)]
    connection = connection[["source", "destination", "duration", "transportType"]]
    connection["duration"] = pd.to_datetime(connection.duration, unit='m').dt.strftime('%Hh %Mmin')
    print(connection)

interactive(children=(Dropdown(description='source', options=('Aachen', 'Augsburg', 'Berlin', 'Bielefeld', 'Bo…

### Aggregate connection times per connection into a single row and calculate mean/min train duration

To answer the question where are already good train connections, we need to compare the connection times by car with them by train. Therefore we pick the fastest train connection and the median and compare connection times to the ones by car. We store the information for a specific connection in a single row

In [39]:
# Select the train connections, group the trains connections by 'source' and 'destination' and calculate the median and minimum duration
train_df = ds1_df[ds1_df['transportType'] == 'train']
train_grouped = train_df.groupby(['source', 'destination'])['duration'].agg(['median', 'min']).reset_index()

# Filter the DataFrame for 'car' durations
car_df = ds1_df[ds1_df['transportType'] == 'car']
car_df = car_df[["source", "destination", "duration"]]
#print(car_df)

# Merge the train_df and car_df on 'source' and 'destination'
connection_times_df = pd.merge(train_grouped, car_df, on=['source', 'destination'], how='left')

# Rename the columns
connection_times_df.rename(columns={'median': 'median_train_duration', 'min': 'min_train_duration', 'duration': 'car_duration'}, inplace=True)

print(connection_times_df.head(2))

   source destination  median_train_duration  min_train_duration  car_duration
0  Aachen    Augsburg                  314.0                 302           318
1  Aachen      Berlin                  356.0                 351           352


### Compare train connection times with car connection times
To answer the question where a train connection is already better than a car connection, calculate the difference of the connection times. 

* Positive values x mean that a train connection is faster by x minutes than the car connection between a source and a destination.
* Negative values x mean that a train connection is slower by x minutes than the car connection between a source and a destination.

In [40]:
connection_times_df["diff_car_median_train_duration"] = connection_times_df["car_duration"] - connection_times_df["median_train_duration"]
connection_times_df["diff_car_min_train_duration"] = connection_times_df["car_duration"] - connection_times_df["min_train_duration"]

print(connection_times_df.head(2))

   source destination  median_train_duration  min_train_duration  \
0  Aachen    Augsburg                  314.0                 302   
1  Aachen      Berlin                  356.0                 351   

   car_duration  diff_car_median_train_duration  diff_car_min_train_duration  
0           318                             4.0                           16  
1           352                            -4.0                            1  


### Show metrics for a specific connection

Now we can show the calculated metrics for a specific connection and show which transportation type is faster for a specific connection between two towns.

In [41]:
# remember, we calculated the sources and destinations list for the dropbox earlier

@interact
def show_metrics_for_a_connection(source=sources,
                                destination=destinations):
    connection = connection_times_df[(connection_times_df["source"] == source) & (connection_times_df["destination"] == destination)]
    print(connection)
    diff_car_min_train_duration = connection["diff_car_min_train_duration"].values[0]
    diff_car_median_train_duration = connection["diff_car_median_train_duration"].values[0]
    
    if diff_car_min_train_duration > 0:
        print(f"\nThe fastest train connection from {source} to {destination} is {diff_car_min_train_duration} minutes faster than the car connection.")
    else:
        print(f"\nThe fastest train connection from {source} to {destination} is {diff_car_min_train_duration} minutes slower than the car connection.")
    
    if diff_car_median_train_duration > 0:
        print(f"\nThe median train connection from {source} to {destination} is {diff_car_median_train_duration} minutes faster than the car connection.")
    else:
        print(f"\nThe median train connection from {source} to {destination} is {diff_car_median_train_duration} minutes slower than the car connection.")

interactive(children=(Dropdown(description='source', options=('Aachen', 'Augsburg', 'Berlin', 'Bielefeld', 'Bo…

### Ranking of towns with good train connections

To show which towns already have good train connections we now calculate for all outgoing connections from a town if the car or the train is faster to all destinations and count the results. We then create a ranking to highlight towns that are better accessible by car and towns that are better accessible by train.

In [42]:
min_train_faster = connection_times_df.groupby("source")["diff_car_min_train_duration"].apply(lambda diff_car_min_train_duration: (diff_car_min_train_duration > 0).sum()).reset_index(name="min_train_faster")
median_train_faster = connection_times_df.groupby("source")["diff_car_median_train_duration"].apply(lambda diff_car_min_train_duration: (diff_car_min_train_duration > 0).sum()).reset_index(name="median_train_faster")

town_ranking = pd.merge(min_train_faster, median_train_faster, on="source")

Towns sorted by the number of outgoing connections where the **fastest** train connection is faster than the car connection:

In [43]:
town_ranking.sort_values(by=["min_train_faster"], inplace=True, ascending=False)
print(town_ranking)

                    source  min_train_faster  median_train_faster
2                   Berlin                57                   26
44                Mannheim                47                   38
64               Stuttgart                43                   24
1                 Augsburg                42                   18
33               Karlsruhe                40                   11
..                     ...               ...                  ...
35                    Kiel                 1                    0
68  Villingen-Schwenningen                 0                    0
65                   Trier                 0                    0
19               Flensburg                 0                    0
37                Konstanz                 0                    0

[75 rows x 3 columns]


Towns sorted by the number of outgoing connections where the **median** train connection is faster than the car connection:

In [44]:
town_ranking.sort_values(by=["median_train_faster"], inplace=True, ascending=False)
print(town_ranking)

                    source  min_train_faster  median_train_faster
44                Mannheim                47                   38
2                   Berlin                57                   26
64               Stuttgart                43                   24
15                Duisburg                32                   21
30              Ingolstadt                35                   20
..                     ...               ...                  ...
35                    Kiel                 1                    0
68  Villingen-Schwenningen                 0                    0
65                   Trier                 0                    0
19               Flensburg                 0                    0
37                Konstanz                 0                    0

[75 rows x 3 columns]


## Analysis of Dataset 2: Delay causes for specific train stations

In the following chapter we provide details which train stations have more problems than others to identify possible bottlenecks. 

### Structure of Dataset 2
Also for Dataset 2 we first show the structure of the Dataset to get insights into the data structre. As the data pipeline for dataset 2 runs multiple times duplicates are theoretically possible, so we also again drop duplicates to be sure that no duplicates are present.

In [45]:
ds2_df = ds2_df.drop_duplicates(subset=["id", "message_type", "from_time", "to_time", "category", "timestamp", "priority", "train_station", "problems_found", "del"])
print(ds2_df.info)
ds2_df.head(2)

<bound method DataFrame.info of        index        id message_type     from_time       to_time  \
0          0  r1923001            h  2.303310e+09  2.306302e+09   
1          1  r1961074            h  2.306232e+09  2.306260e+09   
3          3  r1982070            h  2.306232e+09  2.307122e+09   
4          4  r1983704            h  2.306231e+09  2.306241e+09   
7          7  r1978334            h  2.306160e+09  2.307282e+09   
...      ...       ...          ...           ...           ...   
42995  17327  r1985306            h  2.306241e+09  2.307012e+09   
42996  17328  r1971077            h  2.306102e+09  2.307012e+09   
43000  17332  r1806644            h  2.212110e+09  2.312092e+09   
43001  17333      None         None           NaN           NaN   
43002  17334      None         None           NaN           NaN   

                                 category     timestamp  priority  \
0      Bauarbeiten. (Quelle: zuginfo.nrw)  2.304052e+09       2.0   
1      Bauarbeiten. (Quel

Unnamed: 0,index,id,message_type,from_time,to_time,category,timestamp,priority,train_station,problems_found,del
0,0,r1923001,h,2303310000.0,2306302000.0,Bauarbeiten. (Quelle: zuginfo.nrw),2304052000.0,2.0,Aachen Hbf,True,
1,1,r1961074,h,2306232000.0,2306260000.0,Bauarbeiten. (Quelle: zuginfo.nrw),2306061000.0,2.0,Aachen Hbf,True,


### Stations with no found problems

The column problems_found indicates wether there are problems found for a station while querying the DB Api or not. 

We show them separatly as these are stations whose Analysis differ from the Analysis of other stations. As the query is empty every station seems to have an entry with problems in the dataset.

In [46]:
none_delay_stations = ds2_df[ds2_df["problems_found"] == False]
print(none_delay_stations)

       index    id message_type  from_time  to_time category  timestamp  \
43001  17333  None         None        NaN      NaN     None        NaN   
43002  17334  None         None        NaN      NaN     None        NaN   

       priority    train_station  problems_found  del  
43001       NaN  Saarbrücken Hbf           False  NaN  
43002       NaN     Solingen Hbf           False  NaN  


### Delay causes in the dataset

In the dataset is a column "category" for the delay causes. Not all entries seem to be real delays. Also included is a category *Information*. 

The following delay causes are present in the dataset:

In [47]:
delay_causes = list(ds2_df["category"].unique())
print(delay_causes)

['Bauarbeiten. (Quelle: zuginfo.nrw)', 'Störung. (Quelle: zuginfo.nrw)', 'Information', 'Störung', None, 'Bauarbeiten', 'Information. (Quelle: zuginfo.nrw)', 'Großstörung']


The german federal state Nrw seems to have its own cause type. As this information is not relevant for us in the following analysis, we remove it.

In [48]:
ds2_df["category"] = ds2_df["category"].str.split('.').str[0]
delay_causes = list(ds2_df["category"].unique())
print(delay_causes)

['Bauarbeiten', 'Störung', 'Information', None, 'Großstörung']


### Compute duration of an interference
To analyze interferences better we need to calculate the duration of a delay. Therefore we subtract the from_time of the to_time and cast the timestamps.

In [49]:
ds2_df["from_time"] = pd.to_datetime(ds2_df["from_time"], format='%y%m%d%H%M')
ds2_df["to_time"] = pd.to_datetime(ds2_df["to_time"], format='%y%m%d%H%M')
ds2_df["duration"] = ds2_df["to_time"] - ds2_df["from_time"]

### Delays in the operation of a specific train station
As for connections also the information of Datasource 2 should be able to be filtered. Just select your \<Station\> and a \<Delay cause\> to make delay causes visible for a specific station

In [50]:
stations = list(ds2_df["train_station"].unique())

@interact
def show_train_station_information(train_station=stations, delay_cause=delay_causes):
    station = ds2_df[(ds2_df["train_station"] == train_station) & (ds2_df["problems_found"] == True)]
    if(station.empty):
        print("No problems for station ", train_station, " found!")
    else:
        station = station[["train_station", "from_time", "to_time", "duration", "category", "priority"]]
        station = station[station["category"] == delay_cause]
        print(station)

interactive(children=(Dropdown(description='train_station', options=('Aachen Hbf', 'Augsburg Hbf', 'Berlin Hbf…

### Addition of the delay duration by Category and priority
To rank stations according to their vulnerability to interferences coefficients have to be calculated for each station. As the severity of interferences differ by category and priority we group by category and priority

In the following coefficients like total delay duration, average delay duration and number of delays for each station by category and priority are calculated.

In [60]:
interference_metrics_of_stations = ds2_df.groupby(['train_station', 'category', 'priority']).agg({'duration': ['sum', 'mean'], 'priority': 'size'}).reset_index()
interference_metrics_of_stations.columns = ['train_station', 'category', 'priority', 'total_duration', 'average_duration', 'total_interferences']
interference_metrics_of_stations.head(10)

Unnamed: 0,train_station,category,priority,total_duration,average_duration,total_interferences
0,Aachen Hbf,Bauarbeiten,2.0,157 days 16:59:00,31 days 12:59:48,5
1,Aachen Hbf,Information,2.0,389 days 07:29:00,32 days 10:37:25,12
2,Aachen Hbf,Information,3.0,10 days 15:45:00,5 days 07:52:30,2
3,Aachen Hbf,Störung,1.0,2 days 17:12:00,0 days 06:31:12,10
4,Augsburg Hbf,Bauarbeiten,1.0,13 days 02:18:00,13 days 02:18:00,1
5,Augsburg Hbf,Bauarbeiten,2.0,4 days 14:45:00,1 days 12:55:00,3
6,Augsburg Hbf,Bauarbeiten,3.0,2 days 22:24:00,1 days 11:12:00,2
7,Augsburg Hbf,Information,1.0,0 days 14:52:00,0 days 04:57:20,3
8,Augsburg Hbf,Information,2.0,597 days 05:31:00,14 days 05:16:27.142857142,42
9,Augsburg Hbf,Information,3.0,145 days 02:22:00,12 days 02:11:50,12


### Severty of Categories and Priorities
To finally rank stations we need to sort the different Categories and Priorities by their severnesss. It is likely that a *Großstörung (engl. major disturbance)* has more severe causes than a *Störung (engl. disturbance)*. *Bauarbeiten (engl. construction work)* is also a delay cause, but one that might lead to lesser delays in the future. *Information*s seem to be harmless. We assume the following severty order of interferences:

Severty of Categories:
1. *Großstörung*
2. *Störung*
3. *Bauarbeiten*
4. *Information*

Priorities show according to the DB Timetable OpenAPI document the severty of an interference. According to the OpenAPI Document priorities indicate the following severty:

* *1 - High*
* *2 - Medium*
* *3 - Low*
* *4 - Done*

For both Categories and Priorities we should sort in ascending order.

In [57]:
interference_metrics_of_stations["category"] = pd.Categorical(interference_metrics_of_stations["category"], categories=['Großstörung', 'Störung', 'Bauarbeiten', 'Information'], ordered=True)
interference_metrics_of_stations = interference_metrics_of_stations.sort_values(by=["category", "priority"], ascending=[True, True])
print(interference_metrics_of_stations.head(10))

         train_station     category  priority   total_duration  \
199       Hannover Hbf  Großstörung       2.0  1 days 02:00:00   
334        München Hbf  Großstörung       2.0  0 days 09:31:00   
404     Regensburg Hbf  Großstörung       2.0  0 days 09:31:00   
3           Aachen Hbf      Störung       1.0  2 days 17:12:00   
10        Augsburg Hbf      Störung       1.0  9 days 12:38:00   
14   Bergisch Gladbach      Störung       1.0  0 days 04:00:00   
18          Berlin Hbf      Störung       1.0  0 days 02:30:00   
23       Bielefeld Hbf      Störung       1.0 56 days 16:29:00   
29          Bochum Hbf      Störung       1.0 11 days 09:03:00   
35            Bonn Hbf      Störung       1.0 57 days 01:39:00   

             average_duration  total_interferences  
199           0 days 06:30:00                    4  
334           0 days 09:31:00                    1  
404           0 days 09:31:00                    1  
3             0 days 06:31:12                   10  
10      

As one can see from this evaluation in the dataset are currently 3 events with the category *Großstörung*. What is remarkable is that München Hbf and Regensburg Hbf have for the *Großstörung* category the same total_duration. So it is very likely (also geographically) that they have the same reason.

Nevertheless for München and Regensburg only one *Großstörung* interference could be counted what is not statistically significant. Hannover has more than one with an average duration of 6:30h what indicates that here might be a bigger problem at the moment. We closer look at the station of Hannover:

In [59]:
print(interference_metrics_of_stations[interference_metrics_of_stations["train_station"] == "Hannover Hbf"])

    train_station     category  priority    total_duration  \
199  Hannover Hbf  Großstörung       2.0   1 days 02:00:00   
203  Hannover Hbf      Störung       1.0  11 days 12:12:00   
204  Hannover Hbf      Störung       2.0   4 days 01:18:00   
197  Hannover Hbf  Bauarbeiten       1.0  83 days 16:33:00   
198  Hannover Hbf  Bauarbeiten       2.0  56 days 03:55:00   
200  Hannover Hbf  Information       1.0  18 days 15:00:00   
201  Hannover Hbf  Information       2.0 814 days 10:39:00   
202  Hannover Hbf  Information       3.0 178 days 08:44:00   

              average_duration  total_interferences  
199            0 days 06:30:00                    4  
203  0 days 10:13:46.666666666                   27  
204  0 days 05:07:15.789473684                   19  
197           13 days 22:45:30                    6  
198            9 days 08:39:10                    6  
200            2 days 01:40:00                    9  
201 10 days 07:25:33.417721519                   79  
202      

### The problem of arguing about data of a slight time span
As we see from the closer look, the station of Hannover has long lasting construction work. So this may be the cause for the major inference and the multiple normal interferences. It is likely that when construction work is finished, there will be less interferences in the future.

As we can see, it is hard to find a reasonable ranking that withstands a closer look into the data. If the data would be grasped over a long time (like f.e. a year), long lasting construction work would still be visible in the data, but the causes of short lasting construction work would vanish and we could argue with more reason. 

So the first result of the analysis of dataset 2 is, we need to grasp the data over a much longer time span to get significant results.

### Ranking of Stations with filtered long-lasting interferences
Nethertheless we still like to attempt ranking the stations in Dataset 2. We saw from the previous section that under the given circumstances taking all long-lasting interferences into account would vanish our results, so we need another attempt.

In the following we focus on short-lasting interferences as these are the ones that often have an immediant not planned impact. Therefore we take the dataset and focus on the events that have a max duration of 24h. We assume that this is the max *normal* interference in typical train operation.

We again calculate the metrics, but filter out long-lasting interferences and the category *Großstörung* as this feature was not relevant enough.

In [79]:
short_lasting_interferences = ds2_df[(ds2_df["duration"] < pd.Timedelta(hours=12)) & (ds2_df["category"]!="Großstörung")].groupby(['train_station', 'category', 'priority']).agg({'duration': ['sum', 'mean'], 'priority': 'size'}).reset_index()
short_lasting_interferences.columns = ['train_station', 'category', 'priority', 'total_duration', 'average_duration', 'total_interferences']
short_lasting_interferences["category"] = pd.Categorical(short_lasting_interferences["category"], categories=['Großstörung', 'Störung', 'Bauarbeiten', 'Information'], ordered=True)

#### Ranking Total_interference most meaningful criteria
After this we again rank Dataset 2. This time we also take the average_duration and the total_interferences into account.

In a first attempt we rank by total_interferences first.

In [80]:
short_lasting_interferences = short_lasting_interferences.sort_values(by=["category", "total_interferences", "average_duration", "priority"], ascending=[True, False, False, True])
short_lasting_interferences.head(10)

Unnamed: 0,train_station,category,priority,total_duration,average_duration,total_interferences
37,Dortmund Hbf,Störung,1.0,7 days 17:21:00,0 days 05:17:44.571428571,35
50,Düsseldorf Hbf,Störung,1.0,5 days 22:29:00,0 days 04:04:15.428571428,35
63,Essen Hbf,Störung,1.0,4 days 13:59:00,0 days 04:04:24.444444444,27
203,Münster,Störung,1.0,5 days 00:39:00,0 days 04:49:33.600000,25
44,Duisburg Hbf,Störung,1.0,4 days 02:09:00,0 days 03:55:33.600000,25
161,Köln Hbf,Störung,1.0,3 days 07:22:00,0 days 03:27:02.608695652,23
107,Hamm,Störung,1.0,4 days 04:30:00,0 days 04:34:05.454545454,22
118,Hannover Hbf,Störung,1.0,2 days 16:15:00,0 days 02:55:13.636363636,22
132,Hildesheim Hbf,Störung,1.0,0 days 15:50:00,0 days 00:45:14.285714285,21
14,Bochum Hbf,Störung,1.0,3 days 11:23:00,0 days 04:10:09,20


This attempt was more successful then previous ones. What is noticeable is that many stations among the top 10 are in North Rhine-Westphalia.

#### Ranking Average_duration as most meaningful criteria
We secondly rank Dataset 2 by average_duration first.

In [88]:
short_lasting_interferences = short_lasting_interferences.sort_values(by=["category", "priority", "total_interferences"], ascending=[True, False, False])
short_lasting_interferences.head(10)

Unnamed: 0,train_station,category,priority,total_duration,average_duration,total_interferences
119,Hannover Hbf,Störung,2.0,3 days 11:19:00,0 days 04:37:43.333333333,18
150,Kiel Hbf,Störung,2.0,0 days 16:06:00,0 days 01:20:30,12
104,Hamburg Hbf,Störung,2.0,0 days 17:01:00,0 days 01:32:49.090909090,11
264,Ulm Hbf,Störung,2.0,0 days 16:19:00,0 days 02:19:51.428571428,7
209,Nürnberg Hbf,Störung,2.0,1 days 03:46:00,0 days 04:37:40,6
71,Flensburg,Störung,2.0,0 days 09:53:00,0 days 01:58:36,5
189,Mannheim Hbf,Störung,2.0,0 days 20:35:00,0 days 04:07:00,5
199,München Hbf,Störung,2.0,1 days 00:54:00,0 days 04:58:48,5
224,Paderborn Hbf,Störung,2.0,1 days 20:00:00,0 days 08:48:00,5
228,Potsdam Hbf,Störung,2.0,0 days 01:49:00,0 days 00:27:15,4


#### Ranking Total-duration as most meaningful criteria
Instead of the average_duration of an interference, we rank Dataset 2 by total_duraton first.

In [82]:
short_lasting_interferences = short_lasting_interferences.sort_values(by=["category", "total_duration", "total_interferences", "priority"], ascending=[True, False, False, True])
short_lasting_interferences.head(10)

Unnamed: 0,train_station,category,priority,total_duration,average_duration,total_interferences
37,Dortmund Hbf,Störung,1.0,7 days 17:21:00,0 days 05:17:44.571428571,35
50,Düsseldorf Hbf,Störung,1.0,5 days 22:29:00,0 days 04:04:15.428571428,35
203,Münster,Störung,1.0,5 days 00:39:00,0 days 04:49:33.600000,25
63,Essen Hbf,Störung,1.0,4 days 13:59:00,0 days 04:04:24.444444444,27
107,Hamm,Störung,1.0,4 days 04:30:00,0 days 04:34:05.454545454,22
44,Duisburg Hbf,Störung,1.0,4 days 02:09:00,0 days 03:55:33.600000,25
96,Hagen Hbf,Störung,1.0,4 days 00:46:00,0 days 05:41:31.764705882,17
76,Frankfurt am Main,Störung,1.0,3 days 23:23:00,0 days 07:20:13.846153846,13
182,Mainz Hbf,Störung,1.0,3 days 18:02:00,0 days 07:30:10,12
208,Nürnberg Hbf,Störung,1.0,3 days 16:52:00,0 days 06:20:51.428571428,14


## Summary

In this last chapter we summarize our results by presenting our findings again. Furthermore we also like to indicate on which major assumptions these findings are based. Then we briefly sum up the most difficult problems tackled in the project. Finally we give an outlook how the project can be extended and how the certainty of the results can be further improved.

### Findings

#### Dataset 1

#### Dataset 2

### Assumptions
This section should briefly cover the most important assumptions that were made.

#### The more delays the worse the station
The findings in Datasource 2 require the assumption that the more delays a station has the worse the station is. This might not always be true. It is rather likely that f.e. the station Berlin Hbf has more delays than the station of Rostock. But for towns that have stations that have a comparable train frequency this assumption is likely to be true.

#### One station per town
For a town only the central train station was considered in Dataset 2, not all stations of a town.

### Problems during the whole Project

#### Corrupted ttl file of Datasource 1
The most severe problem tackeled during the project was that Datasource 1 provides corrupted data. The ttl file containing the rdf graph with the connection times between the towns is syntactically false. Therefore a preprocessing was required before the graph could be parsed. The preprocessing contains replacing expressions using regexes and removing invalid signs. 

A relation in rdf basically consists of a triple of a *\<subject\>*, *\<predicate\>* and an *\<object\>* like f.e. *Erlangen* *IsConnectedTo* *Berlin*. 

The relation between source and destination with attributes like the driving time was also not correctly associated in rdf terms in the ttl file. For this reason attributes also had to be brought into relation with the corresponding connection by complex logical changes in the rdf file. 

By these changes the attributes could be brought into relation to its connection and the graph could be parsed correctly with rdflib.

#### Contineous updates of Datasource 2
Another problem was that the DB Timetable API provides potentially new informations every 30 minutes. Nethertheless some informations do not change that often (f.e. the delay cause *Bauarbeiten* are a time consuming task that lasts from days to months). Therefore one problem is the deduplication of events that were grasped multiple times. Also would it be nice to grasp the data contineously like f.e. every hour, but this what require an installation on a server that automatically calls the API as a scheduled task. This was not possible manually.

#### XML representation of the data of Datasource 2
The used endpoint of the DB Timetable API of Datasource 2 provides more information than just the delay causes in a XML representation. To retrieve only the needed data a XPATH query was used. 

### Outlook
#### Contineous grasping of data from Datasource 2
The actual dataset from Datasource 2 covers at the moment data of four different days. The data was requested about approximately the same time. An interesting enhancement would be to deploy the model to a server with much storage and call the DB Api automatically daily or hourly over a long time like f.e. a year. The current data is just a snapshot of the current state. Maybe it can be found that in the long term other stations are the ones that are the most potential for enhancements.

#### Including multiple stations per town
Some towns have multiple stations. In this report only the main train station of a town is covered. Covering other stations of towns in Dataset 2 would also provide further information.

#### Comparing train connection times of Datasource 1 with real-time information of the DB Timetable API
Furthermore the DB Timetable API provides other endpoints that provide further information of current connections. It would be interesting to compare the real average train connection times with the train connection times provided in Dataset 1.


<Destination/>