# Swiss Railway Network Analysis

This notebook explores the Swiss railway network data (SBB) by constructing a graph representation where stations are nodes and railway line connections are edges. We analyze the network structure, identify connected components, and examine the main network versus isolated subnetworks.

In [26]:
import pandas as pd
import networkx as nx
import os
import pickle

## Data Loading

We load two datasets: the line data containing railway lines with their stations and kilometer positions, and the station metadata containing additional information about each station (Didok numbers, stop names, etc.).

In [27]:
sbb_line_data_path = "datasets/switzerland/sbb-linie-mit-betriebspunkten.csv"
sbb_stations_data_path = "datasets/switzerland/sbb-dienststellen-gemass-opentransportdataswiss.csv"

graph_path = "datasets/switzerland/swiss_rail_network.gpickle"


In [28]:
sbb_line_data = pd.read_csv(sbb_line_data_path, sep=';')

print(sbb_line_data.columns)
sbb_line_data.head()

Index(['Station abbreviation', 'Stop name', 'Line', 'KM', 'Line.1', 'Geopos',
       'Didok number', 'OPUIC', 'Stop name.1', 'lod', 'sloid'],
      dtype='object')


Unnamed: 0,Station abbreviation,Stop name,Line,KM,Line.1,Geopos,Didok number,OPUIC,Stop name.1,lod,sloid
0,ABO,Aarburg-Oftringen,500,43.00505,Basel SBB - Olten - Luzern,"47.320268469495055, 7.908222606719322",2000,8502000,Aarburg-Oftringen,http://lod.opentransportdata.swiss/didok/didok85,ch:1:sloid:2000
1,ABOW,Aarburg-Oftringen,451,43.97183,Aarburg-Oftringen - Rothrist Gleis 1,"47.3136469221806, 7.906871810514809",8141,8508141,Aarburg-Oftringen West (Abzw),http://lod.opentransportdata.swiss/didok/didok85,ch:1:sloid:8141
2,AD,Aadorf,850,122.89829,St.Gallen - Winterthur Nord,"47.48811815646303, 8.903301351691383",6013,8506013,Aadorf,http://lod.opentransportdata.swiss/didok/didok85,ch:1:sloid:6013
3,AESP,Aespli,400,12.27812,Lochligut - Wanzwil - Rothrist West,"47.0283996594778, 7.5235281481533445",15299,8515299,Aespli,http://lod.opentransportdata.swiss/didok/didok85,ch:1:sloid:15299
4,AF,Affoltern am Albis,711,24.84132,ZH Hardbrucke - Kollermuhle,"47.2760703624343, 8.446594995016907",2224,8502224,Affoltern am Albis,http://lod.opentransportdata.swiss/didok/didok85,ch:1:sloid:2224


In [29]:
sbb_stations_data = pd.read_csv(sbb_stations_data_path, sep=";")

print(sbb_stations_data.columns)
sbb_stations_data.head()

Index(['lod', 'Geopos', '﻿numberShort', 'uicCountryCode', 'sloid', 'number',
       'validFrom', 'validTo', 'designationOfficial', 'designationLong',
       'abbreviation', 'operatingPoint', 'operatingPointWithTimetable',
       'stopPoint', 'stopPointType', 'freightServicePoint', 'trafficPoint',
       'borderPoint', 'hasGeolocation', 'isoCountryCode', 'cantonName',
       'cantonFsoNumber', 'cantonAbbreviation', 'districtName',
       'districtFsoNumber', 'municipalityName', 'fsoNumber', 'localityName',
       'operatingPointType', 'operatingPointTechnicalTimetableType',
       'meansOfTransport', 'categories', 'operatingPointTrafficPointType',
       'operatingPointRouteNetwork', 'operatingPointKilometer',
       'operatingPointKilometerMasterNumber', 'sortCodeOfDestinationStation',
       'businessOrganisation', 'businessOrganisationNumber',
       'businessOrganisationAbbreviationDe',
       'businessOrganisationAbbreviationFr',
       'businessOrganisationDescriptionDe',
       '

Unnamed: 0,lod,Geopos,﻿numberShort,uicCountryCode,sloid,number,validFrom,validTo,designationOfficial,designationLong,...,businessOrganisation,businessOrganisationNumber,businessOrganisationAbbreviationDe,businessOrganisationAbbreviationFr,businessOrganisationDescriptionDe,businessOrganisationDescriptionFr,fotComment,height,creationDate,editionDate
0,http://lod.opentransportdata.swiss/didok/1322013,"45.971036269799534, 8.069922381501902",22013,13,ch:1:sloid:1322013,1322013,2020-09-01,9999-12-31,Ceppo Morelli,,...,ch:1:sboid:101223,7090,ACO,ACO,Autoservizi Comazzi S.R.L.,Autoservizi Comazzi S.R.L.,,0.0,2017-11-09T12:53:05+01:00,2024-04-08T11:26:05+02:00
1,http://lod.opentransportdata.swiss/didok/1322033,"46.12230782918108, 8.290578872184831",22033,13,ch:1:sloid:1322033,1322033,2020-09-01,9999-12-31,"Domodossola, via Sant'Antonio",,...,ch:1:sboid:101223,7090,ACO,ACO,Autoservizi Comazzi S.R.L.,Autoservizi Comazzi S.R.L.,,0.0,2017-11-09T12:53:05+01:00,2024-04-08T11:26:05+02:00
2,http://lod.opentransportdata.swiss/didok/1322022,"45.89945668927423, 8.415831401908225",22022,13,ch:1:sloid:1322022,1322022,2020-09-01,9999-12-31,Crusinallo,,...,ch:1:sboid:101223,7090,ACO,ACO,Autoservizi Comazzi S.R.L.,Autoservizi Comazzi S.R.L.,,0.0,2017-11-09T12:53:05+01:00,2024-04-08T11:26:05+02:00
3,http://lod.opentransportdata.swiss/didok/1322005,"46.122297290464246, 8.210769596126164",22005,13,ch:1:sloid:1322005,1322005,2020-09-01,9999-12-31,"Bognanco, T. Villa Elda",,...,ch:1:sboid:101223,7090,ACO,ACO,Autoservizi Comazzi S.R.L.,Autoservizi Comazzi S.R.L.,,0.0,2017-11-09T12:53:05+01:00,2024-04-08T11:26:05+02:00
4,http://lod.opentransportdata.swiss/didok/1322008,"46.1340177517307, 8.286193910517145",22008,13,ch:1:sloid:1322008,1322008,2020-09-01,9999-12-31,Caddo,,...,ch:1:sboid:101223,7090,ACO,ACO,Autoservizi Comazzi S.R.L.,Autoservizi Comazzi S.R.L.,,0.0,2017-11-09T12:53:05+01:00,2024-04-08T11:26:05+02:00


## Data Merging

We merge the line data with the station metadata using the Didok number as the key. This enriches our line data with additional station information that will be used as node attributes in the graph.

In [30]:
sbb_line_and_station_data = sbb_line_data.merge(
    sbb_stations_data,
    left_on="Didok number",
    right_on="number",
    how="left",
    suffixes=("", "_didok"),
)

In [31]:
print(sbb_line_and_station_data.columns)
sbb_line_and_station_data.head()

Index(['Station abbreviation', 'Stop name', 'Line', 'KM', 'Line.1', 'Geopos',
       'Didok number', 'OPUIC', 'Stop name.1', 'lod', 'sloid', 'lod_didok',
       'Geopos_didok', '﻿numberShort', 'uicCountryCode', 'sloid_didok',
       'number', 'validFrom', 'validTo', 'designationOfficial',
       'designationLong', 'abbreviation', 'operatingPoint',
       'operatingPointWithTimetable', 'stopPoint', 'stopPointType',
       'freightServicePoint', 'trafficPoint', 'borderPoint', 'hasGeolocation',
       'isoCountryCode', 'cantonName', 'cantonFsoNumber', 'cantonAbbreviation',
       'districtName', 'districtFsoNumber', 'municipalityName', 'fsoNumber',
       'localityName', 'operatingPointType',
       'operatingPointTechnicalTimetableType', 'meansOfTransport',
       'categories', 'operatingPointTrafficPointType',
       'operatingPointRouteNetwork', 'operatingPointKilometer',
       'operatingPointKilometerMasterNumber', 'sortCodeOfDestinationStation',
       'businessOrganisation', 'busin

Unnamed: 0,Station abbreviation,Stop name,Line,KM,Line.1,Geopos,Didok number,OPUIC,Stop name.1,lod,...,businessOrganisation,businessOrganisationNumber,businessOrganisationAbbreviationDe,businessOrganisationAbbreviationFr,businessOrganisationDescriptionDe,businessOrganisationDescriptionFr,fotComment,height,creationDate,editionDate
0,ABO,Aarburg-Oftringen,500,43.00505,Basel SBB - Olten - Luzern,"47.320268469495055, 7.908222606719322",2000,8502000,Aarburg-Oftringen,http://lod.opentransportdata.swiss/didok/didok85,...,,,,,,,,,,
1,ABOW,Aarburg-Oftringen,451,43.97183,Aarburg-Oftringen - Rothrist Gleis 1,"47.3136469221806, 7.906871810514809",8141,8508141,Aarburg-Oftringen West (Abzw),http://lod.opentransportdata.swiss/didok/didok85,...,,,,,,,,,,
2,AD,Aadorf,850,122.89829,St.Gallen - Winterthur Nord,"47.48811815646303, 8.903301351691383",6013,8506013,Aadorf,http://lod.opentransportdata.swiss/didok/didok85,...,,,,,,,,,,
3,AESP,Aespli,400,12.27812,Lochligut - Wanzwil - Rothrist West,"47.0283996594778, 7.5235281481533445",15299,8515299,Aespli,http://lod.opentransportdata.swiss/didok/didok85,...,,,,,,,,,,
4,AF,Affoltern am Albis,711,24.84132,ZH Hardbrucke - Kollermuhle,"47.2760703624343, 8.446594995016907",2224,8502224,Affoltern am Albis,http://lod.opentransportdata.swiss/didok/didok85,...,,,,,,,,,,


## Graph Construction

We build an undirected graph where:
- **Nodes** represent stations (identified by their abbreviations)
- **Node attributes** include stop names, Didok numbers, and the lines passing through each station
- **Edges** connect consecutive stations along each railway line
- **Edge attributes** store the line IDs and detailed segment information (km positions, metadata for each segment)

Stations that appear on multiple lines will have edges to multiple neighbors, creating a realistic representation of the railway network topology.

In [32]:
# Only create the graph if it doesn't exist
if os.path.exists(graph_path):
    print(f"Loading existing graph from {graph_path}")
    with open(graph_path, 'rb') as f:
        G = pickle.load(f)
    print(f"Loaded graph with {G.number_of_nodes()} nodes and {G.number_of_edges()} edges")
else:
    print("Creating new graph...")
    
    LINE_COL = "Line"
    STATION_COL = "Station abbreviation"
    ORDER_COL = "KM"

    # keep every row from the merged dataframe attached to its station node
    node_df = sbb_line_and_station_data.dropna(subset=[STATION_COL])
    station_groups = node_df.groupby(STATION_COL)

    # rows that can be ordered along a line (needed to draw edges)
    ordered_df = (
        sbb_line_and_station_data
        .dropna(subset=[LINE_COL, STATION_COL, ORDER_COL])
        .sort_values([LINE_COL, ORDER_COL])
    )

    G = nx.Graph()  # undirected infrastructure graph decorated with metadata

    for station, group in station_groups:
        row_dicts = group.to_dict("records")
        stop_names = sorted({row.get("Stop name") for row in row_dicts if pd.notna(row.get("Stop name"))})
        didoks = sorted({row.get("Didok number") for row in row_dicts if pd.notna(row.get("Didok number"))})
        lines = sorted({row.get(LINE_COL) for row in row_dicts if pd.notna(row.get(LINE_COL))})
        G.add_node(
            station,
            station_abbreviation=station,
            stop_names=stop_names,
            didok_numbers=didoks,
            lines=lines,
            rows=row_dicts,
        )

    # connect consecutive stops per line and keep the merged metadata for every segment
    for line_id, group in ordered_df.groupby(LINE_COL):
        stops = group[STATION_COL].tolist()
        kms = group[ORDER_COL].tolist()
        row_dicts = group.to_dict("records")

        for idx, (u, v) in enumerate(zip(stops[:-1], stops[1:])):
            segment_meta = {
                "line_id": line_id,
                "order_index": idx,
                "from_station": u,
                "to_station": v,
                "from_km": kms[idx],
                "to_km": kms[idx + 1],
                "from_row": row_dicts[idx],
                "to_row": row_dicts[idx + 1],
            }

            if G.has_edge(u, v):
                G[u][v]["lines"].add(line_id)
                G[u][v]["segments"].append(segment_meta)
            else:
                G.add_edge(
                    u,
                    v,
                    lines={line_id},
                    segments=[segment_meta],
                )

    # make the list of lines per edge easier to inspect than a bare set
    for u, v, data in G.edges(data=True):
        data["lines"] = sorted(data["lines"])
    
    # Save the graph
    print(f"Saving graph to {graph_path}")
    with open(graph_path, 'wb') as f:
        pickle.dump(G, f, protocol=pickle.HIGHEST_PROTOCOL)
    print(f"Graph saved with {G.number_of_nodes()} nodes and {G.number_of_edges()} edges")

Creating new graph...
Saving graph to datasets/switzerland/swiss_rail_network.gpickle
Graph saved with 1355 nodes and 1512 edges


## Basic Network Statistics

Here we compute fundamental graph metrics: the number of nodes (stations), edges (connections), and connected components (isolated subnetworks). A high number of connected components might indicate data quality issues or genuinely isolated railway segments.

In [33]:
print("Nodes:", G.number_of_nodes())
print("Edges:", G.number_of_edges())
print("Connected components:", nx.number_connected_components(G))


Nodes: 1355
Edges: 1512
Connected components: 10


## Connected Components Analysis

We identify all connected components and create a summary table showing the size of each component (number of nodes and edges). Components are ranked by size, with rank 1 being the largest (main network) and higher ranks representing progressively smaller isolated subnetworks.

In [34]:
components = list(nx.connected_components(G))

rows = []
for i, nodes in enumerate(components):
    sub = G.subgraph(nodes)
    rows.append({              
        "num_nodes": sub.number_of_nodes(),
        "num_edges": sub.number_of_edges(),
    })

cc_df = pd.DataFrame(rows)

# Order by size (nodes) descending, and reset index for a clean table
cc_df = cc_df.sort_values("num_nodes", ascending=False).reset_index(drop=True)

# Optional: add a rank (1 = largest component)
cc_df.insert(0, "rank", cc_df.index + 1)

In [35]:
print(cc_df)

   rank  num_nodes  num_edges
0     1       1338       1504
1     2          2          1
2     3          2          1
3     4          2          1
4     5          2          1
5     6          2          1
6     7          2          1
7     8          2          1
8     9          2          1
9    10          1          0


## Small Components Detailed Analysis

For all components except the largest one (the main network), we create detailed tables showing:
1. A per-station view with all metadata for each station in the small components
2. An aggregated view showing pairs of stations for each small component, making it easier to identify isolated line segments

We handle singleton components (single isolated stations) separately from components with multiple stations.

In [36]:
components_by_size = sorted(components, key=len, reverse=True)

# All small components (everything except the giant one)
small_components = components_by_size[1:]   # since [0] is the giant one

# --- 1) Per-station table that keeps every node in the small components ---

rows = []  # IMPORTANT: reset rows so nothing old leaks in
component_meta = {}

for rank, nodes in enumerate(small_components, start=2):
    sub = G.subgraph(nodes)
    component_meta[rank] = {
        "num_nodes": sub.number_of_nodes(),
        "num_edges": sub.number_of_edges(),
    }

    # sort the node ids to have a stable order
    for abbr in sorted(nodes):
        station_rows = sbb_line_data.loc[sbb_line_data["Station abbreviation"] == abbr]
        if station_rows.empty:
            continue  # skip nodes we cannot map back to the CSV
        row = station_rows.iloc[0]

        rows.append({
            "component_rank": rank,
            "station_abbreviation": abbr,
            "stop_name": row["Stop name"],
            "didok": row["Didok number"],
            "line": row["Line"],
            "km": row["KM"],
        })

small_cc_df = (
    pd.DataFrame(rows)
    .sort_values(["component_rank", "km"])
    .reset_index(drop=True)
)

# --- 2) Aggregate to one row per small component, handling singles gracefully ---

pair_rows = []
singleton_rows = []
for comp_rank, group in small_cc_df.groupby("component_rank"):
    group = group.sort_values("km")
    meta = component_meta.get(comp_rank, {"num_nodes": len(group), "num_edges": 0})

    if len(group) < 2:
        singleton_rows.append({
            "component_rank": comp_rank,
            "num_nodes": meta["num_nodes"],
            "num_edges": meta["num_edges"],
            "station_abbr": group.iloc[0]["station_abbreviation"],
            "station_name": group.iloc[0]["stop_name"],
        })
        continue

    a, b = group.iloc[0], group.iloc[1]

    pair_rows.append({
        "component_rank": comp_rank,
        "num_nodes": meta["num_nodes"],
        "num_edges": meta["num_edges"],
        "station1_abbr": a["station_abbreviation"],
        "station1_name": a["stop_name"],
        "station2_abbr": b["station_abbreviation"],
        "station2_name": b["stop_name"],
    })

small_pairs_df = (
    pd.DataFrame(pair_rows)
    .sort_values("component_rank")
    .reset_index(drop=True)
)

singleton_components_df = (
    pd.DataFrame(singleton_rows)
    .sort_values("component_rank")
    .reset_index(drop=True)
    if singleton_rows
    else pd.DataFrame(columns=["component_rank", "num_nodes", "num_edges", "station_abbr", "station_name"])
)

In [37]:
print(small_pairs_df)

   component_rank  num_nodes  num_edges station1_abbr  \
0               2          2          1          ASKO   
1               3          2          1          ASZW   
2               4          2          1          BOCS   
3               5          2          1          BOZS   
4               6          2          1          FACO   
5               7          2          1          LNTO   
6               8          2          1          SEZU   
7               9          2          1          SIFO   

                   station1_name station2_abbr                   station2_name  
0        Amsteg Kabelstollen Ost          ASSW        Amsteg Kabelstollen West  
1     Amsteg Zugangsstollen West          ASZO       Amsteg Zugangsstollen Ost  
2  Bodio cunicolo di aggira. sud          BOAN  Bodio cunicolo di aggira. nord  
3     Bozberg Zugangsstollen Sud          BOZN     Bozberg Zugangsstollen Nord  
4   Faido cunicolo di acc. ovest          FACE      Faido cunicolo di acc. est  


In [38]:
if not singleton_components_df.empty:
    print("Singleton components:")
    print(singleton_components_df)
else:
    print("No singleton components detected.")


Singleton components:
   component_rank  num_nodes  num_edges station_abbr station_name
0              10          1          0         LZZB       Luzern


## Main Network Extraction

Finally, we extract the rows from the original dataset that belong to the largest connected component (the main Swiss railway network). This filtered dataset can be used for further analysis focusing only on the interconnected network while excluding isolated segments.

In [39]:
components = list(nx.connected_components(G))

# largest component (set of node IDs)
largest_nodes = max(components, key=len)

# rows of df that belong to the big component
df_big = sbb_line_data[sbb_line_data["Station abbreviation"].isin(largest_nodes)].copy()

print("Rows in big component:", len(df_big))
print("Unique nodes in big component:", df_big["Station abbreviation"].nunique())

Rows in big component: 1865
Unique nodes in big component: 1338
