# 🔍 Exploration and Data Preprocessing

In this section, we will take a first look at the data and preprocess it to make it more suitable for our analysis. The following steps will be performed:
1. Download the data from the OpenTransportData platform for the 10th of March 2023.
2. Remove unnecessary columns and translate the column names from German to English.

In [4]:
import pandas as pd

## Download the Data and Keep Interesting Columns

The data is available as a CSV file. We will use the `pandas` library to read the data into a `DataFrame` and keep only the columns that we are interested in. The data contains the following columns:
| Column Name | Type | Description |
| --- | --- | --- |
| `BETRIEBSTAG` | Date (DD.MM.YYYY) | Date of the journey |
| `FAHRT_BEZEICHNER` | String | Identifier of the journey (see https://opentransportdata.swiss/en/cookbook/fahrt-id-journeyref/) |
| `BETREIBER_ID` | String | Identifier of the operator |
| `BETREIBER_ABK` | String | Abbreviation of the operator |
| `BETREIBER_NAME` | String | Name of the operator |
| `PRODUKT_ID` | Category | Identifier of the product |
| `LINIEN_ID` | String | Identifier of the line |
| `LINIEN_TEXT` | String | Name of the line |
| `UMLAUF_ID` | String | Identifier of the circuit (see https://opentransportdata.swiss/en/cookbook/umlauf/)|
| `VERKEHRSMITTEL_TEXT` | String | Type of transport |
| `ZUSATZFAHRT_TF` | Boolean | Whether the journey is an additional one |
| `FAELLT_AUS_TF` | Boolean | Whether the journey is cancelled |
| `BPUIC` | String | Identifier of the stop (see https://didok.ch/en/glossary-stop-points-location-codes/) |
| `HALTESTELLEN_NAME` | String | Name of the stop |
| `ANKUNFTSZEIT` | DateTime (DD.MM.YYYY HH:MM:SS) | Arrival time at the stop |
| `AN_PROGNOSE` | DateTime (DD.MM.YYYY HH:MM:SS) | Arrival time at the stop (predicted) |
| `AN_PROGNOSE_STATUS` | Category | Status of the predicted arrival time in [UNKNOWN, FORECAST, ESTIMATED, REAL, Empty] |
| `ABFAHRTSZEIT` | DateTime (DD.MM.YYYY HH:MM:SS) | Departure time from the stop |
| `AB_PROGNOSE` | DateTime (DD.MM.YYYY HH:MM:SS) | Departure time from the stop (predicted) |
| `AB_PROGNOSE_STATUS` | Category | Status of the predicted departure time in [UNKNOWN, FORECAST, ESTIMATED, REAL, Empty] |
| `DURCHFAHRT_TF` | Boolean | Whether the stop is a through stop |

In [5]:
# False if you want to download the data from the internet
LOCAL_DOWNLOAD = True
DATE_FILE = '2023-03-10' # Format YYYY-MM-DD

In [8]:
# Download the data
DATA_LINK = f"https://opentransportdata.swiss/dataset/0edc74a3-ad4d-486e-8657-f8f3b34a0979/resource/9de83058-0365-481d-85fa-a8477709b9fa/download/{DATE_FILE}_istdaten.csv"
DATA_FILE = f"data/{DATE_FILE}_istdaten.csv"
DATA_PATH = DATA_FILE if LOCAL_DOWNLOAD else DATA_LINK
transport_data = pd.read_csv(
    DATA_LINK, 
    sep=';',
    parse_dates=['BETRIEBSTAG', 'ANKUNFTSZEIT', 'AN_PROGNOSE', 'ABFAHRTSZEIT', 'AB_PROGNOSE'],
    dtype={
        'FAHRT_BEZEICHNER': 'string',
        'BETRIEBER_ID': 'string',
        'BETREIBER_ABK': 'string',
        'BETREIBER_NAME': 'string',
        'PRODUKT_ID': 'string',
        'LINIEN_ID': 'string',
        'LINIEN_TEXT': 'string',
        'UMLAUF_ID': 'string',
        'VERKEHRSMITTEL_TEXT': 'string',
        'ZUSATZFAHRT_TF': 'boolean',
        'FAELLT_AUS_TF': 'boolean',
        'BPUIC': 'string',
        'HALTESTELLEN_NAME': 'string',
        'AN_PROGNOSE_STATUS': 'string',
        'AB_PROGNOSE_STATUS': 'string',
        'DURCHFAHRT_TF': 'boolean'
    }
)

In [None]:
transport_data.head()

## Translate Column Names

In [None]:
translations = {
    'BETRIEBSTAG': 'date',
    'FAHRT_BEZEICHNER': 'trip_id',
    'BETRIEBER_ID': 'operator_id',
    'BETREIBER_ABK': 'operator_abbreviation',
    'BETREIBER_NAME': 'operator_name',
    'PRODUKT_ID': 'product_id',
    'LINIEN_ID': 'line_id',
    'LINIEN_TEXT': 'line_text',
    'UMLAUF_ID': 'circuit_id',
    'VERKEHRSMITTEL_TEXT': 'transport_type',
    'ZUSATZFAHRT_TF': 'is_additional_trip',
    'FAELLT_AUS_TF': 'is_cancelled',
    'BPUIC': 'stop_id',
    'HALTESTELLEN_NAME': 'stop_name',
    'ANKUNFTSZEIT': 'arrival_time',
    'AN_PROGNOSE': 'arrival_forecast',
    'AN_PROGNOSE_STATUS': 'arrival_forecast_status',
    'ABFAHRTSZEIT': 'departure_time',
    'AB_PROGNOSE': 'departure_forecast',
    'AB_PROGNOSE_STATUS': 'departure_forecast_status',
    'DURCHFAHRT_TF': 'is_through_trip'
}

transport_data = transport_data.rename(columns=translations)