# [DataViz]: Milestone 1 (10% of the final grade)

**Group ID:** The Vizards

**Author 1 (sciper):** Salma Ed-dahabi (282284)

**Author 2 (sciper):** Antonin Faure (302686)   

**Author 3 (sciper):** Lena Vogel (297026) 

**Due date:** 07.04.2023 (11:59 pm)

[Github link] : https://github.com/com-480-data-visualization/project-2023-the-vizards

---
## Part 1 - Dataset

Find a dataset (or multiple) that you will explore. Assess the quality of the data it contains and how much preprocessing / data-cleaning it will require before tackling visualization. We recommend using a standard dataset as this course is not about scraping nor data processing.

- We found our dataset on https://opentransportdata.swiss/de/showcase-5/ which is a website that has data on all public transports in Switzerland. As we need to import everything in order to, afterwards, filter only the data that concerns Lausanne, it is very heavy (~15GO per month).



In [1]:
from plotly.subplots import make_subplots
import plotly.express as px
import plotly.graph_objects as go
import pandas as pd
from pyproj import Proj, transform
import numpy as np
from zipfile import ZipFile
import os
from tqdm import tqdm

### 1.1) Pre processing of stops

In [2]:
stops = pd.read_excel('data/bav_list_current_timetable.xlsx', header=1).drop(index=[0])

In [3]:
lausanne_box = {
    "lat": [2529114.807, 2544879.291],
    "lon": [1159080.496, 1149116.298]
}

# Filter stops in Lausanne area
stops = stops[(stops['Coord. E'] < lausanne_box["lat"][1])
            & (stops['Coord. E'] > lausanne_box["lat"][0])
            & (stops['Coord. N'] < lausanne_box["lon"][0])
            & (stops['Coord. N'] > lausanne_box["lon"][1])
            ]

# Keep only relevant columns
stops = stops[["N° sv.85", "Nom (ordre alphab.)", "Statut", "Moyen de transport", "N° ET", "Sigle ET", "Coord. E", "Coord. N", "Altitude"]]

In [4]:
pWorld = Proj(init="epsg:4326")
pCH = Proj(init="epsg:2056")

# Convert LV95 projection to WGS84 projection
lon, lat, _ = transform(pCH,pWorld, stops["Coord. E"], stops["Coord. N"], np.zeros(stops["Coord. E"].shape))
stops["lon"] = lon
stops["lat"] = lat
stops = stops.drop(labels=["Coord. E", "Coord. N"], axis=1)
stops.rename({
    "N° sv.85": "stop_id",
    "Nom (ordre alphab.)": "stop_name",
    "Statut": "stop_status",
    "Moyen de transport": "transport_mode",
    "N° ET": "transport_id",
    "Sigle ET": "company_name",
    "Altitude": "altitude"
}, axis=1, inplace=True)

  in_crs_string = _prepare_from_proj_string(in_crs_string)
  in_crs_string = _prepare_from_proj_string(in_crs_string)
  lon, lat, _ = transform(pCH,pWorld, stops["Coord. E"], stops["Coord. N"], np.zeros(stops["Coord. E"].shape))


### 1.2) Pre processing of timetables

In [5]:
keep_columns = [
    'PRODUKT_ID',
    'BETREIBER_ABK',
    'LINIEN_ID',
    'LINIEN_TEXT',
    'BPUIC',
    'FAELLT_AUS_TF',
    'ANKUNFTSZEIT',
    'AN_PROGNOSE',
    'AN_PROGNOSE_STATUS',
    'ABFAHRTSZEIT',
    'AB_PROGNOSE',
    'AB_PROGNOSE_STATUS',
    'DURCHFAHRT_TF'
]

In [6]:
# Read monthly zip files, filter Lausanne area and write csv file per day
for month_file in os.listdir('data/raw'):
    if month_file.endswith('.zip'):
        print('reading {} ...'.format(month_file))
        zip_file = ZipFile('data/raw/{}'.format(month_file))
        for i, text_file in tqdm(enumerate(zip_file.infolist()), total=len(zip_file.infolist())):
            if text_file.filename.endswith('.csv') and not str(text_file.filename).startswith('__MACOSX/'):
                timetable = pd.read_csv(zip_file.open(text_file.filename), sep=";", low_memory=False)

                # Filter Lausanne area
                timetable = timetable[timetable["BPUIC"].isin(stops['N° sv.85'].to_list())]

                # Keep only relevant columns
                timetable = timetable[keep_columns]
                
                timetable.to_csv('data/timetables/{}'.format(os.path.basename(text_file.filename)), sep=';', index=False)
                del timetable

FileNotFoundError: [Errno 2] No such file or directory: 'data/raw'

In [7]:
# Merge all days files into one BIG file
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

all_timetables = pd.DataFrame(columns=keep_columns)
for month_file in tqdm(os.listdir('data/timetables')):
    if month_file.endswith('.csv'):
        timetable = pd.read_csv('data/timetables/{}'.format(month_file), sep=";", low_memory=False)
        timetable['FAELLT_AUS_TF'] = timetable['FAELLT_AUS_TF'].astype(bool)
        timetable['DURCHFAHRT_TF'] = timetable['DURCHFAHRT_TF'].astype(bool)
        all_timetables = pd.concat([all_timetables, timetable])
        del timetable

all_timetables.to_csv('data/timetables.csv', sep=';', index=False)

100%|██████████| 450/450 [56:50<00:00,  7.58s/it]


---
## PART 2 - Problematic

**Objectives overview**:

Frame the general topic of your visualization and the main axis that you want to develop.

**1)** What am I trying to show with my visualization? With our visualisations we would like to show: 
- Lausanne traffic depending on the day and time: the punctuality of the buses is a good estimation of the traffic.
- Lausanne connectivity through the city and its suburbs: seing the concentration of the public transports (and not only the lines/trajectories) allows anyone to judge whether a region is well served or not.

**2)** Think of an overview for the project, your motivation, and the target audience.
- The main target audience is people who would like to move to Lausanne in a near future.
- Another audience could be the TL themselves, as it is useful to know what are the problematic areas, which make the buses late. In general, it could be used in an analytic way for other entities such as geographists or sociologists

---
## Part 3 - Exploratory Data Analysis

Pre-processing of the data set you chose.

Show some basic statistics and get insights about the data



In [8]:
fig = px.scatter_mapbox(
    stops, lat="lat",
    lon="lon",
    color="company_name", # which column to use to set the color of markers
    hover_name="stop_name", # column added to hover information,
    zoom=3, mapbox_style='open-street-map', height=800
    )

fig.update_layout(
    autosize=True,
    hovermode='closest',
    mapbox=dict(
        bearing=0,
        center=dict(lat=46.52097182546228, lon=6.633647865079138),
        pitch=0,
        zoom=11
    ),
    title="Public transports stops in Lausanne",
)
fig.show()

In [9]:
fig = px.scatter(stops, x="company_name", y="altitude", color="company_name", title="Distribution of the number of public transports stops depending on the altitude")
fig.add_hline(y=372, line_width=1, line_dash="dash", line_color="blue", annotation_text="Leman", annotation_font_color="blue")
fig.show()

In [10]:
fig = px.histogram(stops, x="company_name", color="company_name", title="Number of stops for each public transport company")
fig.show()

In [44]:
def map_modes(date):
    transactions = pd.read_csv('data/timetables/{}_istdaten.csv'.format(date), sep=';')
    modes = dict()

    for _, trx in transactions.iterrows():
        mode = trx['PRODUKT_ID']
        if mode not in modes:
            modes[mode] = 0
        modes[mode] += 1
    
    return modes

def map_companies(date):
    transactions = pd.read_csv('data/timetables/{}_istdaten.csv'.format(date), sep=';')
    companies = dict()

    for _, trx in transactions.iterrows():
        company = trx['BETREIBER_ABK']
        if company not in companies:
            companies[company] = 0
        companies[company] += 1
    
    return companies

def map_vehicles(date):
    transactions = pd.read_csv('data/timetables/{}_istdaten.csv'.format(date), sep=';')
    vehicles = dict()

    for _, trx in transactions.iterrows():
        vehicle = trx['LINIEN_ID']
        if vehicle not in vehicles:
            vehicles[vehicle] = 0
        vehicles[vehicle] += 1

    vehicles = {k: v for k, v in sorted(vehicles.items(), key=lambda item: -item[1])}

    return vehicles

def map_stops(date):
    transactions = pd.read_csv('data/timetables/{}_istdaten.csv'.format(date), sep=';')
    stops = dict()

    for _, trx in transactions.iterrows():
        stop = trx['BPUIC']
        if stop not in stops:
            stops[stop] = 0
        stops[stop] += 1

    stops = {k: v for k, v in sorted(stops.items(), key=lambda item: -item[1])}

    return stops

def day_transactions(date):
    transactions = pd.read_csv('data/timetables/{}_istdaten.csv'.format(date), sep=';')

    # Plot transactions per hour of the day grouped by company name
    transactions['ANKUNFTSZEIT'] = pd.to_datetime(transactions['ANKUNFTSZEIT'], format='%d.%m.%Y %H:%M')
    transactions['ABFAHRTSZEIT'] = pd.to_datetime(transactions['ABFAHRTSZEIT'], format='%d.%m.%Y %H:%M')
    transactions['AN_HOUR'] = transactions['ANKUNFTSZEIT'].dt.hour
    transactions['AB_HOUR'] = transactions['ABFAHRTSZEIT'].dt.hour

    # Define the line colors for each company
    colors = {'SBB': '#EB0000', 'TL': '#005198', 'LEB': '#5AB034', 'MBC Auto': '#03A84B', 'PAG': 'orange'}

    # Create a single subplot
    fig = make_subplots(rows=1, cols=1)

    # Loop over companies and add each trace to the subplot
    for company in transactions['BETREIBER_ABK'].unique():
        hourly_counts = transactions[transactions['BETREIBER_ABK'] == company].groupby('AN_HOUR').size().reset_index(name='TRANSACTION_COUNT')
        fig.add_trace(
            go.Scatter(x=hourly_counts['AN_HOUR'], 
                    y=hourly_counts['TRANSACTION_COUNT'],
                    mode='lines',
                    name=company,
                    line=dict(color=colors[company]))
        )

    # Update the layout and show the plot
    fig.update_layout(
        title_text='Number of transactions per hour on {date} ({day})'.format(date=date, day=pd.Timestamp(date).day_name()),
        height=600,
        xaxis_title='Hour of the Day',
        yaxis_title='Number of Transactions'
    )
    fig.show()




def day_delays(date):
    transactions = pd.read_csv('data/timetables/{}_istdaten.csv'.format(date), sep=';')

    # Plot median delays per hour of the day grouped by company name
    transactions['ANKUNFTSZEIT'] = pd.to_datetime(transactions['ANKUNFTSZEIT'], format='%d.%m.%Y %H:%M')
    transactions['ABFAHRTSZEIT'] = pd.to_datetime(transactions['ABFAHRTSZEIT'], format='%d.%m.%Y %H:%M')
    transactions['AN_PROGNOSE'] = pd.to_datetime(transactions['AN_PROGNOSE'], format='%d.%m.%Y %H:%M:%S')
    transactions['AB_PROGNOSE'] = pd.to_datetime(transactions['AB_PROGNOSE'], format='%d.%m.%Y %H:%M:%S')
    transactions['AN_DELAY'] = transactions['AN_PROGNOSE'] - transactions['ANKUNFTSZEIT']
    transactions['AB_DELAY'] = transactions['AB_PROGNOSE'] - transactions['ABFAHRTSZEIT']

    # Group by company name and hour of the day and compute median delay
    transactions['AN_HOUR'] = transactions['ANKUNFTSZEIT'].dt.hour
    transactions['AB_HOUR'] = transactions['ABFAHRTSZEIT'].dt.hour
    transactions['AN_DELAY'] = transactions['AN_DELAY'].dt.total_seconds() / 60
    transactions['AB_DELAY'] = transactions['AB_DELAY'].dt.total_seconds() / 60
    transactions = transactions.groupby(['BETREIBER_ABK', 'AN_HOUR']).median().reset_index()
    transactions = transactions.groupby(['BETREIBER_ABK', 'AB_HOUR']).median().reset_index()

    # Plot median delays per hour of the day with one subplot per company in transactions['BETREIBER_ABK'].unique()
    fig = make_subplots(rows=3, cols=2, subplot_titles=transactions['BETREIBER_ABK'].unique())
    for i, company in enumerate(transactions['BETREIBER_ABK'].unique()):
        # Departure delays same color for all companies
        fig.add_trace(go.Scatter(x=transactions[transactions['BETREIBER_ABK'] == company]['AN_HOUR'], y=transactions[transactions['BETREIBER_ABK'] == company]['AN_DELAY'], name="{}_arrival".format(company)), row=i//2 + 1, col=i%2 + 1)
        fig.add_trace(go.Scatter(x=transactions[transactions['BETREIBER_ABK'] == company]['AB_HOUR'], y=transactions[transactions['BETREIBER_ABK'] == company]['AB_DELAY'], name="{}_departure".format(company)), row=i//2 + 1, col=i%2 + 1)
    fig.update_layout(title_text='Median delays per hour on {date} ({day}) (in minutes)'.format(date=date, day=pd.Timestamp(date).day_name()), height=800)
    fig.show()

In [35]:
day_delays('2023-03-21')

TypeError: could not convert string to float: 'Zug'

In [45]:
day_transactions('2023-03-21')

---
## Part 4 - Related work

- What others have already done with the data?

We took our data from the website https://opentransportdata.swiss/de/showcase-5/, which already labels and organize its data.

- Why is your approach original?

We would like to concentrate our visualizations only on Lausanne, to allow any Lausanne habitant or future habitant to have access to a full view of the public transports of the city

- What source of inspiration do you take? Visualizations that you found on other websites or magazines (might be unrelated to your data).

Other visualisations of that type have been done, for instance in Switzerland: https://observablehq.com/@alexmasselot/mapping-swiss-trains-delays-over-one-day/2, which shows the delays of the trains in Switzerland, and https://mobility.portal.geops.io/world.geops.transit?baselayer=world.geops, which displays most of the public transports in the world and their position in real time.

- In case you are using a dataset that you have already explored in another context (ML or ADA course, semester project...), you are required to share the report of that work to outline the differences with the submission for this class.

N/A

