[< Back to the main notebook](./index.md)


# Detour no.2: Public transport stops data collection / cleaning

> This is a rendered version of a Jupyter notebook. The source notebook can be found [in my GitHub repository](https://github.com/barjin/ndbi023-project), along with the data used in this analysis.

For the data about public transport stops in Prague, I used the Open data source at https://data.pid.cz/stops/json/stops.json. The data is in JSON format and contains information about all the public transport stops in Prague.

I first preprocessed the data using the bash script from [`./scripts/pid_stops/process_stops.sh`](https://github.com/barjin/ndbi023-project/blob/master/scripts/pid_stops/process_stops.sh). The script reads the JSON file, extracts the relevant information, and saves it to a (much smaller) JSON file.
The script itself uses `jq`, a command-line JSON processor, to process the data. The script is shown below:

```bash
#!/bin/bash
# This script uses jq to process the stops data from http://data.pid.cz/stops/json/stops.json 
# to a smaller, more readable file.
# Jindřich Bär (barjin), 2024
#
# Expected usage: ./process_stops.sh ./stops.json
#  - pass the path to the json file from the link above as the first (and only) parameter.
#  - the script outputs the processed JSON into stdout.

jq "
    .stopGroups[] | 
    { 
        name: .uniqueName, 
        lat: .avgLat, 
        lng: .avgLon,
        types: .stops | [.[].lines[].type] | unique
    }
" "$1" | jq -s
```

## Normalizing the JSON data

Once we've acquired and cleaned the data, there is very little left to do. One thing that might be useful is to "normalize" the data. 

For every stop, the `types` field contains an array of strings, each representing a type of public transport that stops at the given stop.
Since JSON is a hierarchical format (and Pandas Dataframes are more of a tabular schema), we might need to flatten the array in the `types` field, so that each stop has a single column for each type of public transport that stops there.

In [4]:
import pandas as pd

df = pd.read_json('./data/pid_stops/processed_stops.json')

def list_to_incidence_dict(l):
    d = {}

    for item in l:
        d[item] = True

    return d

df = df.join(pd.DataFrame(list(df['types'].map(list_to_incidence_dict))).fillna(False))
df.drop(columns=['types'], inplace=True)

df

  df = df.join(pd.DataFrame(list(df['types'].map(list_to_incidence_dict))).fillna(False))


Unnamed: 0,name,lat,lng,bus,tram,metro,train,ferry,trolleybus,funicular
0,Adamov,49.858105,15.408134,True,False,False,False,False,False,False
1,Albertov,50.067917,14.420799,True,True,False,False,False,False,False
2,Ametystová,49.988200,14.362217,True,False,False,False,False,False,False
3,Amforová,50.041780,14.327298,True,False,False,False,False,False,False
4,Anděl,50.071260,14.403365,True,True,True,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...
7989,"Žleby,ZŠ",49.889290,15.488659,True,False,False,False,False,False,False
7990,Žloukovice,50.016018,13.955886,False,False,False,True,False,False,False
7991,Žlutice,50.084600,13.159661,False,False,False,True,False,False,False
7992,Županovice,49.706900,14.298495,True,False,False,False,False,False,False


Now that we have our normalized Pandas DataFrame, we can try to plot it on a map using the `staticmap` library. 

Using the new columns for each type of public transport, we can color the stops based on the type of public transport that stops there.

In [5]:
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
from staticmap import StaticMap, CircleMarker

def get_map(df, categories, title='', scale_label=''):
    m = StaticMap(1600, 1000)
    colors = ['yellow', 'blue', 'brown', 'green', 'red', 'black']

    if(len(categories) > len(colors)):
        raise Exception(f'Too many categories {len(categories)}, I only have f{len(colors)} colors!')

    legend_handles = []

    for c, category in enumerate(categories):
        legend_handles.append(mpatches.Patch(color=colors[c], label=category.capitalize()))

        for index, row in df[df[category]].iterrows():
            point = CircleMarker(
                (row['lng'], 
                row['lat']), 
                color=colors[c], 
                width=10)
            m.add_marker(point)

    image = m.render(zoom=12, center=(14.4399466,50.0859818))

    fig, ax = plt.subplots(figsize=(15, 10))
    fig.suptitle(title, verticalalignment='bottom', fontsize=16, y=0.9)
    fig.legend(handles=legend_handles, loc='center right')

    ax.imshow(image)
    ax.set_xticklabels([])
    ax.set_yticklabels([])
    ax.set_xticks([])
    ax.set_yticks([])

    return fig


In [11]:
plt.ioff()

get_map(
    df,
    ['bus', 'tram', 'metro', 'trolleybus', 'funicular', 'ferry'],
    scale_label='Price in CZK / night', 
    title='Public transport stops in Prague, Czech Republic'
).savefig('./img/pid/01_pid_stops.png', bbox_inches='tight', pad_inches=0.1)

| ![Public transport stops in Prague](./img/pid/01_pid_stops.png) |
|:--:|
| *Public transport stops in Prague, colored by the types of public transport.* |

This concludes our preprocessing step for the public transport stops data. We can now store the cleaned data in a CSV file and use it in our analysis.

In [10]:
df.to_csv("./data/pid_stops/index.csv", index=False)

---

[< Back to the main notebook](./index.md)