# hide
title: one year of free parking space records
enable: plotly

In [80]:
# hide
import sys
sys.path.insert(0, "..")

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots

from nb_helpers import *
from html_tools import html_display

In [85]:
# hide
df_meta = pd.read_csv("../../../parking-data/meta-data.csv")

In [81]:
# hide
def search():
    s = Search("parking-data")
    s = s.range("timestamp", gte="2020-03-23", lte="2021-03-28")
    s = s.param(rest_total_hits_as_int=True)
    return s

One year of corona also means one year of [recorded parking lot occupancy across germany](https://github.com/defgsus/parking-data/), because that's what i did when we where all supposed to stay at home and be calm. I do not drive and i'm generally more annoyed than interested about cars but for some reason it felt right to record the data and make it public. In my opinion it is *social data* in the first place and time series of it are not freely available.  

There where some ideas about a website where one could analyze and download the data but it seemed a bit too time consuming. A friend just suggested to push it to github and be done. And rightly so. In the meantime i discovered [elasticsearch](https://github.com/elastic/elasticsearch), learned stuff about kubernetes, graphana and kibana, fought once again with understanding [pandas](https://github.com/pandas-dev/pandas) and a couple of other statistical tools. Actually, all these areas are so overwhelming that i'm happy with just publishing CSV files and let everybody handle it their own way.

The data comes from scattered websites spread across germany. After a few days i thought i'd found most of them. With the usual [beautiful-soup](https://beautiful-soup-4.readthedocs.io/en/latest/) chopping, [each found website](https://github.com/defgsus/parking-scraper/tree/master/sources) is scraped and saved to a json file with the number of free parking places per lot. This is done every couple of minutes by a faithful little server and after a long day exported into a CSV and pushed to github.

The rows in those CSVs only contain *changed* numbers and are blank otherwise to save space. The CSVs are currently **180 MB** large and contain about **15,200,000** numbers.

In [4]:
%%sh
# hide
du -hs ../../../parking-data/

179M	../../../parking-data/


In [5]:
# hide
search().execute().total_hits

15194237

During the year at least 35 cities with at least 500 car parks where sampled each week. The website scraping probably fails if the html changes a bit but that does not seem to happen often. Actually most of the websites look like they are hundrets of years old already. Still there is a bit of fluctuation.

In [6]:
# hide-code
df = (search()
    .agg_date_histogram("week", calendar_interval="week")
    .metric_cardinality("cities", field="city_name")
    .metric_cardinality("places", field="place_id")
    .execute().df()
)

In [10]:
df.plot.bar(
    "date", ["cities", "places"],
).update_layout(
    margin={"l": 0, "r": 0, "t": 0, "b": 0},
    height=250,
    title="Number of cities and places sampled each week",
)

So before showing any real occupancy i'll try to clean the data first. Here are the cities which happened to have no data during at least one week.

In [78]:
# hide-code
df = (search()
    .agg_date_histogram("week", calendar_interval="week")
    .agg_terms("city",field="city_name", size=100)
    .execute().df()
    .set_index(["week", "city"])
)
df_missing = df.drop("week.doc_count", axis=1).unstack()
df_missing.columns = df_missing.columns.get_level_values(1)
(df_missing.isnull().astype(int).sum().to_frame()
     .set_axis(["missing weeks"], axis=1)   
     .query("`missing weeks` > 0")
     .sort_values("missing weeks", ascending=False)
)

Unnamed: 0_level_0,missing weeks
city,Unnamed: 1_level_1
Paderborn,46
Berlin,45
Lübeck,20
Jena,19
Potsdam,15
Hanau,10
Bielefeld,2
Köln,1
Reutlingen,1


Let's see... **Berlin** really just recently appeared on some website. The only source i found, a year ago, was `https://www.parkopedia.de/parken/berlin/` and this is exactly one of those non-free services where you immediately recognize from the look of the website that they know exactly what scraping is and that they don't like it be done to themselves. So i did not dot it to keep the `parking-data` repo out of trouble.

**Paderborn** uses [this nice old-school website](https://www4.paderborn.de/aspparkinfo/) to disemminate their parking allocation and it's probably not functional most of the time. In fact they just seemed to have started working again after a year ;)    

In [118]:
# hide-code
(search().term("city_name", "Paderborn")
    .agg_date_histogram("date", calendar_interval="week", min_doc_count=2)
    #.metric_avg("free", field="num_free")
    .execute().df().dropna()
    .set_index("date")
    .set_axis(["records"], axis=1)
     .rename_axis(columns="Paderborn")
)
# df_meta.query("city_name == 'Paderborn'")["source_web_url"].values

Paderborn,records
date,Unnamed: 1_level_1
2020-03-23,560
2020-03-30,536
2020-04-06,743
2020-04-27,256
2021-03-08,1649
2021-03-15,4467
2021-03-22,1495


**Lübeck** seems to have changed it's URL last November. They where actually scraped with a text-search inside their inline javascript but that does not seem to work anymore and this post seems to turn into a todo-list.

**Jena** is actually where i live and they just started a parking system last summer. Actually i had to leave vacation for a few days and come back to Jena and added the new website to the parking scraper late at night. The next day i was way back to the cost and the website developers where changing a few css classes and deploying it before lunch break.

In [117]:
(search().term("city_name", "Jena")
    .range("timestamp", lt="2020-08-01")
    .agg_date_histogram("date", calendar_interval="hour", min_doc_count=2)
    #.metric_avg("free", field="num_free")
    .execute().df().dropna()
    .set_index("date")
    .set_axis(["records"], axis=1)
    .rename_axis(columns="Jena")
)

Jena,records
date,Unnamed: 1_level_1
2020-07-27 22:00:00,7
2020-07-27 23:00:00,8
2020-07-28 00:00:00,9
2020-07-28 01:00:00,2
2020-07-28 02:00:00,5
2020-07-28 03:00:00,9
2020-07-28 04:00:00,31
2020-07-28 05:00:00,45
2020-07-28 06:00:00,52
2020-07-28 07:00:00,65


**Potsdam** also stopped working in December so i should revisit the scraper soon. Apart from cities there are far more places that are missing large proportions of data. 

In [122]:
# hide
df = (search()
    .agg_date_histogram("week", calendar_interval="week")
    .agg_terms("place",field="place_id", size=1000)
    .execute().df()
    .set_index(["week", "place"])
)

In [152]:
# hide
valid_place_ids = sorted(
    df.drop("week.doc_count", axis=1).unstack()
    #df_missing.columns = df_missing.columns.get_level_values(1)
    .isnull().astype(int).sum().to_frame()
    .set_axis(["missing weeks"], axis=1)   
    .query("`missing weeks` < 1")
    .index.get_level_values(1)
    #.sort_values("missing weeks", ascending=False)
)

df_valid_cities = (search().terms("place_id", valid_place_ids)
     .agg_terms("city", field="city_name", size=100)
     .execute().df()
     .set_index("city")
     .set_axis(["records"], axis=1)
)
print("valid places", len(valid_place_ids))
print("cities", df_valid_cities.shape[0])

valid places 347
cities 30


All in all, there are **347 parking lots** which provided data every week and they are in exactly **30** different cities:

In [153]:
df_valid_cities

Unnamed: 0_level_0,records
city,Unnamed: 1_level_1
Dresden,894872
Wiesbaden,781662
Düsseldorf,698917
Osnabrück,655463
Münster,582818
Mannheim,572990
Aachen,474196
Bremen,419271
Oldenburg,394727
Karlsruhe,378949


In [154]:
def plot_avg_free(search, interval: str = "week"):
    df = (search
        .agg_date_histogram(interval, calendar_interval=interval)
        .metric_avg("free", field="num_free")
        .execute().df()
    )
    # return df
    df.plot.line(interval, "free").show()

plot_avg_free(
    search().terms("place_id", valid_place_ids)
    #search().term("place_id", "parken-mannheim-N1-N2-Stadthaus-Parkhaus")
)

In [120]:
# hide-code
_ = (search().term("city_name", "Potsdam")
    .agg_date_histogram("date", calendar_interval="week", min_doc_count=2)
    #.metric_avg("free", field="num_free")
    .execute().df().dropna()
    .set_index("date")
    .set_axis(["records"], axis=1)
     .rename_axis(columns="Paderborn")
)
# df_meta.query("city_name == 'Paderborn'")["source_web_url"].values

In [8]:
cities = search().agg_terms(field="city_name", size=100).execute().to_dict()
# cities

#### average number of free parking spaces

In [86]:
average_free = search() \
    .agg_date_histogram("week", calendar_interval="week") \
    .metric_avg("avg free", field="num_free") \
    .execute().df(to_index=True)
#average_free.plot.line(average_free.index, "avg free")

In [87]:
_ = heatmap(
    search().range("percent_free", gte=0, lte=100),
    lambda a: a.agg_date_histogram("date", calendar_interval="week"),
    lambda a: a.agg_histogram("free", field="percent_free", interval=2.5),
    lambda a: a.metric_cardinality("places", field="place_id"),
    aspect=False, 
)

### city deviations

In [11]:
def plot_average_free_deviation(search, **plot_kwargs):
    num_free = search.copy() \
        .agg_date_histogram("week", calendar_interval="week") \
        .metric_avg("avg free", field="num_free") \
        .execute().df(to_index=True)
    #print(num_free)
    df = (average_free - average_free.mean()) - (num_free - num_free.mean())
    #print(df)
    fig = df.plot.line(df.index, "avg free", **plot_kwargs)
    fig.update_layout(margin={"t": 0, "b": 0, "l": 0, "r": 0})
    fig.show()

# plot_average_free_deviation(search().term("city_name", "Jena"))

In [88]:
num_all_docs = sum(cities.values())
for city in cities.keys():
    continue
    html_display(f"<h3>{city} ({round(cities[city] / num_all_docs * 100 * len(cities))}% average activity)</h3>")
    plot_average_free_deviation(search().term("city_name", city), height=300)
    #break

NameError: name 'cities' is not defined

In [13]:
pd.DataFrame.plot.line?

In [89]:
num_all_docs = sum(cities.values())
for city in cities.keys():
    continue
    html_display(f"<h3>{city} ({round(cities[city] / num_all_docs * 100 * len(cities))}% average activity)</h3>")
    df = search().term("city_name", city) \
        .agg_date_histogram("week", calendar_interval="week") \
        .metric_percentiles("free", field="num_free") \
        .execute().df(to_index=True)
        #.metric_avg("avg free", field="num_free") \
        #.metric_max("max free", field="num_free") \
    #df = num_free
    #print(df)
    #print(num_free)
    #df = (average_free - average_free.mean()) - (num_free - num_free.mean())
    #print(df)
    fig = df.plot.line(df.index, df.columns[1:], height=300)#, color_continuous_scale="Viridis")
    fig.update_layout(margin={"t": 0, "b": 0, "l": 0, "r": 0})
    fig.show()
    

NameError: name 'cities' is not defined