### API Analysis

![alt text](./images/image.png "Title")
This configuration shows the hourly day-ahead (price of energy until the same time tomorrow) for the last two weeks.
When checking the network traffic for the above dates and for the hourly resolution, you will find three .json files being fetched from the API.

A request to the api has the following structure:
https://www.smard.de/app/chart_data/4169/DE/4169_DE_hour_[timestamp_in_milliseconds].json

The following request fetch data for the corresponding time frames.

https://www.smard.de/app/chart_data/4169/DE/4169_DE_hour_1729461600000.json:
Sunday, 6 October 2024 22:00:00 -> Sunday, 13 October 2024 21:00:00

https://www.smard.de/app/chart_data/4169/DE/4169_DE_hour_1728856800000.json:
Sunday, 13 October 2024 22:00:00 -> Sunday, 20 October 2024 21:00:00

https://www.smard.de/app/chart_data/4169/DE/4169_DE_hour_1729461600000.json
Sunday, 20 October 2024 22:00:00 -> Sunday, 27 October 2024 22:00:00


You will find that for example the timestamp 1729461600000 maps to the initial date Sunday, 6 October 2024 22:00:00 and every file contains the date for one week. Interestingly enough the site only shows the data for two weeks even though it had to fetch the data for three entire weeks. If the above links are broken, it may be due to a shift in daylight savings time (DST) which we will have to take into account.

Additionally you will see that each .json file contains around 172 (more or less) time series entries for an entire week.



### Implementing the scraper
We now want to implement a scraper that fetches the hourly energy prices for n amount of days. With the above information we now know that we'll have to find the corresponding timestamps for each week and to fetch the data.

In [65]:
import requests
import numpy as np
import logging
from datetime import datetime, timedelta, timezone
import pytz
import time
from pprint import pprint

In [66]:
logging.basicConfig(level=logging.INFO) 

logger = logging.getLogger("scraper_logger")

# console_handler = logging.StreamHandler()
file_handler = logging.FileHandler("app.log")

# console_handler.setLevel(logging.WARNING)
file_handler.setLevel(logging.WARNING) 

# logger.addHandler(console_handler)
logger.addHandler(file_handler)

In [67]:
from datetime import datetime, timedelta
import pytz

# Define Berlin timezone
tz_berlin = pytz.timezone("Europe/Berlin")

# Calculate last Monday in Berlin time, taking into account local DST
now = datetime.now(tz_berlin)
days_since_monday = now.weekday()
last_monday_berlin = now - timedelta(days=days_since_monday)
last_monday_berlin = last_monday_berlin.replace(hour=0, minute=0, second=0, microsecond=0)

# Convert Berlin time to UTC and get the timestamp in milliseconds
last_monday_utc = last_monday_berlin.astimezone(pytz.UTC)
last_monday_utc_ms = int(last_monday_utc.timestamp() * 1000)

print("Berlin time (local):", last_monday_berlin)
print("UTC time:", last_monday_utc)
print("UTC timestamp (ms):", last_monday_utc_ms)


Berlin time (local): 2024-10-28 00:00:00+01:00
UTC time: 2024-10-27 23:00:00+00:00
UTC timestamp (ms): 1730070000000


In [68]:
# Define constants
week_in_ms = 24 * 60 * 60 * 1000 * 7
delay = 0.5  # seconds
n = 100  # number of weeks
base_url = "https://www.smard.de/app/chart_data/4169/DE/4169_DE_hour_{}.json"
energy_ts_data = []

def fetch(base_url, adjusted_timestamp, delay):
    # Prepare the URL with the adjusted timestamp
    url = base_url.format(adjusted_timestamp)

    response = requests.get(url)
    time.sleep(delay)
    response.raise_for_status()

    return response


for k in range(n):
    # Adjust timestamp for DST if necessary
    
    last_monday_berlin = last_monday_utc.astimezone(tz_berlin)
    last_monday_utc = last_monday_berlin.astimezone(pytz.UTC)
    last_monday_utc_ms = int(last_monday_utc.timestamp() * 1000)
           
    if last_monday_berlin.dst() != timedelta(0):  # DST is in effect
        last_monday_utc_ms -= 60 * 60 * 1000
    
    try:
        response = fetch(base_url, last_monday_utc_ms, delay)
        logging.info(f"Successfully scraped data for ts: {last_monday_berlin} (Europe/Berlin)")
    except requests.exceptions.HTTPError as http_err:
        logging.warning(f"Failed to scrape data for timestamp: {last_monday_utc} (UTC)\n\tError: {http_err}")

    try:
        json_data = response.json()
    except requests.exceptions.JSONDecodeError as decoder_error:
        logging.warning(f"Failed to deserialize JSON: \n\tError: {decoder_error}")
        continue

    parsed_json = dict(json_data)

    energy_ts_data_week = []
    for ts, price in parsed_json["series"]:
        try:
            price_float = float(price)
            ts_datetime = datetime.fromtimestamp(ts / 1000, tz=timezone.utc).__str__()
        except TypeError as e:
            # logging.warning(f"Failed to parse non-float value for timestamp {ts_datetime} (UTC)\n\tError: {e}")
            continue

        energy_ts_data_week.append((ts_datetime, price_float))
    energy_ts_data_week.extend(energy_ts_data)
    energy_ts_data = energy_ts_data_week
    
    # Move to the previous week
    last_monday_utc = last_monday_utc - timedelta(weeks=1)

# Convert the list of tuples to a numpy array
data = np.array(energy_ts_data)[::-1]

print(data.shape)


INFO:root:Successfully scraped data for ts: 2024-10-28 00:00:00+01:00 (Europe/Berlin)
INFO:root:Successfully scraped data for ts: 2024-10-21 01:00:00+02:00 (Europe/Berlin)
INFO:root:Successfully scraped data for ts: 2024-10-14 01:00:00+02:00 (Europe/Berlin)
INFO:root:Successfully scraped data for ts: 2024-10-07 01:00:00+02:00 (Europe/Berlin)
INFO:root:Successfully scraped data for ts: 2024-09-30 01:00:00+02:00 (Europe/Berlin)
INFO:root:Successfully scraped data for ts: 2024-09-23 01:00:00+02:00 (Europe/Berlin)
INFO:root:Successfully scraped data for ts: 2024-09-16 01:00:00+02:00 (Europe/Berlin)
INFO:root:Successfully scraped data for ts: 2024-09-09 01:00:00+02:00 (Europe/Berlin)
INFO:root:Successfully scraped data for ts: 2024-09-02 01:00:00+02:00 (Europe/Berlin)
INFO:root:Successfully scraped data for ts: 2024-08-26 01:00:00+02:00 (Europe/Berlin)
INFO:root:Successfully scraped data for ts: 2024-08-19 01:00:00+02:00 (Europe/Berlin)
INFO:root:Successfully scraped data for ts: 2024-08-12

(16680, 2)


In [69]:
data = np.vstack((["date", "hourly day-ahead energy price"], data))
print(data)
np.savetxt("./day_ahead_energy_prices.csv", data, delimiter=",", fmt="%s")

[['date' 'hourly day-ahead energy price']
 ['2024-10-29 22:00:00+00:00' '103.09']
 ['2024-10-29 21:00:00+00:00' '118.08']
 ['2024-10-29 20:00:00+00:00' '124.05']
 ['2024-10-29 19:00:00+00:00' '140.6']
 ['2024-10-29 18:00:00+00:00' '203.82']
 ['2024-10-29 17:00:00+00:00' '257.44']
 ['2024-10-29 16:00:00+00:00' '285.8']
 ['2024-10-29 15:00:00+00:00' '212.72']
 ['2024-10-29 14:00:00+00:00' '172.26']
 ['2024-10-29 13:00:00+00:00' '135.94']
 ['2024-10-29 12:00:00+00:00' '117.73']
 ['2024-10-29 11:00:00+00:00' '105.45']
 ['2024-10-29 10:00:00+00:00' '112.97']
 ['2024-10-29 09:00:00+00:00' '120.7']
 ['2024-10-29 08:00:00+00:00' '129.69']
 ['2024-10-29 07:00:00+00:00' '151.32']
 ['2024-10-29 06:00:00+00:00' '157.73']
 ['2024-10-29 05:00:00+00:00' '127.38']
 ['2024-10-29 04:00:00+00:00' '113.1']
 ['2024-10-29 03:00:00+00:00' '100.02']
 ['2024-10-29 02:00:00+00:00' '98.17']
 ['2024-10-29 01:00:00+00:00' '98.83']
 ['2024-10-29 00:00:00+00:00' '100.04']
 ['2024-10-28 23:00:00+00:00' '101.34']
 ['2