### API Analysis

![alt text](image.png "Title")
This configuration shows the hourly day-ahead (price of energy until the same time tomorrow) for two weeks back.
When checking the network traffic for the above dates and for the hourly resolution, you will find three .json files being fetched from the API.

A request to the api has the following structure:
https://www.smard.de/app/chart_data/4169/DE/4169_DE_hour_[timestamp_in_milliseconds].json

The following request fetch data for the corresponding time frames.

https://www.smard.de/app/chart_data/4169/DE/4169_DE_hour_1729461600000.json:
Sunday, 6 October 2024 22:00:00 -> Sunday, 13 October 2024 21:00:00

https://www.smard.de/app/chart_data/4169/DE/4169_DE_hour_1728856800000.json:
Sunday, 13 October 2024 22:00:00 -> Sunday, 20 October 2024 21:00:00

https://www.smard.de/app/chart_data/4169/DE/4169_DE_hour_1729461600000.json
Sunday, 20 October 2024 22:00:00 -> Sunday, 27 October 2024 22:00:00


You will find that for example the timestamp 1729461600000 maps to the initial date Sunday, 6 October 2024 22:00:00 and every file contains the date for one week. Interestingly enough the site only shows the data for two weeks even though it had to fetch the data for three entire weeks.

You will find that each .json file contains around 172 (more or less) time series entries.



### Implementing the scraper
We now want to implement a scraper that fetches the hourly energy prices for n amount of days. With the above information we now know that we'll have to find the corresponding timestamps for each week and to fetch the data.

In [103]:
import requests
import numpy as np
import logging
from datetime import datetime, timedelta, timezone
import pytz
import time
from pprint import pprint

In [104]:
logging.basicConfig(level=logging.INFO) 

logger = logging.getLogger("scraper_logger")

# console_handler = logging.StreamHandler()
file_handler = logging.FileHandler("app.log")

# console_handler.setLevel(logging.WARNING)
file_handler.setLevel(logging.WARNING) 

# logger.addHandler(console_handler)
logger.addHandler(file_handler)

In [105]:
from datetime import datetime, timedelta
import pytz

account_for_dst = True

# Define Berlin timezone
tz_berlin = pytz.timezone("Europe/Berlin")
tz_utc = pytz.timezone("UTC")
now = datetime.now(tz_utc)
days_since_monday = now.weekday()
last_monday_utc = now - timedelta(days=days_since_monday)
last_monday_utc = last_monday_utc.replace(hour=0, minute=0, second=0, microsecond=0)

# Ensure Berlin timezone is consistently set
last_monday_berlin = last_monday_utc.astimezone(tz_berlin)

# Convert last_monday to milliseconds since epoch
last_monday_utc_ms = int(last_monday_utc.timestamp() * 1000)

if account_for_dst:
    last_monday_utc_ms += (1000 * 60 * 60)

print(f"Last Monday UTC: {last_monday_utc}, In ms: {last_monday_utc_ms}\nLast Monday Berlin: {last_monday_berlin}")


Last Monday UTC: 2024-10-28 00:00:00+00:00, In ms: 1730073600000
Last Monday Berlin: 2024-10-28 01:00:00+01:00


In [106]:
# Define constants
week_in_ms = 24 * 60 * 60 * 1000 * 7
delay = 0.5  # seconds
n = 100  # number of weeks
base_url = "https://www.smard.de/app/chart_data/4169/DE/4169_DE_hour_{}.json"
energy_ts_data = []

def fetch(base_url, adjusted_timestamp, delay):
    # Prepare the URL with the adjusted timestamp
    url = base_url.format(adjusted_timestamp)

    response = requests.get(url)
    time.sleep(delay)
    response.raise_for_status()

    return response


for _ in range(n):

    last_monday_utc = datetime.fromtimestamp(last_monday_utc_ms / 1000, tz=timezone.utc)
    # last_monday_berlin = last_monday_utc.astimezone(tz_berlin)

    # Adjust timestamp for DST if necessary
    if last_monday_utc.dst() != timedelta(0):  # DST is in effect
        adjusted_timestamp = last_monday_utc_ms - 60 * 60 * 1000  # Subtract 1 hour in ms
    else:
        adjusted_timestamp = last_monday_utc_ms  # No DST adjustment needed

    try:
        response = fetch(base_url, adjusted_timestamp, delay)
    except requests.exceptions.HTTPError as http_err:
        logging.warning(f"Failed to scrape data for timestamp: {last_monday_utc} (UTC)\n\tError: {http_err}")

    logging.info(f"Successfully scraped data for ts: {last_monday_utc} (UTC)")

    try:
        json_data = response.json()
    except requests.exceptions.JSONDecodeError as decoder_error:
        logging.warning(f"Failed to deserialize JSON: \n\tError: {decoder_error}")
        continue

    parsed_json = dict(json_data)

    energy_ts_data_week = []
    for ts, price in parsed_json["series"]:
        try:
            price_float = float(price)
            ts_datetime = datetime.fromtimestamp(ts / 1000, tz=timezone.utc).__str__()
        except TypeError as e:
            # logging.warning(f"Failed to parse non-float value for timestamp {ts_datetime} (UTC)\n\tError: {e}")
            continue

        energy_ts_data_week.append((ts_datetime, price_float))
    energy_ts_data_week.extend(energy_ts_data)
    energy_ts_data = energy_ts_data_week
    
    # Move to the previous week
    last_monday_utc_ms -= week_in_ms

# Convert the list of tuples to a numpy array
data = np.array(energy_ts_data)

print(data.shape)


	Error: 404 Client Error: Not Found for url: https://www.smard.de/app/chart_data/4169/DE/4169_DE_hour_1730070000000.json
INFO:root:Successfully scraped data for ts: 2024-10-28 00:00:00+00:00 (UTC)
	Error: 404 Client Error: Not Found for url: https://www.smard.de/app/chart_data/4169/DE/4169_DE_hour_1729465200000.json
INFO:root:Successfully scraped data for ts: 2024-10-21 00:00:00+00:00 (UTC)
	Error: 404 Client Error: Not Found for url: https://www.smard.de/app/chart_data/4169/DE/4169_DE_hour_1728860400000.json
INFO:root:Successfully scraped data for ts: 2024-10-14 00:00:00+00:00 (UTC)
	Error: 404 Client Error: Not Found for url: https://www.smard.de/app/chart_data/4169/DE/4169_DE_hour_1728255600000.json
INFO:root:Successfully scraped data for ts: 2024-10-07 00:00:00+00:00 (UTC)
	Error: 404 Client Error: Not Found for url: https://www.smard.de/app/chart_data/4169/DE/4169_DE_hour_1727650800000.json
INFO:root:Successfully scraped data for ts: 2024-09-30 00:00:00+00:00 (UTC)
	Error: 404 Cli

(16798, 2)


In [108]:
np.set_printoptions(threshold=np.inf)
print(data)

[['2022-12-04 23:00:00+00:00' '205.16']
 ['2022-12-05 00:00:00+00:00' '195.89']
 ['2022-12-05 01:00:00+00:00' '183.22']
 ['2022-12-05 02:00:00+00:00' '181.68']
 ['2022-12-05 03:00:00+00:00' '163.66']
 ['2022-12-05 04:00:00+00:00' '250.61']
 ['2022-12-05 05:00:00+00:00' '292.0']
 ['2022-12-05 06:00:00+00:00' '368.99']
 ['2022-12-05 07:00:00+00:00' '406.33']
 ['2022-12-05 08:00:00+00:00' '400.49']
 ['2022-12-05 09:00:00+00:00' '399.65']
 ['2022-12-05 10:00:00+00:00' '421.93']
 ['2022-12-05 11:00:00+00:00' '404.91']
 ['2022-12-05 12:00:00+00:00' '392.96']
 ['2022-12-05 13:00:00+00:00' '413.46']
 ['2022-12-05 14:00:00+00:00' '429.91']
 ['2022-12-05 15:00:00+00:00' '430.73']
 ['2022-12-05 16:00:00+00:00' '444.98']
 ['2022-12-05 17:00:00+00:00' '433.15']
 ['2022-12-05 18:00:00+00:00' '427.92']
 ['2022-12-05 19:00:00+00:00' '392.3']
 ['2022-12-05 20:00:00+00:00' '333.11']
 ['2022-12-05 21:00:00+00:00' '324.56']
 ['2022-12-05 22:00:00+00:00' '292.62']
 ['2022-12-05 23:00:00+00:00' '298.1']
 ['