# Power scraper
The power scraper's goal is to scrape hourly power production averages from [ENTSO-E](https://en.wikipedia.org/wiki/European_Network_of_Transmission_System_Operators_for_Electricity)'s [API](https://transparency.entsoe.eu/content/static_content/Static%20content/web%20api/Guide.html), and put it into a pandas dataframe for smoother analysis going onwards.

From [this PDF (page 13/16)](https://transparency.entsoe.eu/content/static_content/download?path=/Static%20content/web%20api/RestfulAPI_IG.pdf) I've learned that we are looking to scrape a _Document Type_ named _A73_ and described as _Actual generation output per generation unit_. This document type is also coupled with a _Process Type_ named _A16_ and described as _Realised_ that I believe reference that it relates to actual generation as compared to for example forcasted generation.

### References
- [Transparency platform - Actual Generation Per Generation Unit](https://transparency.entsoe.eu/generation/r2/actualGenerationPerGenerationUnit/show)
- [Electrical domains documented](https://www.entsoe.eu/fileadmin/user_upload/edi/library/downloads/Market_Areas_v2.0.pdf)
- [ENTSO-E API implementation guide](https://transparency.entsoe.eu/content/static_content/download?path=/Static%20content/web%20api/RestfulAPI_IG.pdf)
- [ENTSO-E API web guide](https://transparency.entsoe.eu/content/static_content/Static%20content/web%20api/Guide.html)
- [EnergieID/entsoe-py](https://github.com/EnergieID/entsoe-py): A python client for working against the ENTSO-E API
- [ElectricityMap's use of ENTSO-E's API](https://github.com/tmrowco/electricitymap-contrib/blob/master/parsers/ENTSOE.py)

### Setup to run scraper
In order for you to be able to use this code, you need an API Token for accessing the ENTSO-E API. For instructions on how to get one, see [their documentation about it](https://transparency.entsoe.eu/content/static_content/Static%20content/web%20api/Guide.html#_authentication_and_authorisation).

This is what you need to do summarized:

1. Create an account on the [Transparency Platform (TP)](https://transparency.entsoe.eu/).
2. Send an email to transparency@entsoe.eu with `Restful API access` in the subject line. Indicate the email address you entered during registration in the email body.
3. Await a response and inspect [your account settings](https://transparency.entsoe.eu/usrm/user/myAccountSettings) where a _Web Api Security Token_ should now be available.
4. Write your API_TOKEN to the `.env` file or set manually update your environment variable `API_TOKEN` for use by the scripts below. The file should contain one line of text formatted like this:

   ```
   API_TOKEN=your-api-token-here
   ```

In [None]:
# Load's API_TOKEN from the .env file which is .gitignore'd
from dotenv import load_dotenv
load_dotenv()

## Code strategy
We rely heavily on the `EntsoePandasClient` of the `entsoe` Python library, and its [query_generation_per_plant function](https://github.com/EnergieID/entsoe-py/blob/e620357ad6ea0ddd217d4cff61eed18e8461f584/entsoe/entsoe.py#L905). Iterating over it for each country, and using a time interval from the beginning of the API's creation, to now.

Caching web requests, saving completed country downloads, and progressbars is the glue that takes us all the way.

### About web request caching
We should avoid making web request to the API in order to avoid paing a cost of time and potentially getting rate limited or banned. To do this we can try to cache as much requests we make as possible, as the data isn't supposed to change anyhow.

Consider us using the `query_generation_per_plant` function and passing a timespan of a week. I've learned it will make seven requests to the ENTSO-E API, one per day, because it is simply a restriction on their API. If we now specify overlapping days in a later request, these days will be cached and we will only probe the actual API with those that had not already been requested. All of this is done by us importing [requests-cache](https://github.com/reclosedev/requests-cache) and saying we want to use it.

In [2]:
# use a cache so we don't ask the same thing twice
# this is what is stored in the entso-e.sqlite database file
import requests_cache

### Progress bars for sanity
When working with long running jobs, having a progress bar is very useful. The tqdm library provides such functionality.

In [3]:
# progress bars beauty
# https://github.com/tqdm/tqdm#ipythonjupyter-integration
from tqdm.auto import tqdm

### Modifications to entsoe library
I made some modification to not abort scraping if a exception about no available data was thrown for an individual day. This changes are part of a fork on https://github.com/consideRatio/entsoe-py.

### Code improvements
- Upload to google cloud storage and stop working with local dataset files
- Try fixing names with åäö etc
- Convert dataframes to excel files
- Find out about the DOMAIN_MAPPING issues


## Scraping time estimates
It took 130 seconds to get data about Sweden for a month, or about five seconds per day. The dates to investigate range from 2015-01-05 onwards, which is about five years, which is about 60 months. 60 months * 130 seconds / month = 130 minutes for one country. With 42 different country entries we can choose from in the `entsoe.mappings.DOMAIN_MAPPINGS`, we have something that needs to run for about 130 minutes * 42 ~= 3.8 days.

# Scraper

In [4]:
from entsoe import EntsoePandasClient
import entsoe.misc

# To help us not clutter the output of the cell
from IPython.utils import io

import os
import pandas as pd

In [5]:
def scrape_control_area_data(country_code, cta, start, end):
    """
    Scrapes pandas dataframes containing the specified country's
    production units and how much energy they produced that hour
    with a hourly resolution, for all the country's control areas
    (CTAs).
    """
    # Return previously downloaded dataframes
    dataframe_file = f"dataframes/{cta['abbrev'].lower()}.pickle"
    if os.path.isfile(dataframe_file):
        tqdm(desc=country_code, total=0, bar_format="Already downloaded");
        return pd.read_pickle(dataframe_file)
    
    # Use a cache for this country
    cache_file = f"caches/{country_code.lower()}"
    requests_cache.install_cache(cache_file)
    
    # This will do the heavy lifting
    client = EntsoePandasClient(
        api_key=os.environ["API_TOKEN"],
        session=None,
        retry_count=3,
        retry_delay=0,
        proxies=None,
    )
    
    # Create monthly blocks so we can have a nice progressbar the
    # increments with each month. The requests will still be made
    # on a daily basis by the entsoe library, as required by the
    # ENTSO-E API.
    month_intervals = list(entsoe.misc.month_blocks(start, end))
    
    # Start scraping using the entsoe library acting as a helper
    # python API to communicate with the ENTSO-E API.
    dfs = []
    for month_start_datetime, month_end_datetime in tqdm(month_intervals, desc=cta["abbrev"]):
        with io.capture_output() as captured_output:
            print(f"Scraping month: {month_start_datetime.year}-{month_start_datetime.month:02d}")
            dfs.append(
                client.query_generation_per_plant(
                    country_code,
                    start=month_start_datetime,
                    end=month_end_datetime,
                    eic=cta["eic"],
                )
            )
    
    # Combine the scraped monthly dataframes into one large
    df = pd.concat(dfs, sort=True)
    
    # Save the combined dataframe to disk as a pickle file
    # NOTE: the pickle file for Sweden was 19 MB, while the
    # request cache was 335 MB.
    df.to_pickle(dataframe_file)
    
    # Make sure we don't leave this file open
    requests_cache.uninstall_cache()
    
    # return the dataframe
    return df

In [6]:
import sys

def valid_cta(cta):
    api_data_start = pd.Timestamp(year=2015, month=1, day=1, tz='Europe/Brussels')
    api_data_end   = pd.Timestamp(year=2015, month=1, day=2, tz='Europe/Brussels')

    try:
        with io.capture_output() as captured_output:
            EntsoePandasClient(
                api_key=os.environ["API_TOKEN"],
                session=None,
                retry_count=3,
                retry_delay=0,
                proxies=None,
            ).query_generation_per_plant(
                country_code=cta["country"],
                start=api_data_start,
                end=api_data_end,
                eic=cta["eic"],
            )
    except:
        e = sys.exc_info()[0]
        display(e)
        return False
    else:
        return True

In [7]:
country_names = {
    "AL": "Albania",
    "AT": "Austria",
    "BA": "Bosnia and Herz.",
    "BE": "Belgium",
    "BG": "Bulgaria",
    "BY": "Belarus",
    "CH": "Switzerland",
    "CY": "Cyprus",
    "CZ": "Czech Republic",
    "DE": "Germany",
    "DK": "Denmark",
    "EE": "Estonia",
    "ES": "Spain",
    "FI": "Finland",
    "FR": "France",
    "GE": "Georgia",
    "GR": "Greece",
    "HR": "Croatia",
    "HU": "Hungary",
    "IE": "Ireland",
    "IT": "Italy",
    "LT": "Lithuania",
    "LU": "Luxembourg",
    "LV": "Latvia",
    "MD": "Moldova",
    "ME": "Montenegro",
    "MK": "North Macedonia",
    "MT": "Malta",
    "NL": "Netherlands",
    "NO": "Norway",
    "PL": "Poland",
    "PT": "Portugal",
    "RO": "Romania",
    "RS": "Serbia",
    "RU": "Russia",
    "SE": "Sweden",
    "SI": "Slovenia",
    "SK": "Slovakia",
    "TR": "Turkey",
    "UA": "Ukraine",
    "UK": "United Kingdom",
}

In [8]:
# NOTE: the ENTSO-E API doesn't have data before 2015
# NOTE: I think the API is pretty much hardcoded for use with the
# Europe/Brussels timezone. There could be daylight savings time
# matters that can mess with an hour of data or two during the year.
api_data_start = pd.Timestamp(year=2015, month=1,  day=1,  tz='Europe/Brussels')
api_data_end   = pd.Timestamp(year=2019, month=11, day=15, tz='Europe/Brussels')

# Known potential failures:
# - ChunkedEncodingError
# The solution strategy to avoid aborts while this is running
# unmonitored is to just let errors happen and keep trying again
# where the next attempt will be quicker due to caches.

import pickle
with open("country_control_areas.pickle", 'rb') as f:
    country_control_areas = pickle.load(f)
country_control_areas.pop("NO")

# Let's wait with norway, they behave weird and takes forever to finish.
while True:
    try:
        # Iterate over the 34 countries and their CTAs respectively, and get their data
        for country_code in tqdm(country_control_areas, desc="Countries"):
            control_areas = country_control_areas[country_code]
            for cta in control_areas:
                if valid_cta(cta):
                    scrape_control_area_data(country_code, cta, api_data_start, api_data_end)
                else:
                    tqdm(desc=country_code, total=1, bar_format="Invalid domain");
    except Exception as e:
        continue
    else:
        break

HBox(children=(IntProgress(value=0, description='Countries', max=35, style=ProgressStyle(description_width='in…

HBox(children=(IntProgress(value=1, bar_style='info', description='AL', max=1, style=ProgressStyle(description…

HBox(children=(IntProgress(value=1, bar_style='info', description='AT', max=1, style=ProgressStyle(description…

HBox(children=(IntProgress(value=1, bar_style='info', description='BA', max=1, style=ProgressStyle(description…

HBox(children=(IntProgress(value=1, bar_style='info', description='BE', max=1, style=ProgressStyle(description…

HBox(children=(IntProgress(value=1, bar_style='info', description='BG', max=1, style=ProgressStyle(description…

HBox(children=(IntProgress(value=1, bar_style='info', description='CH', max=1, style=ProgressStyle(description…

HBox(children=(IntProgress(value=1, bar_style='info', description='CZ', max=1, style=ProgressStyle(description…

HBox(children=(IntProgress(value=1, bar_style='info', description='DE', max=1, style=ProgressStyle(description…

HBox(children=(IntProgress(value=1, bar_style='info', description='DE', max=1, style=ProgressStyle(description…

requests.exceptions.HTTPError

HBox(children=(IntProgress(value=0, description='DE', max=1, style=ProgressStyle(description_width='initial'))…

HBox(children=(IntProgress(value=1, bar_style='info', description='DE', max=1, style=ProgressStyle(description…

requests.exceptions.HTTPError

HBox(children=(IntProgress(value=0, description='DK', max=1, style=ProgressStyle(description_width='initial'))…

requests.exceptions.HTTPError

HBox(children=(IntProgress(value=0, description='DK', max=1, style=ProgressStyle(description_width='initial'))…

requests.exceptions.HTTPError

HBox(children=(IntProgress(value=0, description='DK', max=1, style=ProgressStyle(description_width='initial'))…

HBox(children=(IntProgress(value=1, bar_style='info', description='EE', max=1, style=ProgressStyle(description…

requests.exceptions.HTTPError

HBox(children=(IntProgress(value=0, description='ES', max=1, style=ProgressStyle(description_width='initial'))…

HBox(children=(IntProgress(value=1, bar_style='info', description='FI', max=1, style=ProgressStyle(description…

HBox(children=(IntProgress(value=1, bar_style='info', description='FR', max=1, style=ProgressStyle(description…

HBox(children=(IntProgress(value=1, bar_style='info', description='GR', max=1, style=ProgressStyle(description…

requests.exceptions.HTTPError

HBox(children=(IntProgress(value=0, description='HR', max=1, style=ProgressStyle(description_width='initial'))…

HBox(children=(IntProgress(value=1, bar_style='info', description='HU', max=1, style=ProgressStyle(description…

HBox(children=(IntProgress(value=1, bar_style='info', description='IE', max=1, style=ProgressStyle(description…

HBox(children=(IntProgress(value=1, bar_style='info', description='IT', max=1, style=ProgressStyle(description…

HBox(children=(IntProgress(value=1, bar_style='info', description='LT', max=1, style=ProgressStyle(description…

requests.exceptions.HTTPError

HBox(children=(IntProgress(value=0, description='LU', max=1, style=ProgressStyle(description_width='initial'))…

HBox(children=(IntProgress(value=1, bar_style='info', description='LV', max=1, style=ProgressStyle(description…

requests.exceptions.HTTPError

HBox(children=(IntProgress(value=0, description='ME', max=1, style=ProgressStyle(description_width='initial'))…

HBox(children=(IntProgress(value=1, bar_style='info', description='MK', max=1, style=ProgressStyle(description…

requests.exceptions.HTTPError

HBox(children=(IntProgress(value=0, description='MT', max=1, style=ProgressStyle(description_width='initial'))…

HBox(children=(IntProgress(value=1, bar_style='info', description='NL', max=1, style=ProgressStyle(description…

HBox(children=(IntProgress(value=1, bar_style='info', description='PL', max=1, style=ProgressStyle(description…

requests.exceptions.HTTPError

HBox(children=(IntProgress(value=0, description='PT', max=1, style=ProgressStyle(description_width='initial'))…

HBox(children=(IntProgress(value=1, bar_style='info', description='RO', max=1, style=ProgressStyle(description…

HBox(children=(IntProgress(value=1, bar_style='info', description='RS', max=1, style=ProgressStyle(description…

HBox(children=(IntProgress(value=1, bar_style='info', description='SE', max=1, style=ProgressStyle(description…

HBox(children=(IntProgress(value=1, bar_style='info', description='SI', max=1, style=ProgressStyle(description…

HBox(children=(IntProgress(value=1, bar_style='info', description='SK', max=1, style=ProgressStyle(description…

requests.exceptions.HTTPError

HBox(children=(IntProgress(value=0, description='TR', max=1, style=ProgressStyle(description_width='initial'))…

requests.exceptions.HTTPError

HBox(children=(IntProgress(value=0, description='UA', max=1, style=ProgressStyle(description_width='initial'))…

requests.exceptions.HTTPError

HBox(children=(IntProgress(value=0, description='UA', max=1, style=ProgressStyle(description_width='initial'))…

requests.exceptions.HTTPError

HBox(children=(IntProgress(value=0, description='UA', max=1, style=ProgressStyle(description_width='initial'))…

requests.exceptions.HTTPError

HBox(children=(IntProgress(value=0, description='UA', max=1, style=ProgressStyle(description_width='initial'))…

KeyError

HBox(children=(IntProgress(value=0, description='UK', max=1, style=ProgressStyle(description_width='initial'))…

KeyError

HBox(children=(IntProgress(value=0, description='UK', max=1, style=ProgressStyle(description_width='initial'))…


