refactor(parser-valiation): refactor BO parser #6271

unitrium · 2023-12-21T09:53:59Z

Description

Refactor the BO parser to use the new datastructure. Add test coverage and minor improvements.

Double check

I have tested my parser changes locally with poetry run test_parser "zone_key"
I have run pnpx prettier --write . and poetry run format to format my changes.

github-actions · 2023-12-21T09:55:17Z

PR Analysis

🎯 Main theme: Refactoring of the BO parser
📝 PR summary: This PR refactors the BO parser to use a new data structure. It also includes adding test coverage and minor improvements to the parser. The parser is now located in a new file, CNDC.py, and the old BO.py file has been deleted.
📌 Type of PR: Enhancement
🧪 Relevant tests added: Yes
⏱️ Estimated effort to review [1-5]: 3, because the PR involves a significant refactoring of the parser and addition of new tests. The changes are well-structured and clear, but require a good understanding of the parser's functionality to review effectively.
🔒 Security concerns: No security concerns found

PR Feedback

💡 General suggestions: The PR is well-structured and the changes are clear. It's good to see that tests have been added to verify the functionality of the parser. The code is clean and follows good practices. However, it would be beneficial to add more comments in the code to explain what each function does, especially for complex ones.

🤖 Code feedback:

relevant file	parsers/CNDC.py
suggestion	Consider adding more comments to explain the functionality of each function, especially for complex ones. This will make the code easier to understand and maintain in the future. [medium]
relevant line	def fetch_data(

relevant file	parsers/CNDC.py
suggestion	It would be beneficial to handle the case where the 'total' or any of the 'modes_extracted' is None in a more explicit way. Currently, the code just continues to the next iteration, but it might be helpful to log a warning or error message. [medium]
relevant line	if total is None or None in modes_extracted:

relevant file	parsers/CNDC.py
suggestion	Consider using more descriptive variable names. For example, 'hour_row' could be renamed to something like 'hourly_data_row' to make it clearer what it represents. [medium]
relevant line	for hour_row in raw_data:

relevant file	parsers/test/test_CNDC.py
suggestion	It would be beneficial to add more tests to cover edge cases and error scenarios. This will help ensure the robustness of the parser. [medium]
relevant line	class TestCNDC(TestCase):

How to use

Instructions

To invoke the PR-Agent, add a comment using one of the following commands:
/review: Request a review of your Pull Request.
/describe: Update the PR title and description based on the contents of the PR.
/improve [--extended]: Suggest code improvements. Extended mode provides a higher quality feedback.
/ask <QUESTION>: Ask a question about the PR.
/update_changelog: Update the changelog based on the PR's contents.
/add_docs: Generate docstring for new components introduced in the PR.
/generate_labels: Generate labels for the PR based on the PR's contents.
see the tools guide for more details.

To edit any configuration parameter from the configuration.toml, add --config_path=new_value.
For example: /review --pr_reviewer.extra_instructions="focus on the file: ..."
To list the possible configuration parameters, add a /config comment.

github-actions · 2023-12-21T09:55:55Z

parsers/CNDC.py

+DATA_URL = "https://www.cndc.bo/gene/dat/gene.php?fechag={0}"
+SOURCE = "cndc.bo"
+
+
+def extract_xsrf_token(html):
+    """Extracts XSRF token from the source code of the generation graph page."""
+    return re.search(r'var ttoken = "([a-f0-9]+)";', html).group(1)
+
+
+def get_timestamp(query_date: datetime, hour: int) -> datetime:
+    return datetime(
+        year=query_date.year,
+        month=query_date.month,
+        day=query_date.day,
+        hour=hour,
+        tzinfo=tz_bo,
+    )
+
+
+def fetch_data(
+    session: Session | None = None, target_datetime: datetime | None = None
+) -> tuple[list[dict], datetime]:
+    if session is None:
+        session = Session()
+
+    if target_datetime is None:
+        target_datetime = datetime.now()
+    target_datetime = target_datetime.astimezone(tz_bo)
+    # Define actual and previous day (for midnight data).
+    formatted_dt = target_datetime.strftime("%Y-%m-%d")
+
+    # XSRF token for the initial request
+    xsrf_token = extract_xsrf_token(session.get(INDEX_URL).text)
+
+    resp = session.get(
+        DATA_URL.format(formatted_dt), headers={"x-csrf-token": xsrf_token}
+    )


Suggestion: Use f-string for formatting the DATA_URL instead of using the format method.

Suggested change

DATA_URL = "https://www.cndc.bo/gene/dat/gene.php?fechag={0}"

SOURCE = "cndc.bo"

def extract_xsrf_token(html):

"""Extracts XSRF token from the source code of the generation graph page."""

return re.search(r'var ttoken = "([a-f0-9]+)";', html).group(1)

def get_timestamp(query_date: datetime, hour: int) -> datetime:

return datetime(

year=query_date.year,

month=query_date.month,

day=query_date.day,

hour=hour,

tzinfo=tz_bo,

)

def fetch_data(

session: Session | None = None, target_datetime: datetime | None = None

) -> tuple[list[dict], datetime]:

if session is None:

session = Session()

if target_datetime is None:

target_datetime = datetime.now()

target_datetime = target_datetime.astimezone(tz_bo)

# Define actual and previous day (for midnight data).

formatted_dt = target_datetime.strftime("%Y-%m-%d")

# XSRF token for the initial request

xsrf_token = extract_xsrf_token(session.get(INDEX_URL).text)

resp = session.get(

DATA_URL.format(formatted_dt), headers={"x-csrf-token": xsrf_token}

)

DATA_URL = "https://www.cndc.bo/gene/dat/gene.php?fechag={formatted_dt}"

...

resp = session.get(

DATA_URL, headers={"x-csrf-token": xsrf_token}

)

parsers/CNDC.py

github-actions · 2023-12-21T09:55:56Z

parsers/CNDC.py

+        session = Session()
+
+    if target_datetime is None:
+        target_datetime = datetime.now()
+    target_datetime = target_datetime.astimezone(tz_bo)
+    # Define actual and previous day (for midnight data).
+    formatted_dt = target_datetime.strftime("%Y-%m-%d")
+
+    # XSRF token for the initial request
+    xsrf_token = extract_xsrf_token(session.get(INDEX_URL).text)
+
+    resp = session.get(
+        DATA_URL.format(formatted_dt), headers={"x-csrf-token": xsrf_token}
+    )
+
+    hour_rows = json.loads(resp.text.replace("ï»¿", ""))["data"]
+    return hour_rows, target_datetime
+
+
+def parse_generation_forecast(
+    zone_key: ZoneKey, date: datetime, raw_data: list[dict], logger: Logger
+) -> TotalProductionList:
+    result = TotalProductionList(logger)
+    assert date.tzinfo == tz_bo
+    for hour_row in raw_data:
+        [hour, forecast, total, thermo, hydro, wind, solar, bagasse] = hour_row
+
+        # "hour" is one-indexed
+        timestamp = get_timestamp(query_date=date, hour=hour - 1)
+
+        result.append(
+            zoneKey=zone_key,
+            datetime=timestamp,
+            value=forecast,
+            source=SOURCE,
+            sourceType=EventSourceType.forecasted,
+        )
+
+    return result
+
+
+def parser_production_breakdown(
+    zone_key: ZoneKey, date: datetime, raw_data: list[dict], logger: Logger
+) -> ProductionBreakdownList:
+    result = ProductionBreakdownList(logger)
+    assert date.tzinfo == tz_bo
+    for hour_row in raw_data:
+        [hour, forecast, total, thermo, hydro, wind, solar, bagasse] = hour_row
+
+        # "hour" is one-indexed
+        timestamp = get_timestamp(query_date=date, hour=hour - 1)
+        modes_extracted = [hydro, solar, wind, bagasse]
+
+        if total is None or None in modes_extracted:
+            continue
+
+        result.append(
+            zoneKey=zone_key,
+            datetime=timestamp,
+            production=ProductionMix(
+                hydro=hydro,
+                solar=solar,
+                wind=wind,
+                biomass=bagasse,
+                # NOTE: thermo includes gas + oil mixed, so we set these as unknown for now
+                # The modes here should match the ones we extract in the production payload
+                unknown=total - (hydro + solar + wind + bagasse),
+            ),
+            source=SOURCE,
+        )
+
+    return result
+
+
+def fetch_production(
+    zone_key: ZoneKey = ZoneKey("BO"),
+    session: Session | None = None,
+    target_datetime: datetime | None = None,
+    logger: Logger = getLogger(__name__),
+) -> list:
+    """Requests the last known production mix (in MW) of a given country."""
+    production = ProductionBreakdownList(logger)
+    raw_data, query_date = fetch_data(session=session, target_datetime=target_datetime)


Suggestion: Use a context manager to handle the session object. This ensures that the session is properly closed after use.

Suggested change

session = Session()

if target_datetime is None:

target_datetime = datetime.now()

target_datetime = target_datetime.astimezone(tz_bo)

# Define actual and previous day (for midnight data).

formatted_dt = target_datetime.strftime("%Y-%m-%d")

# XSRF token for the initial request

xsrf_token = extract_xsrf_token(session.get(INDEX_URL).text)

resp = session.get(

DATA_URL.format(formatted_dt), headers={"x-csrf-token": xsrf_token}

)

hour_rows = json.loads(resp.text.replace("ï»¿", ""))["data"]

return hour_rows, target_datetime

def parse_generation_forecast(

zone_key: ZoneKey, date: datetime, raw_data: list[dict], logger: Logger

) -> TotalProductionList:

result = TotalProductionList(logger)

assert date.tzinfo == tz_bo

for hour_row in raw_data:

[hour, forecast, total, thermo, hydro, wind, solar, bagasse] = hour_row

# "hour" is one-indexed

timestamp = get_timestamp(query_date=date, hour=hour - 1)

result.append(

zoneKey=zone_key,

datetime=timestamp,

value=forecast,

source=SOURCE,

sourceType=EventSourceType.forecasted,

)

return result

def parser_production_breakdown(

zone_key: ZoneKey, date: datetime, raw_data: list[dict], logger: Logger

) -> ProductionBreakdownList:

result = ProductionBreakdownList(logger)

assert date.tzinfo == tz_bo

for hour_row in raw_data:

[hour, forecast, total, thermo, hydro, wind, solar, bagasse] = hour_row

# "hour" is one-indexed

timestamp = get_timestamp(query_date=date, hour=hour - 1)

modes_extracted = [hydro, solar, wind, bagasse]

if total is None or None in modes_extracted:

continue

result.append(

zoneKey=zone_key,

datetime=timestamp,

production=ProductionMix(

hydro=hydro,

solar=solar,

wind=wind,

biomass=bagasse,

# NOTE: thermo includes gas + oil mixed, so we set these as unknown for now

# The modes here should match the ones we extract in the production payload

unknown=total - (hydro + solar + wind + bagasse),

),

source=SOURCE,

)

return result

def fetch_production(

zone_key: ZoneKey = ZoneKey("BO"),

session: Session | None = None,

target_datetime: datetime | None = None,

logger: Logger = getLogger(__name__),

) -> list:

"""Requests the last known production mix (in MW) of a given country."""

production = ProductionBreakdownList(logger)

raw_data, query_date = fetch_data(session=session, target_datetime=target_datetime)

with Session() as session:

raw_data, query_date = fetch_data(session=session, target_datetime=target_datetime)

github-actions · 2023-12-21T09:55:56Z

parsers/CNDC.py

+def fetch_data(
+    session: Session | None = None, target_datetime: datetime | None = None
+) -> tuple[list[dict], datetime]:
+    if session is None:
+        session = Session()
+
+    if target_datetime is None:
+        target_datetime = datetime.now()
+    target_datetime = target_datetime.astimezone(tz_bo)
+    # Define actual and previous day (for midnight data).
+    formatted_dt = target_datetime.strftime("%Y-%m-%d")
+
+    # XSRF token for the initial request
+    xsrf_token = extract_xsrf_token(session.get(INDEX_URL).text)
+
+    resp = session.get(
+        DATA_URL.format(formatted_dt), headers={"x-csrf-token": xsrf_token}
+    )
+
+    hour_rows = json.loads(resp.text.replace("ï»¿", ""))["data"]


Suggestion: Use a more descriptive function name than 'fetch_data'. The function name should reflect what kind of data it fetches.

Suggested change

def fetch_data(

session: Session | None = None, target_datetime: datetime | None = None

) -> tuple[list[dict], datetime]:

if session is None:

session = Session()

if target_datetime is None:

target_datetime = datetime.now()

target_datetime = target_datetime.astimezone(tz_bo)

# Define actual and previous day (for midnight data).

formatted_dt = target_datetime.strftime("%Y-%m-%d")

# XSRF token for the initial request

xsrf_token = extract_xsrf_token(session.get(INDEX_URL).text)

resp = session.get(

DATA_URL.format(formatted_dt), headers={"x-csrf-token": xsrf_token}

)

hour_rows = json.loads(resp.text.replace("ï»¿", ""))["data"]

def fetch_hourly_data(

session: Session | None = None, target_datetime: datetime | None = None

) -> tuple[list[dict], datetime]:

...

VIKTORVAV99

Everything works as expected but I have a few comments before approving.

VIKTORVAV99 · 2023-12-21T10:20:58Z

parsers/CNDC.py

+        [hour, forecast, total, thermo, hydro, wind, solar, bagasse] = hour_row
+
+        # "hour" is one-indexed
+        timestamp = get_timestamp(query_date=date, hour=hour - 1)


Not sure if you changed this part but I would prefer if it was called something like get_datetime instead as it actually returns a datetime.

VIKTORVAV99 · 2023-12-21T10:23:12Z

parsers/CNDC.py

+                # NOTE: thermo includes gas + oil mixed, so we set these as unknown for now
+                # The modes here should match the ones we extract in the production payload
+                unknown=total - (hydro + solar + wind + bagasse),
+            ),


Why are we not assigning thermo directly to unknown?
Or are there additional missing production?

I think there might be additional unknowns we want to account for. This was already in the original parser.

VIKTORVAV99 · 2023-12-21T10:27:59Z

parsers/CNDC.py

+        result.append(
+            zoneKey=zone_key,
+            datetime=timestamp,
+            production=ProductionMix(


By not using the add value function we introduce floating point errors like this: 'unknown': 777.4199999999998,
We could possibly change how the ProductionMix class works to avoid this or use the add_value function everywhere. Either or I think we should avoid introducing these.

I think I would rather update the __set__attr method of the Mix class so that it's coherent with the add method. I will do a follow up PR to address this.

unitrium added 4 commits December 20, 2023 16:23

rename parser

bfef56b

update config

4cf35e2

create test

b0e3518

upgrade parser

0b736c0

unitrium requested a review from VIKTORVAV99 as a code owner December 21, 2023 09:54

github-actions bot added parser python Pull requests that update Python code tests zone config Pull request or issue for zone configurations labels Dec 21, 2023

github-actions bot reviewed Dec 21, 2023

View reviewed changes

VIKTORVAV99 reviewed Dec 21, 2023

View reviewed changes

rename to get_datetime

1954e10

VIKTORVAV99 approved these changes Dec 21, 2023

View reviewed changes

Merge branch 'master' into robin/refactor-BO

a0c365d

unitrium enabled auto-merge (squash) December 21, 2023 13:49

unitrium merged commit 77863b2 into master Dec 21, 2023
11 checks passed

unitrium deleted the robin/refactor-BO branch December 21, 2023 13:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(parser-valiation): refactor BO parser #6271

refactor(parser-valiation): refactor BO parser #6271

unitrium commented Dec 21, 2023

github-actions bot commented Dec 21, 2023

github-actions bot Dec 21, 2023

github-actions bot Dec 21, 2023

github-actions bot Dec 21, 2023

VIKTORVAV99 left a comment

VIKTORVAV99 Dec 21, 2023

VIKTORVAV99 Dec 21, 2023

unitrium Dec 21, 2023

VIKTORVAV99 Dec 21, 2023 •

edited

Loading

unitrium Dec 21, 2023

refactor(parser-valiation): refactor BO parser #6271

refactor(parser-valiation): refactor BO parser #6271

Conversation

unitrium commented Dec 21, 2023

Description

Double check

github-actions bot commented Dec 21, 2023

PR Analysis

PR Feedback

How to use

github-actions bot Dec 21, 2023

Choose a reason for hiding this comment

github-actions bot Dec 21, 2023

Choose a reason for hiding this comment

github-actions bot Dec 21, 2023

Choose a reason for hiding this comment

VIKTORVAV99 left a comment

Choose a reason for hiding this comment

VIKTORVAV99 Dec 21, 2023

Choose a reason for hiding this comment

VIKTORVAV99 Dec 21, 2023

Choose a reason for hiding this comment

unitrium Dec 21, 2023

Choose a reason for hiding this comment

VIKTORVAV99 Dec 21, 2023 • edited Loading

Choose a reason for hiding this comment

unitrium Dec 21, 2023

Choose a reason for hiding this comment

VIKTORVAV99 Dec 21, 2023 •

edited

Loading