Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(parser-valiation): refactor BO parser #6271

Merged
merged 6 commits into from
Dec 21, 2023
Merged

Conversation

unitrium
Copy link
Contributor

Description

Refactor the BO parser to use the new datastructure. Add test coverage and minor improvements.

Double check

  • I have tested my parser changes locally with poetry run test_parser "zone_key"
  • I have run pnpx prettier --write . and poetry run format to format my changes.

@github-actions github-actions bot added parser python Pull requests that update Python code tests zone config Pull request or issue for zone configurations labels Dec 21, 2023
Copy link
Contributor

PR Analysis

  • 🎯 Main theme: Refactoring of the BO parser
  • 📝 PR summary: This PR refactors the BO parser to use a new data structure. It also includes adding test coverage and minor improvements to the parser. The parser is now located in a new file, CNDC.py, and the old BO.py file has been deleted.
  • 📌 Type of PR: Enhancement
  • 🧪 Relevant tests added: Yes
  • ⏱️ Estimated effort to review [1-5]: 3, because the PR involves a significant refactoring of the parser and addition of new tests. The changes are well-structured and clear, but require a good understanding of the parser's functionality to review effectively.
  • 🔒 Security concerns: No security concerns found

PR Feedback

  • 💡 General suggestions: The PR is well-structured and the changes are clear. It's good to see that tests have been added to verify the functionality of the parser. The code is clean and follows good practices. However, it would be beneficial to add more comments in the code to explain what each function does, especially for complex ones.

  • 🤖 Code feedback:
    relevant fileparsers/CNDC.py
    suggestion      Consider adding more comments to explain the functionality of each function, especially for complex ones. This will make the code easier to understand and maintain in the future. [medium]
    relevant linedef fetch_data(

    relevant fileparsers/CNDC.py
    suggestion      It would be beneficial to handle the case where the 'total' or any of the 'modes_extracted' is None in a more explicit way. Currently, the code just continues to the next iteration, but it might be helpful to log a warning or error message. [medium]
    relevant lineif total is None or None in modes_extracted:

    relevant fileparsers/CNDC.py
    suggestion      Consider using more descriptive variable names. For example, 'hour_row' could be renamed to something like 'hourly_data_row' to make it clearer what it represents. [medium]
    relevant linefor hour_row in raw_data:

    relevant fileparsers/test/test_CNDC.py
    suggestion      It would be beneficial to add more tests to cover edge cases and error scenarios. This will help ensure the robustness of the parser. [medium]
    relevant lineclass TestCNDC(TestCase):

How to use

Instructions

To invoke the PR-Agent, add a comment using one of the following commands:
/review: Request a review of your Pull Request.
/describe: Update the PR title and description based on the contents of the PR.
/improve [--extended]: Suggest code improvements. Extended mode provides a higher quality feedback.
/ask <QUESTION>: Ask a question about the PR.
/update_changelog: Update the changelog based on the PR's contents.
/add_docs: Generate docstring for new components introduced in the PR.
/generate_labels: Generate labels for the PR based on the PR's contents.
see the tools guide for more details.

To edit any configuration parameter from the configuration.toml, add --config_path=new_value.
For example: /review --pr_reviewer.extra_instructions="focus on the file: ..."
To list the possible configuration parameters, add a /config comment.

parsers/CNDC.py Outdated
Comment on lines 22 to 58
DATA_URL = "https://www.cndc.bo/gene/dat/gene.php?fechag={0}"
SOURCE = "cndc.bo"


def extract_xsrf_token(html):
"""Extracts XSRF token from the source code of the generation graph page."""
return re.search(r'var ttoken = "([a-f0-9]+)";', html).group(1)


def get_timestamp(query_date: datetime, hour: int) -> datetime:
return datetime(
year=query_date.year,
month=query_date.month,
day=query_date.day,
hour=hour,
tzinfo=tz_bo,
)


def fetch_data(
session: Session | None = None, target_datetime: datetime | None = None
) -> tuple[list[dict], datetime]:
if session is None:
session = Session()

if target_datetime is None:
target_datetime = datetime.now()
target_datetime = target_datetime.astimezone(tz_bo)
# Define actual and previous day (for midnight data).
formatted_dt = target_datetime.strftime("%Y-%m-%d")

# XSRF token for the initial request
xsrf_token = extract_xsrf_token(session.get(INDEX_URL).text)

resp = session.get(
DATA_URL.format(formatted_dt), headers={"x-csrf-token": xsrf_token}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Use f-string for formatting the DATA_URL instead of using the format method.

Suggested change
DATA_URL = "https://www.cndc.bo/gene/dat/gene.php?fechag={0}"
SOURCE = "cndc.bo"
def extract_xsrf_token(html):
"""Extracts XSRF token from the source code of the generation graph page."""
return re.search(r'var ttoken = "([a-f0-9]+)";', html).group(1)
def get_timestamp(query_date: datetime, hour: int) -> datetime:
return datetime(
year=query_date.year,
month=query_date.month,
day=query_date.day,
hour=hour,
tzinfo=tz_bo,
)
def fetch_data(
session: Session | None = None, target_datetime: datetime | None = None
) -> tuple[list[dict], datetime]:
if session is None:
session = Session()
if target_datetime is None:
target_datetime = datetime.now()
target_datetime = target_datetime.astimezone(tz_bo)
# Define actual and previous day (for midnight data).
formatted_dt = target_datetime.strftime("%Y-%m-%d")
# XSRF token for the initial request
xsrf_token = extract_xsrf_token(session.get(INDEX_URL).text)
resp = session.get(
DATA_URL.format(formatted_dt), headers={"x-csrf-token": xsrf_token}
)
DATA_URL = "https://www.cndc.bo/gene/dat/gene.php?fechag={formatted_dt}"
...
resp = session.get(
DATA_URL, headers={"x-csrf-token": xsrf_token}
)

parsers/CNDC.py Outdated Show resolved Hide resolved
parsers/CNDC.py Outdated
Comment on lines 45 to 127
session = Session()

if target_datetime is None:
target_datetime = datetime.now()
target_datetime = target_datetime.astimezone(tz_bo)
# Define actual and previous day (for midnight data).
formatted_dt = target_datetime.strftime("%Y-%m-%d")

# XSRF token for the initial request
xsrf_token = extract_xsrf_token(session.get(INDEX_URL).text)

resp = session.get(
DATA_URL.format(formatted_dt), headers={"x-csrf-token": xsrf_token}
)

hour_rows = json.loads(resp.text.replace("", ""))["data"]
return hour_rows, target_datetime


def parse_generation_forecast(
zone_key: ZoneKey, date: datetime, raw_data: list[dict], logger: Logger
) -> TotalProductionList:
result = TotalProductionList(logger)
assert date.tzinfo == tz_bo
for hour_row in raw_data:
[hour, forecast, total, thermo, hydro, wind, solar, bagasse] = hour_row

# "hour" is one-indexed
timestamp = get_timestamp(query_date=date, hour=hour - 1)

result.append(
zoneKey=zone_key,
datetime=timestamp,
value=forecast,
source=SOURCE,
sourceType=EventSourceType.forecasted,
)

return result


def parser_production_breakdown(
zone_key: ZoneKey, date: datetime, raw_data: list[dict], logger: Logger
) -> ProductionBreakdownList:
result = ProductionBreakdownList(logger)
assert date.tzinfo == tz_bo
for hour_row in raw_data:
[hour, forecast, total, thermo, hydro, wind, solar, bagasse] = hour_row

# "hour" is one-indexed
timestamp = get_timestamp(query_date=date, hour=hour - 1)
modes_extracted = [hydro, solar, wind, bagasse]

if total is None or None in modes_extracted:
continue

result.append(
zoneKey=zone_key,
datetime=timestamp,
production=ProductionMix(
hydro=hydro,
solar=solar,
wind=wind,
biomass=bagasse,
# NOTE: thermo includes gas + oil mixed, so we set these as unknown for now
# The modes here should match the ones we extract in the production payload
unknown=total - (hydro + solar + wind + bagasse),
),
source=SOURCE,
)

return result


def fetch_production(
zone_key: ZoneKey = ZoneKey("BO"),
session: Session | None = None,
target_datetime: datetime | None = None,
logger: Logger = getLogger(__name__),
) -> list:
"""Requests the last known production mix (in MW) of a given country."""
production = ProductionBreakdownList(logger)
raw_data, query_date = fetch_data(session=session, target_datetime=target_datetime)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Use a context manager to handle the session object. This ensures that the session is properly closed after use.

Suggested change
session = Session()
if target_datetime is None:
target_datetime = datetime.now()
target_datetime = target_datetime.astimezone(tz_bo)
# Define actual and previous day (for midnight data).
formatted_dt = target_datetime.strftime("%Y-%m-%d")
# XSRF token for the initial request
xsrf_token = extract_xsrf_token(session.get(INDEX_URL).text)
resp = session.get(
DATA_URL.format(formatted_dt), headers={"x-csrf-token": xsrf_token}
)
hour_rows = json.loads(resp.text.replace("", ""))["data"]
return hour_rows, target_datetime
def parse_generation_forecast(
zone_key: ZoneKey, date: datetime, raw_data: list[dict], logger: Logger
) -> TotalProductionList:
result = TotalProductionList(logger)
assert date.tzinfo == tz_bo
for hour_row in raw_data:
[hour, forecast, total, thermo, hydro, wind, solar, bagasse] = hour_row
# "hour" is one-indexed
timestamp = get_timestamp(query_date=date, hour=hour - 1)
result.append(
zoneKey=zone_key,
datetime=timestamp,
value=forecast,
source=SOURCE,
sourceType=EventSourceType.forecasted,
)
return result
def parser_production_breakdown(
zone_key: ZoneKey, date: datetime, raw_data: list[dict], logger: Logger
) -> ProductionBreakdownList:
result = ProductionBreakdownList(logger)
assert date.tzinfo == tz_bo
for hour_row in raw_data:
[hour, forecast, total, thermo, hydro, wind, solar, bagasse] = hour_row
# "hour" is one-indexed
timestamp = get_timestamp(query_date=date, hour=hour - 1)
modes_extracted = [hydro, solar, wind, bagasse]
if total is None or None in modes_extracted:
continue
result.append(
zoneKey=zone_key,
datetime=timestamp,
production=ProductionMix(
hydro=hydro,
solar=solar,
wind=wind,
biomass=bagasse,
# NOTE: thermo includes gas + oil mixed, so we set these as unknown for now
# The modes here should match the ones we extract in the production payload
unknown=total - (hydro + solar + wind + bagasse),
),
source=SOURCE,
)
return result
def fetch_production(
zone_key: ZoneKey = ZoneKey("BO"),
session: Session | None = None,
target_datetime: datetime | None = None,
logger: Logger = getLogger(__name__),
) -> list:
"""Requests the last known production mix (in MW) of a given country."""
production = ProductionBreakdownList(logger)
raw_data, query_date = fetch_data(session=session, target_datetime=target_datetime)
with Session() as session:
raw_data, query_date = fetch_data(session=session, target_datetime=target_datetime)

Comment on lines +41 to +60
def fetch_data(
session: Session | None = None, target_datetime: datetime | None = None
) -> tuple[list[dict], datetime]:
if session is None:
session = Session()

if target_datetime is None:
target_datetime = datetime.now()
target_datetime = target_datetime.astimezone(tz_bo)
# Define actual and previous day (for midnight data).
formatted_dt = target_datetime.strftime("%Y-%m-%d")

# XSRF token for the initial request
xsrf_token = extract_xsrf_token(session.get(INDEX_URL).text)

resp = session.get(
DATA_URL.format(formatted_dt), headers={"x-csrf-token": xsrf_token}
)

hour_rows = json.loads(resp.text.replace("", ""))["data"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Use a more descriptive function name than 'fetch_data'. The function name should reflect what kind of data it fetches.

Suggested change
def fetch_data(
session: Session | None = None, target_datetime: datetime | None = None
) -> tuple[list[dict], datetime]:
if session is None:
session = Session()
if target_datetime is None:
target_datetime = datetime.now()
target_datetime = target_datetime.astimezone(tz_bo)
# Define actual and previous day (for midnight data).
formatted_dt = target_datetime.strftime("%Y-%m-%d")
# XSRF token for the initial request
xsrf_token = extract_xsrf_token(session.get(INDEX_URL).text)
resp = session.get(
DATA_URL.format(formatted_dt), headers={"x-csrf-token": xsrf_token}
)
hour_rows = json.loads(resp.text.replace("", ""))["data"]
def fetch_hourly_data(
session: Session | None = None, target_datetime: datetime | None = None
) -> tuple[list[dict], datetime]:
...

Copy link
Member

@VIKTORVAV99 VIKTORVAV99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything works as expected but I have a few comments before approving.

parsers/CNDC.py Outdated
[hour, forecast, total, thermo, hydro, wind, solar, bagasse] = hour_row

# "hour" is one-indexed
timestamp = get_timestamp(query_date=date, hour=hour - 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if you changed this part but I would prefer if it was called something like get_datetime instead as it actually returns a datetime.

Comment on lines +109 to +112
# NOTE: thermo includes gas + oil mixed, so we set these as unknown for now
# The modes here should match the ones we extract in the production payload
unknown=total - (hydro + solar + wind + bagasse),
),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we not assigning thermo directly to unknown?
Or are there additional missing production?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be additional unknowns we want to account for. This was already in the original parser.

result.append(
zoneKey=zone_key,
datetime=timestamp,
production=ProductionMix(
Copy link
Member

@VIKTORVAV99 VIKTORVAV99 Dec 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By not using the add value function we introduce floating point errors like this: 'unknown': 777.4199999999998,
We could possibly change how the ProductionMix class works to avoid this or use the add_value function everywhere. Either or I think we should avoid introducing these.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I would rather update the __set__attr method of the Mix class so that it's coherent with the add method. I will do a follow up PR to address this.

@unitrium unitrium enabled auto-merge (squash) December 21, 2023 13:49
@unitrium unitrium merged commit 77863b2 into master Dec 21, 2023
11 checks passed
@unitrium unitrium deleted the robin/refactor-BO branch December 21, 2023 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parser python Pull requests that update Python code tests zone config Pull request or issue for zone configurations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants