Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

METdataio, in METreformat address PerformanceWarning #219

Closed
8 of 23 tasks
bikegeek opened this issue Aug 2, 2023 · 3 comments
Closed
8 of 23 tasks

METdataio, in METreformat address PerformanceWarning #219

bikegeek opened this issue Aug 2, 2023 · 3 comments
Assignees
Labels
alert: NEED ACCOUNT KEY Need to assign an account key to this issue component: METdataio priority: high High Priority reporting: DTC NOAA R2O NOAA Research to Operations DTC Project required: FOR OFFICIAL RELEASE Required to be completed in the official release for the assigned milestone type: bug Fix something that is not working

Comments

@bikegeek
Copy link
Collaborator

bikegeek commented Aug 2, 2023

PerformanceWarning: DataFrame is highly fragmented. This is usually the result of calling frame.insert many times, which has poor performance. Consider joining all columns at once using pd.concat(axis=1) instead. To get a de-fragmented frame, use newframe = frame.copy()
cnt_df.insert(loc=0, column='Idx', value=idx)

Remove the requirement for the XML specification file used by METdbLoad. Encapsulate that information in the YAML configuration file and pass that into the METdbLoad module read_data_files.

Describe the Problem

When using the METreformatter on large data sets, the PerformanceWarning is observed.

Expected Behavior

Provide a clear and concise description of what you expected to happen here.

Environment

Describe your runtime environment:
1. Machine: (e.g. HPC name, Linux Workstation, Mac Laptop)
2. OS: (e.g. RedHat Linux, MacOS)
3. Software version number(s)
anything that uses pandas 2.0.3 or above
(previously could NOT reproduce when using pandas 1.5.2)

To Reproduce

Describe the steps to reproduce the behavior:
1. Go to '...'
2. Click on '....'
3. Scroll down to '....'
4. See error
Post relevant sample data following these instructions:
https://dtcenter.org/community-code/model-evaluation-tools-met/met-help-desk#ftp

Relevant Deadlines

List relevant project deadlines here or state NONE.

Funding Source

Define the source of funding and account keys here or state NONE.

Define the Metadata

Assignee

  • Select engineer(s) or no engineer required
  • Select scientist(s) or no scientist required

Labels

  • Select component(s)
  • Select priority
  • Select requestor(s)

Projects and Milestone

  • Select Organization level Project for support of the current coordinated release
  • Select Repository level Project for development toward the next official release or add alert: NEED CYCLE ASSIGNMENT label
  • Select Milestone as the next bugfix version

Define Related Issue(s)

Consider the impact to the other METplus components.

Bugfix Checklist

See the METplus Workflow for details.

  • Complete the issue definition above, including the Time Estimate and Funding Source.
  • Fork this repository or create a branch of main_<Version>.
    Branch name: bugfix_<Issue Number>_main_<Version>_<Description>
  • Fix the bug and test your changes.
  • Add/update log messages for easier debugging.
  • Add/update unit tests.
  • Add/update documentation.
  • Add any new Python packages to the METplus Components Python Requirements table.
  • Push local changes to GitHub.
  • Submit a pull request to merge into main_<Version>.
    Pull request: bugfix <Issue Number> main_<Version> <Description>
  • Define the pull request metadata, as permissions allow.
    Select: Reviewer(s) and Development issues
    Select: Organization level software support Project for the current coordinated release
    Select: Milestone as the next bugfix version
  • Iterate until the reviewer(s) accept and merge your changes.
  • Delete your fork or branch.
  • Complete the steps above to fix the bug on the develop branch.
    Branch name: bugfix_<Issue Number>_develop_<Description>
    Pull request: bugfix <Issue Number> develop <Description>
    Select: Reviewer(s) and Development issues
    Select: Repository level development cycle Project for the next official release
    Select: Milestone as the next official version
  • Close this issue.
@bikegeek bikegeek added alert: NEED ACCOUNT KEY Need to assign an account key to this issue priority: high High Priority type: bug Fix something that is not working reporting: DTC NOAA R2O NOAA Research to Operations DTC Project required: FOR OFFICIAL RELEASE Required to be completed in the official release for the assigned milestone component: METdataio labels Aug 2, 2023
@bikegeek bikegeek added this to the METdataio-3.0.0 milestone Aug 2, 2023
@bikegeek bikegeek self-assigned this Aug 2, 2023
@bikegeek bikegeek changed the title METreformat: PerformanceWarning METdataio, in METreformat address PerformanceWarning Aug 2, 2023
bikegeek added a commit that referenced this issue Aug 3, 2023
@bikegeek
Copy link
Collaborator Author

Using data provided by Will Mayfield, the PerformanceWarning message is not generated. Cannot reproduce this issue. Checking with Will Mayfield to see if this warning message is still observed.

@bikegeek
Copy link
Collaborator Author

The PerformanceWarning is only prevalent in the later versions of pandas. Using the data and reproducing the Python environment used by the RRFS team, the warning is observed with pandas 2.0.3. Currently, METplus analysis is using pandas 1.5.2 to be consistent with NCO HPC hosts. This warning will be present when the pandas version gets upgraded.

@bikegeek
Copy link
Collaborator Author

bikegeek commented Sep 12, 2023

Generating a new index column on a copy of the dataframe now avoids working on a fragmented dataframe. Now the PerformanceWarning is no longer generated.

Before:
fho_df.insert(loc=0, column='Idx', value=idx)

After:
fho_df_copy = fho_df.copy()
fho_df_copy.insert(loc=0, column='Idx', value=idx)

then use fho_df_copy in subsequent lines instead of fho_df:

linetype_data: pd.DataFrame = pd.melt(fho_copy, id_vars=columns_to_use, var_name='stat_name',
value_name='stat_value')

bikegeek added a commit that referenced this issue Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
alert: NEED ACCOUNT KEY Need to assign an account key to this issue component: METdataio priority: high High Priority reporting: DTC NOAA R2O NOAA Research to Operations DTC Project required: FOR OFFICIAL RELEASE Required to be completed in the official release for the assigned milestone type: bug Fix something that is not working
Projects
Status: ✅ Done
Development

No branches or pull requests

1 participant