Speed up performance for DataRows #4774

mpolson64 · 2026-01-15T18:19:46Z

Summary:
Misc improvements and tricks to make DataRows more performant. We're within spitting distance of the original implementation with dataframes, close enough that Im willing to consider the difference is likely due to scheduler noise; IMO good enough to land.

Removed isinstance check from Data init -- this was helpful when refactoring since some calls to Data(df) didnt use kwargs and caused errors, but added unnecessary overhead
[BIG IMPROVEMENT] Used df.itertuples instead of df.iterrows in Data init when initializing from a dataframe. This alone took us from 1h 44m to ~40m
New empty, metric_names, and trial_indices properties which dont require constructing full_df
Changes to Experiment.attach_data which operate directly on list[DataRows] instead of on DataFrames (ie migrating from combine_df_favoring_recent helper fn to new combine_data_rows_favoring_recent fn)
Changed [*foo] to list(foo) in a couple places. Metamate tells me this is faster in extremely high data regimes -- not sure I notice a difference or trust it necessarily.

Remaining TODOs:
Id be interested in removing property from the methods which are not O(1); theres a lot of fairly expensive things we do in Data, or at least things which require a full scan, which look like they should be fast because they have the same syntax as an attribute lookup. If nobody has any objections to this Ill ask Metamate to do this for us

Differential Revision: D90713603

meta-codesync · 2026-01-15T18:20:16Z

@mpolson64 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D90713603.

Summary: Pull Request resolved: facebook#4774 Misc improvements and tricks to make DataRows more performant. We're within spitting distance of the original implementation with dataframes, close enough that Im willing to consider the difference is likely due to scheduler noise; IMO good enough to land. 1. Removed isinstance check from Data init -- this was helpful when refactoring since some calls to Data(df) didnt use kwargs and caused errors, but added unnecessary overhead 2. **[BIG IMPROVEMENT]** Used df.itertuples instead of df.iterrows in Data init when initializing from a dataframe. This alone took us from 1h 44m to ~40m 3. New empty, metric_names, and trial_indices properties which dont require constructing full_df 4. Changes to Experiment.attach_data which operate directly on list[DataRows] instead of on DataFrames (ie migrating from combine_df_favoring_recent helper fn to new combine_data_rows_favoring_recent fn) 5. Changed [*foo] to list(foo) in a couple places. Metamate tells me this is faster in extremely high data regimes -- not sure I notice a difference or trust it necessarily. Remaining TODOs: Id be interested in removing `property` from the methods which are not O(1); theres a lot of fairly expensive things we do in Data, or at least things which require a full scan, which look like they should be fast because they have the same syntax as an attribute lookup. If nobody has any objections to this Ill ask Metamate to do this for us Differential Revision: D90713603

Summary: TData was necesssary whern we had multiple different Data classes, but recent developments have made this no longer needed Differential Revision: D90596942

Summary: Moved these tests into TestData, since Data is the only data-related class in Ax. Differential Revision: D90605845

Summary: NOTE: This is much slower than the implementation which is backed by a dataframe. For clarity, Ive put this naive implementation up as its own diff and the next diff hunts for speedups. Creates new source of truth for Data: the DataRow. The df is now a cached property which is dynamically generated based on these rows. In the future, these will become a Base object in SQLAlchemy st. Data will have a SQLAlchemy relationship to a list of DataRows which live in their own table. RFC: 1. Im renaming sem -> se here (but keeping sem in the df for now, since this could be an incredibly involved cleanup). Do we have alignment that this is a positive change? If so I can either start of backlog the cleanup across the codebase. cc Balandat who Ive talked about this with a while back. 2. This removes the ability for Data to contain arbitrary columns, which was added in D83682740 and afaik unused. Arbitrary new columns would not be compatible with the new storage setup (it was easy in the old setup which is why we added it), and I think we should take a careful look at how to store contextual data in the future in a structured way. Differential Revision: D90605846

Summary: Misc improvements and tricks to make DataRows more performant. We're within spitting distance of the original implementation with dataframes, close enough that Im willing to consider the difference is likely due to scheduler noise; IMO good enough to land. 1. Removed isinstance check from Data init -- this was helpful when refactoring since some calls to Data(df) didnt use kwargs and caused errors, but added unnecessary overhead 2. **[BIG IMPROVEMENT]** Used df.itertuples instead of df.iterrows in Data init when initializing from a dataframe. This alone took us from 1h 44m to ~40m 3. New empty, metric_names, and trial_indices properties which dont require constructing full_df 4. Changes to Experiment.attach_data which operate directly on list[DataRows] instead of on DataFrames (ie migrating from combine_df_favoring_recent helper fn to new combine_data_rows_favoring_recent fn) 5. Changed [*foo] to list(foo) in a couple places. Metamate tells me this is faster in extremely high data regimes -- not sure I notice a difference or trust it necessarily. Remaining TODOs: Id be interested in removing `property` from the methods which are not O(1); theres a lot of fairly expensive things we do in Data, or at least things which require a full scan, which look like they should be fast because they have the same syntax as an attribute lookup. If nobody has any objections to this Ill ask Metamate to do this for us Differential Revision: D90713603

meta-cla bot added the CLA Signed Do not delete this pull request or issue due to inactivity. label Jan 15, 2026

meta-codesync bot added fb-exported meta-exported labels Jan 15, 2026

mpolson64 force-pushed the export-D90713603 branch from 3938995 to 623655e Compare January 15, 2026 18:51

mpolson64 added 4 commits January 15, 2026 13:43

Remove TData (facebook#4771)

2cf2508

Summary: TData was necesssary whern we had multiple different Data classes, but recent developments have made this no longer needed Differential Revision: D90596942

Remove TestDataBase now that DataBase is gone (facebook#4772)

27a59b8

Summary: Moved these tests into TestData, since Data is the only data-related class in Ax. Differential Revision: D90605845

mpolson64 force-pushed the export-D90713603 branch from 623655e to 91786af Compare January 15, 2026 21:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed up performance for DataRows #4774

Speed up performance for DataRows #4774

mpolson64 commented Jan 15, 2026

Uh oh!

meta-codesync bot commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Speed up performance for DataRows #4774

Are you sure you want to change the base?

Speed up performance for DataRows #4774

Conversation

mpolson64 commented Jan 15, 2026

Uh oh!

meta-codesync bot commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant