Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meta: 🔡 Large synthetic dataset for performance evaluation #2516

Open
PeterNerlich opened this issue Oct 31, 2023 · 2 comments
Open

Meta: 🔡 Large synthetic dataset for performance evaluation #2516

PeterNerlich opened this issue Oct 31, 2023 · 2 comments
Milestone

Comments

@PeterNerlich
Copy link
Contributor

This issue is about research into improving the development workflow when interested in performance bottlenecks. While we could just create a copy of the live system for local experimentation (the same as we do with the test system every so often), it might contain personal information which we rather don't want to even be able to obtain as developers.

  • Devise a method of generating reasonably authentic data in comparison with the live system, namely a similar number of regions with similar content.
  • Verify that a dev environment with that data behaves similarly to the live environment in performance (mind resources available, which might vary greatly between the machines used for development; while e.g. the redis cache is probably only to be mentioned to the developer as a potentially deciding if not active).

This should be separate from the existing test_data.json fixture, as the small dataset is highly preferable during quick iterations on a feature, except when performance with large data is the focus. The developer should be able to switch between them with relative ease.

@timobrembeck
Copy link
Member

Just for completeness, I want to mention

./tools/integreat-cms-cli duplicate_pages augsburg

which can be used to generate a lot of pages, however it does not cover specific edge cases which are not reflected in the original test data. So one solution could be to create a more diverse baseline test data, which hopefully would result in a more realistic dataset if the duplication algorithm is executed a few times (~1k pages for large regions is realistic).

@timobrembeck timobrembeck changed the title META: Large synthetic dataset for performance evaluation Meta: Large synthetic dataset for performance evaluation Oct 31, 2023
@timobrembeck timobrembeck added this to the Meta Issues milestone Oct 31, 2023
@timobrembeck timobrembeck changed the title Meta: Large synthetic dataset for performance evaluation Meta: 🔡 Large synthetic dataset for performance evaluation Nov 4, 2023
@david-venhoff
Copy link
Member

however it does not cover specific edge cases which are not reflected in the original test data

One edge case example would be #2530, where performance testing requires lots of different links, which cannot be created using the duplicate_pages tool

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants