Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define general benchmark database schema #20885

Closed
asfimport opened this issue Jan 21, 2019 · 25 comments
Closed

Define general benchmark database schema #20885

asfimport opened this issue Jan 21, 2019 · 25 comments

Comments

@asfimport
Copy link

asfimport commented Jan 21, 2019

Some possible attributes that the benchmark database should track, to permit heterogeneity of hardware and programming languages

  • Timestamp of benchmark run

  • Git commit hash of codebase

  • Machine unique name (sort of the "user id")

  • CPU identification for machine, and clock frequency (in case of overclocking)

  • CPU cache sizes (L1/L2/L3)

  • Whether or not CPU throttling is enabled (if it can be easily determined)

  • RAM size

  • GPU identification (if any)

  • Benchmark unique name

  • Programming language(s) associated with benchmark (e.g. a benchmark
    may involve both C++ and Python)

  • Benchmark time, plus mean and standard deviation if available, else NULL

    see discussion on mailing list https://lists.apache.org/thread.html/278e573445c83bbd8ee66474b9356c5291a16f6b6eca11dbbe4b473a@%3Cdev.arrow.apache.org%3E

Reporter: Wes McKinney / @wesm
Assignee: Tanya Schlusser / @tanyaschlusser

Related issues:

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as ARROW-4313. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Note that some benchmarks may not give you a time (for example memory benchmarks with ASV). For some other benchmarks, the preferred unit of reporting may be something else (for example MB/s for a memory copy benchmark).

@asfimport
Copy link
Author

Wes McKinney / @wesm:
Very good point. So we should generalize to:

  • Benchmark result

  • Benchmark unit

    Or something else?

    We should make an effort to standardize the permitted values of "Unit".

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Yes, result + unit sounds good. Plus a boolean flag that indicates "higher is better" or something (to be able to distinguish regressions from improvements). In this manner, the unit could be pretty much arbitrary (as long as a given benchmark always reports the same unit, instead of e.g. switching between "MB" and "GB").

@asfimport
Copy link
Author

Tanya Schlusser / @tanyaschlusser:
Pinging @aregm  who started the email discussion, and volunteering to help in what ways I can 👋. I said I'd mock a backend and will edit this comment with a hyperlink when a mock is up.

@asfimport
Copy link
Author

Areg Melik-Adamyan / @aregm:
 I think it will be easy if we keep it a little bit simple in the beginning, not to redo a lot in the future.

So replies to original comments:

  • Timestamp of benchmark run - We should be careful, as this is helpful, but you cannot rely on this timestamp as, there is no guarantee that systems are synced in time. So for purely informational purposes, it is fine. 
  • Git commit hash of codebase 
  • Machine unique name (sort of the "user id") - Machine ID and machine information should go to a different database, as they can change, come and go, you do not want to keep that info tied to benchmarks
  • CPU identification for machine, and clock frequency (in case of overclocking)
  • CPU cache sizes (L1/L2/L3)
  • Whether or not CPU throttling is enabled (if it can be easily determined) - for benchmarking you should always set it to max, not fixing the governor will add additional unpredictable flakiness to the benchmarks. Also you need to lock machine when the benchmarks are running to prevent noise. 
  • RAM size
  • GPU identification (if any)
  • Benchmark unique name - For the start I would say yes, but it can quickly get out of control, as you have e.g. TestFeatureA, then it gets flavors, like input size, and you start naming it TestFeatureA5GB, then TestFeatureA5GB-CPU, TestFeatureA5GB-GPU-Nvidia, TestFeatureA5GB-GPU-Radeon, and it gets out of control. The best know method to control is hierarchical name or unique id with benchmark table, which is kind of overkill for now.**
  • Programming language(s) associated with benchmark (e.g. a benchmark
    may involve both C++ and Python)  - Why would you need this? Maybe put into hierarchical name?
  • Benchmark time, plus mean and standard deviation if available, else NULL  ** 

@asfimport
Copy link
Author

Areg Melik-Adamyan / @aregm:
@tanyaschlusser I have created the task 4354 - can you please take a look?

Meanwhile I have created the first version of teh spec to work on - https://cwiki.apache.org/confluence/display/ARROW/Performance+Dashboard 

@asfimport
Copy link
Author

Tanya Schlusser / @tanyaschlusser:
I've attached a diagram benchmark-data-model.png and a corresponding .erdplus file benchmark-data-model.erdplus (JSON--viewable and editable by getting a free account on erdplus.com) with a draft data model for everyone's consideration. I tried to incorporate elements of both the codespeed and the ASV projects.

Happy to modify per feedback—or leave this to a more experienced person if I'm becoming the slow link.
Of course there will be a view with all of the relevant information joined.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Some thoughts:

  • in cpu_dim, perhaps add a cpu_thread_count (the CPU's number of hardware threads, which can be a multiple of the number of distinct cores)
  • either in machine_dim or os_dim, store the bitness? (usually 64-bit I suppose, though perhaps some people will want to benchmark on 32-bit). Or, more generally perhaps, the architecture name (such as "x86-64" or "ARMv8" or "AArch64").
  • not sure why tables are suffixed with _dim?

@asfimport
Copy link
Author

Areg Melik-Adamyan / @aregm:

  • in cpu_dim, perhaps add a cpu_thread_count (the CPU's number of hardware threads, which can be a multiple of the number of distinct cores)
    • there is a 'core_count', for IA it is better to have HT flag, for others threads=cores
  • either in machine_dim or os_dim, store the bitness? (usually 64-bit I suppose, though perhaps some people will want to benchmark on 32-bit). Or, more generally perhaps, the architecture name (such as "x86-64" or "ARMv8" or "AArch64").
    • short uname -i should be enough. 
  • not sure why tables are suffixed with _dim?
    • I guess those are conditional names and not necessarily the resulting. 

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:

there is a 'core_count', for IA it is better to have HT flag, for others threads=cores

Not really, for example IBM POWER CPUs can have 2, 4 or 8 threads per core.

@asfimport
Copy link
Author

Areg Melik-Adamyan / @aregm:
Ok, if we want to add them, then it should be named 'smt_thread_count' or 'threads_per_core'. And there is a case for multiple CPUs also. Do you anticipate using Arrow on mainframes? I would say that most likely FPGA usage will preceed Power usage. 

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Multiple CPUs would go under the core_count IMO.

As for mainframes, no, but AFAIK there are regular Linux-based (or AIX-based) POWER servers.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I recall an e-mail thread some time back about IBM POWER support – some of us (myself, @kou) were given access to Power Z -based CI infrastructure for testing but we have yet to try it. I doubt that the project works on big endian right now (Arrow is current little-endian, even running on big-endian hardware)

@asfimport
Copy link
Author

Tanya Schlusser / @tanyaschlusser:
Thank you very much for everyone's detailed feedback. I absolutely need guidance with the Machine / CPU / GPU specs. I have updated the benchmark-data-model.png and the benchmark-data-model.erdplus, and added all of the recommended columns.

 

Summary of changes:

  • All the dimension tables have been renamed to exclude the _dim. (It was to distinguish dimension vs. fact tables.)

  • cpu

    • Added a cpu_thread_count
    • Changed cpu.speed_Hz to two columns: frequency_max_Hz and frequency_min_Hz and also added a column machine.overclock_frequency_Hz to the machine table to allow for overclocking like Wes mentioned in the beginning.
  • os

    • Added both os.architecture_name and os.architecture_bits, the latter forced to be in {32, 64}, and pulled from the architecture name (maybe it will become just a computed column in the joined view...). I think it's a good idea.
  • project

    • Added a project.project_name (oversight before)
  • benchmark_language

    • Split out language to language_name and language_version because maybe people will want to compare between them (e.g. Python 2.7, 3.5+)
  • environment

    • Removed foreign key for machine_id — that should be in the benchmark report separately. Many machines will have the same environment.
  • benchmark

    • Added foreign key for benchmark_language_id—a benchmark with the same name may exist for different languages.
    • Added foreign key for project_id—moved it from table benchmark_result
  • benchmark_result

    • Added foreign key for machine_id (was removed from environment)
    • Deleted foreign key for project_id, placing it in benchmark (as stated above)

Questions

  • cpu and gpu dimension
    • Is it a mistake to make cpu.cpu_model_name unique? I mean, are the LX cache levels, core counts, or any other attribute ever different for the same CPU model string?
    • The same for GPU.
    • I have commented the columns to say that  cpu_thread_count corresponds to sysctl -n hw.logicalcpu and cpu_core_count corresponds to sysctl -n hw.physicalcpu; corrections gratefully accepted.
    • Would it be less confusing to make the column names the exact same strings as correspond to their value from sysctl, e.g. change cpu.cpu_model_name to cpu.cpu_brand_string to correspond to the output of sysctl -n machdep.cpu.brand_string?
    • On that note is CPU RAM the same thing as sysctl -n machdep.cpu.cache.size?
  • environment
    • I'm worried I'm doing something inelegant with the dependency list. It will hold everything – Conda / virtualenv; versions of Numpy; all permutations of the various dependencies in what in ASV is the dependency matrix.

@asfimport
Copy link
Author

Areg Melik-Adamyan / @aregm:
@tanyaschlusser why we need 'overclock_freq_HZ'? What is the practical usage model for this field?

@asfimport
Copy link
Author

Tanya Schlusser / @tanyaschlusser:
@aregm I do not know. I am depending on the other people commenting here to make sure the hardware tables make sense because honestly I don't ever pay attention to hardware because my use cases never stress my system. At one point Wes suggested it. I am glad there is a debate.

@asfimport
Copy link
Author

Areg Melik-Adamyan / @aregm:
Got it. I think that mostly those numbers are never used because you run benchmarks on a fixed freq always to get consistent results in time. So they can be easily determined from the model name or cpuid, just for informational purposes, but will never be used in a serial benchmarking. In a serial benchmarking everything should be fixed, nailed and unchanged, except the variable you are measuring, and it is the arrow code measured through the benchmark code. 

@asfimport
Copy link
Author

Tanya Schlusser / @tanyaschlusser:
I think part of this was to allow anybody to contribute benchmarks from their own machine. And while dedicated benchmarking machines like the ones you will set up will have all parameters set for optimal benchmarking, benchmarks run on other machines may give different results. Collecting details about the machine that might explain those differences (in case someone cares to explore the dataset) is part of the goal of the data model.

One concern, of course, is that people get wildly different results than a benchmark says, and may say "Oh boo–the representative person from the company made fake results that I can't replicate on my machine" ... and with details about a system, performance differences can maybe be traced back to differences in setup, because they were recorded.

Not all fields need to be filled out all the time. My priorities are:

  1. Identifying which fields flat-out wrong

  2. Differentiating between necessary columns and extraneous ones that can be left null

    To me, it is not a big deal to have an extra column dangling around that almost nobody uses. No harm. (Unless it's mislabeled or otherwise wrong; that's what I'm interested in getting out of the discussion here.)

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
For the record, IBM POWER CPUs support little-endian mode on Linux:
https://www.ibm.com/developerworks/library/l-power-little-endian-faq-trs/index.html

So big-endian support in Arrow would probably not be a roadblock.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
"Is it a mistake to make cpu.cpu_model_name unique? I mean, are the LX cache levels, core counts, or any other attribute ever different for the same CPU model string?"

The overclocked frequency may vary (which we could also call "actual frequency"), the rest should be the same.

@asfimport
Copy link
Author

Tanya Schlusser / @tanyaschlusser:
Thank you Antoine! I missed this last comment. "actual frequency" is a good name, and I used it.

  • I did not understand the conversations about little-and-big-endian, and did not add fields related to that to the database.

  • I was surprised during testing about the behavior of nulls in the database, so some things don't yet work the way I'd like (the example script fails in one place.)

    Thank you everyone for so much feedback. I have uploaded new files for the current data model and am happy to change things according to feedback. If you don't like something, it can be fixed :) 

@asfimport
Copy link
Author

Areg Melik-Adamyan / @aregm:
@wesm and @pitrou - need your input.

My understanding of [this conversation] (https://lists.apache.org/thread.html/dcc08ab10507a5139178d7f816c0f5177ff0657546a4ade3ed71ffd5@%3Cdev.arrow.apache.org%3E) was that a data model not tied to any ORM tool was the desired path to take.

I think we need to take a step back, and sync and agree with the @wesm and @pitrou on the goals for this little project:

  • for me the goal is to continuously track the performance for the core C++ library and help everybody who is doing performance work to catch regressions and contribute improvements.

  • do that in a validated form, so we can rely on the numbers.

  • there is no goal to provide infrastructure for contributing 3rd party numbers, as they cannot be validated in a quick manner.

  • there is no goal to bench other languages, as they rely on C++ library calls and you will benchmark the wrapper conversion speed

  • there is no goal, for now, to anticipate and satisfy all the future possible needs.

    The ability of the Arrow test library (practically GTest) to provide performance numbers on a run platform is more than enough. I would not like to limit users to have a different kind of databases, performance monitors or dashboards of their need. I am duplicating this in the issue 4313 to move the discussion from code review.

@asfimport
Copy link
Author

Wes McKinney / @wesm:
I'm involved many projects so I haven't been able to follow the discussion to see where there is disagreement or conflict.

From my perspective I want the following in the short term

  • A general purpose database schema, preferably for PostgreSQL, which can be used to easily provision a new benchmark database

  • A script for running the C++ benchmarks and inserting the results into any instance of that database. This script should capture hardware information as well as any additional information that is known about the environment (OS, thirdparty library versions – e.g. so we can see if upgrading a dependency, like gRPC for example, causes a performance problem). The script should not be coupled to a particular instance of the database. It should work in an air-gapped environment

    I think until we should work as quickly as possible to have a working version of both of these to validate that we are on the right track. If we try to come up with the "perfect database schema" and punt the benchmark collector script until later we could be waiting a long time.

    Ideally the database schema can accommodate results from multiple benchmark execution frameworks other than Google benchmark for C++. So we could write an adapter script to export data from ASV (for Python) into this database.

    @aregm this does not seem to be out of line with the requirements you listed unless I am misunderstanding. I would rather not be too involved with the details right now unless the project stalls out for some reason and needs me to help push it through to completion.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Indeed, agreed with Wes.

Just to answer one comment:

there is no goal to bench other languages, as they rely on C++ library calls and you will benchmark the wrapper conversion speed

It's a bit more involved than that. For example the speed of creating an Arrow array (or an Arrow dataframe) from Python objects is important, and this requires specific optimizations inside Arrow. Technically we could benchmark it using the C++ infrastructure, it's just massively easier to write the benchmarks in Python using ASV, so that's what we're doing now.

That said, yes, recording C++ benchmark results is a good first-priority goal. The thing to keep in mind is that we don't want the adopted DB schema to limit ourselves in this regard.

(also, some implementations are not based on the C++ library, they are independent reimplementations of the Arrow data model, e.g. Java, C# or Rust)

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 3586
#3586

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant