Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NaN in StatisticsSenderService JSON output #29412

Closed
makortel opened this issue Apr 8, 2020 · 14 comments · Fixed by #35362
Closed

NaN in StatisticsSenderService JSON output #29412

makortel opened this issue Apr 8, 2020 · 14 comments · Fixed by #35362

Comments

@makortel
Copy link
Contributor

makortel commented Apr 8, 2020

Reported by @vkuznet in https://hypernews.cern.ch/HyperNews/CMS/get/swDevelopment/3555.html

Hi,
with recent migration of UDP collector service which captures JSON packets
produced by CMSSW for popularity purposes I found that this piece of code [1]
can produce NaN when value under sqrt is negative. This lead to malformed
JSON packet [2].

Can someone responsible for this package [1] checks the logic of calculating
read_vector_sigma value.

Thanks,
Valentin.

[1] https://cmssdt.cern.ch/lxr/source/Utilities/StorageFactory/src/StatisticsSenderService.cc#0092
[2]

{
  "site_name":"T2_US_Nebraska",
  "user_dn":"...",
  "client_host":"...",
  "client_domain":"...",
  "server_host":"unknown",
  "server_domain":"unknown",
  "unique_id":"2DDAA44B-6E7F-E543-BBED-B50D0C8EE113-0",
  "file_lfn":"/store/unmerged/RunIISummer19UL18SIM/SingleNeutrino/GEN-SIM/106X_upgrade2018_realistic_v11_L1v1_ext1-v1/10001/63DB7219-8852-D244-AFBD-8E36D06D1486.root",
  "file_size":3821015373,
  "read_single_sigma":437859,
  "read_single_average":128233,
  "read_vector_average":1.5576e+07,
  "read_vector_sigma":-nan,
  "read_vector_count_average":7.50465,
  "read_vector_count_sigma":9.46,
  "read_bytes":482886164000,
  "read_bytes_at_close":482886164000,
  "read_single_operations":7283,
  "read_single_bytes":933923139,
  "read_vector_operations":30942,
  "read_vector_bytes":481952240861,
  "start_time":1586260611,
  "end_time":1586283538
}
@makortel
Copy link
Contributor Author

makortel commented Apr 8, 2020

assign core

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 8, 2020

New categories assigned: core

@Dr15Jones,@smuzaffar,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Apr 8, 2020

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @silviodonato, @dpiparo, @smuzaffar, @makortel can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

makortel commented Apr 8, 2020

@bbockelm Would you be able to take a look?

@makortel
Copy link
Contributor Author

makortel commented Apr 8, 2020

@vkuznet Is it more important to get a conforming JSON, or a sensible number instead of e.g. "NaN"? I'm just wondering if a quick mitigation of putting a string instead of invalid value while the real cause is being investigated would have any benefit.

@vkuznet
Copy link
Contributor

vkuznet commented Apr 8, 2020

Matti, what is important is valid JSON, the nan is not a valid JSON value, therefore service fails to parse the JSON. From my perspective, I don't really care if you'll provide string or number, but for data to be consistent with possible analysis it should have consistent data-type.

@makortel
Copy link
Contributor Author

makortel commented Apr 8, 2020

Thanks, so a 0 (or -1) might be a possible quick mitigation. Do you know how frequently the problem occurs? Are backports to earlier CMSSW release cycles needed?

@vkuznet
Copy link
Contributor

vkuznet commented Apr 8, 2020

yes, 0 or -1 will work. And, the rate of errors I see is ~1500 in 24h where the total number of docs received by the collector was about 2.5M.

@makortel
Copy link
Contributor Author

makortel commented Sep 3, 2021

#35142 (and https://hypernews.cern.ch/HyperNews/CMS/get/swDevelopment/3593.html) reports a similar issue with read_single_sigma. In the following I'll focus on the math of that

if (single_op_count > 0) {
double single_sum = read_single_bytes - m_read_single_bytes;
double single_average = single_sum / static_cast<double>(single_op_count);
os << "\"read_single_sigma\":"
<< sqrt((static_cast<double>(read_single_square - m_read_single_square) -
single_average * single_average * single_op_count) /
static_cast<double>(single_op_count))
<< ", ";
os << "\"read_single_average\":" << single_average << ", ";
}
m_read_single_square = read_single_square;

Since the entry is there, and single_average is a valid number, we can tell that single_op_count != 0 and single_sum != 0. The only way to get NaN would then be the argument of sqrt, (static_cast<double>(read_single_square - m_read_single_square) - single_average * single_average * single_op_count to be negative.

By quick look this snippet looks like it intends to calculate the mean and stddev since the last update, especially because m_read_single_square is updated to read_single_square. But I don't see any such updates for m_read_single_bytes, actually I don't see any writes to it except initialization to 0 in the constructor. Same holds for m_read_vector_bytes, but m_read_vector_count_sum is updated in

m_read_vector_count_sum = read_vector_count_sum;
.

I think this leads to the second term in the sqrt argument to grow throughout the job, whereas the first term would stay roughly the same (assuming constant update intervals), that would eventually lead to the expression to become negative.

Should m_read_single_bytes and m_read_vector_bytes be updated similarly? @bbockelm

@vkuznet
Copy link
Contributor

vkuznet commented Sep 3, 2021

@makortel , I did not inspect all errors (as I reported today there are roughly 7000 of invalid JSON failures reported yesterday). How many do have those per day I don't know, but I would assume this rate. So, the code should check that all math operations are valid and produce valid numerical values. I don't see why you can't just check that you pass positive value to sqrt, and you should not divide by zero. Those seems obvious checks to add to the code to all values which this code stream to os. If you find that value is invalid do not write it to os or put zero or -1 if you want to distinguish them from other positive values.

@Dr15Jones
Copy link
Contributor

So looking at the code, the numbers can be changing on one thread by being read out on another. They are all std::atomic so the values themselves are fine, but doing arithmetic on multiples of them while is is happening is not guaranteed to give a meaningful result every time. I'm protecting the sqrt with an internal abs so at least we will get a number.

@makortel
Copy link
Contributor Author

#35362 has a minimal fix against the NaNs. The next question is then to what release cycles this should be backported to? @qliphy @perrotta Could you comment which release cycles are still considered active? (or are all those in the ORP spreadsheet such)

@makortel
Copy link
Contributor Author

@qliphy @perrotta Could you comment which release cycles are still considered active? (or are all those in the ORP spreadsheet such)

@qliphy
Copy link
Contributor

qliphy commented Oct 1, 2021

@qliphy @perrotta Could you comment which release cycles are still considered active? (or are all those in the ORP spreadsheet such)

@makortel For open cycles, you can refer to
https://indico.cern.ch/event/1019627/contributions/4292570/attachments/2225107/3768685/210413_ORP.pdf
and of course 12_0_X and 12_1_X in addition to above

And as discussed at ORP in Aug,
7_1, 8_0, 9_3, 11_2 IBs have been removed.

We will update the ORP speadsheet later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants