-
Notifications
You must be signed in to change notification settings - Fork 12
Server health tests are a little unfair due to absolute timing values #43
Comments
Here's my plan to change the policy:
My first impulse would be to do it once, take the resulting break points and hard-code them in the health scoring process, but ideally the break point should be updated regularly. The issue is that it requires to ping every know server which takes significant amount of time especially with 5 tries. Maybe it could be a rolling thing where the ping is checked at the same time as the server health, and the break points are moved less regularly. Another idea would be to have a fully proportional score change based on min/max values, this wouldn't require any break points. |
|
Thanks for asking and for the work, I've a question: the bell curve studies only one variable (the response time in our case), how do you factor in the ping? Through the potential log function? |
Are we waiting on your friend then, or do we have to do anything by ourselves? |
I'm going to test the log function for the time being. Next Friday, I'll see him again. He will try to get someone who will do the empirical part for us on a spreadsheet. Once we know the factor formula and the benchmark values, it can be coded. Today's discussion was very brief (after some other meeting) and were merely conceptual. I put up the concept for people to comment. But I understand this is quite advanced and also partially beyond my own expertise. |
Beyond my own as well! |
So the LOG-normal functions works as predicted. Here a before and after on the "request_time" variable: This would be ready for zoning, if it had the ping factored in. I'll try to give it go even before the next meeting. -- @MrPetovan do you think you could obtain data for all known nodes? All we need to ensure is that there is no duplication (e.g. http vs https). |
There’s a difference between “known node” and “active node”. I voluntarily limited the dataset to active nodes only because ping and request times don’t make sense for dead nodes. |
That's good. Sorry my confusion! Yes, if that's all active nodes out there that's great. Do you have a script for generating the data? |
No, it was a simple SQL query against the |
Sorry, this is the part I don't understand. In my So my question how do you make that leap from a negative health score to being inactive? |
Because the score penalty for connection timeout is very high, so anything with the lowest health score (-100) is unlikely to be active. |
But could this not be part of the latency problem we're trying to solve? Could you have a look at my table taken from this side of the world? There are some active nodes with -100. -- We can do this any time once we have the correct probability formula and agreed on the different zones in a spreadsheet for checking. In the future, each directory might once in awhile calculate automatically the cut off values for health levels based on our predefined zones. This presumably will be done on the basis of all known nodes in the DB. |
The important field for available node is Also remember that high response time will also penalize servers, this is why you have This is why you probably should take into account nodes where |
Yes, that makes sense. Failed probes, like outdated code, will reduce the overall score of course. When I said dynamically calculated levels, I purely meant value of the response_time moderate by ping_avg. My own node running on a piece of junkware is a good example; it has a very high response time but a relatively low ping (from my directory) and high high (from yours). The probability distribution formula will actually reduce my node's response_time value because of the low ping, but it probably will be only a tiny bit. My node will still very likely get a low score in terms of the high RT/ping value. That's the part we are interested in; to see how each percentage zone pans out along the curve at the crest and the base. Are you able to see why I'm interested in the whole dataset of all known nodes? Do you think I could easily generate this myself for my directory by running a simple query? |
Of course I see the interest, but I'm weary of inactive nodes skewing the data set in a specific direction. The other table to consider is The query to extract the data would go along those lines (I can't test until 8 PM EST): SELECT `base_url`, AVG(`request_time`), AVG(`avg_ping`)
FROM `site-probe`
JOIN `site-health` ON `site-health`.`id` = `site-probe`.`site_health_id`
WHERE `request_time` IS NOT NULL
AND `avg_ping` IS NOT NULL
GROUP BY `site-health`.`id` |
Thanks! No rush at all. The conditional probability equation will not produce a value for any node that has a zero ping. The very long request_times that just end up flat-lining at the base will turn into a steep drop through the normal-log function. See the before and after graph (see here: #43 (comment)) We will do a purely qualitative judgment with the zoning as to how narrow each zone will. This will ensure that very slow machines can never have a medium or high rating. |
How can there be a ping of zero? Or do you mean a missing value? |
Less than a millisecond ping. Could happen. |
Here's the latest data from dir.friendica.social with the above query: |
Thanks MrPetovan for the data!! @tobiasd highlights an important point. I think for the equation to give a probability, the ping value must be greater than zero. A superfast 0.001ms ping would still work. In practice, this means admins who block pings will not have a health value. There are currently nodes that return a fast request_time but have a "0" ping, presumably because it failed. Rather than giving these nodes a bad health score, they should be given a special status as "unknown health" or something like this. All the nodes that I saw fitting this category, were not open for registration. So this would be the trade off: if you run an open registration node, you need to be pingable. Otherwise you don't get a score. |
One could use the average ping time for those nodes that block ping. I mean a fix value we find during the evaluation of the health determination round. |
Yes, that's a possibility, but it's open for manipulation. So if you have a slow request_time and a faster than average ping, you just block your ping and instantly get a better score. It's a kind of shared risk issue. There are many legitimate reasons why people block ping, but if everyone were to block ping than the directories would not work any more. Of course we would still provide individual outputs for nodes without ping, similar to screenshot below. But there would not be a final overall score; instead of the heart being green or whatever colour, the heart would just not be shown or something like this, while clearly indicating the node is active. Mostly it wouldn't matter, because only nodes open for registration are listed in ../servers. |
If someone is gaming the health test that way, we can just introduce a penalty by a horrible factor. The fixed time could also be at the lower end of the bell-curve; average plus half of the full width half max value or so. Then the fixed time is more likely no desired value for gamers. |
Yes, that would work. But than it would make the node's health look worse than it really is. Do you think, it's important to have an overall score in such cases? |
I don't think there will be gaming of the value, hence my suggestion with the average value to be neutral in that metric. I don't really have an opinion about "no health" value in the listing and if that would be better or worse for the node in terms of "advertising" the node to new users. |
That's a good point! "Unknown health" can sound worse than "below average health" for some people. We probably need better explanations that would help people to interpret the finer points, either way. If we give those nodes an overall score, I think a penality would be good mainly for encouraging people to allow pings were this is possible. Mainly a penalty for not sharing the collective risk of contributing to the average and standard deviation ping values. How exactly that penalty would look like (so that's not an invitation to fiddle the system), we need to see with the data and the calculated probabilities. Something like you said, average or above average ping value, or even a simple 0 = 1. |
Scoring (Q1-3) The data gives us the following three quartiles: First Quartile: 75 In a boxplot the whole data looks like this: Here in more detail with the outliers removed, generated by BoxPlotR:
So the quick hard-coded fix will look like this:
|
I love everything about this. |
AVG(
|
I don't know, I'm just taking the raw result from the ping command. Have you tried running a manual ping command against the same domains to see if there's a difference? |
Looking at the The problem occurs when running a query the database to generate an output. I can actually see some error messages in addition to the distorted output. So the ping seems to work without any issues. I'll check what's going on with my db or phpmyadmin setup. Just for clarification: When we pull and push as defined in |
I managed to traced the error that occurred when querying the database. It was a result of having modified the table index when I adding the two new columns. It's all fixed now and works as expected. 🙂 |
Yes, health is computed locally. |
I'm analysing the data that dir.hubup.pro has generated over the last few days (since we are collecting Preliminary, it looks promising. I think there are similarities in the quartiles, despite that actually the values for I'm currently looking into ways how to calculate the quantiles and extreme values. I found something that looks like a potential direction for how to do quantiles in php (see below). Once we know Q1 -3 you can also calculate the lower "extreme value" (i.e. Q3 + 1.5 x IQR).
https://blog.poettner.de/2011/06/09/simple-statistics-with-php/ |
This looks good, but I'm not sure what it would be for. Determining the cutoff points? |
Yes. See above.
|
Here the code with fixed values.
and here with dynamic points based on all
|
I'll admit I didn't expect to have so much fun when I agreed to maintain the Friendica Directory. |
Some preliminary results. Here a comparison of the two directories, one in Western Europe and the other Southeast Asia. The datasets were taken at different times and have different total number of nodes (181 v. 270). Results based on this equation:
The coefficient: |
@MrPetovan the explanation for the coefficient, I gave, is incorrect. #43 (comment) I'll try to give the correct version shortly. Hope you have not already coded this. |
Duplication of nodesI have noticed there are some duplications in the Something like: But we are quite sure there is only one node running there, despite the difference in protocols. Even more concerning are duplications of entries with identical protocols. So for instance in Hypolite's dataset there are ten (10!) entries for https://libranet.de and about 15 for https://friendica.ladies.community each with different What's going on there and how to fix this? |
The behavior is even stranger than you expect. These are the only 15 redundant base_urls in the dir.friendica.social database: The first issue is that there isn't a UNIQUE key on the base URL. The second issue is that there's no reduction to a normalized URL (without |
Ohh.. what did we let ourselves in for here... 😲 I think this issue has effected the stats somehow. The coefficients are too different. Could you run this query again with:
We would like to run some further tests. Thanks. |
Here you are: I did deduplicate base_url but I didn't added the |
Of course, go ahead! |
[OK, I deleted some of my redundant posts above] Ko tested the two new datasets for us and we found some interesting developments. I'm summarising a three page long report here and will give the practical implication. For "dir.frienica.social" the removal of duplicated nodes seemed to make the relationship between However, for "dir.hubup.pro" the data showed there was no relationship between After removing all servers with zero
Here the coefficients (plus p-values about the likelihood of no relationship) Practical implicationThe calculation of the coefficient and the Q1, Q2, Q3, and IQR values (see here #43 (comment)) must exclude all nodes with zero |
But then each directory server has to automatically recalculate the coefficients from time to time--right? |
Correct, and also its speed score zones. Here an example for zones: #43 (comment) |
OK, here the coefficient. Please excuse this non-standard notation. I hope this makes sense.
x = |
Moved to friendica/friendica-directory#4 |
As per this thread servers that are further geographically from a directory server get lower health scores due to the latency involved.
There's been some discussion on how to adjust the absolute curl timing value, such as:
There may be more that could be done, but perhaps some fairly simple 'corrective' factors around that absolute curl process time would be helpful with health scoring the servers that are further away network-wise.
The text was updated successfully, but these errors were encountered: