New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GEOS-8283, GEOS-8284] Adds status-monitoring community module #2518

Merged
merged 2 commits into from Sep 26, 2017

Conversation

Projects
None yet
5 participants
@nmco
Contributor

nmco commented Sep 13, 2017

Associated issues:

This PR creates an extension point in AbstractStatusPage to allow extensions to contribute new tabs, this work is isolated in is own commit and has is own issue.

This PR adds a new module that will contribute a new tab to GeoServer status page that will show system monitoring information, this information can also be retrieved through GeoServer

The work uses the OSHI library to retrieve the necessary system information, no DLL or native libraries are required. This module should work out of the box for the major operating systems, i.e. Linux, Windows and MacOX.

I have manually test this for Windows (the following images only show a small part of the available information):
image
and for Linux:
image

OSHI library is published under the Eclipse Public License 1.0, another PR will propose the necessary changes to GeoServer license page to allow the use of this library:
https://osgeo-org.atlassian.net/browse/GEOS-8285

@cezio

This comment has been minimized.

Show comment
Hide comment
@cezio

cezio Sep 14, 2017

I have few questions/suggestions here:

  • could you add rest of load average values? at the moment, it's only 1 m average load, there are usually values for 5 and 15 minutes provided by uptime.
  • could you also provide tick count for CPUs (especially aggregated from all units)?
  • could you also add some differentiator (label) for partition stats (if it's not a problem)? the thing is, to get mountpoint, i'll have to parse description string. It would be easier if mountpoint (or partition) would be in separate key.
  • the same is with network interfaces - their names are buried within description.

cezio commented Sep 14, 2017

I have few questions/suggestions here:

  • could you add rest of load average values? at the moment, it's only 1 m average load, there are usually values for 5 and 15 minutes provided by uptime.
  • could you also provide tick count for CPUs (especially aggregated from all units)?
  • could you also add some differentiator (label) for partition stats (if it's not a problem)? the thing is, to get mountpoint, i'll have to parse description string. It would be easier if mountpoint (or partition) would be in separate key.
  • the same is with network interfaces - their names are buried within description.
@cezio

This comment has been minimized.

Show comment
Hide comment
@cezio

cezio Sep 14, 2017

One more thing: at the moment it's not possible to fetch from status endpoint if client is not authenticated. Could we make it opened to unauthenticated clients as well?

cezio commented Sep 14, 2017

One more thing: at the moment it's not possible to fetch from status endpoint if client is not authenticated. Could we make it opened to unauthenticated clients as well?

@cezio

This comment has been minimized.

Show comment
Hide comment
@cezio

cezio Sep 14, 2017

One more thing, i've noticed: value is mostly a string, but sometimes it's int. Even for similar metrics:

{
"value": "57716922154,00",
"available": true,
"description": "Network interfaces received",
"name": "NETWORK_INTERFACES_RECEIVED",
"unit": "bytes",
"category": "NETWORK",
"priority": 801
},
-{
"value": 4099570589,
"available": true,
"description": "Network interface [wlp3s0] send",
"name": "NETWORK_INTERFACE_SEND",
"unit": "bytes",
"category": "NETWORK",
"priority": 903
},

cezio commented Sep 14, 2017

One more thing, i've noticed: value is mostly a string, but sometimes it's int. Even for similar metrics:

{
"value": "57716922154,00",
"available": true,
"description": "Network interfaces received",
"name": "NETWORK_INTERFACES_RECEIVED",
"unit": "bytes",
"category": "NETWORK",
"priority": 801
},
-{
"value": 4099570589,
"available": true,
"description": "Network interface [wlp3s0] send",
"name": "NETWORK_INTERFACE_SEND",
"unit": "bytes",
"category": "NETWORK",
"priority": 903
},
@bencaradocdavies

This comment has been minimized.

Show comment
Hide comment
@bencaradocdavies

bencaradocdavies Sep 16, 2017

Member

@nmco in my view the GeoServer license change is a major change that affects all of GeoServer and requires a GSIP. I think this is a good opportunity discuss whether we use more general language to permit any EPL-licensed code to be included without being covered by the GPL. This would give us a long term solution to the GPL-EPL conflict.

GNU provide some example text for exceptions (but the language is more for native libraries):
http://www.gnu.org/licenses/gpl-faq.html#GPLIncompatibleLibs

@jodygarnett is an expert on licensing issues and may be able to provide suitable text or a contact as OSGeo legal.

Member

bencaradocdavies commented Sep 16, 2017

@nmco in my view the GeoServer license change is a major change that affects all of GeoServer and requires a GSIP. I think this is a good opportunity discuss whether we use more general language to permit any EPL-licensed code to be included without being covered by the GPL. This would give us a long term solution to the GPL-EPL conflict.

GNU provide some example text for exceptions (but the language is more for native libraries):
http://www.gnu.org/licenses/gpl-faq.html#GPLIncompatibleLibs

@jodygarnett is an expert on licensing issues and may be able to provide suitable text or a contact as OSGeo legal.

@simboss

This comment has been minimized.

Show comment
Hide comment
@simboss

simboss Sep 17, 2017

Member

Hi @bencaradocdavies I agree with you but I would split concerns here: this contribution was already discussed on the mailing list and as a result we ended up going towards the same direction of other EPL modules like EMF. Notice that GeoServer already contains exceptions for EPL, hence we are not adding an exception, we just need to mention this library as well:

As an exception to the terms of the GPL, you may copy, modify,

The notice for EPL license is included below:

I am happy to help with coming up with a more general solution for this kind of problems but given the current situation and the previous exchange, I don't think we should stop this work at this point.

My suggestion then is to task @nmco to take responsibility for bringing the general EPL topic to the next PSC meeting and work with Jody towards a something that would it clear in the future how to address EPL libraries.

Member

simboss commented Sep 17, 2017

Hi @bencaradocdavies I agree with you but I would split concerns here: this contribution was already discussed on the mailing list and as a result we ended up going towards the same direction of other EPL modules like EMF. Notice that GeoServer already contains exceptions for EPL, hence we are not adding an exception, we just need to mention this library as well:

As an exception to the terms of the GPL, you may copy, modify,

The notice for EPL license is included below:

I am happy to help with coming up with a more general solution for this kind of problems but given the current situation and the previous exchange, I don't think we should stop this work at this point.

My suggestion then is to task @nmco to take responsibility for bringing the general EPL topic to the next PSC meeting and work with Jody towards a something that would it clear in the future how to address EPL libraries.

@nmco

This comment has been minimized.

Show comment
Hide comment
@nmco

nmco Sep 17, 2017

Contributor

could you add rest of load average values? at the moment, it's only 1 m average load, there are usually values for 5 and 15 minutes provided by uptime.

Right now there is only the system load average for 1 minutes indeed, the documentation of OSHI says its possible but I could not find the necessary methods in the API. I will double check this again.

could you also provide tick count for CPUs (especially aggregated from all units)?

The goal here was to provide only very simple system information (in my opinion load per CPU is already too much) that can be used to quickly identify that something is not right with the server.

The tick count for CPUs is quite a detailed and low level metric, do you see any specific use case were having this value will help identify a general issue with the server ? The meaning of this metrics is also quite dependent on the time and that will really complicate things, right now we don't have any specific time dimension.

the same is with network interfaces - their names are buried within description.

could you also add some differentiator (label) for partition stats (if it's not a problem)? the thing is, to get mountpoint, i'll have to parse description string. It would be easier if mountpoint (or partition) would be in separate key.

Identifiers can be easily retrieved from the description using the regex ^.*?\[(.*?)\][^\]]*$, but I guess an identifier attribute can be added and it will only be filled when necessary.

Contributor

nmco commented Sep 17, 2017

could you add rest of load average values? at the moment, it's only 1 m average load, there are usually values for 5 and 15 minutes provided by uptime.

Right now there is only the system load average for 1 minutes indeed, the documentation of OSHI says its possible but I could not find the necessary methods in the API. I will double check this again.

could you also provide tick count for CPUs (especially aggregated from all units)?

The goal here was to provide only very simple system information (in my opinion load per CPU is already too much) that can be used to quickly identify that something is not right with the server.

The tick count for CPUs is quite a detailed and low level metric, do you see any specific use case were having this value will help identify a general issue with the server ? The meaning of this metrics is also quite dependent on the time and that will really complicate things, right now we don't have any specific time dimension.

the same is with network interfaces - their names are buried within description.

could you also add some differentiator (label) for partition stats (if it's not a problem)? the thing is, to get mountpoint, i'll have to parse description string. It would be easier if mountpoint (or partition) would be in separate key.

Identifiers can be easily retrieved from the description using the regex ^.*?\[(.*?)\][^\]]*$, but I guess an identifier attribute can be added and it will only be filled when necessary.

@nmco

This comment has been minimized.

Show comment
Hide comment
@nmco

nmco Sep 17, 2017

Contributor

One more thing: at the moment it's not possible to fetch from status endpoint if client is not authenticated. Could we make it opened to unauthenticated clients as well?

For sure this should be secured by default, publishing server details on the open internet by default doesn't seems like a good idea, to be honest I cannot even think about a use case where this is a good idea, what is your use case for this ?

Anyway I need to check if it is possible to add a new element to the security chain to allow no authenticated users to access the REST end point.

Contributor

nmco commented Sep 17, 2017

One more thing: at the moment it's not possible to fetch from status endpoint if client is not authenticated. Could we make it opened to unauthenticated clients as well?

For sure this should be secured by default, publishing server details on the open internet by default doesn't seems like a good idea, to be honest I cannot even think about a use case where this is a good idea, what is your use case for this ?

Anyway I need to check if it is possible to add a new element to the security chain to allow no authenticated users to access the REST end point.

@nmco

This comment has been minimized.

Show comment
Hide comment
@nmco

nmco Sep 17, 2017

Contributor

One more thing, i've noticed: value is mostly a string, but sometimes it's int. Even for similar metrics:

Hummm this is quite strange indeed, I will need to investigate. Thanks for the heads up.

Contributor

nmco commented Sep 17, 2017

One more thing, i've noticed: value is mostly a string, but sometimes it's int. Even for similar metrics:

Hummm this is quite strange indeed, I will need to investigate. Thanks for the heads up.

@bencaradocdavies

This comment has been minimized.

Show comment
Hide comment
@bencaradocdavies

bencaradocdavies Sep 17, 2017

Member

@simboss if there is majority support from the PSC then I am comfortable with patching the license to add an exception for OSHI as an interim measure. I do not want to delay the status-monitoring work. I think that the PSC should review the GeoServer license and take advice from @jodygarnett on where EPL is heading, but this need not delay the status-monitoring module. I will start a discussion on the mailing list.

Member

bencaradocdavies commented Sep 17, 2017

@simboss if there is majority support from the PSC then I am comfortable with patching the license to add an exception for OSHI as an interim measure. I do not want to delay the status-monitoring work. I think that the PSC should review the GeoServer license and take advice from @jodygarnett on where EPL is heading, but this need not delay the status-monitoring module. I will start a discussion on the mailing list.

@cezio

This comment has been minimized.

Show comment
Hide comment
@cezio

cezio Sep 18, 2017

Right now there is only the system load average for 1 minutes indeed, the documentation of OSHI says its possible but I could not find the necessary methods in the API. I will double check this again.

could you also provide tick count for CPUs (especially aggregated from all units)?
The goal here was to provide only very simple system information (in my opinion load per CPU is already too much) that can be used to quickly identify that something is not right with the server.
The tick count for CPUs is quite a detailed and low level metric, do you see any specific use case were having this value will help identify a general issue with the server ? The meaning of this metrics is also quite dependent on the time and that will really complicate things, right now we don't have any specific time dimension.

I agree that per-CPU metrics are not needed. We calculate cpu usage in monitoring based on ticks, but since you provide usage percentage, then maybe it would be enough.

Speaking of low-level metrics, I don't think we'll use voltage/temperature/fan speed.

the same is with network interfaces - their names are buried within description.
could you also add some differentiator (label) for partition stats (if it's not a problem)? the thing is, to get mountpoint, i'll have to parse description string. It would be easier if mountpoint (or partition) would be in separate key.

Identifiers can be easily retrieved from the description using the regex ^.?[(.?)][^]]*$, but I guess an identifier attribute can be added and it will only be filled when necessary.

We can use description field for plain iface name. No need to wrap it with arbitrary string, since there's metric type that describes it already.

One more thing: at the moment it's not possible to fetch from status endpoint if client is not authenticated. Could we make it opened to unauthenticated clients as well?

For sure this should be secured by default, publishing server details on the open internet by default doesn't seems like a good idea, to be honest I cannot even think about a use case where this is a good idea, what is your use case for this ?
Anyway I need to check if it is possible to add a new element to the security chain to allow no authenticated users to access the REST end point.

The thing is, code that does requests is headless, so no session information will be provided. However, maybe we could add some token or signature-based authentication.

cezio commented Sep 18, 2017

Right now there is only the system load average for 1 minutes indeed, the documentation of OSHI says its possible but I could not find the necessary methods in the API. I will double check this again.

could you also provide tick count for CPUs (especially aggregated from all units)?
The goal here was to provide only very simple system information (in my opinion load per CPU is already too much) that can be used to quickly identify that something is not right with the server.
The tick count for CPUs is quite a detailed and low level metric, do you see any specific use case were having this value will help identify a general issue with the server ? The meaning of this metrics is also quite dependent on the time and that will really complicate things, right now we don't have any specific time dimension.

I agree that per-CPU metrics are not needed. We calculate cpu usage in monitoring based on ticks, but since you provide usage percentage, then maybe it would be enough.

Speaking of low-level metrics, I don't think we'll use voltage/temperature/fan speed.

the same is with network interfaces - their names are buried within description.
could you also add some differentiator (label) for partition stats (if it's not a problem)? the thing is, to get mountpoint, i'll have to parse description string. It would be easier if mountpoint (or partition) would be in separate key.

Identifiers can be easily retrieved from the description using the regex ^.?[(.?)][^]]*$, but I guess an identifier attribute can be added and it will only be filled when necessary.

We can use description field for plain iface name. No need to wrap it with arbitrary string, since there's metric type that describes it already.

One more thing: at the moment it's not possible to fetch from status endpoint if client is not authenticated. Could we make it opened to unauthenticated clients as well?

For sure this should be secured by default, publishing server details on the open internet by default doesn't seems like a good idea, to be honest I cannot even think about a use case where this is a good idea, what is your use case for this ?
Anyway I need to check if it is possible to add a new element to the security chain to allow no authenticated users to access the REST end point.

The thing is, code that does requests is headless, so no session information will be provided. However, maybe we could add some token or signature-based authentication.

@cezio

This comment has been minimized.

Show comment
Hide comment
@cezio

cezio Sep 18, 2017

However, maybe we could add some token or signature-based authentication.

Actually, there's one more solution: leave this endpoint open in GeoServer, and add web-server level access restrictions (by ip for example). Consider scenario, when GeoServer is in separate network from GeoNode, and only way to access it is through public IP.

This however adds some work for sysadmin when deploying monitoring.

cezio commented Sep 18, 2017

However, maybe we could add some token or signature-based authentication.

Actually, there's one more solution: leave this endpoint open in GeoServer, and add web-server level access restrictions (by ip for example). Consider scenario, when GeoServer is in separate network from GeoNode, and only way to access it is through public IP.

This however adds some work for sysadmin when deploying monitoring.

@nmco

This comment has been minimized.

Show comment
Hide comment
@nmco

nmco Sep 19, 2017

Contributor

I agree that per-CPU metrics are not needed. We calculate cpu usage in monitoring based on ticks, but since you provide usage percentage, then maybe it would be enough.

Speaking of low-level metrics, I don't think we'll use voltage/temperature/fan speed.

When I used the "low level" term I was referring to how easy is actually to interpret the metric. A quick look at the sensors values say a lot about our system, if the temperature is above 100ºC we may have a problem. But the CPU ticks will not tell much to most of the users.

Contributor

nmco commented Sep 19, 2017

I agree that per-CPU metrics are not needed. We calculate cpu usage in monitoring based on ticks, but since you provide usage percentage, then maybe it would be enough.

Speaking of low-level metrics, I don't think we'll use voltage/temperature/fan speed.

When I used the "low level" term I was referring to how easy is actually to interpret the metric. A quick look at the sensors values say a lot about our system, if the temperature is above 100ºC we may have a problem. But the CPU ticks will not tell much to most of the users.

@nmco

This comment has been minimized.

Show comment
Hide comment
@nmco

nmco Sep 19, 2017

Contributor

We can use description field for plain iface name. No need to wrap it with arbitrary string, since there's metric type that describes it already.

No idea what you mean where 😄 will you use the description field ?

Contributor

nmco commented Sep 19, 2017

We can use description field for plain iface name. No need to wrap it with arbitrary string, since there's metric type that describes it already.

No idea what you mean where 😄 will you use the description field ?

@nmco

This comment has been minimized.

Show comment
Hide comment
@nmco

nmco Sep 19, 2017

Contributor

The thing is, code that does requests is headless, so no session information will be provided. However, maybe we could add some token or signature-based authentication.

This seems like a very specific use case, in any case security is a separated concern that needs to be handled at an higher level (i.e. we will not implement a specific security mechanism for the status-monitoring module).

The authkey module maybe the solution for your specific need.

Contributor

nmco commented Sep 19, 2017

The thing is, code that does requests is headless, so no session information will be provided. However, maybe we could add some token or signature-based authentication.

This seems like a very specific use case, in any case security is a separated concern that needs to be handled at an higher level (i.e. we will not implement a specific security mechanism for the status-monitoring module).

The authkey module maybe the solution for your specific need.

@nmco

This comment has been minimized.

Show comment
Hide comment
@nmco

nmco Sep 20, 2017

Contributor

Add load averages for 5 minutes and 15 minutes, add an identifier attribute that will allow the identification of the measured resource (in the case of a network interface, for example, it will be the network interface name) and fixed the JSON encoding of primitive values.

Contributor

nmco commented Sep 20, 2017

Add load averages for 5 minutes and 15 minutes, add an identifier attribute that will allow the identification of the measured resource (in the case of a network interface, for example, it will be the network interface name) and fixed the JSON encoding of primitive values.

@nmco

This comment has been minimized.

Show comment
Hide comment
@nmco

nmco Sep 20, 2017

Contributor

Pull request for GeoServer license update available here: #2529.

Contributor

nmco commented Sep 20, 2017

Pull request for GeoServer license update available here: #2529.

@nmco nmco merged commit 21f0254 into geoserver:master Sep 26, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@nmco nmco changed the title from Adds status-monitoring community module to [GEOS-8283, GEOS-8284] Adds status-monitoring community module Sep 26, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment