GSoC 2014 project ideas

Daniel Pocock edited this page Mar 15, 2014 · 32 revisions

Prospective mentors:

  • please add ideas below by copying the template and include your name as a mentor,
  • link to your blog or github, etc

Students:

Everybody: don't edit this wiki page interactively for a long time, edit your idea in a text editor first and then cut and paste it into the page quickly - don't keep the page locked or you might find you can not submit because somebody else submitted at the same time

RRDtool data access from data analysis frameworks (Data Science, Statistics)

NVIDIA GPU monitoring enhancements

  • Mentors
  • NVIDIA GPU metrics can be collected via NVIDIA gmond plugin, but they are currently not very well visualized via Gweb
  • Tasks
    • Update plugin to support new metrics that can be collected by new version of NVML
    • Update web interface to support summarizing GPU graphs under Host Overview
    • Update web interface to better visualize data with many GPU cores
    • Streamline plugin installation removing the need to patch core Ganglia code and ensure compatibility with current versions of Ganglia
  • Required skills: PHP, jQuery, Python, C, NVML
  • Students will be provided access to a compute cluster with NVIDIA GPUs for development/testing

Create cluster-wide metric aggregation for arbitrary metrics

Right now, the cluster view shows overall metrics for the cluster above all the individual hosts. It sums load, memory, and network and averages CPU. It would be useful to also have aggregated metrics for many or all other metrics represented in the cluster. One method of doing this at the moment is via an Aggregated Graph. These graphs can be added to views and achieve much of the benefit of aggregating metrics at the cluster level. However, aggregate graphs lose their efficacy in an environment with high machine turnover, such as in an Amazon Auto Scaling Group. As machines leave the cluster, their historic metrics are no longer counted in the aggregate graphs, giving you historic values that are drastically low. By aggregating metrics and storing the aggregated value at a cluster level, this problem goes away.

The task:

  • create a method for aggregating some or all metrics present in a cluster at the cluster level. ** include a method for specifying the aggregating function to use (sum, average, min, max)
  • add the aggregated metrics to the cluster view below the individual hosts in the cluster or create a pseudo-host named 'all-$clustername' and store all the aggregated metrics there.

Mentors:

Internal Ganglia server metrics

It's slightly ironic that a metric collection system does not produce performance metrics of its own. The ganglia server daemon suffers performance degradation under extreme load and having internal metrics would greatly improve the ability to diagnose these issues and increase the scalability of the server daemon.

The task:

  • add an endpoint to the gmetad daemon. a simple TCP listener would do.
  • the listener would responded to a "STAT" request with a JSON encoded list of internal metrics
  • these internal metrics would include things like:
    • general info about the server, eg. version number, uptime, ports numbers listening on
    • number of clusters, hosts and metrics collected
    • memory and cpu consumption information
    • number of rrd writes
    • number of metric forwarded to other tools eg. graphite, riemann
  • develop a web page that can query for the metrics (using Javascript JSONP callbacks) and display them in a visually useful way (either via a simple table or graphically)

  • Mentors:

  • Required skills: C, TCP/IP for gmetad, Javascript/JQuery for web development

API

Allowing internals of gmetad/gmond to be exposed via a REST-ful interface, both for the purposes of monitoring, and to provide another method for data to be exchanged.

The tasks:

  • Create a simple embedded REST-ful HTTP server with optional authentication
  • Support, at a minimum, the following tasks:

    • Basic metric querying for service internals (number of queries submitted, etc)
    • Reporting uptime and version
    • Retrieving an object representation of some part of the metric trees
  • Mentors:

  • Required skills: C

Gweb enhancement: Tabular data views

  • Mentors
  • Today Gweb provides the capability to create custom views that are collections of graphs
  • Users have requested the ability to define "realtime" tabular views that show current metric values organized in a specified row-column layout.
  • Required skills: PHP, jQuery

Gweb enhancement: Monitoring HPC parallel jobs

Warehousing data from monitoring tools (Data Science, NoSQL)

  • Mentors
  • Skills
    • C programming and possibly Python
  • Systematically storing data from various monitoring tools into a MongoDB database
  • The rsyslog-ommongodb already does this for syslog events, it is suggested that any student applying for this project should try it out, it is very easy to set up
  • The student will then do the same thing (possibly referring to that rsyslog source code for ideas) for other data sources:
    • the XML network state created by Ganglia gmetad - can this be generated in JSON format and stored to MongoDB?
    • service state events from Nagios

Web portal integrating popular monitoring tools

  • Mentors
  • Skills
    • PHP or Python Flask
    • JavaScript, jQuery, bootstrap
  • There are many popular monitoring tools with web interfaces:
    • Ganglia: for gathering performance metrics
    • Nagios: for checking services are working
    • LogAnalyzer: for inspecting log messages from machines and processes in a network
  • Each tool has its own web interface
  • This project would involve trying to unify the web interfaces
    • When the user clicks a hostname in any of these web tools, show them a popup with links to the pages about that host in the other tools
    • Adapt each tool to expose some data using REST
    • Create a portal that gathers the REST data from individual tools and displays a unified page for a given host or service

Improving integration between Ganglia and Nagios (Python)

  • Mentors
  • Skills
    • Python
  • Improve the Ganglia Nagios Bridge
    • Refer to the issue tracker in Github for some ideas
    • Separate the code that generates Nagios checkresults files into a standalone Python class
    • Automatically detect the hosts that Nagios knows about and avoid writing checkresults for those that don't already exist in Nagios (possibly using MongoDB to share configuration data between Nagios and Ganglia-Nagios-Bridge)
    • Provide a more advanced way to configure the alerting thresholds
    • Support for thresholds that depend on multiple metrics (for example, take the total disk space and used bytes metrics and calculate the percentage used)

Improving JMXetric

  • Mentors
  • Skills
    • Java
  • Improve the JMXetric
    • Refer to the issue tracker in Github for some ideas
    • auto-discovery of the gmond.conf settings
    • auto-discovery of useful JMX metrics, generating a sample config file on the fly
    • looking for a way to unify the configuration of metrics in jmxetric.xml and thresholds in the ganglia-nagios-bridge
    • practical testing with popular Java servers such as JBoss, HornetQ, ServiceMix and Spring. Interacting with those projects to create additional metrics in their JMX MBeans.