Research: Service Stats on Crashes #364

nh758 · 2024-02-20T04:11:05Z

“When the site crashes, I want to see report of the current/avg cpu and memory usage of the various services

Requirements

zachhh3 · 2024-02-27T07:31:00Z

Suggestion - Prometheus an open-source monitoring and alerting toolkit.

Real-Time Monitoring: Prometheus provides real-time visibility into the performance metrics of our site, including CPU and memory usage of various services.
Alerting and Notification: With Prometheus, we can define alerting rules to trigger notifications when performance metrics exceed predefined thresholds (e.g. site crashing).
Historical Analysis: Prometheus stores historical metrics data, allowing us to analyze trends, identify patterns, and perform root cause analysis of performance issues.

Setup:

Instrument the Applications:

Instrument web applications with Prometheus client libraries to collect custom metrics. for a Node.js application, you would use the prom-client library.
Add instrumentation code to application to expose relevant metrics.

Expose Metrics Endpoints:

Expose an HTTP endpoint to serve Prometheus metrics data. This endpoint should return metrics data in a format that Prometheus can scrape (e.g., plaintext or Protocol Buffers).
Configure web server or application framework to handle requests to this metrics endpoint.

Visualization (Optional):

Grafana, a visualization tool that integrates well with Prometheus could be used to visualize data. Grafana allows us to create dashboards and visualize metrics collected from Prometheus.

Alerting:

Configure alerting rules in Prometheus to trigger alerts based on certain performance thresholds or anomalies detected in the collected metrics. E.g. alerts for high CPU usage, high memory usage, increased error rates, site crashes.

zachhh3 self-assigned this Feb 26, 2024

zachhh3 closed this as completed Mar 25, 2024

nh758 mentioned this issue Mar 26, 2024

Set up Monitoring #376

Closed

4 tasks