# Installing Prometheus

The notebook handles the exercise part mentioned in the course, It progressively does the following,

* Clone this repository
* Downloads the prometheus
* Prepares a simple prometheus.yml and starts prometheus
* Compiles the [demo-metrics-producer](./demo-metrics-producer/) and starts 3 instances
* Updates the prometheus.yml's scrape_configs for scraping the demo-metrics-producer
* Downloads NodeExporter, starts it and update prometheus
* Runs cAdvisor, update prometheus
* Builds the custom Node Exporter service [cpu-metrics-exporter](./cpu-metrics-exporter/) and starts it


## Configuration parameters

The following block list some of the parameters that are used across various cells,

In [None]:
WORKDIR="/tmp/lfs-prometheus"

PROM_VERSION="2.49.1"
DEMO_VERSION="0.11.1"
NODEEXP_VERSION="1.7.0"
CADVISOR_VERSION="0.36.0"
CONSUL_VERSION="1.17.3"
BLACKBOX_EXPORTER_VERSION="0.24.0"
PUSHGATEWAY_VERSION="1.7.0"
ALERTMANAGER_VERSION="0.26.0"

### Clone the repository

In [None]:
%%bash -s {WORKDIR}

{
    echo "Cloning the repo"
    WORKDIR=$1
    if [ ! -e ${WORKDIR}/.git ]; then 
        git clone https://github.com/ennc0d3-learn/lfs-prometheus ${WORKDIR} 
    else 
        cd ${WORKDIR} && git pull --rebase
    fi
    echo "Ready to go, ${WORKDIR}!"
}

### Cleanup part 

Use this to stop if you want to cleanup

In [None]:

%%bash -s {WORKDIR}
{
    echo "Stopping all processes and containers that are started"
    killall prometheus
    killall node_exporter
    killall consul
    killall prometheus_demo_service
    killall cpu-metric-exporter.py
    killall blackbox_exporter
    killall pushgateway alertmanager alertreceiver_webhook
        
    docker rm -f cadvisor
}

### Start Prometheus

- Downloads prometheus
- Updates the scrape_config
- Starts the prometheus

In [None]:
%%bash -s  {WORKDIR} {PROM_VERSION}
(
    WORKDIR=$1
    PROM_VERSION=$2
    cd ${WORKDIR}
    wget -q https://github.com/prometheus/prometheus/releases/download/v${PROM_VERSION}/prometheus-${PROM_VERSION}.linux-amd64.tar.gz
    tar -zxf prometheus-${PROM_VERSION}.*.gz
    rm -f prometheus-${PROM_VERSION}.*.gz
    echo "Downloaded prometheus ${PROM_VERSION}"
)


##### Create the scrape config

In [None]:
%%bash -s {WORKDIR} {PROM_VERSION}
(
    WORKDIR=$1
    PROM_VERSION=$2
    prometheus_dir=${WORKDIR}/prometheus-${PROM_VERSION}*
    cat > ${prometheus_dir}/prometheus.yml <<-EOD
global:
    scrape_interval: 5s
    evaluation_interval: 5s
scrape_configs:
    - job_name: 'prometheus'
      static_configs:
        - targets: ['localhost:9090']
EOD
echo "Created the prometheus.yml"
cat ${prometheus_dir}/prometheus.yml
)

##### Start prometheus

In [None]:
%%bash -s {WORKDIR}
(
    WORKDIR=$1
    PROM_VERSION=$2
    cd $WORKDIR/prometheus-*/
    killall prometheus
    ./prometheus >/dev/null &
    
)

#### Download and start the demo service

In [None]:
%%bash -s {WORKDIR} {DEMO_VERSION}
(
    WORKDIR=$1
    DEMO_VERSION=${2:-0.11.1}
    cd $WORKDIR
    echo "Downloading the demo service version ${DEMO_VERSION}"
    wget -q https://github.com/juliusv/prometheus_demo_service/releases/download/${DEMO_VERSION}/prometheus_demo_service-${DEMO_VERSION}.linux-amd64
    
    chmod +x ./prometheus_demo_service-${DEMO_VERSION}.linux-amd64
    mv ./prometheus_demo_service-${DEMO_VERSION}.linux-amd64 ./prometheus_demo_service
    
    echo "Starting the demo service(3) instances"
    
    killall prometheus_demo_service
    
    ./prometheus_demo_service -listen-address=":10001" > /dev/null 2>&1 &
    ./prometheus_demo_service -listen-address=":10002" > /dev/null 2>&1 &
    ./prometheus_demo_service -listen-address=":10003" > /dev/null 2>&1 &
    
    echo "Demo instances are running"
    
)

#### Update the config and refresh

In [None]:
%%bash -s {WORKDIR}
(
    WORKDIR=$1
    cat > ${WORKDIR}/prometheus-*/prometheus.yml <<-EOD 
global:
    scrape_interval: 5s
    evaluation_interval: 5s
scrape_configs:
    - job_name: 'prometheus'
      static_configs:
        - targets: ['localhost:9090']
    - job_name: 'demo-service'
      static_configs:
        - targets:
            - localhost:10001
            - localhost:10002
            - localhost:10003
EOD
    
    # Send signal to reaload
    killall -HUP prometheus
)



### Installing node exporter

In [None]:
%%bash -s {WORKDIR} {NODEEXP_VERSION}
{
    cleanup() {
        trap - INT TERM
        echo "Cleaning up the node_exporter"
        killall node_exporter
        exit 1
    }
    
    trap cleanup INT TERM
    
    WORKDIR=$1
    NODEEXP_VERSION=${2:-1.7.0}
    startNodeExporter() {
        cd $WORKDIR
        wget -q https://github.com/prometheus/node_exporter/releases/download/v${NODEEXP_VERSION}/node_exporter-${NODEEXP_VERSION}.linux-amd64.tar.gz -O node_exporter.tgz
        tar -zxf node_exporter.tgz
        chmod +x ./node_exporter*/node_exporter
        killall node_exporter
        ./node_exporter*/node_exporter & > /dev/null
    }
    echo "Download and start the exporter"
    startNodeExporter
}

#### Update scrape_config and reload

In [None]:
%%bash -s {WORKDIR}
(
    WORKDIR=$1
    cat > ${WORKDIR}/prometheus-*/prometheus.yml <<-EOD 
global:
    scrape_interval: 5s
    evaluation_interval: 5s
scrape_configs:
    - job_name: 'prometheus'
      static_configs:
        - targets: ['localhost:9090']
    - job_name: 'demo-service'
      static_configs:
        - targets:
            - localhost:10001
            - localhost:10002
            - localhost:10003
    - job_name: 'node_exporter'
      static_configs:
        - targets:
            - localhost:9100
EOD
    
    # Send signal to reaload
    killall -HUP prometheus
)


#### Install/Run cAdvisor

In [None]:
%%bash -s {WORKDIR} {CADVISOR_VERSION}
{
    WORKDIR=$1
    CADVISOR_VERSION=${2:-"0.36.0"}
    
    cleanup() {
        trap - INT TERM
        echo "Stopping cAdvisor"
        docker rm -f cadvisor
        exit 1
    }
    
    startCadvisor() {
        docker rm -f cadvisor
        docker run \
        --volume=/:/rootfs:ro \
        --volume=/var/run:/var/run:ro \
        --volume=/sys:/sys:ro \
        --volume=/var/lib/docker/:/var/lib/docker:ro \
        --volume=/dev/disk/:/dev/disk:ro \
        --publish=8080:8080 \
        --detach=true \
        --name=cadvisor \
        --privileged \
        --device=/dev/kmsg \
        gcr.io/cadvisor/cadvisor:v${CADVISOR_VERSION}
    }
    
    trap cleanup INT TERM
    startCadvisor
}

#### Udpdate scrape_config and refresh

In [None]:
%%bash -s {WORKDIR}
(
    WORKDIR=$1
    cat > ${WORKDIR}/prometheus-*/prometheus.yml <<-EOD 
global:
    scrape_interval: 5s
    evaluation_interval: 5s
scrape_configs:
    - job_name: 'prometheus'
      static_configs:
        - targets: ['localhost:9090']
    - job_name: 'demo-service'
      static_configs:
        - targets:
            - localhost:10001
            - localhost:10002
            - localhost:10003
    - job_name: 'node_exporter'
      static_configs:
        - targets:
            - localhost:9100
    - job_name: 'cadvisor'
      static_configs:
        - targets:
            - localhost:8080
EOD
    
    # Send signal to reaload
    killall -HUP prometheus
)


### Write Custom NodeExporter

We write a custom node exporter in Python, see [cpu-metrics-exporter](./cpu-metrics-exporter/), Here we use psutil to read the
cpu_usage for all modes and use the ConstantMetricFamily to expose them in Prometheus'es exposition format. The production of 
the metrics data happens with scrape interval.


#### Run the CPU Metrics Exporter

In [None]:
%%bash -s {WORKDIR}
{
    WORKDIR=$1
    cd $WORKDIR/cpu-metrics-exporter
    killall cpu-metric-exporter.py
    poetry shell
    python3 ./cpu-metric-exporter.py &
}


#### Update and refresh prometheus

In [None]:
%%bash -s {WORKDIR}
(
    WORKDIR=$1
    cat > ${WORKDIR}/prometheus-*/prometheus.yml <<-EOD 
global:
    scrape_interval: 5s
    evaluation_interval: 5s
scrape_configs:
    - job_name: 'prometheus'
      static_configs:
        - targets: ['localhost:9090']
    - job_name: 'demo-service'
      static_configs:
        - targets:
            - localhost:10001
            - localhost:10002
            - localhost:10003
    - job_name: 'node_exporter'
      static_configs:
        - targets:
            - localhost:9100
    - job_name: 'cadvisor'
      static_configs:
        - targets:
            - localhost:8080
    - job_name: 'cpu-metrics'
      static_configs:
        - targets:
            - localhost:8100
EOD
    # Send signal to reaload
    killall -HUP prometheus
)

### Relabelling

In [None]:
%%bash -s {WORKDIR}
(
    WORKDIR=$1
    cat > ${WORKDIR}/prometheus-*/prometheus.yml <<-EOD 
global:
    scrape_interval: 5s
    evaluation_interval: 5s
scrape_configs:
    - job_name: 'prometheus'
      static_configs:
        - targets: ['localhost:9090']
    - job_name: 'demo-service'
      static_configs:
        - targets:
            - localhost:10001
            - localhost:10002
            - localhost:10003
      metric_relabel_configs:
        - action: keep
          source_labels: [__name__]
          regex: '(demo_|http_).*'
    - job_name: 'node_exporter'
      static_configs:
        - targets:
            - localhost:9100
    - job_name: 'cadvisor'
      static_configs:
        - targets:
            - localhost:8080
    - job_name: 'cpu-metrics'
      static_configs:
        - targets:
            - localhost:8100
EOD
    # Send signal to reaload
    killall -HUP prometheus
)

### Service Discovery

Though the above method is good for a small number of services, it is not practical for a large number of services. In large-scale systems, we need a way to discover new services automatically, without having to update the Prometheus configuration file every time a new service is added or removed. There are several ways to achieve this, including:

    - File-based service discovery
    - Kubernetes 
    - Consul
    - Cloud Provider based

In this example, we are going to use consul for service discovery.

In [None]:
%%bash -s {WORKDIR} {CONSUL_VERSION} 
{
    WORKDIR=$1
    CONSUL_VERSION=${2:-1.17.3}
    cd ${WORKDIR}
    echo "Downloading consul ${CONSUL_VERSION} to ${WORKDIR}"
    wget -q https://releases.hashicorp.com/consul/${CONSUL_VERSION}/consul_${CONSUL_VERSION}_linux_amd64.zip
    unzip -u -qq consul_${CONSUL_VERSION}_linux_amd64.zip
    rm -f consul_${CONSUL_VERSION}_linux_amd64.zip
    
    chmod +x ./consul
    
    cat > ./demo-service.json <<-EOD
{
    "services":
    [
        {"id":"demo1","name":"demo","address":"127.0.0.1","port":10001},
        {"id":"demo2","name":"demo","address":"127.0.0.1","port":10002},
        {"id":"demo3","name":"demo","address":"127.0.0.1","port":10003}
    ]
}
EOD
    killall consul
    
    ./consul agent -dev -config-dir=./demo-service.json > /dev/null 2>&1 &
    echo "Started consul"
}

#### Add consul to the scrape configs

In [None]:
%%bash -s {WORKDIR}
(
    WORKDIR=$1
    cat > ${WORKDIR}/prometheus-*/prometheus.yml <<-EOD 
global:
    scrape_interval: 5s
    evaluation_interval: 5s
scrape_configs:
    - job_name: 'prometheus'
      static_configs:
        - targets: ['localhost:9090']
    - job_name: 'demo-service'
      static_configs:
        - targets:
            - localhost:10001
            - localhost:10002
            - localhost:10003
      metric_relabel_configs:
        - action: keep
          source_labels: [__name__]
          regex: '(demo_|http_).*'
    - job_name: 'node_exporter'
      static_configs:
        - targets:
            - localhost:9100
    - job_name: 'cadvisor'
      static_configs:
        - targets:
            - localhost:8080
    - job_name: 'cpu-metrics'
      static_configs:
        - targets:
            - localhost:8100
    - job_name: 'consul-sd-demo'
      consul_sd_configs:
        - server: 'localhost:8500'
      relabel_configs:
        - action: keep
          source_labels: [__meta_consul_service]
          regex: demo

EOD
    # Send signal to reaload
    killall -HUP prometheus
)

#### Using File based discovery

In [None]:
%%bash -s {WORKDIR}
(
    echo "Updating the prometheus.yml to include file_sd_configs"
    WORKDIR=$1
    cat > ${WORKDIR}/prometheus-*/prometheus.yml <<-EOD 
global:
    scrape_interval: 5s
    evaluation_interval: 5s
scrape_configs:
    - job_name: 'prometheus'
      static_configs:
        - targets: ['localhost:9090']
    - job_name: 'demo-service'
      static_configs:
        - targets:
            - localhost:10001
            - localhost:10002
            - localhost:10003
      metric_relabel_configs:
        - action: keep
          source_labels: [__name__]
          regex: '(demo_|http_).*'
    - job_name: 'node_exporter'
      static_configs:
        - targets:
            - localhost:9100
    - job_name: 'cadvisor'
      static_configs:
        - targets:
            - localhost:8080
    - job_name: 'cpu-metrics'
      static_configs:
        - targets:
            - localhost:8100
    - job_name: 'consul-sd-demo'
      consul_sd_configs:
        - server: 'localhost:8500'
      relabel_configs:
        - action: keep
          source_labels: [__meta_consul_service]
          regex: demo
    - job_name: 'file-sd-demo'
      file_sd_configs:
        - files:
            - 'targets.yml'

EOD

    # Create the targets.yml
    echo "Creating the targets.yml file"
    cd $WORKDIR/prometheus-*/
    cat > targets.yml <<-EOD
- targets:
    - localhost:10001
    - localhost:10002
  labels:
    env: production
- targets:
    - localhost:10003
  labels:
    env: staging
EOD
    
    # Send signal to reaload
    killall -HUP prometheus
)

### BlackBox Exporter
Instead of the target providing data, we can probe the target externally using protocols like HTTP, TCP, DNS, etc. The prometheus service discovery provides the targets to the BlackBox Exporter which then scrapes the target, here some relabelling is used.

In [None]:
%%bash -s {WORKDIR} {BLACKBOX_EXPORTER_VERSION} 
{
    WORKDIR=$1
    VERSION=${2:-0.24.0}
    cd ${WORKDIR}
    echo "Downloading blackbox_exporter ${VERSION} to ${WORKDIR}"
    wget -q https://github.com/prometheus/blackbox_exporter/releases/download/v${VERSION}/blackbox_exporter-${VERSION}.linux-amd64.tar.gz
    tar zxf blackbox_exporter-${VERSION}.linux-amd64.tar.gz
    rm -f blackbox_exporter-${VERSION}.linux-amd64
    
    cd ./blackbox_exporter-${VERSION}.linux-amd64
    
    cat > ./blackbox.yml <<-EOD
modules:
    http_2xx:
        prober: http
        timeout: 2s
        http:
            valid_http_versions: [ "HTTP/1.1", "HTTP/2" ]
            valid_status_codes: []  # Defaults to 2xx
            method: GET
            preferred_ip_protocol: "ip4"  # defaults to "ip6"
EOD
    killall blackbox_exporter
    
    ./blackbox_exporter > /dev/null 2>&1 &
    echo "Started blackbox_exporter, http://localhost:9115/"
}

#### Update the scrape config to probe some websites using blackbox exporter


In [None]:
%%bash -s {WORKDIR}
(
    echo "Updating the prometheus.yml to include blackbox_exporter"
    WORKDIR=$1
    cat > ${WORKDIR}/prometheus-*/prometheus.yml <<-EOD 
global:
    scrape_interval: 5s
    evaluation_interval: 5s
scrape_configs:
    - job_name: 'prometheus'
      static_configs:
        - targets: ['localhost:9090']
    - job_name: 'demo-service'
      static_configs:
        - targets:
            - localhost:10001
            - localhost:10002
            - localhost:10003
      metric_relabel_configs:
        - action: keep
          source_labels: [__name__]
          regex: '(demo_|http_).*'
    - job_name: 'node_exporter'
      static_configs:
        - targets:
            - localhost:9100
    - job_name: 'cadvisor'
      static_configs:
        - targets:
            - localhost:8080
    - job_name: 'cpu-metrics'
      static_configs:
        - targets:
            - localhost:8100
    - job_name: 'consul-sd-demo'
      consul_sd_configs:
        - server: 'localhost:8500'
      relabel_configs:
        - action: keep
          source_labels: [__meta_consul_service]
          regex: demo
    - job_name: 'file-sd-demo'
      file_sd_configs:
        - files:
            - 'targets.yml'
            
    - job_name: 'blackbox'
      metrics_path: /probe
      params:
        module: [http_2xx]
      static_configs:
        - targets:
            - http://prometheus.io
            - https://prometheus.io
            - http://example.com:8080
      relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - source_labels: [__param_target]
          target_label: instance
        - target_label: __address__
          replacement: localhost:9115
            
EOD

    # Create the targets.yml
    echo "Creating the targets.yml file"
    cd $WORKDIR/prometheus-*/
    cat > targets.yml <<-EOD
- targets:
    - localhost:10001
    - localhost:10002
  labels:
    env: production
- targets:
    - localhost:10003
  labels:
    env: staging
EOD
    
    # Send signal to reaload
    killall -HUP prometheus
)

### Using PushGateway

Its the way for ephemeral jobs to send metric data that get persisted until the next write, the prometheus server scrapes from the PushGateway

#### Installing PushGateway
https://github.com/prometheus/pushgateway/releases/download/v1.7.0/pushgateway-1.7.0.linux-amd64.tar.gz

In [None]:
%%bash -s {WORKDIR} {PUSHGATEWAY_VERSION} 
{
    set -x
    WORKDIR=$1
    VERSION=${2:-1.7.0}
    cd ${WORKDIR}
    echo "Downloading PushGateweay ${VERSION} to ${WORKDIR}"
    wget -q https://github.com/prometheus/pushgateway/releases/download/v${VERSION}/pushgateway-${VERSION}.linux-amd64.tar.gz
    tar zxf pushgateway-${VERSION}.linux-amd64.tar.gz
    
    cd ./pushgateway-${VERSION}.linux-amd64
    
    killall pushgateway
    
    ./pushgateway > /dev/null 2>&1 &
    echo "Started pushgateway, http://localhost:9091/"
}

Update scrape config and restart prometheus

In [None]:
%%bash -s {WORKDIR}
(
    echo "Updating the prometheus.yml to include pushgateway"
    WORKDIR=$1
    cat > ${WORKDIR}/prometheus-*/prometheus.yml <<-EOD 
global:
    scrape_interval: 5s
    evaluation_interval: 5s
scrape_configs:
    - job_name: 'prometheus'
      static_configs:
        - targets: ['localhost:9090']
    - job_name: 'demo-service'
      static_configs:
        - targets:
            - localhost:10001
            - localhost:10002
            - localhost:10003
      metric_relabel_configs:
        - action: keep
          source_labels: [__name__]
          regex: '(demo_|http_).*'
    - job_name: 'node_exporter'
      static_configs:
        - targets:
            - localhost:9100
    - job_name: 'cadvisor'
      static_configs:
        - targets:
            - localhost:8080
    - job_name: 'cpu-metrics'
      static_configs:
        - targets:
            - localhost:8100
    - job_name: 'consul-sd-demo'
      consul_sd_configs:
        - server: 'localhost:8500'
      relabel_configs:
        - action: keep
          source_labels: [__meta_consul_service]
          regex: demo
    - job_name: 'file-sd-demo'
      file_sd_configs:
        - files:
            - 'targets.yml'
            
    - job_name: 'blackbox'
      metrics_path: /probe
      params:
        module: [http_2xx]
      static_configs:
        - targets:
            - http://prometheus.io
            - https://prometheus.io
            - http://example.com:8080
      relabel_configs:
        - source_labels: [__address__]
          target_label: __param_target
        - source_labels: [__param_target]
          target_label: instance
        - target_label: __address__
          replacement: localhost:9115
          
    - job_name: 'pushgateway'
      honor_labels: true
      static_configs:
        - targets: ['localhost:9091']
            
EOD

    # Create the targets.yml
    echo "Creating the targets.yml file"
    cd $WORKDIR/prometheus-*/
    cat > targets.yml <<-EOD
- targets:
    - localhost:10001
    - localhost:10002
  labels:
    env: production
- targets:
    - localhost:10003
  labels:
    env: staging
EOD
    
    # Send signal to reaload
    killall -HUP prometheus
)

In [None]:
# Simulate an batch job that pushes metrics to the pushgateway
%%bash -s WORKDIR
{
curl --data-binary @- http://localhost:9091/metrics/job/demo_batch_job <<EOF
# TYPE demo_batch_job_last_successful_run_timestamp_seconds gauge
# HELP demo_batch_job_last_successful_run_timestamp_seconds The Unix timestampin seconds of the last successful batch job run.
demo_batch_job_last_successful_run_timestamp_seconds $(date +%s)
# TYPE demo_batch_job_last_run_timestamp_seconds gauge
# HELP demo_batch_job_last_run_timestamp_seconds The Unix timestamp in seconds of the last successful batch job run.
demo_batch_job_last_run_timestamp_seconds $(date +%s)
# TYPE demo_batch_job_users_deleted gauge
# HELP demo_batch_job_users_deleted How many userswere deleted in the lastbatch job run.
demo_batch_job_users_deleted $RANDOM
EOF
}


In [None]:
# To delete the deprovisioned batch metrics, Delete the group from PushGateway
curl -XDELETE http://localhost:9091/metrics/job/demo_batch_job

### Using AlertManager

We will use Prometheues in HA and AlertManager in HA for redundancy, the instances of AlertManager receives alerts from the prometheus servers. The AlertManager instances uses consensus mechanism and notification fired message as gossip, in case of network partition the
alerts are atleast fired once, 


In [None]:
%%bash -s {WORKDIR} {ALERTMANAGER_VERSION} 
{
    WORKDIR=$1
    VERSION=${2:-0.26.0}
    cd ${WORKDIR}
    echo "Downloading alertmanager ${VERSION} to ${WORKDIR}"
    wget -q https://github.com/prometheus/alertmanager/releases/download/v${VERSION}/alertmanager-${VERSION}.linux-amd64.tar.gz
    tar zxf alertmanager-${VERSION}.linux-amd64.tar.gz
    
    cd ./alertmanager-${VERSION}.linux-amd64
    
    cat > ./alertmanager.yml <<-EOD
route:
    group_by: ['alertname', 'job']
    group_wait: "30s"
    group_interval: "5m"
    repeat_interval: "3h"
    receiver: 'test-receiver-slack'
    routes:
        - match:
            severity: critical
          receiver: 'test-receiver-webhook'
    
receivers:
    - name: test-receiver-slack
      slack_configs:
        - api_url: "https://hooks.slack.com/services/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"
          username: 'Slack AlertBot'
          channel: '#lfs-alertmanager'
          send_resolved: true
    # Make sure the alertreciever_webhook is running      
    - name: test-receiver-webhook
      webhook_configs:
        - url: http://localhost:9595/
                    
    
EOD
    killall alertmanager alertreceiver_webhook.py
    
    ./alertmanager > /dev/null 2>&1 &
    echo "Started alertmanager, http://localhost:9093/"
    
    echo "Starting the webhook receiver"
    ./alertreceiver_webhook.py > /dev/null 2>&1 &
}

#### Add alerting_rules and relaod Proemtheus

In [None]:
%%bash -s {WORKDIR}
{
prometheus_cfg=${WORKDIR}/prometheus-*/prometheus.yml
WORKDIR=$1
if grep "alerting_rules.yml" $prometheus_cfg; then
    echo "The alerting rules are already included"
else
    cat >> $prometheus_cfg <<-EOD
rule_files:
    - alerting_rules.yml
EOD

cd ${WORKDIR}/prometheus-*/
cat > alerting_rules.yml <<-EOD
groups:
- name: demo-service-alerts
rules:
- alert: Many5xxErrors
  expr: |
    sum by(path, instance, job) (
      rate(demo_api_request_duration_seconds_count{status=~"5..",job="demo"}["1m"])
    )
    /
    sum by(path, instance, job) (
        rate(demo_api_request_duration_seconds_count{job="demo"}["1m"])
    ) * 100 > 0.5
    
for: "30s"
labels:
    severity: critical
annotations:
    description: "The 5xx error rate for path {{$labels.path}} on {{$labels.instance}} is {{$value}}%."
EOD

killall -HUP prometheus
}


### Recording Rules
It is to run expensive queries and aggregate the metrics in higher-order

In [None]:
%%bash -s {WORKDIR}
{
    set -x
    WORKDIR=$1
    echo "Creating the recording rules file"
    prometheus_cfg=${WORKDIR}/prometheus-*/prometheus.yml
    cd ${WORKDIR}/prometheus-*
    cat > recording_rules.yml <<-EOD
groups:
    - name: demo-service
        rules: 
            - record:
            job: demo_api_request_duration_seconds_count:rate5m 
            expr:|
                sum by(job) (rate(demo_api_request_duration_seconds_count["5m"]))
EOD
    
    echo "Add the recording rule to $prometheus_cfg"
    if ! grep -q "recording_rules.yml" $prometheus_cfg; then
        sed -i '/alerting_rules.yml/a\    - recording_rules.yml' $prometheus_cfg
    fi
    
    killall -HUP prometheus
    
}
    
    

#### Check the recorded query

Assuming the setup is working, we should be able to query,
```
job:demo_api_request_duration_seconds_count:rate5m
```

### Prometheus on Kuberenetes

#### Install k8s using kubeadm

In [None]:
%%bash 
{
    echo "Run the following as sudo"
    curl -s https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key add -cat <<EOF >/etc/apt/sources.list.d/kubernetes.list
deb http://apt.kubernetes.io/ kubernetes-jammy main
EOF
    apt-get update
    apt-get install -y kubelet kubeadm kubectl
}


#### Start the cluster
- Turn of the swap
- kubeadm init
- Copy the admin.conf to ~/.kube/config # Chown the file to user
- kubectl taint nodes --all node-role.kubernetes.io/master- (# As this is single node, remove taint)
- Install the cluster network plugin (weavnet or calico)

### Remote Storage - LTS

- Influx Db ( slow,)
- Thanos (Side Car, reuses the tsdb blocks and sends them to Storage(minio/s3..), Thanos Store(reads them and makes the series ready), Thanos Query)
- Cortex 

### Monitoring and Debugging Prometheus

- Meta Prometheus
- Use prometheus'es metrics and run alerting

#### Debugging/Profiling
API /debug/pprof/profile, /debug/pprof/heap, /debug/pprof/goroutine?debug=2

