Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OpenVSX proxy component #6007

Merged
merged 1 commit into from
Oct 8, 2021
Merged

Add OpenVSX proxy component #6007

merged 1 commit into from
Oct 8, 2021

Conversation

corneliusludmann
Copy link
Contributor

@corneliusludmann corneliusludmann commented Oct 4, 2021

Description

This PR removes the OpenVSX proxy implementation from the Caddy server and adds its own dedicated OpenVSX proxy component under IDE team responsibility.

With this change, we don't use Caddy for this anymore. That has the following advantages:

  • The Caddy cache plugin we used before was actually not made for our use case. We misused it which brought us some limitations (see below).
  • Now, we have full control over when we answer with a stored response. Previously, we answered only with a stored response when the upstream server returned a 5xx status code. Now we do this also when the upstream is not reachable at all.
  • Now, we are able to store responses to a Redis DB that persists the cached response on shutdown. Thus we no longer lose the cached responses on proxy restart and can even copy the cache to a new cluster.
  • With Redis, we have now an LFU (least frequently used) cache with a fixed cache size. That prevents that the disk runs full and we can make the best use of the available space instead of an inflexible TTL time span.
  • Now, we can exactly measure how long certain parts take and count how many requests we are able to answer on an upstream outage. That allows us to fine-tune the cache size and implement better alerts.

This change still uses the proxy to route the traffic from open-vsx.gitpod.io. This is useful for self-hosted set-ups. For our SaaS solution, we would probably configure our load balancer to route the traffic directly to the OpenVSX proxy and not over the ingress proxy.

Stress Testing

I tested the OpenVSX proxy with a script from an external server that runs 100'000 unique calls (200 in parallel) to this proxy. There were 35 out of 100'000 failed calls (closed client connection). Most of the calls took not longer than 250ms (78'311 of 100'000), 94'722 of 100'000 took not longer than 500ms. None of them took longer than 5 secs. The upstream call is by far the longest part.

bash script
#!/bin/bash

# sudo apt-get update -y
# sudo apt-get install -y parallel

echo "" > rnr.txt
echo "" > log.txt

call() {
    #RNR=$RANDOM
    RNR=$(shuf -i 1-1000000000000 -n 1)
    echo $RNR >> rnr.txt
    data="{\"filters\":[{\"criteria\":[{\"filterType\":8,\"value\":\"Microsoft.VisualStudio.Code\"},{\"filterType\":10,\"value\":\"docker $RNR\"},{\"filterType\":12,\"value\":\"4096\"}],\"pageNumber\":1,\"pageSize\":1,\"sortBy\":0,\"sortOrder\":0}],\"assetTypes\":[],\"flags\":950}"
    #echo "$data"

    statuscode=$(curl -s -o /dev/null --show-error -w "%{http_code}" --header "Content-Type: application/json" --data "$data" https://open-vsx.clu-openvsx-proxy-comp.staging.gitpod-dev.com/vscode/gallery/extensionquery)
    #echo "$statuscode"
    if [ "$statuscode" != "200" ]
    then
        printf "x"
        echo "$1 | $statuscode" >> log.txt
    else
        printf "."
    fi
}
export -f call

maxjobs=100000
paralleljobs=200

seq 1 $maxjobs | parallel -j $paralleljobs call
echo ""
detailed results
# HELP gitpod_openvsx_proxy_backup_cache_hit_total The total amount of requests where we had a cached response that we could use as backup when the upstream server is down.
# TYPE gitpod_openvsx_proxy_backup_cache_hit_total counter
gitpod_openvsx_proxy_backup_cache_hit_total 0
# HELP gitpod_openvsx_proxy_backup_cache_miss_total The total amount of requests where we haven't had a cached response that we could use as backup when the upstream server is down.
# TYPE gitpod_openvsx_proxy_backup_cache_miss_total counter
gitpod_openvsx_proxy_backup_cache_miss_total 100000
# HELP gitpod_openvsx_proxy_backup_cache_serve_total The total amount of requests where we actually answered with a cached response because the upstream server is down.
# TYPE gitpod_openvsx_proxy_backup_cache_serve_total counter
gitpod_openvsx_proxy_backup_cache_serve_total 0
# HELP gitpod_openvsx_proxy_duration_overall The duration in seconds of the HTTP requests.
# TYPE gitpod_openvsx_proxy_duration_overall histogram
gitpod_openvsx_proxy_duration_overall_bucket{le="0.005"} 1
gitpod_openvsx_proxy_duration_overall_bucket{le="0.01"} 2
gitpod_openvsx_proxy_duration_overall_bucket{le="0.025"} 12
gitpod_openvsx_proxy_duration_overall_bucket{le="0.05"} 20
gitpod_openvsx_proxy_duration_overall_bucket{le="0.1"} 34
gitpod_openvsx_proxy_duration_overall_bucket{le="0.25"} 78311
gitpod_openvsx_proxy_duration_overall_bucket{le="0.5"} 94722
gitpod_openvsx_proxy_duration_overall_bucket{le="1"} 97591
gitpod_openvsx_proxy_duration_overall_bucket{le="2.5"} 99989
gitpod_openvsx_proxy_duration_overall_bucket{le="5"} 100000
gitpod_openvsx_proxy_duration_overall_bucket{le="10"} 100000
gitpod_openvsx_proxy_duration_overall_bucket{le="+Inf"} 100000
gitpod_openvsx_proxy_duration_overall_sum 22566.655625430558
gitpod_openvsx_proxy_duration_overall_count 100000
# HELP gitpod_openvsx_proxy_duration_request_processing The duration in seconds of the processing of the HTTP requests before we call the upstream.
# TYPE gitpod_openvsx_proxy_duration_request_processing histogram
gitpod_openvsx_proxy_duration_request_processing_bucket{le="0.005"} 99688
gitpod_openvsx_proxy_duration_request_processing_bucket{le="0.01"} 99916
gitpod_openvsx_proxy_duration_request_processing_bucket{le="0.025"} 99967
gitpod_openvsx_proxy_duration_request_processing_bucket{le="0.05"} 99976
gitpod_openvsx_proxy_duration_request_processing_bucket{le="0.1"} 99978
gitpod_openvsx_proxy_duration_request_processing_bucket{le="0.25"} 99984
gitpod_openvsx_proxy_duration_request_processing_bucket{le="0.5"} 99997
gitpod_openvsx_proxy_duration_request_processing_bucket{le="1"} 100000
gitpod_openvsx_proxy_duration_request_processing_bucket{le="2.5"} 100000
gitpod_openvsx_proxy_duration_request_processing_bucket{le="5"} 100000
gitpod_openvsx_proxy_duration_request_processing_bucket{le="10"} 100000
gitpod_openvsx_proxy_duration_request_processing_bucket{le="+Inf"} 100000
gitpod_openvsx_proxy_duration_request_processing_sum 61.9069501900005
gitpod_openvsx_proxy_duration_request_processing_count 100000
# HELP gitpod_openvsx_proxy_duration_response_processing The duration in seconds of the processing of the HTTP responses after we have called the upstream.
# TYPE gitpod_openvsx_proxy_duration_response_processing histogram
gitpod_openvsx_proxy_duration_response_processing_bucket{le="0.005"} 99855
gitpod_openvsx_proxy_duration_response_processing_bucket{le="0.01"} 99983
gitpod_openvsx_proxy_duration_response_processing_bucket{le="0.025"} 100000
gitpod_openvsx_proxy_duration_response_processing_bucket{le="0.05"} 100000
gitpod_openvsx_proxy_duration_response_processing_bucket{le="0.1"} 100000
gitpod_openvsx_proxy_duration_response_processing_bucket{le="0.25"} 100000
gitpod_openvsx_proxy_duration_response_processing_bucket{le="0.5"} 100000
gitpod_openvsx_proxy_duration_response_processing_bucket{le="1"} 100000
gitpod_openvsx_proxy_duration_response_processing_bucket{le="2.5"} 100000
gitpod_openvsx_proxy_duration_response_processing_bucket{le="5"} 100000
gitpod_openvsx_proxy_duration_response_processing_bucket{le="10"} 100000
gitpod_openvsx_proxy_duration_response_processing_bucket{le="+Inf"} 100000
gitpod_openvsx_proxy_duration_response_processing_sum 53.07051631899996
gitpod_openvsx_proxy_duration_response_processing_count 100000
# HELP gitpod_openvsx_proxy_duration_upstream_call The duration in seconds of the call of the upstream server.
# TYPE gitpod_openvsx_proxy_duration_upstream_call histogram
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="0.005"} 1
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="0.01"} 2
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="0.025"} 12
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="0.05"} 21
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="0.1"} 34
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="0.25"} 78423
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="0.5"} 94754
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="1"} 97599
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="2.5"} 99989
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="5"} 100000
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="10"} 100000
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="+Inf"} 100000
gitpod_openvsx_proxy_duration_upstream_call_sum 22420.242568651047
gitpod_openvsx_proxy_duration_upstream_call_count 100000
# HELP gitpod_openvsx_proxy_regular_cache_hit_and_serve_total The total amount or requests where we answered with a cached response for performance reasons.
# TYPE gitpod_openvsx_proxy_regular_cache_hit_and_serve_total counter
gitpod_openvsx_proxy_regular_cache_hit_and_serve_total 0
# HELP gitpod_openvsx_proxy_regular_cache_miss_total The total amount or requests we haven't had a young enough cached requests to use it for performance reasons.
# TYPE gitpod_openvsx_proxy_regular_cache_miss_total counter
gitpod_openvsx_proxy_regular_cache_miss_total 100000
# HELP gitpod_openvsx_proxy_requests_total The total amount of requests by response status.
# TYPE gitpod_openvsx_proxy_requests_total counter
gitpod_openvsx_proxy_requests_total{path="POST /vscode/gallery/extensionquery",status="200"} 99965
gitpod_openvsx_proxy_requests_total{path="POST /vscode/gallery/extensionquery",status="error"} 35

Metrics

@ArthurSens Could you have a look at the Prometheus metrics? Is that feasible? Do you have any suggestions for improvement? The number of possible path values in gitpod_openvsx_proxy_requests_total should be stable and not more then what you see in this example:

example Pometheus metrics
# HELP gitpod_openvsx_proxy_backup_cache_hit_total The total amount of requests where we had a cached response that we could use as backup when the upstream server is down.
# TYPE gitpod_openvsx_proxy_backup_cache_hit_total counter
gitpod_openvsx_proxy_backup_cache_hit_total 5
# HELP gitpod_openvsx_proxy_backup_cache_miss_total The total amount of requests where we haven't had a cached response that we could use as backup when the upstream server is down.
# TYPE gitpod_openvsx_proxy_backup_cache_miss_total counter
gitpod_openvsx_proxy_backup_cache_miss_total 100034
# HELP gitpod_openvsx_proxy_backup_cache_serve_total The total amount of requests where we actually answered with a cached response because the upstream server is down.
# TYPE gitpod_openvsx_proxy_backup_cache_serve_total counter
gitpod_openvsx_proxy_backup_cache_serve_total 0
# HELP gitpod_openvsx_proxy_duration_overall The duration in seconds of the HTTP requests.
# TYPE gitpod_openvsx_proxy_duration_overall histogram
gitpod_openvsx_proxy_duration_overall_bucket{le="0.005"} 5
gitpod_openvsx_proxy_duration_overall_bucket{le="0.01"} 6
gitpod_openvsx_proxy_duration_overall_bucket{le="0.025"} 16
gitpod_openvsx_proxy_duration_overall_bucket{le="0.05"} 24
gitpod_openvsx_proxy_duration_overall_bucket{le="0.1"} 45
gitpod_openvsx_proxy_duration_overall_bucket{le="0.25"} 78336
gitpod_openvsx_proxy_duration_overall_bucket{le="0.5"} 94756
gitpod_openvsx_proxy_duration_overall_bucket{le="1"} 97630
gitpod_openvsx_proxy_duration_overall_bucket{le="2.5"} 100028
gitpod_openvsx_proxy_duration_overall_bucket{le="5"} 100039
gitpod_openvsx_proxy_duration_overall_bucket{le="10"} 100039
gitpod_openvsx_proxy_duration_overall_bucket{le="+Inf"} 100039
gitpod_openvsx_proxy_duration_overall_sum 22576.559664320564
gitpod_openvsx_proxy_duration_overall_count 100039
# HELP gitpod_openvsx_proxy_duration_request_processing The duration in seconds of the processing of the HTTP requests before we call the upstream.
# TYPE gitpod_openvsx_proxy_duration_request_processing histogram
gitpod_openvsx_proxy_duration_request_processing_bucket{le="0.005"} 99727
gitpod_openvsx_proxy_duration_request_processing_bucket{le="0.01"} 99955
gitpod_openvsx_proxy_duration_request_processing_bucket{le="0.025"} 100006
gitpod_openvsx_proxy_duration_request_processing_bucket{le="0.05"} 100015
gitpod_openvsx_proxy_duration_request_processing_bucket{le="0.1"} 100017
gitpod_openvsx_proxy_duration_request_processing_bucket{le="0.25"} 100023
gitpod_openvsx_proxy_duration_request_processing_bucket{le="0.5"} 100036
gitpod_openvsx_proxy_duration_request_processing_bucket{le="1"} 100039
gitpod_openvsx_proxy_duration_request_processing_bucket{le="2.5"} 100039
gitpod_openvsx_proxy_duration_request_processing_bucket{le="5"} 100039
gitpod_openvsx_proxy_duration_request_processing_bucket{le="10"} 100039
gitpod_openvsx_proxy_duration_request_processing_bucket{le="+Inf"} 100039
gitpod_openvsx_proxy_duration_request_processing_sum 61.93492549200048
gitpod_openvsx_proxy_duration_request_processing_count 100039
# HELP gitpod_openvsx_proxy_duration_response_processing The duration in seconds of the processing of the HTTP responses after we have called the upstream.
# TYPE gitpod_openvsx_proxy_duration_response_processing histogram
gitpod_openvsx_proxy_duration_response_processing_bucket{le="0.005"} 99888
gitpod_openvsx_proxy_duration_response_processing_bucket{le="0.01"} 100016
gitpod_openvsx_proxy_duration_response_processing_bucket{le="0.025"} 100033
gitpod_openvsx_proxy_duration_response_processing_bucket{le="0.05"} 100033
gitpod_openvsx_proxy_duration_response_processing_bucket{le="0.1"} 100033
gitpod_openvsx_proxy_duration_response_processing_bucket{le="0.25"} 100033
gitpod_openvsx_proxy_duration_response_processing_bucket{le="0.5"} 100035
gitpod_openvsx_proxy_duration_response_processing_bucket{le="1"} 100035
gitpod_openvsx_proxy_duration_response_processing_bucket{le="2.5"} 100035
gitpod_openvsx_proxy_duration_response_processing_bucket{le="5"} 100035
gitpod_openvsx_proxy_duration_response_processing_bucket{le="10"} 100035
gitpod_openvsx_proxy_duration_response_processing_bucket{le="+Inf"} 100035
gitpod_openvsx_proxy_duration_response_processing_sum 53.73361513499995
gitpod_openvsx_proxy_duration_response_processing_count 100035
# HELP gitpod_openvsx_proxy_duration_upstream_call The duration in seconds of the call of the upstream server.
# TYPE gitpod_openvsx_proxy_duration_upstream_call histogram
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="0.005"} 1
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="0.01"} 2
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="0.025"} 12
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="0.05"} 21
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="0.1"} 43
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="0.25"} 78444
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="0.5"} 94786
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="1"} 97634
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="2.5"} 100024
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="5"} 100035
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="10"} 100035
gitpod_openvsx_proxy_duration_upstream_call_bucket{le="+Inf"} 100035
gitpod_openvsx_proxy_duration_upstream_call_sum 22429.440173204042
gitpod_openvsx_proxy_duration_upstream_call_count 100035
# HELP gitpod_openvsx_proxy_regular_cache_hit_and_serve_total The total amount or requests where we answered with a cached response for performance reasons.
# TYPE gitpod_openvsx_proxy_regular_cache_hit_and_serve_total counter
gitpod_openvsx_proxy_regular_cache_hit_and_serve_total 4
# HELP gitpod_openvsx_proxy_regular_cache_miss_total The total amount or requests we haven't had a young enough cached requests to use it for performance reasons.
# TYPE gitpod_openvsx_proxy_regular_cache_miss_total counter
gitpod_openvsx_proxy_regular_cache_miss_total 100035
# HELP gitpod_openvsx_proxy_requests_total The total amount of requests by response status.
# TYPE gitpod_openvsx_proxy_requests_total counter
gitpod_openvsx_proxy_requests_total{path="GET /vscode/asset/",status="302"} 16
gitpod_openvsx_proxy_requests_total{path="OPTIONS /vscode/asset/",status="204"} 6
gitpod_openvsx_proxy_requests_total{path="OPTIONS /vscode/gallery/extensionquery",status="204"} 3
gitpod_openvsx_proxy_requests_total{path="POST /api/-/query",status="200"} 1
gitpod_openvsx_proxy_requests_total{path="POST /vscode/gallery/extensionquery",status="200"} 99973
gitpod_openvsx_proxy_requests_total{path="POST /vscode/gallery/extensionquery",status="500"} 1
gitpod_openvsx_proxy_requests_total{path="POST /vscode/gallery/extensionquery",status="error"} 35
# HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
# TYPE go_gc_duration_seconds summary
go_gc_duration_seconds{quantile="0"} 5.8239e-05
go_gc_duration_seconds{quantile="0.25"} 0.000103436
go_gc_duration_seconds{quantile="0.5"} 0.000129677
go_gc_duration_seconds{quantile="0.75"} 0.000183272
go_gc_duration_seconds{quantile="1"} 0.003420552
go_gc_duration_seconds_sum 1.006078281
go_gc_duration_seconds_count 4050
# HELP go_goroutines Number of goroutines that currently exist.
# TYPE go_goroutines gauge
go_goroutines 15
# HELP go_info Information about the Go environment.
# TYPE go_info gauge
go_info{version="go1.17.1"} 1
# HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
# TYPE go_memstats_alloc_bytes gauge
go_memstats_alloc_bytes 2.194608e+06
# HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
# TYPE go_memstats_alloc_bytes_total counter
go_memstats_alloc_bytes_total 9.396451168e+09
# HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
# TYPE go_memstats_buck_hash_sys_bytes gauge
go_memstats_buck_hash_sys_bytes 4616
# HELP go_memstats_frees_total Total number of frees.
# TYPE go_memstats_frees_total counter
go_memstats_frees_total 6.6925156e+07
# HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
# TYPE go_memstats_gc_cpu_fraction gauge
go_memstats_gc_cpu_fraction 0.0006993961118404291
# HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
# TYPE go_memstats_gc_sys_bytes gauge
go_memstats_gc_sys_bytes 5.5812e+06
# HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
# TYPE go_memstats_heap_alloc_bytes gauge
go_memstats_heap_alloc_bytes 2.194608e+06
# HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
# TYPE go_memstats_heap_idle_bytes gauge
go_memstats_heap_idle_bytes 2.7344896e+07
# HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
# TYPE go_memstats_heap_inuse_bytes gauge
go_memstats_heap_inuse_bytes 5.062656e+06
# HELP go_memstats_heap_objects Number of allocated objects.
# TYPE go_memstats_heap_objects gauge
go_memstats_heap_objects 8900
# HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
# TYPE go_memstats_heap_released_bytes gauge
go_memstats_heap_released_bytes 2.7131904e+07
# HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
# TYPE go_memstats_heap_sys_bytes gauge
go_memstats_heap_sys_bytes 3.2407552e+07
# HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
# TYPE go_memstats_last_gc_time_seconds gauge
go_memstats_last_gc_time_seconds 1.6333383646239781e+09
# HELP go_memstats_lookups_total Total number of pointer lookups.
# TYPE go_memstats_lookups_total counter
go_memstats_lookups_total 0
# HELP go_memstats_mallocs_total Total number of mallocs.
# TYPE go_memstats_mallocs_total counter
go_memstats_mallocs_total 6.6934056e+07
# HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
# TYPE go_memstats_mcache_inuse_bytes gauge
go_memstats_mcache_inuse_bytes 9600
# HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
# TYPE go_memstats_mcache_sys_bytes gauge
go_memstats_mcache_sys_bytes 16384
# HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
# TYPE go_memstats_mspan_inuse_bytes gauge
go_memstats_mspan_inuse_bytes 154496
# HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
# TYPE go_memstats_mspan_sys_bytes gauge
go_memstats_mspan_sys_bytes 393216
# HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
# TYPE go_memstats_next_gc_bytes gauge
go_memstats_next_gc_bytes 4.194304e+06
# HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
# TYPE go_memstats_other_sys_bytes gauge
go_memstats_other_sys_bytes 1.56044e+06
# HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
# TYPE go_memstats_stack_inuse_bytes gauge
go_memstats_stack_inuse_bytes 1.14688e+06
# HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
# TYPE go_memstats_stack_sys_bytes gauge
go_memstats_stack_sys_bytes 1.14688e+06
# HELP go_memstats_sys_bytes Number of bytes obtained from system.
# TYPE go_memstats_sys_bytes gauge
go_memstats_sys_bytes 4.1110288e+07
# HELP go_threads Number of OS threads created.
# TYPE go_threads gauge
go_threads 11
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 336.08
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 18
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 1.78176e+07
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.63333381539e+09
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 7.32356608e+08
# HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
# TYPE process_virtual_memory_max_bytes gauge
process_virtual_memory_max_bytes 1.8446744073709552e+19

Could you also help me to add this to Grafana and create alerts?

SaaS Deployment

  • We need to enable the component in the production values.yaml files (follow-up PR in the ops repo).
  • Currently, we have a load balancer that routes traffic to open-vsx.gitpod.io directly to open-vsx.org. We would like to gradually increase the percentage of the traffic that will be router to the OpenVSX proxy component by configuring the load balancer.

Related Issue(s)

#5881

How to test

  • Start a workspace with VS Code latest.
  • Search for and install extensions.

Release Notes

OpenVSX caching proxy has been moved to its own component

Meta

/uncc
/werft with-observability
/werft withObservabilityBranch=clu/openvsx-proxy

@akosyakov
Copy link
Member

I tried and it seems to work good, code changes make sense.

It would be good to see how it should be hooked up with metrics and then start gradual deployment in the production.

@codecov
Copy link

codecov bot commented Oct 4, 2021

Codecov Report

Merging #6007 (818cfea) into main (332775c) will increase coverage by 26.77%.
The diff coverage is 45.81%.

❗ Current head 818cfea differs from pull request most recent head 79aaf76. Consider uploading reports for the commit 79aaf76 to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##             main    #6007       +/-   ##
===========================================
+ Coverage   19.04%   45.81%   +26.77%     
===========================================
  Files           2        8        +6     
  Lines         168      550      +382     
===========================================
+ Hits           32      252      +220     
- Misses        134      261      +127     
- Partials        2       37       +35     
Flag Coverage Δ
components-local-app-app-linux-amd64 ?
components-local-app-app-linux-arm64 ?
components-local-app-app-windows-386 ?
components-local-app-app-windows-amd64 ?
components-local-app-app-windows-arm64 ?
components-openvsx-proxy-app 45.81% <45.81%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
components/openvsx-proxy/pkg/errorhandler.go 0.00% <0.00%> (ø)
components/openvsx-proxy/pkg/config.go 6.06% <6.06%> (ø)
components/openvsx-proxy/pkg/run.go 11.68% <11.68%> (ø)
components/openvsx-proxy/pkg/cache.go 46.42% <46.42%> (ø)
components/openvsx-proxy/pkg/utils.go 47.36% <47.36%> (ø)
components/openvsx-proxy/pkg/handler.go 47.47% <47.47%> (ø)
components/openvsx-proxy/pkg/modifyresponse.go 72.52% <72.52%> (ø)
components/openvsx-proxy/pkg/prometheus.go 79.24% <79.24%> (ø)
components/local-app/pkg/auth/pkce.go
components/local-app/pkg/auth/auth.go

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 332775c...79aaf76. Read the comment docs.

@csweichel
Copy link
Contributor

I know I'm super late to the party, but have we considered using something OOTB rather than rolling our own caching proxy?

Possible candidates are e.g. varnish from IBM, for which there even is an operator

@ArthurSens
Copy link
Contributor

ArthurSens commented Oct 4, 2021

  1. Metrics for cache hits/misses

I see we have a counter for cache hits, one for cache misses and another for total. Do you think it makes sense to remove hits OR misses?

From my understanding, we can calculate one from the other 2 metrics. Eg: hits = total - misses, or misses = total - hits. By removing one, there is less code we need to maintain, and less storage required for extra metrics


  1. Metric naming

I see a couple of histogram metrics are being added, that's very handy! One thing I'd ask here is just to get more aligned with metric naming standards.

OpenMetrics, which specify good standards for prometheus metrics, asks to add the unit to the metric name.

So for example, the metric gitpod_openvsx_proxy_duration_overall_bucket, I'd change to gitpod_openvsx_proxy_duration_overall_seconds_bucket.


  1. HTTP method as labels

Looking at this metric, gitpod_openvsx_proxy_requests_total{path="GET /vscode/asset/",status="302"} 16. Would it make sense to make the HTTP method as a separated label? Something like gitpod_openvsx_proxy_requests_total{method="GET", path="/vscode/asset/",status="302"} 16


  1. Testing

If you add this snipped to your PR description, you'll get a grafana instance with your preview environment and you should be able to see metrics there:

/werft with-observability

The one thing that might not work here is because this PR is introducing a brand new component. Since it is a new target for prometheus, we probably need to create a new branch in gitpod-io/observability with the extra target configuration. Once we create that branch on the observability repository, we can add this snippet to create our preview with a custom branch:

/werft withObservabilityBranch="my-new-branch"

  1. Help with Dashboard and alert development.

I'd be happy to help, but first I need a review in this PR: #5843 😬

@corneliusludmann
Copy link
Contributor Author

corneliusludmann commented Oct 5, 2021

I know I'm super late to the party, but have we considered using something OOTB rather than rolling our own caching proxy?

Possible candidates are e.g. varnish from IBM, for which there even is an operator

Yes, indeed. I'm always a big advocate of using existing solutions before building something ourselves.

I had a look at different existing caching proxies but with the experience, we have made with the Caddy cache plugin we have come to the conclusion that we want to have more control over how the caching for OpenVSX works. E.g. the “grace mode” of varnish comes somehow close to what we need but still, it's more about what the upstream server thinks what should be cached for how long, and how long a cache should be used as a fallback. However, we have another focus:

  • We want to have a backup of the most frequently used requests to OpenVSX to use them for outages of the OpenVSX registry and don't care so much about the actual caching. That's why an LFU cache would be the best fit.
  • We have a pretty good understanding of the upstream server and know exactly what needs to be and will be cached, e.g. we need OPTIONS calls and redirects cached (not supported by the RFC and so often not supported by OOTB proxies), we have only text (JSON) data as responses, etc.
  • We want to cache POST requests based on the body.
  • The content of the upstream server does not change so often. That's why even pretty old responses would make sense (better than nothing).
  • We need to replace the OpenVSX URLs in the responses.
  • It would be good to have a cache persistent as a file to copy it to a different node.
  • We would like to be able to have proper metrics and be able to alert e.g. when OpenVSX upstream is down.
  • We don't need to respect cache headers from OpenVSX etc. but would like to control the caching logic ourselves instead.

Since we use an existing OOTB caching solution and the go reverse proxy the only logic we had to implement is to store the responses in the cache, when we want to serve the responses from the cache, and what should be replaced in the response body. That's what we would have to teach OOTB proxies as well somehow which usually requires extensive configuration and/or custom plugins. A good portion of the go code in this PR is actually to provide proper metrics and logs that we are able to define because we have more control over it. And for the actual LFU cache, we actually use an OOTB solution and don't implement them ourselves.

That's why I still think that trying to teach existing proxies what we want to achieve it's harder and/or brings some limitions.

@corneliusludmann
Copy link
Contributor Author

Thanks, Arthur for your input!

I see we have a counter for cache hits, one for cache misses and another for total.

Actually, we don't have an explicit counter for total, only one for hit and one for miss. The histogram metrics bring total counters out of the box but I cannot prevent it.

OpenMetrics, which specify good standards for prometheus metrics, asks to add the unit to the metric name.

Thanks for the pointer! Changed this! ✔️

Would it make sense to make the HTTP method as a separated label?

Interesting idea. However, I don't think that this makes much sense or brings more insights. For me, the method and the path belong together somehow.

@corneliusludmann corneliusludmann force-pushed the clu/openvsx-proxy-comp branch 4 times, most recently from 9e3114d to ccf3822 Compare October 6, 2021 18:49
@ArthurSens
Copy link
Contributor

Stress Testing

I tested the OpenVSX proxy with a script from an external server that runs 100'000 unique calls (200 in parallel) to this proxy. There were 35 out of 100'000 failed calls (closed client connection). Most of the calls took not longer than 250ms (78'311 of 100'000), 94'722 of 100'000 took not longer than 500ms. None of them took longer than 5 secs. The upstream call is by far the longest part.

Is it possible to run this against the preview and see how metrics are shown in Grafana? :)

@corneliusludmann
Copy link
Contributor Author

@akosyakov Yesterday, I implemented the changes we discussed with Chris. The OpenVSX proxy is now a StatefulSet with persisted volume claim.

I also added Grafana to the preview environment with the help of Arthur: https://grafana-clu-openvsx-proxy-comp.preview.gitpod-dev.com/

Today, I would like to create dashboards and alerts in Grafana.

@corneliusludmann corneliusludmann marked this pull request as ready for review October 7, 2021 08:25
@akosyakov
Copy link
Member

Besides #6007 (comment), it works well in prev env.

Copy link
Member

@akosyakov akosyakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@roboquat
Copy link
Contributor

roboquat commented Oct 7, 2021

LGTM label has been added.

Git tree hash: e92cc92dbfcb3717be716a0c03990e71cf67cd61

@akosyakov
Copy link
Member

/assign @svenefftinge

@JanKoehnlein
Copy link
Contributor

/approve

@roboquat
Copy link
Contributor

roboquat commented Oct 8, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: akosyakov, JanKoehnlein

Associated issue: #5881

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@roboquat roboquat merged commit 619c8ea into main Oct 8, 2021
@roboquat roboquat deleted the clu/openvsx-proxy-comp branch October 8, 2021 11:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants