Skip to content
This repository has been archived by the owner on Sep 20, 2023. It is now read-only.

USN database downloads are interrupted when the site is redeployed #36

Closed
tyhicks opened this issue Mar 8, 2018 · 18 comments
Closed

USN database downloads are interrupted when the site is redeployed #36

tyhicks opened this issue Mar 8, 2018 · 18 comments

Comments

@tyhicks
Copy link

tyhicks commented Mar 8, 2018

https://askubuntu.com/questions/1012806/landscape-error-downloading-usn-pickle-from-https-usn-ubuntu-com-usn-db-data

If a client is downloading the USN database and a new version of the site is deployed, the clients will see an error and the download will fail.

You can reproduce this issue by downloading the database-all.pickle file:

$ curl -o /dev/null https://usn.ubuntu.com/usn-db/database-all.pickle

After the download begins, immediately rebuild the site (be sure to click the clean box).

The curl command will fail once the Deploy to Kubernetes stage of the deployment job starts:

curl: (56) GnuTLS recv error (-9): A TLS packet with unexpected length was received.

@WillMoggridge
Copy link

WillMoggridge commented Mar 9, 2018

I think a quick solution for this could be to increase terminationGracePeriodSeconds in Kubernetes, which should allow for more time for existing connections to complete. Then we can investigate improving the solution.

I will do some testing with this setting.

@tyhicks
Copy link
Author

tyhicks commented Mar 9, 2018

@WillMoggridge if a client is downloading a copy of the USN database, will new deployments of the USN website need to wait for that client to finish its download?

I want to make sure that someone can't prevent us from publishing new USNs by simply repeatedly downloading the USN database in a loop.

@WillMoggridge
Copy link

@tyhicks They will not be able to block a new release. Once a container is set for termination, no new connections can be made to it but the existing downloads will be allowed to finish. There will also be a time limit for existing connections we set and can tweak.

While they are waiting for existing connections to close, the new containers will start up and serve the new site.

@tyhicks
Copy link
Author

tyhicks commented Mar 12, 2018

@WillMoggridge that sounds like the perfect solution

@tyhicks
Copy link
Author

tyhicks commented Mar 20, 2018

@WillMoggridge Hi! Any update here? This is fairly urgent to get corrected since it affects Landscape users.

@lathiat
Copy link

lathiat commented Mar 22, 2018

This appears to be happening literally every 1 minute. Is something causing the containers to be continually recycled at the moment?

You can test simply with this command:
wget https://usn.ubuntu.com/usn-db/database.pickle.bz2 --limit-rate 10k

It continually disconnects at a semi-random interval between 60-120 seconds

2018-03-22 15:32:53-- https://usn.ubuntu.com/usn-db/database.pickle.bz2
2018-03-22 15:33:32 (10.1 KB/s) - Connection closed at byte 402807. Retrying. [39s]
2018-03-22 15:34:30 (10.0 KB/s) - Connection closed at byte 982808. Retrying. [57s]
2018-03-22 15:35:35 (10.0 KB/s) - Connection closed at byte 1631239. Retrying. [63s]
2018-03-22 15:36:44 (10.0 KB/s) - Connection closed at byte 2305643. Retrying. [66s]
2018-03-22 15:37:28 (9.90 KB/s) - Connection closed at byte 2706817. Retrying. [40s]
2018-03-22 15:38:29 (10.0 KB/s) - Connection closed at byte 3270629. Retrying. [55s]
2018-03-22 15:40:30 (10.0 KB/s) - Connection closed at byte 4448547. Retrying. [1m55s]

To be clear, this is causing major problems for Landscape Users who with slow enough connections (150KB/s-300KB/s) can never successfully download the 15MB pickle file (currently existing deployed versions do not attempt to resume the download - although that is still not always an ideal solution since the database may well change out from under them in some cases if its regenerated)

@tyhicks
Copy link
Author

tyhicks commented Mar 22, 2018

You can test simply with this command:
wget https://usn.ubuntu.com/usn-db/database.pickle.bz2 --limit-rate 10k

It continually disconnects at a semi-random interval between 60-120 seconds

I was able to verify this and I also made sure that a new deployment of the USN website wasn't happening at the same time. We'll need @nottrobin or @WillMoggridge to investigate this.

@WillMoggridge
Copy link

The solution should be pushed live and I wanted to check in and see if any improvements have been seen for this situation but it sounds like it is lacking. We will investigate why those drops are happening.

Separately from that we have been talking with IS who are working on a high priority ticket (RT#109653) building a new full caching layer. This hopes to be a full solution to these problems and is progressing well.

@lathiat
Copy link

lathiat commented Mar 23, 2018

Just confirming that as of right now the drops are still happening

@WillMoggridge
Copy link

I want to update you that I am still looking into this for a fix for the timeouts. I am talking with IS a little and continuing to investigate.

@lathiat
Copy link

lathiat commented Apr 7, 2018

Still seeing this issue

2018-04-07 14:03:46-- https://usn.ubuntu.com/usn-db/database.pickle.bz2
2018-04-07 14:05:30 (20.0 KB/s) - Connection closed at byte 2113171. Retrying.
2018-04-07 14:08:55 (20.0 KB/s) - Connection closed at byte 6257805. Retrying.
2018-04-07 14:10:28 (20.0 KB/s) - Connection closed at byte 8091280. Retrying.
2018-04-07 14:11:34 (20.0 KB/s) - Connection closed at byte 9369489. Retrying.
2018-04-07 14:13:14 (20.0 KB/s) - Connection closed at byte 11300262. Retrying.
2018-04-07 14:17:06 (20.0 KB/s) - 'database.pickle.bz2.1’ saved [15916435/15916435]

@lathiat
Copy link

lathiat commented Apr 9, 2018

I am no longer seeing that I always get disconnected every 2 minutes, sometimes I am and other times it takes longer. But I still always see it eventually. Wanted to try different IPs but I can't find a way to make wget/curl forcibly use the various IPs to see if there is a difference between them.

Today from 162.213.33.205
2018-04-09 15:08:57-- https://usn.ubuntu.com/usn-db/database.pickle.bz2
2018-04-09 15:16:36 (10.0 KB/s) - Connection closed at byte 4684437. Retrying.

@setharnold
Copy link

setharnold commented Apr 9, 2018 via email

@lathiat
Copy link

lathiat commented Apr 9, 2018

Thanks for the tip.. that works great using --resolve.
Also hacked around it for wget by just adding some blackhole routes temporarily (ip r a blackhole 1.2.3.4).

Seeing roughly every 5 minutes from a couple of different IPs (.20, .207) sometimes up to 10 but more commonly 5. Won't update any further with the status of what IPs take how long etc as I don't see a specific pattern but more just wanted to make the point that it now seems more variable than previous -- before it was reliably every ~2 minutes now it seems usually every 5-10 and occasionally a bit longer.

@setharnold
Copy link

setharnold commented Apr 9, 2018 via email

@davecore82
Copy link

I'm also seeing the same behaviour that Trent reports. I ran this curl command with a 20k speed limit multiple times this morning and they all failed:

curl https://usn.ubuntu.com/usn-db/database.pickle.bz2 --output /dev/null --limit-rate 20k

% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed

33 15.1M 33 5284k 0 0 16300 0 0:16:17 0:05:32 0:10:45 12444
curl: (18) transfer closed with 10518006 bytes remaining to read

40 15.1M 40 6302k 0 0 16297 0 0:16:17 0:06:36 0:09:41 15308
curl: (18) transfer closed with 9476064 bytes remaining to read

67 15.1M 67 10.2M 0 0 16319 0 0:16:16 0:10:57 0:05:19 18969
curl: (18) transfer closed with 5208042 bytes remaining to read

12 15.1M 12 1888k 0 0 16385 0 0:16:12 0:01:58 0:14:14 14254
curl: (18) transfer closed with 13996316 bytes remaining to read

10 15.1M 10 1637k 0 0 16277 0 0:16:18 0:01:43 0:14:35 14361
curl: (18) transfer closed with 14253228 bytes remaining to read

16 15.1M 16 2563k 0 0 16307 0 0:16:16 0:02:41 0:13:35 20317
curl: (18) transfer closed with 13304344 bytes remaining to read

63 15.1M 63 9834k 0 0 16295 0 0:16:17 0:10:18 0:05:59 20778
curl: (18) transfer closed with 5859296 bytes remaining to read

The last 4 tests failed at:

Wed Apr 11 09:09:49 EDT 2018
Wed Apr 11 09:11:32 EDT 2018
Wed Apr 11 09:14:14 EDT 2018
Wed Apr 11 09:24:32 EDT 2018

@desrod
Copy link

desrod commented Apr 11, 2018

I can confirm that this is indeed the upstream USN server going away, and being replaced by another server instance. I ran 4 parallel instances in a loop of {1..10}, to download the database.pickle file, and all 4 when started, reached the same physical server upstream.

At around the 3-4 minute mark in this case, all 4 instances went down, and the next iteration of the loop continued, reaching another server entirely, again all reached the same hostname, but a different host than the previous loop.

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0HTTP/1.1 200 OK
Server: nginx/1.13.9
Date: Wed, 11 Apr 2018 17:41:01 GMT
Content-Type: application/octet-stream
Content-Length: 15929793
Connection: keep-alive
Last-Modified: Tue, 10 Apr 2018 17:54:50 GMT
ETag: "5accfa6a-f311c1"
X-Commit-ID: c2e08eb4adb91011794e476cea8b0783aa93e70f
X-Hostname: usn-ubuntu-com-67b88cbf4f-qx2pk
Accept-Ranges: bytes
Strict-Transport-Security: max-age=15724800; includeSubDomains;

 27 15.1M   27 4293k    0     0  20473      0  0:12:58  0:03:34  0:09:24 19883
curl: (56) GnuTLS recv error (-9): A TLS packet with unexpected length was received.
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0HTTP/1.1 200 OK
Server: nginx/1.13.9
Date: Wed, 11 Apr 2018 17:44:36 GMT
Content-Type: application/octet-stream
Content-Length: 15929793
Connection: keep-alive
Last-Modified: Tue, 10 Apr 2018 17:54:50 GMT
ETag: "5accfa6a-f311c1"
X-Commit-ID: c2e08eb4adb91011794e476cea8b0783aa93e70f
X-Hostname: usn-ubuntu-com-67b88cbf4f-8znpp
Accept-Ranges: bytes
Strict-Transport-Security: max-age=15724800; includeSubDomains;

  3 15.1M    3  565k    0     0  20508      0  0:12:56  0:00:28  0:12:28 20558

Look carefully at X-Hostname in each loop, and you'll see that it's changing when it gets dropped.

You can see this by executing the following:

for count in {1..10}; do 
   curl --dump-header - --connect-timeout 30000 --limit-rate 20k \
   https://usn.ubuntu.com/usn-db/database.pickle.bz2             \
   -o /tabase.pickle-$(uuid -F STR).bz2; 
done

We could add resume code here using the Content-Length of the remote resource, and feeding that into curl with '-C -' in the curl command. Here's some sample code I've written that does exactly this:

#!/bin/bash

input_file="database.pickle.bz2"
remote_url="https://usn.ubuntu.com/usn-db/$input_file"
db_loc="/tmp/foobar"

curl_cmd() { curl -\# -L -4 -f --connect-timeout 30000 --limit-rate 20k -H 'Accept-encoding: gzip,deflate' "$@"; }

get_headers() {
	declare -A remote_headers; 
	while IFS=$': \r' read -r name val; do [[ $name ]] || break; 
		remote_headers[$name]=$val; 
	done < <(curl -sI "$remote_url")

	remote_size="${remote_headers['Content-Length']}"
	usn_hostname="${remote_headers['X-Hostname']}"
	commit_id="${remote_headers['X-Commit-ID']}"
}

print_status() {
	printf "Remote host...: %s\nCommit ID.....: %s\nRemote size...: %s\n\n" \
		"$usn_hostname" "$commit_id" "$remote_size"
}

download_file() {
	if [[ -e $db_loc/$input_file ]]; then
		local_size=$(wc -c < "$db_loc/$input_file")

		while ! (( remote_size == local_size )); do
			local_size=$(wc -c < "$db_loc/$input_file")
			print_status
			if ! (( remote_size )); then
				echo "Unable to retrieve remote size: Server does not provide Content-Length" >&2
				break
			elif (( remote_size > local_size )); then
				echo "[/] Resuming download of $input_file"
				curl_cmd -C - -o "$db_loc/$input_file" "$remote_url"
			elif (( remote_size < local_size )); then
				echo "Remote file shrunk, delete local copy and start over" >&2
			fi
		done
		echo "[+] Download complete, skipping $db_loc/$input_file" >&2
	else
		print_status
		echo "[-] Downloading ${input_file}..."
		curl_cmd -o "$db_loc/$input_file" "$remote_url"
	fi
}

get_headers
download_file

Update: Cleaner curl resume/download code. This correctly restarts and resumes when the upstream/remote server goes away and gets replaced by another nginx instance, so the download does not "fail" or get truncated, but continues until 100%. This is purely POC code to demonstrate how this could be addressed in landscape's use of curl itself, to download the pickle file.

@setharnold
Copy link

I'm confused, are you proposing that we modify all clients that consume this data?

Thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

8 participants