Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Discrepancy between expected and actual zip file size #1487
On Sunday (8/14) and Monday (8/15) mornings the SoS zip files posted at the regular time (11:20 UTC), and on both days the update process ran successfully without interruption: The zips and raw files were added to the S3 bucket, the website pages baked and were published (hooray!).
However, two warnings were logged because the expected size of the zip and the actual zip file size were not equal.
So, weirdly, the actual zip size on Monday is the same as the expected size on Sunday. It's as if each day we've somehow downloaded the zip from the previous day.
I re-downloaded each day's archive of the dbwebexport.zip from our S3 bucket and confirmed that the values recorded in the RawDataVersion.download_zip_size are the actual sizes of the zip file archived for each day.
I then re-downloaded Mon zip from the SoS and...lo and behold... it's actual size is the same as the expected size: 764,917,377 bytes. I downloaded this file twice with the same result: first, manually via Chrome and then again using our
So now I'm thinking something is happening on their end where these HTTP response values are being updated before the new zip file is actually available to download.
More evidence: The previous most recent snapshot we've tracked in our production environment was released on Fri, 8/12, (no new zip file was posted on Sat, 8/13): The expected and actual size of that zip was 764,878,673, which is the same number of bytes as the zip we downloaded on Sun. The only difference is that we downloaded the zip much later on Fri, at 21:50, after I pushed some fixes to the raw-data app and repeated the end-to-end update process for the download's site.
So, three proposals:
If this actually is the problem, then at some point we might loop back to replace the warning in the
Ran the update 20 mins later this morning, at 11:45 UTC, and the expected and actual zip size are the same.
I am going to go ahead and raise an exception in the download command if the expected and actual file size are different, and catch and re-try in the update command. Also will keep experimenting with getting the earliest possible time to start the update.
added a commit
Aug 16, 2016
Adding more weirdness to the stew...
Download metadata on 16/Aug/2016 at 11:45:05
On At 17/Aug/2016 at 11:25:03:
Note how all the response headers have changed, but the downloaded zip is the same size as yesterday's.
All the response headers have reverted to yesterday, and the downloaded zip size is still the same as yesterday.
Now all the response headers are updated again, and the download zip size matches the expected size.
And now let's get EVEN WEIRDER.
The logs above were collected by the dev server. Meanwhile on the prod server, the normal daily routine was ran. Both the update and the download commands make HEAD requests and write the response headers to the log. Here is what that looked this morning:
So a second later (or maybe like every other time we make the request) the response headers all reverted to yesterday's values, and apparently the zip did as well.
Also, the update of the downloads website this morning (8/17) completed, but had a several problems. I think these arose from the scenario outlined above, when a new zip is posted, but the HTTP response yields yesterday's headers and/or content.
The symptoms: downloads/latest/ has today's date -- Wednesday, Aug. 17, 2016 at 11:21 a.m. -- and there's a clean.zip available to download, but no links to the individual files appear. Looking back at the most recent RawDataVersion db record, the update_start_datetime and update_finish_datetime are populated with today's values, but the following fields are empty:
But if you look at the previous RawDataVersion db record for the Tuesday, 8/16 release, the download and extract datetime fields all have 8/17 values. Also the clean_zip_size values are the same for the Tues and Wed records.
So looking back at the prod server log lines pasted above:
So the point of failure is between when the update command calls the download command, and the SoS server decides to pull the ol' switch-a-roo. I propose a patch that checks if
A little more info:
$ dig +vc campaignfinance.cdn.sos.ca.gov a ; <<>> DiG 9.8.3-P1 <<>> +vc campaignfinance.cdn.sos.ca.gov a ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 41090 ;; flags: qr rd ra; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;campaignfinance.cdn.sos.ca.gov. IN A ;; ANSWER SECTION: campaignfinance.cdn.sos.ca.gov. 300 IN CNAME dbexport.sos.ca.gov.cdn184.raxcdn.com. dbexport.sos.ca.gov.cdn184.raxcdn.com. 300 IN CNAME raxcdn.com.mdc.edgesuite.net. raxcdn.com.mdc.edgesuite.net. 27 IN CNAME a1907.dscw14.akamai.net. a1907.dscw14.akamai.net. 20 IN A 126.96.36.199 a1907.dscw14.akamai.net. 20 IN A 188.8.131.52 ;; Query time: 89 msec ;; SERVER: 184.108.40.206#53(220.127.116.11) ;; WHEN: Wed Aug 17 13:47:22 2016 ;; MSG SIZE rcvd: 207
So I think this is saying they have a CNAME chain that resolves to two IP addresses at akamai.net. If you run this dig command multiple times in succession, then you'll note that the IP addresses sometimes switch order. The sys admin I talked to here in the office suggested this might signify a load-balancing strategy called round-robin DNS.
My guess is that one of these two servers is getting behind the other around the time we start our update. Since we probably lose our requests connection between the HEAD request in the update command and the HEAD request in the download command, that might be the point at which they're switching us from one to the other. That proposal I made in the previous comment should catch that scenario.
I don't yet know if it's possible that, within
Maybe this is a CDN issue, where the files are being uploaded, but they're large enough that they don't fully propagate before the new headers do or vice versa. I didn't think that was even possible, but if there's something weird going on with the folks at Akamai, that could totally explain the discrepancy.
Today's update completed without any problems on the prod server. Ran it at 11:45 GMT.
Meanwhile on the dev server I confirmed that, even between the HEAD and GET requests sent by the
Because of the check I added yesterday, the download was stopped during the first two attempts (though, for some reason the CommandError isn't showing up in the log...), but then finished downloading 765136081 bytes at 11:56:21.
The other thing I just noticed is that the datetime format of