New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor SQL hairball #121
Comments
Now we have external contributors (hi @chadmiller!), it's time to start thinking more about comprehensibility
Remove 200 lines of SQL, and add 100 lines of Python. Clouds are sexier than local work, but the size of this data is tractable. Related to ebmdatalab#121. Don't extract the clinicaltrials.gov zipfile. We can read from it just as well as reading from the filesystem. Save tons of space.
What do you think? Do you mind not sending anything to Google at all? This isn't ready for merging. I have done no tests or checked quality. |
It certainly addresses the readability and refactorability, and hence testability of the code! Regarding not sending things to Google, there are two factors to consider:
I could answer (1) by trying out your code myself, but I've run out of time for today. I'll have more time tomorrow... |
On my tiny laptop, not including download of zipfile, it takes 6 minutes to
make the CSV from the Zipfile.
Since most of that is walking the XML, I bet it's comparable to the XML to
JSON preprocessing stage plus bigtable execution stage.
…On Mar 14, 2018 07:45, "Seb Bacon" ***@***.***> wrote:
It certainly addresses the readability and refactorability, and hence
testability of the code!
Regarding not sending things to Google, there are two factors to consider:
1. Prototyping performance. How long does it take to try out
extracting a new field, for example? As stated above, nothing takes more
than 2 mins in BigQuery
2. (More FYI than a directly relevant point): it's important for our
academics to be able to ask ad-hoc queries of all our data. We find Google
a useful tool for this: the convenience of being able to run arbitrary SQL
without having to worry about DBA issues (optimisation, availability,
authorisation) outweighs the inconvenience of managing another cloudy
service. So - even where we don't actually have a Big Data use case (and we
probably only really have one of those across all our projects), we still
use it.
1. That said, we can quite happily (and often do) end an ETL
process with a final upload to BigQuery for these purposes.
I could answer (1) by trying out your code myself, but I've run out of
time for today. I'll have more time tomorrow...
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#121 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AARkDqyuvzQ91MAC49b8xklkKKI7yqD0ks5teS0DgaJpZM4Snyov>
.
|
Maybe I was doing something else expensive at the time. It takes 4 minutes.
real 4m0.192s
user 3m57.368s
sys 0m2.687s
That isn't great. Let's see.... I wrapped it in some multiprocessing code.
There, 2 minutes. Maybe better with a real machine.
real 1m52.367s
user 5m50.441s
sys 0m14.433s
…On Wed, Mar 14, 2018 at 8:02 AM, Chad MILLER ***@***.***> wrote:
On my tiny laptop, not including download of zipfile, it takes 6 minutes
to make the CSV from the Zipfile.
Since most of that is walking the XML, I bet it's comparable to the XML to
JSON preprocessing stage plus bigtable execution stage.
On Mar 14, 2018 07:45, "Seb Bacon" ***@***.***> wrote:
> It certainly addresses the readability and refactorability, and hence
> testability of the code!
>
> Regarding not sending things to Google, there are two factors to consider:
>
> 1. Prototyping performance. How long does it take to try out
> extracting a new field, for example? As stated above, nothing takes more
> than 2 mins in BigQuery
> 2. (More FYI than a directly relevant point): it's important for our
> academics to be able to ask ad-hoc queries of all our data. We find Google
> a useful tool for this: the convenience of being able to run arbitrary SQL
> without having to worry about DBA issues (optimisation, availability,
> authorisation) outweighs the inconvenience of managing another cloudy
> service. So - even where we don't actually have a Big Data use case (and we
> probably only really have one of those across all our projects), we still
> use it.
> 1. That said, we can quite happily (and often do) end an ETL
> process with a final upload to BigQuery for these purposes.
>
> I could answer (1) by trying out your code myself, but I've run out of
> time for today. I'll have more time tomorrow...
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#121 (comment)>,
> or mute the thread
> <https://github.com/notifications/unsubscribe-auth/AARkDqyuvzQ91MAC49b8xklkKKI7yqD0ks5teS0DgaJpZM4Snyov>
> .
>
--
Chad Miller chad.org gpg:a806deac30420066
|
I think what you've sketched is a great improvement and if we can get it completing in around 2 mins that should be absolutely fine for our workflow. However, on my machine this leaks memory fast until it's all used up (8GB + 2GB swap). I can't see a glaringly obvious reason why. Maybe we're adding to the pool much faster than consuming? I tried adding a semaphore and it didn't help - no time to investigate further today. Interesting that you don't have the same issue. |
Oh, wow. I'll think about that.
Will you please limit the size of the processing pool, a and see if it's
leaking over the course of a whole run and not getting jammed up in swap or
scheduling? At the top of load_data.fabricate_csv, make Pool() into Pool(
processes=2) . Or that "2" into multiprocessing.cpu_count() - foo.
Stretching here, but also maybe add parameter maxtasksperchild=10 .
|
I already tried those! No joy. In all cases, memory usage increased approximately linearly until explosion Having a quick noodle just now, it looks a bit like removing the |
Weird. I wish I could reproduce it. What environment are you using? I'm on up-to-date amd64 Ubuntu 16.04.4. This was an experiment of mine, and I don't want debugging its problems to be the same as advocating for it. The code can be a little prettier after a half dozen changes, and there is a speedup available in using a faster ZIP module, and the next best speedup is probably in removing Maybe there is a better way to accomplish fixing the hairball. I'll take the cue from you, sebbacon. |
Python 3.6.1, Ubuntu 17.04, so should be pretty similar. Yes, I don't want to spend ages debugging it, but on the other hand, it's a frustrating and intriguing issue in equal measure. I'll timebox 1 hour for this on Monday and make a decision after that. |
Your discovery about the presence of error_callback makes me wonder if
there were exceptions. The communication between processes "pickles" up the
outbound params, inbound return value, and inbound exception objects, and
(IIRC) sends them over a socket. Could be a bug on either end of that
socket. I'd like to know which side.
While you're running will you please, in another window, run "watch ps wwuf
--tty=$otherwindowstty" and capture details from that once in a while and
send it to me?
Good luck.
…On Sun, Mar 18, 2018 at 3:55 AM, Seb Bacon ***@***.***> wrote:
Python 3.6.1, Ubuntu 17.04, so should be pretty similar.
Yes, I don't want to spend ages debugging it, but on the other hand, it's
a frustrating and intriguing issue in equal measure.
I'll timebox 1 hour for this on Monday and make a decision after that.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#121 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AARkDhXjudg6pmN0GisWcx_vjG_zUowmks5tfj0HgaJpZM4Snyov>
.
--
Chad Miller chad.org gpg:a806deac30420066
|
Also, pull those two changes from my 'ungoogle' branch. Now no complex objects pickled across the socket like a few |
Well, this has been an interesting excursion. I was wrong (or miswrote my notes) about the error callback. It was the success callback:
My hour's up! I'm intrigued as to why this wasn't an issue for you. My 5 minutes is possibly a bit slow. But I'll also think a little more (prob. tomorrow now) on if 5 mins is really an issue or not -- perhaps we can find workarounds now it's all Python (e.g. thinking about how to make it easy for @NickCEBM to write unit tests). Here's how I switched to threads: diff --git a/load_data.py b/load_data.py
index 71b2203..86652ac 100644
--- a/load_data.py
+++ b/load_data.py
@@ -10,7 +10,7 @@ import contextlib
import re
from zipfile import ZipFile
from csv import DictWriter
-import multiprocessing
+from multiprocessing.dummy import Pool as ThreadedPool
from time import time
import extraction
@@ -32,7 +32,7 @@ def document_stream(zip_filename):
def fabricate_csv(input_filename, output_filename):
- pool = multiprocessing.Pool()
+ pool = ThreadedPool() |
@chadmiller, quick update for you - currently juggling several projects at once, so not had time to return to this yet. We are recruiting so hopefully will get unstuck on this soonish (e.g. 4 weeks) |
Movement on this per #182. Move to python rather than refactor SQL. |
This SQL hairball has long since been replaced with Python per the above so closing. |
There some very hard-to-read (and untested) SQL.
This is a legacy of our workflow: a domain expert non-programmer designed a flat data structure, and an engineer worked with that data structure.
The advantages of this approach are
Together, these points mean someone with no knowledge of python or distributed programming can write queries to transform the data without worrying about performance or having to wait for hours for something to complete.
The disadvantage is that this has led to an untested hairball.
Some initial thoughts:
The text was updated successfully, but these errors were encountered: