Refactor SQL hairball #121

sebbacon · 2018-03-12T22:27:10Z

There some very hard-to-read (and untested) SQL.

This is a legacy of our workflow: a domain expert non-programmer designed a flat data structure, and an engineer worked with that data structure.

The advantages of this approach are

Efficient division of labour
BigQuery is great for rapid prototyping of operations that work across large datasets. You don't really need to worry about optimisation - pretty much anything you can think of will complete within at most 2 mins

Together, these points mean someone with no knowledge of python or distributed programming can write queries to transform the data without worrying about performance or having to wait for hours for something to complete.

The disadvantage is that this has led to an untested hairball.

Some initial thoughts:

As there is still some merit to the existing workflow, a sensible first refactor could be simply to make the SQL more comprehensible. One way could be with BigQuery UDFs, given much of the apparent complexity is just fiddling around with JSON using the existing commands.
Needs tests. I've made a very rough start here. Might take me a few days to get round to finishing it, though.

@chadmiller

Now we have external contributors (hi @chadmiller!), it's time to start thinking more about comprehensibility

Remove 200 lines of SQL, and add 100 lines of Python. Clouds are sexier than local work, but the size of this data is tractable. Related to ebmdatalab#121. Don't extract the clinicaltrials.gov zipfile. We can read from it just as well as reading from the filesystem. Save tons of space.

chadmiller · 2018-03-14T06:50:06Z

What do you think? Do you mind not sending anything to Google at all?

This isn't ready for merging. I have done no tests or checked quality.

master...chadmiller:ungoogle

sebbacon · 2018-03-14T14:45:21Z

It certainly addresses the readability and refactorability, and hence testability of the code!

Regarding not sending things to Google, there are two factors to consider:

Prototyping performance. How long does it take to try out extracting a new field, for example? As stated above, nothing takes more than 2 mins in BigQuery
(More FYI than a directly relevant point): it's important for our academics to be able to ask ad-hoc queries of all our data. We find Google a useful tool for this: the convenience of being able to run arbitrary SQL without having to worry about DBA issues (optimisation, availability, authorisation) outweighs the inconvenience of managing another cloudy service. So - even where we don't actually have a Big Data use case (and we probably only really have one of those across all our projects), we still use it.
1. That said, we can quite happily (and often do) end an ETL process with a final upload to BigQuery for these purposes.

I could answer (1) by trying out your code myself, but I've run out of time for today. I'll have more time tomorrow...

chadmiller · 2018-03-14T15:02:21Z

On my tiny laptop, not including download of zipfile, it takes 6 minutes to make the CSV from the Zipfile. Since most of that is walking the XML, I bet it's comparable to the XML to JSON preprocessing stage plus bigtable execution stage.

…

On Mar 14, 2018 07:45, "Seb Bacon" ***@***.***> wrote: It certainly addresses the readability and refactorability, and hence testability of the code! Regarding not sending things to Google, there are two factors to consider: 1. Prototyping performance. How long does it take to try out extracting a new field, for example? As stated above, nothing takes more than 2 mins in BigQuery 2. (More FYI than a directly relevant point): it's important for our academics to be able to ask ad-hoc queries of all our data. We find Google a useful tool for this: the convenience of being able to run arbitrary SQL without having to worry about DBA issues (optimisation, availability, authorisation) outweighs the inconvenience of managing another cloudy service. So - even where we don't actually have a Big Data use case (and we probably only really have one of those across all our projects), we still use it. 1. That said, we can quite happily (and often do) end an ETL process with a final upload to BigQuery for these purposes. I could answer (1) by trying out your code myself, but I've run out of time for today. I'll have more time tomorrow... — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#121 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AARkDqyuvzQ91MAC49b8xklkKKI7yqD0ks5teS0DgaJpZM4Snyov> .

chadmiller · 2018-03-15T06:03:21Z

Maybe I was doing something else expensive at the time. It takes 4 minutes. real 4m0.192s user 3m57.368s sys 0m2.687s That isn't great. Let's see.... I wrapped it in some multiprocessing code. There, 2 minutes. Maybe better with a real machine. real 1m52.367s user 5m50.441s sys 0m14.433s

…

On Wed, Mar 14, 2018 at 8:02 AM, Chad MILLER ***@***.***> wrote: On my tiny laptop, not including download of zipfile, it takes 6 minutes to make the CSV from the Zipfile. Since most of that is walking the XML, I bet it's comparable to the XML to JSON preprocessing stage plus bigtable execution stage. On Mar 14, 2018 07:45, "Seb Bacon" ***@***.***> wrote: > It certainly addresses the readability and refactorability, and hence > testability of the code! > > Regarding not sending things to Google, there are two factors to consider: > > 1. Prototyping performance. How long does it take to try out > extracting a new field, for example? As stated above, nothing takes more > than 2 mins in BigQuery > 2. (More FYI than a directly relevant point): it's important for our > academics to be able to ask ad-hoc queries of all our data. We find Google > a useful tool for this: the convenience of being able to run arbitrary SQL > without having to worry about DBA issues (optimisation, availability, > authorisation) outweighs the inconvenience of managing another cloudy > service. So - even where we don't actually have a Big Data use case (and we > probably only really have one of those across all our projects), we still > use it. > 1. That said, we can quite happily (and often do) end an ETL > process with a final upload to BigQuery for these purposes. > > I could answer (1) by trying out your code myself, but I've run out of > time for today. I'll have more time tomorrow... > > — > You are receiving this because you commented. > Reply to this email directly, view it on GitHub > <#121 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/AARkDqyuvzQ91MAC49b8xklkKKI7yqD0ks5teS0DgaJpZM4Snyov> > . >

-- Chad Miller chad.org gpg:a806deac30420066

sebbacon · 2018-03-15T15:21:18Z

I think what you've sketched is a great improvement and if we can get it completing in around 2 mins that should be absolutely fine for our workflow.

However, on my machine this leaks memory fast until it's all used up (8GB + 2GB swap). I can't see a glaringly obvious reason why. Maybe we're adding to the pool much faster than consuming? I tried adding a semaphore and it didn't help - no time to investigate further today. Interesting that you don't have the same issue.

chadmiller · 2018-03-15T16:05:37Z

Oh, wow. I'll think about that. Will you please limit the size of the processing pool, a and see if it's leaking over the course of a whole run and not getting jammed up in swap or scheduling? At the top of load_data.fabricate_csv, make Pool() into Pool( processes=2) . Or that "2" into multiprocessing.cpu_count() - foo. Stretching here, but also maybe add parameter maxtasksperchild=10 .

sebbacon · 2018-03-15T18:31:23Z

I already tried those! No joy. In all cases, memory usage increased approximately linearly until explosion

Having a quick noodle just now, it looks a bit like removing the error_callback argument to apply_async makes the problem go away. Not obvious to me why, yet...

chadmiller · 2018-03-17T21:06:32Z

Weird. I wish I could reproduce it. What environment are you using? I'm on up-to-date amd64 Ubuntu 16.04.4.

This was an experiment of mine, and I don't want debugging its problems to be the same as advocating for it. The code can be a little prettier after a half dozen changes, and there is a speedup available in using a faster ZIP module, and the next best speedup is probably in removing datetime and relativedelta and treating dates lexically. I haven't refined in case this experiment is a dead end from a usability standpoint.

Maybe there is a better way to accomplish fixing the hairball. I'll take the cue from you, sebbacon.

sebbacon · 2018-03-18T10:55:03Z

Python 3.6.1, Ubuntu 17.04, so should be pretty similar.

Yes, I don't want to spend ages debugging it, but on the other hand, it's a frustrating and intriguing issue in equal measure.

I'll timebox 1 hour for this on Monday and make a decision after that.

chadmiller · 2018-03-18T19:12:02Z

Your discovery about the presence of error_callback makes me wonder if there were exceptions. The communication between processes "pickles" up the outbound params, inbound return value, and inbound exception objects, and (IIRC) sends them over a socket. Could be a bug on either end of that socket. I'd like to know which side. While you're running will you please, in another window, run "watch ps wwuf --tty=$otherwindowstty" and capture details from that once in a while and send it to me? Good luck.

…

On Sun, Mar 18, 2018 at 3:55 AM, Seb Bacon ***@***.***> wrote: Python 3.6.1, Ubuntu 17.04, so should be pretty similar. Yes, I don't want to spend ages debugging it, but on the other hand, it's a frustrating and intriguing issue in equal measure. I'll timebox 1 hour for this on Monday and make a decision after that. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#121 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AARkDhXjudg6pmN0GisWcx_vjG_zUowmks5tfj0HgaJpZM4Snyov> .

-- Chad Miller chad.org gpg:a806deac30420066

chadmiller · 2018-03-18T19:44:39Z

Also, pull those two changes from my 'ungoogle' branch. Now no complex objects pickled across the socket like a few datetimes I missed before.

sebbacon · 2018-03-19T12:09:03Z

Well, this has been an interesting excursion. I was wrong (or miswrote my notes) about the error callback. It was the success callback:

I found a single place on the internet which mentions that lxml uses a global dict as a cache for ids parsed from documents. Given there's no unique id attributes in our source XML, I'm not entirely convinced this is the problem, but it's suggestive, at least...
There's no option to turn this off, but the response from devs (no longer on the internet) includes this: I guess it wouldn't be difficult to provide a global option in lxml to switch off the dict sharing. This is already done for threads...
'lxml' releases the GIL during parsing
Looking for an easy way to try this with threads, I discovered something new to me: multiprocessing.dummy -- very cool!
Memory no longer "leaks". But it's not particularly fast on my laptop: 5m28s. Not CPU-bound. I'm not convinced we're getting concurrency. Perhaps something else is blocking on the GIL.

My hour's up! I'm intrigued as to why this wasn't an issue for you. My lxml version is 4.1.1

5 minutes is possibly a bit slow. But I'll also think a little more (prob. tomorrow now) on if 5 mins is really an issue or not -- perhaps we can find workarounds now it's all Python (e.g. thinking about how to make it easy for @NickCEBM to write unit tests).

Here's how I switched to threads:

diff --git a/load_data.py b/load_data.py
index 71b2203..86652ac 100644
--- a/load_data.py
+++ b/load_data.py
@@ -10,7 +10,7 @@ import contextlib
 import re
 from zipfile import ZipFile
 from csv import DictWriter
-import multiprocessing
+from multiprocessing.dummy import Pool as ThreadedPool
 from time import time
 
 import extraction
@@ -32,7 +32,7 @@ def document_stream(zip_filename):
 
 
 def fabricate_csv(input_filename, output_filename):
-    pool = multiprocessing.Pool()
+    pool = ThreadedPool()

sebbacon · 2018-04-16T10:46:15Z

@chadmiller, quick update for you - currently juggling several projects at once, so not had time to return to this yet. We are recruiting so hopefully will get unstuck on this soonish (e.g. 4 weeks)

NickCEBM · 2019-02-27T10:31:55Z

Movement on this per #182. Move to python rather than refactor SQL.

NickCEBM · 2020-02-26T20:00:02Z

This SQL hairball has long since been replaced with Python per the above so closing.

sebbacon referenced this issue Mar 12, 2018

Explain why logic is in SQL and ask for help

ce0c22d

Now we have external contributors (hi @chadmiller!), it's time to start thinking more about comprehensibility

NickCEBM closed this as completed Feb 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor SQL hairball #121

Refactor SQL hairball #121

sebbacon commented Mar 12, 2018 •

edited

chadmiller commented Mar 14, 2018

sebbacon commented Mar 14, 2018

chadmiller commented Mar 14, 2018 via email

chadmiller commented Mar 15, 2018 via email

sebbacon commented Mar 15, 2018

chadmiller commented Mar 15, 2018 via email

sebbacon commented Mar 15, 2018

chadmiller commented Mar 17, 2018

sebbacon commented Mar 18, 2018

chadmiller commented Mar 18, 2018 via email

chadmiller commented Mar 18, 2018

sebbacon commented Mar 19, 2018

sebbacon commented Apr 16, 2018

NickCEBM commented Feb 27, 2019

NickCEBM commented Feb 26, 2020

Refactor SQL hairball #121

Refactor SQL hairball #121

Comments

sebbacon commented Mar 12, 2018 • edited

chadmiller commented Mar 14, 2018

sebbacon commented Mar 14, 2018

chadmiller commented Mar 14, 2018 via email

chadmiller commented Mar 15, 2018 via email

sebbacon commented Mar 15, 2018

chadmiller commented Mar 15, 2018 via email

sebbacon commented Mar 15, 2018

chadmiller commented Mar 17, 2018

sebbacon commented Mar 18, 2018

chadmiller commented Mar 18, 2018 via email

chadmiller commented Mar 18, 2018

sebbacon commented Mar 19, 2018

sebbacon commented Apr 16, 2018

NickCEBM commented Feb 27, 2019

NickCEBM commented Feb 26, 2020

sebbacon commented Mar 12, 2018 •

edited