Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor SQL hairball #121

Closed
sebbacon opened this issue Mar 12, 2018 · 15 comments
Closed

Refactor SQL hairball #121

sebbacon opened this issue Mar 12, 2018 · 15 comments

Comments

@sebbacon
Copy link
Contributor

sebbacon commented Mar 12, 2018

There some very hard-to-read (and untested) SQL.

This is a legacy of our workflow: a domain expert non-programmer designed a flat data structure, and an engineer worked with that data structure.

The advantages of this approach are

  1. Efficient division of labour
  2. BigQuery is great for rapid prototyping of operations that work across large datasets. You don't really need to worry about optimisation - pretty much anything you can think of will complete within at most 2 mins

Together, these points mean someone with no knowledge of python or distributed programming can write queries to transform the data without worrying about performance or having to wait for hours for something to complete.

The disadvantage is that this has led to an untested hairball.

Some initial thoughts:

  1. As there is still some merit to the existing workflow, a sensible first refactor could be simply to make the SQL more comprehensible. One way could be with BigQuery UDFs, given much of the apparent complexity is just fiddling around with JSON using the existing commands.
  2. Needs tests. I've made a very rough start here. Might take me a few days to get round to finishing it, though.
sebbacon referenced this issue Mar 12, 2018
Now we have external contributors (hi @chadmiller!), it's time to
start thinking more about comprehensibility
chadmiller added a commit to chadmiller/clinicaltrials-act-tracker that referenced this issue Mar 14, 2018
Remove 200 lines of SQL, and add 100 lines of Python. Clouds are sexier than
local work, but the size of this data is tractable.  Related to ebmdatalab#121.

Don't extract the clinicaltrials.gov zipfile. We can read from it just as well
as reading from the filesystem. Save tons of space.
@chadmiller
Copy link
Contributor

What do you think? Do you mind not sending anything to Google at all?

This isn't ready for merging. I have done no tests or checked quality.

master...chadmiller:ungoogle

@sebbacon
Copy link
Contributor Author

It certainly addresses the readability and refactorability, and hence testability of the code!

Regarding not sending things to Google, there are two factors to consider:

  1. Prototyping performance. How long does it take to try out extracting a new field, for example? As stated above, nothing takes more than 2 mins in BigQuery
  2. (More FYI than a directly relevant point): it's important for our academics to be able to ask ad-hoc queries of all our data. We find Google a useful tool for this: the convenience of being able to run arbitrary SQL without having to worry about DBA issues (optimisation, availability, authorisation) outweighs the inconvenience of managing another cloudy service. So - even where we don't actually have a Big Data use case (and we probably only really have one of those across all our projects), we still use it.
    1. That said, we can quite happily (and often do) end an ETL process with a final upload to BigQuery for these purposes.

I could answer (1) by trying out your code myself, but I've run out of time for today. I'll have more time tomorrow...

@chadmiller
Copy link
Contributor

chadmiller commented Mar 14, 2018 via email

@chadmiller
Copy link
Contributor

chadmiller commented Mar 15, 2018 via email

@sebbacon
Copy link
Contributor Author

I think what you've sketched is a great improvement and if we can get it completing in around 2 mins that should be absolutely fine for our workflow.

However, on my machine this leaks memory fast until it's all used up (8GB + 2GB swap). I can't see a glaringly obvious reason why. Maybe we're adding to the pool much faster than consuming? I tried adding a semaphore and it didn't help - no time to investigate further today. Interesting that you don't have the same issue.

@chadmiller
Copy link
Contributor

chadmiller commented Mar 15, 2018 via email

@sebbacon
Copy link
Contributor Author

I already tried those! No joy. In all cases, memory usage increased approximately linearly until explosion

Having a quick noodle just now, it looks a bit like removing the error_callback argument to apply_async makes the problem go away. Not obvious to me why, yet...

@chadmiller
Copy link
Contributor

Weird. I wish I could reproduce it. What environment are you using? I'm on up-to-date amd64 Ubuntu 16.04.4.

This was an experiment of mine, and I don't want debugging its problems to be the same as advocating for it. The code can be a little prettier after a half dozen changes, and there is a speedup available in using a faster ZIP module, and the next best speedup is probably in removing datetime and relativedelta and treating dates lexically. I haven't refined in case this experiment is a dead end from a usability standpoint.

Maybe there is a better way to accomplish fixing the hairball. I'll take the cue from you, sebbacon.

@sebbacon
Copy link
Contributor Author

Python 3.6.1, Ubuntu 17.04, so should be pretty similar.

Yes, I don't want to spend ages debugging it, but on the other hand, it's a frustrating and intriguing issue in equal measure.

I'll timebox 1 hour for this on Monday and make a decision after that.

@chadmiller
Copy link
Contributor

chadmiller commented Mar 18, 2018 via email

@chadmiller
Copy link
Contributor

Also, pull those two changes from my 'ungoogle' branch. Now no complex objects pickled across the socket like a few datetimes I missed before.

@sebbacon
Copy link
Contributor Author

Well, this has been an interesting excursion. I was wrong (or miswrote my notes) about the error callback. It was the success callback:

  1. I found a single place on the internet which mentions that lxml uses a global dict as a cache for ids parsed from documents. Given there's no unique id attributes in our source XML, I'm not entirely convinced this is the problem, but it's suggestive, at least...
  2. There's no option to turn this off, but the response from devs (no longer on the internet) includes this: I guess it wouldn't be difficult to provide a global option in lxml to switch off the dict sharing. This is already done for threads...
  3. 'lxml' releases the GIL during parsing
  4. Looking for an easy way to try this with threads, I discovered something new to me: multiprocessing.dummy -- very cool!
  5. Memory no longer "leaks". But it's not particularly fast on my laptop: 5m28s. Not CPU-bound. I'm not convinced we're getting concurrency. Perhaps something else is blocking on the GIL.

My hour's up! I'm intrigued as to why this wasn't an issue for you. My lxml version is 4.1.1

5 minutes is possibly a bit slow. But I'll also think a little more (prob. tomorrow now) on if 5 mins is really an issue or not -- perhaps we can find workarounds now it's all Python (e.g. thinking about how to make it easy for @NickCEBM to write unit tests).

Here's how I switched to threads:

diff --git a/load_data.py b/load_data.py
index 71b2203..86652ac 100644
--- a/load_data.py
+++ b/load_data.py
@@ -10,7 +10,7 @@ import contextlib
 import re
 from zipfile import ZipFile
 from csv import DictWriter
-import multiprocessing
+from multiprocessing.dummy import Pool as ThreadedPool
 from time import time
 
 import extraction
@@ -32,7 +32,7 @@ def document_stream(zip_filename):
 
 
 def fabricate_csv(input_filename, output_filename):
-    pool = multiprocessing.Pool()
+    pool = ThreadedPool()

@sebbacon
Copy link
Contributor Author

@chadmiller, quick update for you - currently juggling several projects at once, so not had time to return to this yet. We are recruiting so hopefully will get unstuck on this soonish (e.g. 4 weeks)

@NickCEBM
Copy link
Contributor

Movement on this per #182. Move to python rather than refactor SQL.

@NickCEBM
Copy link
Contributor

This SQL hairball has long since been replaced with Python per the above so closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants