CSV export results in empty file with no data #2157

noisydeadlines · 2022-06-28T20:43:47Z

Describe the bug
When using the CSV data export the resulting file is empty

To Reproduce
Steps to reproduce the behavior:

Go to 'Settings'
Click on 'CSV Export'
Click on 'Download the File'
Open the downloaded file

Expected behavior
CSV with data from the books

Screenshots

Instance
BookWyrm Social

Desktop (please complete the following information):
- OS: Windows 10
- Browser: Edge
- Version 103.0.1264.37 (Official build) (64-bit)

mouse-reeve · 2022-06-28T20:45:14Z

I suspect this is happening because of a timeout -- it was reported earlier and self resolved without me doing anything.

noisydeadlines · 2022-06-29T12:23:05Z

Hi! I tried exporting it again twice, I still get the same empty file every time.

Strubbl · 2022-07-15T23:27:36Z

I can confirm this issue was happening with v0.4.3 and is still not working with v0.4.4. I was trying to export on the BookWyrm Social instance.

cubicgarden · 2022-11-09T17:38:30Z

I can confirm I'm getting the exact same thing too.
Tired to export twice and spoke with @mouse-reeve who pointed to this issue

Desktop

OS: Ubuntu 22.04.1
Browser: Firefox 106.0.2 (64-bit)

r3k2 · 2022-11-21T05:56:13Z

same here, is annoying only reason I wanted this is to migrate to another server, the one I am at is not been taken care off is more time down than up for many people, https://bookwyrm.social/ I keep getting issues for more than 9 months, I cant edit a group list some of us made, works when not logged in, but as soon I log in it breaks, because a nginx bad gateway that usually happens when the service is unresponsive or the container is low on memory unresponsive etc, so got tired of waiting for the admin to reply to us, I even offer to let me ssh so I can fix it, (systems engineer for 24 years) and to try to troubleshoot this even do I sent 2-3 times screenshots, I waited 9 months because I know this is not a job and voluntier but now and here comes to the point, I try to get the CSV to migrate to another instance and is empty :( I am on Arch GNU/Linux I tried with qutebrowser, firefox and chromium.

mouse-reeve · 2022-11-21T17:07:46Z

@r3k2 -- I just wanted to acknowledge how frustrated and unheard it sounds like you're feeling with this. It sounds like you've been impacted by bugs that interfere with basic usability, and you aren't getting a response that makes you feel like the problems you're encountering are being acknowledged and taken seriously. I imagine it's just as frustrating for everyone who's been reporting and +1'ing these as well!

When I get bug reports that I have trouble replicating and am not sure how to fix, I often find it overwhelming and I'm not always sure how to get outside help in fixing them. As a result, there are some bugs, like the ones you've encountered, that I've been aware of but pretty stuck about how to address, and as a result they languish in the issue tracker. The scale and demands of the project have gone way up and my capacity to fix bugs and bring in support from others hasn't kept pace. I find that really daunting and a little scary. So given that, it's super generous of you to have offered your expertise to help fix it, and I apologize for missing that you made that offer; although I try my best, I still fail to notice or forget things.

I believe that the source of this specific bug is that the server is overloaded and the query is timing out and returning an empty csv. I have been focusing (with some greatly appreciated help) for the last couple weeks to try to improve the performance situation overall, which is related but doesn't directly address this specific thing not working. I'd be really open to suggestions on how a data export can be compiled in a way that is robust to server load and timeouts.

It will also need some expanding to be an effective migration utility, but that's very doable.

eldang · 2022-12-09T20:59:34Z

I'm also frustrated by this bug, and I'm wondering if the solution is to make the export asynchronous. I have two examples in mind which might be useful as patterns:

LibraryThing: when I exported from there to import to Bookwyrm, clicking the "Export all books" button didn't immediately give me a link, instead it showed some kind of "in progress" feedback. Trying it now, I got a blue bar that says "nnn books processed", updating every 100 books. It took a few minutes to do my whole library, and then was replaced with a link in the format https://www.librarything.com/download_export_file.php?uniqueId=HexadecimalIDHere . Presumably that hex ID is a lookup in a table that connects to an actual file; clicking the link gets me a file with my username in the file name.

Allen Coral Atlas: I actually maintain the downloads system for this one. Downloads can take a couple of hours to assemble in the worst case, so we pre-package common requests and then have a fully asynchronous system for everything else. It goes like this:

User requests download, which starts a background job with a database row to track its progress internally.
User immediately gets an acknowledgement email so they know their request is in the works.
A whole set of tasks get kicked off behind the scenes, some in parallel, some polling for the output of others.
When the last one is done, the file is uploaded to a server, the database row gets the URL added to it, and the user gets an email with the link.

The ACA version is probably overkill here! But something like what LibraryThing does, perhaps also generating a notification that contains the link, seems like it could work.

nein09 · 2022-12-12T00:02:25Z

Hi, I'm @eldang 's spouse and I told him that I would take a crack at fixing this bug as a Christmas gift. Luckily, I do a lot of work on Django sites, so it's like... actually realistic of me to offer this.

I haven't done a lot of digging here yet, so I might be way off-base, but something like https://pypi.org/project/django-import-export-celery/ might go a long way toward helping make this happen.

I think what I'm likely to do is start with something tagged as "good first bug" to get my feet wet in this repo, too.

So... hi! Hope I can help!

mouse-reeve · 2022-12-12T03:12:18Z

Thank you! Help on this ticket would be super appreciated, and I agree that using celery seems like the right approach. I'd be happy to work with you and provide whatever help/explanations of weird codebase things I can do

todrobbins · 2023-01-19T00:57:03Z

@nein09, any luck with django-import-export-celery? I'd love to learn about how you're approaching this challenge.

nein09 · 2023-01-23T20:22:10Z

@todrobbins I decided to start with what looks like a smaller piece of work first (#1678) (and life has been getting in the way of even that), so I'm not quite here yet. So sadly, there isn't much to report yet though. But I haven't forgotten about it.

phildini · 2023-02-17T07:51:20Z

Hi! As a note on possible causes here, I was able to get an export from my own bookwyrm instance just fine, but the export I got from bookwyrm.social was blank

WesleyAC · 2023-03-07T21:54:47Z

I made some interesting progress on this today. I've been playing with our gunicorn setup (since I'm pretty sure it's responsible for many of the loading delays on bookwyrm.social recently), and I bumped the timeout up to 600 seconds to test this (also increasing the timeout on the nginx side).

This prevents the timeout, but there's still a problem — psycopg2 reports that the disk is full:

Long traceback

2023-03-07T21:12:33.957615255Z [2023-03-07 21:12:33 +0000] [34] [ERROR] Error handling request
2023-03-07T21:12:33.957816312Z Traceback (most recent call last):
2023-03-07T21:12:33.957824404Z   File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
2023-03-07T21:12:33.957830447Z     return self.cursor.execute(sql, params)
2023-03-07T21:12:33.957838102Z psycopg2.errors.DiskFull: could not write to file "base/pgsql_tmp/pgsql_tmp258365.0.sharedfileset/o7of16.p2.0": No space left on device
2023-03-07T21:12:33.957866374Z CONTEXT:  parallel worker
2023-03-07T21:12:33.957871138Z 
2023-03-07T21:12:33.957875345Z 
2023-03-07T21:12:33.957879675Z The above exception was the direct cause of the following exception:
2023-03-07T21:12:33.957890816Z 
2023-03-07T21:12:33.957895746Z Traceback (most recent call last):
2023-03-07T21:12:33.957900217Z   File "/usr/local/lib/python3.9/site-packages/gunicorn/workers/base_async.py", line 113, in handle_request
2023-03-07T21:12:33.957905097Z     for item in respiter:
2023-03-07T21:12:33.957909616Z   File "/usr/local/lib/python3.9/site-packages/sentry_sdk/integrations/wsgi.py", line 269, in __iter__
2023-03-07T21:12:33.957914304Z     reraise(*_capture_exception(self._hub))
2023-03-07T21:12:33.957919300Z   File "/usr/local/lib/python3.9/site-packages/sentry_sdk/_compat.py", line 56, in reraise
2023-03-07T21:12:33.957924185Z     raise value
2023-03-07T21:12:33.957929101Z   File "/usr/local/lib/python3.9/site-packages/sentry_sdk/integrations/wsgi.py", line 265, in __iter__
2023-03-07T21:12:33.957933608Z     chunk = next(iterator)
2023-03-07T21:12:33.957938084Z   File "/app/bookwyrm/views/preferences/export.py", line 44, in <genexpr>
2023-03-07T21:12:33.974117970Z     (writer.writerow(row) for row in generator),
2023-03-07T21:12:33.974127943Z   File "/app/bookwyrm/views/preferences/export.py", line 65, in csv_row_generator
2023-03-07T21:12:33.974134306Z     for book in books:
2023-03-07T21:12:33.974139525Z   File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 280, in __iter__
2023-03-07T21:12:33.974144935Z     self._fetch_all()
2023-03-07T21:12:33.974149851Z   File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 1324, in _fetch_all
2023-03-07T21:12:33.974154775Z     self._result_cache = list(self._iterable_class(self))
2023-03-07T21:12:33.974159318Z   File "/usr/local/lib/python3.9/site-packages/model_utils/managers.py", line 38, in __iter__
2023-03-07T21:12:33.974163645Z     yield from iter
2023-03-07T21:12:33.974168584Z   File "/usr/local/lib/python3.9/site-packages/django/db/models/query.py", line 51, in __iter__
2023-03-07T21:12:33.974173790Z     results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
2023-03-07T21:12:33.974178793Z   File "/usr/local/lib/python3.9/site-packages/django/db/models/sql/compiler.py", line 1175, in execute_sql
2023-03-07T21:12:33.974183543Z     cursor.execute(sql, params)
2023-03-07T21:12:33.974198530Z   File "/usr/local/lib/python3.9/site-packages/sentry_sdk/integrations/django/__init__.py", line 562, in execute
2023-03-07T21:12:33.974203863Z     return real_execute(self, sql, params)
2023-03-07T21:12:33.974208496Z   File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 66, in execute
2023-03-07T21:12:33.974233721Z     return self._execute_with_wrappers(sql, params, many=False, executor=self._execute)
2023-03-07T21:12:33.974239021Z   File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 75, in _execute_with_wrappers
2023-03-07T21:12:33.974244133Z     return executor(sql, params, many, context)
2023-03-07T21:12:33.974249552Z   File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
2023-03-07T21:12:33.974254546Z     return self.cursor.execute(sql, params)
2023-03-07T21:12:33.974259352Z   File "/usr/local/lib/python3.9/site-packages/django/db/utils.py", line 90, in __exit__
2023-03-07T21:12:33.974264361Z     raise dj_exc_value.with_traceback(traceback) from exc_value
2023-03-07T21:12:33.974269235Z   File "/usr/local/lib/python3.9/site-packages/django/db/backends/utils.py", line 84, in _execute
2023-03-07T21:12:33.974273780Z     return self.cursor.execute(sql, params)
2023-03-07T21:12:33.974278058Z django.db.utils.OperationalError: could not write to file "base/pgsql_tmp/pgsql_tmp258365.0.sharedfileset/o7of16.p2.0": No space left on device
2023-03-07T21:12:33.974282548Z CONTEXT:  parallel worker

And indeed, watch -n1 df -h while a CSV export is running shows that it quickly eats up all the remaining space on the server (which is promptly returned when the query fails).

Using the log_temp_files postgres setting, it seems that this is from creating many temporary files of around 200MB each, rather than a single enormous temporary file.

It looks to me like what's happening is that one of the queries in the for book in books: loop in the csv_row_generator is making a 200MB temp file, and the results of the query are being kept around until the generator is fully read (maybe? I don't have a strong picture of how generators work, nor the Django ORM).

While building this export in Celery is clearly the best long-term solution, I do think that there's potential to rewrite the export code to fix this in the short term. I'm not sure why a generator was initially used (probably it saves on memory by avoiding storing the entire CSV in memory?), but unrolling it into a loop that generates the CSV in-memory, rather than keeping the Query objects around and converting them to CSV one line at a time would probably get exports working on bookwyrm.social for the moment.

The idea behind a streaming CSV export was to reduce the amount of memory used, by avoiding building the entire CSV file in memory before sending it to the client. However, it didn't work out this way in practice: the query objects that were created to represent each line caused Postgres to generate a very large (~200MB on bookwyrm.social) temp file, not to mention the memory being used by the Query object likely being similar to, if not larger than that used by the finalized CSV row. While we should in the long term run our CSV exports as a Celery task, this change should allow CSV exports to work on large servers without causing disk-space problems. Fixes: bookwyrm-social#2157

WesleyAC · 2023-03-13T15:59:35Z

Looks like #2713 didn't have exactly the effect I was hoping for — somehow the same disk space problem is triggered. Not sure if it's that I don't understand when Django QuerySets are cleaned up, when Postgres temp files are cleaned up, or something else. Will continue investigating.

mouse-reeve · 2023-03-13T16:24:00Z

The fact that we're seeing a 500 error rather than an empty file is an improvement, I think!

WesleyAC · 2023-03-13T16:39:58Z

Yeah, that's true :)

My guess is that what's happening is the books query is trying to materialize tables that are larger than work_mem (there is a Materialize in the EXPLAIN output), and those thus get written to disk. Looking at the EXPLAIN output for the books query:

        books = (
            models.Edition.viewer_aware_objects(request.user)
            .filter(
                Q(shelves__user=request.user)
                | Q(readthrough__user=request.user)
                | Q(review__user=request.user)
                | Q(comment__user=request.user)
                | Q(quotation__user=request.user)
            )
            .distinct()
        )

It seems like the culprit and solution will likely be similar to #2725 / #2726, although presumably more complicated since we'll need to figure out how to handle duplicates. If we were writing raw SQL it would be pretty easy to use CTEs to select all five of those as individual queries then DISTINCT them at the end, but it looks like Django can't generate CTEs without using libraries like django-cte, which I'm not totally sure is worth the effort, versus making five queries then sticking them in a set or something.

Splitting this into five separate queries avoids the large join that prevents us from using indexes, and requires materializing to disk. Fixes: bookwyrm-social#2157 (hopefully)

WesleyAC · 2023-03-20T23:58:25Z

Just FYI, #2741 hasn't been deployed to bookwyrm.social yet, so CSV exports will still fail there until that's deployed — I'll comment in this issue when it is, so that anyone following this will get a notification.

WesleyAC · 2023-04-04T16:32:22Z

This should now be fixed on bookwyrm.social! I was able to export my own CSV history, albeit a small one. Please open a new issue if you have any problems with the CSV export.

@Strubbl @todrobbins FYI

todrobbins · 2023-04-04T17:52:31Z

Thank you for your work on this, @WesleyAC! I'll test a bookwyrm.social export right now.

eldang · 2023-04-04T17:54:09Z

@WesleyAC Hooray! Thank you for fixing this. Export from bookwyrm.social just worked smoothly (and fairly quickly) for me, and it looks like importing that to books.theunseen.city (which is on v0.6.0) is also working.

todrobbins · 2023-04-04T17:56:33Z

Export worked perfectly and pretty quick (~3s/43.1 KB/363 rows):

cubicgarden · 2023-04-04T21:53:28Z

I can confirm it worked for me too, thank you so much!

…

On 04/04/2023 18:56, Tod Robbins wrote: Export worked perfectly and pretty quick (~3s/43KB/363 rows): image <https://user-images.githubusercontent.com/158590/229877801-491d0175-0c53-44fe-85db-28aa2c96fa0d.png> — Reply to this email directly, view it on GitHub <#2157 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAMS2ASPCIXET6M3IXBZ2BTW7ROFZANCNFSM52DOWXMQ>. You are receiving this because you commented.Message ID: ***@***.***>

noisydeadlines added the bug Something isn't working label Jun 28, 2022

mouse-reeve added the feature: import/export label Jan 11, 2023

WesleyAC mentioned this issue Mar 7, 2023

Change CSV export to buffer instead of streaming #2713

Merged

mouse-reeve closed this as completed in #2713 Mar 12, 2023

mouse-reeve reopened this Mar 13, 2023

WesleyAC mentioned this issue Mar 13, 2023

Optimize CSV export query #2741

Merged

mouse-reeve closed this as completed in #2741 Mar 20, 2023

WesleyAC mentioned this issue Mar 21, 2023

Investigate Celery hanging due to redis.exceptions.BusyLoadingError #2719

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSV export results in empty file with no data #2157

CSV export results in empty file with no data #2157

noisydeadlines commented Jun 28, 2022

mouse-reeve commented Jun 28, 2022

noisydeadlines commented Jun 29, 2022

Strubbl commented Jul 15, 2022

cubicgarden commented Nov 9, 2022 •

edited

Loading

r3k2 commented Nov 21, 2022 •

edited

Loading

mouse-reeve commented Nov 21, 2022

eldang commented Dec 9, 2022

nein09 commented Dec 12, 2022

mouse-reeve commented Dec 12, 2022

todrobbins commented Jan 19, 2023 •

edited

Loading

nein09 commented Jan 23, 2023 •

edited

Loading

phildini commented Feb 17, 2023

WesleyAC commented Mar 7, 2023

WesleyAC commented Mar 13, 2023

mouse-reeve commented Mar 13, 2023

WesleyAC commented Mar 13, 2023

WesleyAC commented Mar 20, 2023

WesleyAC commented Apr 4, 2023

todrobbins commented Apr 4, 2023

eldang commented Apr 4, 2023

todrobbins commented Apr 4, 2023 •

edited

Loading

cubicgarden commented Apr 4, 2023 via email

CSV export results in empty file with no data #2157

CSV export results in empty file with no data #2157

Comments

noisydeadlines commented Jun 28, 2022

mouse-reeve commented Jun 28, 2022

noisydeadlines commented Jun 29, 2022

Strubbl commented Jul 15, 2022

cubicgarden commented Nov 9, 2022 • edited Loading

r3k2 commented Nov 21, 2022 • edited Loading

mouse-reeve commented Nov 21, 2022

eldang commented Dec 9, 2022

nein09 commented Dec 12, 2022

mouse-reeve commented Dec 12, 2022

todrobbins commented Jan 19, 2023 • edited Loading

nein09 commented Jan 23, 2023 • edited Loading

phildini commented Feb 17, 2023

WesleyAC commented Mar 7, 2023

WesleyAC commented Mar 13, 2023

mouse-reeve commented Mar 13, 2023

WesleyAC commented Mar 13, 2023

WesleyAC commented Mar 20, 2023

WesleyAC commented Apr 4, 2023

todrobbins commented Apr 4, 2023

eldang commented Apr 4, 2023

todrobbins commented Apr 4, 2023 • edited Loading

cubicgarden commented Apr 4, 2023 via email

cubicgarden commented Nov 9, 2022 •

edited

Loading

r3k2 commented Nov 21, 2022 •

edited

Loading

todrobbins commented Jan 19, 2023 •

edited

Loading

nein09 commented Jan 23, 2023 •

edited

Loading

todrobbins commented Apr 4, 2023 •

edited

Loading