tumblr_backup: Add --save-notes and --cookies options #189

cebtenzzre · 2018-12-15T19:47:25Z

I've tested this with only one post so far, but it seems to work alright. It brings in a lot of new code and dependencies, but it's all optional.
This also allows youtube-dl to use the cookies that are provided.

bbolli · 2018-12-15T19:53:59Z

tumblr_backup.py

+
+        notes_str = u'%d note%s' % (self.note_count, 's'[self.note_count == 1:])
+
+        if options.save_notes and web_crawler:


The and web_crawler seems redundant. We never get here with just save_notes set.

bbolli · 2018-12-15T19:54:17Z

tumblr_backup.py

@@ -1223,4 +1395,7 @@ def request_callback(option, opt, value, parser):
    except KeyboardInterrupt:
        sys.exit(EXIT_INTERRUPT)

+    if options.save_notes and web_crawler:


See line 1074

bbolli · 2018-12-15T19:57:58Z

I did a very quick review inline. More importantly: is it possible to move the part starting at line 2075 into class WebCrawler and maybe the whole class into a separate file?

cebtenzzre · 2018-12-15T20:15:03Z

The and web_crawler is so that web_crawler can fail to load (line 1385) and the script can still continue. Now that I think about it, it's probably better if we just throw an exception.

bbolli · 2018-12-15T20:16:36Z

Yes, better not let the user think the notes were saved when the crawler can't initialize.

cebtenzzre · 2018-12-15T20:31:29Z

So far I've removed the exception-silencing and moved more code into WebCrawler.

cebtenzzre · 2018-12-15T20:38:56Z

D'oh, I somehow just realized that cookielib is an official Python module.

cebtenzzre · 2018-12-15T20:45:21Z

The class is in its own file now. I hope I did that right xD

bbolli · 2018-12-15T20:48:58Z

You can also move the import checks into the module, then check if an import succeeded with if web_crawler.bs4, e.g. You can also do an additional commit for this, so the evolution can be seen later.

cebtenzzre · 2018-12-15T21:19:07Z

I'm now trying to figure out how to get it to cleanly exit on SIGINT. Currently it throws a lot of exceptions.

cebtenzzre · 2018-12-16T03:23:49Z

All of the obvious issues have now been fixed.

cebtenzzre · 2018-12-16T04:47:14Z

Well, it still seems to (sometimes) either ignore SIGINT, or hang when it's sent. Luckily SIGQUIT (Ctrl+\) still works.

cebtenzzre · 2018-12-16T05:41:02Z

I've somehow only just realized that JavaScript (and thusly Selenium) isn't necessary for this. In my testing, I must have been modifying too many experimental variables at once.

cebtenzzre · 2018-12-16T06:08:29Z

Alright, now it actually gets the right notes (I think all of the Selenium requests were simultaneous and on the same virtual tab before c70f09a) and it doesn't eat up an insane amount of CPU anymore. Those are good things.

It's still having issues with SIGINT, and it still doesn't get all of the notes (at least according to the note count -- is that accurate?), but it's good enough for me to use now.

cebtenzzre · 2018-12-17T17:33:37Z

So, locally I have a version that runs the web crawler as a subprocess. The reasoning behind this is that Python threads aren't real threads, because of the GIL. And we do a lot of work in the web crawler that would probably be better off if it were actually running in parallel (HTML parsing, waiting on HTTP requests, and looping back to get the next set of notes).
I first tried to use IronPython, which doesn't have a GIL, but after building my own AUR package just to get it to build and import modules, I found that it won't support lxml without Ironclad, and Ironclad will only build properly on Windows.

With -k --save-notes -s 200 -n 200 --cookies <cookiefile> on one blog, I compared the execution time of each (note the value of real).

As a module:

real	107.98s
user	98.57s
sys	34.58s

As a subprocess:

real	41.27s
user	71.70s
sys	8.17s

So this is more than a 2x speedup of the overall execution (including everything outside of the crawler). The only downside is that it uses more RAM and CPU, due to running multiple interpreter instances.

Should we make this the default, or perhaps an option?

cebtenzzre · 2019-05-20T19:29:59Z

This is extremely out of date now. I've been doing a lot of work on a few local forks.
If there's ever further interest in this, I will organize and publish my current code.

bbolli reviewed Dec 15, 2018

View reviewed changes

cebtenzzre force-pushed the patch-2 branch from 776099a to 7e6bbed Compare December 15, 2018 20:29

tumblr_backup: Add --save-notes and --cookies options

e1177d2

cebtenzzre force-pushed the patch-2 branch from 7e6bbed to e1177d2 Compare December 15, 2018 20:44

cebtenzzre mentioned this pull request Dec 15, 2018

Feature: Download notes (replies, likes, etc.) #169

Open

cebtenzzre added 2 commits December 15, 2018 15:54

Fix missing import

b50ed0e

Move import checks to web_crawler

5c161f8

cebtenzzre added 2 commits December 15, 2018 16:32

Ignore connection aborted errors

eb7a840

Fix errors, improve exception handling, and fix unicode handling

2b396fc

Retry on error 429

f93f3d9

Per-thread WebCrawlers

c70f09a

Remove Selenium, fix urlopen of unicode URIs

6e39c66

cebtenzzre mentioned this pull request Dec 16, 2018

youtube-dl requests do not work in the EU/EEA #132

Closed

Fix youtube-dl cookiefile option

dc420ff

cebtenzzre closed this May 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tumblr_backup: Add --save-notes and --cookies options #189

tumblr_backup: Add --save-notes and --cookies options #189

cebtenzzre commented Dec 15, 2018

bbolli Dec 15, 2018

bbolli Dec 15, 2018

bbolli commented Dec 15, 2018

cebtenzzre commented Dec 15, 2018

bbolli commented Dec 15, 2018

cebtenzzre commented Dec 15, 2018

cebtenzzre commented Dec 15, 2018

cebtenzzre commented Dec 15, 2018

bbolli commented Dec 15, 2018

cebtenzzre commented Dec 15, 2018

cebtenzzre commented Dec 16, 2018

cebtenzzre commented Dec 16, 2018 •

edited

Loading

cebtenzzre commented Dec 16, 2018

cebtenzzre commented Dec 16, 2018

cebtenzzre commented Dec 17, 2018

cebtenzzre commented May 20, 2019


		notes_str = u'%d note%s' % (self.note_count, 's'[self.note_count == 1:])

		if options.save_notes and web_crawler:

tumblr_backup: Add --save-notes and --cookies options #189

tumblr_backup: Add --save-notes and --cookies options #189

Conversation

cebtenzzre commented Dec 15, 2018

bbolli Dec 15, 2018

Choose a reason for hiding this comment

bbolli Dec 15, 2018

Choose a reason for hiding this comment

bbolli commented Dec 15, 2018

cebtenzzre commented Dec 15, 2018

bbolli commented Dec 15, 2018

cebtenzzre commented Dec 15, 2018

cebtenzzre commented Dec 15, 2018

cebtenzzre commented Dec 15, 2018

bbolli commented Dec 15, 2018

cebtenzzre commented Dec 15, 2018

cebtenzzre commented Dec 16, 2018

cebtenzzre commented Dec 16, 2018 • edited Loading

cebtenzzre commented Dec 16, 2018

cebtenzzre commented Dec 16, 2018

cebtenzzre commented Dec 17, 2018

cebtenzzre commented May 20, 2019

cebtenzzre commented Dec 16, 2018 •

edited

Loading