-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tumblr_backup: Add --save-notes and --cookies options #189
Conversation
tumblr_backup.py
Outdated
|
||
notes_str = u'%d note%s' % (self.note_count, 's'[self.note_count == 1:]) | ||
|
||
if options.save_notes and web_crawler: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The and web_crawler
seems redundant. We never get here with just save_notes set.
tumblr_backup.py
Outdated
@@ -1223,4 +1395,7 @@ def request_callback(option, opt, value, parser): | |||
except KeyboardInterrupt: | |||
sys.exit(EXIT_INTERRUPT) | |||
|
|||
if options.save_notes and web_crawler: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See line 1074
I did a very quick review inline. More importantly: is it possible to move the part starting at line 2075 into class WebCrawler and maybe the whole class into a separate file? |
The |
Yes, better not let the user think the notes were saved when the crawler can't initialize. |
So far I've removed the exception-silencing and moved more code into WebCrawler. |
D'oh, I somehow just realized that cookielib is an official Python module. |
The class is in its own file now. I hope I did that right xD |
You can also move the import checks into the module, then check if an import succeeded with |
I'm now trying to figure out how to get it to cleanly exit on SIGINT. Currently it throws a lot of exceptions. |
All of the obvious issues have now been fixed. |
Well, it still seems to (sometimes) either ignore SIGINT, or hang when it's sent. Luckily SIGQUIT ( |
I've somehow only just realized that JavaScript (and thusly Selenium) isn't necessary for this. In my testing, I must have been modifying too many experimental variables at once. |
Alright, now it actually gets the right notes (I think all of the Selenium requests were simultaneous and on the same virtual tab before c70f09a) and it doesn't eat up an insane amount of CPU anymore. Those are good things. It's still having issues with SIGINT, and it still doesn't get all of the notes (at least according to the note count -- is that accurate?), but it's good enough for me to use now. |
So, locally I have a version that runs the web crawler as a subprocess. The reasoning behind this is that Python threads aren't real threads, because of the GIL. And we do a lot of work in the web crawler that would probably be better off if it were actually running in parallel (HTML parsing, waiting on HTTP requests, and looping back to get the next set of notes). With As a module:
As a subprocess:
So this is more than a 2x speedup of the overall execution (including everything outside of the crawler). The only downside is that it uses more RAM and CPU, due to running multiple interpreter instances. Should we make this the default, or perhaps an option? |
This is extremely out of date now. I've been doing a lot of work on a few local forks. |
I've tested this with only one post so far, but it seems to work alright. It brings in a lot of new code and dependencies, but it's all optional.
This also allows youtube-dl to use the cookies that are provided.