Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tumblr_backup: Add --save-notes and --cookies options #189

Closed
wants to merge 9 commits into from

Conversation

cebtenzzre
Copy link
Collaborator

I've tested this with only one post so far, but it seems to work alright. It brings in a lot of new code and dependencies, but it's all optional.
This also allows youtube-dl to use the cookies that are provided.

tumblr_backup.py Outdated

notes_str = u'%d note%s' % (self.note_count, 's'[self.note_count == 1:])

if options.save_notes and web_crawler:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The and web_crawler seems redundant. We never get here with just save_notes set.

tumblr_backup.py Outdated
@@ -1223,4 +1395,7 @@ def request_callback(option, opt, value, parser):
except KeyboardInterrupt:
sys.exit(EXIT_INTERRUPT)

if options.save_notes and web_crawler:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See line 1074

@bbolli
Copy link
Owner

bbolli commented Dec 15, 2018

I did a very quick review inline. More importantly: is it possible to move the part starting at line 2075 into class WebCrawler and maybe the whole class into a separate file?

@cebtenzzre
Copy link
Collaborator Author

The and web_crawler is so that web_crawler can fail to load (line 1385) and the script can still continue. Now that I think about it, it's probably better if we just throw an exception.

@bbolli
Copy link
Owner

bbolli commented Dec 15, 2018

Yes, better not let the user think the notes were saved when the crawler can't initialize.

@cebtenzzre
Copy link
Collaborator Author

So far I've removed the exception-silencing and moved more code into WebCrawler.

@cebtenzzre
Copy link
Collaborator Author

D'oh, I somehow just realized that cookielib is an official Python module.

@cebtenzzre
Copy link
Collaborator Author

The class is in its own file now. I hope I did that right xD

@bbolli
Copy link
Owner

bbolli commented Dec 15, 2018

You can also move the import checks into the module, then check if an import succeeded with if web_crawler.bs4, e.g. You can also do an additional commit for this, so the evolution can be seen later.

@cebtenzzre
Copy link
Collaborator Author

I'm now trying to figure out how to get it to cleanly exit on SIGINT. Currently it throws a lot of exceptions.

@cebtenzzre
Copy link
Collaborator Author

All of the obvious issues have now been fixed.

@cebtenzzre
Copy link
Collaborator Author

cebtenzzre commented Dec 16, 2018

Well, it still seems to (sometimes) either ignore SIGINT, or hang when it's sent. Luckily SIGQUIT (Ctrl+\) still works.

@cebtenzzre
Copy link
Collaborator Author

I've somehow only just realized that JavaScript (and thusly Selenium) isn't necessary for this. In my testing, I must have been modifying too many experimental variables at once.

@cebtenzzre
Copy link
Collaborator Author

Alright, now it actually gets the right notes (I think all of the Selenium requests were simultaneous and on the same virtual tab before c70f09a) and it doesn't eat up an insane amount of CPU anymore. Those are good things.

It's still having issues with SIGINT, and it still doesn't get all of the notes (at least according to the note count -- is that accurate?), but it's good enough for me to use now.

@cebtenzzre
Copy link
Collaborator Author

So, locally I have a version that runs the web crawler as a subprocess. The reasoning behind this is that Python threads aren't real threads, because of the GIL. And we do a lot of work in the web crawler that would probably be better off if it were actually running in parallel (HTML parsing, waiting on HTTP requests, and looping back to get the next set of notes).
I first tried to use IronPython, which doesn't have a GIL, but after building my own AUR package just to get it to build and import modules, I found that it won't support lxml without Ironclad, and Ironclad will only build properly on Windows.

With -k --save-notes -s 200 -n 200 --cookies <cookiefile> on one blog, I compared the execution time of each (note the value of real).

As a module:

real	107.98s
user	98.57s
sys	34.58s

As a subprocess:

real	41.27s
user	71.70s
sys	8.17s

So this is more than a 2x speedup of the overall execution (including everything outside of the crawler). The only downside is that it uses more RAM and CPU, due to running multiple interpreter instances.

Should we make this the default, or perhaps an option?

@cebtenzzre
Copy link
Collaborator Author

This is extremely out of date now. I've been doing a lot of work on a few local forks.
If there's ever further interest in this, I will organize and publish my current code.

@cebtenzzre cebtenzzre closed this May 20, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants