Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Odd issue with FA scraper cookie storage #98

Open
mmsterful opened this issue Jul 15, 2020 · 9 comments
Open

Odd issue with FA scraper cookie storage #98

mmsterful opened this issue Jul 15, 2020 · 9 comments

Comments

@mmsterful
Copy link

I'm trying to run an FA scrape, after doing a git pull (and subsequently re-making my settings.py to get it working again), and getting this:

(venv)  ✘ username@Monolith  /mnt/c/Users/username/xA-Scraper   master ●  python3 -m manage fetch fa
Setting up loggers....
done
Setup
initialized manager
fetch args ['fa'] <class 'list'>
ScraperBase Init
Starting up
Main.WebRequest - INFO - Using global chromium tab pool
Starting up?
Creating pool
INFO: Creating engine for process! Engine name: 'MainProcess-MainThread'
Main.WebRequest - INFO - Fetching content at URL: http://www.furaffinity.net/controls/user-settings/
Main.WebRequest - INFO - Request for URL: http://www.furaffinity.net/controls/user-settings/ succeeded at Wed Jul 15 00:05:46 2020 On Attempt 1. Recieving...
Main.WebRequest - INFO - URL fully retrieved.
Main.WebRequest - INFO - Compression type = gzip. Content Size compressed = 7.966K. Decompressed = 27.251K. File type: text/html; charset=UTF-8.
Main.FaGet.StatusMgr - WARNING - Not logged in!
Main.FaGet.StatusMgr - INFO - Do not have login cookie. Retreiving one now.
Main.WebRequest - INFO - Fetching content at URL: http://www.furaffinity.net/controls/user-settings/
Main.WebRequest - INFO - Request for URL: http://www.furaffinity.net/controls/user-settings/ succeeded at Wed Jul 15 00:05:46 2020 On Attempt 1. Recieving...
Main.WebRequest - INFO - URL fully retrieved.
Main.WebRequest - INFO - Compression type = gzip. Content Size compressed = 7.966K. Decompressed = 27.251K. File type: text/html; charset=UTF-8.
Main.FaGet.StatusMgr - WARNING - Not logged in!
Main.FaGet.StatusMgr - ERROR - No captcha solver configured (or no solver with a non-zero balance)! Cannot continue!
Main - CRITICAL - Uncaught exception!
Main - CRITICAL - Uncaught exception
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/mnt/c/Users/username/xA-Scraper/manage/__main__.py", line 112, in <module>
    go()
  File "/mnt/c/Users/username/xA-Scraper/manage/__main__.py", line 105, in go
    two_arg_go(sys.argv[1], sys.argv[2])
  File "/mnt/c/Users/username/xA-Scraper/manage/__main__.py", line 52, in two_arg_go
    scrape_manage.do_fetch([param])
  File "/mnt/c/Users/username/xA-Scraper/manage/scrape_manage.py", line 73, in do_fetch
    do_plugin(plgname)
  File "/mnt/c/Users/username/xA-Scraper/manage/scrape_manage.py", line 47, in do_plugin
    plg.runScraper(namespace)
  File "/mnt/c/Users/username/xA-Scraper/xascraper/modules/scraper_base.py", line 831, in runScraper
    instance.go(ctrlNamespace=managedNamespace)
  File "/mnt/c/Users/username/xA-Scraper/xascraper/modules/scraper_base.py", line 791, in go
    cookieStatus, msg = self.getCookie()
ValueError: too many values to unpack (expected 2)

Unfortunately, my skills aren't good enough for me to to figure out what exactly is going on here with WebRequest and the cookies file. It's a valid LWP file; I tried updating it with the "a" and "b" cookies to no avail. The "manual FA login" option on the web interface seems to no longer be functional; it looks like they removed the old secondary captcha.

Manually bypassing the cookie check by making it return True lets it scrape, but it reported possible missing art with 946 expected and 624 retrieved from the first artist, so I don't think it's logged in.

@fake-name
Copy link
Owner

fake-name commented Jul 15, 2020

The critical part is:

Main.FaGet.StatusMgr - ERROR - No captcha solver configured (or no solver with a non-zero balance)! Cannot continue!

The exception is a bug in the cookie failure return value.
Basically, the failure handling had a bug, but since the failure isn't recoverable it wound up not causing any additional problems.

FA cannot log in automatically, since they use a captcha. You have to use the web interface to solve the captcha yourself (or use a captcha solving service).

The manual login stuff is kind of creaky, I strongly suggest using a service (I like anti-captcha.com). They're a bit of an affair to get money into, but $5-10 of credit should last nearly forever.


and subsequently re-making my settings.py to get it working again

Wow, how long since you last pulled? I don't think I've changed the settings file recently.

@mmsterful
Copy link
Author

It's been quite some time, before the repo went down for a while. It may have been my fault that it stopped working.

That last commit fixes this problem, but now it's throwing an exception:

Main.FaGet.StatusMgr - INFO - Login attempt status = False (Login Failed).
Main - CRITICAL - Uncaught exception!
Main - CRITICAL - Uncaught exception
Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/mnt/c/Users/atomicthumbs/xA-Scraper/manage/__main__.py", line 112, in <module>
    go()
  File "/mnt/c/Users/atomicthumbs/xA-Scraper/manage/__main__.py", line 105, in go
    two_arg_go(sys.argv[1], sys.argv[2])
  File "/mnt/c/Users/atomicthumbs/xA-Scraper/manage/__main__.py", line 52, in two_arg_go
    scrape_manage.do_fetch([param])
  File "/mnt/c/Users/atomicthumbs/xA-Scraper/manage/scrape_manage.py", line 73, in do_fetch
    do_plugin(plgname)
  File "/mnt/c/Users/atomicthumbs/xA-Scraper/manage/scrape_manage.py", line 47, in do_plugin
    plg.runScraper(namespace)
  File "/mnt/c/Users/atomicthumbs/xA-Scraper/xascraper/modules/scraper_base.py", line 831, in runScraper
    instance.go(ctrlNamespace=managedNamespace)
  File "/mnt/c/Users/atomicthumbs/xA-Scraper/xascraper/modules/scraper_base.py", line 793, in go
    assert cookieStatus, "Login failed! Cannot continue!"
AssertionError: Login failed! Cannot continue!

It's odd it doesn't think it has a valid cookie; the cookies.lwp file in the project base directory is valid and should allow it to log in. (Unless it's storing the actual values somewhere else I didn't know about.)

@fake-name
Copy link
Owner

fake-name commented Jul 15, 2020

The cookies file being valid just means that the scraper exited correctly the last time it executed. Whether the relevant cookie in particular is in the cookies file is the issue, and in this case apparently it's not.

I just tested, and it appears the captcha handling is currently broken. It probably stopped working when FA did their site redesign, and I missed this fact because I had a valid auth cookie when doing the tests (derp).

Additionally, the auth procedure now appears to require a google reCAPTCHA, so I think I'll not be able to support the manual circumvention when I fix the problem.

@fake-name
Copy link
Owner

Sidenote: DA is also broken ATM. I haven't had time to poke things recently.

@mmsterful
Copy link
Author

Oh, what I meant was - I logged in on a browser and transplanted the cookie info there into xA-Scraper's cookies file. It worked the last time I tried it, whenever that was.

@fake-name
Copy link
Owner

Ah. Well, you need two cookies, a and b. Did you get both?

@mmsterful
Copy link
Author

Yes, plus the __cfduid one.

@fake-name fake-name reopened this Jul 16, 2020
@fake-name
Copy link
Owner

That's strange. It should at least pass the login check if you do that.

This login check was written way, way long ago before I was just looking at cookies, rather then querying the website and checking if I can find your username on a home page path.

@mmsterful
Copy link
Author

mmsterful commented Aug 30, 2020

I took another look at this, since my FA scraper still doesn't work. It looks like line 41 of faScrape.py is loading http://www.furaffinity.net/controls/user-settings/ and looking for:
<a id="my-username" class="top-heading hideonmobile" href="
but the string as present in the page when logged in is:
<a id="my-username" href="

Changing the string makes it run the scrape without complaining, but it's indicating "artist seems to have disabled their account" for a lot of accounts that exist. I'm not sure whether it's actually logged in using the valid cookies in cookies.lwp or not.

My FA account is set to use the classic theme, so if all the screen-scraping is made with the old theme in mind, it might be that it's not actually logged in and is trying to scrape pages with the modern theme. I'm not sure.

The captcha-handling stuff for FA can probably be removed, as FA no longer appears to use a captcha.

Edit: I figured out how to turn on debug logging; it seems to be using the cookies correctly, but gives no indication why it's raising an AccountDisabledException.

Edit again: i added a log statement; it looks like maybe the submission count extraction code at line 285 in faScrape.py is failing, or at least the exception raise statement immediately below it is what's getting set off.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants