-
-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AoPS crawler is not working. #30
Comments
It seems the AoPS server started to reject (403 error code) certain clients, my pycurl sends "User-Agent: PycURL/7.43.0.2 libcurl/7.60.0 OpenSSL/1.1.0h zlib/1.2.11 c-ares/1.14.0 WinIDN libssh2/1.8.0 nghttp2/1.32.0" which gets rejected... But if I add my own User-Agent into
then it seems to work. You can try this change on your side see if it helps. To find this issue I have uncommented the EDIT: It seems it specifically rejects it if it contains PycURL :) |
@TheSil Thank you, it is fixed in the private repo. |
Unfortunately, I need to reopen this issue, although I get the script working yesterday, I just find it is not working again. Still
And here is what I get from
Any idea? |
In Python script, this if statement does not meet the condition: if 'AoPS.bootstrap_data' in script.text: (see this link) |
That condition does not always have to be met, there are multiple script blocks, only one of them will have AoPS.bootstrap_data in it I suppose. Or does that conditinon fail for all blocks, ending with function returning None? That would be strange, especially in that file you posted, I see there are script blocks which contain AoPS.bootstrap_data ... Is that repeatable on your side? Also did that stop crawling or is it just an error in the error.log? I've got few those in past in my log too which looked sort of similar ([error] post https://artofproblemsolving.com/community/c6h1189932 ('NoneType' object is not subscriptable)), but I remember I could not repeat them. |
I fed your example directly into the BeautifulSoup object (after writing it as a binary data, do not recommend to treat it as a textual data :D), and I dont see any issue there, it parsed on my side correctly... (it only fails then because it uses yours session which is now old, but that is expected) |
@TheSil Thank you for your help. I actually did not double-check whether my file contains AoPS.bootstrap_data, when I found that if-condition was not satisfied, I immediately thought what I get is not expected. Yes, indeed I identify that string (AoPS.bootstrap_data) after issuing a simple grep, so it looks pretty strange at this point. I am going to try it once more now... |
I can reproduce it. Also, the good news is that I just find it only happens in Docker image, I can verify everything works well on my host machine. Yet still strange. Anyway, here is the output from Docker-run (
Another finding: If I run a shell on that Docker image so that I can run AoPS crawler script for multiple times, I find the second time all the FakeUserAgent errors are gone, and what left is the same error I produced 11 hours ago:
|
Update: I installed vim on that Docker session, added a few lines for debugging, print(community_page)
parsed = get_aops_data(community_page)
print(parsed)
session = parsed['AoPS.session']
quit() Since that Docker image was based on |
The actual parsing of javascript is not done through BeautifulSoup but through slimit, maybe that package is of different version? As for Python 3.7, I actually use that on my host (although Im on Windows). |
For whatever reason, in Python 3.7 (in Docker), BTW. My |
That is indeed strange... But good you got it working :) |
@TheSil Thank you for your kind help anyway. I am going to close this issue now. Well, also thankful for your Github sponsorship by the way... I will buy myself a couple of drinks, thanks! |
I have not used AoPS crawler for several months. However, (since I am playing with Docker swarm) today I run into AoPS crawler issue:
It looks like AoPS has changed its API, It also could be my network is blocking AoPS, I have not tested yet.
@TheSil IF you get time, please help me and see if you can reproduce this issue, thanks!
The text was updated successfully, but these errors were encountered: