Problems with Torpy? #4

thohug · 2022-10-06T12:46:54Z

Any Ideas on this behaviour? There seem to be problems with torpy, have you experienced this before and any idea how to solve it?

[...]

Initiating tor session 233
                  Circuit built.
Start iteration 0: 2022-10-06 14:40:11.079573
Tor end node blocked. Last response: <Response [404]>
0it [01:16, ?it/s]
Initiating tor session 234
                  Circuit built.
Start iteration 0: 2022-10-06 14:41:28.718957
Tor end node blocked. Last response: <Response [404]>
0it [00:07, ?it/s]
Initiating tor session 235
                  Circuit built.
Start iteration 0: 2022-10-06 14:41:37.347591
ERROR:torpy.cell_socket:_ssl.c:1112: The handshake operation timed out
ERROR:root:[ignored]
Traceback (most recent call last):
  File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\cell_socket.py", line 63, in connect
    self._socket.connect((self._router.ip, self._router.or_port))
  File "C:\Users\...\anaconda3\envs\scrape\lib\ssl.py", line 1343, in connect
    self._real_connect(addr, False)
  File "C:\Users\...\anaconda3\envs\scrape\lib\ssl.py", line 1334, in _real_connect
    self.do_handshake()
  File "C:\Users\...\anaconda3\envs\scrape\lib\ssl.py", line 1310, in do_handshake
    self._sslobj.do_handshake()
socket.timeout: _ssl.c:1112: The handshake operation timed out

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\utils.py", line 79, in newfn
    return func(*args, **kwargs)
  File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\consesus.py", line 183, in newfn
    return func(*args, **kwargs)
  File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\consesus.py", line 426, in get_descriptor
    with self._get_dir_client() as dir_client:
  File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\consesus.py", line 375, in _get_dir_client
    self._dir_guard, self._dir_circuit = self._create_dir_circuit(purpose='Internal dir client')
  File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\consesus.py", line 365, in _create_dir_circuit
    guard = TorGuard(router, purpose=purpose)
  File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\guard.py", line 66, in __init__
    self.__tor_socket.connect()
  File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\cell_socket.py", line 69, in connect
    raise TorSocketConnectError(e)
torpy.cell_socket.TorSocketConnectError: _ssl.c:1112: The handshake operation timed out
WARNING:torpy.utils:Retry with another router...
0it [00:31, ?it/s]
'graphql'
Initiating tor session 236
                  Circuit built.
Start iteration 0: 2022-10-06 14:42:09.078514
Tor end node blocked. Last response: <Response [404]>
0it [00:06, ?it/s]
Initiating tor session 237
                  Circuit built.
Start iteration 0: 2022-10-06 14:42:16.572684
WARNING:torpy.circuit:#80000242 circuit: has been destroyed already
ERROR:torpy.utils:[ignored] torpy.circuit.CellTimeoutError: Timeout wait for CellRelayExtended2 or CellRelayTruncated
WARNING:torpy.utils:Retry circuit creation
Tor end node blocked. Last response: <Response [404]>
0it [00:52, ?it/s]
Initiating tor session 238

The text was updated successfully, but these errors were encountered:

do-me · 2022-10-07T06:23:34Z

That's the expected behavior when mining too fast. Tor end node blocked. Last response: <Response [404]> indicates that the respective node got blocked which is likely to happen after while. Make sure to work with a higher --wait_between_requests.

thohug · 2022-10-07T09:22:04Z

Thanks for your quick reply. I understand that and tried different numbers. But if a circuit is built, there seems to be a problem with torpy? Or would you suggest also increasing Tor-Timeouts?

Initiating tor session 4
0it [00:00, ?it/s]Circuit built.
Start iteration 0: 2022-10-07 11:06:04.996308
ERROR:torpy.utils:[ignored] torpy.circuit.CellTimeoutError: Timeout wait for CellRelayExtended2 or CellRelayTruncated
WARNING:torpy.utils:Retry circuit creation
WARNING:torpy.circuit:#8000000b circuit: has been destroyed already
ERROR:torpy.utils:[ignored] torpy.circuit.CellTimeoutError: Timeout wait for CellRelayExtended2 or CellRelayTruncated
WARNING:torpy.utils:Retry circuit creation
Exception in thread RecvLoop_103.251:
Traceback (most recent call last):
  File "C:\Users\...\anaconda3\envs\scrape\lib\threading.py", line 980, in _bootstrap_inner
    self.run()
  File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\circuit.py", line 233, in run
    callback(key.fileobj, mask)
  File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\circuit.py", line 220, in _do_recv
    for cell in self._tor_socket.recv_cell_async():
  File "C:\Users\...\anaconda3\envs\scrape\lib\site-packages\torpy\cell_socket.py", line 104, in recv_cell_async
    more_data = self._socket.recv(TorCellSocket.RECV_BUFF_SIZE)
  File "C:\Users\...\anaconda3\envs\scrape\lib\ssl.py", line 1227, in recv
    return self.read(buflen)
  File "C:\Users\...\anaconda3\envs\scrape\lib\ssl.py", line 1102, in read
    return self._sslobj.read(len)
ConnectionAbortedError: [WinError 10053] Eine bestehende Verbindung wurde softwaregesteuert
durch den Hostcomputer abgebrochen
Torsession terminated after 600 seconds tor_timeout.

do-me · 2022-10-07T09:44:32Z

I have seen this error before and until now only on Windows.
This is indeed rather a problem related to torpy/SSL than fast-instagram-scraper.

If you already made sure to have the latest torpy version installed and used a virtual env, I would recommend switching to Ubuntu or if you're under Windows use WSL as the SSL error might be cumbersome to fix. There might be some conflicting SSL libraries or other hard to identify problems.

Let us know if it worked for you!

do-me · 2022-10-09T18:57:10Z

Just checked again. Instagram changed it's API recently so the logic needs slight refactoring first! Hence, it cannot work at the moment.

thohug · 2022-10-12T10:01:45Z

Thanks for checking it out - I didn't get to run it on wsl either so I guess the API is the Problem...Am 09.10.2022 20:57 schrieb do-me ***@***.***>: Just checked again. Instagram changed it's API recently so the logic needs slight refactoring first! Hence, it cannot work at the moment. —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you authored the thread.Message ID: ***@***.***>

fmac2000 · 2023-01-07T15:24:06Z

Any updates on this @do-me? Great work btw!

do-me · 2023-01-14T14:22:28Z

Thanks for asking @fmac2000 (also @thohug), there are indeed.

tl;dr: Mining is getting harder, TOR end points and even residential IPs gets blocked fast (without login), no more GET but POST-Requests needed for pagination.

Let me try to sum up the current status of the active Instagram API's. Basically there are two API's running at the moment, one is the legacy API that I originally designed fast-instagram-scraper for and then there is the new one.

Legacy API

Example: https://instagram.com/graphql/query/?query_hash=ac38b90f0f3981c42092016a37c59bf7&variables={"id":"1020237355","first":50,"after":"2301822988561378864"}

On every page you would receive a cursor for pagination. In this example it's 2301822988561378864 that I retrieved from the previous page and insert in the following GET-request. That's what the very first version of fast-instagram-scraper did.

The legacy API is completely unchanged. You can still query stuff if you're lucky but TOR end nodes are 99% blocked. Even residential IPs get blocked after only a few requests. So the only option here is to use commercial rotating residential IPs. If you google it you will find tons of more or less shady/working/not working services offering such. If anyone needs a good recommendation write an email as I eventually managed to find a good one.

New API

Example: https://instagram.com/explore/locations/1020237355/?__a=1&__d=dis&max_id=<cursor>

The good thing is that the new API offers plenty of new interesting nodes in the response JSON; great for research. Also (strangely) it does not block TOR end nodes. But here comes the catch: You can fire a GET request to get the first page but if you want to paginate you cannot do it with a GET request as you must include the respective headers with a bunch of tokens (e.g. XCSRF etc.). You get these tokens only by accessing the page in a browser that can execute JS to generate them (as far as I understood).

So theoretically, if you do so, copy the tokens and wrap them in a POST request in Python you're good to go. However I am not sure at what point they are eventually blocked but probably fast.

You could also go with a commercial service as some offer those requests to be executed in a real browser (and hence request the needed tokens for the POST headers) and after do normal requests (that cost way less).

Advice

Depending on your needs there are different ways to go:

Quick and simply working but costly: commercial rotating residential IPs + legacy API's GET request pagination
Free but only first page per location: fast-instagram-scraper + new API (good for "broad" mining)
Cumbersome and free: copy tokens from your browser + POST requests to new API in Python (a modified version of fast-instagram-scraper would do)
Optimized commercial version: 1st request with JS execution, following without until the tokens expire.

Future of fast-instagram-scraper

Doesn't look too bright. Still, in the coming days I will update the script to work at least for every 1st location page of the new API. If someone already did, PR's are welcome.

Hope that clarifies the current situation. Let me know if you find out anything else!

I'm reopening the issue for everyone to see.

do-me · 2023-11-26T17:34:43Z

Update 11/2023: as torpy is currently unmaintained and needs refactoring due to TOR changes from V2 to V3 fast-instagram-scraper won't work.

do-me added the Expected behavior label Oct 7, 2022

do-me closed this as completed Oct 7, 2022

do-me reopened this Jan 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problems with Torpy? #4

Problems with Torpy? #4

thohug commented Oct 6, 2022 •

edited

Loading

do-me commented Oct 7, 2022

thohug commented Oct 7, 2022

do-me commented Oct 7, 2022 •

edited

Loading

do-me commented Oct 9, 2022

thohug commented Oct 12, 2022 via email

fmac2000 commented Jan 7, 2023

do-me commented Jan 14, 2023

do-me commented Nov 26, 2023 •

edited

Loading

Problems with Torpy? #4

Problems with Torpy? #4

Comments

thohug commented Oct 6, 2022 • edited Loading

do-me commented Oct 7, 2022

thohug commented Oct 7, 2022

do-me commented Oct 7, 2022 • edited Loading

do-me commented Oct 9, 2022

thohug commented Oct 12, 2022 via email

fmac2000 commented Jan 7, 2023

do-me commented Jan 14, 2023

Legacy API

New API

Advice

Future of fast-instagram-scraper

do-me commented Nov 26, 2023 • edited Loading

thohug commented Oct 6, 2022 •

edited

Loading

do-me commented Oct 7, 2022 •

edited

Loading

do-me commented Nov 26, 2023 •

edited

Loading