Solved outstanding issues mentioned in snscrape#413 #8

john-osullivan · 2024-01-18T16:47:29Z

Setting up this pull request to track my work in progress here! Here are the 5 issues called out by @JustAnotherArchivist, along with my progress on each:

Validating multiple matches on image regex
Maintaining order of media by merging loop into main link loop. Validated behavior by retrieving https://t.me/s/nexta_live/43102, which now had the videos and images in the correct order.
Confirming that some GIFs present as duration-less videos -- thanks for providing the link, Logan! I added a log statement, but will remove it in my next commit now that you've validated.
Leaving the link preview href within the outlinks array for consistency with other scrapers
Update the scraper break from using a try/except to instead using an if statement that won't throw an Exception.
Strengthening logic for getting next page => Using the prev link in the header, which accounts for media getting separate IDs.

…e, rather than index math

snscrape/modules/telegram.py

john-osullivan · 2024-02-22T06:13:08Z

snscrape/modules/telegram.py

 			pageLink = soup.find('a', attrs = {'class': 'tme_messages_more', 'data-before': True})
 			if not pageLink:
 				# some pages are missing a "tme_messages_more" tag, causing early termination
 				if '=' not in nextPageUrl:
-					nextPageUrl =  soup.find('link', attrs = {'rel': 'canonical'}, href = True)['href']
-				nextPostIndex = int(nextPageUrl.split('=')[-1]) - 20
+					nextPageUrl = soup.find('link', attrs = {'rel': 'prev'}, href = True)['href']


Shifting to using the prev tag addresses the duplicates issue caused by media getting a post ID, and also lets us remove the index math. Still calculating the next post index to determine whether it makes sense to fetch another page.

snscrape/modules/telegram.py

john-osullivan · 2024-02-22T06:21:11Z

Okay, sorry for the long delay, but I found the time to sit down and refactor the link loop! Pretty sure that this PR now addresses all of the comments from JustAnotherArchivist#413, although I suppose we'll see if/when this gets sent their way.

@loganwilliams , I think this is working correctly and ready for review! I ran this against the listed example posts discussed in the PR to validate my fixes, but haven't done a large run on anything. Open to suggestions if there's anything you'd like me to do to make sure there aren't any regressions 👍

loganwilliams · 2024-02-22T08:58:05Z

Thanks @john-osullivan! I'll ping @trislee for a review as he's more up-to-date on Telegram stuff than I am.

john-osullivan · 2024-02-22T15:14:59Z

Thank you! 🙌 I was gonna say, took a look at the git history and realized that I probably ought to have been tagging them in.

trislee · 2024-02-23T08:37:36Z

@john-osullivan thanks for this. The scraping of some channels isn't terminating correctly for me. For example, @WLM_USA_TEXAS gets into an endless loop after post 3746 because there's no "prev" pagination link on that page, nextPostIndex uses the nextPageUrl from the previous page, and the most recent post index from that page is larger than 20.

I previously dealt with Telegram's inconsistencies by decrementing nextPostIndex by 20, which is sloppy but I'm not sure what a more robust method would be. This seems to be an issue with the browser interface: scrolling up on the WLM_USA_TEXAS channel indeed stops at post 3746, but you can get past this using a URL like https://t.me/s/WLM_USA_TEXAS?before=3746

The "> 20" was a sloppy way of checking whether or not the scraper had reached the last page, but it's probably not robust enough here, since a channel could delete early posts.

I'm encountering a similar issue with @AKCPB getting stuck after post 3, since the largest post index from that page is 28.

One way of dealing with these issues could be keeping track of which posts are scraped, using a similar method as in the vkontakte scraper, and using that to deduplicate.

john-osullivan · 2024-02-23T17:07:40Z

Thanks for the feedback and including links for me to validate, @trislee! I'll dig in this evening and try to turn around a clean solution 👍

trislee · 2024-02-25T07:44:54Z

Replacing lines 222 - 232 with something like this seems to be working well, without introducing duplicate posts:

if pageLink := soup.find('link', attrs = {'rel': 'prev'}, href = True):
    nextPageUrl = urllib.parse.urljoin(r.url, pageLink['href'])
else:
    nextPostIndex = int(soup.find('div', attrs = {'class': 'tgme_widget_message', 'data-post': True})["data-post"].split("/")[-1])
    nextPageUrl = urllib.parse.urljoin(r.url, r.url.split('?')[0] + f'?before={nextPostIndex}')

I haven't seen any examples of a page with a "tme_messages_more" anchor that doesn't have a "rel=prev" pagination, so I think it should be safe to use one or the other.

john-osullivan · 2024-03-01T17:52:11Z

Thanks so much for writing that up, my apologies for getting sidetracked. Just applied this change, please let me know if you're seeing any other issues :)

trislee · 2024-03-02T00:06:43Z

The only other edge-case I've seen is that some channels don't seem to have a post with index 1, For example, https://t.me/s/BREAKDCODE?before=3. They have a div.tme_no_messages_found element, so adding something like this on line 216 deals with that:

if soup.find("div", class_ = "tme_no_messages_found"):
    break

… prevent regression

john-osullivan · 2024-03-14T05:56:03Z

Hey @trislee -- sorry for the delay turning this small change around, but I wanted to make things a little easier for the next person who might not have good example channels on hand for testing. I set up a test suite which covers the issues you've called out here, as well as the media ordering one that @JustAnotherArchivist mentioned.

I know this is growing the scope of the PR, so if you think it's best that I remove it, happy to do so! It just seemed like a nice improvement to prevent regressions, given that testing this behavior is pretty tricky if you aren't in the weeds and aware of these example channels. The helper for detecting infinite loops is a little involved, but seems to work well.

john-osullivan added 4 commits January 18, 2024 11:37

WIP: Fixed 2.5 out of 5 issues mentioned in PR

0023938

Remove test log statement, add link to example GIF

670905f

Added media processing into main link loop; using prev tag to get pag…

54df883

…e, rather than index math

Forgot to remove a test log

2dfd154

john-osullivan changed the title ~~WIP: Solved 3 out of 6 issues mentioned in snscrape#413~~ Solved outstanding issues mentioned in snscrape#413 Feb 22, 2024