-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Solved outstanding issues mentioned in snscrape#413 #8
base: more-tg-info
Are you sure you want to change the base?
Solved outstanding issues mentioned in snscrape#413 #8
Conversation
snscrape/modules/telegram.py
Outdated
pageLink = soup.find('a', attrs = {'class': 'tme_messages_more', 'data-before': True}) | ||
if not pageLink: | ||
# some pages are missing a "tme_messages_more" tag, causing early termination | ||
if '=' not in nextPageUrl: | ||
nextPageUrl = soup.find('link', attrs = {'rel': 'canonical'}, href = True)['href'] | ||
nextPostIndex = int(nextPageUrl.split('=')[-1]) - 20 | ||
nextPageUrl = soup.find('link', attrs = {'rel': 'prev'}, href = True)['href'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shifting to using the prev
tag addresses the duplicates issue caused by media getting a post ID, and also lets us remove the index math. Still calculating the next post index to determine whether it makes sense to fetch another page.
Okay, sorry for the long delay, but I found the time to sit down and refactor the link loop! Pretty sure that this PR now addresses all of the comments from JustAnotherArchivist#413, although I suppose we'll see if/when this gets sent their way. @loganwilliams , I think this is working correctly and ready for review! I ran this against the listed example posts discussed in the PR to validate my fixes, but haven't done a large run on anything. Open to suggestions if there's anything you'd like me to do to make sure there aren't any regressions 👍 |
Thanks @john-osullivan! I'll ping @trislee for a review as he's more up-to-date on Telegram stuff than I am. |
Thank you! 🙌 I was gonna say, took a look at the git history and realized that I probably ought to have been tagging them in. |
@john-osullivan thanks for this. The scraping of some channels isn't terminating correctly for me. For example, I previously dealt with Telegram's inconsistencies by decrementing nextPostIndex by 20, which is sloppy but I'm not sure what a more robust method would be. This seems to be an issue with the browser interface: scrolling up on the WLM_USA_TEXAS channel indeed stops at post 3746, but you can get past this using a URL like https://t.me/s/WLM_USA_TEXAS?before=3746 The "> 20" was a sloppy way of checking whether or not the scraper had reached the last page, but it's probably not robust enough here, since a channel could delete early posts. I'm encountering a similar issue with One way of dealing with these issues could be keeping track of which posts are scraped, using a similar method as in the vkontakte scraper, and using that to deduplicate. |
Thanks for the feedback and including links for me to validate, @trislee! I'll dig in this evening and try to turn around a clean solution 👍 |
Replacing lines 222 - 232 with something like this seems to be working well, without introducing duplicate posts:
I haven't seen any examples of a page with a "tme_messages_more" anchor that doesn't have a "rel=prev" pagination, so I think it should be safe to use one or the other. |
Thanks so much for writing that up, my apologies for getting sidetracked. Just applied this change, please let me know if you're seeing any other issues :) |
The only other edge-case I've seen is that some channels don't seem to have a post with index 1, For example, https://t.me/s/BREAKDCODE?before=3. They have a
|
Hey @trislee -- sorry for the delay turning this small change around, but I wanted to make things a little easier for the next person who might not have good example channels on hand for testing. I set up a test suite which covers the issues you've called out here, as well as the media ordering one that @JustAnotherArchivist mentioned. I know this is growing the scope of the PR, so if you think it's best that I remove it, happy to do so! It just seemed like a nice improvement to prevent regressions, given that testing this behavior is pretty tricky if you aren't in the weeds and aware of these example channels. The helper for detecting infinite loops is a little involved, but seems to work well. |
Setting up this pull request to track my work in progress here! Here are the 5 issues called out by @JustAnotherArchivist, along with my progress on each:
outlinks
array for consistency with other scrapersprev
link in the header, which accounts for media getting separate IDs.