Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solved outstanding issues mentioned in snscrape#413 #8

Open
wants to merge 8 commits into
base: more-tg-info
Choose a base branch
from

Conversation

john-osullivan
Copy link

@john-osullivan john-osullivan commented Jan 18, 2024

Setting up this pull request to track my work in progress here! Here are the 5 issues called out by @JustAnotherArchivist, along with my progress on each:

  • Validating multiple matches on image regex
  • Maintaining order of media by merging loop into main link loop. Validated behavior by retrieving https://t.me/s/nexta_live/43102, which now had the videos and images in the correct order.
  • Confirming that some GIFs present as duration-less videos -- thanks for providing the link, Logan! I added a log statement, but will remove it in my next commit now that you've validated.
  • Leaving the link preview href within the outlinks array for consistency with other scrapers
  • Update the scraper break from using a try/except to instead using an if statement that won't throw an Exception.
  • Strengthening logic for getting next page => Using the prev link in the header, which accounts for media getting separate IDs.

@john-osullivan john-osullivan changed the title WIP: Solved 3 out of 6 issues mentioned in snscrape#413 Solved outstanding issues mentioned in snscrape#413 Feb 22, 2024
pageLink = soup.find('a', attrs = {'class': 'tme_messages_more', 'data-before': True})
if not pageLink:
# some pages are missing a "tme_messages_more" tag, causing early termination
if '=' not in nextPageUrl:
nextPageUrl = soup.find('link', attrs = {'rel': 'canonical'}, href = True)['href']
nextPostIndex = int(nextPageUrl.split('=')[-1]) - 20
nextPageUrl = soup.find('link', attrs = {'rel': 'prev'}, href = True)['href']
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shifting to using the prev tag addresses the duplicates issue caused by media getting a post ID, and also lets us remove the index math. Still calculating the next post index to determine whether it makes sense to fetch another page.

@john-osullivan john-osullivan marked this pull request as ready for review February 22, 2024 06:13
@john-osullivan
Copy link
Author

john-osullivan commented Feb 22, 2024

Okay, sorry for the long delay, but I found the time to sit down and refactor the link loop! Pretty sure that this PR now addresses all of the comments from JustAnotherArchivist#413, although I suppose we'll see if/when this gets sent their way.

@loganwilliams , I think this is working correctly and ready for review! I ran this against the listed example posts discussed in the PR to validate my fixes, but haven't done a large run on anything. Open to suggestions if there's anything you'd like me to do to make sure there aren't any regressions 👍

@loganwilliams
Copy link
Collaborator

Thanks @john-osullivan! I'll ping @trislee for a review as he's more up-to-date on Telegram stuff than I am.

@john-osullivan
Copy link
Author

Thank you! 🙌 I was gonna say, took a look at the git history and realized that I probably ought to have been tagging them in.

@trislee
Copy link

trislee commented Feb 23, 2024

@john-osullivan thanks for this. The scraping of some channels isn't terminating correctly for me. For example, @WLM_USA_TEXAS gets into an endless loop after post 3746 because there's no "prev" pagination link on that page, nextPostIndex uses the nextPageUrl from the previous page, and the most recent post index from that page is larger than 20.

I previously dealt with Telegram's inconsistencies by decrementing nextPostIndex by 20, which is sloppy but I'm not sure what a more robust method would be. This seems to be an issue with the browser interface: scrolling up on the WLM_USA_TEXAS channel indeed stops at post 3746, but you can get past this using a URL like https://t.me/s/WLM_USA_TEXAS?before=3746

The "> 20" was a sloppy way of checking whether or not the scraper had reached the last page, but it's probably not robust enough here, since a channel could delete early posts.

I'm encountering a similar issue with @AKCPB getting stuck after post 3, since the largest post index from that page is 28.

One way of dealing with these issues could be keeping track of which posts are scraped, using a similar method as in the vkontakte scraper, and using that to deduplicate.

@john-osullivan
Copy link
Author

Thanks for the feedback and including links for me to validate, @trislee! I'll dig in this evening and try to turn around a clean solution 👍

@trislee
Copy link

trislee commented Feb 25, 2024

Replacing lines 222 - 232 with something like this seems to be working well, without introducing duplicate posts:

if pageLink := soup.find('link', attrs = {'rel': 'prev'}, href = True):
    nextPageUrl = urllib.parse.urljoin(r.url, pageLink['href'])
else:
    nextPostIndex = int(soup.find('div', attrs = {'class': 'tgme_widget_message', 'data-post': True})["data-post"].split("/")[-1])
    nextPageUrl = urllib.parse.urljoin(r.url, r.url.split('?')[0] + f'?before={nextPostIndex}')

I haven't seen any examples of a page with a "tme_messages_more" anchor that doesn't have a "rel=prev" pagination, so I think it should be safe to use one or the other.

@john-osullivan
Copy link
Author

Thanks so much for writing that up, my apologies for getting sidetracked. Just applied this change, please let me know if you're seeing any other issues :)

@trislee
Copy link

trislee commented Mar 2, 2024

The only other edge-case I've seen is that some channels don't seem to have a post with index 1, For example, https://t.me/s/BREAKDCODE?before=3. They have a div.tme_no_messages_found element, so adding something like this on line 216 deals with that:

if soup.find("div", class_ = "tme_no_messages_found"):
    break

@john-osullivan
Copy link
Author

john-osullivan commented Mar 14, 2024

Hey @trislee -- sorry for the delay turning this small change around, but I wanted to make things a little easier for the next person who might not have good example channels on hand for testing. I set up a test suite which covers the issues you've called out here, as well as the media ordering one that @JustAnotherArchivist mentioned.

I know this is growing the scope of the PR, so if you think it's best that I remove it, happy to do so! It just seemed like a nice improvement to prevent regressions, given that testing this behavior is pretty tricky if you aren't in the weeds and aware of these example channels. The helper for detecting infinite loops is a little involved, but seems to work well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants