Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implemented JustAnotherArchivist's requested changes to Telegram scraper from PR #2

Merged
merged 10 commits into from
May 9, 2022

Conversation

trislee
Copy link

@trislee trislee commented Mar 31, 2022

Implemented requested changes from JustAnotherArchivist#413

  • Made attachment handling similar to Twitter's: dataclasses for Image, Video, and Gif.
  • Added capability to scrape multiple Videos from a single message
  • Added attribute for the full forwarded URL and made the forwarded attribute have type Channel
  • Added capability to scrape number of views for messages

Additional steps that should be done:

  • Add Document dataclass for arbitrary attached documents
    • Looks like the /s/ browser interface doesn't give you a way of downloading arbitrary documents (i.e. no file URL). Is this worth implementing?
  • Test on more channels to make sure nothing broke
  • Decide on whether to keep current method for scraping forwarded channel information
    • Current method (scraping full channel details for each forward) is comprehensive, but comes at the expense of speed: in most extreme case, if a channel's posts are only forwards, scraper is ~12x slower, since we need to make an additional request for each forward rather than just for each message page.
    • Other option is to modify Channel definition so that it only requires the username. Would be faster, but less comprehensive.
  • Ensure code style is consistent (convert if-else blocks to use walrus operator?)

@trislee trislee marked this pull request as ready for review April 21, 2022 14:42
@trislee trislee requested a review from msramalho April 21, 2022 14:43
@trislee
Copy link
Author

trislee commented Apr 21, 2022

I got frustrated with the slowness of the scraping so I changed the forwarding Channel method by modifying the Channel definition so that it only requires the username, rather than retrieving the full forwarded channel information for every forwarded message.

Additional changes:

  • Telegram seems to have changed their interface somehow such that the tme_messages_more, data-before tag often doesn't appear on some pages. To deal with this, I added a default that decrements the before query parameter by 20. This requires a few additional changes to handle edge cases:
    • If the querystring doesn't contain the before parameter, get the canonical url tag in the page
    • Added a termination condition: if the first tgme_widget_message_date has an href to the first post (t.me/CHANNEL/1), terminate the scraping loop
  • Moved attachment extraction out of if (message := post.find('div', class_ = 'tgme_widget_message_text')): clause, since some attachments are in messages without text, so they weren't being added to the media list
  • I also added a responseOkCallback function to retry the request if we get a 5xx response.

…t wasn't correctly getting the forwarding information in forwarded posts that contained attachments but no text
@trislee
Copy link
Author

trislee commented Apr 21, 2022

One thing we need to decide is if we want to include pinned messages, e.g. https://t.me/s/SouthwestOhioPB/17, where the content is just "[CHANNEL NAME] pinned a [ATTACHMENT TYPE] ". Unfortunately, unlike the desktop app, the browser interface doesn't include the link to the message that was pinned, so there's very little information in the scraped post.

Comment on lines 143 to 149
if link['href'] == rawUrl or link['href'] == url:
style = link.attrs.get('style', '')
# Generic filter of links to the post itself, catches videos, photos, and the date link
if style != '':
imageUrls = re.findall('url\(\'(.*?)\'\)', style)
if len(imageUrls) == 1:
media.append(Photo(url = imageUrls[0]))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code is partially duplicated below (152-155) maybe it could be isolated to a method, or at least the REGEX into a variable so it stays consistent.

forwarded = forward_tag['href'].split('t.me/')[1].split('/')[0]
for voice_player in post.find_all('a', {'class': 'tgme_widget_message_voice_player'}):
audioUrl = voice_player.find('audio')['src']
durationStr = voice_player.find('time').text.split(':')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

durationStr comes from split so it will be a list rather than string. Both calls pass lists so maybe renaming the variables + durationStrToSeconds method to reflect that.

videoThumbnailUrl = None
else:
style = iTag['style']
videoThumbnailUrl = re.findall('url\(\'(.*?)\'\)', style)[0]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regex can be extracted to variable since it's also used above

Comment on lines 187 to 190
if videoTag is None:
videoUrl = None
else:
videoUrl = videoTag['src']

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if videoTag is None:
videoUrl = None
else:
videoUrl = videoTag['src']
videoUrl = None if videoTag is None else videoTag['src']

else:
cls = Video
durationStr = video_player.find('time').text.split(':')
mKwargs['duration'] = durationStrToSeconds(durationStr)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment on list vs str as above for durationStrToSeconds

Comment on lines 224 to 227
if viewsSpan is None:
views = None
else:
views = parse_num(viewsSpan.text)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if viewsSpan is None:
views = None
else:
views = parse_num(viewsSpan.text)
views = None if viewsSpan is None else parse_num(viewsSpan.text)

Comment on lines 316 to 322
s = s.replace(' ', '')
if s.endswith('M'):
return int(float(s[:-1]) * 1e6), 10 ** (6 if '.' not in s else 6 - len(s[:-1].split('.')[1]))
elif s.endswith('K'):
return int(float(s[:-1]) * 1000), 10 ** (3 if '.' not in s else 3 - len(s[:-1].split('.')[1]))
else:
return int(s), 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not check this logic, maybe adding some docstr with example expected input and expected output

if r.status_code == 200:
return (True, None)
elif r.status_code // 100 == 5:
return (False, f'status code: {r.status_code}')
Copy link

@msramalho msramalho Apr 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return (False, f'status code: {r.status_code}')
return (False, f'{r.status_code=}')

discovered this recently for python 3.8+, see here, just a suggestion

Comment on lines 332 to 333
else:
return (False, None)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
else:
return (False, None)
return (False, None)

no need for else and having a base-level return with the default values is also a good pattern

@trislee trislee merged commit 0a4bd39 into master May 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants