Implemented JustAnotherArchivist's requested changes to Telegram scraper from PR #2

trislee · 2022-03-31T02:46:35Z

Implemented requested changes from JustAnotherArchivist#413

Made attachment handling similar to Twitter's: dataclasses for Image, Video, and Gif.
Added capability to scrape multiple Videos from a single message
Added attribute for the full forwarded URL and made the forwarded attribute have type Channel
Added capability to scrape number of views for messages

Additional steps that should be done:

Add Document dataclass for arbitrary attached documents
- Looks like the /s/ browser interface doesn't give you a way of downloading arbitrary documents (i.e. no file URL). Is this worth implementing?
Test on more channels to make sure nothing broke
Decide on whether to keep current method for scraping forwarded channel information
- Current method (scraping full channel details for each forward) is comprehensive, but comes at the expense of speed: in most extreme case, if a channel's posts are only forwards, scraper is ~12x slower, since we need to make an additional request for each forward rather than just for each message page.
- Other option is to modify Channel definition so that it only requires the username. Would be faster, but less comprehensive.
Ensure code style is consistent (convert if-else blocks to use walrus operator?)

…tracting a post's view count

…ribute type Channel.

… attribute; fixed video edge cases.

…s didn't have a next page link (added reasonable default)

…se they weren't in a post containing a 'tgme_widget_message_text' div

trislee · 2022-04-21T14:59:42Z

I got frustrated with the slowness of the scraping so I changed the forwarding Channel method by modifying the Channel definition so that it only requires the username, rather than retrieving the full forwarded channel information for every forwarded message.

Additional changes:

Telegram seems to have changed their interface somehow such that the tme_messages_more, data-before tag often doesn't appear on some pages. To deal with this, I added a default that decrements the before query parameter by 20. This requires a few additional changes to handle edge cases:
- If the querystring doesn't contain the before parameter, get the canonical url tag in the page
- Added a termination condition: if the first tgme_widget_message_date has an href to the first post (t.me/CHANNEL/1), terminate the scraping loop
Moved attachment extraction out of if (message := post.find('div', class_ = 'tgme_widget_message_text')): clause, since some attachments are in messages without text, so they weren't being added to the media list
I also added a responseOkCallback function to retry the request if we get a 5xx response.

…edundant outlinks

…t wasn't correctly getting the forwarding information in forwarded posts that contained attachments but no text

trislee · 2022-04-21T23:51:43Z

One thing we need to decide is if we want to include pinned messages, e.g. https://t.me/s/SouthwestOhioPB/17, where the content is just "[CHANNEL NAME] pinned a [ATTACHMENT TYPE] ". Unfortunately, unlike the desktop app, the browser interface doesn't include the link to the message that was pinned, so there's very little information in the scraped post.

msramalho · 2022-04-26T10:58:30Z

snscrape/modules/telegram.py

+				if link['href'] == rawUrl or link['href'] == url:
+					style = link.attrs.get('style', '')
+					# Generic filter of links to the post itself, catches videos, photos, and the date link
+					if style != '':
+						imageUrls = re.findall('url\(\'(.*?)\'\)', style)
+						if len(imageUrls) == 1:
+							media.append(Photo(url = imageUrls[0]))


this code is partially duplicated below (152-155) maybe it could be isolated to a method, or at least the REGEX into a variable so it stays consistent.

msramalho · 2022-04-26T10:58:32Z

snscrape/modules/telegram.py

-					forwarded = forward_tag['href'].split('t.me/')[1].split('/')[0]			
+			for voice_player in post.find_all('a', {'class': 'tgme_widget_message_voice_player'}):
+				audioUrl = voice_player.find('audio')['src']
+				durationStr = voice_player.find('time').text.split(':')


durationStr comes from split so it will be a list rather than string. Both calls pass lists so maybe renaming the variables + durationStrToSeconds method to reflect that.

msramalho · 2022-04-26T10:58:34Z

snscrape/modules/telegram.py

+					videoThumbnailUrl = None
+				else:
+					style = iTag['style']
+					videoThumbnailUrl = re.findall('url\(\'(.*?)\'\)', style)[0]


regex can be extracted to variable since it's also used above

msramalho · 2022-04-26T10:58:37Z

snscrape/modules/telegram.py

+					if videoTag is None:
+						videoUrl = None
+					else:
+						videoUrl = videoTag['src']


Suggested change

if videoTag is None:

videoUrl = None

else:

videoUrl = videoTag['src']

videoUrl = None if videoTag is None else videoTag['src']

msramalho · 2022-04-26T10:58:40Z

snscrape/modules/telegram.py

+				else:
+					cls = Video
+					durationStr = video_player.find('time').text.split(':')
+					mKwargs['duration'] = durationStrToSeconds(durationStr)


same comment on list vs str as above for durationStrToSeconds

msramalho · 2022-04-26T10:58:44Z

snscrape/modules/telegram.py

+			if viewsSpan is None:
+				views = None
+			else:
+				views = parse_num(viewsSpan.text)


Suggested change

if viewsSpan is None:

views = None

else:

views = parse_num(viewsSpan.text)

views = None if viewsSpan is None else parse_num(viewsSpan.text)

msramalho · 2022-04-26T10:58:51Z

snscrape/modules/telegram.py

+	s = s.replace(' ', '')
+	if s.endswith('M'):
+		return int(float(s[:-1]) * 1e6), 10 ** (6 if '.' not in s else 6 - len(s[:-1].split('.')[1]))
+	elif s.endswith('K'):
+		return int(float(s[:-1]) * 1000), 10 ** (3 if '.' not in s else 3 - len(s[:-1].split('.')[1]))
+	else:
+		return int(s), 1


I did not check this logic, maybe adding some docstr with example expected input and expected output

msramalho · 2022-04-26T10:59:04Z

snscrape/modules/telegram.py

+	if r.status_code == 200:
+		return (True, None)
+	elif r.status_code // 100 == 5:
+		return (False, f'status code: {r.status_code}')


Suggested change

return (False, f'status code: {r.status_code}')

return (False, f'{r.status_code=}')

discovered this recently for python 3.8+, see here, just a suggestion

msramalho · 2022-04-26T11:00:31Z

snscrape/modules/telegram.py

+	else:
+		return (False, None)


Suggested change

else:

return (False, None)

return (False, None)

no need for else and having a base-level return with the default values is also a good pattern

…TTERN as variable

trislee added 2 commits March 30, 2022 21:07

implemented Media dataclasses for Telegram, and added variable for ex…

a7eb54d

…tracting a post's view count

added a forwardedUrl attribute to TelegramPost and made forwarded att…

4e59638

…ribute type Channel.

trislee requested a review from loganwilliams March 31, 2022 02:46

trislee added 4 commits April 17, 2022 03:55

made Telegram scraper not return full channel info for forwarded_from…

babcddd

… attribute; fixed video edge cases.

fixed issue where Telegram scraper terminated early because some page…

1e4e0c2

…s didn't have a next page link (added reasonable default)

fixed issue where some videos and photos weren't being scraped (becau…

b276c3c

…se they weren't in a post containing a 'tgme_widget_message_text' div

added additional termination criteria to Telegram scraper

97d38e5

trislee marked this pull request as ready for review April 21, 2022 14:42

trislee requested a review from msramalho April 21, 2022 14:43

trislee added 2 commits April 21, 2022 18:06

added additional attributes for hashtags and user mentions, removed r…

9b3faec

…edundant outlinks

moved forward finding out of tgme_widget_message_text clause, since i…

21f7b62

…t wasn't correctly getting the forwarding information in forwarded posts that contained attachments but no text

msramalho reviewed Apr 26, 2022

View reviewed changes

improved consistency of code formatting and added _STYLE_MEDIA_URL_PA…

5648e95

…TTERN as variable

msramalho approved these changes Apr 28, 2022

View reviewed changes

Merge branch 'master' into telegram-media

c18ca0f

trislee merged commit 0a4bd39 into master May 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implemented JustAnotherArchivist's requested changes to Telegram scraper from PR #2

Implemented JustAnotherArchivist's requested changes to Telegram scraper from PR #2

trislee commented Mar 31, 2022 •

edited

Loading

trislee commented Apr 21, 2022 •

edited

Loading

trislee commented Apr 21, 2022 •

edited

Loading

msramalho Apr 26, 2022

msramalho Apr 26, 2022

msramalho Apr 26, 2022

msramalho Apr 26, 2022

msramalho Apr 26, 2022

msramalho Apr 26, 2022

msramalho Apr 26, 2022

msramalho Apr 26, 2022 •

edited

Loading

msramalho Apr 26, 2022

	return (False, f'status code: {r.status_code}')
	return (False, f'{r.status_code=}')

Implemented JustAnotherArchivist's requested changes to Telegram scraper from PR #2

Implemented JustAnotherArchivist's requested changes to Telegram scraper from PR #2

Conversation

trislee commented Mar 31, 2022 • edited Loading

trislee commented Apr 21, 2022 • edited Loading

trislee commented Apr 21, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msramalho Apr 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trislee commented Mar 31, 2022 •

edited

Loading

trislee commented Apr 21, 2022 •

edited

Loading

trislee commented Apr 21, 2022 •

edited

Loading

msramalho Apr 26, 2022 •

edited

Loading