In [2]:
# @title **1. [ Required ] Set up your credentials once** { display-mode: "form" }

# @markdown Here, you need to input your credentials: `username`, `phone`, `api_id`, and `api_hash`. Your `api_id` and `api_hash` can only be generated from [Telegram's app creation page](https://my.telegram.org/apps). Once your credentials are set up, you won’t need to update them again. Just click “Run” to proceed.

# Install the Telethon library for Telegram API interactions
#!pip install -q telethon

# Initial imports
from datetime import datetime, timezone
import pandas as pd
import time
import json
import re

# Telegram imports
from telethon.sync import TelegramClient

# Google Colab imports
#from google.colab import files

# Setup / change only the first time you use it
# @markdown **1.1.** Your Telegram account username (just 'abc123', not '@'):
username = 'username' # @param {type:"string"}
# @markdown **1.2.** Your Telegram account phone number (ex: '+5511999999999'):
phone = '+phone' # @param {type:"string"}
# @markdown **1.3.** Your API ID, it can be only generated from https://my.telegram.org/apps:
api_id = 'number' # @param {type:"string"}
# @markdown **1.4.** Your API hash, also from https://my.telegram.org/apps:
api_hash = 'hasg' # @param {type:"string"}

In [3]:
# @title **2. [ Required ] Adjust every time you want to use it** { display-mode: "form" }

# @markdown In this section, you will define the parameters for scraping data from Telegram channels or groups. Specify the channels you want to scrape using the format `@ChannelName` or the full URL `https://t.me/ChannelName`. Do not use URLs starting with `https://web.telegram.org/`. Set the date range by defining the start and end day, month, and year. Choose an output file name for the scraped data. Optionally, set a search keyword if you need to filter messages by specific terms. Define the maximum number of messages to scrape and set a timeout in seconds.

# Setup / change every time to define scraping parameters

# @markdown **2.1.** Here you put the name of the channel or group that you want to scrape, as an example, play: '@LulanoTelegram' or 'https://t.me/LulanoTelegram'. Do not use: 'https://web.telegram.org/a/#-1001249230829' or '-1001249230829'. **Just write the `channel names` always separated by commas (,):**
#channels = []
respath = 'C:\\Users\\barto\\telescrap\\' #homedir
%mkdir telescrap
channels = ['@tiesiogiaiisukrainos',  '@varlinas', '@volna_lt', '@novayaklaipeda', '@euromore', '@visaginasnews', 
           '@karas_ukrainoje', '@n_aujenoschat', '@sputniknews_lt', '@lithuanian', '@lithuanianlegio', '@lithuanianews24', 
           '@vilnius_lithuania', "@aktyvusklubas2", '@rudelfi', '@slava_ukraini_ltu', '@infalt', '@NorthernFront_NATO',
           '@matricalietuvoje', '@vardantosLietuvos', '@novayaklaipeda', '@litovecrubitpravdu', '@politikai', '@sapereaudelt',
           '@atsibudimas', '@Kritinismastytojas', '@n_aujienos', '@neadeqatus', '@hmelisozreli', '@klaipedaonline', 
           '@Jaunieji_Partizanai', '@naujoji_pasaulio_tvarka', '@karas_Z', '@seimusajudistesiasi','@seimusajudis2021', 
           '@komentarastv', '@ArunasGl', '@mkzmedia', '@ValomDangu', '@fak_tai', '@mariusjonaitis', '@Karamzin_branch',
           '@Karas_ukraina_chronologijos', '@karas_ukrainoje_chronologija', '@RFULietuviskai', '@slava_ukraini_ltu', 
           '"@Izraeliokaras', '@zingeris', '@neadeqatus', '@ZoroKanalas', '@ekspertaiTelegram', '@lrtlt', '@radiorlt', 
           '@baltnews', '@delfi_lietuva', '@zhiznvlitve', '@nexta_live',  '@SolovievLive',  '@rubaltic', '@lietuvasu']


#channels = [channel.strip() for channel in channels.split(",")]

# @markdown **2.2.** Here you can select the `time window` you would like to extract data from the listed communities:
date_min = '2023-01-01' # @param {type:"date"}
date_max = '2025-01-01' # @param {type:"date"}

date_min = datetime.fromisoformat(date_min).replace(tzinfo=timezone.utc)
date_max = datetime.fromisoformat(date_max).replace(tzinfo=timezone.utc)

# @markdown **2.3.** Choose a `name` for the final file you want to download as output:
file_name = 'Test' # @param {type:"string"}

# @markdown **2.4.** `Keyword` to search, **leave empty if you want to extract all messages from the channel(s):**
key_search = '' # @param {type:"string"}

# @markdown **2.5.** **Maximum** `number of messages` to scrape (only use if you want a specific limit, otherwise leave a high number to scrape everything):
max_t_index = 1000000   # @param {type:"integer"}

# @markdown **2.6.** `Timeout in seconds` (never leave it longer than 6 hours, that is 21600 seconds, as Google Colab deactivates itself after that time):
time_limit = 216000 # @param {type:"integer"}

# @markdown **2.7.** Choose the format of the final file you want to download. If you are a first-time user, choose `Excel`. If you have advanced skills, you can use `Parquet`:
File = 'excel' # @param ["excel", "parquet"]


A subdirectory or file telescrap already exists.


In [6]:
# Function to remove invalid XML characters from text
def remove_unsupported_characters(text):
    valid_xml_chars = (
        "[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD"
        "\U00010000-\U0010FFFF]"
    )
    cleaned_text = re.sub(valid_xml_chars, '', text)
    return cleaned_text

# Function to format time in days, hours, minutes, and seconds
def format_time(seconds):
    days = seconds // 86400
    hours = (seconds % 86400) // 3600
    minutes = (seconds % 3600) // 60
    seconds = seconds % 60
    return f'{int(days):02}:{int(hours):02}:{int(minutes):02}:{int(seconds):02}'

# Function to print progress of the scraping process
def print_progress(t_index, message_id, start_time, max_t_index):
    elapsed_time = time.time() - start_time
    current_progress = t_index / (t_index + message_id) if (t_index + message_id) <= max_t_index else t_index / max_t_index
    percentage = current_progress * 100
    estimated_total_time = elapsed_time / current_progress
    remaining_time = estimated_total_time - elapsed_time

    elapsed_time_str = format_time(elapsed_time)
    remaining_time_str = format_time(remaining_time)

    print(f'Progress: {percentage:.2f}% | Elapsed Time: {elapsed_time_str} | Remaining Time: {remaining_time_str}')

# Normalize File variable to avoid issues
File = re.sub(r'[^a-z]', '', File.lower())  # Converts to lowercase and removes non-alphabetic characters


In [8]:
def extract_reaction_info(peer_dict):
    if peer_dict['_'] == 'PeerChannel':
        type = peer_dict['_']
        id =  peer_dict['channel_id']
    elif peer_dict['_'] == 'PeerUser':
        type = peer_dict['_']
        id =  peer_dict['user_id']
    elif peer_dict['_'] == 'PeerChat':
        type = peer_dict['_']
        id =  peer_dict['chat_id']
    return type,id


def extract_reactions(message):
    emoji_string = ''
    reaction_ids = ''
    reaction_peer_type = ''
    try:
        if message.reactions:
            #check if recent reactions are avalaible:
            if message.reactions.recent_reactions:
                for reaction_count in message.reactions.recent_reactions:
                    emoji = reaction_count.reaction.emoticon
                    emoji_string += emoji + " " + "1" + " "
                    type, id = extract_reaction_info(reaction_count.peer_id.to_dict())
                    reaction_ids += str(id) + " " 
                    reaction_peer_type += type + " " 
            else:
                for reaction_count in message.reactions.results:
                    emoji = reaction_count.reaction.emoticon
                    count = str(reaction_count.count)
                    emoji_string += emoji + " " + count + " "
    except Exception as e:
        pass
        #print(f'Error processing reactions: {e}')
    return emoji_string, reaction_ids, reaction_peer_type

In [10]:
async def extract_data_from_message(message, channel, respond_to_id):
    #change data format
    date_time = message.date.strftime('%Y-%m-%d %H:%M:%S')
    #clean text
    if message.text != None:
        cleaned_content = remove_unsupported_characters(message.text)
    else:
        cleaned_content= ""
    #check if media
    media = 'True' if message.media else 'False'
    #working on reactions to the post
    emoji_string, reaction_ids, reaction_peer_type = extract_reactions(message)
    #extract info about post author
    if message.from_id != None:
        author_type, author_id = extract_reaction_info(message.from_id.to_dict())
    else:
        author_type, author_id = extract_reaction_info(message.peer_id.to_dict())
    #extract info about forwarded messages
    try:
        fwd_type, fwd_author_id = extract_reaction_info(message.fwd_from.from_id.to_dict())
        fwd_id = message.fwd_from.channel_post if message.fwd_from.channel_post != None else ''
        fwd_date  = message.fwd_from.date.strftime('%Y-%m-%d %H:%M:%S') if message.fwd_from.date != None else ''    
    except Exception as e:
        #print(f'Error processing message: {e}')
        fwd_type, fwd_id, fwd_author_id, fwd_date = "", "", "", ""            
    #extract info about the message responding to    
    if message.reply_to != None: 
        reply_id = int(message.reply_to.reply_to_top_id or 2147483646)
        reply_id = min(message.reply_to.reply_to_msg_id, reply_id)
        reply_id = int(respond_to_id or reply_id)
        prev_message = await client.get_messages(channel, ids = reply_id)
        if prev_message != None:
            prev_message = prev_message[0] if isinstance(prev_message, list) else prev_message 
            prev_message_date  = prev_message.date.strftime('%Y-%m-%d %H:%M:%S') if prev_message.date != None else ''
            if prev_message.from_id != None:
                prev_message_type, prev_message_author_id = extract_reaction_info(prev_message.from_id.to_dict())
            elif prev_message.peer_id != None:
                prev_message_type, prev_message_author_id = extract_reaction_info(prev_message.peer_id.to_dict()) 
            else:
                prev_message_type, prev_message_author_id = "", ""
        else:
            prev_message_type, prev_message_author_id, prev_message_date = "", "", ""
    else:
        reply_id, prev_message_type, prev_message_author_id, prev_message_date = "", "", "", ""
    return {
        'Message ID': message.id,
        'Author ID': author_id,
        'Author type': author_type,
        'Author': message.post_author,
        'Date': date_time,
        'Channel': channel,
        'Reply to ID': reply_id,
        'Reply to author type': prev_message_type,
        'Reply to author ID': prev_message_author_id,
        'Reply to date': prev_message_date,
        'Forwarded from post ID': fwd_id,
        'Forwarded from post date': fwd_date,
        'Forwarded from author type': fwd_type,
        'Forwarded from author ID': fwd_author_id,
        'Type': "Message",
        'Content': cleaned_content,
        'Views': message.views,
        'Reactions': emoji_string,
        'Reactions Peer': reaction_peer_type,
        'Reactions IDs': reaction_ids,
        'Shares': message.forwards,
        'Media': media,
        'Url': f'https://t.me/{channel}/{message.id}'.replace('@', '')
    }

In [12]:
# @title **3. [ Required ] Start Telegram scraping** { display-mode: "form" }

# @markdown **Attention:** During this step, Telegram may request a verification code. Please monitor your Telegram app and input the required information promptly. Rest assured, all data entered remains secure.

t_index = 0  # Tracker for the number of messages processed
start_time = time.time()  # Record the start time for the scraping session

In [14]:
# Scraping process
for channel in channels:
    if t_index >= max_t_index:
        break

    if time.time() - start_time > time_limit:
        break

    loop_start_time = time.time()
    
    try:
        c_index = 0
        used_ids = set()
        data = set() # List to store scraped data
        print(f'\n\n##### {channel} scrapping begin,  {len(used_ids):05} posts already scraped #####\n\n')
        async with TelegramClient(username, api_id, api_hash) as client:
            async for message in client.iter_messages(channel, search=key_search):
                if message.id in used_ids:
                    continue 
                try:
                    if date_min <= message.date <= date_max:
                        msg = await extract_data_from_message(message, channel, None)
                        used_ids.add(message.id)
                        data.add(tuple(msg.items()))
                        #comment/replies processing
                        if message.replies != None and message.replies.replies > 0:
                            async for comment_message in client.iter_messages(channel, reply_to=message.id):
                                try:
                                    if comment_message.id in used_ids:
                                        continue 
                                    msg = await extract_data_from_message(comment_message, channel, message.id)
                                    used_ids.add(comment_message.id)
                                    data.add(tuple(msg.items()))
                                except Exception as e:
                                    print(f'Error processing comment: {e}')
                        c_index += 1
                        t_index += 1

                        # Print progress
                        if t_index % 1000 == 0:
                            print(f'{"-" * 80}')
                            print_progress(t_index, message.id, start_time, max_t_index)
                            current_max_id = min(c_index + message.id, max_t_index)
                            print(f'From {channel}: {c_index:05} contents of {current_max_id:05}')
                            print(f'Id: {message.id:05} / Date: {msg['Date']}')
                            print(f'Total: {t_index:05} contents until now')
                            print(f'{"-" * 80}\n\n')

                        if t_index % 10000 == 0:
                            if File == 'parquet':
                                backup_filename = respath + f'backup_{file_name}_until_{t_index:05}_{channel}_ID{message.id:07}.parquet'
                                pd.DataFrame([dict(m) for m in data]).to_parquet(backup_filename, index=False)
                            elif File == 'excel':
                                backup_filename = respath  + f'backup_{file_name}_until_{t_index:05}_{channel}_ID{message.id:07}.xlsx'
                                pd.DataFrame([dict(m) for m in data]).to_excel(backup_filename, index=False, engine='openpyxl')

                        if t_index >= max_t_index:
                            break
                        if time.time() - start_time > time_limit:
                            break
                    elif message.date < date_min:
                        break
                except Exception as e:
                    print(f'Error processing message: {e}')
            print(f'\n\n##### {channel} was ok with {c_index:05} posts #####\n\n')
            dt = [dict(m) for m in data]
            df = pd.DataFrame(dt)
            if File == 'parquet':
                partial_filename = respath + f'complete_{channel}_in_{file_name}_until_{t_index:05}.parquet'
                df.to_parquet(partial_filename, index=False)
            elif File == 'excel':
                partial_filename = respath + f'complete_{channel}_in_{file_name}_until_{t_index:05}.xlsx'
                df.to_excel(partial_filename, index=False, engine='openpyxl')
            

    except Exception as e:
        print(f'{channel} error: {e}')

    loop_end_time = time.time()
    loop_duration = loop_end_time - loop_start_time

    if loop_duration < 60:
        time.sleep(60 - loop_duration)

#print(f'\n{"-" * 50}\n#Concluded! #{t_index:05} posts were scraped!\n{"-" * 50}\n\n\n\n')
#dt = [dict(m) for m in data]
#df = pd.DataFrame(dt)
#if File == 'parquet':
#    final_filename = respath + f'FINAL_{file_name}_with_{t_index:05}.parquet'
#    df.to_parquet(final_filename, index=False)
#elif File == 'excel':
#    final_filename = respath + f'FINAL_{file_name}_with_{t_index:05}.xlsx'
#    df.to_excel(final_filename, index=False, engine='openpyxl')



##### @rubaltic scrapping begin,  00000 posts already scraped #####


--------------------------------------------------------------------------------
Progress: 3.51% | Elapsed Time: 00:00:20:40 | Remaining Time: 00:09:28:38
From @rubaltic: 01000 contents of 28509
Id: 27509 / Date: 2024-10-17 12:03:22
Total: 01000 contents until now
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
Progress: 7.02% | Elapsed Time: 00:00:42:20 | Remaining Time: 00:09:21:05
From @rubaltic: 02000 contents of 28506
Id: 26506 / Date: 2024-07-24 08:45:22
Total: 02000 contents until now
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
Progress: 10.52% | Elapsed Time: 00:01:01:34 | Remaining Time: 00:08:43:30
From @rubaltic: 03000 contents of 28504
Id: 25504 / Date: 2024-05-15 16:28:0

Server closed the connection: [WinError 1236] The network connection was aborted by the local system
Server closed the connection: [WinError 10054] An existing connection was forcibly closed by the remote host


--------------------------------------------------------------------------------
Progress: 46.91% | Elapsed Time: 00:04:30:29 | Remaining Time: 00:05:06:05
From @lietuvasu: 00655 contents of 15366
Id: 14711 / Date: 2024-11-19 09:19:23
Total: 13000 contents until now
--------------------------------------------------------------------------------




Server closed the connection: [WinError 1236] The network connection was aborted by the local system
Server closed the connection: [WinError 1236] The network connection was aborted by the local system
Server closed the connection: [WinError 1236] The network connection was aborted by the local system
Server closed the connection: [WinError 1236] The network connection was aborted by the local system
Server closed the connection: [WinError 1236] The network connection was aborted by the local system
Server closed the connection: [WinError 1236] The network connection was aborted by the local system


--------------------------------------------------------------------------------
Progress: 50.60% | Elapsed Time: 00:04:58:42 | Remaining Time: 00:04:51:39
From @lietuvasu: 01655 contents of 15325
Id: 13670 / Date: 2024-08-23 14:48:04
Total: 14000 contents until now
--------------------------------------------------------------------------------




Server closed the connection: [WinError 1236] The network connection was aborted by the local system
Server closed the connection: [WinError 10054] An existing connection was forcibly closed by the remote host
Server closed the connection: [WinError 1236] The network connection was aborted by the local system


--------------------------------------------------------------------------------
Progress: 54.30% | Elapsed Time: 00:05:19:53 | Remaining Time: 00:04:29:15
From @lietuvasu: 02655 contents of 15281
Id: 12626 / Date: 2024-05-27 12:19:57
Total: 15000 contents until now
--------------------------------------------------------------------------------




Server closed the connection: [WinError 1236] The network connection was aborted by the local system
Server closed the connection: [WinError 1236] The network connection was aborted by the local system
Server closed the connection: [WinError 1236] The network connection was aborted by the local system


--------------------------------------------------------------------------------
Progress: 58.02% | Elapsed Time: 00:05:51:45 | Remaining Time: 00:04:14:33
From @lietuvasu: 03655 contents of 15234
Id: 11579 / Date: 2024-03-17 12:46:49
Total: 16000 contents until now
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
Progress: 61.71% | Elapsed Time: 00:06:10:26 | Remaining Time: 00:03:49:53
From @lietuvasu: 04655 contents of 15205
Id: 10550 / Date: 2024-02-02 19:50:12
Total: 17000 contents until now
--------------------------------------------------------------------------------


--------------------------------------------------------------------------------
Progress: 65.48% | Elapsed Time: 00:06:30:06 | Remaining Time: 00:03:25:41
From @lietuvasu: 05655 contents of 15146
Id: 09491 / Date: 2023-12-22 18:29:51
Total: 18000 contents until now
---------------------------------