# **#TelegramScrap: A comprehensive tool for scraping Telegram data**

✅ This code, developed by [Ergon Cugler de Moraes Silva](https://github.com/ergoncugler) (Brazil), aims to scrape data from selected `Telegram Channels, Groups, or Chats` using the `Telethon Library`. It is designed to facilitate the extraction of various data fields including `message content, author information, reactions, views, and comments`. The primary functions of this code include setting up scraping parameters, processing messages and their associated comments, and handling unsupported characters to ensure data integrity. Data is stored in `Apache Parquet files (.parquet)`, which are highly efficient for both storage and processing, making them superior to traditional spreadsheets in terms of speed and scalability. This tool is particularly **useful for researchers and analysts** looking to collect and analyze Telegram data efficiently.

✅ **The code is open-source and available for free at [https://github.com/ergoncugler/web-scraping-telegram/](https://github.com/ergoncugler/web-scraping-telegram/)**. While it is free to use and modify, the responsibility for its use and any modifications lies with the user. Feel free to explore, utilize, and adapt the code to suit your needs, but please ensure you comply with Telegram's terms of service and data privacy regulations.

## **#Papers: Some scientific production using this code**

✅ In the realm of scientific articles, this code was instrumental in the study **Informational Co-option against Democracy: Comparing Bolsonaro's Discourses about Voting Machines with the Public Debate** ([**link**](https://dl.acm.org/doi/abs/10.1145/3614321.3614373)). It was also used in **Institutional Denialism From the President's Speeches to the Formation of the Early Treatment Agenda (Off Label) in the COVID-19 Pandemic in Brazil** ([**link**](https://anepecp.org/ojs/index.php/br/article/view/561)). Moreover, the code facilitated research in **Catalytic Conspiracism: Exploring Persistent Homologies Time Series in the Dissemination of Disinformation in Conspiracy Theory Communities on Telegram** ([**link**](https://www.abcp2024.sinteseeventos.com.br/trabalho/view?ID_TRABALHO=687)) and **Conspiratorial Convergence: Comparing Thematic Agendas Among Conspiracy Theory Communities on Telegram Using Topic Modeling** ([**link**](https://www.abcp2024.sinteseeventos.com.br/trabalho/view?ID_TRABALHO=903)). Lastly, it was pivotal in the study **Informational Disorder and Institutions Under Attack: How Did Former President Bolsonaro's Narratives Against the Brazilian Judiciary Between 2019 and 2022 Manifest?** ([**link**](https://www.encontro2023.anpocs.org.br/trabalho/view?ID_TRABALHO=8990)).

✅ Furthermore, the code was utilized in several technical notes, as we can see in **Technical Note #16 – Disinformation about Electronic Voting Machines Persists Outside Election Periods** ([**link**](https://www.monitordigital.org/2023/05/18/nota-tecnica-16-desinformacao-sobre-urnas-eletronicas-persiste-fora-dos-periodos-eleitorais/)). It was also employed in **Technical Note #18 – Electoral Fraud Discourse in Argentina on Telegram and Twitter** ([**link**](https://www.monitordigital.org/2023/10/24/nota-tecnica-18-discurso-de-fraude-eleitoral-na-argentina-no-telegram-e-no-twitter/)). The code contributed to the analysis in the technical note **Bashing and Praising Public Servants and Bureaucrats During the Bolsonaro Government (2019 - 2022)** ([**link**](https://neburocracia.wordpress.com/wp-content/uploads/2024/04/nota-tecnica-neb-fgv-eaesp-como-bolsonaro-equilibrou-ataques-e-acenos-aos-servidores-publicos-e-burocratas-entre-2019-e-2022.pdf)). Additionally, it was used in **Technical Note 2: The Digital Territory of Milei's Followers: From Commerce to Politics** ([**link**](https://pacunla.com/nota-tecnica-2-el-territorio-digital-de-los-seguidores-de-milei-del-comercio-a-la-politica/)).

✅ To credit this academic work and the scraping code, it is recommended to cite: **SILVA, Ergon Cugler de Moraes. *Web Scraping Telegram Posts and Content*. (Feb) 2023. Available at: [https://github.com/ergoncugler/web-scraping-telegram/](https://github.com/ergoncugler/web-scraping-telegram/).**

## **Setup**

Here you need to input your credentials like `username, phone, api_id and api_hash`. Your api_id and your api_hash, **it can be only generated from https://my.telegram.org/apps**. Once you set your details for the first time, you no longer need to update, just click play.

In [None]:
# Install the Telethon library for Telegram API interactions
!pip install telethon


In [None]:
# Initial imports
from datetime import datetime, timezone
import pandas as pd
import time
import json
import re

# Telegram imports
from telethon.sync import TelegramClient

# Google Colab imports
from google.colab import files

# Setup / change only the first time you use it
username = 'your_username'  # Your Telegram account username (just 'abc123', not '@')
phone = '+5511999999999'  # Your Telegram account phone number (ex: '+5511999999999')
api_id = '11111111'  # Your API ID, it can be only generated from https://my.telegram.org/apps
api_hash = '1a1a1a1a1a1a1a1a1a1a1a1a1a1a1a1a'  # Your API hash, also from https://my.telegram.org/apps


## **Scraping**

In this section, you will define the parameters for scraping data from Telegram channels or groups. Specify the channels you want to scrape using the format `@ChannelName` or the full URL `https://t.me/ChannelName`. Do not use URLs starting with `https://web.telegram.org/`. Set the date range by defining the start and end day, month, and year. Choose an output file name for the scraped data. Optionally, set a search keyword if you need to filter messages by specific terms. Define the maximum number of messages to scrape and set a timeout in seconds.

In [None]:
# Setup / change every time to define scraping parameters
channels = [
    '@LulanoTelegram',
    '@jairbolsonarobrasil',
]
# Here you put the name of the channel or group that you want to scrape
# As an example, play: '@LulanoTelegram' or 'https://t.me/LulanoTelegram'
# Do not use: 'https://web.telegram.org/a/#-1001249230829' or '-1001249230829'

d_min = 1  # Start day (inclusive)
m_min = 1  # Start month
y_min = 2000  # Start year
d_max = 1  # End day (exclusive)
m_max = 8  # End month
y_max = 2024  # End year
file_name = 'Test'  # Output file name
key_search = ''  # Keyword to search, leave empty if not needed
max_t_index = 1000000  # Maximum number of messages to scrape
time_limit = 6 * 60 * 60  # Timeout in hours (*seconds)
File = 'parquet'  # Set to 'parquet' or 'excel'


In [None]:
data = []  # List to store scraped data
t_index = 0  # Tracker for the number of messages processed
start_time = time.time()  # Record the start time for the scraping session

# Function to remove invalid XML characters from text
def remove_unsupported_characters(text):
    valid_xml_chars = (
        "[^\u0009\u000A\u000D\u0020-\uD7FF\uE000-\uFFFD"
        "\U00010000-\U0010FFFF]"
    )
    cleaned_text = re.sub(valid_xml_chars, '', text)
    return cleaned_text

# Function to format time in days, hours, minutes, and seconds
def format_time(seconds):
    days = seconds // 86400
    hours = (seconds % 86400) // 3600
    minutes = (seconds % 3600) // 60
    seconds = seconds % 60
    return f'{int(days):02}:{int(hours):02}:{int(minutes):02}:{int(seconds):02}'

# Function to print progress of the scraping process
def print_progress(t_index, message_id, start_time, max_t_index):
    elapsed_time = time.time() - start_time
    current_progress = t_index / (t_index + message_id) if (t_index + message_id) <= max_t_index else t_index / max_t_index
    percentage = current_progress * 100
    estimated_total_time = elapsed_time / current_progress
    remaining_time = estimated_total_time - elapsed_time

    elapsed_time_str = format_time(elapsed_time)
    remaining_time_str = format_time(remaining_time)

    print(f'Progress: {percentage:.2f}% | Elapsed Time: {elapsed_time_str} | Remaining Time: {remaining_time_str}')

# Scraping process
for channel in channels:
    if t_index >= max_t_index:
        break

    if time.time() - start_time > time_limit:
        break

    loop_start_time = time.time()

    try:
        c_index = 0
        async with TelegramClient(username, api_id, api_hash) as client:
            async for message in client.iter_messages(channel, search=key_search):
                try:
                    if datetime(y_min, m_min, d_min, tzinfo=timezone.utc) < message.date <= datetime(y_max, m_max, d_max, tzinfo=timezone.utc):

                        # Process comments of the message
                        comments_list = []
                        try:
                            async for comment_message in client.iter_messages(channel, reply_to=message.id):
                                comment_text = comment_message.text.replace("'", '"')

                                comment_media = 'True' if comment_message.media else 'False'

                                comment_emoji_string = ''
                                if comment_message.reactions:
                                    for reaction_count in comment_message.reactions.results:
                                        emoji = reaction_count.reaction.emoticon
                                        count = str(reaction_count.count)
                                        comment_emoji_string += emoji + " " + count + " "

                                comment_date_time = comment_message.date.strftime('%Y-%m-%d %H:%M:%S')

                                comments_list.append({
                                    'Type': 'comment',
                                    'Comment Group': channel,
                                    'Comment Author ID': comment_message.sender_id,
                                    'Comment Content': comment_text,
                                    'Comment Date': comment_date_time,
                                    'Comment Message ID': comment_message.id,
                                    'Comment Author': comment_message.post_author,
                                    'Comment Views': comment_message.views,
                                    'Comment Reactions': comment_emoji_string,
                                    'Comment Shares': comment_message.forwards,
                                    'Comment Media': comment_media,
                                    'Comment Url': f'https://t.me/{channel}/{message.id}?comment={comment_message.id}'.replace('@', ''),
                                })
                        except Exception as e:
                            comments_list = []
                            print(f'Error processing comments: {e}')

                        # Process the main message
                        media = 'True' if message.media else 'False'

                        emoji_string = ''
                        if message.reactions:
                            for reaction_count in message.reactions.results:
                                emoji = reaction_count.reaction.emoticon
                                count = str(reaction_count.count)
                                emoji_string += emoji + " " + count + " "

                        date_time = message.date.strftime('%Y-%m-%d %H:%M:%S')
                        cleaned_content = remove_unsupported_characters(message.text)
                        cleaned_comments_list = remove_unsupported_characters(json.dumps(comments_list))

                        data.append({
                            'Type': 'text',
                            'Group': channel,
                            'Author ID': message.sender_id,
                            'Content': cleaned_content,
                            'Date': date_time,
                            'Message ID': message.id,
                            'Author': message.post_author,
                            'Views': message.views,
                            'Reactions': emoji_string,
                            'Shares': message.forwards,
                            'Media': media,
                            'Url': f'https://t.me/{channel}/{message.id}'.replace('@', ''),
                            'Comments List': cleaned_comments_list,
                        })

                        c_index += 1
                        t_index += 1

                        # Print progress
                        print(f'{"-" * 80}')
                        print_progress(t_index, message.id, start_time, max_t_index)
                        current_max_id = min(c_index + message.id, max_t_index)
                        print(f'From {channel}: {c_index:05} contents of {current_max_id:05}')
                        print(f'Id: {message.id:05} / Date: {date_time}')
                        print(f'Total: {t_index:05} contents until now')
                        print(f'{"-" * 80}\n\n')

                        if t_index % 1000 == 0:
                            if File == 'parquet':
                                backup_filename = f'backup_{file_name}_until_{t_index:05}_{channel}_ID{message.id:07}.parquet'
                                pd.DataFrame(data).to_parquet(backup_filename, index=False)
                            elif File == 'excel':
                                backup_filename = f'backup_{file_name}_until_{t_index:05}_{channel}_ID{message.id:07}.xlsx'
                                pd.DataFrame(data).to_excel(backup_filename, index=False)

                        if t_index >= max_t_index:
                            break

                        if time.time() - start_time > time_limit:
                            break

                    elif message.date < datetime(y_min, m_min, d_min, tzinfo=timezone.utc):
                        break

                except Exception as e:
                    print(f'Error processing message: {e}')

        print(f'\n\n##### {channel} was ok with {c_index:05} posts #####\n\n')

        df = pd.DataFrame(data)
        if File == 'parquet':
            partial_filename = f'complete_{channel}_in_{file_name}_until_{t_index:05}.parquet'
            df.to_parquet(partial_filename, index=False)
        elif File == 'excel':
            partial_filename = f'complete_{channel}_in_{file_name}_until_{t_index:05}.xlsx'
            df.to_excel(partial_filename, index=False)
        # files.download(partial_filename)

    except Exception as e:
        print(f'{channel} error: {e}')

    loop_end_time = time.time()
    loop_duration = loop_end_time - loop_start_time

    if loop_duration < 60:
        time.sleep(60 - loop_duration)

print(f'\n{"-" * 50}\n#Concluded! #{t_index:05} posts were scraped!\n{"-" * 50}\n\n\n\n')
df = pd.DataFrame(data)
if File == 'parquet':
    final_filename = f'FINAL_{file_name}_with_{t_index:05}.parquet'
    df.to_parquet(final_filename, index=False)
elif File == 'excel':
    final_filename = f'FINAL_{file_name}_with_{t_index:05}.xlsx'
    df.to_excel(final_filename, index=False)
files.download(final_filename)


## **Reading**

If you want to read your generated file or convert from `.parquet` to another format, feel free to use it here.

In [None]:
# Import pandas for data manipulation
from google.colab import files
import pandas as pd

filename = 'TYPE_HERE_YOUR_FILENAME.parquet'  # Set the Parquet file name here, including the .parquet extension

# Read the final Parquet file
df_parquet = pd.read_parquet(filename)

# Display the dataframe to the user (optional)
display(df_parquet)

# Convert the dataframe to Excel and set the new filename
excel_filename = filename.replace('.parquet', '.xlsx')

# Save the dataframe as an Excel file
df_parquet.to_excel(excel_filename, index=False)

# Download the Excel file
files.download(excel_filename)
