### ⚙️ **Retrieve and Preprocess Messages of the Opportunities For You channel** ⚙️
The code constructs two JSON files:
* **o4u_messages_Jun_07_2025.json:** removes additional fields from **o4u_logs_Jun_07_2025.json** remaining only the messages
* **o4u_preprocessed_messages_Jun_07_2025.json:** removes any special characters from messages that are unnecessary for further manipulations

In [3]:
!pip install emoji unicodedata2




[notice] A new release of pip is available: 25.0.1 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [4]:
import json
import re
import emoji
import unicodedata

In [5]:
def extract_plain_text(message):
    '''
    Extracts plain text content from a JSON logs file

    Args:
        message: Message object containing 'text' field
    
    Returns:
        str:   Concatenated text content if non-empty
        None:  When result is empty string
    '''
    text_parts = []
    for part in message['text']:
        if isinstance(part, str):
            text_parts.append(part)
        elif isinstance(part, dict) and 'text' in part:
            text_parts.append(part['text'])
    result = ''.join(text_parts)
    return None if result == '' else result

with open('data/o4u_logs_Jun_07_2025.json', 'r', encoding='utf-8') as f:
    original_data = json.load(f)

messages_texts = []
for message in original_data['messages']:
    plain_text = extract_plain_text(message)
    if plain_text:
        messages_texts.append(plain_text)

with open('data/o4u_messages_Jun_07_2025.json', 'w', encoding='utf-8') as f:
    json.dump(messages_texts, f, ensure_ascii=False, indent=2)

In [6]:
with open("data/o4u_messages_Jun_07_2025.json", "r", encoding="utf-8") as f:
    json_string = f.read()

event_texts = json.loads(json_string)
event_texts[:5]

['Dear students,\n\nThis channel advertises minor extracurricular activities, internal and external events, hackathons, competitions, campaigns and other potentially interesting happenings. All mentioned is supposed to help you to keep informed about additional opportunities for own personal and professional development.\n\nKeep in touch!',
 '📣Hi there!\n\nStudent Affairs is urgently looking for 3 volunteers to help with administrative work today from 15:30 until 18:00. \n\nYour efforts will be compensated with:\n- innopoints\n- tea & cookies, if you like\n- friendly 319 team\n- amazing reputation in the future!\n\n👉If you may help please message @andrejsblakunovs',
 '📣Hi there! Want any of these?\n\nStudent Affairs are looking for volunteers to help with administrative work \n\n- today 15:00-17:00 or\n- tomorrow in 319 from 14:00 to 16:00.\n\n✅Your efforts will be compensated with:\n\n- IBC 2019 T-shirt\n- tea & cookies, if you like\n- friendliness of 319 team!\n\n👉If you may help ple

In [8]:
def preprocess_event_text(text):
    '''
    Performs text normalization and cleaning.
    Removes emojis, non-text elements, URLs, and non-standard characters.
    Supports Unicode normalization, HTML tag removal, and whitespace reduction.

    Args:
        text: Raw input string to be processed
    
    Returns:
        str: Normalized and cleaned text
    '''
    text = emoji.replace_emoji(text, replace='')
    text = unicodedata.normalize('NFKC', text)
    text = re.sub(r'<[^>]+>', '', text)
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'[^\w\s.,!?;:()"\'-]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

preprocessed_texts = [preprocess_event_text(text) for text in event_texts]
preprocessed_texts[:5]

['Dear students, This channel advertises minor extracurricular activities, internal and external events, hackathons, competitions, campaigns and other potentially interesting happenings. All mentioned is supposed to help you to keep informed about additional opportunities for own personal and professional development. Keep in touch!',
 'Hi there! Student Affairs is urgently looking for 3 volunteers to help with administrative work today from 15:30 until 18:00. Your efforts will be compensated with: - innopoints - tea cookies, if you like - friendly 319 team - amazing reputation in the future! If you may help please message andrejsblakunovs',
 'Hi there! Want any of these? Student Affairs are looking for volunteers to help with administrative work - today 15:00-17:00 or - tomorrow in 319 from 14:00 to 16:00. Your efforts will be compensated with: - IBC 2019 T-shirt - tea cookies, if you like - friendliness of 319 team! If you may help please message andrejsblakunovs',
 "Bonjour! Ça va? 

In [9]:
with open('data/o4u_preprocessed_messages_Jun_07_2025.json', 'w', encoding='utf-8') as f:
    json.dump(preprocessed_texts, f, ensure_ascii=False, indent=2)