## # PropWatch Cyprus ‚Äî End-to-End Pipeline

**Scrape ‚Üí Merge ‚Üí Filter ‚Üí Clean ‚Üí Lemmatize ‚Üí Analyse**

Corpus sources:

- **Tier 1**: RT / Sputnik via direct sitemap (`src/scraping/news.py`)
- **Tier 2/3**: Telegram channels via Telethon (`src/scraping/telegram.py`)
- **Twitter/X**: Jan 2026 kompromat event (`src/scraping/twitter.py`)

Run cells top-to-bottom. Scraping cells are commented out by default ‚Äî uncomment to collect fresh data.


In [None]:
from dotenv import load_dotenv
load_dotenv(override=True)  # must run before any src.* imports

import pandas as pd

from src.config import (
    RAW_CSV, PRECLEANED_CSV, CLEAN_CSV,
    CYRILLIC_CSV, LATIN_CSV, GREEK_CSV,
    CYRILLIC_LEMMATIZED_CSV, GREEK_LEMMATIZED_CSV,
    WORD_FREQ_CSV, BIGRAMS_CSV, TRIGRAMS_CSV,
    LATIN_WORD_FREQ_CSV, LATIN_BIGRAMS_CSV, LATIN_TRIGRAMS_CSV,
    GREEK_WORD_FREQ_CSV, GREEK_BIGRAMS_CSV, GREEK_TRIGRAMS_CSV,
    ARCHIVED_RAW_DIR, TWITTER_RAW_CSV,
)
from src.scraping.telegram import scrape_channels
from src.scraping.news import scrape_all_tier1
from src.scraping.twitter import scrape_twitter
from src.preprocessing.filtering import filter_messages, tag_categories
from src.preprocessing.text_cleaning import clean_and_split
from src.analysis.lemmatization import lemmatize_column, lemmatize_greek_column
from src.analysis.frequency import ensure_list_column, word_frequency, compute_ngrams

print('Imports OK')

## 1. Scrape ‚Äî Tier 1 (RT / Sputnik via sitemap)


In [None]:
# Requires sitemap_index entries in configs/channels.yaml.
# RT English and RT Russian are pre-configured.
# Output saved per-domain to data/raw/archived/ + merged tier1_archived_raw.csv

# df_tier1 = scrape_all_tier1()

# Load existing:
tier1_path = ARCHIVED_RAW_DIR / 'tier1_archived_raw.csv'
df_tier1 = pd.read_csv(tier1_path) if tier1_path.exists() else pd.DataFrame()
print(f'Tier 1 articles loaded: {len(df_tier1)}')

## 2. Scrape ‚Äî Tier 2 / 3 (Telegram channels)


In [None]:
# Requires TELEGRAM_APP_ID and TELEGRAM_API_HASH in .env
# Channels loaded from configs/channels.yaml (archived: false entries only)
# First run will prompt for phone number + Telegram auth code in terminal.

# df_telegram = await scrape_channels()

# Load existing:
from src.config import RAW_CSV
df_telegram = pd.read_csv(RAW_CSV) if RAW_CSV.exists() else pd.DataFrame()
print(f'Telegram messages loaded: {len(df_telegram)}')

--- Processing: warfakes ---
   Success: Scraped 1053 messages.
   [Anti-Ban] Pausing for 16s...
--- Processing: rybar ---
   Success: Scraped 2321 messages.
   [Anti-Ban] Pausing for 13s...
--- Processing: rusembcy ---
   Success: Scraped 797 messages.
   [Anti-Ban] Pausing for 12s...

Done! Saved 4171 rows to /Users/andreas/PropWatch-Cyprus/data/processed/corpus_raw.csv
Loaded 4171 raw messages


Unnamed: 0,message_id,date,channel,region,text,views,forwards,reactions,reply_to_id,edit_date
0,33114,2026-02-28 16:14:59+00:00,warfakes,tier2_postban,**–ß—Ç–æ –∏–∑–º–µ–Ω–∏–ª–æ—Å—å –∫ –≤–µ—á–µ—Ä—É 28 —Ñ–µ–≤—Ä–∞–ª—è:** üü¢–†–∞—Å—Ö...,9115,25,117,33109.0,2026-02-28 16:15:07+00:00
1,33113,2026-02-28 15:02:57+00:00,warfakes,tier2_postban,"–û—Ü–µ–Ω–∏—Ç–µ, –∫–∞–∫ –≤—Å—ë –ø–æ-–≤–∑—Ä–æ—Å–ª–æ–º—É! –í —Ç–∞–∫–æ–º —Ä–∞–Ω–Ω–µ–º ...",18579,31,238,,2026-02-28 15:26:28+00:00
2,33110,2026-02-28 13:43:49+00:00,warfakes,tier2_postban,**–§–µ–π–∫**: –£—á–∞—â–∏–µ—Å—è –≤ –∫–æ–ª–ª–µ–¥–∂–µ ¬´–ê–ª–∞–±—É–≥–∞¬ª –ø–æ–¥—Ä–æ—Å...,26498,36,292,,2026-02-28 13:43:52+00:00
3,33109,2026-02-28 09:46:11+00:00,warfakes,tier2_postban,**–í –æ—Ç–≤–µ—Ç –Ω–∞ –∞—Ç–∞–∫—É —Å–æ —Å—Ç–æ—Ä–æ–Ω—ã –ò–∑—Ä–∞–∏–ª—è –ò—Ä–∞–Ω –ø—Ä–µ...,47082,115,469,,2026-02-28 11:09:40+00:00
4,33108,2026-02-28 07:40:31+00:00,warfakes,tier2_postban,–ù—ã–Ω–µ—à–Ω—è—è —Å–µ—Ä–∏—è —É–¥–∞—Ä–æ–≤ –ø–æ –ò—Ä–∞–Ω—É –≤ –æ—á–µ—Ä–µ–¥–Ω–æ–π —Ä–∞–∑...,53790,155,539,,2026-02-28 15:18:11+00:00


## 3. Scrape ‚Äî Twitter/X (Jan 2026 kompromat event)


In [None]:
# Requires TWITTER_BEARER_TOKEN in .env
# Targets the coordinated operation against President Christodoulides (Jan 2026)
# Note: Twitter API v2 free tier covers last 7 days only.
# For full Jan 2026 window, a Pro/Academic tier token is required.

# df_twitter = scrape_twitter()

# Load existing:
df_twitter = pd.read_csv(TWITTER_RAW_CSV) if TWITTER_RAW_CSV.exists() else pd.DataFrame()
print(f'Tweets loaded: {len(df_twitter)}')

## 4. Merge corpus


In [None]:
# Combine all sources into one DataFrame.
# source_url column is present in Tier 1 and Twitter but not Telegram ‚Äî fillna handles it.
frames = [df for df in [df_tier1, df_telegram, df_twitter] if not df.empty]

if not frames:
    raise RuntimeError('No data loaded ‚Äî run at least one scraper first.')

df_raw = pd.concat(frames, ignore_index=True)
df_raw['source_url'] = df_raw.get('source_url', pd.Series(dtype=str)).fillna('')

df_raw.to_csv(RAW_CSV, index=False)
print(f'Merged corpus: {len(df_raw)} rows ({len(df_tier1)} Tier1 | {len(df_telegram)} Telegram | {len(df_twitter)} Twitter)')
df_raw.head()

## 5. Filter & tag


In [None]:
df_filtered = filter_messages(df_raw)
df_tagged = tag_categories(df_filtered)
df_tagged.to_csv(PRECLEANED_CSV, index=False)
print(f'Saved {len(df_tagged)} pre-cleaned rows to {PRECLEANED_CSV}')

## 6. Clean text & split by language


In [None]:
df_clean = pd.read_csv(PRECLEANED_CSV)
df_all, df_ru, df_en, df_gr = clean_and_split(df_clean)

df_all.to_csv(CLEAN_CSV, index=False)
df_ru.to_csv(CYRILLIC_CSV, index=False)
df_en.to_csv(LATIN_CSV, index=False)
df_gr.to_csv(GREEK_CSV, index=False)
print(f'Saved clean data ({len(df_all)} total | {len(df_ru)} Russian | {len(df_en)} English | {len(df_gr)} Greek)')

## 7. Lemmatize


In [None]:
# Russian ‚Äî stanza ru pipeline
df_cyr = pd.read_csv(CYRILLIC_CSV)
df_cyr = lemmatize_column(df_cyr)
df_cyr.to_csv(CYRILLIC_LEMMATIZED_CSV, index=False)
print(f'Russian lemmatized: {len(df_cyr)} rows ‚Üí {CYRILLIC_LEMMATIZED_CSV}')

# Greek ‚Äî stanza el pipeline
df_gr = pd.read_csv(GREEK_CSV)
df_gr = lemmatize_greek_column(df_gr)
df_gr.to_csv(GREEK_LEMMATIZED_CSV, index=False)
print(f'Greek lemmatized:  {len(df_gr)} rows ‚Üí {GREEK_LEMMATIZED_CSV}')

## 8. Word frequency & n-gram analysis


In [None]:
# ‚îÄ‚îÄ Russian ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
df_ru = pd.read_csv(CYRILLIC_LEMMATIZED_CSV)
df_ru['lemmas'] = ensure_list_column(df_ru['lemmas'])
wf = word_frequency(df_ru['lemmas'])
wf.to_csv(WORD_FREQ_CSV, index=False)
bg = compute_ngrams(df_ru['lemmas'], n=2, min_freq=3)
tg = compute_ngrams(df_ru['lemmas'], n=3, min_freq=3)
bg.to_csv(BIGRAMS_CSV, index=False)
tg.to_csv(TRIGRAMS_CSV, index=False)
print('Russian ‚Äî top 20 words:')
print(wf.head(20).to_string(index=False))

# ‚îÄ‚îÄ English ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
if LATIN_CSV.exists():
    df_en = pd.read_csv(LATIN_CSV)
    if 'lemmas' not in df_en.columns:
        df_en['lemmas'] = df_en['text_cleaned'].str.split()
    df_en['lemmas'] = ensure_list_column(df_en['lemmas'])
    wf_en = word_frequency(df_en['lemmas'])
    wf_en.to_csv(LATIN_WORD_FREQ_CSV, index=False)
    compute_ngrams(df_en['lemmas'], n=2, min_freq=3).to_csv(LATIN_BIGRAMS_CSV, index=False)
    compute_ngrams(df_en['lemmas'], n=3, min_freq=3).to_csv(LATIN_TRIGRAMS_CSV, index=False)
    print(f'English word freq saved ({len(wf_en)} unique lemmas)')

# ‚îÄ‚îÄ Greek ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ
if GREEK_LEMMATIZED_CSV.exists():
    df_gr = pd.read_csv(GREEK_LEMMATIZED_CSV)
    df_gr['lemmas'] = ensure_list_column(df_gr['lemmas'])
    wf_gr = word_frequency(df_gr['lemmas'])
    wf_gr.to_csv(GREEK_WORD_FREQ_CSV, index=False)
    compute_ngrams(df_gr['lemmas'], n=2, min_freq=3).to_csv(GREEK_BIGRAMS_CSV, index=False)
    compute_ngrams(df_gr['lemmas'], n=3, min_freq=3).to_csv(GREEK_TRIGRAMS_CSV, index=False)
    print(f'Greek word freq saved ({len(wf_gr)} unique lemmas)')

## 9. Propaganda classification (TODO)

Once `src/classification/model.py` is implemented and weights are placed in `models/`,
uncomment the cell below.


In [None]:
# from src.classification.model import predict
# df_ru_clean = pd.read_csv(CYRILLIC_CSV)
# predictions = predict(df_ru_clean['text_cleaned'].tolist())

## 10. BERTopic clustering (TODO)

H1 (technique distribution) and H4 (narrative drift over time).
Implement in `src/clustering/bertopic_pipeline.py`.


In [None]:
# from src.clustering.bertopic_pipeline import run_bertopic
# topics, topic_model = run_bertopic(df_ru['text_cleaned'].tolist())

## 11. Interrupted time series ‚Äî Jan 2026 kompromat event (TODO)

H3: detectable structural break in technique intensity around Jan 2026.
Implement in `src/stats/its.py`.


In [None]:
# from src.stats.its import run_its
# results = run_its(df_raw)