## # PropWatch Cyprus — End-to-End Pipeline

**Scrape → Merge → Filter → Clean → Lemmatize → Analyse**

Corpus sources:

- **Tier 1**: RT / Sputnik via direct sitemap (`src/scraping/news.py`)
- **Tier 2/3**: Telegram channels via Telethon (`src/scraping/telegram.py`)
- **Twitter/X**: Jan 2026 kompromat event (`src/scraping/twitter.py`)

Run cells top-to-bottom. Scraping cells are commented out by default — uncomment to collect fresh data.


In [2]:
from dotenv import load_dotenv
load_dotenv(override=True)  # must run before any src.* imports

import pandas as pd

from src.config import (
    RAW_CSV, PRECLEANED_CSV, CLEAN_CSV,
    CYRILLIC_CSV, LATIN_CSV, GREEK_CSV,
    CYRILLIC_LEMMATIZED_CSV, GREEK_LEMMATIZED_CSV,
    WORD_FREQ_CSV, BIGRAMS_CSV, TRIGRAMS_CSV,
    LATIN_WORD_FREQ_CSV, LATIN_BIGRAMS_CSV, LATIN_TRIGRAMS_CSV,
    GREEK_WORD_FREQ_CSV, GREEK_BIGRAMS_CSV, GREEK_TRIGRAMS_CSV,
    ARCHIVED_RAW_DIR, TWITTER_RAW_CSV,
)
from src.scraping.telegram import scrape_channels
from src.scraping.news import scrape_all_tier1
# from src.scraping.twitter import scrape_twitter
from src.preprocessing.filtering import filter_messages, tag_categories
from src.preprocessing.text_cleaning import clean_and_split
from src.analysis.lemmatization import lemmatize_column, lemmatize_greek_column
from src.analysis.frequency import ensure_list_column, word_frequency, compute_ngrams

print('Imports OK')

Imports OK


## 1. Scrape — Tier 1 (RT / Sputnik via sitemap)


In [4]:
# Requires sitemap_index entries in configs/channels.yaml.
# RT English and RT Russian are pre-configured.
# Output saved per-domain to data/raw/archived/ + merged tier1_archived_raw.csv

df_tier1 = scrape_all_tier1()

# Load existing:
tier1_path = ARCHIVED_RAW_DIR / 'tier1_archived_raw.csv'
df_tier1 = pd.read_csv(tier1_path) if tier1_path.exists() else pd.DataFrame()
print(f'Tier 1 articles loaded: {len(df_tier1)}')


=== Scraping rt.com ===
   Fetching sitemap index: https://www.rt.com/sitemap.xml
   Found 3 child sitemaps in study window
   Total URLs discovered: 33116
   After Cyprus URL pre-filter: 4
   ✓ 1 kept | https://www.rt.com/news/599597-hezbollah-nasrallah-israel-cyprus/
   ✓ 2 kept | https://www.rt.com/news/608166-cyprus-nato-accession-plan/
   ✓ 3 kept | https://www.rt.com/russia/608379-cyprus-strips-russian-ukrainian-passports/
   ✓ 4 kept | https://www.rt.com/business/631975-ceo-uralkali-baumgartner-dead-cyprus/
   Saved 4 articles to /Users/andreas/PropWatch-Cyprus/data/raw/archived/rt_com_raw.csv

=== Scraping russian.rt.com ===
   Fetching sitemap index: https://russian.rt.com/sitemap.xml
   Found 8 child sitemaps in study window
   Total URLs discovered: 328231
   After Cyprus URL pre-filter: 0
   Saved 0 articles to /Users/andreas/PropWatch-Cyprus/data/raw/archived/russian_rt_com_raw.csv
   [skip] No sitemap_index for Sputnik Russian — add it to channels.yaml
   [skip] No sitem

## 2. Scrape — Tier 2 / 3 (Telegram channels)


In [8]:
# Requires TELEGRAM_APP_ID and TELEGRAM_API_HASH in .env
# Channels loaded from configs/channels.yaml (archived: false entries only)
# First run will prompt for phone number + Telegram auth code in terminal.

df_telegram = await scrape_channels()

# Load existing:
from src.config import RAW_CSV
df_telegram = pd.read_csv(RAW_CSV) if RAW_CSV.exists() else pd.DataFrame()
print(f'Telegram messages loaded: {len(df_telegram)}')

--- Processing: warfakes ---
   Success: Scraped 1051 messages.
   [Anti-Ban] Pausing for 12s...
--- Processing: rybar ---
   Success: Scraped 2323 messages.
   [Anti-Ban] Pausing for 23s...
--- Processing: rusembcy ---
   Success: Scraped 797 messages.
   [Anti-Ban] Pausing for 25s...

Done! Saved 4171 rows to /Users/andreas/PropWatch-Cyprus/data/processed/corpus_raw.csv
Telegram messages loaded: 4171


## 3. Scrape — Twitter/X (Jan 2026 kompromat event)


In [None]:
# Requires TWITTER_BEARER_TOKEN in .env
# Targets the coordinated operation against President Christodoulides (Jan 2026)
# Note: Twitter API v2 free tier covers last 7 days only.
# For full Jan 2026 window, a Pro/Academic tier token is required.

# df_twitter = scrape_twitter()

# Load existing:
df_twitter = pd.read_csv(TWITTER_RAW_CSV) if TWITTER_RAW_CSV.exists() else pd.DataFrame()
print(f'Tweets loaded: {len(df_twitter)}')

## 4. Merge corpus


In [12]:
# Combine all sources into one DataFrame.
# source_url column is present in Tier 1 and Twitter but not Telegram — fillna handles it.
frames = [df for df in [df_tier1, df_telegram] if not df.empty]
# frames = [df for df in [df_tier1, df_telegram, df_twitter] if not df.empty] - with twitter --- IGNORE ---

if not frames:
    raise RuntimeError('No data loaded — run at least one scraper first.')

df_raw = pd.concat(frames, ignore_index=True)
df_raw['source_url'] = df_raw.get('source_url', pd.Series(dtype=str)).fillna('')

df_raw.to_csv(RAW_CSV, index=False)
print(f'Merged corpus: {len(df_raw)} rows ({len(df_tier1)} Tier1 | {len(df_telegram)} Telegram)')
# print(f'Merged corpus: {len(df_raw)} rows ({len(df_tier1)} Tier1 | {len(df_telegram)} Telegram | {len(df_twitter)} Twitter)') - with twitter --- IGNORE ---
df_raw.head()

Merged corpus: 4175 rows (4 Tier1 | 4171 Telegram)


Unnamed: 0,message_id,date,channel,region,text,views,forwards,reactions,reply_to_id,edit_date,source_url
0,52020e2a5cb25cdf27cd081fada12f18,2024-06-19T20:41:15+00:00,rt.com,tier1_archived,‘No place’ is safe if Israel starts war – Hezb...,0,0,0,,,https://www.rt.com/news/599597-hezbollah-nasra...
1,c061410a4344f4d1520888f61b2403e0,2024-11-25T12:25:59+00:00,rt.com,tier1_archived,Cyprus shares secret NATO plan with US – media...,0,0,0,,,https://www.rt.com/news/608166-cyprus-nato-acc...
2,c00984557e952b9fd506b9178885f1f2,2024-11-29T10:44:21+00:00,rt.com,tier1_archived,Russian and Ukrainian tycoons among ‘golden’ E...,0,0,0,,,https://www.rt.com/russia/608379-cyprus-strips...
3,06c6a057e25cb46ca29b720da5f909f4,2026-02-04T16:54:12+00:00,rt.com,tier1_archived,Russian businessman’s remains found on British...,0,0,0,,,https://www.rt.com/business/631975-ceo-uralkal...
4,33115,2026-02-28 17:28:33+00:00,warfakes,tier2_postban,**Фейк**: Викарий Патриарха Савва дал интервью...,7669,8,89,,2026-02-28 17:28:36+00:00,


## 5. Filter & tag


In [13]:
df_filtered = filter_messages(df_raw)
df_tagged = tag_categories(df_filtered)
df_tagged.to_csv(PRECLEANED_CSV, index=False)
print(f'Saved {len(df_tagged)} pre-cleaned rows to {PRECLEANED_CSV}')

Step 1 (Length Filter): 4172 rows remain
Step 2 (Remove Ads):    4172 rows remain
Step 3 (Topic Filter):  4056 rows remain
Step 4 (Deduplicate):   4053 unique rows remain
Saved 4053 pre-cleaned rows to /Users/andreas/PropWatch-Cyprus/data/processed/corpus_precleaned.csv


## 6. Clean text & split by language


In [14]:
df_clean = pd.read_csv(PRECLEANED_CSV)
df_all, df_ru, df_en, df_gr = clean_and_split(df_clean)

df_all.to_csv(CLEAN_CSV, index=False)
df_ru.to_csv(CYRILLIC_CSV, index=False)
df_en.to_csv(LATIN_CSV, index=False)
df_gr.to_csv(GREEK_CSV, index=False)
print(f'Saved clean data ({len(df_all)} total | {len(df_ru)} Russian | {len(df_en)} English | {len(df_gr)} Greek)')

Russian posts: 3869
English posts: 100
Greek posts:   83
Saved clean data (4053 total | 3869 Russian | 100 English | 83 Greek)


## 7. Lemmatize


In [None]:
# Russian — stanza ru pipeline
df_cyr = pd.read_csv(CYRILLIC_CSV)
df_cyr = lemmatize_column(df_cyr)
df_cyr.to_csv(CYRILLIC_LEMMATIZED_CSV, index=False)
print(f'Russian lemmatized: {len(df_cyr)} rows → {CYRILLIC_LEMMATIZED_CSV}')

# Greek — stanza el pipeline
df_gr = pd.read_csv(GREEK_CSV)
df_gr = lemmatize_greek_column(df_gr)
df_gr.to_csv(GREEK_LEMMATIZED_CSV, index=False)
print(f'Greek lemmatized:  {len(df_gr)} rows → {GREEK_LEMMATIZED_CSV}')

## 8. Word frequency & n-gram analysis


In [None]:
# ── Russian ───────────────────────────────────────────────────────────────────
df_ru = pd.read_csv(CYRILLIC_LEMMATIZED_CSV)
df_ru['lemmas'] = ensure_list_column(df_ru['lemmas'])
wf = word_frequency(df_ru['lemmas'])
wf.to_csv(WORD_FREQ_CSV, index=False)
bg = compute_ngrams(df_ru['lemmas'], n=2, min_freq=3)
tg = compute_ngrams(df_ru['lemmas'], n=3, min_freq=3)
bg.to_csv(BIGRAMS_CSV, index=False)
tg.to_csv(TRIGRAMS_CSV, index=False)
print('Russian — top 20 words:')
print(wf.head(20).to_string(index=False))

# ── English ───────────────────────────────────────────────────────────────────
if LATIN_CSV.exists():
    df_en = pd.read_csv(LATIN_CSV)
    if 'lemmas' not in df_en.columns:
        df_en['lemmas'] = df_en['text_cleaned'].str.split()
    df_en['lemmas'] = ensure_list_column(df_en['lemmas'])
    wf_en = word_frequency(df_en['lemmas'])
    wf_en.to_csv(LATIN_WORD_FREQ_CSV, index=False)
    compute_ngrams(df_en['lemmas'], n=2, min_freq=3).to_csv(LATIN_BIGRAMS_CSV, index=False)
    compute_ngrams(df_en['lemmas'], n=3, min_freq=3).to_csv(LATIN_TRIGRAMS_CSV, index=False)
    print(f'English word freq saved ({len(wf_en)} unique lemmas)')

# ── Greek ─────────────────────────────────────────────────────────────────────
if GREEK_LEMMATIZED_CSV.exists():
    df_gr = pd.read_csv(GREEK_LEMMATIZED_CSV)
    df_gr['lemmas'] = ensure_list_column(df_gr['lemmas'])
    wf_gr = word_frequency(df_gr['lemmas'])
    wf_gr.to_csv(GREEK_WORD_FREQ_CSV, index=False)
    compute_ngrams(df_gr['lemmas'], n=2, min_freq=3).to_csv(GREEK_BIGRAMS_CSV, index=False)
    compute_ngrams(df_gr['lemmas'], n=3, min_freq=3).to_csv(GREEK_TRIGRAMS_CSV, index=False)
    print(f'Greek word freq saved ({len(wf_gr)} unique lemmas)')

## 9. Propaganda classification (TODO)

Once `src/classification/model.py` is implemented and weights are placed in `models/`,
uncomment the cell below.


In [None]:
# from src.classification.model import predict
# df_ru_clean = pd.read_csv(CYRILLIC_CSV)
# predictions = predict(df_ru_clean['text_cleaned'].tolist())

## 10. BERTopic clustering (TODO)

H1 (technique distribution) and H4 (narrative drift over time).
Implement in `src/clustering/bertopic_pipeline.py`.


In [None]:
# from src.clustering.bertopic_pipeline import run_bertopic
# topics, topic_model = run_bertopic(df_ru['text_cleaned'].tolist())

## 11. Interrupted time series — Jan 2026 kompromat event (TODO)

H3: detectable structural break in technique intensity around Jan 2026.
Implement in `src/stats/its.py`.


In [None]:
# from src.stats.its import run_its
# results = run_its(df_raw)