Rule-based risk triage system for short social-media videos.
This system is a risk triage tool for human reviewers — it does NOT make legal verdicts and does NOT claim any content is definitively illegal.
Detect risk signals related to:
- Illegal gambling / online casino / betting advertisements
- Financial pyramid schemes
- Investment scams
- Suspicious conversion funnels (Telegram, WhatsApp, QR codes, link-in-bio)
Output is an explainable risk score and evidence timeline intended for human review.
pip install -r requirements.txtNo external ML dependencies in v0.1. Pure Python only.
python src/predict_video.py \
--video data/demo/casino_high_risk.mp4 \
--ocr-text "Бонус 100% за регистрацию. Промокод MEDIA. Депозит от 1000 тенге." \
--asr-text "Переходи по ссылке в био. Вывод моментальный." \
--caption "#казино #промокод"Optional: save output to file:
python src/predict_video.py --video video.mp4 --ocr-text "..." --output outputs/result.json{
"video_id": "casino_high_risk",
"risk_score": 91,
"severity": "critical",
"labels": ["illegal_gambling_ad"],
"is_hybrid": false,
"confidence": 0.86,
"top_signals": [
{
"signal_id": "S_GAM_008",
"category": "illegal_gambling_ad",
"modality": "ocr",
"timestamp_start": null,
"timestamp_end": null,
"evidence_text": "Бонус 100% за регистрацию",
"risk_weight": 0.92,
"reason": "Registration bonus / gambling promotion",
"language": "ru"
}
],
"timeline_events": [
{
"event_type": "signal_detected",
"category": "illegal_gambling_ad",
"modality": "ocr",
"description": "Registration bonus / gambling promotion",
"timestamp": null
}
],
"explanation": "Видео содержит признаки риска (незаконная реклама азартных игр). Уровень риска: критический. Рекомендуется ручная проверка перед принятием решения. Данный анализ не является юридическим заключением.",
"recommended_action": "send_to_human_review",
"processing_time_ms": 12,
"model_version": "mediawatch-core-v0.1"
}pytest tests/ -v| Score | Severity | Action |
|---|---|---|
| 0–24 | low | no_action |
| 25–49 | medium | needs_manual_check |
| 50–74 | high | send_to_human_review |
| 75–100 | critical | send_to_human_review |
Run with real OCR from video frames (requires opencv-python + easyocr):
python src/predict_video.py \
--video data/demo/casino_high_risk.mp4 \
--use-ocr \
--asr-text "Переходи по ссылке в био" \
--caption "#казино #бонус"Manual OCR still works and always overrides --use-ocr:
python src/predict_video.py \
--video data/demo/casino_high_risk.mp4 \
--ocr-text "Бонус 100% за регистрацию" \
--caption "#казино"OCR flags:
| Flag | Default | Description |
|---|---|---|
--use-ocr |
off | Enable real OCR from frames |
--ocr-fps |
1.0 | Frames per second to sample |
--ocr-max-frames |
30 | Max frames to extract |
--frames-dir |
data/interim/frames |
Where sampled frames are saved |
Note: EasyOCR is optional. If not installed, OCR is skipped silently and a warning is printed to stderr. Use
--ocr-textas the manual fallback. EasyOCR downloads ~500 MB of language models on first run.Install EasyOCR:
pip install easyocr
Run with real ASR from video audio (requires ffmpeg on PATH + faster-whisper):
python src/predict_video.py \
--video data/demo/casino_high_risk.mp4 \
--use-asr \
--ocr-text "Бонус 100% за регистрацию" \
--caption "#казино"Run with both real OCR and real ASR:
python src/predict_video.py \
--video data/demo/casino_high_risk.mp4 \
--use-ocr \
--use-asr \
--caption "#казино"Manual ASR still works and always overrides --use-asr:
python src/predict_video.py \
--video data/demo/casino_high_risk.mp4 \
--asr-text "Переходи по ссылке в био"ASR flags:
| Flag | Default | Description |
|---|---|---|
--use-asr |
off | Enable real ASR transcription |
--asr-model |
base |
Whisper model size (tiny/base/small/medium/large) |
--asr-device |
auto |
Inference device (auto/cpu/cuda) |
--audio-dir |
data/interim/audio |
Where extracted WAV files are saved |
Note:
faster-whisperandffmpegare both optional. If either is missing, a warning is printed to stderr and the pipeline continues with empty ASR text. Manual--asr-textis the safest option for demos without FFmpeg.Install:
pip install faster-whisperandffmpegvia your OS package manager.
The system includes a lightweight context guard (src/context_guard.py) to reduce
false positives on educational, warning, news, or debunking content.
Example — this warning should not be treated as a gambling ad:
"Не верьте онлайн казино. Это мошенничество."
But this promotional ad should still be flagged:
"Бонус за регистрацию. Промокод. Депозит. Ссылка в био."
How it works:
analyze_context(ocr_text, asr_text, caption)scores warning and promotional phrase density independently.fusion_scorer.score(signals, context=context)applies a suppression factor (0.15×) when content is warning-only.- Mixed cases (warning + strong promotional): strong promotional intent overrides suppression.
Limitations:
- This is a heuristic, not perfect intent detection.
- Ambiguous cases are routed to human review (
needs_review). - Strong promotional signals (bonus, deposit, promo code, registration link) always override warning context.
- No negation understanding: phrase matching only.
Run evaluation on the demo annotation set:
python src/evaluate.py \
--annotations data/annotations/demo_labels.jsonl \
--output-dir outputs/demo_results \
--report reports/evaluation_report.mdThis evaluates 8 synthetic demo cases using manual ocr_text/asr_text/caption fields
from the annotation file — no real video files required.
Important: This is a demo-only evaluation, not a production benchmark. All cases are synthetic and purpose-built to exercise the signal vocabulary.
Optional flags:
| Flag | Description |
|---|---|
--limit N |
Evaluate only the first N cases |
--use-ocr |
(Future) run real OCR per case |
--use-asr |
(Future) run real ASR per case |
Per-case prediction JSON files are saved to outputs/demo_results/{video_id}.json.
The full Markdown report is saved to reports/evaluation_report.md.
- MODEL_CARD.md — model overview, intended use, evaluation, limitations, and ethical boundaries.
- DEMO_SCRIPT.md — step-by-step demo guide with PowerShell commands and Q&A defense answers.
- No real QR code detection — QR keywords in text only
- Rule-based matching only (no ML model yet)
- Russian/Kazakh/English only
- No database or persistence layer
- QR Detector — OpenCV + pyzbar for QR code detection in frames
- Visual classifier — fine-tune image classifier on gambling/scam screenshots
- Evaluation — build annotated test dataset and measure precision/recall
- API layer — FastAPI wrapper for batch processing
- Human review UI — case queue with evidence highlighting