Skip to content

cheatplayer-code/AFM_AI_Media

Repository files navigation

AI Media Watch ML-Core

Rule-based risk triage system for short social-media videos.

This system is a risk triage tool for human reviewers — it does NOT make legal verdicts and does NOT claim any content is definitively illegal.


Purpose

Detect risk signals related to:

  • Illegal gambling / online casino / betting advertisements
  • Financial pyramid schemes
  • Investment scams
  • Suspicious conversion funnels (Telegram, WhatsApp, QR codes, link-in-bio)

Output is an explainable risk score and evidence timeline intended for human review.


Install

pip install -r requirements.txt

No external ML dependencies in v0.1. Pure Python only.


Run

python src/predict_video.py \
  --video data/demo/casino_high_risk.mp4 \
  --ocr-text "Бонус 100% за регистрацию. Промокод MEDIA. Депозит от 1000 тенге." \
  --asr-text "Переходи по ссылке в био. Вывод моментальный." \
  --caption "#казино #промокод"

Optional: save output to file:

python src/predict_video.py --video video.mp4 --ocr-text "..." --output outputs/result.json

Example Output

{
  "video_id": "casino_high_risk",
  "risk_score": 91,
  "severity": "critical",
  "labels": ["illegal_gambling_ad"],
  "is_hybrid": false,
  "confidence": 0.86,
  "top_signals": [
    {
      "signal_id": "S_GAM_008",
      "category": "illegal_gambling_ad",
      "modality": "ocr",
      "timestamp_start": null,
      "timestamp_end": null,
      "evidence_text": "Бонус 100% за регистрацию",
      "risk_weight": 0.92,
      "reason": "Registration bonus / gambling promotion",
      "language": "ru"
    }
  ],
  "timeline_events": [
    {
      "event_type": "signal_detected",
      "category": "illegal_gambling_ad",
      "modality": "ocr",
      "description": "Registration bonus / gambling promotion",
      "timestamp": null
    }
  ],
  "explanation": "Видео содержит признаки риска (незаконная реклама азартных игр). Уровень риска: критический. Рекомендуется ручная проверка перед принятием решения. Данный анализ не является юридическим заключением.",
  "recommended_action": "send_to_human_review",
  "processing_time_ms": 12,
  "model_version": "mediawatch-core-v0.1"
}

Run Tests

pytest tests/ -v

Severity Levels

Score Severity Action
0–24 low no_action
25–49 medium needs_manual_check
50–74 high send_to_human_review
75–100 critical send_to_human_review

Real OCR (v0.2)

Run with real OCR from video frames (requires opencv-python + easyocr):

python src/predict_video.py \
  --video data/demo/casino_high_risk.mp4 \
  --use-ocr \
  --asr-text "Переходи по ссылке в био" \
  --caption "#казино #бонус"

Manual OCR still works and always overrides --use-ocr:

python src/predict_video.py \
  --video data/demo/casino_high_risk.mp4 \
  --ocr-text "Бонус 100% за регистрацию" \
  --caption "#казино"

OCR flags:

Flag Default Description
--use-ocr off Enable real OCR from frames
--ocr-fps 1.0 Frames per second to sample
--ocr-max-frames 30 Max frames to extract
--frames-dir data/interim/frames Where sampled frames are saved

Note: EasyOCR is optional. If not installed, OCR is skipped silently and a warning is printed to stderr. Use --ocr-text as the manual fallback. EasyOCR downloads ~500 MB of language models on first run.

Install EasyOCR: pip install easyocr


Real ASR (v0.3)

Run with real ASR from video audio (requires ffmpeg on PATH + faster-whisper):

python src/predict_video.py \
  --video data/demo/casino_high_risk.mp4 \
  --use-asr \
  --ocr-text "Бонус 100% за регистрацию" \
  --caption "#казино"

Run with both real OCR and real ASR:

python src/predict_video.py \
  --video data/demo/casino_high_risk.mp4 \
  --use-ocr \
  --use-asr \
  --caption "#казино"

Manual ASR still works and always overrides --use-asr:

python src/predict_video.py \
  --video data/demo/casino_high_risk.mp4 \
  --asr-text "Переходи по ссылке в био"

ASR flags:

Flag Default Description
--use-asr off Enable real ASR transcription
--asr-model base Whisper model size (tiny/base/small/medium/large)
--asr-device auto Inference device (auto/cpu/cuda)
--audio-dir data/interim/audio Where extracted WAV files are saved

Note: faster-whisper and ffmpeg are both optional. If either is missing, a warning is printed to stderr and the pipeline continues with empty ASR text. Manual --asr-text is the safest option for demos without FFmpeg.

Install: pip install faster-whisper and ffmpeg via your OS package manager.


Context Guard (v0.5)

The system includes a lightweight context guard (src/context_guard.py) to reduce false positives on educational, warning, news, or debunking content.

Example — this warning should not be treated as a gambling ad:

"Не верьте онлайн казино. Это мошенничество."

But this promotional ad should still be flagged:

"Бонус за регистрацию. Промокод. Депозит. Ссылка в био."

How it works:

  1. analyze_context(ocr_text, asr_text, caption) scores warning and promotional phrase density independently.
  2. fusion_scorer.score(signals, context=context) applies a suppression factor (0.15×) when content is warning-only.
  3. Mixed cases (warning + strong promotional): strong promotional intent overrides suppression.

Limitations:

  • This is a heuristic, not perfect intent detection.
  • Ambiguous cases are routed to human review (needs_review).
  • Strong promotional signals (bonus, deposit, promo code, registration link) always override warning context.
  • No negation understanding: phrase matching only.

Demo Evaluation (v0.4)

Run evaluation on the demo annotation set:

python src/evaluate.py \
  --annotations data/annotations/demo_labels.jsonl \
  --output-dir outputs/demo_results \
  --report reports/evaluation_report.md

This evaluates 8 synthetic demo cases using manual ocr_text/asr_text/caption fields from the annotation file — no real video files required.

Important: This is a demo-only evaluation, not a production benchmark. All cases are synthetic and purpose-built to exercise the signal vocabulary.

Optional flags:

Flag Description
--limit N Evaluate only the first N cases
--use-ocr (Future) run real OCR per case
--use-asr (Future) run real ASR per case

Per-case prediction JSON files are saved to outputs/demo_results/{video_id}.json. The full Markdown report is saved to reports/evaluation_report.md.


Documentation

  • MODEL_CARD.md — model overview, intended use, evaluation, limitations, and ethical boundaries.
  • DEMO_SCRIPT.md — step-by-step demo guide with PowerShell commands and Q&A defense answers.

Current Limitations

  • No real QR code detection — QR keywords in text only
  • Rule-based matching only (no ML model yet)
  • Russian/Kazakh/English only
  • No database or persistence layer

Next Steps

  1. QR Detector — OpenCV + pyzbar for QR code detection in frames
  2. Visual classifier — fine-tune image classifier on gambling/scam screenshots
  3. Evaluation — build annotated test dataset and measure precision/recall
  4. API layer — FastAPI wrapper for batch processing
  5. Human review UI — case queue with evidence highlighting

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages