Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
d8de3b2
Add YouTube parser for comments and transcripts
claude Apr 28, 2026
8db2102
Add Streamlit UI for the YouTube parser
claude Apr 28, 2026
47fa108
Added Dev Container Folder
codeby Apr 28, 2026
558ddba
Read API key from st.secrets on Streamlit Cloud
claude Apr 28, 2026
35f8653
Translate UI to Russian and persist the API key locally
claude Apr 28, 2026
22ece4d
Mirror saved API key into .streamlit/secrets.toml
claude Apr 28, 2026
0148c52
Fix transcript fetching for youtube-transcript-api 1.x
claude Apr 28, 2026
7cc003e
Support proxy for transcript fetching on Streamlit Cloud
claude Apr 28, 2026
80f71ef
Introduce content_parser package and move YouTube into a plugin
claude Apr 29, 2026
147ea9b
Add unified CLI and dynamic Streamlit UI on top of plugin contract
claude Apr 29, 2026
d44d33d
Add Instagram plugin via Apify's instagram-scraper
claude Apr 29, 2026
601eca8
Harden core: registry diagnostics, runner finally, TOML escaping
claude Apr 29, 2026
b722714
Tighten Instagram plugin: per-kind resultsType, input validation, hea…
claude Apr 29, 2026
cb7de57
Wrap legacy CLI, defensive UI fallback, gitignore cleanup, tests
claude Apr 29, 2026
e51d3e2
Add Reddit plugin via PRAW (read-only auth)
claude Apr 29, 2026
669200a
Address security review findings
claude Apr 29, 2026
b56b239
Address follow-up review: stem collisions, fragment redaction, UA che…
claude Apr 29, 2026
a1c85d1
Add VK plugin: community search, walls, comments
claude Apr 29, 2026
dec5acc
VK plugin review fixes: cap correctness, retry, Session, defensive ad…
claude Apr 29, 2026
deb4590
Add Telegram plugin via Apify (public channels and posts)
claude Apr 29, 2026
9c16d58
Telegram plugin review fixes: actor_id, single-pass dedupe, replies-i…
claude Apr 29, 2026
97e8083
Add Google Sheets loader for plugin inputs
claude Apr 29, 2026
716bea1
Sheets loader review fixes: strict host, validate-before-save, UI polish
claude Apr 29, 2026
ab0525d
Add jobs core: YAML schema, filesystem store, run_job
claude Apr 29, 2026
5332195
Add jobs/cron.py + 'jobs' CLI subcommand
claude Apr 29, 2026
b10f696
Add Schedule panel to Streamlit UI
claude Apr 29, 2026
ee7ab9c
Stage C review fixes: input typing, output_dir guard, newline guard, …
claude Apr 29, 2026
3ed36b9
Add Whisper transcription via OpenAI API for video plugins
claude Apr 29, 2026
3118096
Whisper review fixes: SSRF guard, unknown-duration block, retry, vers…
claude Apr 29, 2026
2d30df6
Project-wide cleanup: CI, shared redact_spec, ApifyClient extraction
claude Apr 29, 2026
744ba79
Add Instagram Graph API plugin for owned business/creator accounts
claude Apr 29, 2026
4824d8c
Instagram Graph review fixes: token redaction, precedence, no-mutation
claude Apr 29, 2026
a7c524d
Deep-review fixes: CSV injection, trace redaction, replies cap, atomi…
claude Apr 30, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
{
"name": "Python 3",
// Or use a Dockerfile or Docker Compose file. More info: https://containers.dev/guide/dockerfile
"image": "mcr.microsoft.com/devcontainers/python:1-3.11-bookworm",
"customizations": {
"codespaces": {
"openFiles": [
"README.md",
"app.py"
]
},
"vscode": {
"settings": {},
"extensions": [
"ms-python.python",
"ms-python.vscode-pylance"
]
}
},
"updateContentCommand": "[ -f packages.txt ] && sudo apt update && sudo apt upgrade -y && sudo xargs apt install -y <packages.txt; [ -f requirements.txt ] && pip3 install --user -r requirements.txt; pip3 install --user streamlit; echo '✅ Packages installed and Requirements met'",
"postAttachCommand": {
"server": "streamlit run app.py --server.enableCORS false --server.enableXsrfProtection false"
},
"portsAttributes": {
"8501": {
"label": "Application",
"onAutoForward": "openPreview"
}
},
"forwardPorts": [
8501
]
}
33 changes: 33 additions & 0 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
name: tests

on:
push:
pull_request:
branches: [main]

jobs:
unittest:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.11", "3.12"]
steps:
- uses: actions/checkout@v4

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: pip

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt

- name: Run unit tests
run: python -m unittest discover -s tests -v

- name: Smoke-check CLI loads all plugins
run: python -m content_parser.cli list-sources
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
__pycache__/
*.pyc
.venv/
venv/
.env
.streamlit/secrets.toml
output/
.youtube_parser_config.json
.content_parser/
.pytest_cache/
21 changes: 20 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,21 @@
# claude
Репозиторий клода

Парсер контента для ресёрча: YouTube, Instagram (через Apify), Reddit (через PRAW).
Streamlit-интерфейс + CLI; результаты в JSON / Markdown / CSV.

## Sharing scraped results — security note

Папка `output/` содержит **сырые комментарии** из публичных API. Текст комментариев
пишется в Markdown без эскейпа — это сделано осознанно, чтобы сохранить читаемость
ссылок и формул, но имеет следствие:

- **Markdown injection.** Злоумышленник может оставить под видео/постом комментарий
вида `[нажми сюда](javascript:alert(1))` или с произвольным HTML. В большинстве
Markdown-вьюверов это отрисуется как кликабельная ссылка / выполнится как код.
- **Не публикуйте `output/` напрямую** на GitHub Pages, Notion, в чатах с
отрисовкой Markdown — без предварительной очистки. Файлы `output/` уже
попадают под `.gitignore`, чтобы исключить случайный коммит.
- Для безопасной публикации — экспортируйте в обычный `.txt`/`.csv`, либо
пропускайте Markdown через санитайзер (например, `bleach`).

JSON-файлы безопасны (нет исполняемого контента).
4 changes: 4 additions & 0 deletions app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
"""Streamlit entry point — calls into content_parser.ui.app.main()."""
from content_parser.ui.app import main

main()
Empty file added content_parser/__init__.py
Empty file.
237 changes: 237 additions & 0 deletions content_parser/cli.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,237 @@
"""Unified CLI: python -m content_parser.cli {run,list-sources}."""
from __future__ import annotations

import argparse
import sys
from pathlib import Path

from .core.registry import all_plugins, get_plugin
from .core.runner import run
from .core.secrets import get_secret


def _build_parser() -> argparse.ArgumentParser:
p = argparse.ArgumentParser(prog="content_parser")
sub = p.add_subparsers(dest="command", required=True)

sub.add_parser("list-sources", help="Show registered source plugins")

# ----- jobs subcommand -----
jobs_p = sub.add_parser("jobs", help="Manage scheduled jobs")
jobs_sub = jobs_p.add_subparsers(dest="jobs_command", required=True)
jobs_sub.add_parser("list", help="List all saved jobs")
show_p = jobs_sub.add_parser("show", help="Print a job's YAML")
show_p.add_argument("name")
run_job_p = jobs_sub.add_parser("run", help="Run a job once")
run_job_p.add_argument("name")
jobs_sub.add_parser("install-cron", help="Regenerate the managed crontab block")
jobs_sub.add_parser("remove-cron", help="Remove the managed crontab block")
jobs_sub.add_parser("cron-status", help="Show what's currently in the managed block")

run_p = sub.add_parser("run", help="Resolve inputs and fetch items for one source")
run_p.add_argument("--source", required=True, help="Plugin name (e.g. youtube, instagram)")
run_p.add_argument("--output", "-o", default=None, help="Output directory")

# Generic input flags — repeatable. Plugin decides which kinds it understands.
run_p.add_argument(
"--input", "-i", action="append", default=[],
metavar="KIND=VALUE",
help='Input as "kind=value" (e.g. --input video=https://youtu.be/x). Repeatable.',
)
# Convenience aliases
run_p.add_argument("--query", "-q", action="append", default=[])
run_p.add_argument("--channel", "-c", action="append", default=[])
run_p.add_argument("--playlist", "-p", action="append", default=[])
run_p.add_argument("--video", "-v", action="append", default=[])
run_p.add_argument("--hashtag", action="append", default=[])
run_p.add_argument("--account", action="append", default=[])
run_p.add_argument("--post", action="append", default=[])

# Plugin settings as key=value, repeatable
run_p.add_argument(
"--set", action="append", default=[],
metavar="KEY=VALUE",
help='Override a plugin setting (e.g. --set max_comments=100). Repeatable.',
)
return p


def _parse_kv(items: list[str]) -> dict[str, str]:
out: dict[str, str] = {}
for s in items:
if "=" not in s:
raise SystemExit(f"Expected KEY=VALUE, got {s!r}")
k, v = s.split("=", 1)
out[k.strip()] = v.strip()
return out


def _coerce(value: str):
low = value.lower()
if low in ("true", "yes", "on"):
return True
if low in ("false", "no", "off"):
return False
try:
return int(value)
except ValueError:
pass
try:
return float(value)
except ValueError:
pass
return value


def cmd_list_sources() -> int:
for p in all_plugins():
kinds = ", ".join(s.kind for s in p.input_specs())
print(f"{p.name:12s} {p.label:20s} inputs=[{kinds}] secrets={p.secret_keys}")
return 0


def cmd_run(args: argparse.Namespace) -> int:
plugin = get_plugin(args.source)

inputs: dict[str, list[str]] = {s.kind: [] for s in plugin.input_specs()}

# Aliases → inputs
for alias_attr, kind in [
("query", "query"), ("channel", "channel"), ("playlist", "playlist"),
("video", "video"), ("hashtag", "hashtag"), ("account", "account"),
("post", "post"),
]:
for v in getattr(args, alias_attr, []):
inputs.setdefault(kind, []).append(v)

# Generic --input KIND=VALUE
for raw in args.input:
if "=" not in raw:
raise SystemExit(f"--input expects KIND=VALUE, got {raw!r}")
kind, value = raw.split("=", 1)
inputs.setdefault(kind.strip(), []).append(value.strip())

# Drop empty kinds
inputs = {k: v for k, v in inputs.items() if v}

if not inputs:
accepted = ", ".join(s.kind for s in plugin.input_specs())
raise SystemExit(f"No inputs given. Plugin {args.source!r} accepts: {accepted}")

# Settings
settings: dict = {s.key: s.default for s in plugin.settings_specs()}
for k, v in _parse_kv(args.set).items():
settings[k] = _coerce(v)

# Secrets
secrets: dict[str, str] = {k: get_secret(k) for k in plugin.secret_keys}
# also pull any well-known optional secrets the plugin might use
for opt in ("WEBSHARE_USERNAME", "WEBSHARE_PASSWORD", "PROXY_HTTP_URL", "PROXY_HTTPS_URL"):
v = get_secret(opt)
if v:
secrets[opt] = v

out_dir = Path(args.output) if args.output else None

def log(msg: str) -> None:
print(msg)

def progress(done: int, total: int, message: str) -> None:
print(f" [{done}/{total}] {message}")

result = run(plugin, inputs, settings, secrets, output_dir=out_dir, log=log, progress=progress)
print(f"\nDone. {len(result.items)} item(s) saved to {result.out_dir.resolve()}")
return 0


def cmd_jobs(args: argparse.Namespace) -> int:
from .jobs import store as jobs_store # noqa: PLC0415
from .jobs.runner import run_job # noqa: PLC0415
from .jobs.schema import dump_job_yaml # noqa: PLC0415

if args.jobs_command == "list":
jobs = jobs_store.list_jobs()
if not jobs:
print("No jobs found in", jobs_store.JOBS_DIR)
return 0
for job in jobs:
schedule = job.schedule or "(manual)"
inputs_summary = ", ".join(f"{k}={len(v)}" for k, v in job.inputs.items()) or "—"
sheet_count = len(job.sheet_inputs)
print(
f"{job.name:30s} source={job.source:10s} schedule={schedule:20s} "
f"inline=[{inputs_summary}] sheet_refs={sheet_count}"
)
invalid = jobs_store.list_invalid()
if invalid:
print()
print("Invalid job files:")
for name, err in invalid:
print(f" {name}: {err}")
return 0

if args.jobs_command == "show":
job = jobs_store.load_job(args.name)
print(dump_job_yaml(job))
return 0

if args.jobs_command == "run":
from .core.errors import AuthError, PluginError # noqa: PLC0415
try:
result = run_job(
args.name, log=print,
progress=lambda d, t, m: print(f" [{d}/{t}] {m}"),
)
except (AuthError, PluginError) as e:
print(f"Error: {e}", file=sys.stderr)
return 1
except KeyError as e:
# get_plugin raises KeyError for unknown source.
print(f"Error: unknown plugin/source — {e}", file=sys.stderr)
return 1
print(f"\nDone. {len(result.items)} item(s) saved to {result.out_dir.resolve()}")
return 0

if args.jobs_command == "install-cron":
from .jobs.cron import install_cron # noqa: PLC0415
entries = install_cron()
if not entries:
print("No scheduled jobs found. Managed block cleared.")
return 0
print(f"Installed {len(entries)} entrie(s) in crontab:")
for e in entries:
print(f" {e.schedule} {e.job_name}")
return 0

if args.jobs_command == "remove-cron":
from .jobs.cron import remove_cron # noqa: PLC0415
removed = remove_cron()
print("Removed managed block." if removed else "Managed block not present.")
return 0

if args.jobs_command == "cron-status":
from .jobs.cron import read_block # noqa: PLC0415
entries = read_block()
if not entries:
print("Managed block is empty or absent.")
return 0
for e in entries:
print(f"{e.schedule} job:{e.job_name}\n → {e.command}")
return 0

return 2


def main(argv: list[str] | None = None) -> int:
args = _build_parser().parse_args(argv)
if args.command == "list-sources":
return cmd_list_sources()
if args.command == "run":
return cmd_run(args)
if args.command == "jobs":
return cmd_jobs(args)
return 2


if __name__ == "__main__":
sys.exit(main())
Empty file.
Loading
Loading