feat: add llms.txt detection and content-focused scoring criteria#58
feat: add llms.txt detection and content-focused scoring criteria#58IISweetHeartII merged 2 commits intodevelopfrom
Conversation
|
Navi review: APPROVE 판단.
참고: 현재 gh auth 기준 PR 작성자와 동일 계정이라 GitHub self-approve는 불가해 코멘트로 판정 남깁니다. |
|
CI 실패: auto-label. 수정 필요. |
|
CI 실패: auto-label (failure). 수정 필요. |
|
CI 실패: auto-label. 수정 필요. |
|
Reviewed deeply. No blocking issues found. The llms.txt, JSON-LD, feed, and semantic HTML changes are internally consistent, test coverage was expanded alongside the new scoring logic, and the config/report updates match the new audit surface. |
|
Deep review is still positive on the code itself, but this PR is now blocked by merge conflicts against develop. I tried updating the branch and GitHub rejected the rebase due to conflicts, so this needs a conflict refresh before it can be merged. |
|
CI 실패: auto-label. 수정 필요. |
IISweetHeartII
left a comment
There was a problem hiding this comment.
🐱 Navi review — APPROVE, blocking 없음.
issue #51의 5가지 변경 모두 합리적이고 테스트 커버리지 충분함.
판단 근거:
- llms.txt: binary→numeric 점수 전환, 품질 신호(H1/blockquote/sectioned URLs/길이) 세분화 깔끔
- JSON-LD:
@graph+@type배열 처리로 엣지 케이스 커버 - AI 크롤러: 4→10 확장, 단순 추가
- ContentFeed: 새 audit + gatherer 분리 구조 좋음
- Semantic HTML:
<article>추가, heading hierarchy 검증 로직 정확
Non-blocking (3건):
-
ContentFeedAudit.detectBodyFeedLinks가HtmlGatherer.extractFeedLinks와 동일한<link rel="alternate">정규식 파싱을 수행. gatherer에서 이미feedLinks로 제공하므로 audit의 body re-parsing은 불필요. Set dedup이라 동작엔 문제 없지만, 같은 로직이 두 곳에 있음. -
ContentFeed 점수:
min(1, feedCount * 0.5 + 0.5)→ 피드 1개 = 만점 1.0. 점수 차별화가 안 됨. -
runner.test.ts에서/llms-full.txt404 조건 추가 시 원래if (포맷이 망가짐. 동작은 정상.
Implement all 5 parts of issue #51: 1. llms.txt Detection (High Priority) - Upgrade LlmsTxtAudit from binary to numeric scoring - Check for H1 heading, blockquote summary, sectioned URL lists - Add companion /llms-full.txt bonus detection - Add llmsFullTxt probe to HttpGatherer 2. JSON-LD Schema Type Analysis (High Priority) - Upgrade JsonLdAudit from binary to numeric scoring - Analyze specific schema types (WebSite, BlogPosting, Article, Person, Organization, BreadcrumbList, FAQPage, HowTo, WebPage) - Handle @graph arrays and @type arrays - Give partial credit based on type richness 3. AI Crawler Permissions (Medium Priority) - Expand AI_USER_AGENTS from 4 to 10 crawlers - Add ClaudeBot, PerplexityBot, ChatGPT-User, Applebot-Extended, Bard, anthropic-ai 4. Content Feed Detection (Medium Priority) - New ContentFeedAudit checking RSS/Atom feed availability - HTML <link rel='alternate'> feed link extraction in HtmlGatherer - Numeric scoring based on feed count 5. Semantic HTML Analysis (Lower Priority) - Add <article> element to semantic checks (was already gathered) - Add heading hierarchy validation (no skipped levels) - Add headingLevels tracking to SemanticElements Config changes: - Add content-feed audit (weight 4) to structured-data category - Rebalance json-ld/meta-tags/semantic-html weights Fixes #51
Add llmsFullTxt, headingLevels, and feedLinks to test helpers to match updated HttpGatherResult and HtmlGatherResult interfaces. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2a53fc0 to
a839f85
Compare
Summary
Implement all 5 parts of issue #51 — enhanced content-focused scoring for AX Score.
Changes
1. llms.txt Detection (High Priority)
LlmsTxtAuditfrom binary → numeric scoring/llms-full.txtbonus detectionllmsFullTxtprobe inHttpGatherer2. JSON-LD Schema Type Analysis (High Priority)
JsonLdAuditfrom binary → numeric scoring@grapharrays and@typearrays3. AI Crawler Permissions (Medium Priority)
4. Content Feed Detection (Medium Priority)
ContentFeedAudit— checks for RSS/Atom feed availability<link rel="alternate">feed link extraction inHtmlGatherer5. Semantic HTML Analysis (Lower Priority)
<article>to semantic element checksheadingLevelsinSemanticElementsConfig
content-feedaudit (weight 4) to structured-data categoryTesting
Files Changed
20 files, +569/-71 lines
Fixes #51