Skip to content

feat: add llms.txt detection and content-focused scoring criteria#58

Merged
IISweetHeartII merged 2 commits intodevelopfrom
fix/issue-51
Apr 17, 2026
Merged

feat: add llms.txt detection and content-focused scoring criteria#58
IISweetHeartII merged 2 commits intodevelopfrom
fix/issue-51

Conversation

@IISweetHeartII
Copy link
Copy Markdown
Contributor

Summary

Implement all 5 parts of issue #51 — enhanced content-focused scoring for AX Score.

Changes

1. llms.txt Detection (High Priority)

  • Upgrade LlmsTxtAudit from binary → numeric scoring
  • Check for H1 heading, blockquote summary, sectioned URL lists
  • Add companion /llms-full.txt bonus detection
  • New llmsFullTxt probe in HttpGatherer

2. JSON-LD Schema Type Analysis (High Priority)

  • Upgrade JsonLdAudit from binary → numeric scoring
  • Analyze specific schema types: WebSite, BlogPosting, Article, Person, Organization, BreadcrumbList, FAQPage, HowTo, WebPage
  • Handle @graph arrays and @type arrays
  • Give partial credit based on type richness

3. AI Crawler Permissions (Medium Priority)

  • Expand AI crawler list from 4 → 10 agents
  • Add: ClaudeBot, PerplexityBot, ChatGPT-User, Applebot-Extended, Bard, anthropic-ai

4. Content Feed Detection (Medium Priority)

  • New ContentFeedAudit — checks for RSS/Atom feed availability
  • HTML <link rel="alternate"> feed link extraction in HtmlGatherer
  • Numeric scoring based on feed count

5. Semantic HTML Analysis (Lower Priority)

  • Add <article> to semantic element checks
  • Add heading hierarchy validation (no skipped levels, e.g. h1 → h3)
  • Track headingLevels in SemanticElements

Config

  • Add content-feed audit (weight 4) to structured-data category
  • Rebalance json-ld (8), meta-tags (4), semantic-html (4) weights

Testing

  • All 139 tests pass (7 new tests added)
  • TypeScript type-check clean
  • ESLint clean

Files Changed

20 files, +569/-71 lines

Fixes #51

@IISweetHeartII
Copy link
Copy Markdown
Contributor Author

Navi review: APPROVE 판단.

  • 구현 범위가 issue #51의 5개 항목을 모두 정확히 덮음
  • CI (lint, type-check, test) clean
  • 특히 llms.txt/JSON-LD/feed/semantic HTML scoring 확장이 일관됨

참고: 현재 gh auth 기준 PR 작성자와 동일 계정이라 GitHub self-approve는 불가해 코멘트로 판정 남깁니다.

@IISweetHeartII
Copy link
Copy Markdown
Contributor Author

CI 실패: auto-label. 수정 필요.

@IISweetHeartII
Copy link
Copy Markdown
Contributor Author

CI 실패: auto-label (failure). 수정 필요.

@IISweetHeartII
Copy link
Copy Markdown
Contributor Author

CI 실패: auto-label. 수정 필요.

@IISweetHeartII
Copy link
Copy Markdown
Contributor Author

Reviewed deeply. No blocking issues found. The llms.txt, JSON-LD, feed, and semantic HTML changes are internally consistent, test coverage was expanded alongside the new scoring logic, and the config/report updates match the new audit surface.

@IISweetHeartII IISweetHeartII enabled auto-merge (squash) April 17, 2026 05:24
@IISweetHeartII
Copy link
Copy Markdown
Contributor Author

Deep review is still positive on the code itself, but this PR is now blocked by merge conflicts against develop. I tried updating the branch and GitHub rejected the rebase due to conflicts, so this needs a conflict refresh before it can be merged.

@IISweetHeartII
Copy link
Copy Markdown
Contributor Author

CI 실패: auto-label. 수정 필요.

Copy link
Copy Markdown
Contributor Author

@IISweetHeartII IISweetHeartII left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐱 Navi review — APPROVE, blocking 없음.

issue #51의 5가지 변경 모두 합리적이고 테스트 커버리지 충분함.

판단 근거:

  • llms.txt: binary→numeric 점수 전환, 품질 신호(H1/blockquote/sectioned URLs/길이) 세분화 깔끔
  • JSON-LD: @graph + @type 배열 처리로 엣지 케이스 커버
  • AI 크롤러: 4→10 확장, 단순 추가
  • ContentFeed: 새 audit + gatherer 분리 구조 좋음
  • Semantic HTML: <article> 추가, heading hierarchy 검증 로직 정확

Non-blocking (3건):

  1. ContentFeedAudit.detectBodyFeedLinksHtmlGatherer.extractFeedLinks와 동일한 <link rel="alternate"> 정규식 파싱을 수행. gatherer에서 이미 feedLinks로 제공하므로 audit의 body re-parsing은 불필요. Set dedup이라 동작엔 문제 없지만, 같은 로직이 두 곳에 있음.

  2. ContentFeed 점수: min(1, feedCount * 0.5 + 0.5) → 피드 1개 = 만점 1.0. 점수 차별화가 안 됨.

  3. runner.test.ts에서 /llms-full.txt 404 조건 추가 시 원래 if ( 포맷이 망가짐. 동작은 정상.

IISweetHeartII and others added 2 commits April 17, 2026 20:55
Implement all 5 parts of issue #51:

1. llms.txt Detection (High Priority)
   - Upgrade LlmsTxtAudit from binary to numeric scoring
   - Check for H1 heading, blockquote summary, sectioned URL lists
   - Add companion /llms-full.txt bonus detection
   - Add llmsFullTxt probe to HttpGatherer

2. JSON-LD Schema Type Analysis (High Priority)
   - Upgrade JsonLdAudit from binary to numeric scoring
   - Analyze specific schema types (WebSite, BlogPosting, Article,
     Person, Organization, BreadcrumbList, FAQPage, HowTo, WebPage)
   - Handle @graph arrays and @type arrays
   - Give partial credit based on type richness

3. AI Crawler Permissions (Medium Priority)
   - Expand AI_USER_AGENTS from 4 to 10 crawlers
   - Add ClaudeBot, PerplexityBot, ChatGPT-User, Applebot-Extended,
     Bard, anthropic-ai

4. Content Feed Detection (Medium Priority)
   - New ContentFeedAudit checking RSS/Atom feed availability
   - HTML <link rel='alternate'> feed link extraction in HtmlGatherer
   - Numeric scoring based on feed count

5. Semantic HTML Analysis (Lower Priority)
   - Add <article> element to semantic checks (was already gathered)
   - Add heading hierarchy validation (no skipped levels)
   - Add headingLevels tracking to SemanticElements

Config changes:
- Add content-feed audit (weight 4) to structured-data category
- Rebalance json-ld/meta-tags/semantic-html weights

Fixes #51
Add llmsFullTxt, headingLevels, and feedLinks to test helpers
to match updated HttpGatherResult and HtmlGatherResult interfaces.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@IISweetHeartII IISweetHeartII merged commit 76d8f50 into develop Apr 17, 2026
2 of 3 checks passed
@IISweetHeartII IISweetHeartII deleted the fix/issue-51 branch April 17, 2026 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Add llms.txt detection and content-focused scoring criteria

1 participant