Skip to content

1842 doc powershell scripts#1848

Closed
SturmCamper wants to merge 6 commits intoapache:mainfrom
SturmCamper:1842-doc-powershell-scripts
Closed

1842 doc powershell scripts#1848
SturmCamper wants to merge 6 commits intoapache:mainfrom
SturmCamper:1842-doc-powershell-scripts

Conversation

@SturmCamper
Copy link
Copy Markdown
Contributor

Thank you for contributing to Apache StormCrawler.

In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:

For all changes

  • Is there a issue associated with this PR? Is it referenced in the commit message?

  • Does your PR title start with #XXXX where XXXX is the issue number you are trying to resolve?

  • Has your PR been rebased against the latest commit within the target branch (typically main)?

  • Is your initial contribution a single, squashed commit?

  • Is the code properly formatted with mvn git-code-format:format-code -Dgcf.globPattern="**/*" -Dskip.format.code=false?

rzo1 and others added 6 commits March 26, 2026 19:22
…c files

Fix inaccurate default values, wrong config key names, and a
nonexistent class reference in the documentation:
- protocols: add missing "file" to default
- redirections.allowed → http.allow.redirects (default false, not true)
- fetchInterval.error: -1, not 44640
- cacheConfigParamName → robots.cache.spec
- errorcacheConfigParamName → robots.error.cache.spec
- http.accept and http.accept.language: add actual defaults
- protocol.md.prefix: default is "protocol.", not empty
- FrontierSpout → Spout (class doesn't exist)
…ss documentation

- Fix wrong package names: org.apache.storm.crawler → org.apache.stormcrawler
- Fix outdated import: backtype.storm.Config → org.apache.storm.Config
- Fix non-existent class reference: StatusStreamBolt → DummyIndexer
- Fix method signature to match actual code (MetadataTransfer.getMetaForOutlink)
- Fix broken internal links (StatusStream, Configuration, HTTPProtocol anchors)
- Fix Selenium protocol links pointing to core/ instead of external/selenium/
- Fix incorrect GitHub line number references for crawler-default.yaml
- Fix incomplete references ("See blog post", "See example", "See default")
- Fix Storm UI URL protocol: https → http
- Fix TikaParser → ParserBolt (actual class name)
- Remove undocumented/unimplemented http.store.responsetime config entry
- Replace unexplained %HEAP-MEM% placeholder with concrete 2g value
- Fix typos: anonynmous, asomething, extra commas, grammar
- Fix AsciiDoc link syntax in powered-by.adoc
Configuration tables in configuration.adoc now cover all keys from
crawler-default.yaml, including robots, protocol, OkHttp, parsing,
sitemap, scheduling, and indexing options.

Internals documentation now covers:
- FeedParserBolt, URLPartitionerBolt, StatusEmitterBolt
- FileSpout, MemorySpout, AbstractQueryingSpout
- AdaptiveScheduler (with config example)
- DelegatorProtocol (with config example)
- SelfURLFilter

Also fixes missing default values for indexer.text.fieldname and
indexer.url.fieldname.
Add configuration tables for OpenSearch, Solr, SQL, URLFrontier, Tika,
AWS, AI/LLM, Playwright, Language ID, WARC, and common spout options.

Each module section includes all key config options with defaults and
descriptions, plus links to the module source for full setup details.
The Language ID module (which had no README) now has its parse filter
configuration documented for the first time.
New extending.adoc covers:
- Writing custom URL filters, parse filters, protocols, bolts/spouts
  with complete code examples and registration instructions
- Politeness and rate limiting: queue modes, crawl-delay handling,
  robots.txt compliance
- Error handling and retry logic: status lifecycle, retry mechanism,
  custom fetch intervals
- Monitoring and metrics: fetcher metrics, MetricsConsumer setup
- Scaling and tuning: thread config, queue sizing, connection pooling
- Security: SSL/TLS, basic auth, proxy auth
- add Bootstrapping PowerShell script
- add Inject Your First Seeds PowerShell script
- add Run Your First Crawl PowerShell script

Issue apache#1842
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants