Nous

Crawl websites and extract readable Markdown, optimized for LLM consumption. Inspired by sitefetch.

Nous fetches same-host pages starting from a seed URL, extracts readable content, and outputs clean Markdown as XML-tagged text or JSON. It supports concurrent crawling, glob-based URL filtering, and two extraction backends: a local parser (ruby-readability) and the Jina Reader API for JS-rendered sites.

Installation

Add to your Gemfile:

gem "nous"

Or install directly:

gem install nous

CLI Usage

# Crawl a site and print extracted content to stdout
nous https://example.com

# Output as JSON
nous https://example.com -f json

# Write to a file
nous https://example.com -o site.md

# Limit pages and increase concurrency
nous https://example.com -l 20 -c 5

# Only crawl pages matching a glob pattern
nous https://example.com -m "/blog/*"

# Scope extraction to a CSS selector
nous https://example.com -s "article.post"

# Use Jina Reader API for JS-rendered sites (Next.js, SPAs)
nous https://example.com --jina

# Debug logging
nous https://example.com -d

Options

Flag	Description	Default
`-o`, `--output PATH`	Write output to file	stdout
`-f`, `--format FORMAT`	Output format: `text` or `json`	`text`
`-c`, `--concurrency N`	Concurrent requests	`3`
`-m`, `--match PATTERN`	Glob filter for URLs (repeatable)	none
`-s`, `--selector SELECTOR`	CSS selector to scope extraction	none
`-l`, `--limit N`	Maximum pages to fetch	`100`
`--timeout N`	Per-request timeout in seconds	`15`
`--jina`	Use Jina Reader API for extraction	off
`-v`, `--version`	Print version and exit	off
`-h`, `--help`	Print usage and exit	off
`-d`, `--debug`	Debug logging to stderr	off

Ruby API

require "nous"

# Fetch pages with the default extractor
pages = Nous.fetch("https://example.com", limit: 10, concurrency: 3)

# Each page is a Nous::Page with title, url, pathname, content
pages.each do |page|
  puts "#{page.title} (#{page.url})"
  puts page.content
end

# Serialize to XML-tagged text
text = Nous.serialize(pages, format: :text)

# Serialize to JSON
json = Nous.serialize(pages, format: :json)

# Use the Jina extractor for JS-heavy sites
pages = Nous.fetch("https://spa-site.com",
  extractor: Nous::Extractor::Jina.new,
  limit: 5
)

Extraction Backends

Default (ruby-readability)

Parses static HTML using ruby-readability, strips noisy elements (nav, footer, script, header), and converts to Markdown via reverse_markdown. Fast and requires no external services, but cannot extract content from JS-rendered pages.

Jina Reader API

Uses the Jina Reader API which renders pages with headless Chrome. Handles Next.js App Router, React Server Components, SPAs, and other JS-heavy sites. Free tier allows 20 requests/minute without a key, or 500 RPM with a JINA_API_KEY environment variable.

Output Formats

Text (default)

XML-tagged output designed for LLM context windows:

<page>
<title>Page Title</title>
<url>https://example.com/page</url>
<content>
# Heading

Extracted markdown content...
</content>
</page>

JSON

[
  {
    "title": "Page Title",
    "url": "https://example.com/page",
    "pathname": "/page",
    "content": "# Heading\n\nExtracted markdown content..."
  }
]

Development

bin/setup               # Install dependencies
bundle exec rspec       # Run tests
bundle exec standardrb  # Lint
bundle exec exe/nous    # Run the command line in-development

License

MIT License. See LICENSE.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.github/workflows		.github/workflows
bin		bin
exe		exe
lib		lib
sig		sig
spec		spec
.gitignore		.gitignore
.rspec		.rspec
.ruby-version		.ruby-version
.standard.yml		.standard.yml
CHANGELOG.md		CHANGELOG.md
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
LICENSE.txt		LICENSE.txt
README.md		README.md
Rakefile		Rakefile
nous.gemspec		nous.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Nous

Installation

CLI Usage

Options

Ruby API

Extraction Backends

Default (ruby-readability)

Jina Reader API

Output Formats

Text (default)

JSON

Development

License

About

Uh oh!

Releases

Packages

Languages

License

danfrenette/nous

Folders and files

Latest commit

History

Repository files navigation

Nous

Installation

CLI Usage

Options

Ruby API

Extraction Backends

Default (ruby-readability)

Jina Reader API

Output Formats

Text (default)

JSON

Development

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages