Extract structured data (JSON-LD, Microdata, RDFa) from any URL into clean JSON.
OpenSchemaExtract includes:
- A reusable TypeScript/Node extractor library
- A Next.js API endpoint (
/api/extract) - A minimal web UI for interactive extraction
- A Claude Code plugin for AI agents (see PLUGIN.md)
- An MCP server for Claude Desktop and other MCP clients (see mcp-server/README.md)
Extract JSON-LD, Microdata, and RDFa from any URL with a fast TypeScript library, API endpoint, and web UI.
- Parses
application/ld+json, Microdata (itemscope/itemprop), and RDFa (typeof/property) - Normalizes output into:
schemaTypesblocksbyType
- Bot-wall handling with fallback fetching via
r.jina.ai - Typed response model for app and package usage
Get an API key from https://openschemaextract.chat-data.com/dashboard, then call the API:
curl -H "Authorization: Bearer osx_live_your_key" \
"https://openschemaextract.chat-data.com/api/extract?url=https://schema.org/Recipe"No API key? Demo usage is available (rate-limited) — just omit the Authorization header.
npx skills add chat-data-llc/OpenSchemaExtractThen ask your AI agent: "extract schema from https://schema.org/Recipe"
See PLUGIN.md for details.
npm install openschemaextractimport { extractSchema } from "openschemaextract";
const result = await extractSchema("https://schema.org/Recipe");
if (result.ok) {
console.log(result.data.blocks); // All schema blocks
}Runs entirely on your server — no API calls, no rate limits.
claude mcp add -e OPENSCHEMAEXTRACT_API_KEY=osx_live_your_key \
-- openschemaextract npx -y openschemaextract-mcpSee mcp-server/README.md for details.
Deploy the full app (Next.js + MongoDB + OAuth) to your own VPS:
git clone https://github.com/chat-data-llc/OpenSchemaExtract.git
cd OpenSchemaExtract
./deploy.shThe deploy.sh script auto-generates secrets, sets up Docker containers, and starts the app on port 3000. See the Self-Hosting section below for the full guide with Nginx, SSL, and troubleshooting.
Prerequisites: Node.js 18+
git clone https://github.com/chat-data-llc/OpenSchemaExtract.git
cd OpenSchemaExtract
npm install
npm run devOpen http://localhost:3000.
curl "http://localhost:3000/api/extract?url=schema.org/Recipe"{
"success": true,
"data": {
"url": "https://schema.org/Recipe",
"schemaTypes": ["Recipe"],
"blocks": [
{
"format": "json-ld",
"type": "Recipe",
"data": { "name": "..." }
}
],
"byType": {
"Recipe": [
{
"format": "json-ld",
"type": "Recipe",
"data": { "name": "..." }
}
]
}
}
}{
"success": false,
"error": "No structured data found on this page",
"errorCode": "EMPTY_CONTENT"
}Install the package:
npm install openschemaextractUse it in your Node.js code:
import { extractSchema } from "openschemaextract";
const result = await extractSchema("https://schema.org/Recipe");
if (result.ok) {
console.log(result.data.schemaTypes); // ["Recipe"]
console.log(result.data.blocks); // All schema blocks
console.log(result.data.byType); // Grouped by @type
} else {
console.error(result.error.code, result.error.message);
}Available exports:
extractSchema(url: string)— Main extraction functionextract(url: string)— Alias- TypeScript types:
ExtractionResult,SchemaBlock,ExtractionError,ExtractionErrorCode
npm run dev # Next.js dev server
npm run build # Next.js production build
npm run start # Run production server
npm run test # Integration + smoke tests
npm run build:pkg # Build package with tsupapp/ Next.js app router and API route
components/ UI components
src/ Extractor library and parsers
src/parsers/ JSON-LD, Microdata, RDFa parsers
tests/ Integration and smoke tests
Deploy your own instance of OpenSchemaExtract with Docker. The app runs in a container and connects to an existing MongoDB instance on your server.
- VPS or server with Docker and Docker Compose installed
- MongoDB running (as a container or standalone)
- Domain name pointing to your server (for HTTPS and GitHub OAuth)
- GitHub OAuth App (for user authentication)
Install Docker if needed:
curl -fsSL https://get.docker.com | shgit clone https://github.com/chat-data-llc/OpenSchemaExtract.git
cd OpenSchemaExtract- Go to https://github.com/settings/developers → New OAuth App
- Fill in:
- Application name:
OpenSchemaExtract - Homepage URL:
https://your-domain.com - Authorization callback URL:
https://your-domain.com/api/auth/callback/github
- Application name:
- Save your Client ID and Client Secret for the next step
If you have MongoDB running as a Docker container (e.g., named mongo), create a shared network so the app can reach it by name:
sudo docker network create shared
sudo docker network connect shared mongoIf MongoDB is running on the host directly (not in Docker), use
MONGODB_URI=mongodb://host.docker.internal:27017instead and addextra_hosts: ["host.docker.internal:host-gateway"]todocker-compose.yml.
cp .env.production.example .env.production
nano .env.productionFill in your values:
# Generate secrets with: openssl rand -base64 32
AUTH_SECRET=<your-generated-secret>
OAUTH_JWT_SECRET=<your-generated-secret>
# Your domain
AUTH_URL=https://your-domain.com
OAUTH_ISSUER=https://your-domain.com
# GitHub OAuth (from Step 2)
AUTH_GITHUB_ID=<your-client-id>
AUTH_GITHUB_SECRET=<your-client-secret>
# MongoDB — use your container name (e.g., "mongo") as the hostname
MONGODB_URI=mongodb://mongo:27017
MONGODB_DB=openschemaextract./deploy.shOr manually:
sudo docker compose up -d --buildYour app will be available at http://your-server-ip:3000
sudo apt update
sudo apt install nginx certbot python3-certbot-nginxCreate /etc/nginx/sites-available/openschemaextract:
server {
listen 80;
server_name your-domain.com;
location / {
proxy_pass http://127.0.0.1:3000;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_cache_bypass $http_upgrade;
proxy_read_timeout 90;
}
client_max_body_size 10M;
}Enable and get SSL:
sudo ln -s /etc/nginx/sites-available/openschemaextract /etc/nginx/sites-enabled/
sudo nginx -t && sudo systemctl reload nginx
sudo certbot --nginx -d your-domain.com# View logs
sudo docker compose logs -f app
# Restart app
sudo docker compose restart app
# Stop the app
sudo docker compose down
# Update to latest code
git pull && sudo docker compose up -d --build
# Check running containers
sudo docker compose ps| Variable | Required | Description |
|---|---|---|
AUTH_SECRET |
Yes | NextAuth.js encryption key (openssl rand -base64 32) |
AUTH_URL |
Yes | Your full app URL (e.g., https://your-domain.com) |
AUTH_GITHUB_ID |
Yes | GitHub OAuth Client ID |
AUTH_GITHUB_SECRET |
Yes | GitHub OAuth Client Secret |
OAUTH_ISSUER |
Yes | OAuth 2.1 issuer URL (usually same as AUTH_URL) |
OAUTH_JWT_SECRET |
Yes | JWT signing secret (openssl rand -base64 32) |
MONGODB_URI |
Yes | MongoDB connection string (e.g., mongodb://mongo:27017) |
MONGODB_DB |
Yes | Database name (default: openschemaextract) |
Build fails with "MONGODB_URI environment variable is required"
- The Dockerfile provides a dummy
MONGODB_URIat build time. Make sure your Dockerfile hasn't been modified.
Build fails with "Module not found: @/components/..."
- Make sure
tsconfig.jsonincludes the@/*path alias. Restore it from git if corrupted.
Tailwind oxide native binding error
- The Dockerfile uses
node:20-slim(Debian) and freshnpm installto avoid this. Don't switch to Alpine.
GitHub login doesn't work
- Verify the callback URL in your GitHub OAuth App matches
https://your-domain.com/api/auth/callback/githubexactly. - Make sure
AUTH_URLin.env.productionmatches your domain.
MongoDB connection fails
- Verify your MongoDB container is on the
sharednetwork:sudo docker network inspect shared - Check MongoDB is running:
sudo docker ps | grep mongo - Test connectivity:
sudo docker compose exec app sh -c "curl -s mongo:27017"
Port 3000 already in use
- Change the port mapping in
docker-compose.yml:"127.0.0.1:3001:3000"
See DEPLOYMENT.md for backup, monitoring, and advanced configurations.
- This project fetches user-provided URLs server-side.
- If deploying publicly, add SSRF protections (allowlist or private-address blocking).
- Proxy fallback sends blocked URLs through
https://r.jina.ai/; avoid passing sensitive URLs/tokens. - Live integration tests depend on external websites and may be flaky over time.
MIT