Skip to content

anjakDev/parsely

Repository files navigation

Parsely - Language Learning Vocabulary Extractor

Parsely is a tool that uses AI to extract vocabulary from language learning course notes (PDF/DOCX files) and stores them in a searchable database. It features both a command-line interface (TUI) and a web interface.

Features

  • AI-Powered Extraction: Uses Claude AI to intelligently extract vocabulary and phrases
  • Document Support: Parses PDF, DOCX, and plain text (TXT) files
  • Deduplication: Automatically skips vocabulary that's already in the database
  • Dual Interface: Choose between CLI (Terminal UI) or Web interface
  • Export: Export vocabulary to JSON for use in other applications
  • Security: Built with security best practices (SQL injection prevention, file validation, etc.)

Requirements

  • Go 1.23 or later
  • Claude API key (get one from Anthropic)
  • Optional: Bun or Node.js for the web frontend (if you want to develop it)

Installation

Clone the repository

git clone https://github.com/parsely/parsely.git
cd parsely

Install dependencies

go mod download

Build the binaries

# Build CLI version
go build -o parsely-cli ./cmd/cli

# Build web version
go build -o parsely-web ./cmd/web

Configuration

Parsely uses environment variables for configuration:

Variable Required Default Description
ANTHROPIC_API_KEY Yes Your Anthropic API key
DATABASE_PATH No /data/parsely.db Path to the SQLite database file
LANGUAGE No auto-detect Target language for extraction
PORT No 8080 Port for the web server
API_TOKEN No Bearer token to protect API endpoints. If unset, auth is disabled (fine for local use). Set this in production.

Deployment

Running locally

The default DATABASE_PATH is /data/parsely.db, which is intended for the Railway deployment (see below). When running locally, override it to a path that exists on your machine:

DATABASE_PATH=parsely.db ANTHROPIC_API_KEY=sk-ant-... go run ./cmd/web

Or export the variables in your shell before running:

export ANTHROPIC_API_KEY="sk-ant-..."
export DATABASE_PATH="parsely.db"
./parsely-web

Deploying to Railway

The project includes a Dockerfile configured for Railway.

  1. Push the repository to GitHub and connect it to a new Railway project.
  2. Add a Volume in Railway and set the mount path to /data.
  3. Set the following environment variables in the Railway service settings:
    • ANTHROPIC_API_KEY — your Anthropic API key (required)
    • API_TOKEN — a secret token to protect your API (recommended);
    • LANGUAGE — target language, e.g. Spanish (optional)
    • DATABASE_PATH — can be left unset; defaults to /data/parsely.db
  4. Railway automatically injects the PORT variable — no action needed.

The SQLite database will be persisted on the mounted volume at /data/parsely.db across deployments and restarts.

Usage

CLI Version

Run the interactive terminal UI:

./parsely-cli

Features:

  • Parse new documents (PDF/DOCX)
  • View all vocabulary
  • Export to JSON
  • Navigate with arrow keys or vim keys (j/k)

Web Version

Start the web server:

./parsely-web

The API will be available at http://localhost:8080

API Endpoints

GET    /api/vocabulary       - List all vocabulary
GET    /api/vocabulary/{id}  - Get specific vocabulary item
DELETE /api/vocabulary/{id}  - Delete vocabulary item
POST   /api/upload           - Upload and process document
POST   /api/export           - Export vocabulary to JSON
GET    /api/stats            - Get vocabulary statistics
GET    /health               - Health check

Authentication

When API_TOKEN is set, all /api/* endpoints require a Bearer token header:

curl -H "Authorization: Bearer your-token" http://localhost:8080/api/vocabulary

The /health endpoint is always public. When API_TOKEN is not set (e.g. local development), no header is required.

Upload Document Example

curl -X POST \
  -H "Authorization: Bearer your-token" \
  -F "file=@/path/to/document.pdf" \
  http://localhost:8080/api/upload

Running Tests

Run all tests with coverage:

go test ./... -cover

Run tests for a specific package:

go test ./internal/db -v
go test ./internal/parser -v
go test ./internal/ai -v
go test ./internal/core -v
go test ./internal/api -v

Project Structure

parsely/
├── cmd/
│   ├── cli/          # CLI application entry point
│   └── web/          # Web server entry point
├── internal/
│   ├── ai/           # Claude AI integration
│   ├── parser/       # PDF/DOCX parsers
│   ├── db/           # SQLite database layer
│   ├── core/         # Core business logic
│   └── api/          # HTTP API handlers
├── testdata/         # Test fixtures
├── go.mod
├── go.sum
├── README.md
└── CLAUDE.md         # Development guidelines

Security Features

  • API Authentication: Bearer token auth protects all endpoints when API_TOKEN is set
  • SQL Injection Prevention: All database queries use parameterized statements
  • Path Traversal Protection: File paths are validated to prevent directory traversal
  • File Size Limits: Maximum 10MB per document
  • File Type Validation: Only PDF and DOCX files accepted
  • Input Sanitization: All user input is validated and sanitized
  • Secure Permissions: Database and temp files created with restrictive permissions

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Write tests first (TDD approach)
  4. Implement your feature
  5. Ensure all tests pass
  6. Submit a pull request

See CLAUDE.md for detailed development guidelines.

Troubleshooting

"ANTHROPIC_API_KEY not set"

Make sure you've exported your API key:

export ANTHROPIC_API_KEY="your-key"

Database Permission Errors

Ensure the database file has proper permissions:

chmod 600 parsely.db

PDF Parsing Errors

Some PDFs may not contain extractable text. Try:

  1. Ensuring the PDF has selectable text (not scanned images)
  2. Using a different PDF viewer to verify text content
  3. Converting scanned PDFs to text-based PDFs using OCR

Large File Errors

Files over 10MB are rejected. Compress or split your documents.

License

MIT License - see LICENSE file for details

Acknowledgments

About

Parse your language course notes into your personal collection of vocabulary and phrases.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors