Skip to content

cweiske/phinde

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
bin
 
 
 
 
src
 
 
 
 
www
 
 
 
 
 
 
 
 
 
 
 
 

phinde - generic web search engine

Self-hosted search engine you can use for your static blog or about any other website you want search functionality for.

My live instance is at http://search.cweiske.de/ and indexes my website, blog and all linked URLs.

Features

  • Crawler and indexer with the ability to run many in parallel
  • Shows and highlights text that contains search words
  • Boolean search queries:
    • foo bar searches for foo AND bar
    • foo OR bar
    • title:foo searches for foo only in the page title
  • Facets for tag, domain, language and type
  • Date search:
    • before:2016-08-30 - modification date before that day
    • after:2016-08-30 - modified after that day
    • date::2016-08-30 - exact modification day match
  • Site search
    • Query: foo bar site:example.org/dir/
    • or use the site GET parameter: /?q=foo&site=example.org/dir
  • OpenSearch support with HTML and Atom result lists
  • Instant indexing with WebSub (formerly PubSubHubbub)

Dependencies

  • PHP 5.5+
  • Elasticsearch 2.0
  • MySQL or MariaDB for WebSub subscriptions
  • Gearman (Debian 9: gearman-job-server, not gearman-server)
  • PHP Gearman extension
  • Console_CommandLine
  • Net_URL2
  • Twig 1.x

Setup

  1. Install and run Elasticsearch and Gearman

  2. Install php-gearman

  3. Get a local copy of the code:

    $ git clone https://git.cweiske.de/phinde.git phinde
    
  4. Install dependencies via composer:

    $ composer install
    
  5. Point your webserver's document root to phinde's www directory

  6. Copy data/config.php.dist to data/config.php and adjust it. Make sure your add your domain to the crawl whitelist.

  7. Create a MySQL database and import the schema from data/schema.sql

  8. Run bin/setup.php which sets up the Elasticsearch schema

  9. Put your homepage into the queue:

    $ ./bin/process.php http://example.org/
    
  10. Start at least one worker to process the crawl+index queue:

    $ ./bin/phinde-worker.php
    
  11. Check phinde's status page in your browser. The number of open tasks should be > 0, the number of workers also.

Re-index when your site changes

When your site changed, the search engine needs to re-crawl and re-index the pages.

Simply tell phinde that something changed by running:

$ ./bin/process.php http://example.org/foo.htm

phinde supports HTML pages and Atom feeds, so if your blog has a feed it's enough to let phinde reindex that one. It will find all linked pages automatically.

Website integration

Adding a simple search form to your website is easy. It needs two things:

  • <form> tag with an action that points to the phinde instance
  • Search text field with name of q.

Example:

<form method="get" action="http://phinde.example.org">
  <input type="text" name="q" placeholder="Search text"/>
  <button type="submit">Search</button>
</form>

System service

When using systemd, you can let it run multiple worker instances when the system boots up:

  1. Copy files data/systemd/phinde*.service into /etc/systemd/system/

  2. Adjust user and group names, and the work directories

  3. Enable three worker processes:

    $ systemctl daemon-reload
    $ systemctl enable phinde@1
    $ systemctl enable phinde@2
    $ systemctl enable phinde@3
    $ systemctl enable phinde
    $ systemctl start phinde
    
  4. Now three workers are running. Restarting the phinde service also restarts the workers.

Cron job

Run bin/renew-subscriptions.php once a day with cron. It will renew the WebSub subscriptions.

Howto

Delete index data from one domain:

$ curl -iv -XDELETE -H 'Content-Type: application/json' -d '{"query":{"term":{"domain":"example.org"}}}' http://127.0.0.1:9200/phinde/_query

That's delete-by-query 2.0, see https://www.elastic.co/guide/en/elasticsearch/plugins/2.0/delete-by-query-usage.html

Subscribe to a website/feed

Phinde supports WebSub to get subscribe to changes of a website. When phinde gets notified by the website's hub about changes, it will immediately crawl and index the changed pages.

Subscribe to a website's feed:

$ php bin/subscribe.php http://example.org/feed.atom

Phinde will determine the website's hub and send a registration request to it.

The status page will show the number of working, and the number of open subscriptions.

Unsubscribing also happens on command line:

$ php bin/unsubscribe.php http://example.org/feed.atom

About phinde

Source code

phinde's source code is available from http://git.cweiske.de/phinde.git or the mirror on github.

License

phinde is licensed under the AGPL v3 or later.

Author

phinde was written by Christian Weiske.

About

Self-hosted search engine for your static blog

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published