Skip to content

cmbullock/clucy

 
 

Repository files navigation

ZClucy

Build Status

ZClucy is forked from clucy a Clojure interface to Lucene. There are some enhanced futures in ZClucy :

  1. Supports numeric values (such as int, long, float double)
  2. Supports multivalued fileds.
  3. Supports to sort results
  4. Supports parallel indexing large number of data
  5. Supports schema defined as index's metadata
  6. Supports upsert (Insert documents or update them if they exist)
  7. Supports to only load certain stored fields and fast collect field values for statistics (eg. sum, avg )

Installation

To install Clucy, add the following dependency to your project.clj file:

    [zclucy "0.9.2"]

Usage

To use Clucy, first require it:

    (ns example
      (:require [clucy.core :as clucy]))

Then create an index. You can use (memory-index), which stores the search index in RAM, or (disk-index "/path/to/a-folder"), which stores the index in a folder on disk.

    (def index (clucy/memory-index))

Next, add Clojure maps to the index:

    (clucy/add index
       {:name "Bob", :job "Builder"}
       {:name "Donald", :job "Computer Scientist"})

You can remove maps just as easily:

    (clucy/delete index
       {:name "Bob", :job "Builder"})

Once maps have been added, the index can be searched:

    user=> (clucy/search index "bob" 10)
    ({:name "Bob", :job "Builder"})
    user=> (clucy/search index "scientist" 10)
    ({:name "Donald", :job "Computer Scientist"})

You can search and remove all in one step. To remove all of the scientists...

    (clucy/search-and-delete index "job:scientist")

Manipulate Schema

By default, every field is a string stored, indexed, analyzed and stores norms. You can customise it just like :

(def people-schema {:name {:type "string"} :age {:type "int" }})
(binding [clucy/*schema-hints* people-schema]
;.... do some adding
;.....do some query
)

Or you can add a schema with index when create it :

(def index (clucy/memory-index people-schema))

Then you need not bind schema-hints anymore. Now here two statements get the same result :

(clucy/add ....)
(clucy/add ....)
(binding [clucy/*schema-hints* people-schema]
	(clucy/add ....)
	(clucy/add ....)
)

Then name is still a string stored, indexed, analyzed and stores norms, but age is a int without being analyzed and norms.

Numeric Types

You can add maps with numeric value to the index:

(def people-schema {:name {:type "string"} :age {:type "int"}})
(binding [clucy/*schema-hints* people-schema]
    (clucy/add index
       {:name "Bob", :age (int 20)}
       {:name "Donald", :age (int 35)}))

Once maps have been added, the index can be searched:

	user=> (binding [clucy/*schema-hints* people-schema]
	      (clucy/search index "age:20" 10))
	({:age 20, :name "Bob"})

Or do range query just as :

	user=> (binding [clucy/*schema-hints* people-schema]
   	      (clucy/search index "age:[32 TO 35]" 10))
	({:age 35, :name "Donald"})

Numberic type can be one of int, long, double, float.

Multivalued Fields

You can use clojure collection to manage multivalued fields, eg.

    (clucy/add index
       {:name "Bob", :books ["Clojure Programming" "Clojure In Action"] }
    user=> (clucy/search index "books:action" 10)
       ({:name "Bob", :books ["Clojure Programming" "Clojure In Action"]})

Sort Results

First add some documents with a defined schema

(def people-schema {:name {:type "string"} :age {:type "int" }})

(binding [clucy/*schema-hints* people-schema]
    (clucy/add index
       {:name "Bob", :age (int 20)}
       {:name "Donald", :age (int 35)}))

Then you can sort the result when search them :

user=> ((binding [clucy/*schema-hints* people-schema]
          (clucy/search index "*:*" 10 :sort-by "age desc"))
({:age 35, :name "Donald"} {:age 20, :name "Bob"})

You can sort by several fields just like :

((binding [clucy/*schema-hints* people-schema]
          (clucy/search index "*:*" 10 :sort-by "age desc, name asc"))

Or sort by document number (index order) :

((binding [clucy/*schema-hints* people-schema]
          (clucy/search index "*:*" 10 :sort-by "$doc asc"))

Or sort by document score (relevance):

((binding [clucy/*schema-hints* people-schema]
          (clucy/search index "*:*" 10 :sort-by "$score asc"))

Parallel indexing

When you want to index a large number of data such as data from a large text file with one record per line, you should use padd instead of add

    (with-open [r (clojure.java.io/reader file)]
       (let [stime (System/currentTimeMillis)
               reporter (fn [n] 
                                 (when (= (mod n 100000) 0) ; print process per 100K
                                   (println n " cost:"(- (System/currentTimeMillis) stime)))) ]
                (clucy/padd index reporter 
                      (map 
                              #(let [row (clojure.string/split % #"\s+")] 
                                                {:id (row 0), :name (row 1) })
                                                 (line-seq r)))))

Upsert

Before use upsert , you must define ID fields (just like primary key) in schema, for example use "id" as unique value field :

(def index (clucy/disk-index "mypath" {_id [:id]}, :id {:type "int"},  :name {:type "string"}))

;There's no record so just insert it.
(clucy/upsert  index {:id 1 :name "Tom"})

=>(clucy/search index "*:*" 10)
;({:id 1 :name "Tom"})

;now update Tom to Jack
(clucy/upsert index {:id 1 :name "Jack"})

=>(clucy/search index "*:*" 10)
;({:id 1 :name "Jack"})

Load certain stored fields and fast collect field values for statistics (eg. sum, avg )

(let [index (memory-index people-schema)
                   avg (atom 0) sum (atom 0) max (atom 0) min (atom 0)]
             ;Do some insert here ........
             ;.......
             (search index "age:[34 TO 48]" 100 :fields [:age] ; only load age field
                     :doc-collector (fn [{age :age} i total]
                                                 (println "i:" i ",total: " total ",age:" age)
                                                 (swap! sum + age)
                                                 (when (or (= i 0) (> @min age)) (reset! min age))
                                                 (when (or (= i 0) (> age @max)) (reset! max age))
                                                 (when (= i (dec total)) 
                                                   (reset! avg  (/ @sum total)))))

Default Search Field

A field called "_content" that contains all of the map's values is stored in the index for each map (excluding fields with {:stored false} in the map's metadata). This provides a default field to run all searches against. Anytime you call the search function without providing a default search field "_content" is used.

This behavior can be disabled by binding content to false, you must then specify the default search field with every search invocation.

About

Clojure interface to Lucene

Resources

License

Stars

Watchers

Forks

Packages

No packages published