GitHub - dzmoore/java-lucene-rest: A REST API exemplifying finding "similar" objects using Apache Lucene

Java Lucene REST

A REST project exemplifying finding "similar" objects using Apache Lucene (Java library used by stand-alone server Solr).

Implementation

This project defines a "customer" as being composed of the fields: 'First Name', 'Last Name', 'Email', 'Billing Email', and 'Phone Number'

The core technologies used in this project are:

Java 8
Maven - for project management
(XML-less!) Spring MVC - for web services
HSQLDB - as an embedded data store
Lucene - for searching and similarity matching

Operations Implemented

After downloading or cloning the project, the web services can be run by the command mvn tomcat7:run (or by building and deploying another app server). When running, the following endpoints are accessible:

POST /customer - creates a new customer - returns the new customer instance

example:

curl -i -X POST -d '{"firstName":"dylan", "lastName":"moore", "email":"dzmoore@gmail.com", "billingEmail":"billing@example.com", "phoneNumber":"(999) 123-3456"}'   -H "Content-Type: application/json"   http://localhost:8080/customer

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: application/json;charset=UTF-8
Transfer-Encoding: chunked
Date: Sat, 19 Sep 2015 11:02:52 GMT

{"pk":1,"firstName":"dylan","lastName":"moore","email":"dzmoore@gmail.com","billingEmail":"billing@example.com","phoneNumber":"(999) 123-4567"}

GET /customer/{customer-pk} - retrieves a customer instance by a specified pk

example:

curl -i -X GET http://localhost:8080/customer/1

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: application/json;charset=UTF-8
Transfer-Encoding: chunked
Date: Sat, 19 Sep 2015 11:03:40 GMT

{"pk":1,"firstName":"dylan","lastName":"moore","email":"dzmoore@gmail.com","billingEmail":"billing@example.com","phoneNumber":"(999) 123-4567"}

PUT /customer - updates an existing customer instance with new values - returns a message with whether or not the operation was successful

example:

curl -i -X PUT -d '{"pk":1,"firstName":"drylan", "lastName":"monore", "email":"email@email", "billingEmail":"billing@email", "phoneNumber":"1234112"}'   -H "Content-Type: application/json"   http://localhost:8080/customer

HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: application/json;charset=UTF-8
Transfer-Encoding: chunked
Date: Sat, 19 Sep 2015 11:05:48 GMT

{"result":"Update successful.","error":false}

DELETE /customer/{customer-pk} - deletes a customer instance by the specified pk - returns a message with whether the operation was successful

example:

curl -i -X DELETE http://localhost:8080/customer/1
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: application/json;charset=UTF-8
Transfer-Encoding: chunked
Date: Sat, 19 Sep 2015 11:06:48 GMT

{"result":"Update successful.","error":false}

GET /index/customers - updates the index of customer fields

example:

curl -i -X GET http://localhost:8080/index/customers
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: application/json;charset=UTF-8
Transfer-Encoding: chunked
Date: Sat, 19 Sep 2015 11:12:28 GMT

{"result":"Customer index started.","error":false}

GET /customer/similar-to/{customer-pk}/{max-number-of-similar-customers}

curl -i -X GET http://localhost:8080/similarity/customer/3/10
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Content-Type: application/json;charset=UTF-8
Transfer-Encoding: chunked
Date: Sat, 19 Sep 2015 11:34:53 GMT

[{"pk":5,"firstName":"Harry","lastName":"Richardson","email":"richardson@gmail.com","billingEmail":"richardson@example.com","phoneNumber":"(123) 123-1365"},{"pk":4,"firstName":"Rary","lastName":"Gichardson","email":"richardson@gmail.com","billingEmail":"richardson@example.com","phoneNumber":"(123) 123-1365"}]

Additional

I initially considered implementing a Levenshtein Distance algorithm to determine the similarity of objects/fields, but then decided that the approach to use a more sophisticated indexing system (Lucene/Solr) would scale much better (and optimizing Levenshtein might require some form of 'indexing' regardless). The current implementation requires that a re-index command be executed if new customer entries are to be considered in the 'similar' list, but a production version of this project would most likely include periodic re-indexing from some type of cron.

Additionally, the project could be tweaked to improve the similarity results. The current implementation uses Levenshtein distance (fuzzy logic with the Lucene queries preceeded by a tilde ~) with the default threshold (0.5) and all non-null keywords OR'd together. Obviously, this might return similar profiles when customer's names are similar, but this would be an easy 'fix' to define something like first and last names being treated as a single unit, or perhaps even given a lower priority in the search by themselves.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
pom.xml		pom.xml
readme.md		readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Java Lucene REST

Implementation

Operations Implemented

Additional

About

Uh oh!

Releases

Packages

Languages

License

dzmoore/java-lucene-rest

Folders and files

Latest commit

History

Repository files navigation

Java Lucene REST

Implementation

Operations Implemented

Additional

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages