Skip to content
Browse files

Add options to detection, don't automatically load profiles

  • Loading branch information...
1 parent 95f5ff6 commit 8e7ed01ea37ec6f98320af40272334a8764b4e92 @dakrone committed Feb 16, 2012
Showing with 32 additions and 5 deletions.
  1. +28 −5 src/cld/core.clj
  2. +2 −0 test/cld/test/benchmarks.clj
  3. +2 −0 test/cld/test/core.clj
33 src/cld/core.clj
@@ -1,17 +1,40 @@
(ns cld.core
(:use [ :only [resource file]])
- (:import (com.cybozu.labs.langdetect Detector DetectorFactory)))
+ (:import (com.cybozu.labs.langdetect Detector DetectorFactory)
+ (java.util HashMap)))
-(DetectorFactory/loadProfile (file (resource "profiles")))
+(defn load-profiles
+ "Load detection profiles from either a File object or a String path."
+ [file-or-string]
+ (DetectorFactory/loadProfile file-or-string))
(defn detect
"Returns a tuple with the language as the first element and a map of
- languages to their probabilities."
- [^String text]
+ languages to their probabilities. Accepts an optional hash-map of options:
+ :smoothing <double> - Smoothing, defaults to 0.5
+ :max-length <int> - Maximum length of data to read, defaults to all
+ :prior-map <hash-map> - A map on languages to probabilites to use
+ :verbose <boolean> - Use verbose mode, defaults to all"
+ [text-or-reader & [opts]]
(let [^Detector detector (DetectorFactory/create)]
- (.append ^Detector detector text)
+ (when (:smoothing opts)
+ (.setAlpha detector (double (:smoothing opts))))
+ (when (:max-length opts)
+ (.setMaxTextLength detector (:max-length opts)))
+ (when (:prior-map opts)
+ (.setPriorMap detector (HashMap. (:prior-map opts))))
+ (when (:verbose opts)
+ (.setVerbose detector))
+ (.append ^Detector detector text-or-reader)
[(.detect ^Detector detector)
(->> (.getProbabilities ^Detector detector)
(map str)
(map #(vec (.split ^String % ":")))
(into {}))]))
+(defn default-init!
+ "Initialize the DetectorFactory with the included profiles. Will not throw an
+ exception on subsequent invocations."
+ []
+ (defonce _ (load-profiles (file (resource "profiles")))))
2 test/cld/test/benchmarks.clj
@@ -3,6 +3,8 @@
(require [criterium.core :as bench]))
(def text "The meaning of the term information retrieval can be very broad. Just getting a credit card out of your wallet so that you can type in the card number is a form of information retrieval. However, as an academic field of study, information retrieval might be defined thus:
Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers).
As defined in this way, information retrieval used to be an activity that only a few people engaged in: reference librarians, paralegals, and similar professional searchers. Now the world has changed, and hundreds of millions of people engage in information retrieval every day when they use a web search engine or search their email.[*]Information retrieval is fast becoming the dominant form of information access, overtaking traditional database-style searching (the sort that is going on when a clerk says to you: ``I'm sorry, I can only look up your order if you can give me your Order ID'').
2 test/cld/test/core.clj
@@ -2,6 +2,8 @@
(:use [cld.core]
(deftest t-detect
(is (= "en"
(first (detect (str "This is a sentence, it is written in "

0 comments on commit 8e7ed01

Please sign in to comment.
Something went wrong with that request. Please try again.