Normalization and other routines useful for dealing with library data in Solr
Java
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
java
LICENSE
README.textile

README.textile

Don’t use this anymore!

DEPRECATED. Use https://github.com/billdueber/umich_solr_library_filters instead.

What is this?

These are java methods, along with some solr analyzers, that we’ve found useful in implementing our Solr-based OPAC at the University of Michigan University Library.

How do I install it?

For the solr normalizers, just put the .jar file somewhere your java container (Jetty, Tomcat, whatever) will be able to find and use it. If you’re using Solrmarc to index your data, you need to do the same thing — make sure there’s a copy in your lib directory so Solrmarc can find it. The HLB stuff is explained further down.

What Solr filters are here?

These Solr text filters are designed to be run at both index and query time. This, hopefully, allows you to not care what (valid) format your users enter their data in (so, for example, an ISBN10 will get a positive search on the equivalent ISBN13 and vice-versa). Both filters are designed to pass through unchanged anything that doesn’t seem like a valid identifier of the appropriate type.

ISBNLongifier — turn all ISBNs into ISBN13

The ISBNLongifer code takes anything that looks like it might be an ISBN (string of 9 digits followed by a digit or an X) and algorithmically turns it into an ISBN13.

No attempt is made to determine if it’s a legal ISBN with a good checksum! It just does the conversion, doing its very best to ignore leading/trailing garbage and internal hyphens and dots.

In your schema.xml file you’ll want a fieldType definition like this:

<!-- Longify ISBNs -->
    <fieldType name="ISBNLong" class="solr.TextField"  omitNorms="true">
       <analyzer>
         <tokenizer class="solr.KeywordTokenizerFactory"/> 
         <filter class="edu.umich.lib.solr.analysis.ISBNLongifierFilterFactory"/> 
       </analyzer>
     </fieldType>

…and a field definition like this:

<field name="isbn" type="ISBNLong" indexed="true" stored="true" multiValued="true"/>

LCCNNormalizer

LCCNs have a canonical normalization process that I follow here. Not a ton of effort is done to try to clean things up — the idea is that you have a pretty good idea that you have an LCCN before calling this code.

<!-- LCCN normalization on both index and query -->
     <fieldType name="lccnnormalizer" class="solr.TextField"  omitNorms="true">
       <analyzer>
         <tokenizer class="solr.KeywordTokenizerFactory"/> 
         <filter class="edu.umich.lib.solr.analysis.LCCNNormalizerFilterFactory"/> 
       </analyzer>
     </fieldType>
…and later
 <field name="lccn"    type="lccnnormalizer" indexed="true" stored="true"/> 

LC Call Number Normalization

My LCCNCallNumberNormalizer is not the best one around — check out the stuff in the Blacklight project for something more robust.

Other stuff — add High Level Browse (HLB) categories based on LC CallNumber

At the University of Michigan, we have a librarian-maintained database of call-number ranges corresponding to rough academic subject areas. These ranges are allowed to overlap and repeat, so a single call number may appear in many HLB categories. You can see a simple example of these categories in place at the library home page — just “Choose a subject” under the Browse column in the header. They are also in place as a facet in Mirlyn, our catalog using the designation “Academic Subject.”

While the HLB designations will necessarily be idiosyncratic due to their reliance on our local call numbers, they may nonetheless be of interest to folks at other institutions.

The code under HLB pulls a (2-3MB) JSON file from our website reflecting the most recent version of the database upon load, and will then categorize and LC callnumbers you feed it.

The HLB and RangeSet classes both need to be accessible — in our setup, I’ve added them to the solrmarc2/umich/src/ directory.

We use it like this:

 import edu.umich.lib.hlb.HLB;
 ... 
 public Set<String> getHLB3(final Record record) {
    Set<String> callNums = getFieldList(record, "050ab:082a:090ab:099a:086a:086z:852hij");
    Iterator<String> callNumsIter = callNums.iterator();
    
    Set<String> hlb3components = new LinkedHashSet<String>();
    while (callNumsIter.hasNext()) {
            String callNum = (String) callNumsIter.next();
        try{
            hlb3components.addAll(HLB.components(callNum));
        } catch (java.lang.NullPointerException e) {
            // logger.error("NullPointer in HLB3 on '" + callNum + "'");
        }
    }
    return hlb3components;
  }
  
  public Set<String> getHLB3Delimited(final Record record) {
    Set<String> callNums = getFieldList(record, "050ab:082a:090ab:099a:086a:086z:852hij");
    Iterator<String> callNumsIter = callNums.iterator();

    Set<String> hlb3cats = new LinkedHashSet<String>();
    while (callNumsIter.hasNext()) {
      String callNum = (String) callNumsIter.next();
      try {
          hlb3cats.addAll(HLB.categories(callNum));
      } catch (java.lang.NullPointerException e) {
          //      logger.error("NullPointer in HLB3 on '" + callNum + "'");
      }
    }
    return hlb3cats;
  }