Skip to content

brovikov/sunspot_cell

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sunspot Cell (gem)

Note by Brovikov

This gem adds Cell support (for indexing rich documents like pdf, docs, html, etc…) to Sunspot (compartible with Sunspot 2.5.5). Support Paperclip or Carrierwave, S3 Storage or local files

The code is based on the patch included here: github.com/zheileman/sunspot_cell

Requirements

Thanks to Chris Powell, there is now a cook-book style blog post for getting a Rails 3.2 app properly index rich documents using this gem: cbpowell.wordpress.com/2012/09/18/indexing-rich-documents-with-rails-sunspot-solr-sunspot-cell-and-carrierwave-cookbook-style/

  • Sunspot gem installed (>= 2.2.5)

  • Solr Cell libraries

Tested with Solr 5.3.1 with Cell librarie solr-cell-5.3.1.jar

(dist/apache-solr-cell-5.3.1.jar and +contrib/extraction/lib/*.jar+ from the standard Solr distribution) placed in the /solr/lib directory as created by the Sunspot gem, in development environment. Your production setup might vary.

  • Adjustments to the Solr schema.xml:

<fieldType name=“ignored” stored=“false” indexed=“false” multiValued=“true” class=“solr.StrField” /> and <dynamicField name=“*_attachment” stored=“true” type=“text” multiValued=“true” indexed=“true”/> <dynamicField name=“ignored_*” type=“ignored”/>

  • Adjustments to the Solr solrconfig.xml:

    Need add requestHandler for /update/extract path:

    <requestHandler name="/update/extract" class="solr.handler.extraction.ExtractingRequestHandler">
      <lst name="defaults">
        <str name="fmap.Last-Modified">last_modified</str>
        <str name="uprefix">ignored_</str>
      </lst>
      <!--Optional.  Specify a path to a tika configuration file.  See the Tika docs for details.-->
      <!-- <str name="tika.config">/my/path/to/tika.config</str> -->
      <!-- Optional. Specify one or more date formats to parse.  See DateUtil.DEFAULT_DATE_FORMATS for default date formats -->
      <!--  <lst name="date.formats">
        <str>yyyy-MM-dd</str>
      </lst> -->
    </requestHandler>

Install Plugin

Add sunspot gem and sunspot_cell to Gemfile:

gem 'sunspot'
gem 'sunspot_solr', github: 'sunspot/sunspot', branch: 'master'
gem 'sunspot_rails', github: 'sunspot/sunspot', branch: 'master', :require => "sunspot_rails"
gem 'sunspot_cell', :git => 'git://github.com/brovikov/sunspot_cell.git'
gem 'progress_bar'

Also take a look to nice gem github.com/mrcsparker/sunspot_cell_jars. It allow to add all necessary jar files to make sunspot_cells work with apache-solr-cell

But pay attention to tika-core-x.x.jar and tika-parser-x.x.jar files. sunspot_cell_jars includes outdated version for both of them (check this page www.java2s.com/Code/Jar/t/tika.htm to find correct version for yours configuration)

I tested Solr 5.3.1 with Cell librarie solr-cell-5.3.1.jar with tika-core-1.3.jar and tika-parser-1.3.jar

Usage

class Doc
  searchable do
     text :title
     attachment :file
   end
end

Paperclip & S3 Storage

require 'open-uri'

class Doc
  searchable do
     text :title
     attachment :attached_file
   end

private
  def attached_file
    URI.parse(remote_full_url)
  end

About

[UPDATED] Continuation of Cell support for Sunspot

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages