Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement SolrDriver/GoraDriver for PO.DAAC Integration #84

Open
lewismc opened this issue Feb 1, 2017 · 2 comments
Open

Implement SolrDriver/GoraDriver for PO.DAAC Integration #84

lewismc opened this issue Feb 1, 2017 · 2 comments

Comments

@lewismc
Copy link
Collaborator

lewismc commented Feb 1, 2017

As requested by the project team, we should look into extending Mudrod storage functionality such that we can use Apache Solr as an indexing server. Justification is simple, this is what is in use at PO.DAAC.
We should review both Solrj and Apache Gora as options before hardcoding anything.

@lewismc lewismc added this to Engine Integration and Deployment in AIST Master Schedule Feb 1, 2017
@fgreg
Copy link
Collaborator

fgreg commented Feb 3, 2017

Just doing a little preliminary research, this is going to be very difficult. Doing a search for "import org.elasticsearch" yields 27 different files where we are dependent directly on the Elastic Search libraries. This means at a minimum we will need to alter these 27 files. I haven't done any further analysis as to how hard it will be to extract ES from these files.

  • Occurrences of 'import org.elasticsearch' in Project
    • mudrod-core
      • esiptestbed.mudrod.driver
        • ESDriver.java
      • esiptestbed.mudrod.integration
        • LinkageIntegration.java
      • esiptestbed.mudrod.metadata.pre
        • ApiHarvester.java
      • esiptestbed.mudrod.metadata.structure
        • MetadataExtractor.java
      • esiptestbed.mudrod.ontology.process
        • OntologyLinkCal.java
      • esiptestbed.mudrod.recommendation.pre
        • ImportMetadata.java
        • NormalizeVariables.java
        • SessionCooccurence.java
      • esiptestbed.mudrod.recommendation.process
        • VariableBasedSimilarity.java
      • esiptestbed.mudrod.recommendation.structure
        • HybridRecommendation.java
        • MetadataOpt.java
        • RecomData.java
      • esiptestbed.mudrod.ssearch
        • ClickstreamImporter.java
        • Dispatcher.java
        • Searcher.java
      • esiptestbed.mudrod.ssearch.ranking
        • TrainingImporter.java
      • esiptestbed.mudrod.utils
        • ESTransportClient.java
        • LinkageTriple.java
      • esiptestbed.mudrod.weblog.pre
        • CrawlerDetection.java
        • HistoryGenerator.java
        • ImportLogFile.java
        • LogAbstract.java
        • RemoveRawLog.java
        • SessionGenerator.java
        • SessionStatistic.java
      • esiptestbed.mudrod.weblog.structure
        • Session.java
        • SessionExtractor.java

@lewismc
Copy link
Collaborator Author

lewismc commented Feb 3, 2017

No joke, it is a non trivial codebase amendment. We have two options,

  • essentially rip all ES stuff out and do a direct replacement with Solrj, or
  • make an attempt to abstract the functionality out into a core Driver interface, which would live in esiptestbed.mudrod.driver.

The other issue we need to consider is what the tradeoff's are in terms of performance between the Spark + ES integration we currently have (parrallize log ingestion and subsequent processing) Vs the Spark + Solr alternative (which we still have to design and implement).

I have previously used Lucidworks spark-solr for achieving this. It would be a great please to start.
Right now I think that it may be best for us to

@fgreg fgreg added this to the 03/29/17 milestone Mar 30, 2017
@fgreg fgreg modified the milestones: 05/03/2017, 03/29/17 Apr 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
AIST Master Schedule
Engine Integration and Deployment
Development

No branches or pull requests

2 participants