Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

loop through the database to create a full text index of the selected subset #15

Closed
molikd opened this issue Apr 15, 2020 · 16 comments
Closed
Assignees
Labels
data This issue is related to data index This issue is related to indexing

Comments

@molikd
Copy link
Collaborator

molikd commented Apr 15, 2020

relates to #14

@artunit
Copy link
Collaborator

artunit commented May 14, 2020

I think one approach to this is to use the Solr's Data Import Handler to leverage the database but pull the full text from the file system, the FileReaderDataSource is sort of an example of this.

@molikd
Copy link
Collaborator Author

molikd commented May 14, 2020

We are going to be switching to a Postgres end-point so Data Import Handler might work well. Eric?

@artunit
Copy link
Collaborator

artunit commented May 15, 2020

Solr normally uses a timestamp column to calculate delta entries, this could be a trigger in sqlite. Postgres would work well for this too, it also has an awesome foreign data wrapper.

@artunit
Copy link
Collaborator

artunit commented May 15, 2020

I managed to get my laptop trapped in transit between a repair shop and the campus before the start of the pandemic, so these examples are from a borrowed windows machine. I used the sqlite table id as the identifier here but this should be whatever meaningful identifier is possible with the dataset (sha looks like there are duplicates?). So here a full import would use the identifiers from the articles table to determine which file to import. I thought this could be used with JSON files but I think it has to be done with XML inputs. There are some solr-specific functions that I think we would want to position the data for anyway so I have some python code for setting up the XML input, but basically the idea here is for the database to be linked with the indexing (to allow "looping through"). A solr deltaimport, which is typically what you set up in cron or whatever, would keep the index in sync (see sql triggers below).

<?xml version="1.0" encoding="utf-8"?>
<dataConfig>
  <dataSource name="ddr1" 
       url="jdbc:sqlite:C:/util/shared/DistantReader/test/test.db" 
       driver="org.sqlite.JDBC"/>
  <dataSource name="ddr2" type="FileDataSource" encoding="UTF-8" />
  <document>
    <!-- processor for database interactions -->
    <entity name="edr1" dataSource="ddr1" pk="id" processor="SqlEntityProcessor" 
         query="SELECT id from articles"
         deltaQuery="SELECT id from articles WHERE timeStamp > '${dataimporter.last_index_time}'"
         deletedPkQuery="SELECT deleted_id as id from deletes WHERE timeStamp > '${dataimporter.last_index_time}'">
        <!-- processor for filesystem interactons -->
        <entity name="edr2" dataSource="ddr2" 
             processor="XPathEntityProcessor" useSolrAddSchema="true" 
             stream="true" forEach="/doc" 
             url="C:\tmp\xml\${edr1.id}.xml"/> 
    </entity>
  </document>
</dataConfig>

There is a deletes table and I added 2 triggers to the database schema:

CREATE TABLE articles (
        id        INTEGER PRIMARY KEY AUTOINCREMENT,
        sha       TEXT,
        title     TEXT,
        journal   TEXT,
        date      TEXT,
        abstract  TEXT,
        doi       TEXT,
        timeStamp DATE
);
CREATE TRIGGER insert_article_timeStamp AFTER INSERT ON articles
BEGIN
  UPDATE articles SET timeStamp = DATETIME('NOW')
  WHERE rowid = new.rowid;
END;
CREATE TABLE deletes (
        id        INTEGER PRIMARY KEY,
        deleted_id INTEGER,
        timeStamp DATE
);
CREATE TRIGGER insert_after_delete AFTER DELETE ON articles
BEGIN
  INSERT INTO deletes (deleted_id,timeStamp) VALUES (OLD.id,DATETIME('NOW'));
END;
CREATE TABLE authors (
        sha     TEXT,
        author  TEXT
);

The last time I was involved with something like this, the solr index was built totally outside of dataimporthandler, and deltaimports were used to keep ongoing, and typically more minor, updates in sync.

@ericleasemorgan
Copy link
Owner

ericleasemorgan commented May 15, 2020 via email

@ericleasemorgan
Copy link
Owner

ericleasemorgan commented May 15, 2020 via email

@ericleasemorgan
Copy link
Owner

ericleasemorgan commented May 15, 2020 via email

@ericleasemorgan
Copy link
Owner

ericleasemorgan commented May 15, 2020 via email

@artunit
Copy link
Collaborator

artunit commented May 16, 2020

Sorry to be slow on the uptake, I received an email message for each comment. I don't think much could touch SQLite in terms of ease of deployment, postgres has some unique options to leverage solr capabilities at the sql level, but it might be overkill for this purpose. You definitely don't need postgres for the data import handler, which in itself is really just a way to keep a solr index in sync with a database. I will go through a setup on the shared file system, I wanted to sort out the data handler stuff on my laptop because I tend to lean on a lot on the web-based solr admin interface for setting something like this up. The data handler doesn't impact the index definition at all, it can be as faceted and as extensive as solr schemas allow.

@ericleasemorgan
Copy link
Owner

ericleasemorgan commented May 16, 2020 via email

@artunit
Copy link
Collaborator

artunit commented May 18, 2020

../let's use Solr to index the content of a database I've already filled. It is an SQLite database, and it is located on our shared file system

I ran through a simple example on the shared file system, I used an xml version of the records but I didn't realize the full-text of the content was now included in the database. SQLite could be the source of all of the content in the index(es) if that's where all of the content from the dataset ends up.

@ericleasemorgan
Copy link
Owner

ericleasemorgan commented May 18, 2020 via email

@artunit
Copy link
Collaborator

artunit commented May 19, 2020

Sorry, it's a long weekend here in Canada and I am just catching up on today's email. The solr instance is in /export/scratch/solr/solr-8.5.1 and I ran the import again with the PlainTextEntityProcessor. I misread the source column in the database last night and thought it was the fulltext, but I think the fulltext for solr should come from the file system (since you already have it nicely processed), and everything else could come directly from the database. I am running this version of solr on port 8984 and you can see the results of the import with:

curl http://localhost:8984/solr/cord/dataimport?command=status

It took just over 11 minutes, but that's only the fulltext. The status report claims there are no skipped documents, but there are documents which did not resolve to a file. The current config for the data handler stuff is in /export/scratch/solr/solr-8.5.1/server/solr/cord/conf/DIHconfigfile.xml. The field layout still needs to be worked out, as well as the faceting and so on, but solr works well with a database for this kind of thing and I don't think the indexing will be terribly onerous. I haven't worked on optimizing solr at all, I never have done much with poking around SolrCloud and using multiple servers.

@ericleasemorgan
Copy link
Owner

Art, great work. 'More real soon....

@ericleasemorgan
Copy link
Owner

ericleasemorgan commented May 19, 2020 via email

@molikd molikd moved this from Tasks to In Progress in The Reader Meets COVID-19 May 22, 2020
@nkmeyers nkmeyers removed this from the indexing enhancements milestone May 26, 2020
@ericleasemorgan
Copy link
Owner

I think we can consider this "done"; we are now successfully able to create study carrels against subsets of CORD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data This issue is related to data index This issue is related to indexing
Projects
Development

No branches or pull requests

5 participants