New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOCUMENT-VECTOR not consistent from node.js to browser #394

Closed
gburgett opened this Issue Jul 20, 2017 · 4 comments

Comments

Projects
None yet
2 participants
@gburgett
Contributor

gburgett commented Jul 20, 2017

Hello,
I'm trying to build a searchable index of my static web site from my markdown source files. My intention is to use gulp to build the index, then serve it gzipped so that the client can run the search directly in the browser. I know there's probably better ways to do this for a static site but it's more fun this way.

The problem I ran into is that the DOCUMENT-VECTOR key created in the LevelUp DB on NodeJS for searchable fields seems to be a different format from the one required by search-index.min.js built for the web browser. As far as I can tell here are the relevant sections of code:

dist/search-index.js (this is the bundle intended for the browser):

28945:    fields.forEach(function (field) {
28946:      // get vector for whole document field
28947:      that.options.indexes.get(
28948:        'DOCUMENT-VECTOR' + sep + docID + sep + field + sep, function (err, docVector) {
28949:          if (err) console.log(err) // TODO something clever with err
28950:          // const vector = {}
28951:          clause.queryClause.AND[field].forEach(function (token) {

search-index-adder/lib/add.js (this is what is building my search index in gulp):

53:      if (that.storeVector) {
54:        that.deltaIndex['DOCUMENT-VECTOR' + sep + fieldName + sep + ingestedDoc.id + sep] =
55:          ingestedDoc.vector[fieldName]
56:      }

You can see that in the adder, fieldName comes first then document ID. While in the browser, where i'm doing the searching, docId comes first and then field name. In fact the error I get when I search is the following:

'NotFoundError: Key not found in database [DOCUMENT-VECTOR○post/2015/01_which-which.md○tags○]'

and if I unzip and grep my gzipped search index this is the entry I find:

{"key":"DOCUMENT-VECTOR○tags○post/2015/01_which-which.md○","value":{"Development":1,"golang":1,"*":1}}

I'm not sure if you've considered before cross-compatibility between node.js and the browser, but if you have a suggestion on how to proceed I'd be happy to work on it and send a pull request as my time allows.

@fergiemcdowall

This comment has been minimized.

Owner

fergiemcdowall commented Jul 20, 2017

If you are using search-index into a browser (which is a very cool thing to do) you have to be careful about encodings. Long story short- initing your search index with keySeparator: '~' should work. Look at the demo to see how this is done (code is in searchapp.js).

@gburgett

This comment has been minimized.

Contributor

gburgett commented Jul 21, 2017

Thanks for the tip! I'm sure that will prevent other errors in the future :) But I'm still getting this error.

'NotFoundError: Key not found in database [DOCUMENT-VECTOR~post/2015/01_which-which.md~tags~]'

And in the index file generated by my gulp task it has this entry:

{"key":"DOCUMENT-VECTOR~tags~post/2015/01_which-which.md~","value":{"Development":1,"golang":1,"*":1}}

This is also affecting a simple text search. For example, the following code causes a similar error:

        const docs = []
        si.search('die zauberflöte').on('data', (doc) => {
          docs.push(doc)
        }).on('end', () => {
          expect(docs).to.have.length(1)
          expect(docs[0].id).to.equal('post/2015/01_euro-trip.md')

          si.close(done)
        })
'NotFoundError: Key not found in database [DOCUMENT-VECTOR~post/2015/01_euro-trip.md~*~]'
@gburgett

This comment has been minimized.

Contributor

gburgett commented Jul 21, 2017

I made a workaround for my use case:

const invertDocumentVector = () => new Transform({
  objectMode: true,

  transform(obj: any, encoding, callback) {
    const doc = obj as { key: string, value: string[] }
    if(doc.key && doc.key.startsWith('DOCUMENT-VECTOR')){
      const vector = doc.key.split(searchIndexOptions.keySeparator)
      const tmp = vector[1]   // tmp = fieldName
      vector[1] = vector[2]   // move document ID to correct spot for web bundle
      vector[2] = tmp         // move fieldName to correct spot for web bundle
      doc.key = vector.join(searchIndexOptions.keySeparator)
      console.log(chalk.green('DOCUMENT-VECTOR:'), doc.key)

      callback(null, doc)
      return
    }

    callback(null, obj)
  },
})

And I drop it into my gulp pipeline here:

index.dbReadStream({ gzip: true })
    .pipe(invertDocumentVector())
    .pipe(JSONStream.stringify(false))
    .pipe(zlib.createGzip())
    .pipe(fs.createWriteStream(file))
    .on('close', () => {

Now my gzipped export of the search index that I'm creating using Gulp is in the appropriate format for the web bundle to search it, and my tests are not causing errors anymore in firefox :)

@fergiemcdowall

This comment has been minimized.

Owner

fergiemcdowall commented Jul 21, 2017

Glad that you got it fixed- it sounds like you may have uncovered a bug.

If you have time to throw together a gist that reproduces the broken behaviour then we can get it fixed for you

gburgett added a commit to gburgett/hugo-search-index that referenced this issue Jul 21, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment