Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't seem to get results when searching for indexed text #234

Closed
joshterrill opened this issue Feb 8, 2016 · 27 comments
Closed

Can't seem to get results when searching for indexed text #234

joshterrill opened this issue Feb 8, 2016 · 27 comments

Comments

@joshterrill
Copy link
Contributor

I'm trying to write something where I can upload a document, grab the text out of the document and store the text (and filename and mimetype) in search-index.

I've stored a document just fine, but when I go to search through text in it, nothing seems to ever come up.

So when I put wildcards (*) in both the search fields, I get the document that I have stored in the database:

app.post('/search', function(req, res) {
    var q = {};
    q.query = {
      '*': ['*']
    };
    si.search(q, function (err, searchResults) {
      if (!err) {
        res.json({results: searchResults});
      } else {
        res.json({error: err});
      }
    });
});

Imgur

But when I search for a word that is inside the text field, like the word "traditional", I get no results:

  app.post('/search', function(req, res) {
    var q = {};
    q.query = {
      '*': ['traditional']
    };
    si.search(q, function (err, searchResults) {
      if (!err) {
        res.json({results: searchResults});
      } else {
        res.json({error: err});
      }
    })
});

Imgur

@fergiemcdowall
Copy link
Owner

Hmmm- you definitely have a document in the index, and your query syntax seems to be OK (disclaimer: might be missing something- have not had morning coffee yet). This means that there may be something unusual about the way you are indexing it. The examples and tests should provide some clues. If you can post a test case which indexes the document and reproduces the error, then we can fix it for you.

@joshterrill
Copy link
Contributor Author

@fergiemcdowall Well, all my code is hosted here if you want to take a look at it: https://github.com/joshterrill/doc-search
If you're on a linux machine, you should be able to run npm run deploy to download the server dependencies, node modules, bower dependencies, and then start the server. When you go to http://localhost:3000/extract there is a document upload form that should support uploading RTF's (the only kind of document I've tested so far with catdoc and textract). The document gets posted to /extract, then catdoc parses the content of the RTF into plain text, and search-index stores it.

@fergiemcdowall
Copy link
Owner

Your code looks OK, there are no obvious errors.

We can only really look at contained test cases. If you can throw one together, we can fix it for you. Or you might find out what the problem is in the process of making the test case :)

@joshterrill
Copy link
Contributor Author

I'm not really sure how to create a test case. But I'll look into this stuff again.

@joshterrill
Copy link
Contributor Author

Okay, something is not right here. I created a route that basically just creates a simple object and pushes that through:

app.get('/test', function(req, res) {
    var pushObject = {"test": "123", "test2": "456"};
    si.add(pushObject, function (err) {
      if (!err) {
        res.json({success: true});
      } else {
        res.json({error: err});
      }
    });
});

And when I go to search it with:

    var q = {};
    q.query = {
      "*": ["123"]
    };
    si.search(q, function (err, searchResults) {
      if (!err) {
        res.json({results: searchResults});
      } else {
        res.json({error: err});
      }
    })

I still get 0 results....

@joshterrill
Copy link
Contributor Author

I even followed this quickstart guide verbatim and can't get it to return results...

https://github.com/fergiemcdowall/search-index/blob/master/doc/quickstart.md

@ethanrubinson
Copy link
Contributor

Hey there, I believe I'm running into a very similar problem as joshterrill. I'm indexing a document (markdown syntax) with the fields "title" and "body", then searching for it like so:

q.query = {
      "title": searchString.split(" "),
      "body" : searchString.split(" ")
    };

However it only ever seems to return results if the string matches exactly with a word in the title.

Update

If I use the query:

q.query = {
      "*": searchString.split(" ")
    };

It works fine (ie: it finds occurrences of the search terms in the body). Is something wrong with our query syntax for searching on multiple fields? Am I correct in assuming that searching on fields behaves like an OR rather than an AND?

@joshterrill
Copy link
Contributor Author

I can't even get anything to return, even if I match a word exactly as it
is in the db. Like I said I can't get the example to even work in the quick
start guide. Seems like it may be a problem with the module?
On Feb 9, 2016 10:48 PM, "Ethan Rubinson" notifications@github.com wrote:

Hey there, I believe I'm running into a very similar problem as
joshterrill. I'm indexing a document (markdown syntax) with the fields
"title" and "body", then searching for it like so:

q.query = {
"title": searchString.split(" "),
"body" : searchString.split(" ")
};

However it only ever seems to return results if the string matches exactly
with a word in the title.


Reply to this email directly or view it on GitHub
#234 (comment)
.

@ethanrubinson
Copy link
Contributor

Try wrapping your object in an array:

var pushObject = [{"test": "123", "test2": "456"}];

Also, it's good practice to empty the index before adding stuff in (if you only add stuff once)... just incase:

function addStuffToIndex(si, stuff, callback) {
      si.empty( function(err) {
                if (err){
                    callback(err)
                }
                else{
                    si.add(stuff, function(err) {
                        if (err) {
                            callback(err)
                        }
                        else{
                            callback(null)
                        }
                   });
               }
       });
}

@joshterrill
Copy link
Contributor Author

Still does not work.
On Feb 9, 2016 11:14 PM, "Ethan Rubinson" notifications@github.com wrote:

Try wrapping your object in an array:

var pushObject = [{"test": "123", "test2": "456"}];


Reply to this email directly or view it on GitHub
#234 (comment)
.

@ethanrubinson
Copy link
Contributor

Try downgrading to search-index 0.6.15? You'll have to delete the module from node-modules/, modify the package.json file and do another npm install.

Try changing your initialization options to this as well:

var siOptions = {
  indexPath: 'si',
  fieldsToStore: 'all'
}

If neither of those work I'll fork your repo and give it a whirl tomorrow. Besides the other issue I'm facing, everything else works great for me.

Pro Tip: Syntax highlight your code like this:

```javascript
var something = function (put, your) {
// code here
}
``'

@fergiemcdowall
Copy link
Owner

@ethanrubinson yes, you need to arrayify search terms like so:

q.query = {
      "body": ['donald', 'duck']  // return documents containing either 'donald' or 'duck'
};

...so that you can do phrase search like so (if you have indexed with ngrams of 2 or more):

q.query = {
      "body": ['donald duck']  // returns documents with the phrase 'donald duck'
};

And yes, emptying before reindexing is faster. FYI empty is renamed flush in latest versions

@joshterrill
Copy link
Contributor Author

@ethanrubinson I tried downgrading and still no luck. I'm not sure what's
going on... could you see if you can even get the example to work in the
quick start guide? I linked to it a few posts above. You can try to fork my
repo if you want but just know in order to get the document uploading and
parsing to work you'll have to have cat doc installed which if you're on a
Linux box is pretty easy. I've put the command that installs it in an npm
script in the package.json.
On Feb 9, 2016 11:31 PM, "Fergus McDowall" notifications@github.com wrote:

@ethanrubinson https://github.com/EthanRubinson yes, you need to
arrayify search terms like so:

q.query = {
"body": ['donald', 'duck'] // return documents containing either 'donald' or 'duck'
};

...so that you can do phrase search like so (if you have indexed with
ngrams of 2 or more):

q.query = {
"body": ['donald duck'] // returns documents with the phrase 'donald duck'
};

And yes, emptying before reindexing is faster. FYI empty is renamed flush
in latest versions


Reply to this email directly or view it on GitHub
#234 (comment)
.

@ethanrubinson
Copy link
Contributor

@fergiemcdowall

I am arrayify-ing the search terms: searchString.split(" ") returns an array ("Search Terms 12 3" --> ["Search", "Terms", "12", "3"]. What I was wondering is if the search terms needed to show up in both queried indexed fields or in either field for the corresponding doc to be returned.

In other words, if I search on fields 'title', and 'body', for the word "App", does "App" need to show up in both the title and the body for the doc to be returned... or just in one?

@joshterrill

Yes, I can get the quickstart to work. Try it on another machine maybe? A vanilla Amazon EC2 instance would work fine

@fergiemcdowall
Copy link
Owner

If I understand you correctly: both

each <fieldName>: [<searchterms>] has to return true for document to be returned. The searchterms are AND-y , meaning that they all have to be present in the specified field

@fergiemcdowall
Copy link
Owner

@joshterrill I will try to post a gist to get you started if I get time today

@joshterrill
Copy link
Contributor Author

I'm searching for "*":["test"] which is a word that I have stored in my
database. It's still not showing up in the results when I query it though.
On Feb 9, 2016 11:43 PM, "Fergus McDowall" notifications@github.com wrote:

If I understand you correctly: both

each : [] has to return true for document to be
returned. The searchterms are AND-y , meaning that they all have to be
present in the specified field


Reply to this email directly or view it on GitHub
#234 (comment)
.

@ethanrubinson
Copy link
Contributor

@fergiemcdowall Ah, yes that is exactly what I was asking. I'm going to submit a PR for your search.md doc, it's not clear on there if that was the case 👍

Thanks!

@joshterrill
Copy link
Contributor Author

@fergiemcdowall thanks!
On Feb 9, 2016 11:45 PM, "Ethan Rubinson" notifications@github.com wrote:

@fergiemcdowall https://github.com/fergiemcdowall Ah, yes that is
exactly what I was asking. I'm going to submit a PR for your search.md
doc, it's not clear on there if that was the case [image: 👍]

Thanks!


Reply to this email directly or view it on GitHub
#234 (comment)
.

@joshterrill
Copy link
Contributor Author

I actually just got this to work. I have no idea what I did wrong, but it had something to do with the si options that I had set. And I used the options that were recommended in the docs.

Here's what I had:

var siOptions = {
  deletable: true,
  fieldedSearch: true,
  indexPath: 'si',
  logLevel: 'error',
  nGramLength: 5,
  fieldsToStore: 'all'
}

I deleted all of the options except indexPath and now it works.

@fergiemcdowall
Copy link
Owner

👍

@joshterrill
Copy link
Contributor Author

Could you explain why those options weren't yielding any results? I'm confused by it a tad.

@fergiemcdowall
Copy link
Owner

The code snippets you have posted look fine, and should work. Maybe you have found a bug. We would love to get reproducible test case (a simple script that adds a document, searches and gives brokenness) so that it can be debugged. Then a proper answer can be given

@joshterrill
Copy link
Contributor Author

Okay I'll write a test case tomorrow and see what I come up with.

@eklem
Copy link
Collaborator

eklem commented Feb 15, 2016

Just saw a possible error in your siOptions, @joshterrill. Not sure if your nGramLength definition will work. I figured this out the hard way a while ago, but got it working, defining multiple nGramLength's as an array:

{
  ...
  nGramLength: [1, 2, 3],
  ...
}

@fergiemcdowall
Copy link
Owner

nGramLength should work with a single integer, so if it doesn't then that is a bug

@eklem
Copy link
Collaborator

eklem commented Feb 15, 2016

Ok, from what I remember, this is what happened when just using a single integer higher than 1.
#220

Matcher worked, search not. But something in the code may have changed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants