Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature request: full text search with tf-idf Scoring #3211

Open
pjebs opened this issue Mar 26, 2019 · 8 comments
Open

feature request: full text search with tf-idf Scoring #3211

pjebs opened this issue Mar 26, 2019 · 8 comments

Comments

@pjebs
Copy link
Contributor

@pjebs pjebs commented Mar 26, 2019

Bleve supports tf-idf Scoring.

It would be great if there was a function that could filter based on scoring by relevance. The function would accept an argument where we set our own threshold value.

@pjebs

This comment has been minimized.

Copy link
Contributor Author

@pjebs pjebs commented Mar 26, 2019

Bleve also supports fuzzy search which would be great.

@srfrog srfrog added the kind/feature label Mar 26, 2019
@srfrog

This comment has been minimized.

Copy link
Contributor

@srfrog srfrog commented Mar 26, 2019

Bleve also supports fuzzy search which would be great.

We evaluated Bleve's fuzzy search but it didn't fit with our data model so we made our own. The master branch has fuzzy matching, it will be available in a future release.

@niravmehta

This comment has been minimized.

Copy link

@niravmehta niravmehta commented Apr 20, 2019

We want to build an auto-complete type of search and need to search across multiple predicates for a match, and then show results based on some ranking.

How can we perform full text search in a given set of predicates? And return results based on some scoring?

@quanghaisoft

This comment has been minimized.

Copy link

@quanghaisoft quanghaisoft commented Jun 6, 2019

How can we perform full text search in a given set of predicates? And return results based on some scoring?

@MichelDiz

This comment has been minimized.

Copy link
Member

@MichelDiz MichelDiz commented Jun 26, 2019

This feature does not currently exist. There is not much to do. But I have an idea that can help you. But you will have to use Facets.

https://docs.dgraph.io/query-language/#sorting-using-facets

The idea is to map the text by your application and then create a word structure. So you can track its usage. The only problem is that if you need the words to be unique in the system, you will need to manually add the number of repetitions. Inside the facets by the Map_text_body edge. Otherwise you would create several nodes of equal words. You can do it, but I think it would be kind of messy.

Using this structure you can create a system that analyzes words. You can identify texts that use a specific word thus identifying similarities. Let's say several texts contain the word "Dgraph". You can search for the word "Dgraph" and use @ reverse to identify which texts use it and also to collect values in facets.

An auto-complete system would be easily built. For you would just collect the mapped words in "Map_text_body". And you can also rank using values on facets.

Here is a map of the structure itself. (Remember, this is just an idea for now)

You have the node with the text body and you have an edge that will contain the words mapped to nodes. Each word is a node.

Captura de Tela 2019-06-26 às 11 54 16

Below I explain a little better.

The Schema

Map_text_body: [uid] @reverse . #pay attention, this [uid] syntax is just for master or Dgraph 1.1
# Map_text_body: uid @reverse . #Use this for oldest versions.
name: string @index(exact) .
Text_body: string .

Mutation

{
  set {
    _:Blank <name> "My Text Node" .
    _:Blank <Text_body> "Sir, in my heart there was a kind of fighting \n That would not let me sleep. Methought I lay \n Worse than the mutines in the bilboes. Rashly— \n And prais'd be rashness for itlet us know \n Our indiscretion sometimes serves us well ..." .

   #The mapping
    _:Blank <Map_text_body> _:Sir (times=1) .
    _:Blank <Map_text_body> _:heart (times=1) .
    _:Blank <Map_text_body> _:in_ (times=2) .
    _:Blank <Map_text_body> _:there (times=1) .
    _:Blank <Map_text_body> _:was (times=1).
    _:Blank <Map_text_body> _:my (times=1).
    
   #Creating the mapped words
    _:Sir 		<name> "Sir" .
    _:heart		<name> "heart" .
    _:in_ 		<name> "in" .
    _:there 	        <name> "there" .
    _:was 		<name> "was" .
    _:my 		<name> "my" .

    # Please add more words based on the "Text_body"
  }
}

Query

{
  q(func: eq(name,"My Text Node")) {
    uid
    Text_body
    Map_text_body @facets(orderdesc: times) {
      uid
      name
    }
  }
}

Response

{
  "data": {
    "q": [
      {
        "uid": "0x2",
        "Text_body": "Sir, in my heart there was a kind of fighting \n That would not let me sleep. Methought I lay \n Worse than the mutines in the bilboes. Rashly— \n And prais'd be rashness for it—let us know \n Our indiscretion sometimes serves us well ...",
        "Map_text_body": [
          {
            "uid": "0x5",
            "name": "in",
            "Map_text_body|times": 2
          },
          {
            "uid": "0x3",
            "name": "Sir",
            "Map_text_body|times": 1
          },
          {
            "uid": "0x4",
            "name": "heart",
            "Map_text_body|times": 1
          },
          {
            "uid": "0x6",
            "name": "there",
            "Map_text_body|times": 1
          },
          {
            "uid": "0x7",
            "name": "was",
            "Map_text_body|times": 1
          },
          {
            "uid": "0x8",
            "name": "my",
            "Map_text_body|times": 1
          }
        ]
      }
    ]
  },
  "extensions": {
    "server_latency": {
      "parsing_ns": 15859,
      "processing_ns": 1857051,
      "encoding_ns": 598937
    },
    "txn": {
      "start_ts": 25
    }
  }
}

Using Facets you can do various functions in Dgraph. Like assign Facet values ​​to a variable, Sorting, Aggregation, Filtering and so on.

BTW

You will soon be able to use Upsert Block, this will make this process faster and more automatic.

But of course, you will have to map and create the block (write the query) manually or by some process that you have created in your application.

https://docs.dgraph.io/master/mutations/#upsert-block

e.g:

 upsert {
      query {
		#Check if the text already exist
        var(func: eq(name,"My Text Node")) {
          v_Text as uid
        }

		#Check if any words already exist
        var(func: has(type_mapped_word)) @filter(eq(name,"Sir")) {
          v_Sir as uid
        }
        var(func: has(type_mapped_word)) @filter(eq(name,"heart")) {
          v_heart as uid
        }
        var(func: has(type_mapped_word)) @filter(eq(name,"in")) {
          v_in as uid
        }
        var(func: has(type_mapped_word)) @filter(eq(name,"there")) {
          v_there as uid
        }
    # You're free to define the form that will categorize the type.
    # Here I am using "type_mapped_word".

      }
      mutation {
          set {
           uid(v_Text) <name> "My Text Node" .
           uid(v_Text) <Text_body> "Sir, in my heart there was a kind of fighting \n That would not let me sleep. Methought I lay \n Worse than the mutines in the bilboes. Rashly— \n And prais'd be rashness for itlet us know \n Our indiscretion sometimes serves us well ..." .
           uid(v_Text) <Map_text_body>  uid(v_Sir) (times=1) .
           uid(v_Text) <Map_text_body> uid(v_heart) (times=1) .
           uid(v_Text) <Map_text_body> uid(v_in) (times=2) .
           uid(v_Text) <Map_text_body> uid(v_there) (times=1) .
           uid(v_Text) <Map_text_body> uid(v_was) (times=1).
           uid(v_Text) <Map_text_body> uid(v_my) (times=1).
    
           uid(v_Sir) 		<name> "Sir" .
           uid(v_Sir)) 		<type_mapped_word> "" .

           uid(v_heart)		<name> "heart" .
           uid(v_heart)		<type_mapped_word> "" .

           uid(v_in) 		<name> "in" .
           uid(v_in) 		<type_mapped_word> "" .

           uid(v_there) 	<name> "there" .
           uid(v_there) 	<type_mapped_word> "" .

           uid(v_was) 		<name> "was" .
           uid(v_was) 		<type_mapped_word> "" .

           uid(v_my) 		<name> "my" .
           uid(v_my) 		<type_mapped_word> "" .
          }
      }
    }
@niravmehta

This comment has been minimized.

Copy link

@niravmehta niravmehta commented Jun 27, 2019

Interesting approach. If I understand this right, it's basically maintaining our own index of search words. But using Facets will allow tf-idf scoring.

What we're doing at the moment is to add a predicate - "search_index" to each node, and concatenate all searchable values into single field. Then searching within that index predicate.

This works as a simple search for now, but of course it does not do any scoring.

Of course, we'd love for this to be built-in ;-) Just the ability to search across multiple indexed predicates would work.

@MichelDiz

This comment has been minimized.

Copy link
Member

@MichelDiz MichelDiz commented Jun 27, 2019

Yes, exactly, it is your own index of search words with tf-idf scoring.

I think it would be great to implement this as built-in. Since we use KV indexing, it would be easy to implement this. But it does require some work and thought.

@campoy , what do you think of this feature? Can you define exp, status, and priority flags?

@campoy

This comment has been minimized.

Copy link
Contributor

@campoy campoy commented Jun 28, 2019

Thanks for the feature request, it does sound like something we could definitely add.

I'll be working on our roadmap once 1.1 is released (late July) and will take this feature into account then.

@campoy campoy added the area/indexes label Sep 15, 2019
@campoy campoy added this to the Dgraph v1.2 milestone Sep 15, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
6 participants
You can’t perform that action at this time.