Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature: enhance Text.contains("word") #1652

Merged
merged 6 commits into from
Dec 7, 2021
Merged

feature: enhance Text.contains("word") #1652

merged 6 commits into from
Dec 7, 2021

Conversation

coderzc
Copy link
Member

@coderzc coderzc commented Nov 12, 2021

implement feature #1649

  • support entire word match Text.contains("(word)")
  • support custom words match Text.contains("(word1|word2|word3)")
  • support tokenizer words match Text.contains("word"), also contain word == propValue

@codecov
Copy link

codecov bot commented Nov 12, 2021

Codecov Report

Merging #1652 (9152ecb) into master (83180fb) will increase coverage by 2.51%.
The diff coverage is 90.00%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master    #1652      +/-   ##
============================================
+ Coverage     64.40%   66.92%   +2.51%     
- Complexity     6897     7065     +168     
============================================
  Files           421      421              
  Lines         34675    34682       +7     
  Branches       4803     4804       +1     
============================================
+ Hits          22334    23211     +877     
+ Misses        10058     9124     -934     
- Partials       2283     2347      +64     
Impacted Files Coverage Δ
...du/hugegraph/backend/tx/GraphIndexTransaction.java 83.72% <90.00%> (-0.10%) ⬇️
...va/com/baidu/hugegraph/auth/HugeAuthenticator.java 34.78% <0.00%> (-5.44%) ⬇️
...va/com/baidu/hugegraph/task/ServerInfoManager.java 71.34% <0.00%> (-2.81%) ⬇️
...ain/java/com/baidu/hugegraph/task/TaskManager.java 69.06% <0.00%> (-1.44%) ⬇️
...in/java/com/baidu/hugegraph/auth/HugeResource.java 77.92% <0.00%> (-1.30%) ⬇️
.../java/com/baidu/hugegraph/auth/RolePermission.java 90.58% <0.00%> (-1.18%) ⬇️
...egraph/backend/store/cassandra/CassandraStore.java 71.79% <0.00%> (-0.86%) ⬇️
...om/baidu/hugegraph/task/StandardTaskScheduler.java 75.85% <0.00%> (+0.24%) ⬆️
.../baidu/hugegraph/backend/query/ConditionQuery.java 85.94% <0.00%> (+0.26%) ⬆️
... and 20 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 83180fb...9152ecb. Read the comment docs.

@@ -76,6 +76,10 @@
TEXT_CONTAINS("textcontains", String.class, String.class, (v1, v2) -> {
return v1 != null && ((String) v1).contains((String) v2);
}),
TEXT_CONTAINS_ENTIRE("textcontainsentire", String.class, String.class,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my suggestion is to just enhance GraphIndexTransaction.segmentWords instead of adding textcontainsentire operator, check whether there are multiple special symbols in the text param, and then split it into multiple search terms.
we can support form like Text.contains("(word)") and Text.contains("word1|word2|word3")

Assert.assertEquals(5, vertices.size());
Assert.assertTrue(vertices.contains(vertex2));

vertices = g.V().has("name", Text.contains("(秦始皇)")).toList();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add case "秦始皇" & "秦朝", and query by "秦始皇" or "秦"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which text analyzer can gain the word『秦』?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can try ik max_word

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ik max_word text analyzer only can gain [秦始皇, 始皇]

@coderzc coderzc changed the title feature: support entire word match Text.contains("(word)") feature: enhance Text.contains(text) Nov 16, 2021
@coderzc coderzc requested a review from javeme November 16, 2021 11:38
@coderzc coderzc changed the title feature: enhance Text.contains(text) feature: enhance Text.contains("word") Nov 16, 2021
String[] split = StringUtils.split(text, WORD_DELIMITER);
return Sets.newHashSet(split);
}
// Add original text, retain word == propValue
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

improve comment

support Text.contains("(word)") and Text.contains("word1|word2|word3")
*/
if (text.startsWith(START_SYMBOL) && text.endsWith(END_SYMBOL)) {
return Sets.newHashSet(text.substring(1, text.length() - 1));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep ImmutableSet.of() style?

/*
Enhance segmentWords.
support Text.contains("(word)") and Text.contains("word1|word2|word3")
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/*
 * Support 3 kinds of query:
 *  - Text.contains("(word)"): query by user-specified word;
 *  - Text.contains("word1|word2|word3"): query by user-specified words;
 *  - Text.contains("words"): query by words splitted from analyzer;
 * /

}
/*
Add original text to segment set.
Let Text.contains("word") also contain `(word == propValue)`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not mean what the comment represents

javeme
javeme previously approved these changes Nov 17, 2021
String[] texts = StringUtils.split(text, WORD_DELIMITER);
return ImmutableSet.copyOf(texts);
}
Set<String> segment = this.textAnalyzer.segment(text);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to segments

return ImmutableSet.copyOf(texts);
}
Set<String> segment = this.textAnalyzer.segment(text);
segment.add(text);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't add text to segment

Copy link
Member Author

@coderzc coderzc Nov 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if don't add text to segment, will can't exact match words

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

textAnalyzer.segment() would ensue words the same between insertion and query.

@@ -903,7 +903,10 @@ private boolean matchSearchIndexWords(String propValue, String fieldValue) {
return ImmutableSet.copyOf(texts);
}
Set<String> segments = this.textAnalyzer.segment(text);

// Add original words to segments, in order to can be match fully words
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for more specific like: "Add original text to segments at the insertion stage, in order to can match fully words at the query stage."

javeme
javeme previously approved these changes Dec 1, 2021
@javeme
Copy link
Contributor

javeme commented Dec 2, 2021

@zhoney zhoney merged commit 3b623bc into master Dec 7, 2021
@zhoney zhoney deleted the enhance_contains branch December 7, 2021 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants