-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Add basic text analysis to SAI: #2465
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
mike-tr-adamson
commented
Jul 5, 2023
- Adds the case_sensitive, normalize and ascii options to the index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since CASSANDRA-18521 CQLTester#waitForIndexQueryable requires an argument specifying the index name. Its absence here produces a build error. Maybe it has been missed during rebase?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that also CQLTester#waitForTableIndexesQueryable can be used to just wait for all the indexes used by the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since waitForTableIndexesQueryable was added to createIndex, I have removed these calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it's simpler to get rid of ANALYZABLE_TYPES and TypeUtil#isIn and just use (type instanceof StringType) to check if a type is analyzable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: can be protected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: add @Override
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this @Override is missed in the last commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, not sure how I missed that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: add @Override
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: add @Override
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: add @Override
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: no need to break the line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems there is some code duplication across the tests in this class. Maybe we could use a utility method such as, for example:
private void test(String input, String expected, AbstractAnalyzer analyzer) throws Exception
{
ByteBuffer toAnalyze = ByteBuffer.wrap(input.getBytes());
analyzer.reset(toAnalyze);
ByteBuffer analyzed = null;
while (analyzer.hasNext())
{
analyzed = analyzer.next();
}
String result = ByteBufferUtil.string(analyzed);
assertEquals(expected, result);
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I've done this and generally simplified / tidied this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks nice now :)
|
@adelapena @bereng Please note that I have just made a number of changes for the Lucene 9.7 upgrade. |
- Adds the case_sensitive, normalize and ascii options to the index.
7051147 to
6a9c9c1
Compare
| { | ||
| NonTokenizingOptions options = NonTokenizingOptions.getDefaultOptions(); | ||
|
|
||
| assertNotEquals("nip it in the bud", getAnalyzedString("Nip it in the bud", options)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe assertEquals("Nip it in the bud", getAnalyzedString("Nip it in the bud", options)); achieves the same and is more strict by verifying what is exactly returned?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I'm not entirely sure why the original test did this because it would have failed on any string.
|
@mike-tr-adamson I think it would be good to add a dtest quickly verifying that tokenisation and analysis don't break RFP (classic CASSANDRA-8272). I think something like this would work: public class ReplicaFilteringProtectionTest extends TestBaseImpl
{
private static final int REPLICAS = 2;
@Test
public void testRFPWithIndexTransformations() throws IOException
{
try (Cluster cluster = init(Cluster.build()
.withNodes(REPLICAS)
.withConfig(config -> config.set("hinted_handoff_enabled", false)
.set("commitlog_sync", "batch")).start()))
{
String tableName = "sai_rfp";
String fullTableName = KEYSPACE + '.' + tableName;
cluster.schemaChange("CREATE TABLE " + fullTableName + " (k int PRIMARY KEY, v text)");
cluster.schemaChange("CREATE CUSTOM INDEX ON " + fullTableName + "(v) USING 'StorageAttachedIndex' " +
"WITH OPTIONS = { 'case_sensitive' : false}");
// both nodes have the old value
cluster.coordinator(1).execute("INSERT INTO " + fullTableName + "(k, v) VALUES (0, 'OLD')", ALL);
String select = "SELECT * FROM " + fullTableName + " WHERE v = 'old'";
Object[][] initialRows = cluster.coordinator(1).execute(select, ALL);
assertRows(initialRows, row(0, "OLD"));
// only one node gets the new value
cluster.get(1).executeInternal("UPDATE " + fullTableName + " SET v = 'new' WHERE k = 0");
// querying by the old value shouldn't return the old surviving row
SimpleQueryResult oldResult = cluster.coordinator(1).executeWithResult(select, ALL);
assertRows(oldResult.toObjectArrays());
}
}
} |
|
@adelapena I've added the |
|
@mike-tr-adamson I think RFP is working because cassandra/src/java/org/apache/cassandra/index/sai/plan/StorageAttachedIndexSearcher.java Lines 85 to 103 in 723cc16
You can artificially see the test failing if you modify that method or if you make |
|
Thank you, that's what I was trying to find. |
| import org.apache.cassandra.utils.ByteBufferUtil; | ||
|
|
||
| import static org.junit.Assert.assertEquals; | ||
| import static org.junit.Assert.assertNotEquals; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This unused import is breaking the build: https://app.circleci.com/pipelines/github/adelapena/cassandra/3004/workflows/cff4d9ec-aa49-466d-9f32-8535c3826a64/jobs/56610
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, sorry. I'll get my commit process correct one of these days.
|
Committed as 05dd587 |