Add basic text analysis to SAI: #2465

mike-tr-adamson · 2023-07-05T10:29:19Z

Adds the case_sensitive, normalize and ascii options to the index.

adelapena · 2023-07-05T11:17:16Z

test/unit/org/apache/cassandra/index/sai/cql/StorageAttachedIndexDDLTest.java

Since CASSANDRA-18521 CQLTester#waitForIndexQueryable requires an argument specifying the index name. Its absence here produces a build error. Maybe it has been missed during rebase?

Note that also CQLTester#waitForTableIndexesQueryable can be used to just wait for all the indexes used by the table.

Since waitForTableIndexesQueryable was added to createIndex, I have removed these calls.

adelapena · 2023-07-05T11:25:33Z

src/java/org/apache/cassandra/index/sai/analyzer/AbstractAnalyzer.java

Maybe it's simpler to get rid of ANALYZABLE_TYPES and TypeUtil#isIn and just use (type instanceof StringType) to check if a type is analyzable?

adelapena · 2023-07-05T11:33:52Z

src/java/org/apache/cassandra/index/sai/analyzer/AbstractAnalyzer.java

Nit: can be protected

adelapena · 2023-07-05T11:34:18Z

src/java/org/apache/cassandra/index/sai/analyzer/filter/BasicResultFilters.java

Nit: add @Override

I think this @Override is missed in the last commit

Sorry, not sure how I missed that one.

adelapena · 2023-07-05T11:34:25Z

src/java/org/apache/cassandra/index/sai/analyzer/filter/BasicResultFilters.java

Nit: add @Override

adelapena · 2023-07-05T11:34:30Z

src/java/org/apache/cassandra/index/sai/analyzer/filter/BasicResultFilters.java

Nit: add @Override

adelapena · 2023-07-05T11:34:37Z

src/java/org/apache/cassandra/index/sai/analyzer/filter/BasicResultFilters.java

Nit: add @Override

adelapena · 2023-07-05T11:36:08Z

src/java/org/apache/cassandra/index/sai/analyzer/filter/FilterPipelineExecutor.java

Nit: no need to break the line

adelapena · 2023-07-05T11:48:55Z

test/unit/org/apache/cassandra/index/sai/analyzer/NonTokenizingAnalyzerTest.java

It seems there is some code duplication across the tests in this class. Maybe we could use a utility method such as, for example:

private void test(String input, String expected, AbstractAnalyzer analyzer) throws Exception { ByteBuffer toAnalyze = ByteBuffer.wrap(input.getBytes()); analyzer.reset(toAnalyze); ByteBuffer analyzed = null; while (analyzer.hasNext()) { analyzed = analyzer.next(); } String result = ByteBufferUtil.string(analyzed); assertEquals(expected, result); }

Agreed, I've done this and generally simplified / tidied this test.

It looks nice now :)

mike-tr-adamson · 2023-07-05T12:04:59Z

@adelapena @bereng Please note that I have just made a number of changes for the Lucene 9.7 upgrade.

- Adds the case_sensitive, normalize and ascii options to the index.

adelapena · 2023-07-05T13:22:27Z

test/unit/org/apache/cassandra/index/sai/analyzer/NonTokenizingAnalyzerTest.java

+    {
+        NonTokenizingOptions options = NonTokenizingOptions.getDefaultOptions();
+
+        assertNotEquals("nip it in the bud", getAnalyzedString("Nip it in the bud", options));


Maybe assertEquals("Nip it in the bud", getAnalyzedString("Nip it in the bud", options)); achieves the same and is more strict by verifying what is exactly returned?

Agreed, I'm not entirely sure why the original test did this because it would have failed on any string.

adelapena · 2023-07-05T14:03:45Z

@mike-tr-adamson I think it would be good to add a dtest quickly verifying that tokenisation and analysis don't break RFP (classic CASSANDRA-8272). I think something like this would work:

public class ReplicaFilteringProtectionTest extends TestBaseImpl
{
    private static final int REPLICAS = 2;

    @Test
    public void testRFPWithIndexTransformations() throws IOException
    {
        try (Cluster cluster = init(Cluster.build()
                                           .withNodes(REPLICAS)
                                           .withConfig(config -> config.set("hinted_handoff_enabled", false)
                                                                       .set("commitlog_sync", "batch")).start()))
        {
            String tableName = "sai_rfp";
            String fullTableName = KEYSPACE + '.' + tableName;

            cluster.schemaChange("CREATE TABLE " + fullTableName + " (k int PRIMARY KEY, v text)");
            cluster.schemaChange("CREATE CUSTOM INDEX ON " + fullTableName + "(v) USING 'StorageAttachedIndex' " +
                                 "WITH OPTIONS = { 'case_sensitive' : false}");

            // both nodes have the old value
            cluster.coordinator(1).execute("INSERT INTO " + fullTableName + "(k, v) VALUES (0, 'OLD')", ALL);

            String select = "SELECT * FROM " + fullTableName + " WHERE v = 'old'";
            Object[][] initialRows = cluster.coordinator(1).execute(select, ALL);
            assertRows(initialRows, row(0, "OLD"));

            // only one node gets the new value
            cluster.get(1).executeInternal("UPDATE " + fullTableName + " SET v = 'new' WHERE k = 0");

            // querying by the old value shouldn't return the old surviving row
            SimpleQueryResult oldResult = cluster.coordinator(1).executeWithResult(select, ALL);
            assertRows(oldResult.toObjectArrays());
        }
    }
}

mike-tr-adamson · 2023-07-05T14:50:33Z

@adelapena I've added the ReplicaFilteringProtectionTest but, I have to admit, I'm not entirely sure how/why it's working. My main concern is that the test is passing for reasons that aren't the ones that we are testing for. I will have a bit of a dig to confirm this.

adelapena · 2023-07-05T15:14:56Z

@mike-tr-adamson I think RFP is working because StorageAttachedIndexSearcher #filterReplicaFilteringProtection takes care of applying the filters to the expressions that are used in the coordinator:

cassandra/src/java/org/apache/cassandra/index/sai/plan/StorageAttachedIndexSearcher.java

Lines 85 to 103 in 723cc16

    
           public PartitionIterator filterReplicaFilteringProtection(PartitionIterator fullResponse) 
        
           { 
        
               for (RowFilter.Expression expression : queryController.filterOperation()) 
        
               { 
        
                   AbstractAnalyzer analyzer = queryController.getContext(expression).getAnalyzerFactory().create(); 
        
                   try 
        
                   { 
        
                       if (analyzer.transformValue()) 
        
                           return applyIndexFilter(fullResponse, Operation.buildFilter(queryController), queryContext); 
        
                   } 
        
                   finally 
        
                   { 
        
                       analyzer.end(); 
        
                   } 
        
               } 
        
               // if no analyzer does transformation 
        
               return Index.Searcher.super.filterReplicaFilteringProtection(fullResponse); 
        
           }

You can artificially see the test failing if you modify that method or if you make DataResolver#needsReplicaFilteringProtection return false.

mike-tr-adamson · 2023-07-05T15:26:01Z

Thank you, that's what I was trying to find.

adelapena · 2023-07-05T16:54:15Z

test/unit/org/apache/cassandra/index/sai/analyzer/NonTokenizingAnalyzerTest.java

+import org.apache.cassandra.utils.ByteBufferUtil;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertNotEquals;


This unused import is breaking the build: https://app.circleci.com/pipelines/github/adelapena/cassandra/3004/workflows/cff4d9ec-aa49-466d-9f32-8535c3826a64/jobs/56610

My bad, sorry. I'll get my commit process correct one of these days.

maedhroz · 2023-07-12T21:02:04Z

Committed as 05dd587

adelapena reviewed Jul 5, 2023

View reviewed changes

mike-tr-adamson added 3 commits July 5, 2023 13:42

Add basic text analysis to SAI:

b793e4b

- Adds the case_sensitive, normalize and ascii options to the index.

(DO NOT MERGE) PAID CircleCI

e14113c

Lucene 9.7 changes

6a9c9c1

mike-tr-adamson force-pushed the CASSANDRA-18479 branch from 7051147 to 6a9c9c1 Compare July 5, 2023 12:42

Address review comments

7a40be8

adelapena reviewed Jul 5, 2023

View reviewed changes

More review comments

9cdfbdd

Added ReplicaFilteringProtectionTest

723cc16

adelapena reviewed Jul 5, 2023

View reviewed changes

Fix checkstyle errors

0372568

adelapena approved these changes Jul 6, 2023

View reviewed changes

maedhroz closed this Jul 12, 2023

Add basic text analysis to SAI: #2465

Add basic text analysis to SAI: #2465

Uh oh!

Conversation

mike-tr-adamson commented Jul 5, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mike-tr-adamson commented Jul 5, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adelapena commented Jul 5, 2023

Uh oh!

mike-tr-adamson commented Jul 5, 2023

Uh oh!

adelapena commented Jul 5, 2023

Uh oh!

mike-tr-adamson commented Jul 5, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maedhroz commented Jul 12, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants