Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamp search API #286

Open
tobyzerner opened this issue Jan 31, 2018 · 26 comments
Open

Revamp search API #286

tobyzerner opened this issue Jan 31, 2018 · 26 comments
Assignees

Comments

@tobyzerner
Copy link

tobyzerner commented Jan 31, 2018

Plan for new search system:

Search driver (eg. Algolia) is called upon to update its index each time a post/discussion is changed
(Each post is a separate object in the index which contains not only post content but also its discussion's title/tags/other metadata)

q = foo "bar baz" (author:toby OR author:franz) is:following

  1. Parse q into an abstracted query object
. eg. https://github.com/pimcore/search-query-parser or https://github.com/netgen/query-translator
  2. Loop through each Term in the query object, allow gambits to match and convert term into a Where object to be processed by search driver
  3. Extensions may also populate the query object to enact permissions, eg. add a Where to restrict results to only tags that the user can see
  4. Search driver builds its vendor-specific query by inspecting query object's Terms/Wheres, returns list of discussion IDs

However, this is complex and will take a while to implement, so it would be wise to keep it slated for 0.2 (as per current roadmap). In the meantime, we are doing flarum/framework#1339

@franzliedke
Copy link

After continually jumping back and forth between performance issues and little bugs in our search implementation, it has become clear that we need to make progress on our driver-based search infrastructure (#509). This would then allow for swapping out a default, naive implementation (hand-maintained index, similar to FluxBB; MySQL fulltext search; etc.) for more advanced drivers based on e.g. Elasticsearch or Algolia.

I was asked to describe my thoughts on the initial architecture in a new ticket. But we already have this one, so I'll add my thoughts here.


Laravel Scout

Multiple people brought up using Laravel Scout, as it incorporates a driver infrastructure for searches on Eloquent models. This might be possible to use, but I am somewhat skeptical for several reasons:

  • We want to allow extensions to change things at runtime.
  • Tight coupling to Eloquent models / queries (possibly not compatible with Toby's approach outlined above)
  • We want to define interfaces based on Flarum's functional requirements. Laravel's APIs usually change more often.

What might be possible is using Scout's engines (i.e. drivers) so that we have less work in building out the adapters to the various backends. We would still have to take care of the integration into Eloquent / our search infrastructure manually, though. Definitely worth a shot when we get to the implementation phase.


Requirements

These are the filters / gambits supported by our current search layer...

  • Searching users:

    • email
    • group name
  • Searching discussions:

    • author name
    • tag
    • creation date (exact date, date range)
    • is hidden
    • is locked
    • is sticky
    • user-specific gambits:
      • is (un)read
      • subscription status (following / ignoring)

Initial thoughts:

  1. User search by email is based on a permission anyway, so it might be better placed in a user list / admin user management page (slightly related: User list framework#283).
  2. User-specific gambits are basically impossible to provide by the search driver. Can we combine driver results with our own database-backed filters (we have to use the IDs from the search driver to get the actual discussion titles etc. anyway). Pagination would be difficult or impossible, though.

We really have to come up with a better conceptual separation between what is search and what is filtering.

Driver capabilities

Once this list is complete, we should be able to derive an interface for driver implementations:

  • Update the index when a discussion / post / user is changed
    • This should be done on the queue, but that's probably not the driver's concern.
  • Rebuild the entire index
  • Search the index with a given list of filters / gambits, return a list of matching IDs, paginated

Open questions

  • Where do we draw the line between search (requiring the backend), and filtering (database-backed)?
  • Where is the best place for completely hiding certain posts from the index (e.g. those marked as private)?
  • How do we ensure the driver can satisfy our sort order requirements for the list of results?

Naive implementation

As we don't want to require something like Elasticsearch by default (this would complicate the setup far too much), we have to provide a simple implementation by default.

As an alternative to the current MySQL-based fulltext search, we could implement a search based on a hand-made index, similar to what FluxBB and other forum software do / have done. Now that the application layer has to notify the index when it should be changed, this is possible. The main benefit would be compatibility with other database systems, should we ever get there.

Quick primer on how FluxBB's search works:

  • When indexing, split posts by words.
  • Add each word (exactly once!) to a search_words table and give it an ID. (Only do this for new words, obviously.)
  • For each word occurring in a post, add an entry to the search_matches (word_id, post_id) table. This can also contain metadata about where the match occurred (subject, opening post, body...) and the number of matches.
  • When searching, get the word IDs from search_words, then get the matching post IDs from the search_matches table.

Possible future changes

We want to keep these things in mind when designing the interface, even if we don't implement them right away...

  • Stemming - enables finding "playing" when searching for "player" (implementations are language-specific)
  • Highlighted snippets - the more powerful the driver becomes, the more we will have to rely on its ability to provide highlighted snippets of the results (because it integrates stemming etc.)

This is very likely incomplete right now. I will try my best to keep this post updated, as we discuss things below.

@askvortsov1
Copy link
Sponsor Member

askvortsov1 commented Mar 28, 2020

Currently, we have a DiscussionSearcher and a UserSearcher. I think we should abstract this system down to an AbstractSearcher, which would support extending ANY model, flarum-defined or not, with a searcher. When applied to Posts, this should also remove necessity for the (frankly redundant) ConfigurePostsQuery event.

Within that searcher, filtering is currently supported by gambits. I think this is a good system, and the new AbstractSearcher system should support extending ANY searcher with a gambit (if it exists).

Now, the searching itself it currently handled by a 'FullTextGambit'. This, in my opinion, is the part that we could refactor out into a system that supports various search drivers. A few possible ideas:

  1. Split gambits into AbstractFilterGambit and AbstractSearchGambit
  2. Make FullTextGambit not a gambit, but rather a feature of AbstractSearcher that supports various search drivers (of which the current system could be one). The driver would be configured in admin settings (similar to how mail driver is done).

In my opinion, the refactor I described above should be included by stable. Implementing an extender for custom search drivers could be delayed by 0.2.

@askvortsov1
Copy link
Sponsor Member

Franz and I had a very productive discussion today, here are my takeaways:

Searching and Filtering

One point of uncertainty is where to draw the line between searching and filtering, and how these 2 concepts should fit in/work together. I am in favor of a tightly coupled approach, that is, having filtering and searching together under one system. Here's why:

  • We would expect that the machinery for executing a filter, executing a search, and executing a filtered search would not be radically different. It would be unexpected that adding a search term to a filtered query would use entirely separate logic to compute. For instance, you would expect that filtering for open PRs by askvortsov and filtering for open PRs by askvortsov with a search term of Policy Refactor would not be processed radically differently, but that the latter would build upon the former.
  • Filtering and searching are, arguably, extensions of each other: filtering restricts a queryset by certain structured metadata (like author, title, etc). Searching restricts it by an abstract 'full text search'.

Drivers

Our goal in designing the new system should be to eventually support search drivers. This should be extendable, so that extension developers can add new drivers. The issue emerges of how to generalize our implementation, so that filters (currently presented by gambits) will work across various extenders.

What I'm Proposing

  • The Search extender will allow developers to add 'search driver's, and to add 'filter spec's
    • A search driver will handle the actual searching/filtering
    • A filter spec allows extension developers who are adding functionality to define somewhat of an interface for filters. These will bridge the gap between available filters and the driver's implementation. These will also be available to the frontend.
  • Although we generally discourage this, search drivers will be encouraged to create their own extenders that would allow other extensions to add implementations of filters. As each search driver is so different, having an interface for filters that would be used within the search driver would only make everyone's lives more difficult.
  • Flarum's current search system will become the SimpleFlarumSearch search driver. It will have a SimpleFlarumSearch extender, which will allow people to add gambits, and to set fulltextgambits (or filterers and searchers, respectively if we want to change the names to be clearer).
  • All current gambits will be moved to the SimpleFlarumSearch driver, and corresponding FilterSpecs will be registered in the core of search.

Extenders

public class Search implements ExtenderInterface {
    // Add search driver
    public function searchDriver(SearchDriverInterface $driver);

    // Add filter spec
    public function filterSpec(AbstractModel $model, FilterSpecInterface $filterSpec);
}
public function SimpleFlarumSearch implements ExtenderInterface {
    // Right now these are 'Gambits'
    public function filter(AbstractModel $model, FilterSpecInterface $filterSpec, SimpleFlarumFilterInterface $filter);

    // Right now these are 'FullTextGambits'
    public function search(AbstractModel $model, SimpleFlarumSearchInterface $fullTextSearcher);
}

Interfaces

public interface SearchDriverInterface {
    // This returns a searcher for a given model, which actually performs the searching/filtering
    public function searcher(AbstractModel $model): SearcherInterface;

    // This will return an array of specs for the filters that this driver implements.
    // This should not be a static list, but rather should be generated as a result of other devs
    // adding filter implementations through the driver's extender.
    public function implementedFilters(): array[FilterSpec];

    // This will be called during resolution, and will provide a list of all existing FilterSpecs to the driver.
    public function addAvailableFilter(AbstractModel $model, array[FilterSpec] $filterSpecs);
}
 public interface SearcherInterface {
     public function search(SearchCriteria $criteria, $limit = null, $offset = 0, array $load = []): SearchResults;
}
public interface FilterSpecInterface {
    // Regex representing the filter's pattern
    public function pattern(): string;
}

@luceos
Copy link
Member

luceos commented Apr 14, 2020

This sounds like a great plan @askvortsov1 ; I'm extremely curious about the technical implications once we move gambits to the SimpleSearch. Well done!

@askvortsov1
Copy link
Sponsor Member

askvortsov1 commented Apr 15, 2020

After a bit more thought, I realized I'm missing one method in the SearchDriverInterface:

public function supportedModels(): array[AbstractModel];

This should work similarly to implementedFilters, but I think it's important to have the 2 separately, because we might have filters but not a searcher, or vice versa. However, there might be a bit of overlap, which might be unideal.

One other thing to consider: should filter specs go beyond a regex, and provide, say, a description? or non-regex filter types (equals, contains, between, gt, gte, etc)? In essence, I think that regex might be better as an implementation detail than as the actual interface, but I might be wrong

@w-4
Copy link

w-4 commented Sep 29, 2020

I feel for the API instead of:

filter[q]=testquery tag:testtag author:testauthor

this would be a lot clearer:

filter[q]=testquery&filter[tag]=testtag&filter[author]=testauthor

That would make it behave like the page parameter page[offset]=20&page[near]=20&page[limit]=20

But since it's not an easy change, the better alternative could be to rename filter[q] to either q or filter, which would also clarify things:

filter=testquery tag:testtag author:testauthor

(And maybe even simplify the page parameters to just offset, near, limit to keep it consistent.)

@askvortsov1
Copy link
Sponsor Member

I don't know that we'd want to rename filter[q] to q, as we want to ensure compliance with JSON-API. That being said, I like splitting up AND queries across multiple request params. We could also look into a broader overhaul, something like https://json-api-dotnet.github.io/JsonApiDotNetCore/usage/filtering.html.

That being said, I feel like we should separate discussion about the format of REST API requests, and the underlying search architecture (which is more focused on establishing a driver-based system)

@askvortsov1
Copy link
Sponsor Member

askvortsov1 commented Oct 31, 2020

I've spent a bit more time playing around with the system (using elasticsearch as an example, but only as that), and here's a reiteration of the main issues (shocklingly, pretty much exactly what Franz identified above):

  1. SQL Fulltext search means that we can apply all filtering gambits AND the fulltext gambit in the same query. So we get one "block" of results, which we can then paginate, filter, etc.
  2. Flarum's visibility scoping system is awesome, but this is where it fails: because rules for who can see what can get very complex, we can't reliably convert them into metadata that could be used to filter visibility on the elasticsearch level
  3. Running a query against elasticsearch will return a fixed number of results. Because of (2), we will need to filter THOSE for visibility later, meaning that:
    a. We need to request more results from ElasticSearch than we are planning to return. How much more? 2x? 10x? a constant huge number?
    b. For proper pagination (no duplicates and no gaps), we'll need to supply an offset. What would that offset be?
  4. It doesn't make sense to use elasticsearch for queries that contain current-user-based filters (unread, filtering) AT ALL, because as per (3), any results we extract from ElasticSearch would be filtered down beyond recognition

So, possible, non-mutually-exclusive approaches to fighting these issues (some of these are quite rough) :

a. Run 2 search systems (Flarum's basic one and elasticsearch/algolia/whatever) at the same time. If a search includes an unsupportable gambit like subscription status or unread, default to the native Flarum driver
b. Implement search drivers by tying gambits to a driver name. So an elasticsearch driver would work by overriding (some of) the default filtration gambits
c. Figure out how to improve Flarum's default search so that it takes discussion titles into account (and works better with CJK characters)
d. Have ElasticSearch return 10,000 or 100,000 search results, and keep all offset / size logic in the flarum portion of the application. This wouldn't be too different from our current method of searching, but could have performance implications (how fast can we filter down by ID if the list of IDs is very large?) Can we maintain order for relevance searches as returned by ElasticSearch?
e. Ensure every gambit is aware of ALL other gambits being applied. Also, allow filtration gambits to pass some kind of metadata to the fulltext gambit so that instead of directly executing the filter, they could prepare some elasticsearch query fragment to be run in the fulltext gambit. The more gambits a driver can replace, the more effectively it'll perform, and the vast majority of gambits are replaceable.

I'm planning to take a better look through the Laravel Scout system and see how they accomplish this (and whether they actually do).

Somewhat concrete proposal:

If we consider typical user behavior, people are unlikely to look beyond the nth page of search results, where n is typically in single digits. Results become less relevant, and people have typically found something that works by then. (cue "The best place to hide a body is the 2nd page of google results" jokes).

Criteria that filter out significant portions of the search space (tags, dates, authors/participants, etc) can all be handled by Elasticsearch As long as we run queries that contain subscription and unread filters through the SQL search, searches we send to ElasticSearch should mostly contain results the user can view. We could further improve that by telling elasticsearch not to include anything from tags/categories the user can't view discussions in, which shouldn't be too performance-intensive, and would further cut down on inaccessible (permission-wise) results returned by ElasticSearch.

With these optimizations in place, if we request (offset + size) * c) results from ElasticSearch (where 2 < c < 10) and proceed as per (d) above, user experience shouldn't be affected in the VAST majority of cases. Furthermore, we might even be able to stick that returned data in redis or something, since if we don't use all of it, it could be recycled when the user next clicks "read more".

It goes without saying that this is an extension-space issue, although we might want to make it bundled (at least at the start) because of how critical an issue it is.

@clarkwinkelmann
Copy link
Member

Re-reading this discussion, I feel like we do need to separate filters and full text search into different API endpoints and implementations.

The filter part, like proposed in flarum/framework#2104 , makes a lot of sense to me for retrieving posts with filters, discussions with a particular flag (bookmarked, followed, ...) or users based on filters (enabled, group) for an admin UI. The current approach works well and apart from creating proper extenders and re-usable classes, I don't think we need to change it too much. Changing to query-string based over string-based would be nice.

The full text search really stands on its own. Very few filters matter. You just want relevant results, and fast. Currently there are only two situations where this endpoint would be used: the quick search dropdown that includes discussions and users, and the full discussion search results page. We don't really need to make the search work with anything else by default. Here's there's also potentially the use case of directly hitting a search server like Algolia without going through Flarum backend at all.

My preference/proposal would be as follows:

  • Rename what we currently call "search" to something like "filters"
  • Drop the ability to have full text filters
  • Revamp filters to use key-values in the filters[] query string instead of a string query
  • We keep the system with the search/searcher and gambit manager, just rename it to something different
  • Refactor search input and results page to use a new extensible javascript API. That API needs to support calling external non-Flarum API endpoints for search
  • Implement a driver system for this new full text search system with the ability to select a different one for each model (discussion, user, and other models registered into the global search by extensions)
  • Create a new gambit system for the full text search
    • Gambits could be shown as pills in the search field, and never as "hidden" gambits like we do now
    • Drivers could expose which gambits they support. If a gambit is not supported, it's not added to the search query and this will be clearly visible in the UI
  • Create a default Flarum search driver that uses a server-side database search, probably something like described higher with local indexing with word-split
  • Provide a way for extensions to add gambit support to that default driver
  • Technically, the default search driver could well be implemented using the existing API endpoints and re-use some of the logic from the filters, but I think a complete separation would help to keep things organized and allow easier refactoring in the future
  • Other drivers would expose their own extender to let extensions register gambits on them if they wish so
  • Search drivers would be entirely responsible for the indexing strategy and access control. I think most forums will be fine indexing just the public data so IMO we shouldn't invest too much effort into that. If we let drivers implement the API call on javascript side and add their own API endpoint on Flarum API (if they need it), they basically have access to everything they need to know who the actor is and perform their own logic. The default driver of Flarum could continue using the ScopeVisibilityTrait if we want to offer full search including in restricted tags

So the API could look something like this:

PHP

class FilteredQuery implements ExtenderInterface {
    public function __construct(string $filteredQueryClass);

    // Equivalent to FlarumSearch::addGambit in the PR
    public function addFilter(string $filterClass);

    // Equivalent to FlarumSearch::addSearchMutator in the PR
    public function addQueryMutator(callable $callback);
}
// This extender's only purpose would be to make the drivers and resources known to the admin panel to create a configuration screen. Everything else will be on the client-side
class Search implements ExtenderInterface {
    // Registers a new driver for full text search
    public function driver(string $driverClass);

    // Registers a new full text search resource that this extension provides (for example, pages or albums)
    public function resource(string $name);
}
// This is the extender's for the base search driver provided by Flarum
class SimpleTextSearch implements ExtenderInterface {
    // Add a gambit for the default search driver. Those could work very similarly to what we have now, but would be separate from the filtered query filters
    public function addGambit(string $gambitClass);

    // Allows altering the SQL query of the default search driver, just like we can with filtered queries
    public function addQueryMutator(callable $callback);
}
// The server-side search driver class would only serve to build the admin configuration panel. Flarum needs to get a name/description, and know which resources are compatible with this driver. The list of resources isn't hard-coded as an extension could use this search driver's own extenders to add functionality
class AbstractSearchDriver {
    abstract function name();

    abstract function compatibleResources(): array;
}

JS:

// The extension would be responsible to use this extender to register one class for each resource they are compatible with
class SearchExtender {
    addDriver(resource: string, class implements AbstractSearchDriver);
}
// The search driver would work similarly to the current SearchSource objects, but it would also be used for paginated results in the discussions search results
class AbstractSearchDriver {
    // This method lets Flarum know which gambits it can offer to this driver
    // Flarum would take the list of suggested gambits for the context and compare it against this before suggesting it
    abstract supportedGambits(resource: string);

    // Performs the search asynchronously. We could wrap the result in another object that gives pagination hints
    // The data returned does not need to come from Flarum's API, but must contain models with some minimal attributes/relationships that we could spec out
    async abstract search(resource: string, query: string, page: number): Model[];
}
// We could add a new method to the global search state that can be extended by extensions to provide gambit suggestions
// This would replace the current GlobalSearchState.prototype.params used by extensions like Tags
app.search.suggestedGambits(): string[];

In the UI, I suggest making it so gambits get offered under the search field when focused.

So for example you click the search bar, and the following appears:
[ 🔍 Search Forum ]
Suggested filters: [Tag General] [User Admin]

Clicking a gambit would add it visually to the search, either in text form or ideally in a fashion similar to the tag picker:
[ 🔍 (Tag General) Search Forum ]
Suggested filters: [User Admin]

@askvortsov1
Copy link
Sponsor Member

askvortsov1 commented Nov 5, 2020

2 completely separated systems works even better! I like pretty much everything about this proposal. We'll probably have some duplicate code between the filters and SimpleTextSearch gambits, especially towards the start. I think we should just keep it duplicate without any forced abstractions to unite the two: as you mentioned, these are more complex systems and unnecessary complexity doesn't help.

The biggest challenge here will probably need to be implementation, as the PRs would be bulky. Possible split of work:

PR 1A

  1. Copy-paste new endpoints for searching users and groups, and tweak the frontend for the search dropdowns and search results page to point there instead of to the filter.
  2. Modify the current filter endpoints for groups to be a bit simpler: all we need to do is scope visibility, run it through the filters (based on key-value parameters), run it through all filter mutators, apply offset/size, and return. A lot of abstractions needed for search can probably be eliminated here
  3. Modify frontend calls to filter endpoints to send filters as key-value, not as a single string

PR 1B

  1. Copy-paste the Search Refactor and Extenders framework#2104 extender, set it up for filters
  2. Copy-paste the Search Refactor and Extenders framework#2104 extender, set it up for the simple flarum search driver (essentially exactly what it is now, but with different naming and invocation)

PR 2

  1. Allow registering new drivers in the frontend and backend
  2. Allow different drivers to be used for different resources (a resource search endpoint might call $driver = SearchDriverManager::getDriver(RESOURCE_NAME)
  3. Add a frontend UI for configuring which driver is used for each resource

PR 3

  1. Add a frontend system for adding/removing gambits in a search (IE revamp frontend search API).

PR 4

  1. Add support for searching posts within a discussion?

PR 5

  1. Switch the MYSQL fulltext driver to something like the FluxBB one

So, that begs the question: what do we need when? PR 1A and 1B I think needs to be included for stable so we can deprecate the old events (preferably in beta 15, instead of flarum/framework#2104, which I think is feasible).

PR 2 would also be really nice to have sooner than later to enable foreign language search. I would also like to have this for stable, but maybe as a beta 16 thing.

PRs 3, 4, and 5 can be done when developers are available, but are not THAT urgent. PR 3 is the most important IMO.

@luceos
Copy link
Member

luceos commented Nov 10, 2020

I love the idea of separating fulltext vs filters. This would also allow retrieving private discussions using the fulltext search and then running filtering afterwards. We only need to tackle some basic issues like retrieving more entries than we need etc.

This would also mean we could look at laravel scout again, potentially.. Well established and easy to replace.

I personally don't think that we need to allow search drivers per resource. I think that would over-complicate core as most (if not all) implementation will use one driver.

Changing search, adding an advanced search page and introducing better ux for gambits/filters but I think it doesn't make sense to add that to the scope before stable. For the issues with search that UTF-8 needs fixed, we can possibly go for a simpler, temporary solution?

@askvortsov1
Copy link
Sponsor Member

I personally don't think that we need to allow search drivers per resource. I think that would over-complicate core as most (if not all) implementation will use one driver.

I think it would be simpler to register drivers to resources, not even from the perspective of ALLOWING search per resources as for registering and accessing those drivers in the first place. Since each resource would have different gambits, trying to shoehorn search for ALL resources under one block of logic would be messy IMO.

Most of what I bring up in Phase 1A about custom pages isn't for an "advanced search UI". It's just that we need the index page to hit the filtering API endpoint, and the search page to hit the search API endpoint. This might be possible to integrate into the index page, but I need to play around with it first.

@clarkwinkelmann
Copy link
Member

I can't see a solution where there isn't one driver per resource and where extensions can define their own search endpoints.

Unless we standardize a way for extensions to provide the data that must be indexed, but this kind of goes against the flexibility that could be offered by having radically different drivers.

My use case for drivers per resources is as follows:

  • Core adds user and discussion search
  • Core adds basic driver for user and discussion
  • Extension A adds new advanced driver for user and discussion
  • Extension B adds picture search
  • Extension B adds picture support to basic driver

If we force the forum to use basic for everything since it's the only driver that supports all search types, we will create lots of incompatibilities or just discourage extensions from implementing search in their own data.

We could force selecting the same driver for both user and discussion however since the two will always be used together and can't be disabled.

@askvortsov1
Copy link
Sponsor Member

This would also allow retrieving private discussions using the fulltext search and then running filtering afterwards. We only need to tackle some basic issues like retrieving more entries than we need etc.

This actually wouldn't happen. Retrieving more entries than we need is the exact issue I brought up above. I'm envisioning these as 2 parallel, but maximally detached systems,

@askvortsov1
Copy link
Sponsor Member

I made another attempt at flarum/framework#2453, and came across some concerns:

Shared Endpoints

After some more thinking, I don't see why search and filter couldn't share one endpoint. My main reasoning behind this is that regardless of whether a query is searched or filtered, the frontend expects a consistent format of results. The distinction between which method is used is essentially:

  • If the filter param consists of just a q string which is made of gambits and a fulltext query, it should be searched
  • If the filter param consists of key-value filter pairs, it should be filtered

Either method should return an Eloquent collection, which can then be marked as paginated (the actual application of sort, offset, and size happens inside the filtering / searching logic), have related objects / includes added to it, and so on. IMO, https://github.com/flarum/core/blob/master/src/Search/SearchResults.php#L14-L14, which is currently returned by DiscussionSearcher and UserSearcher, is pretty much what we'd want both the SearchDriverInterface and the Filterer to return.

Gambits vs Filters code duplication

With the filtering / search split, everything that is currently a gambit for the SimpleFlarumSearchDriver (in core and extensions) could/should be supported as a filter for the Filterer. However, the formats in which they are provided are different:

  • filter parameters would be provided as key-value pairs
  • search gambits would be provided as key:value substrings of the filter[q] param

This means we'd have 2 copies of code for all filters / SimpleFlarumSearch gambits. A few ideas:

  • What if we had filters and SimpleFlarumSearch gambits implement both FilterInterface and GambitInterface?
  • We could provide some utility that would apply filters as gambits SearchUtil::filterFromGambit, but I don't like that as it would be magicky and could constrain functionality / format
  • We could implement SimpleFlarumSearch to use Filterer for its gambits (kind of the previous option, but better), but that's still hacky

Another concern is preserving backwards compatibility. With the approach I'm taking in flarum/framework#2453, it's completely preserved for the Search logic (since that is completely unchanged), but it seems like it'll be a breaking change for the filter endpoints, as we can't really apply the gambits from the ConfigureXGambit events, since they won't comply with FilterInterface. However, we could set off Searching events in the Filterer basec on the resource being filtered.

@askvortsov1
Copy link
Sponsor Member

askvortsov1 commented Feb 1, 2021

I've gone ahead and refactored flarum/framework#2454 so that gamits and filters are combined into FilterGambits. I think this is the way to go: having the same code in 2 classes is... bad. That being said, having to fulfill both the requirements of a gambit and a filter looks messy for several reasons:

  • They extend AbstractRegexGambit, but implement FilterInterface. That still feels like it's predominantly a gambit over a filter, but I'm not sure about that.
  • The method naming is an absolute mess, we have pattern, apply, conditions, getFilterKey, and constrain. The order feels random, and it's not immediately clear which applies to which.
  • Generally speaking, conditions and filterValue tend to be very similar in implementation. That's probably fine though.
  • pattern and getFilterKey() are messy

A few thoughts on how we could (potentially) make it better:

  • Rename some of the AbstractRegexGambit stuff. Maybe applyGambit? But that would be confusing with apply. Then again, if we had applyGambit and applyFilter methods, it would be much more intuitive IMO
  • Right now, only EmailFilterGambit needs to check permissions, but that permission check is duplicated, and has to override the apply method. Maybe we could separate that out into their own methods?
  • pattern and getFilterKey() could be provided when registering gambits. I wouldn't mind this too much for getFilterKey(), but with pattern, the regex is part of the implementation, and so should be part of the class. Maybe they should both be properties, or both be methods? Should we have them both at the top?
  • On that note, the pattern for regex gambits (and non regex gambits) will need to eventually become public so we can send it to the frontend, as proposed by Clark for https://github.com/flarum/core/issues/1355#issuecomment-721423135.

That brings me to my other design concern (quoted from the PR):

For simplicity and to reduce diff size, I omitted some of the abstractions present in the old search system for filtering (SearchCriteria, etc). I also reused some SimpleFlarumSearch utils / components, like SearchResults, ApplySearchParametersTrait, AbstractSearch. What do we want to do about this? IMO simpler is better, especially seeing that custom filtration drivers aren't going to be a thing, so unnecessary abstraction is unnecessary On the other hand, if we have AbstractFilterer, and then a RESOURCEFilterer class for each resource, a custom filtering driver becomes possible by overriding bindings on the service provider level. Also, I want to eventually move the shared stuff into a Query namespace, but didn't do so here to minimize diff.

With how it's looking now, I think having a class per-resource might be the better solution. I wonder if we could use SearchCriteria for both filtering and searching? That way, the filter and `search methods would effectively have the same API.

@SychO9
Copy link
Member

SychO9 commented Feb 12, 2021

  • Rename some of the AbstractRegexGambit stuff. Maybe applyGambit? But that would be confusing with apply. Then again, if we had applyGambit and applyFilter methods, it would be much more intuitive IMO

How would it be confusing if we're renaming it though ? applyGambit and applyFilter would definitely be better than apply and filter, and would be more consistent with getFilterKey naming.

  • pattern and getFilterKey() could be provided when registering gambits. I wouldn't mind this too much for getFilterKey(), but with pattern, the regex is part of the implementation, and so should be part of the class. Maybe they should both be properties, or both be methods? Should we have them both at the top?
  • On that note, the pattern for regex gambits (and non regex gambits) will need to eventually become public so we can send it to the frontend, as proposed by Clark for Revamp search API #286 (comment).

we could deprecate the pattern property, and then add a getGambitPattern method to GambitInterface AbstractRegexGambit, thus exposing the pattern, and making it more consistent with the FilterInterface.

I don't know yet about the Query namespace, might be better to look into that after we're done with the current PRs as they are.

@askvortsov1
Copy link
Sponsor Member

  • Maybe applyGambit? But that would be confusing with apply

I didn't explain this well, sorry. GambitInterface has an apply method. But AbstractRegexGambit, which all of core's non-fulltext gambits use, has a conditions method. So I'm saying if we rename constraints to applyGambit, that would clash with the interface's apply, which AbstractRegexGambit implements.

I like the getGambitPattern idea! Will do that.

@SychO9
Copy link
Member

SychO9 commented Feb 14, 2021

I didn't explain this well, sorry. GambitInterface has an apply method. But AbstractRegexGambit, which all of core's non-fulltext gambits use, has a conditions method. So I'm saying if we rename constraints to applyGambit, that would clash with the interface's apply, which AbstractRegexGambit implements.

I see now, yeah.. in that case, I'd leave them as is for now.

@askvortsov1
Copy link
Sponsor Member

askvortsov1 commented Feb 17, 2021

The PR has been refactored to use SearchCriteria and SearchResults for both searches and filters. I like this very much, as we now have a consistent API for querying.

AbstractSearch / WrappedFilter is more complex (and probably the last big design decision left). Discussion began in flarum/framework#2454 (comment). These classes accomplish a few things:

  1. We can pass in the user and query builder instance together
  2. For searches, it keeps track of applied gambits (I imagine there could be a use case for knowing this in a search mutator)
  3. It allows specification of a default sort per-resource
  4. It allows us to add custom state for resource-specific searches. This was previously used to keep track of "most relevant post" for discussions, although thats not used anymore.

Since it's useful to have this kind of wrapper around user and query (it allows us to pass in additional stuff to gambits/filters in the future without breaking anything), I think we should keep this class in some. setDefaultSort is also used (Discussion FulltextGambit) so we can't remove that.

After some thinking, my preference would be the following:

  • A FilterState class (renamed to QueryState in the future, which takes (Builder $query, User $actor, $defaultSort = []) as it's constructor. This would replace WrappedFilter. We also would be passing in the fallback sort instead of defining it in the state class, which feels cleaner to me.
  • AbstractSearch => SearchState, which inherits from FilterState, and had getActiveGambits() and addActiveGambit() methods.
  • No more resource-specific SearchState classes (although if we need one, we can always switch to it, cause polymorphism). Extensions can't add methods to them anyways, so no issue there.

That leaves only the question of where the default sort should come from. I see 2 options:

  • Just use $sort in the controllers. This is simplest, but then it needs to be defined for every controller that uses DiscussionSearcher (just 1 right now). This is also my preferred option.
  • Allow defining it as a searcher property, and use it in AbstractSearcher when instantiating search state if defined (a bit more complex, plus it adds extra responsibility to the Searcher class, which is driver implementation specific (another reason not to have this in SearchState btw), so I don't like this as much).

@askvortsov1
Copy link
Sponsor Member

The remainder of this is building in an extender + manager class, and adding the frontend integration. Neither of these will be particularly breaking, so they can be delayed to the next release.

@askvortsov1 askvortsov1 unpinned this issue Mar 2, 2021
@mmm8955405
Copy link

mmm8955405 commented Sep 1, 2021

I see that you have modified the search statement, but why does the SQL statement still have syntax errors or SQL injection when there are special envoy characters in the search term. Match (Posts. content) against(), match (discussions. Title) against (in Boolean mode) is empty
11
When there are special characters in the search term, they are filtered out in the program. Then it is passed to the subsequent SQL statement, and the query value is null

@askvortsov1
Copy link
Sponsor Member

I see that you have modified the search statement, but why does the SQL statement still have syntax errors or SQL injection when there are special envoy characters in the search term. Match (Posts. content) against(), match (discussions. Title) against (in Boolean mode) is empty
11
When there are special characters in the search term, they are filtered out in the program. Then it is passed to the subsequent SQL statement, and the query value is null

Sorry, I don't quite follow. Do you think you've found an SQL injection vulnerability?

@mmm8955405
Copy link

I don't think I found the SQL injection vulnerability of the framework. I just don't know why when I search for some special characters, these characters will be filtered out by the PHP program, resulting in SQL syntax errors. When I modify the packaging mode of MySQL engine InnoDB to mroonga engine, I don't think MySQL storage engine has any impact on your program. But your code does filter out special characters and cause syntax errors in full-text search SQL.

in general:

  1. In InnoDB storage engine, SQL syntax errors will not appear when searching for special characters

  2. In mroonga storage engine, SQL syntax error occurs when searching for special characters

This logic puzzles me! Can't you use MySQL precompiling?

@mmm8955405

This comment has been minimized.

@mmm8955405
Copy link

I see that you have modified the search statement, but why does the SQL statement still have syntax errors or SQL injection when there are special envoy characters in the search term. Match (Posts. content) against(), match (discussions. Title) against (in Boolean mode) is empty
11
When there are special characters in the search term, they are filtered out in the program. Then it is passed to the subsequent SQL statement, and the query value is null

Sorry, I don't quite follow. Do you think you've found an SQL injection vulnerability?

1、In InnoDB storage engine, SQL syntax errors will not appear when searching for special characters

2、In mroonga storage engine, SQL syntax error occurs when searching for special characters

I put the wrong SQL screenshot in the picture above

@askvortsov1 askvortsov1 transferred this issue from flarum/framework Mar 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants