Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement Fuzzy Search Using PostgreSQL #137

Draft
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

foxoffire33
Copy link

@foxoffire33 foxoffire33 commented Jun 5, 2024

As part of the project I’m working on, I need to implement fuzzy search functionality. For example, if a user searches for an artist and makes a typo, I still want to return the correct result. To achieve this, I plan to use a feature of the PostgreSQL database. Currently, I am testing this on PostgreSQL.

Additionally, I have noticed that sorting is currently broken, and I might need some help resolving this issue. The database structure will follow in a future update.

I would appreciate any feedback you might have on this approach.

Copy link
Contributor

coderabbitai bot commented Jun 5, 2024

Walkthrough

The recent changes introduce several new SQL scripts, configuration files, and TypeScript updates to enhance the database operations for a project. These updates include creating views and indexes for efficient searches, adding dependencies, configuring data loading from SQLite to PostgreSQL, and defining database-related constants. Additionally, new SQL queries for data retrieval, deletion, insertion, and updates are added, along with TypeScript functions and types for handling Postgres database operations.

Changes

Files/Paths Change Summary
migrations/...createViewForWildcardSearch.sql Added SQL migration script to create view transactionsTagsDecoded and indexes on tag_names and tag_values.
migrations/down/...createViewForWildcardSearch.sql Added down migration commands to drop view transactionsTagsDecoded and indexes tagNamesIndex and tagValuesIndex.
package.json Added @types/pg and @types/yesql dependencies, updated pg dependency.
pgloader.conf Introduced configuration for loading data from SQLite to PostgreSQL.
src/config.ts Added database-related constants using environment variables.
src/database/pgql/bundles/... Introduced multiple SQL files with queries for various operations like data retrieval, deletion, and insertion.
src/database/pgql/core/... Added SQL files with queries for core database operations, including accessors, imports, and stats.
src/database/pgql/data/... Added SQL queries for handling content attributes and nested content.
src/database/pgql/moderation/... Introduced SQL queries for moderation checks and blocking data.
src/database/postgress/PostgressDatabaseHelpers.ts Added functions for handling transactions and data items in Postgres.
src/database/postgress/PostgressDatabaseTypes.ts Introduced types and interfaces for worker pools, debug information, and database operations.
src/database/postgress/StandalonePostgresDatabase.ts Added StandalonePostgresDatabase class for various database operations, including worker management and data handling.
src/database/sql/core/accessors.sql Added FOR SHARE clause to a subquery, affecting locking behavior in SQL queries.
src/database/standalone-postgres.ts Implemented Postgres database worker for various operations like fetching data, saving transactions, and handling errors.

Sequence Diagram(s) (Beta)

sequenceDiagram
    participant User
    participant API
    participant Database
    User->>API: Request to insert data
    API->>Database: Execute upsertNewDataItem SQL query
    Database-->>API: Data inserted/updated
    API-->>User: Response with success/failure
Loading
sequenceDiagram
    participant User
    participant API
    participant Database
    User->>API: Request to fetch data attributes
    API->>Database: Execute selectDataAttributes SQL query
    Database-->>API: Return data attributes
    API-->>User: Response with data attributes
Loading

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@foxoffire33 foxoffire33 marked this pull request as draft June 5, 2024 04:21
@foxoffire33 foxoffire33 changed the title tag match with FUZZY_AND,FUZZY_OR works now, postgres database Implement Fuzzy Search Using PostgreSQL Jun 5, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

Outside diff range and nitpick comments (4)
src/database/pgql/bundles/cleanup.sql (1)

1-13: The DELETE queries are secure using parameterized inputs. Consider reviewing the performance if new_data_items and new_data_item_tags tables are large, as these operations might be intensive.

src/database/sql/core/accessors.sql (1)

Line range hint 1-10: The SELECT queries use subqueries and UNIONs efficiently. Ensure that the tables involved are properly indexed to optimize these queries.

src/workers/bundle-repair-worker.ts (1)

Line range hint 23-92: Consider replacing forEach with for...of for better performance when handling large arrays, as suggested by the static analysis tool.

- this.intervalIds.forEach((intervalId) => clearInterval(intervalId));
+ for (const intervalId of this.intervalIds) {
+   clearInterval(intervalId);
+ }
src/workers/block-importer.ts (1)

Line range hint 160-162: Replace forEach with for...of for better performance.

- txs.forEach((tx) => {
+ for (const tx of txs) {
    this.eventEmitter.emit(events.BLOCK_TX_FETCHED, tx);
+ }

Biome static analysis suggests avoiding forEach for performance reasons, especially with large arrays.

Also applies to: 167-169, 173-175

src/database/postgress/StandalonePostgresDatabase.ts Outdated Show resolved Hide resolved
parentId: tx.parent_id ? toB64Url(tx.parent_id) : null,
})));

return [];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove unreachable code.

- return [];
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
return [];
Tools
Biome

[error] 1325-1325: This code is unreachable (lint/correctness/noUnreachable)

... because this statement will return from the function beforehand

src/database/postgress/PostgressDatabaseWorker.ts Outdated Show resolved Hide resolved
src/database/postgress/PostgressDatabaseWorker.ts Outdated Show resolved Hide resolved
break;
case tagsMatch.FUZZY_OR:
queryWheres.push(`tn.name = '${tag.name}'`);
let queryOr: string[] = [];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encapsulate variable declaration within a block to restrict scope within the switch case.

case tagsMatch.FUZZY_OR:
+ {
  let queryOr: string[] = [];
  tag.values.forEach(value => {
    queryOr.push(`similarity(convert_from(tv.value,'UTF8'),'${value.replace('*', '')}') > 0.4 `);
  });
  queryWheres.push(('(' + queryOr.join(' OR ') + ')'));
+ }
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
let queryOr: string[] = [];
case tagsMatch.FUZZY_OR:
{
let queryOr: string[] = [];
tag.values.forEach(value => {
queryOr.push(`similarity(convert_from(tv.value,'UTF8'),'${value.replace('*', '')}') > 0.4 `);
});
queryWheres.push(('(' + queryOr.join(' OR ') + ')'));
}
Tools
Biome

[error] 1029-1029: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause. (lint/correctness/noSwitchDeclarations)

The declaration is defined in this switch clause:

Unsafe fix: Wrap the declaration in a block.

switch (tag.match) {
case tagsMatch.WILDCARD:
queryWheres.push(`tn.name = \'${tag.name}\'`);
const wildcardValue: string = tag.values[0].replace('*', '%');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Encapsulate variable declaration within a block to restrict scope within the switch case.

case tagsMatch.WILDCARD:
+ {
  const wildcardValue: string = tag.values[0].replace('*', '%');
  queryWheres.push(`tn.name = \'${tag.name}\'`);
  queryWheres.push(`tv.value LIKE \'${wildcardValue}\'`);
+ }
Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Suggested change
const wildcardValue: string = tag.values[0].replace('*', '%');
case tagsMatch.WILDCARD:
{
const wildcardValue: string = tag.values[0].replace('*', '%');
queryWheres.push(`tn.name = \'${tag.name}\'`);
queryWheres.push(`tv.value LIKE \'${wildcardValue}\'`);
}
Tools
Biome

[error] 1018-1018: Other switch clauses can erroneously access this declaration.
Wrap the declaration in a block to restrict its access to the switch clause. (lint/correctness/noSwitchDeclarations)

The declaration is defined in this switch clause:

Unsafe fix: Wrap the declaration in a block.

src/database/postgress/PostgressDatabaseWorker.ts Outdated Show resolved Hide resolved
Comment on lines +1 to +15
-- selectNewDataItemTags
SELECT name, value
FROM new_data_item_tags
JOIN tag_names ON tag_name_hash = tag_names.hash
JOIN tag_values ON tag_value_hash = tag_values.hash
WHERE data_item_id = @id
ORDER BY data_item_tag_index ASC

-- selectStableDataItemTags
SELECT name, value
FROM stable_data_item_tags
JOIN tag_names ON tag_name_hash = tag_names.hash
JOIN tag_values ON tag_value_hash = tag_values.hash
WHERE data_item_id = @id
ORDER BY data_item_tag_index ASC
Copy link
Contributor

@coderabbitai coderabbitai bot Jun 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tip

Codebase Verification

Indexes for tag_names.hash and tag_values.hash are missing, which could lead to performance issues in the queries.

  • Ensure that indexes are created for tag_names.hash and tag_values.hash.
Analysis chain

The SELECT queries are well-structured, using joins and parameterized inputs efficiently. Ensure that the necessary indexes are in place for tag_names.hash and tag_values.hash to optimize performance.

Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify that indexes on `tag_names.hash` and `tag_values.hash` are properly set up.

# Test: Search for index creation statements in the migration scripts.
rg --type sql $'CREATE INDEX'

Length of output: 13127

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix this later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@foxoffire33, understood! If you need any assistance when you get to it, feel free to reach out. Good luck with the rest of your updates! 🐇

@@ -119,7 +119,7 @@ export class BlockImporter {
previousDbBlockHash,
},
);
this.chainIndex.resetToHeight(previousHeight - 1);
await this.chainIndex.resetToHeight(previousHeight - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor repeated code for resetting the chain index.

- await this.chainIndex.resetToHeight(previousHeight - 1);
+ await this.resetChainIndex(previousHeight);

Consider extracting the repeated resetToHeight calls into a separate method to reduce duplication and improve maintainability.

Also applies to: 141-141

Committable suggestion was skipped due to low confidence.

src/database/standalone-postgres.ts Outdated Show resolved Hide resolved
@djwhitt
Copy link
Collaborator

djwhitt commented Jun 5, 2024

Hey @foxoffire33, cool work! I'm happy to talk through issues with this. I'm not sure whether we'll want to integrate it into develop/main in the near term, but we could maintain a separate branch for those who want to play around with it. That way we could continue to iterate on development and get some feedback on the level of interest from the community.

I'll follow up later today or tomorrow with more thoughts about the approach and other potential fuzzy search options. Thanks again for all the work you put into this!

@djwhitt
Copy link
Collaborator

djwhitt commented Jun 6, 2024

@foxoffire33 I've been thinking about this more today. What I'm initially curious about is what kind of fuzzy search you want to provide and which specific Postgres features you intend to use for it. (apologies if this is visible in the diffs. I probably won't be able to review those in detail till next week) I'm also curious if you considered using the webhook support to populate some external index (e.g., OpenSearch or similar), and if not, why.

Regarding potential integration - how much effort are you willing to put into maintaining this? We would like to add Postgres support in develop/main at some point, but envisioned using a somewhat different schema from SQLite when we eventually implemented it. If you're interested in pursuing that, I could give you some info about the schema design we had in mind.

@foxoffire33
Copy link
Author

I tried to do it with Meilisearch first, but it doesn't fit the case perfectly. Implementing the functionality I need would be similar in cost to changing the gateway itself.

Since this is for a project I'm working on, I'm happy to maintain a Postgres solution for fuzzy search as long as we need it.

I'm also interested in the database design you already have in place, and I might be able to contribute some code for that.

@djwhitt
Copy link
Collaborator

djwhitt commented Jun 20, 2024

I tried to do it with Meilisearch first, but it doesn't fit the case perfectly. Implementing the functionality I need would be similar in cost to changing the gateway itself.

Gotcha, just wanted to make sure you were aware of the webhook option.

Since this is for a project I'm working on, I'm happy to maintain a Postgres solution for fuzzy search as long as we need it.

Great! What's the current status of the PR? Is it working for your use case now or are you still sorting through issues?

I'm also interested in the database design you already have in place, and I might be able to contribute some code for that.

Also great! I don't think I'll get to it till next week at this point, but I'll write up some details soon.

@foxoffire33
Copy link
Author

I'm still working on it. Do not have a lot of time.

There is an issue with sorting in queries. I will provide an update when I know more.

@djwhitt
Copy link
Collaborator

djwhitt commented Jul 18, 2024

Long overdue update from our side - we're working on a version of the schema I'd recommend (on an internal legacy database in a different code base). It's similar to the SQLite schema but uses a single height partitioned table in Postgres as opposed to separate 'new' and 'stable' tables. We represent null heights with a very large positive integer since partition keys can't be nullable. The advantage of this approach is that all the new/stable flushing is automatically handled by Postgres.

@foxoffire33
Copy link
Author

is it possible to share the sql schema?

@djwhitt
Copy link
Collaborator

djwhitt commented Aug 15, 2024

Sure. This should give you the general idea, but keep in mind this is from a different DB with some legacy decisions (strings instead of binary fields). It's also under active development, so some details might still change.

CREATE TABLE data_items (
    height integer,
    height_partition_key integer,
    id character(43) NOT NULL,
    raw_offset bigint,
    raw_size bigint,
    signature character varying, -- only here to support ANS102
    signature_type bigint,
    signature_offset bigint,
    signature_size bigint,
    owner_address character(43),
    owner_offset bigint,
    owner_size bigint,
    data_offset bigint,
    data_size bigint,
    target character(43),
    quantity character varying,
    reward character varying,
    anchor character varying,
    content_type character varying,
    format smallint,
    created_at timestamp without time zone DEFAULT now() NOT NULL,
    updated_at timestamp without time zone DEFAULT now() NOT NULL,
    parent character(43),
    parent_root_tx_offset bigint, -- offset of the parent relative to the root TX
    root_tx_id character(43), -- same as parent when it's not a nested data-item
    PRIMARY KEY (height_partition_key, id)
) PARTITION BY RANGE (height_partition_key);

CREATE INDEX data_item_id_idx ON data_items (id);
CREATE INDEX data_item_height_part_key_desc_id_idx ON data_items (height_partition_key DESC, id);
CREATE INDEX data_item_owner_address_height_part_key_id_idx ON data_items (owner_address, height_partition_key, id);
CREATE INDEX data_item_owner_address_height_part_key_desc_id_idx ON data_items (owner_address, height_partition_key DESC, id);
CREATE INDEX data_item_target_height_part_key_id_idx ON data_items (target, height_partition_key, id);
CREATE INDEX data_item_target_height_part_key_desc_id_idx ON data_items (target, height_partition_key DESC, id);

CREATE TABLE data_item_tags (
    height integer,
    height_partition_key integer,
    tx_id character(43) NOT NULL,
    index integer NOT NULL,
    name_hash bytea NOT NULL,
    value_hash bytea NOT NULL,
    created_at timestamp without time zone DEFAULT now() NOT NULL,
    updated_at timestamp without time zone DEFAULT now() NOT NULL,
    PRIMARY KEY (height_partition_key, tx_id, index)
) PARTITION BY RANGE (height_partition_key);


CREATE INDEX data_item_tags_name_value_height_part_tx_id_idx ON data_item_tags (name_hash, value_hash, height_partition_key, tx_id);
CREATE INDEX data_item_tags_name_value_height_part_key_desc_tx_id_idx ON data_item_tags (name_hash, value_hash, height_partition_key DESC, tx_id);

CREATE TABLE data_item_tag_names (
    name_hash bytea NOT NULL PRIMARY KEY,
    name character varying NOT NULL
);

CREATE INDEX data_item_tag_names_name_idx ON data_item_tag_names (name);

CREATE TABLE data_item_tag_values (
    value_hash bytea NOT NULL PRIMARY KEY,
    value character varying NOT NULL
);

CREATE INDEX data_item_tag_values_value_idx ON data_item_tag_values (value);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants