Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: id as number #441

Merged
merged 8 commits into from
Jul 18, 2023
Merged

Conversation

H4ad
Copy link
Contributor

@H4ad H4ad commented Jul 9, 2023

/claim #426

This is my attempt to closes #426, in this implementation, I added a new component called internalDocumentIDStore which is responsible for keeping the internal IDs and also has a list to reverse those IDs.

  • Added a new ID Store
  • Added support for all methods to handle the internal ID
  • Keep the API Support with plugins (instead of sending the internal ID, I only send the original ID)
  • Add support for serialization and deserialization (I didn't make it backward compatible if you want this, let me know)
  • Fixed all tests

About the performance, to compare this change with the old behavior, here are the current stats for the old behavior:

To be able to run this quickly, I rebase using #434 otherwise this will take a lot of time.

insert: 884.5955260000192ms
memory: 184.7719955444336MB
save: 172.09495500009507ms
total size: 42.089162826538086 MBs

And the new behavior is:

insert: 798.1346069998108ms
memory: 187.97183227539062MB
save: 132.9290320002474ms
total size: 27.622851371765137 MBs

To check this new behavior locally, checkout to this branch: https://github.com/h4ad-forks/orama/tree/feat/id-as-number-faster

database-size.mjs
import { create, insert, save } from './dist/index.js';
import { writeFileSync } from 'fs';
import crypto from 'crypto';

(async () => {
  const db = await create({
    schema: {
      name: "string"
    }
  });

  let now = performance.now();
  for (let i = 0; i < 1e5; i++) {
    await insert(db, {
      id: crypto.randomUUID(),
      name: Math.random().toString(16).substring(8)
    });
  }
  console.log(`insert: ${performance.now() - now}ms`);
  console.log(`memory: ${process.memoryUsage().heapUsed / 1024 / 1024}MB`);
  now = performance.now();

  const rawState = await save(db);
  console.log(`save: ${performance.now() - now}ms`);

  const jsonState = JSON.stringify(rawState);
  const totalSize = jsonState.length;
  console.log(`total size: ${totalSize / 1024 / 1024} MBs`);

  writeFileSync('./database-size.json', jsonState, 'utf8');
})();
We had a little increase in memory usage (1.7%) but we decreased the index size by 34.36%!

This is what looks like the current serialization:

{
   "internalIdStore":{
      "internalIdToId":[
         "4dca7125-6c6f-461c-9cb0-b0dba12119bc"
      ]
   },
   "index":{
      "indexes":{
         "name":{
            "word":"",
            "subWord":"",
            "children":{
               "b":{
                  "word":"ba2c536",
                  "subWord":"ba2c536",
                  "children":{
                     
                  },
                  "docs":[
                     1
                  ],
                  "end":true
               }
            },
            "docs":[
               
            ],
            "end":false
         }
      },
      "searchableProperties":[
         "name"
      ],
      "searchablePropertiesWithTypes":{
         "name":"string"
      },
      "frequencies":{
         "name":{
            "1":{
               "ba2c536":1
            }
         }
      },
      "tokenOccurrencies":{
         "name":{
            "ba2c536":1
         }
      },
      "avgFieldLength":{
         "name":1
      },
      "fieldLengths":{
         "name":{
            "1":1
         }
      }
   },
   "docs":{
      "docs":{
         "1":{
            "id":"4dca7125-6c6f-461c-9cb0-b0dba12119bc",
            "name":"ba2c536"
         }
      },
      "count":1
   },
   "sorting":{
      "sortableProperties":[
         "name"
      ],
      "sortablePropertiesWithTypes":{
         "name":"string"
      },
      "sorts":{
         "name":{
            "docs":{
               "1":0
            },
            "orderedDocs":[
               [
                  1,
                  "ba2c536"
               ]
            ],
            "type":"string"
         }
      },
      "enabled":true,
      "isSorted":true
   }
}

@vercel
Copy link

vercel bot commented Jul 9, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
orama-docs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 18, 2023 11:57am

@allevo
Copy link
Collaborator

allevo commented Jul 12, 2023

Hi @H4ad ! Could you resolve the conflicts?

Thanks

Copy link
Collaborator

@allevo allevo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR seems ok. I left some comments. In particular, related to DocumentID: why do we need to support both, number or string?

packages/orama/src/components/algorithms.ts Show resolved Hide resolved
packages/orama/src/components/algorithms.ts Show resolved Hide resolved
return false
}

delete store.docs[id]
delete store.docs[internalId]
store.count--

return true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should remove also from sharedInternalDocumentStore, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we just remove it from sharedInternalDocumentStore.idToInternalId, it's fast enough.

If we try to remove it from sharedInternalDocumentStore.internalIdToId, it will be slower as sorter.

What we can do is perform a cleanup of internalIdToId on serialize.

What do you prefer?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep like that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure, so no need to remove the id from the sharedInternalDocumentStore, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No sorry, we should implement it.
For now, we don't care about the performance during the remove.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, we don't care about the performance during the remove.

Agreed. I'd dedicate a separate PR to this

packages/orama/src/methods/search.ts Show resolved Hide resolved
packages/orama/src/methods/search.ts Show resolved Hide resolved
packages/orama/src/methods/search.ts Outdated Show resolved Hide resolved
@@ -0,0 +1,70 @@
import { Orama } from '../types.js';

export type DocumentID = string | number
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why allow both?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was to introduce less breaking change as possible (I also don't know what is public API and what is internal API).

Someone that changes the implementation of sorter, index, etc... will not need to modify their code in order to accept this change.

But if you want to go full breaking change mode, I can use InternalID in every code (but still return the original ID on search).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok to introduce a little breaking change with this. Regarding the naming. Is this a documented or IndexId?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will bump Orama to v1.1.0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I will change all the references for DocumentID to InternalID, and I will only use DocumentID on getByID and when we return the documents from search.

Regarding the naming. Is this a documented or IndexId?

I didn't understand your question but DocumentID is just an alias to reference the ID of the Document that was generated or was passed by the user, we you see this type, is referring to these two cases.

But I think we should add some documentation about it on Orama Docs, just to be clear about how we store IDs to be more efficient.

Copy link
Collaborator

@allevo allevo Jul 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, understood you point. fine for me

packages/orama/src/methods/search.ts Outdated Show resolved Hide resolved
packages/orama/src/methods/search.ts Outdated Show resolved Hide resolved
packages/orama/tests/group.test.ts Outdated Show resolved Hide resolved
Copy link
Collaborator

@allevo allevo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, amazing work!

LGTM

Copy link
Member

@micheleriva micheleriva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Terrific job @H4ad. As always!

LGTM

@micheleriva micheleriva merged commit 47295f1 into oramasearch:main Jul 18, 2023
@H4ad H4ad deleted the feat/id-as-number branch July 18, 2023 15:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce Orama internal ID for documents
3 participants