Skip to content

Conversation

@chandrasekharan-zipstack
Copy link
Contributor

@chandrasekharan-zipstack chandrasekharan-zipstack commented Apr 24, 2024

What

  • Index key generation fixed - hash a dict of values instead of combining UUIDs
  • File hash function is optimized
  • Minor: Dead code removed in wrappers of adapters
  • Version bumped to 0.23.0, lockfile updated
  • MINOR: Some error handling code around vector DB / embedding was removed
  • MINOR: Increased line length limit to precommit hooks to 88

Why

  • File hashing was incorrect - hashes were appended while in reality they should be XOR-ed
  • Previous index key gen logic did not allow adapter updates to be reflected during reindex of a document, since they were based on UUIDs
  • Error handling code removal reason here

How

  • New index key gen logic uses the adapter configurations and hash them altogether

Can this PR break any existing features. If yes please list of possible items. If no please explain why. (PS: Admins do not merge the PR without this section filled)

Yes, since there's an update in the index key generated - all old index keys would be invalid. In order to perform a clean up - a migration has been added in Zipstack/unstract#289

Related PRs

Notes on Testing

  • Tested them along with other changes in unstract, index key generated as sha 256 hash.
  • Adapter changes causes index to be generated again

Screenshots

image
image

Checklist

I have read and understood the Contribution Guidelines.

@CLAassistant
Copy link

CLAassistant commented Apr 24, 2024

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@ritwik-g ritwik-g left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chandrasekharan-zipstack I have looked at the code from a very high level since the relation of same with existing code is not something that I know very well.

One major pattern I observed is that we have removed some of the instance variables. I would like you to double check and make sure those are not used in other repos and if they are we have made required changes

Co-authored-by: Ritwik G <100672805+ritwik-g@users.noreply.github.com>
Signed-off-by: Chandrasekharan M <117059509+chandrasekharan-zipstack@users.noreply.github.com>
Signed-off-by: Chandrasekharan M <117059509+chandrasekharan-zipstack@users.noreply.github.com>
Signed-off-by: Chandrasekharan M <117059509+chandrasekharan-zipstack@users.noreply.github.com>
Copy link
Contributor

@hari-kuriakose hari-kuriakose left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chandrasekharan-zipstack Doc ID generation looks good.

Please double check the impact on other areas.

@hari-kuriakose hari-kuriakose merged commit 14862e9 into main Apr 30, 2024
@hari-kuriakose hari-kuriakose deleted the feat/index-key-generation-fix branch April 30, 2024 16:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants