Improve create index #868

gaurav274 · 2023-06-15T04:12:30Z

👋 Thanks for submitting a Pull Request to EvaDB!

🙌 We want to make contributing to EvaDB as easy and transparent as possible. Here are a few tips to get you started:

🔍 Search existing EvaDB PRs to see if a similar PR already exists.
🔗 Link this PR to a EvaDB issue to help us understand what bug fix or feature is being implemented.
📈 Provide before and after profiling results to help us quantify the improvement your PR provides (if applicable).

👉 Please see our ✅ Contributing Guide for more details.

jiashenC · 2023-06-15T14:54:43Z

evadb/optimizer/statement_to_opr_converter.py

+        # logic: We assume that the maximum number of files in the table is <=
+        # MAGIC_NUMBER and the number of frames or chunks for each video/document is <=
+        # MAGIC_NUMBER. Based on this assumption, we can safely say that
+        # `_row_id` * MAGIC_NUMBER + `chunk_id` for document table and


How do you determine the magic number? And when will be the chunking be done?

Right now, I set it to an arbitrarily large value. The idea is that as long as the number of files in the table is fewer than this number, our assumption will hold.

Reg chunking: It is done when we read the document similar to frame decoding in videos.

I see. And for video, each video will get a unique ID and each frame will be assigned a different frame ID? Is this assumption correct?

We already have these IDs (_row_id for video and id for frame). Building on top of it. The assumption is these ids won't change across runs. _row_id is persisted, so no issue there. id is generated at runtime, and as long as the reader is deterministic across runs, we don't have a problem.

…n doc/video table

jiashenC · 2023-06-27T05:57:41Z

evadb/readers/pdf_reader.py

@@ -34,10 +34,12 @@ def _read(self) -> Iterator[Dict]:
        doc = fitz.open(self.file_url)

        # PAGE ID, PARAGRAPH ID, STRING
+        # Maintain a global paragraph number per PDF
+        global_paragraph_no = 0


I added this so that the calculated id for the create index method stays unique. Otherwise, it is still not unique because multiple paragraphs at different pages can have the same paragraph id. @gaurav274

… into improve-create-index

Use a runtime `row_number` to build the index by incorporating design discussions from #912 and #868.

jiashenC · 2023-09-08T14:34:52Z

Close this for now. #1073 is merged as a fix.

gaurav274 added 6 commits June 14, 2023 19:46

checkpoint

7a9e1a7

fix: minor fix to the catalog utils.

01dcab5

add chunk_id to document table

5b30fc6

minor fix

92b09f1

merge upstream

d9071a2

support creating index on doc tables, image tables, video tables

c70d8c3

jiashenC reviewed Jun 15, 2023

View reviewed changes

gaurav274 added 4 commits June 15, 2023 23:33

merge master

11f9589

checkpoint

e264543

create index story complete, now we support creating index directly o…

02ef4e3

…n doc/video table

merge master

a31550d

jiashenC mentioned this pull request Jun 26, 2023

vector store not working properly with orderby clause #879

Closed

2 tasks

fix paragraph id to be unique per PDF file

f5bf9c5

jiashenC reviewed Jun 27, 2023

View reviewed changes

gaurav274 added 5 commits July 2, 2023 08:59

merge master

bc79428

Merge branch 'improve-create-index' of github.com:georgia-tech-db/eva…

f6d133e

… into improve-create-index

updated design

b10aab9

revert changes

3c3c002

ran formatter

64f7ed1

jiashenC mentioned this pull request Sep 8, 2023

fix: create index from single document #1073

Merged

jiashenC added a commit that referenced this pull request Sep 8, 2023

fix: create index from single document (#1073)

f0dd533

Use a runtime `row_number` to build the index by incorporating design discussions from #912 and #868.

jiashenC closed this Sep 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve create index #868

Improve create index #868

gaurav274 commented Jun 15, 2023

jiashenC Jun 15, 2023

gaurav274 Jun 15, 2023

jiashenC Jun 16, 2023

gaurav274 Jun 16, 2023

jiashenC Jun 27, 2023

jiashenC commented Sep 8, 2023

Improve create index #868

Improve create index #868

Conversation

gaurav274 commented Jun 15, 2023

jiashenC Jun 15, 2023

Choose a reason for hiding this comment

gaurav274 Jun 15, 2023

Choose a reason for hiding this comment

jiashenC Jun 16, 2023

Choose a reason for hiding this comment

gaurav274 Jun 16, 2023

Choose a reason for hiding this comment

jiashenC Jun 27, 2023

Choose a reason for hiding this comment

jiashenC commented Sep 8, 2023