Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Every 2048th inserted date is incorrect #10180

Closed
1 task done
slvr611 opened this issue Jan 9, 2024 · 4 comments · Fixed by #10291
Closed
1 task done

Every 2048th inserted date is incorrect #10180

slvr611 opened this issue Jan 9, 2024 · 4 comments · Fixed by #10291

Comments

@slvr611
Copy link

slvr611 commented Jan 9, 2024

What happens?

Every 2048th inserted date is incorrect. The incorrect date is seemingly random.

The problem does not seem to ocurr when values are inserted in large batches (5000+), but does in smaller batches (~2000)

The problem seems to only ocurr after running the program twice, it's not clear to me whether it only appears after the connection to the file has been closed, or whether it's just a question of the number of queries

The problem does not seem to appear when running a generic query that would show the entire table, but does when running a query for that specific line

To Reproduce

Run the following code to generate and add a series of consective dates associated with IDs to a database:

import pandas as pd
import datetime
import duckdb

dbFile = "testDB.db"

def InitializeDB(db):
    db.sql("CREATE SEQUENCE IF NOT EXISTS seq_dateID START 1")
    db.sql("""CREATE TABLE IF NOT EXISTS Dates (
                dateID LONG NOT NULL DEFAULT NEXTVAL('seq_dateID'),
                day DATE NOT NULL,
                PRIMARY KEY (dateID)
            )""")

def main():
    db = duckdb.connect(dbFile)
    
    InitializeDB(db)
        
    for i in range(2):
        datesToAdd = pd.date_range(start='2010-1-1', end=datetime.date.today(), freq='D').to_frame(name="date")
        
        datesToAdd["date"] = pd.to_datetime(datesToAdd["date"]).dt.date
        
        datesToAdd = datesToAdd.head(2000)
        
        db.sql("INSERT INTO Dates (dateID, day) SELECT nextval('seq_dateID'), date FROM datesToAdd")
    
    outputToCheck = db.sql("SELECT * FROM Dates").df()
    
    print(outputToCheck.iloc[2047:2050])
    
    outputToCheck2 = db.sql("SELECT * FROM Dates WHERE dateID = 2049").df()
    
    print(outputToCheck2)
    
    pass

if __name__ == "__main__":
    main()

Upon running it the second time it should produce the following text:

      dateID        day
2047    2048 2010-02-17
2048    2049 2010-02-18
2049    2050 2010-02-19
   dateID        day
0    2049 2005-07-16

Which shows that the date at index 2048, with a dateID of 2049 is not consecutive, but the generic query above returns a different date that is consective

OS:

Windows 10 Home x64

DuckDB Version:

0.9.2

DuckDB Client:

Python

Full Name:

Orion Gesslein

Affiliation:

None/Myself

Have you tried this on the latest main branch?

I have tested with a main build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • Yes, I have
@szarnyasg
Copy link
Collaborator

Thanks! This must have been challenging to debug and turn into an MWE! We'll look into it.

@slvr611
Copy link
Author

slvr611 commented Jan 10, 2024

Thanks for your response! Let me know if there's any other info I can get to aid in your investigation.

@szarnyasg
Copy link
Collaborator

@slvr611 PR #10291 by @gitccl fixes this issue.

@slvr611
Copy link
Author

slvr611 commented Jan 23, 2024

Big thanks to both you and @gitccl !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants