Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated rows when using fetchmany_arrow method #286

Open
morleytj opened this issue Nov 21, 2023 · 13 comments
Open

Duplicated rows when using fetchmany_arrow method #286

morleytj opened this issue Nov 21, 2023 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@morleytj
Copy link

Hello, I've been using this package to automate some SQL pulldowns of a fairly large dataset, but have realized after running it that the fetchmany_arrow() method is potentially overlapping its returned results. I have included details as to the code I am running and the results below.

MWE

from databricks import sql
import os
from datetime import date
import sys
import pyarrow as pa
from pyarrow import csv

def stream_cursor_to_file_pyarrow(cursor,sql,filepath,size):
    cursor.execute(sql)
    #initial batch to get schema for writer
    res = cursor.fetchmany_arrow(size)
    with csv.CSVWriter(filepath, res.schema) as writer:
        writer.write_table(res)
        while True:
            res = cursor.fetchmany_arrow(size)
            if res.num_rows<1:
                break
            else:
                writer.write_table(res)

if __name__=="__main__":
    output_path = sys.argv[1]

    with sql.connect(server_hostname = os.getenv("DATABRICKS_SERVER_HOSTNAME"),
            http_path=os.getenv("DATABRICKS_HTTP_PATH"),
            access_token=os.getenv("DATABRICKS_TOKEN")) as connection:
        with connection.cursor() as cursor:
            print('connection established')
            q1_sql = "select * from table_name;"
            stream_cursor_to_file_pyarrow(cursor, q1_sql, output_path+'_counts_'+str(date.today())+'.csv',10000)

Example data format

The data is in a long format in the table, and looks like this

ID feature_name feature_occurence earliest_date_of_occurrence
A1 F9 23 2010-04-04
A1 F10 12 2009-11-20
B3 CC1 2 2000-01-02
C4 F9 34 2002-04-02

Error

The resulting file written to disk has duplicated rows in it (including the header), similar to the following example:

ID feature_name feature_occurence earliest_date_of_occurrence
ID feature_name feature_occurence earliest_date_of_occurrence
A1 F9 23 2010-04-04
A1 F10 12 2009-11-20
B3 CC1 2 2000-01-02
C4 F9 34 2002-04-02
... ... ... ...
B3 CC1 2 2000-01-02
X8 CC9 3 2000-10-02
A1 F10 12 2009-11-20

Interestingly the first duplicated row is directly at the 10000th index (same size I used for fetchmany_arrow). The first duplicated value is not the same as the first value in the first batch, meaning it is not a complete repasting of it, the duplicated row's first occurrence is at index=9216. I am wondering if this indicates some level of overlap in the fetch commands. This is also supported by the fact that all the following rows for a set number of rows are duplicated.

The total number of duplicates is 3849407, representing ~0.05% of the total number of records.

Initial investigation of potential error sources:

I have checked to make sure the duplicates are not on the end of the SQL database, by running the following query:

select * from tablename group by ID, feature_name, feature_occurence, earliest_date_of_occurrence having count(*)>1

However, this returns an empty set, indicating that the duplicates are being generated during retrieval or writing. The possibilities I am seeing as to the source of the error are either step of writing or the step of retrieval, but given the appearance of duplicated rows at the start of each batch I believe the error to be originated somewhere in the fetchmany_arrow() call.

Hopefully the error is not in my code somewhere, haha, and hopefully this is helpful in tracking down the potential issue.

Best,
Theodore

@susodapop
Copy link
Contributor

Thanks for the fullsome write-up. I'm writing here to acknowledge that we've seen this and are working up a reproduction. Can you please share your version of Python, pyarrow, and databricks-sql-connector?

@susodapop susodapop self-assigned this Nov 21, 2023
@morleytj
Copy link
Author

Ah, I forgot to include those:

  • python: version 3.10.9
  • pyarrow: version 14.0.1
  • databricks-sql-connector: version 3.0.0

Thanks for the quick reply! Let me know if there's any other information I can provide.

@Adomatic
Copy link

@susodapop Totally off-topic but you should probably look up 'fulsome' before you accidentally insult someone with it, like I did once in front of over 200 people. (-:

@susodapop
Copy link
Contributor

Fair point, @Adomatic ;) Let's just say I'm trying to accelerate the revival of its positive connotations as discussed here:

The senses shown above are the chief living senses of fulsome. Sense 2, which was a generalized term of disparagement in the late 17th century, is the least common of these. Fulsome became a point of dispute when sense 1, thought to be obsolete in the 19th century, began to be revived in the 20th. The dispute was exacerbated by the fact that the large dictionaries of the first half of the century missed the beginnings of the revival. Sense 1 has not only been revived but has spread in its application and continues to do so. The chief danger for the user of fulsome is ambiguity. Unless the context is made very clear, the reader or hearer cannot be sure whether such an expression as "fulsome praise" is meant in sense 1b or in sense 4.

@andrefurlan-db
Copy link
Contributor

andrefurlan-db commented Feb 20, 2024

Hi @morleytj , is this issue still occurring? I have not been able to reproduce it.

@morleytj
Copy link
Author

Hi @andrefurlan-db, I've just rerun my pipeline and the issue is stil occurring, its very consistent in always duplicating the header line so that's an easy way for me to check, even in the new queries I've added, so it doesn't seem to be related to anything specific to a given query.

Is there any additional information you think I could potentially provide to help reproduce? I'm happy to provide other package versions or environment info, if its relevant, the cluster I'm running these scripts on is using CentOS 7 in terms of OS.

@morleytj
Copy link
Author

For example, I've just pulled a file in this manner which has an individual's ID, and a couple of summary variables such as the most recent date associated with that individual (each line is unique, since the query going in has group by ID) -- the file that was pulled had 3738958 lines, and after dropping duplicates has 3560930 lines, coming out to 178,028 duplicate lines.

The query is along the lines of

SELECT ID, max(date_col) as maxdate, count(distinct extract(year from date_col)) as distinct_years FROM table_name GROUP BY ID;

@morleytj
Copy link
Author

Checking in to update that this issue is still occurring April 2024, and am curious if anyone has replicated it.

@kravets-levko kravets-levko added the bug Something isn't working label Apr 17, 2024
@morleytj
Copy link
Author

Some extra context I noticed today -- generally there has very consistently been an extra header column generated as an indicator that rows were duplicated, one of the files I pulled today didn't have that extra column, so I checked and it didn't have any duplicates, though the other two did. This file is the smallest of the three I pulled, and had only two columns.

At first I thought this might be because of the batch size being larger than the retrieved table, but that isn't the case. Batch size was 10000 and row number of the nonduplicated table was 37,616. Unsure if relevant, but in the interest of providing all information, this unduplicated table is a table of two columns of unique ID's, and it's a select distinct of two of the columns from one of the two larger queries, one of which is duplicated.

@kravets-levko
Copy link
Collaborator

@morleytj can you please try to pass use_cloud_fetch = false to sql.connect() method and check if the behavior changes?

@morleytj
Copy link
Author

morleytj commented Apr 18, 2024

@kravets-levko I just ran my pipeline with that argument in the sql.connect() method, and it actually seems to have worked completely, they don't seem to have any duplicates when that argument is included.

@kravets-levko
Copy link
Collaborator

Thank you @morleytj! This indicates that probably we have one more issue with CloudFetch feature, which is sad. But at least we have a direction. I'll ask you to test it a bit more with CloudFetch disabled - just to make sure it indeed helps. If you see duplicated rows again - please let me know

@morleytj
Copy link
Author

Unfortunate, but thank you for the help in identifying the source of the error! The pipeline I'm using is run daily so I will keep you updated as to the output health.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants