Duplicated rows when using fetchmany_arrow method #286

morleytj · 2023-11-21T00:55:21Z

Hello, I've been using this package to automate some SQL pulldowns of a fairly large dataset, but have realized after running it that the fetchmany_arrow() method is potentially overlapping its returned results. I have included details as to the code I am running and the results below.

MWE

from databricks import sql
import os
from datetime import date
import sys
import pyarrow as pa
from pyarrow import csv

def stream_cursor_to_file_pyarrow(cursor,sql,filepath,size):
    cursor.execute(sql)
    #initial batch to get schema for writer
    res = cursor.fetchmany_arrow(size)
    with csv.CSVWriter(filepath, res.schema) as writer:
        writer.write_table(res)
        while True:
            res = cursor.fetchmany_arrow(size)
            if res.num_rows<1:
                break
            else:
                writer.write_table(res)

if __name__=="__main__":
    output_path = sys.argv[1]

    with sql.connect(server_hostname = os.getenv("DATABRICKS_SERVER_HOSTNAME"),
            http_path=os.getenv("DATABRICKS_HTTP_PATH"),
            access_token=os.getenv("DATABRICKS_TOKEN")) as connection:
        with connection.cursor() as cursor:
            print('connection established')
            q1_sql = "select * from table_name;"
            stream_cursor_to_file_pyarrow(cursor, q1_sql, output_path+'_counts_'+str(date.today())+'.csv',10000)

Example data format

The data is in a long format in the table, and looks like this

ID	feature_name	feature_occurence	earliest_date_of_occurrence
A1	F9	23	2010-04-04
A1	F10	12	2009-11-20
B3	CC1	2	2000-01-02
C4	F9	34	2002-04-02

Error

The resulting file written to disk has duplicated rows in it (including the header), similar to the following example:

ID	feature_name	feature_occurence	earliest_date_of_occurrence
ID	feature_name	feature_occurence	earliest_date_of_occurrence
A1	F9	23	2010-04-04
A1	F10	12	2009-11-20
B3	CC1	2	2000-01-02
C4	F9	34	2002-04-02
...	...	...	...
B3	CC1	2	2000-01-02
X8	CC9	3	2000-10-02
A1	F10	12	2009-11-20

Interestingly the first duplicated row is directly at the 10000th index (same size I used for fetchmany_arrow). The first duplicated value is not the same as the first value in the first batch, meaning it is not a complete repasting of it, the duplicated row's first occurrence is at index=9216. I am wondering if this indicates some level of overlap in the fetch commands. This is also supported by the fact that all the following rows for a set number of rows are duplicated.

The total number of duplicates is 3849407, representing ~0.05% of the total number of records.

Initial investigation of potential error sources:

I have checked to make sure the duplicates are not on the end of the SQL database, by running the following query:

select * from tablename group by ID, feature_name, feature_occurence, earliest_date_of_occurrence having count(*)>1

However, this returns an empty set, indicating that the duplicates are being generated during retrieval or writing. The possibilities I am seeing as to the source of the error are either step of writing or the step of retrieval, but given the appearance of duplicated rows at the start of each batch I believe the error to be originated somewhere in the fetchmany_arrow() call.

Hopefully the error is not in my code somewhere, haha, and hopefully this is helpful in tracking down the potential issue.

Best,
Theodore

susodapop · 2023-11-21T18:29:36Z

Thanks for the fullsome write-up. I'm writing here to acknowledge that we've seen this and are working up a reproduction. Can you please share your version of Python, pyarrow, and databricks-sql-connector?

morleytj · 2023-11-21T18:53:39Z

Ah, I forgot to include those:

python: version 3.10.9
pyarrow: version 14.0.1
databricks-sql-connector: version 3.0.0

Thanks for the quick reply! Let me know if there's any other information I can provide.

Adomatic · 2024-01-22T18:35:54Z

@susodapop Totally off-topic but you should probably look up 'fulsome' before you accidentally insult someone with it, like I did once in front of over 200 people. (-:

susodapop · 2024-01-22T18:40:11Z

Fair point, @Adomatic ;) Let's just say I'm trying to accelerate the revival of its positive connotations as discussed here:

The senses shown above are the chief living senses of fulsome. Sense 2, which was a generalized term of disparagement in the late 17th century, is the least common of these. Fulsome became a point of dispute when sense 1, thought to be obsolete in the 19th century, began to be revived in the 20th. The dispute was exacerbated by the fact that the large dictionaries of the first half of the century missed the beginnings of the revival. Sense 1 has not only been revived but has spread in its application and continues to do so. The chief danger for the user of fulsome is ambiguity. Unless the context is made very clear, the reader or hearer cannot be sure whether such an expression as "fulsome praise" is meant in sense 1b or in sense 4.

andrefurlan-db · 2024-02-20T18:52:58Z

Hi @morleytj , is this issue still occurring? I have not been able to reproduce it.

morleytj · 2024-02-26T20:35:37Z

Hi @andrefurlan-db, I've just rerun my pipeline and the issue is stil occurring, its very consistent in always duplicating the header line so that's an easy way for me to check, even in the new queries I've added, so it doesn't seem to be related to anything specific to a given query.

Is there any additional information you think I could potentially provide to help reproduce? I'm happy to provide other package versions or environment info, if its relevant, the cluster I'm running these scripts on is using CentOS 7 in terms of OS.

morleytj · 2024-02-26T20:58:16Z

For example, I've just pulled a file in this manner which has an individual's ID, and a couple of summary variables such as the most recent date associated with that individual (each line is unique, since the query going in has group by ID) -- the file that was pulled had 3738958 lines, and after dropping duplicates has 3560930 lines, coming out to 178,028 duplicate lines.

The query is along the lines of

SELECT ID, max(date_col) as maxdate, count(distinct extract(year from date_col)) as distinct_years FROM table_name GROUP BY ID;

morleytj · 2024-04-15T16:28:50Z

Checking in to update that this issue is still occurring April 2024, and am curious if anyone has replicated it.

morleytj · 2024-04-18T14:35:49Z

Some extra context I noticed today -- generally there has very consistently been an extra header column generated as an indicator that rows were duplicated, one of the files I pulled today didn't have that extra column, so I checked and it didn't have any duplicates, though the other two did. This file is the smallest of the three I pulled, and had only two columns.

At first I thought this might be because of the batch size being larger than the retrieved table, but that isn't the case. Batch size was 10000 and row number of the nonduplicated table was 37,616. Unsure if relevant, but in the interest of providing all information, this unduplicated table is a table of two columns of unique ID's, and it's a select distinct of two of the columns from one of the two larger queries, one of which is duplicated.

kravets-levko · 2024-04-18T16:38:50Z

@morleytj can you please try to pass use_cloud_fetch = false to sql.connect() method and check if the behavior changes?

morleytj · 2024-04-18T17:16:09Z

@kravets-levko I just ran my pipeline with that argument in the sql.connect() method, and it actually seems to have worked completely, they don't seem to have any duplicates when that argument is included.

kravets-levko · 2024-04-18T17:36:37Z

Thank you @morleytj! This indicates that probably we have one more issue with CloudFetch feature, which is sad. But at least we have a direction. I'll ask you to test it a bit more with CloudFetch disabled - just to make sure it indeed helps. If you see duplicated rows again - please let me know

morleytj · 2024-04-18T18:01:26Z

Unfortunate, but thank you for the help in identifying the source of the error! The pipeline I'm using is run daily so I will keep you updated as to the output health.

ion-elgreco · 2024-05-31T07:42:17Z

@kravets-levko is this already resolved?

I was still using 2.9.6 and playing around with enabling cloud fetch, which gives improved performance, but if it returns an incorrect output I probably need to avoid it for now

susodapop self-assigned this Nov 21, 2023

kravets-levko added the bug Something isn't working label Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicated rows when using fetchmany_arrow method #286

Duplicated rows when using fetchmany_arrow method #286

morleytj commented Nov 21, 2023

susodapop commented Nov 21, 2023

morleytj commented Nov 21, 2023

Adomatic commented Jan 22, 2024

susodapop commented Jan 22, 2024

andrefurlan-db commented Feb 20, 2024 •

edited

Loading

morleytj commented Feb 26, 2024

morleytj commented Feb 26, 2024

morleytj commented Apr 15, 2024

morleytj commented Apr 18, 2024

kravets-levko commented Apr 18, 2024

morleytj commented Apr 18, 2024 •

edited

Loading

kravets-levko commented Apr 18, 2024

morleytj commented Apr 18, 2024

ion-elgreco commented May 31, 2024

Duplicated rows when using fetchmany_arrow method #286

Duplicated rows when using fetchmany_arrow method #286

Comments

morleytj commented Nov 21, 2023

MWE

Example data format

Error

Initial investigation of potential error sources:

susodapop commented Nov 21, 2023

morleytj commented Nov 21, 2023

Adomatic commented Jan 22, 2024

susodapop commented Jan 22, 2024

andrefurlan-db commented Feb 20, 2024 • edited Loading

morleytj commented Feb 26, 2024

morleytj commented Feb 26, 2024

morleytj commented Apr 15, 2024

morleytj commented Apr 18, 2024

kravets-levko commented Apr 18, 2024

morleytj commented Apr 18, 2024 • edited Loading

kravets-levko commented Apr 18, 2024

morleytj commented Apr 18, 2024

ion-elgreco commented May 31, 2024

andrefurlan-db commented Feb 20, 2024 •

edited

Loading

morleytj commented Apr 18, 2024 •

edited

Loading