#1740 Issue: fix failing Google Sheets Source with large spreadsheet #1762

yevhenii-ldv · 2021-01-21T15:35:28Z

What

fix failing Google Sheets Source with large spreadsheet (more than 1000 rows)

How

The error occurred in cases when we try to get data from non-existing lines, which means that the file is over, we only need to catch the error and exit the loop.
One more thing, I slightly increased the maximum time for backoff, since once when reading 192302 lines, error 429 fell after 126000 lines.

Pre-merge Checklist

Run integration tests
Publish Docker images

Recommended reading order

test.java
component.ts
the rest

…(more than 1000 rows)

sherifnada · 2021-01-21T15:56:28Z

...te-integrations/connectors/source-google-sheets/google_sheets_source/google_sheets_source.py

-                row_batch = SpreadsheetValues.parse_obj(
-                    client.get_values(spreadsheetId=spreadsheet_id, ranges=range, majorDimension="ROWS")
-                )
+                try:


can we instead cap the range based on how large the spreadsheet is? e.g:

if row_cursor >= num_rows_in_sheet: break range = f"{sheet}!{row_cursor}:{min(num_rows_in_sheet, row_cursor+ROW_BATCH_SIZE)}

I think the current approach of swallowing any HttpError is not very robust as we could be getting other kinds of exceptions

Okay, I`ll try make this

sherifnada · 2021-01-21T19:04:30Z

airbyte-integrations/connectors/source-google-sheets/google_sheets_source/helpers.py

@@ -152,6 +152,11 @@ def get_sheets_in_spreadsheet(client, spreadsheet_id: str):
        spreadsheet_metadata = Spreadsheet.parse_obj(client.get(spreadsheetId=spreadsheet_id, includeGridData=False))
        return [sheet.properties.title for sheet in spreadsheet_metadata.sheets]

+    @staticmethod
+    def get_sheets_properties(client, spreadsheet_id: str):


should this be called get_sheet_row_count ? Also can you add the return value type signature?

@sherifnada done

sherifnada · 2021-01-21T19:04:33Z

...te-integrations/connectors/source-google-sheets/google_sheets_source/google_sheets_source.py

        for sheet in sheet_to_column_index_to_name.keys():
            logger.info(f"Syncing sheet {sheet}")
            column_index_to_name = sheet_to_column_index_to_name[sheet]
            row_cursor = 2  # we start syncing past the header row
-            while True:
+            while row_cursor <= sheet_parameters[sheet]:


@yevhenii-ldv wouldn't this overcount in the case that row_cusror + batchsize is greater than sheet_parameters[sheet]?

@sherifnada It is necessary that the initial row exists, if the final row of the interval will go beyond the table - that's okay, we will only return the real data of the table and in the next iteration will exit the loop.

…row_count

#1740 Issue: fix failing Google Sheets Source with large spreadsheet …

f744a4c

…(more than 1000 rows)

sherifnada reviewed Jan 21, 2021

View reviewed changes

#1740 Issue: update PR after review

ec90144

sherifnada suggested changes Jan 21, 2021

View reviewed changes

yevhenii-ldv and others added 4 commits January 21, 2021 22:06

#1730 Issue: rename function from get_sheets_properties to get_sheet_…

40e5c32

…row_count

#1730 Issue: add the return value type signature

6c9c9af

#1730 Issue: add comment to code

2242102

bump version and make test sheet size weird

be339d7

sherifnada merged commit c2dab06 into master Jan 21, 2021

sherifnada deleted the ykurochkin/google-sheets-fails-on-large-spreadsheet branch January 21, 2021 22:44

sherifnada mentioned this pull request Jan 22, 2021

Google Sheets fails on large spreadsheet #1740

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#1740 Issue: fix failing Google Sheets Source with large spreadsheet #1762

#1740 Issue: fix failing Google Sheets Source with large spreadsheet #1762

yevhenii-ldv commented Jan 21, 2021

sherifnada Jan 21, 2021 •

edited

Loading

yevhenii-ldv Jan 21, 2021

sherifnada Jan 21, 2021

yevhenii-ldv Jan 21, 2021

sherifnada Jan 21, 2021

yevhenii-ldv Jan 21, 2021

#1740 Issue: fix failing Google Sheets Source with large spreadsheet #1762

#1740 Issue: fix failing Google Sheets Source with large spreadsheet #1762

Conversation

yevhenii-ldv commented Jan 21, 2021

What

How

Pre-merge Checklist

Recommended reading order

sherifnada Jan 21, 2021 • edited Loading

Choose a reason for hiding this comment

yevhenii-ldv Jan 21, 2021

Choose a reason for hiding this comment

sherifnada Jan 21, 2021

Choose a reason for hiding this comment

yevhenii-ldv Jan 21, 2021

Choose a reason for hiding this comment

sherifnada Jan 21, 2021

Choose a reason for hiding this comment

yevhenii-ldv Jan 21, 2021

Choose a reason for hiding this comment

sherifnada Jan 21, 2021 •

edited

Loading