-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
#1740 Issue: fix failing Google Sheets Source with large spreadsheet #1762
#1740 Issue: fix failing Google Sheets Source with large spreadsheet #1762
Conversation
…(more than 1000 rows)
row_batch = SpreadsheetValues.parse_obj( | ||
client.get_values(spreadsheetId=spreadsheet_id, ranges=range, majorDimension="ROWS") | ||
) | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we instead cap the range based on how large the spreadsheet is? e.g:
if row_cursor >= num_rows_in_sheet:
break
range = f"{sheet}!{row_cursor}:{min(num_rows_in_sheet, row_cursor+ROW_BATCH_SIZE)}
I think the current approach of swallowing any HttpError is not very robust as we could be getting other kinds of exceptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I`ll try make this
@@ -152,6 +152,11 @@ def get_sheets_in_spreadsheet(client, spreadsheet_id: str): | |||
spreadsheet_metadata = Spreadsheet.parse_obj(client.get(spreadsheetId=spreadsheet_id, includeGridData=False)) | |||
return [sheet.properties.title for sheet in spreadsheet_metadata.sheets] | |||
|
|||
@staticmethod | |||
def get_sheets_properties(client, spreadsheet_id: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be called get_sheet_row_count
? Also can you add the return value type signature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sherifnada done
for sheet in sheet_to_column_index_to_name.keys(): | ||
logger.info(f"Syncing sheet {sheet}") | ||
column_index_to_name = sheet_to_column_index_to_name[sheet] | ||
row_cursor = 2 # we start syncing past the header row | ||
while True: | ||
while row_cursor <= sheet_parameters[sheet]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yevhenii-ldv wouldn't this overcount in the case that row_cusror + batchsize
is greater than sheet_parameters[sheet]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sherifnada It is necessary that the initial row exists, if the final row of the interval will go beyond the table - that's okay, we will only return the real data of the table and in the next iteration will exit the loop.
What
fix failing Google Sheets Source with large spreadsheet (more than 1000 rows)
How
The error occurred in cases when we try to get data from non-existing lines, which means that the file is over, we only need to catch the error and exit the loop.
One more thing, I slightly increased the maximum time for
backoff
, since once when reading192302
lines, error429
fell after126000
lines.Pre-merge Checklist
Recommended reading order
test.java
component.ts