Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#1740 Issue: fix failing Google Sheets Source with large spreadsheet #1762

Merged

Conversation

yevhenii-ldv
Copy link
Contributor

What

fix failing Google Sheets Source with large spreadsheet (more than 1000 rows)

How

The error occurred in cases when we try to get data from non-existing lines, which means that the file is over, we only need to catch the error and exit the loop.
One more thing, I slightly increased the maximum time for backoff, since once when reading 192302 lines, error 429 fell after 126000 lines.

Pre-merge Checklist

  • Run integration tests
  • Publish Docker images

Recommended reading order

  1. test.java
  2. component.ts
  3. the rest

row_batch = SpreadsheetValues.parse_obj(
client.get_values(spreadsheetId=spreadsheet_id, ranges=range, majorDimension="ROWS")
)
try:
Copy link
Contributor

@sherifnada sherifnada Jan 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we instead cap the range based on how large the spreadsheet is? e.g:

if row_cursor >= num_rows_in_sheet: 
    break

range = f"{sheet}!{row_cursor}:{min(num_rows_in_sheet, row_cursor+ROW_BATCH_SIZE)}

I think the current approach of swallowing any HttpError is not very robust as we could be getting other kinds of exceptions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I`ll try make this

@@ -152,6 +152,11 @@ def get_sheets_in_spreadsheet(client, spreadsheet_id: str):
spreadsheet_metadata = Spreadsheet.parse_obj(client.get(spreadsheetId=spreadsheet_id, includeGridData=False))
return [sheet.properties.title for sheet in spreadsheet_metadata.sheets]

@staticmethod
def get_sheets_properties(client, spreadsheet_id: str):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be called get_sheet_row_count ? Also can you add the return value type signature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for sheet in sheet_to_column_index_to_name.keys():
logger.info(f"Syncing sheet {sheet}")
column_index_to_name = sheet_to_column_index_to_name[sheet]
row_cursor = 2 # we start syncing past the header row
while True:
while row_cursor <= sheet_parameters[sheet]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yevhenii-ldv wouldn't this overcount in the case that row_cusror + batchsize is greater than sheet_parameters[sheet]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sherifnada It is necessary that the initial row exists, if the final row of the interval will go beyond the table - that's okay, we will only return the real data of the table and in the next iteration will exit the loop.

@sherifnada sherifnada merged commit c2dab06 into master Jan 21, 2021
@sherifnada sherifnada deleted the ykurochkin/google-sheets-fails-on-large-spreadsheet branch January 21, 2021 22:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants