Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Spreadsheet's metadata is being re-fetched too often #99

Open
mephinet opened this issue Jan 4, 2024 · 2 comments
Open

Google Spreadsheet's metadata is being re-fetched too often #99

mephinet opened this issue Jan 4, 2024 · 2 comments

Comments

@mephinet
Copy link
Contributor

mephinet commented Jan 4, 2024

gspread-pandas' Sheet class has implemented some logic to know when sheet metadata needs to be refetched - cf refresh_spread_metadata. However, the current implementation fails to make use of its potential. When accessing a multi-sheet spread using sheet_to_df, debug logs show that the spreadsheet's metadata endpoint is hit multiple times - in my example, 9 times to fetch 2 sheets.

I have created a standalone pytest that reproduces this behaviour: https://gist.github.com/mephinet/7415a4641a65fbf0d51f0eae11ee21dc
I failed to add it to the betamax'ed test suite because I ran into timeouts...

Tested with gspread-pandas version 3.2.3 on Arch Linux, as well as with the git master.

@mephinet
Copy link
Contributor Author

mephinet commented Jan 4, 2024

To show the potential: changing the Sheet.sheets decorator from @property to @cached_property reduces the requests from 9 down to 3:

  • one from open_spread -> open_by_url
  • one from open_spread -> refresh_metadata (which is unnecessary here)
  • and one from open_sheet (which is caused by accessing the sheets property)

Fixing this issue would improve the performance of gspread-pandas by a relevant factor as I'm currently seeing 429 responses caused by these metadata requests, causing a significant slowdown.

@aiguofer
Copy link
Owner

Hey, sorry I haven't had much time to maintain this and e-mails seem to get lost in my inbox. The reason I originally implemented all this refresh metadata logic is because there's a few methods that depend on available metadata. If we don't refresh it, the values will be wrong. For example, number of rows and cols and number of frozen rows and cols all depend on that metadata.

The best way to avoid these refreshes would be to parse the responses for the different methods that modify the sheet, extract the updated values, and update the metadata. Unfortunately, that is likely a pretty manual process since the response format depends on whether it's a batch operation or not.

I'm not sure if gspread v6 changes any of this, but unfortunately I don't have much time to look into this. any PRs would be welcome though!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants