Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate header check in 5.2.0 is not backward compatible #1007

Closed
robrap opened this issue Mar 7, 2022 · 17 comments · Fixed by #1021
Closed

Duplicate header check in 5.2.0 is not backward compatible #1007

robrap opened this issue Mar 7, 2022 · 17 comments · Fixed by #1021
Assignees
Labels
Milestone

Comments

@robrap
Copy link

robrap commented Mar 7, 2022

Describe the bug

A spreadsheet with multiple columns that had a blank header used to load using get_all_records before 5.2.0, but it now fails with "headers must be uniques" exception. I presume, but did not confirm, that it is due to this simplification: c8a5a73

To Reproduce
Steps to reproduce the behavior:

  1. Run get_all_records on a spreadsheet with multiple columns with a blank header.
  2. See error "headers must be uniques".

Expected behavior
This should work as it used to without an error.

Environment info:

  • Operating System [e.g. Linux, Windows, macOS]: macOS
  • Python version: 3.8
  • gspread version: 5.2.0

Stack trace or other output that would be helpful
Traceback (most recent call last):
File "", line 1, in
File "/edx/other/edx-repo-health/repo_health/check_ownership.py", line 79, in check_ownership
records = find_worksheet(google_creds_file, spreadsheet_url, worksheet_id)
File "/edx/other/edx-repo-health/repo_health/check_ownership.py", line 44, in find_worksheet
return worksheet.get_all_records()
File "/edx/venvs/edx-repo-health/lib/python3.8/site-packages/gspread/worksheet.py", line 408, in get_all_records
raise GSpreadException("headers must be uniques")
gspread.exceptions.GSpreadException: headers must be uniques

@lavigne958 lavigne958 added Need investigation This issue needs to be tested or investigated Bug labels Mar 7, 2022
@lavigne958 lavigne958 self-assigned this Mar 7, 2022
@lavigne958
Copy link
Collaborator

Hi @robrap

thank you for raising this issue. I confirm that it breaks if all your headers are empty strings "".

I am wondering why would you use the method get_all_records to retrieve all the values of a sheet if your headers are empty 🤔

Instead you could use the following methods:

  • get_values() with no range specified, it will return all the values of the current sheet
  • get_all_values() this is a legacy method that calls get_values
  • get_all_cells() that will return a single list of Cell object for every cell in that sheet.

I am still thinking about a way to prevent this breaking change and keep the new feature.

@lavigne958 lavigne958 removed the Need investigation This issue needs to be tested or investigated label Mar 7, 2022
@robrap
Copy link
Author

robrap commented Mar 7, 2022

Good question. Not all of our columns have empty string headers. We just have a number of columns at the end of the sheet with empty strings. Some of these empty header columns contain a pivot table related to the actual data in the sheet, so I don't want to just delete the columns.

robrap added a commit to openedx/edx-repo-health that referenced this issue Mar 7, 2022
gspread 5.2.0 introduced a backward-incompatible change related
to a sheet with extra columns with blank headers. For details
of the bug, see burnash/gspread#1007.
We can upgrade once the issue is resolved, or if we delete these
extra columns. Note: For edX specific ownership spreadsheet, this
sheet currently contains a pivot table that would need to be
moved elsewhere.
@lavigne958
Copy link
Collaborator

I understand, it bothers me to introduce backward incompatible feature.

The simplest way I can think of right now is to add a new flag to the function that enable/disable this feature.
That should suit you and allow you to benefit from newer features with future releases.

I still want to keep the feature enabled by default for the simplest reason that if 2 headers a equals then the column content of the first header is overridden by the second column content and that is the purpose of this method (get_all_records)

robrap added a commit to openedx/edx-repo-health that referenced this issue Mar 7, 2022
gspread 5.2.0 introduced a backward-incompatible change related
to a sheet with extra columns with blank headers. For details
of the bug, see burnash/gspread#1007.
We can upgrade once the issue is resolved, or if we delete these
extra columns. Note: For edX specific ownership spreadsheet, this
sheet currently contains a pivot table that would need to be
moved elsewhere.
@robrap
Copy link
Author

robrap commented Mar 7, 2022

Thanks. A workaround would be great.

I still want to keep the feature enabled by default for the simplest reason that if 2 headers a equals then the column content of the first header is overridden by the second column content and that is the purpose of this method (get_all_records)

I'm not clear on what "the feature" is? Is it just the more strict check on the headers?

Note: it's up to you whether you change the default in a 6.0.0 release to convey the breaking change, or do it in a minor release. You may want to update your release notes either way. Thanks again!

@MartinVardanyan
Copy link

I solved this problem. The problem was in gspread version. Just install 5.1.1 instead of 5.2.0.

@lavigne958
Copy link
Collaborator

something is still not clear to me with this situation, if you use the version 5.1.1 of gspread and use the method get_all_records if you have duplicated headers then it means you miss some of the data you have in the current sheet 🤔 in fact as reported by the original issue, the latest column with the duplicated header will override the content of the first one. so the entire column of data is not returned by this method and nothing raises to tell you that you miss an entire column of data.

@robrap
Copy link
Author

robrap commented Mar 21, 2022

@lavigne958: Here is an example of what I described above (in CSV format):

important-1,important-2,important-3,,
1,2,3,,
1,2,3,,
1,2,3,,non-essential-data
1,2,3,,36
1,2,3,,
1,2,3,,

Notice that the non-essential-data is in one of two columns with a blank column header, and I do not care if it gets lost. It is data that someone added to the side of the real data that I care about. Does this make it more clear?

  1. I don't want to delete the columns with the non-essential-data, because it may be important to someone, but just not in processing the sheet. Not sure whether columns with an empty column header should be treated differently?
  2. I'm just looking for some backward compatible solution.

@lavigne958
Copy link
Collaborator

Sure it makes it clear. Thank you for this data sample.

What I can think of that would solve your issue and provide a nice addition to the method get_all_records is adding a extra parameter that allows you to select the column range.

I need some time to think it through, but that should solve your issue.

@robrap
Copy link
Author

robrap commented Mar 22, 2022

Thanks @lavigne958. That might work. Unfortunately, it makes it a little more brittle, because if you care most about a subset of columns by header/key that are at the end of the spreadsheet (the right), and someone adds new columns to the left, they would fall out of range.

What if you could declare the column header keys you care about, and those must be unique, and are what gets loaded? It could even be a different method. Maybe something like get_columns? Keep in mind, I don't know the current API very well, so this is just food-for-thought.

@lavigne958
Copy link
Collaborator

This is not a bad idea, this is what has been asked here in this issue #976

I will look at it, it could be one way to solve this issue.

@lavigne958 lavigne958 added the Need investigation This issue needs to be tested or investigated label Apr 6, 2022
@lavigne958 lavigne958 added this to the 5.3.1 milestone Apr 6, 2022
lavigne958 added a commit that referenced this issue Apr 6, 2022
Add a new argument to `get_all_records` to provide the list of expected
headers.

The given expected headers must:
- be unique
- be part of the complete headers list
- must not contain extra headers

This will provide a way for users to use this method and still
have *some* duplicated headers that are not relevant to pull.

This will ensure the columns that matters have unique headers.

Closes #1007
@lavigne958
Copy link
Collaborator

I found a potential way to make the best of both worlds:

  • add extra argument to provide the expected list of keys that matters
  • make sure this list is unique
  • make sure this list is part of the pulled headers
  • make sure the list does not contain extra headers (it is not preventing gspread from working but it is safer)

See linked PR.

@lavigne958 lavigne958 removed the Need investigation This issue needs to be tested or investigated label Apr 10, 2022
@robrap
Copy link
Author

robrap commented Apr 12, 2022

Thank you @lavigne958.

@lavigne958
Copy link
Collaborator

It's done ✔️

This proposal for a fix has been released in https://github.com/burnash/gspread/releases/tag/v5.3.2

@pravarag
Copy link

Hi all, thanks for putting in above information. I too faced a similar error with version 5.3.2 today, had to downgrade it back to 5.1.1 to make it work. I tried the above data sample, but stil didn't work for me 🤔

@lavigne958
Copy link
Collaborator

Hi, version 5.3.2 provides an extra parameter that allow you to pass a list of headers you expect from the spreadsheet. This allows you to use the method get_all_records with only a subsets of your headers that are unique.

@robrap
Copy link
Author

robrap commented Apr 14, 2022

Note: The change is still backward incompatible, but passing expected_headers=[] will provide the legacy behavior. However, ideally you would set expected_headers to the actual list of headers you expect in order to enable more complete validation.

@deepansh96
Copy link

This was still happening to me
My version is 5.4.0

Workaround :

sheet_ref = gspread_client.open_by_key(sheet_key).get_worksheet_by_id(worksheet_gid)
expected_headers = sheet_ref.row_values(1)
all_records = sheet_ref.get_all_records(expected_headers=expected_headers)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants