Duplicate header check in 5.2.0 is not backward compatible #1007

robrap · 2022-03-07T17:30:28Z

Describe the bug

A spreadsheet with multiple columns that had a blank header used to load using get_all_records before 5.2.0, but it now fails with "headers must be uniques" exception. I presume, but did not confirm, that it is due to this simplification: c8a5a73

To Reproduce
Steps to reproduce the behavior:

Run get_all_records on a spreadsheet with multiple columns with a blank header.
See error "headers must be uniques".

Expected behavior
This should work as it used to without an error.

Environment info:

Operating System [e.g. Linux, Windows, macOS]: macOS
Python version: 3.8
gspread version: 5.2.0

Stack trace or other output that would be helpful
Traceback (most recent call last):
File "", line 1, in
File "/edx/other/edx-repo-health/repo_health/check_ownership.py", line 79, in check_ownership
records = find_worksheet(google_creds_file, spreadsheet_url, worksheet_id)
File "/edx/other/edx-repo-health/repo_health/check_ownership.py", line 44, in find_worksheet
return worksheet.get_all_records()
File "/edx/venvs/edx-repo-health/lib/python3.8/site-packages/gspread/worksheet.py", line 408, in get_all_records
raise GSpreadException("headers must be uniques")
gspread.exceptions.GSpreadException: headers must be uniques

The text was updated successfully, but these errors were encountered:

lavigne958 · 2022-03-07T18:58:09Z

Hi @robrap

thank you for raising this issue. I confirm that it breaks if all your headers are empty strings "".

I am wondering why would you use the method get_all_records to retrieve all the values of a sheet if your headers are empty 🤔

Instead you could use the following methods:

get_values() with no range specified, it will return all the values of the current sheet
get_all_values() this is a legacy method that calls get_values
get_all_cells() that will return a single list of Cell object for every cell in that sheet.

I am still thinking about a way to prevent this breaking change and keep the new feature.

robrap · 2022-03-07T19:26:23Z

Good question. Not all of our columns have empty string headers. We just have a number of columns at the end of the sheet with empty strings. Some of these empty header columns contain a pivot table related to the actual data in the sheet, so I don't want to just delete the columns.

gspread 5.2.0 introduced a backward-incompatible change related to a sheet with extra columns with blank headers. For details of the bug, see burnash/gspread#1007. We can upgrade once the issue is resolved, or if we delete these extra columns. Note: For edX specific ownership spreadsheet, this sheet currently contains a pivot table that would need to be moved elsewhere.

lavigne958 · 2022-03-07T20:28:36Z

I understand, it bothers me to introduce backward incompatible feature.

The simplest way I can think of right now is to add a new flag to the function that enable/disable this feature.
That should suit you and allow you to benefit from newer features with future releases.

I still want to keep the feature enabled by default for the simplest reason that if 2 headers a equals then the column content of the first header is overridden by the second column content and that is the purpose of this method (get_all_records)

gspread 5.2.0 introduced a backward-incompatible change related to a sheet with extra columns with blank headers. For details of the bug, see burnash/gspread#1007. We can upgrade once the issue is resolved, or if we delete these extra columns. Note: For edX specific ownership spreadsheet, this sheet currently contains a pivot table that would need to be moved elsewhere.

robrap · 2022-03-07T21:13:11Z

Thanks. A workaround would be great.

I still want to keep the feature enabled by default for the simplest reason that if 2 headers a equals then the column content of the first header is overridden by the second column content and that is the purpose of this method (get_all_records)

I'm not clear on what "the feature" is? Is it just the more strict check on the headers?

Note: it's up to you whether you change the default in a 6.0.0 release to convey the breaking change, or do it in a minor release. You may want to update your release notes either way. Thanks again!

MartinVardanyan · 2022-03-10T19:54:05Z

I solved this problem. The problem was in gspread version. Just install 5.1.1 instead of 5.2.0.

lavigne958 · 2022-03-11T09:42:45Z

something is still not clear to me with this situation, if you use the version 5.1.1 of gspread and use the method get_all_records if you have duplicated headers then it means you miss some of the data you have in the current sheet 🤔 in fact as reported by the original issue, the latest column with the duplicated header will override the content of the first one. so the entire column of data is not returned by this method and nothing raises to tell you that you miss an entire column of data.

robrap · 2022-03-21T21:28:02Z

@lavigne958: Here is an example of what I described above (in CSV format):

important-1,important-2,important-3,,
1,2,3,,
1,2,3,,
1,2,3,,non-essential-data
1,2,3,,36
1,2,3,,
1,2,3,,

Notice that the non-essential-data is in one of two columns with a blank column header, and I do not care if it gets lost. It is data that someone added to the side of the real data that I care about. Does this make it more clear?

I don't want to delete the columns with the non-essential-data, because it may be important to someone, but just not in processing the sheet. Not sure whether columns with an empty column header should be treated differently?
I'm just looking for some backward compatible solution.

lavigne958 · 2022-03-21T22:54:35Z

Sure it makes it clear. Thank you for this data sample.

What I can think of that would solve your issue and provide a nice addition to the method get_all_records is adding a extra parameter that allows you to select the column range.

I need some time to think it through, but that should solve your issue.

robrap · 2022-03-22T14:59:37Z

Thanks @lavigne958. That might work. Unfortunately, it makes it a little more brittle, because if you care most about a subset of columns by header/key that are at the end of the spreadsheet (the right), and someone adds new columns to the left, they would fall out of range.

What if you could declare the column header keys you care about, and those must be unique, and are what gets loaded? It could even be a different method. Maybe something like get_columns? Keep in mind, I don't know the current API very well, so this is just food-for-thought.

lavigne958 · 2022-04-06T10:31:12Z

This is not a bad idea, this is what has been asked here in this issue #976

I will look at it, it could be one way to solve this issue.

Add a new argument to `get_all_records` to provide the list of expected headers. The given expected headers must: - be unique - be part of the complete headers list - must not contain extra headers This will provide a way for users to use this method and still have *some* duplicated headers that are not relevant to pull. This will ensure the columns that matters have unique headers. Closes #1007

lavigne958 · 2022-04-06T14:48:29Z

I found a potential way to make the best of both worlds:

add extra argument to provide the expected list of keys that matters
make sure this list is unique
make sure this list is part of the pulled headers
make sure the list does not contain extra headers (it is not preventing gspread from working but it is safer)

See linked PR.

robrap · 2022-04-12T14:56:10Z

Thank you @lavigne958.

lavigne958 · 2022-04-12T15:21:12Z

It's done ✔️

This proposal for a fix has been released in https://github.com/burnash/gspread/releases/tag/v5.3.2

pravarag · 2022-04-14T14:09:39Z

Hi all, thanks for putting in above information. I too faced a similar error with version 5.3.2 today, had to downgrade it back to 5.1.1 to make it work. I tried the above data sample, but stil didn't work for me 🤔

lavigne958 · 2022-04-14T14:55:55Z

Hi, version 5.3.2 provides an extra parameter that allow you to pass a list of headers you expect from the spreadsheet. This allows you to use the method get_all_records with only a subsets of your headers that are unique.

robrap · 2022-04-14T17:53:23Z

Note: The change is still backward incompatible, but passing expected_headers=[] will provide the legacy behavior. However, ideally you would set expected_headers to the actual list of headers you expect in order to enable more complete validation.

burnash/gspread#1007

deepansh96 · 2022-08-30T07:51:39Z

This was still happening to me
My version is 5.4.0

Workaround :

sheet_ref = gspread_client.open_by_key(sheet_key).get_worksheet_by_id(worksheet_gid)
expected_headers = sheet_ref.row_values(1)
all_records = sheet_ref.get_all_records(expected_headers=expected_headers)

lavigne958 added Need investigation This issue needs to be tested or investigated Bug labels Mar 7, 2022

lavigne958 self-assigned this Mar 7, 2022

lavigne958 removed the Need investigation This issue needs to be tested or investigated label Mar 7, 2022

robrap mentioned this issue Mar 7, 2022

fix: add gspread<5.2.0 constraint openedx/edx-repo-health#254

Merged

sobisonator mentioned this issue Mar 8, 2022

Error loading google sheet on latest version of Python gspread sobisonator/imp19c#259

Closed

lavigne958 added the Need investigation This issue needs to be tested or investigated label Apr 6, 2022

lavigne958 added this to the 5.3.1 milestone Apr 6, 2022

lavigne958 mentioned this issue Apr 6, 2022

Bugfix/get all record duplicated columns #1021

Merged

lavigne958 removed the Need investigation This issue needs to be tested or investigated label Apr 10, 2022

lavigne958 closed this as completed in #1021 Apr 12, 2022

k4black added a commit to manytask/manytask that referenced this issue May 8, 2022

fix: gspread 5.2.0 get_all_records error (duplicated columns)

f79cca1

burnash/gspread#1007

trislee mentioned this issue Oct 26, 2022

fixed channel sync bugs bellingcat/cisticola#67

Merged

jerryboy1031 mentioned this issue Jul 7, 2023

GSpreadException: the given 'expected_headers' are not uniques jerryboy1031/data2sheet#5

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate header check in 5.2.0 is not backward compatible #1007

Duplicate header check in 5.2.0 is not backward compatible #1007

robrap commented Mar 7, 2022

lavigne958 commented Mar 7, 2022

robrap commented Mar 7, 2022 •

edited

lavigne958 commented Mar 7, 2022

robrap commented Mar 7, 2022

MartinVardanyan commented Mar 10, 2022

lavigne958 commented Mar 11, 2022

robrap commented Mar 21, 2022

lavigne958 commented Mar 21, 2022

robrap commented Mar 22, 2022

lavigne958 commented Apr 6, 2022

lavigne958 commented Apr 6, 2022

robrap commented Apr 12, 2022

lavigne958 commented Apr 12, 2022

pravarag commented Apr 14, 2022

lavigne958 commented Apr 14, 2022

robrap commented Apr 14, 2022

deepansh96 commented Aug 30, 2022

Duplicate header check in 5.2.0 is not backward compatible #1007

Duplicate header check in 5.2.0 is not backward compatible #1007

Comments

robrap commented Mar 7, 2022

lavigne958 commented Mar 7, 2022

robrap commented Mar 7, 2022 • edited

lavigne958 commented Mar 7, 2022

robrap commented Mar 7, 2022

MartinVardanyan commented Mar 10, 2022

lavigne958 commented Mar 11, 2022

robrap commented Mar 21, 2022

lavigne958 commented Mar 21, 2022

robrap commented Mar 22, 2022

lavigne958 commented Apr 6, 2022

lavigne958 commented Apr 6, 2022

robrap commented Apr 12, 2022

lavigne958 commented Apr 12, 2022

pravarag commented Apr 14, 2022

lavigne958 commented Apr 14, 2022

robrap commented Apr 14, 2022

deepansh96 commented Aug 30, 2022

robrap commented Mar 7, 2022 •

edited