Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong error message when using insert_data_bulk #161

Open
diegoxfx opened this issue May 26, 2023 · 11 comments
Open

Wrong error message when using insert_data_bulk #161

diegoxfx opened this issue May 26, 2023 · 11 comments
Labels
bug Something isn't working

Comments

@diegoxfx
Copy link

Driver version

2.0.910

Redshift version

I don't know

Client Operating System

Amazon Linux 2

Python version

3.10.8

Table schema

Problem description

I write the columns name of the table i want to insert into a temporary table. Let's say one of the columns is not written well.

  1. Expected behaviour:
    What i expect is to get this error message: "Invalid column name: ..."
  2. Actual behaviour:
    What i get is: "TypeError: 'NoneType' object is not subscriptable"
  3. Error message/stack trace:
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[32], line 2
      1 with get_redshift_connector() as conn:
----> 2     insert_data(conn, filename, columns_table, columns_type, file_index_columns, temp_table)

File ~/SageMaker/utils.py:34, in insert_data(conn, filename, columns_table, columns_type, columns_index_file, table_name, delimiter, batch_size)
     32 temp_table_query += '(' + ', '.join(cols) + ')'
     33 cursor.execute(temp_table_query)
---> 34 cursor.insert_data_bulk(
     35     filename=filename,
     36     table_name=table_name,
     37     parameter_indices=columns_index_file,
     38     column_names=columns_table,
     39     delimiter=',',
     40     batch_size=1024
     41 )

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/redshift_connector/cursor.py:306, in Cursor.insert_data_bulk(self, filename, table_name, parameter_indices, column_names, delimiter, batch_size)
    304 if not self.__is_valid_table(table_name):
    305     raise InterfaceError("Invalid table name passed to insert_data_bulk: {}".format(table_name))
--> 306 if not self.__has_valid_columns(table_name, column_names):
    307     raise InterfaceError("Invalid column names passed to insert_data_bulk: {}".format(table_name))
    308 orig_paramstyle = self.paramstyle

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/redshift_connector/cursor.py:363, in Cursor.__has_valid_columns(self, table, columns)
    361         self.execute(q, params)
    362         res = self.fetchone()
--> 363         if typing.cast(typing.List[int], res)[0] != 1:
    364             raise InterfaceError("Invalid column name: {} specified for table: {}".format(params[1], table))
    365 except:

TypeError: 'NoneType' object is not subscriptable

Reproduction code

temp_table = 'some_temp_table'
filename = '/path/to/file'
columns_table_good = ['a']
columns_table_wrong = ['A']
columns_type = ['varchar']
file_index_columns = [0]

def insert_data(conn, filename, columns_table, columns_type, columns_index_file,
                table_name, delimiter=',', batch_size=1024):
    with conn.cursor() as cursor:
        cols = list(map(lambda x: ' '.join(x), (zip(columns_table, columns_type))))
        temp_table_query = f'CREATE TEMPORARY TABLE {table_name} ' 
        temp_table_query += '(' + ', '.join(cols) + ')'
        cursor.execute(temp_table_query)
        cursor.insert_data_bulk(
            filename=filename,
            table_name=table_name,
            parameter_indices=columns_index_file,
            column_names=columns_table,
            delimiter=',',
            batch_size=1024
        )

# This will work
insert_data(conn, filename, columns_table_good, columns_type, file_index_columns, temp_table)

# This won't work
insert_data(conn, filename, columns_table_wrong, columns_type, file_index_columns, temp_table)

Apparently, the error is in these lines, because "res" is getting None type when error.

for params in param_list:
self.execute(q, params)
res = self.fetchone()
if typing.cast(typing.List[int], res)[0] != 1:
raise InterfaceError("Invalid column name: {} specified for table: {}".format(params[1], table))

@Brooke-white
Copy link
Contributor

Hi @diegoxfx ,

Thank you for reporting this issue. I will take a look into this behavior and update with my findings.

@Brooke-white
Copy link
Contributor

Hi @diegoxfx , I've implemented a fix for this issue. The check performed on LOC363 of cursor.py is invalid and causes this confusing error. We are incorrectly checking the first element of the return value of fetchone() without ensuring something is actually returned. I anticipate a fix for this in our next release.

@Brooke-white Brooke-white added the bug Something isn't working label Jun 21, 2023
@Brooke-white
Copy link
Contributor

Hi @diegoxfx , this was just fixed in 2.0.912. please open a new issue if you continue to experience this issue.

@diegoxfx
Copy link
Author

Hi @Brooke-white i'm facing this error again, but this time with every column name:

with get_redshift_connector() as conn:
    with conn.cursor() as cursor:
        cursor.execute("create temporary table temp_table (col1 varchar)")
        cursor.insert_data_bulk(
                    filename=filename,
                    table_name="temp_table",
                    parameter_indices=[0],
                    column_names=["col1"],
                    delimiter=",",
                    batch_size=1024
                )
---------------------------------------------------------------------------
InterfaceError                            Traceback (most recent call last)
Cell In[34], line 4
      2 with conn.cursor() as cursor:
      3     cursor.execute("create temporary table temp_table (col1 varchar)")
----> 4     cursor.insert_data_bulk(
      5                 filename=filename,
      6                 table_name="temp_table",
      7                 parameter_indices=[0],
      8                 column_names=['col1'],
      9                 delimiter=",",
     10                 batch_size=1024
     11             )

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/redshift_connector/cursor.py:311, in Cursor.insert_data_bulk(self, filename, table_name, parameter_indices, column_names, delimiter, batch_size)
    309 if not self.__is_valid_table(table_name):
    310     raise InterfaceError("Invalid table name passed to insert_data_bulk: {}".format(table_name))
--> 311 if not self.__has_valid_columns(table_name, column_names):
    312     raise InterfaceError("Invalid column names passed to insert_data_bulk: {}".format(table_name))
    313 orig_paramstyle = self.paramstyle

File ~/anaconda3/envs/python3/lib/python3.10/site-packages/redshift_connector/cursor.py:369, in Cursor.__has_valid_columns(self, table, columns)
    367 res = self.fetchone()
    368 if res is None:
--> 369     raise InterfaceError(
    370         "Invalid column name. No results were returned when performing column name validity check. Query: {} Parameters: {}".format(
    371             q, params
    372         )
    373     )
    374 if typing.cast(typing.List[int], res)[0] != 1:
    375     raise InterfaceError("Invalid column name: {} specified for table: {}".format(params[1], table))

InterfaceError: Invalid column name. No results were returned when performing column name validity check. Query: select 1 from pg_catalog.svv_all_columns where table_name = ? and column_name = ? Parameters: ['temp_table', 'col1']

Current version: 2.0.918

@Brooke-white
Copy link
Contributor

Hi @diegoxfx , thank you for letting me know and including the repro. I see you're using a temporary table. By any chance does this reproduce for a non-temporary table? Also, could you let me know which redshift server version you are using?

@Brooke-white Brooke-white reopened this Jan 12, 2024
@diegoxfx
Copy link
Author

@Brooke-white I don't have permissions to create non-temporary tables, so i can't check that. Redshift server version is: PostgreSQL 8.0.2 on i686-pc-linux-gnu, compiled by GCC gcc (GCC) 3.4.2 20041017 (Red Hat 3.4.2-6.fc3), Redshift 1.0.61626

@Brooke-white
Copy link
Contributor

Brooke-white commented Jan 12, 2024

@diegoxfx I see you're on Redshift server "Patch 179". Our latest server version is "Patch 180". I'd recommend upgrading to the latest server version, Patch 180 as of writing, or awaiting the cluster maintenance window. Please let me know if the issue still presents after the upgrade. https://docs.aws.amazon.com/redshift/latest/mgmt/cluster-versions.html#cluster-version-180

@diegoxfx
Copy link
Author

@Brooke-white Once the server is updated to Patch 180 I'll let you know. For the moment, downgrading redshift-connector worked for me

@Brooke-white
Copy link
Contributor

Glad to hear it's working again for you @diegoxfx. Could you confirm the version of redshift-connector where you saw issue (I believe 2.0.910) and the version where you saw resolution? I'd like to double check there is no issue in our driver.

@diegoxfx
Copy link
Author

diegoxfx commented Feb 2, 2024

@Brooke-white I think the problem is due to this validation:

if not self.__has_valid_columns(table_name, column_names):
raise InterfaceError("Invalid column names passed to insert_data_bulk: {}".format(table_name))

This line is checking if the pair table/column exists and returns error when that pair doesn't exist.

So, why is this problem happening in the last version and not before? For what i see, this line was change on this commit:

b199bfa#diff-7a712c2f5672b7e37cafedba351f2cbdd8a07f9478317a8ab7ffd1c60f9dcee6L356

The original line was:
q: str = "select 1 from information_schema.columns where table_name = ? and column_name = ?"
and the new one:
q: str = "select 1 from pg_catalog.svv_all_columns where table_name = ? and column_name = ?"

So i made some experiments on redshift and got this:

CREATE TEMPORARY TABLE test_temp_table(test_temp_column varchar);

select * from pg_catalog.svv_all_columns 
where 
    table_name LIKE '%test_temp_table%';

select * from information_schema.columns 
where 
    table_name LIKE '%test_temp_table%';

image

I basically created a temporary table using redshift editor, and searched for the table info using both pg_catalog.svv_all_columns and information_schema.columns.

For pg_catalog.svv_all_columns the result was empty
image

However, for information_schema.columns the table/column was found:

image

Maybe the problem could be resolved by creating a parameter that indicates whether the table is temporary or not, and based on that use pg_catalog.svv_all_columns or information_schema.columns

@diegoxfx
Copy link
Author

diegoxfx commented Feb 5, 2024

I found another bug, but i think it should be in other issue. To be brief: If you have null values in the csv (for example, ''), an error is raised because in the insert, the value inserted is "''" instead of NULL

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants