feat(cursor): Add method insert_data_bulk #81

Yash621 · 2021-12-30T18:47:03Z

@kishaningithub This PR is with respect to issue #75

I have implemented the bulk insert function ,please review it and suggest if any changes are required.

Brooke-white

Thank you for opening this PR in response to this feature request. The team appreciates the time and effort you've put into this PR.

I've left some granular comments, but on a high level I don't see any tests for this method. Could you please add some so we can ensure it's behaving as desired?

Also, it seems like there are a number of whitespace changes in this PR unrelated to the desired change. Could you please revert those?

redshift_connector/cursor.py

Yash621 · 2022-01-16T12:44:15Z

@Brooke-white can you guide me on how i can write the tests for this ?

Brooke-white · 2022-01-18T17:40:55Z

Sure, @Yash621. I've listed a couple ideas below. I think the testing for this method can be done using unit tests.

Reading the csv

happy path: when the csv file can be opened and is read, what do we expect values_list to look like? (i.e. validating values_list is as intended when passed to cursor.execute)
unhappy path: the csv cannot be loaded, the csv is malformed, etc. how do these error percolate up to the user? Maybe we should add some error handling when trying to open the csv file to let the end user know if the file cannot be found and provide some additional detail. Or in the case that reading the csv fails.

Building sql_query

ensuring we can handle a variable number of column names (i.e. sql_query looks as desired when passed to cursor.execute)
unhappy path: what happens if the number of columns read from the csv does not match the number of column names specified by the user? Do we raise an exception saying so (We probably should because this could provide a more descriptive error than the one that will be provided by cursor.execute(...), e.g. "The csv provided has a column mismatch with the column names provided")

For the above ideas, when we aren't confirming an exception is raised, we often want to check what is passed to cursor.execute(). We can use unittest.mock to "spy" on the parameters passed to cursor.execute(). What this means is that we can see the parameters passed to the function. This approach is taken in some of our unit tests to provide lower level testing. An example of doing so with cursor.execute is the following test. Please note the function parameter mocker. The following article has some more information on "spying" using mocks.

Please let me know if you have any questions about writing the tests, or other comments on the PR. I'm more than happy to discuss :)

david-dest01 · 2022-01-18T20:16:32Z

can the aws team please prioritize support for this. it's a pretty big limitation that's a common feature across other drivers.

Brooke-white · 2022-01-18T20:32:29Z

can the aws team please prioritize support for this. it's a pretty big limitation that's a common feature across other drivers.

@david-dest01, we are working with @Yash621, the author of this PR, to get this feature ready for being merged. In the meantime, feel free to review this PR to ensure it suits your needs.

@Yash621 -- do you have an idea of when you plan to post a revision of the PR? It'd be nice if we could include this feature in our next release (scheduled for early February). Please let me know how I can help make this happen.

Yash621 · 2022-01-19T08:30:50Z

@Brooke-white will surely raise the revision by today or tomorrow ,just stuck on writing tests for this as have n't done it before ,will raise the revision soon :)

david-dest01 · 2022-01-19T13:49:52Z

@Brooke-white thank you. @Yash621 thank you for contributing this - happy to help with testing and modifications based on feedback from the aws team. in the interim i added some comments as well that you may want to consider, but no pressure.

redshift_connector/cursor.py

david-dest01 · 2022-01-19T15:04:31Z

@kishaningithub This PR is with respect to issue #75

I have implemented the bulk insert function ,please review it and suggest if any changes are required.

there's a PR template for new feature. could you fill that out when finished?
located in .github

Brooke-white · 2022-01-19T17:13:00Z

@david-dest01 , thank you for your help reviewing this PR, it's much appreciated! @Yash621, I have resolved the PR comments from David and myself that you addressed so we can more easily see those remaining.

david-dest01 · 2022-01-19T17:57:09Z

@Yash621 lmk if you want to pair program to work through multiple args in execute. hope the feedback provided helps clarify some desirable behavioral that accomplishes what you're looking to do plus more other desirable use cases.

Yash621 · 2022-01-19T18:12:06Z

@david-dest01 would love to pair program, let me know when we can do this :)

Yash621 · 2022-01-19T18:13:26Z

@Brooke-white thanks :) ,working on other changes would push soon :)

david-dest01 · 2022-01-19T18:50:28Z

@david-dest01 would love to pair program, let me know when we can do this :)

Let's coordinate - will shoot you an email.

Yash621 · 2022-01-20T16:28:03Z

@Brooke-white I have improved the implementation for the function and have also updated the doc string for it.
please review and let me know if any changes are required:)

Also after this I will add the tests within 1-2 days.

Brooke-white · 2022-01-20T17:10:20Z

Thanks for the quick turn around, @Yash621! I'll post a review by end of day.

Brooke-white

@Yash621 , I've added one small note re: docs and two suggestions for validating table_name and column_names before we do the bulk insertion. Once these comments are addressed + the tests are added we should be good to get this merged :). Let me know if you have any questions or need any help!

redshift_connector/cursor.py

Brooke-white · 2022-01-20T19:49:04Z

Thanks for the quick turn around, @Yash621, the tests can go in test/unit/test_cursor.py if they do not need internet access. Otherwise they can go in test/integration/test_cursor.py. Once the tests are added I'll do a quick run of our CI to ensure all is ok then merge :).

Yash621 · 2022-01-20T19:58:02Z

@Brooke-white thanks,can you also tell what is is_single_database_metadata_val in args of all the tests,an do i have to put it in test_insert_data_bulk .

Yash621 · 2022-01-21T15:54:07Z

@Brooke-white I have added error handling for the cases you mentioned above ,but I am still a it confused about how I should use mock for spying parameters ?
,moreover I have added a support for working flexibly with any given set of columns ,due to which I am excepting column_indexes as a parameter ,so do we need to add that support of should i remove it ?

Brooke-white · 2022-01-21T17:21:07Z

Thanks, @Yash621! I'll write up an sample test that you can build off of (with an explanation) and post that shortly :)

moreover I have added a support for working flexibly with any given set of columns ,due to which I am excepting column_indexes as a parameter ,so do we need to add that support of should i remove it ?

Could you clarify this for me? I'm not fully understanding. Are you saying column_indexes is not needed?

Yash621 · 2022-01-21T17:49:15Z

@Brooke-white yea,do we need column_indexes ?

Brooke-white · 2022-01-21T17:59:38Z

@Yash621 , I think it could be valuable to leave in since it would give more flexibility in addressing cases such as the following:

A user is interested in bulk insertion of only columns id and rating. They have already established a table holding id and fruit in their database, and do not wish to repeat this mapping in the table fruit_ratings.

insert_data_bulk(table_name='fruit_ratings', column_names=['id', 'rating'], column_indexes=[0, 2],...)

id,fruit,rating,
1,apple,10,
2,orange,8,
3,pear,1,

Yash621 · 2022-01-21T18:03:07Z

@Brooke-white Ya, I added it thinking that only,just want to confirm if we need that flexibility or not.

Yash621 · 2022-01-21T19:56:18Z

@Brooke-white I have resolved the prepending issue :)

Brooke-white

@Yash621 , while testing I noticed a minor issue with how values_list and sql_query are built. I added some comments showing what changes are needed to send things to the server in the way it expects. I also included some tests I wrote for this method, feel free to add more if you'd like.

After making these final changes, please make sure to run the pre-commit hooks to ensure this change matches the rest of the code base. Then lets get this merged! :)

Here are the tests, I had placed them in test/unit/test_cursor.py. The mocking and "spy" bits were a bit tricky to get set up :) Make sure unittest.mock.mock_open is imported as it's used for reading in a testing csv from memory.

@pytest.mark.parametrize("indexes, names", [([1], []), ([], ["c1"])])
def test_insert_data_column_names_indexes_mismatch_raises(indexes, names, mocker):
    # mock fetchone to return "True" to ensure the table_name and column_name
    # validation steps pass
    mocker.patch("redshift_connector.Cursor.fetchone", return_value=[1])

    mock_cursor: Cursor = Cursor.__new__(Cursor)
    # mock out the connection
    mock_cursor._c = Mock()
    mock_cursor.paramstyle = 'qmark'

    with pytest.raises(InterfaceError, match="Column names and indexes must be the same length"):
        mock_cursor.insert_data_bulk(
            filename="test_file", table_name='test_table',
            column_indexes=indexes, column_names=names, delimeter=','
        )


in_mem_csv = """\
col1,col2,col3
1,3,foo
2,5,bar
-1,7,baz"""

insert_bulk_data = [
    ([0], ['col1'], ('INSERT INTO  test_table (col1) VALUES (%s), (%s), (%s);', ['1', '2', '-1'])),
    ([1], ['col2'], ('INSERT INTO  test_table (col2) VALUES (%s), (%s), (%s);', ['3', '5', '7'])),
    ([2], ['col3'], ('INSERT INTO  test_table (col3) VALUES (%s), (%s), (%s);', ['foo', 'bar', 'baz'])),
    ([0, 1], ['col1', "col2"], ('INSERT INTO  test_table (col1, col2) VALUES (%s, %s), (%s, %s), (%s, %s);', ['1', '3', '2', '5', '-1', '7'])),
    ([0, 2], ['col1', "col3"], ('INSERT INTO  test_table (col1, col3) VALUES (%s, %s), (%s, %s), (%s, %s);', ['1', 'foo', '2', 'bar', '-1', 'baz'])),
    ([1, 2], ['col2', "col3"], ('INSERT INTO  test_table (col2, col3) VALUES (%s, %s), (%s, %s), (%s, %s);', ['3', 'foo', '5', 'bar', '7', 'baz'])),
    ([0, 1, 2], ['col1', 'col2', "col3"], ('INSERT INTO  test_table (col1, col2, col3) VALUES (%s, %s, %s), (%s, %s, %s), (%s, %s, %s);', ['1', '3', 'foo', '2', '5', 'bar', '-1', '7', 'baz'])),
]


@patch("builtins.open", new_callable=mock_open, read_data=in_mem_csv)
@pytest.mark.parametrize("indexes,names,exp_execute_args", insert_bulk_data)
def test_insert_data_column_stmt(mocked_csv, indexes, names, exp_execute_args, mocker):
    # mock fetchone to return "True" to ensure the table_name and column_name
    # validation steps pass
    mocker.patch("redshift_connector.Cursor.fetchone", return_value=[1])
    mock_cursor: Cursor = Cursor.__new__(Cursor)

    # spy on the execute method, so we can check value of sql_query
    spy = mocker.spy(mock_cursor, "execute")

    # mock out the connection
    mock_cursor._c = Mock()
    mock_cursor.paramstyle = 'qmark'

    mock_cursor.insert_data_bulk(
        filename='mocked_csv', table_name='test_table',
        column_indexes=indexes, column_names=names, delimeter=','
    )

    assert spy.called is True
    assert spy.call_args[0][0] == exp_execute_args[0]
    assert spy.call_args[0][1] == exp_execute_args[1]

redshift_connector/cursor.py

Yash621 · 2022-01-22T13:03:07Z

@Brooke-white I have resolved all the changes you suggested and have also added tests ,
please review them and let me know if any changes are required :)

Brooke-white · 2022-01-24T17:46:05Z

Thank you for bearing with us through the review process, @Yash621! We appreciate your hard work and persistence in making this contribution :).

Yash621 · 2022-01-24T17:58:27Z

@Brooke-white my pleasure :)
Thank you soo much for all your guidance and support
I look forward to contribute more in future :)

snomula · 2022-07-09T23:49:37Z

Hi @Yash621 this is great feature I have been looking for.
I am running this to insert data into my RS instance. The insert_data_bulk doesn't throw any error. But I dont see any records inserted in the table. I did output of my values that I am passing to the function and they all look correct. Not sure where it is going wrong. Can you please let me know if you have any idea on this.

Below is what I am passing to insert_bulk_data
csv_file: \Temp\public_vpcflow_log_via_sql_8_070722162911.csv
table_name: vpcflow_log_via_sql_8
column_list: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
column_names: ['account-id', 'interface-id', 'srcaddr', 'dstaddr', 'srcport', 'dstport', 'protocol', 'packets', 'bytes', 'start', 'end', 'action', 'log-status', 'vpc-id', 'subnet-id', 'instance-id', 'tcp-flags', 'type', 'pkt-srcaddr', 'pkt-dstaddr', 'region', 'az-id', 'date-time', 'lz-name', 'lz-account-id', 'ou', 'bu', 'app', 'year', 'month', 'day']
file_delimiter: ^
batch_size: 1

After running insert_data_bulk. I executed cursor.rowcount and it returns -1. I also went to database and checked on the table rowcount and it doesn't any increment in rowcount. Any advise on this issue

implemented bulk insert function

9561d51

Brooke-white self-requested a review January 4, 2022 18:15

Brooke-white suggested changes Jan 4, 2022

View reviewed changes

redshift_connector/cursor.py Outdated Show resolved Hide resolved

redshift_connector/cursor.py Outdated Show resolved Hide resolved

redshift_connector/cursor.py Outdated Show resolved Hide resolved

redshift_connector/cursor.py Outdated Show resolved Hide resolved

Brooke-white mentioned this pull request Jan 4, 2022

[Feature request] Efficient bulk inserts (single bulk insert statement) #75

Closed