Skip to content

feat(cursor): Add method insert_data_bulk #81

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 19 commits into from
Jan 24, 2022
Merged

Conversation

Yash621
Copy link
Contributor

@Yash621 Yash621 commented Dec 30, 2021

@kishaningithub This PR is with respect to issue #75

I have implemented the bulk insert function ,please review it and suggest if any changes are required.

@Brooke-white Brooke-white self-requested a review January 4, 2022 18:15
Copy link
Contributor

@Brooke-white Brooke-white left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Yash621 ,

Thank you for opening this PR in response to this feature request. The team appreciates the time and effort you've put into this PR.

I've left some granular comments, but on a high level I don't see any tests for this method. Could you please add some so we can ensure it's behaving as desired?

Also, it seems like there are a number of whitespace changes in this PR unrelated to the desired change. Could you please revert those?

@Yash621
Copy link
Contributor Author

Yash621 commented Jan 16, 2022

@Brooke-white can you guide me on how i can write the tests for this ?

@Brooke-white
Copy link
Contributor

Sure, @Yash621. I've listed a couple ideas below. I think the testing for this method can be done using unit tests.

Reading the csv

  • happy path: when the csv file can be opened and is read, what do we expect values_list to look like? (i.e. validating values_list is as intended when passed to cursor.execute)
  • unhappy path: the csv cannot be loaded, the csv is malformed, etc. how do these error percolate up to the user? Maybe we should add some error handling when trying to open the csv file to let the end user know if the file cannot be found and provide some additional detail. Or in the case that reading the csv fails.

Building sql_query

  • ensuring we can handle a variable number of column names (i.e. sql_query looks as desired when passed to cursor.execute)
  • unhappy path: what happens if the number of columns read from the csv does not match the number of column names specified by the user? Do we raise an exception saying so (We probably should because this could provide a more descriptive error than the one that will be provided by cursor.execute(...), e.g. "The csv provided has a column mismatch with the column names provided")

For the above ideas, when we aren't confirming an exception is raised, we often want to check what is passed to cursor.execute(). We can use unittest.mock to "spy" on the parameters passed to cursor.execute(). What this means is that we can see the parameters passed to the function. This approach is taken in some of our unit tests to provide lower level testing. An example of doing so with cursor.execute is the following test. Please note the function parameter mocker. The following article has some more information on "spying" using mocks.

Please let me know if you have any questions about writing the tests, or other comments on the PR. I'm more than happy to discuss :)

@david-dest01
Copy link

can the aws team please prioritize support for this. it's a pretty big limitation that's a common feature across other drivers.

@Brooke-white
Copy link
Contributor

can the aws team please prioritize support for this. it's a pretty big limitation that's a common feature across other drivers.

@david-dest01, we are working with @Yash621, the author of this PR, to get this feature ready for being merged. In the meantime, feel free to review this PR to ensure it suits your needs.

@Yash621 -- do you have an idea of when you plan to post a revision of the PR? It'd be nice if we could include this feature in our next release (scheduled for early February). Please let me know how I can help make this happen.

@Yash621
Copy link
Contributor Author

Yash621 commented Jan 19, 2022

@Brooke-white will surely raise the revision by today or tomorrow ,just stuck on writing tests for this as have n't done it before ,will raise the revision soon :)

@david-dest01
Copy link

@Brooke-white thank you. @Yash621 thank you for contributing this - happy to help with testing and modifications based on feedback from the aws team. in the interim i added some comments as well that you may want to consider, but no pressure.

@david-dest01
Copy link

@kishaningithub This PR is with respect to issue #75

I have implemented the bulk insert function ,please review it and suggest if any changes are required.

there's a PR template for new feature. could you fill that out when finished?
located in .github

@Brooke-white
Copy link
Contributor

@david-dest01 , thank you for your help reviewing this PR, it's much appreciated! @Yash621, I have resolved the PR comments from David and myself that you addressed so we can more easily see those remaining.

@david-dest01
Copy link

@Yash621 lmk if you want to pair program to work through multiple args in execute. hope the feedback provided helps clarify some desirable behavioral that accomplishes what you're looking to do plus more other desirable use cases.

@Yash621
Copy link
Contributor Author

Yash621 commented Jan 19, 2022

@david-dest01 would love to pair program, let me know when we can do this :)

@Yash621
Copy link
Contributor Author

Yash621 commented Jan 19, 2022

@Brooke-white thanks :) ,working on other changes would push soon :)

@david-dest01
Copy link

@david-dest01 would love to pair program, let me know when we can do this :)

Let's coordinate - will shoot you an email.

@Yash621
Copy link
Contributor Author

Yash621 commented Jan 20, 2022

@Brooke-white I have improved the implementation for the function and have also updated the doc string for it.
please review and let me know if any changes are required:)

Also after this I will add the tests within 1-2 days.

@Brooke-white
Copy link
Contributor

Thanks for the quick turn around, @Yash621! I'll post a review by end of day.

Copy link
Contributor

@Brooke-white Brooke-white left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yash621 , I've added one small note re: docs and two suggestions for validating table_name and column_names before we do the bulk insertion. Once these comments are addressed + the tests are added we should be good to get this merged :). Let me know if you have any questions or need any help!

@Brooke-white
Copy link
Contributor

Thanks for the quick turn around, @Yash621, the tests can go in test/unit/test_cursor.py if they do not need internet access. Otherwise they can go in test/integration/test_cursor.py. Once the tests are added I'll do a quick run of our CI to ensure all is ok then merge :).

@Yash621
Copy link
Contributor Author

Yash621 commented Jan 20, 2022

@Brooke-white thanks,can you also tell what is is_single_database_metadata_val in args of all the tests,an do i have to put it in test_insert_data_bulk .

@Yash621
Copy link
Contributor Author

Yash621 commented Jan 21, 2022

@Brooke-white I have added error handling for the cases you mentioned above ,but I am still a it confused about how I should use mock for spying parameters ?
,moreover I have added a support for working flexibly with any given set of columns ,due to which I am excepting column_indexes as a parameter ,so do we need to add that support of should i remove it ?

@Brooke-white
Copy link
Contributor

Thanks, @Yash621! I'll write up an sample test that you can build off of (with an explanation) and post that shortly :)

moreover I have added a support for working flexibly with any given set of columns ,due to which I am excepting column_indexes as a parameter ,so do we need to add that support of should i remove it ?

Could you clarify this for me? I'm not fully understanding. Are you saying column_indexes is not needed?

@Yash621
Copy link
Contributor Author

Yash621 commented Jan 21, 2022

@Brooke-white yea,do we need column_indexes ?

@Brooke-white
Copy link
Contributor

Brooke-white commented Jan 21, 2022

@Yash621 , I think it could be valuable to leave in since it would give more flexibility in addressing cases such as the following:

A user is interested in bulk insertion of only columns id and rating. They have already established a table holding id and fruit in their database, and do not wish to repeat this mapping in the table fruit_ratings.

insert_data_bulk(table_name='fruit_ratings', column_names=['id', 'rating'], column_indexes=[0, 2],...)

id,fruit,rating,
1,apple,10,
2,orange,8,
3,pear,1,

@Yash621
Copy link
Contributor Author

Yash621 commented Jan 21, 2022

@Brooke-white Ya, I added it thinking that only,just want to confirm if we need that flexibility or not.

@Yash621
Copy link
Contributor Author

Yash621 commented Jan 21, 2022

@Brooke-white I have resolved the prepending issue :)

Copy link
Contributor

@Brooke-white Brooke-white left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yash621 , while testing I noticed a minor issue with how values_list and sql_query are built. I added some comments showing what changes are needed to send things to the server in the way it expects. I also included some tests I wrote for this method, feel free to add more if you'd like.

After making these final changes, please make sure to run the pre-commit hooks to ensure this change matches the rest of the code base. Then lets get this merged! :)

Here are the tests, I had placed them in test/unit/test_cursor.py. The mocking and "spy" bits were a bit tricky to get set up :) Make sure unittest.mock.mock_open is imported as it's used for reading in a testing csv from memory.

@pytest.mark.parametrize("indexes, names", [([1], []), ([], ["c1"])])
def test_insert_data_column_names_indexes_mismatch_raises(indexes, names, mocker):
    # mock fetchone to return "True" to ensure the table_name and column_name
    # validation steps pass
    mocker.patch("redshift_connector.Cursor.fetchone", return_value=[1])

    mock_cursor: Cursor = Cursor.__new__(Cursor)
    # mock out the connection
    mock_cursor._c = Mock()
    mock_cursor.paramstyle = 'qmark'

    with pytest.raises(InterfaceError, match="Column names and indexes must be the same length"):
        mock_cursor.insert_data_bulk(
            filename="test_file", table_name='test_table',
            column_indexes=indexes, column_names=names, delimeter=','
        )


in_mem_csv = """\
col1,col2,col3
1,3,foo
2,5,bar
-1,7,baz"""

insert_bulk_data = [
    ([0], ['col1'], ('INSERT INTO  test_table (col1) VALUES (%s), (%s), (%s);', ['1', '2', '-1'])),
    ([1], ['col2'], ('INSERT INTO  test_table (col2) VALUES (%s), (%s), (%s);', ['3', '5', '7'])),
    ([2], ['col3'], ('INSERT INTO  test_table (col3) VALUES (%s), (%s), (%s);', ['foo', 'bar', 'baz'])),
    ([0, 1], ['col1', "col2"], ('INSERT INTO  test_table (col1, col2) VALUES (%s, %s), (%s, %s), (%s, %s);', ['1', '3', '2', '5', '-1', '7'])),
    ([0, 2], ['col1', "col3"], ('INSERT INTO  test_table (col1, col3) VALUES (%s, %s), (%s, %s), (%s, %s);', ['1', 'foo', '2', 'bar', '-1', 'baz'])),
    ([1, 2], ['col2', "col3"], ('INSERT INTO  test_table (col2, col3) VALUES (%s, %s), (%s, %s), (%s, %s);', ['3', 'foo', '5', 'bar', '7', 'baz'])),
    ([0, 1, 2], ['col1', 'col2', "col3"], ('INSERT INTO  test_table (col1, col2, col3) VALUES (%s, %s, %s), (%s, %s, %s), (%s, %s, %s);', ['1', '3', 'foo', '2', '5', 'bar', '-1', '7', 'baz'])),
]


@patch("builtins.open", new_callable=mock_open, read_data=in_mem_csv)
@pytest.mark.parametrize("indexes,names,exp_execute_args", insert_bulk_data)
def test_insert_data_column_stmt(mocked_csv, indexes, names, exp_execute_args, mocker):
    # mock fetchone to return "True" to ensure the table_name and column_name
    # validation steps pass
    mocker.patch("redshift_connector.Cursor.fetchone", return_value=[1])
    mock_cursor: Cursor = Cursor.__new__(Cursor)

    # spy on the execute method, so we can check value of sql_query
    spy = mocker.spy(mock_cursor, "execute")

    # mock out the connection
    mock_cursor._c = Mock()
    mock_cursor.paramstyle = 'qmark'

    mock_cursor.insert_data_bulk(
        filename='mocked_csv', table_name='test_table',
        column_indexes=indexes, column_names=names, delimeter=','
    )

    assert spy.called is True
    assert spy.call_args[0][0] == exp_execute_args[0]
    assert spy.call_args[0][1] == exp_execute_args[1]

@Yash621
Copy link
Contributor Author

Yash621 commented Jan 22, 2022

@Brooke-white I have resolved all the changes you suggested and have also added tests ,
please review them and let me know if any changes are required :)

@Yash621 Yash621 requested a review from Brooke-white January 24, 2022 16:50
@Brooke-white
Copy link
Contributor

Thank you for bearing with us through the review process, @Yash621! We appreciate your hard work and persistence in making this contribution :).

@Brooke-white Brooke-white changed the title implemented bulk insert function feat(cursor): Add method insert_data_bulk Jan 24, 2022
@Brooke-white Brooke-white merged commit affbdef into aws:master Jan 24, 2022
@Yash621
Copy link
Contributor Author

Yash621 commented Jan 24, 2022

@Brooke-white my pleasure :)
Thank you soo much for all your guidance and support
I look forward to contribute more in future :)

@snomula
Copy link

snomula commented Jul 9, 2022

Hi @Yash621 this is great feature I have been looking for.
I am running this to insert data into my RS instance. The insert_data_bulk doesn't throw any error. But I dont see any records inserted in the table. I did output of my values that I am passing to the function and they all look correct. Not sure where it is going wrong. Can you please let me know if you have any idea on this.

Below is what I am passing to insert_bulk_data
csv_file: \Temp\public_vpcflow_log_via_sql_8_070722162911.csv
table_name: vpcflow_log_via_sql_8
column_list: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]
column_names: ['account-id', 'interface-id', 'srcaddr', 'dstaddr', 'srcport', 'dstport', 'protocol', 'packets', 'bytes', 'start', 'end', 'action', 'log-status', 'vpc-id', 'subnet-id', 'instance-id', 'tcp-flags', 'type', 'pkt-srcaddr', 'pkt-dstaddr', 'region', 'az-id', 'date-time', 'lz-name', 'lz-account-id', 'ou', 'bu', 'app', 'year', 'month', 'day']
file_delimiter: ^
batch_size: 1

After running insert_data_bulk. I executed cursor.rowcount and it returns -1. I also went to database and checked on the table rowcount and it doesn't any increment in rowcount. Any advise on this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants