Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python/SQLAlchemy] Add example programs about efficient bulk inserts with SQLAlchemy, pandas, and Dask #64

Merged
merged 9 commits into from Oct 30, 2023

Conversation

amotl
Copy link
Member

@amotl amotl commented May 11, 2023

Hi there,

this patch carries over the example programs about Python/SQLAlchemy/pandas/Dask from crate/crate-python#552. Originally, the corresponding programs have been conceived as individual gists:

With kind regards,
Andreas.

/cc @marijaselakovic, @karynzv, @WalBeh, @hammerhead, @hlcianfagna, @andnig, @ckurze

Backlog

  • Software tests and Dependabot configuration works well for the other code example directories, and will be added here as well. Done, with a code coverage of 95%.
  • Add fast-path INSERT method insert_bulk for SQLAlchemy/pandas/Dask crate-python#553 needs to be merged and released beforehand, to make the examples actually work. Merged in July 2023.
  • Some adjustments may be added to bring all programs into the same shape.
  • Other than demonstrating write operations, also demonstrate read operations?

=> Backlog items have been added to GH-95, in order to finally get this merged, so it can be promoted and re-used.

@amotl amotl changed the title [Python/SQLAlchemy] Add two example programs [Python/SQLAlchemy] Add example programs about efficient bulk inserts with SQLAlchemy, pandas, and Dask May 11, 2023
Comment on lines +60 to +62
- name: Validate by-language/python-sqlalchemy
run: |
python testing/ngr.py --accept-no-venv by-language/python-sqlalchemy
Copy link
Member Author

@amotl amotl Oct 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We finally added a universal test runner, currently located at testing/ngr.py, which is effectively just wrapping a few other calls, to be able to start maintaining a concise incantation syntax across different CI recipes.

Because it is written in Python, and aims to be generic, without assuming it is invoked on any kind of CI system, it has another benefit that we can mirror incantation style and experience between CI systems vs. developer sandbox operations, which significantly eases administration, because the developer can easily run the same CI recipe locally without much efforts, DWIM-style.

Comment on lines +1 to +11
"""
-----
About
-----

Next Generation Runner (ngr): Effortless invoke programs and test harnesses.

-------
Backlog
-------
- After a few iterations, refactor to separate package.
Copy link
Member Author

@amotl amotl Oct 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ngr.py will only be incubated here, it will be refactored to a standalone package after a few iterations. If you like the idea, it can be re-used on other projects, for example on crate-qa, after it learned to invoke relevant test suites of a few other programming languages.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing/ngr.py moved into pueblo.ngr already, and learned three more directory shapes to invoke: javascript, make, and rust.

class ItemType(Enum):
    DOTNET = "dotnet"
    JAVA = "java"
    JAVASCRIPT = "javascript"
    MAKE = "make"
    PHP = "php"
    PYTHON = "python"
    RUBY = "ruby"
    RUST = "rust"

@amotl amotl mentioned this pull request Oct 28, 2023
@amotl amotl requested review from seut and matriv October 28, 2023 16:00
Comment on lines +8 to +12
*****
About
*****

Example programs demonstrating CrateDB's SQLAlchemy adapter and dialect.
Copy link
Member Author

@amotl amotl Oct 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve documentation I

Currently, the example programs added by this patch are mostly about pandas/Dask operations through SQLAlchemy. This is a special topic, where we elaborated extensively about, at 1. Here, we surely want to link to this resource.

Footnotes

  1. https://cratedb.com/docs/python/en/latest/by-example/sqlalchemy/dataframe.html

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amotl amotl marked this pull request as ready for review October 28, 2023 16:14
Comment on lines +15 to +23
*****
Usage
*****

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improve documentation II

In this case, within the Usage section, it should clearly be focused on "library use" details, like, how to exactly use the insert_bulk function with pandas and Dask.

from crate.client.sqlalchemy.support import insert_bulk

# pandas
df.to_sql(name=self.table_name, con=engine, if_exists="append", index=False, chunksize=bulk_size, method=insert_bulk)

# Dask
ddf.to_sql("testdrive", uri=DBURI, index=False, if_exists="replace", chunksize=10_000, parallel=True, method=insert_bulk)

What is currently within the Usage section, should go into an Examples section instead.

Copy link
Member Author

@amotl amotl Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed/improved, see above. An example has been added.

Comment on lines 45 to 46
# Connect to CrateDB Cloud.
time python insert_pandas.py --dburi='crate://admin:<PASSWORD>@example.aks1.westeurope.azure.cratedb.net:4200?ssl=true'
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit poor. Improve/highlight operations with CrateDB Cloud.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dito. Fixed.

Comment on lines +29 to +75
source .venv/bin/activate
pip install --upgrade --requirement requirements.txt
Copy link
Member Author

@amotl amotl Oct 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A test suite has been added. Document how it can be invoked after installing the requirements.

pytest

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.


logger = logging.getLogger(__name__)

pkg_resources.require("sqlalchemy>=2.0")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also reflect this within the requirements.txt file, and probably also within insert_dask.py.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it still needed at all then?

Comment on lines +27 to +36
# Run PostgreSQL
docker run --rm -it --publish=5432:5432 --env "POSTGRES_HOST_AUTH_METHOD=trust" postgres:15 postgres -c log_statement=all

# Use SQLite
time python insert_efficient.py sqlite multirow
time python insert_efficient.py sqlite batched

# Use PostgreSQL
time python insert_efficient.py postgresql multirow
time python insert_efficient.py postgresql batched
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this example related to other DB's? Why should I run this for other DB's?

Copy link
Member Author

@amotl amotl Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to have an easy stage to run comparisons side-by-side. It is now integrated into the relevant example program, and I find it convenient.

When using it with higher numbers of records, you can also get a good feeling about insert performance compared with the other DBs.


To start a CrateDB instance on your machine for evaluation purposes, invoke::

docker run -it --rm --publish=4200:4200 --publish=5432:5432 crate
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is publishing the PG port 5432 required for this example?

Copy link
Member Author

@amotl amotl Oct 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erm. Probably not. It is just my canonical command I slap everywhere, not needing to discriminate between different variants, and also advertising a bit.


@click.command()
@click.option("--dburi", type=str, default="crate://localhost:4200", required=False, help="SQLAlchemy database connection URI.")
@click.option("--mode", type=str, default="bulk", required=False, help="Insert mode. Choose one of basic, multi, bulk.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could use a choice option to make possible options strict.

Suggested change
@click.option("--mode", type=str, default="bulk", required=False, help="Insert mode. Choose one of basic, multi, bulk.")
@click.option("--mode", type=click.Choice(['basic', 'multi', 'bulk']), default="bulk", required=False, case_sensitive=False, help="Insert mode. Choose one of basic, multi, bulk.")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you ask another expert on this?

TypeError: Parameter.__init__() got an unexpected keyword argument 'case_sensitive'

@amotl amotl merged commit af6471d into main Oct 30, 2023
1 check passed
@amotl amotl deleted the amo/python-sqlalchemy branch October 30, 2023 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants