-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python/SQLAlchemy] Add example programs about efficient bulk inserts with SQLAlchemy, pandas, and Dask #64
Conversation
06366e3
to
dcf601d
Compare
dcf601d
to
8dbf13b
Compare
8dbf13b
to
f362170
Compare
f362170
to
034c230
Compare
- name: Validate by-language/python-sqlalchemy | ||
run: | | ||
python testing/ngr.py --accept-no-venv by-language/python-sqlalchemy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We finally added a universal test runner, currently located at testing/ngr.py
, which is effectively just wrapping a few other calls, to be able to start maintaining a concise incantation syntax across different CI recipes.
Because it is written in Python, and aims to be generic, without assuming it is invoked on any kind of CI system, it has another benefit that we can mirror incantation style and experience between CI systems vs. developer sandbox operations, which significantly eases administration, because the developer can easily run the same CI recipe locally without much efforts, DWIM-style.
""" | ||
----- | ||
About | ||
----- | ||
|
||
Next Generation Runner (ngr): Effortless invoke programs and test harnesses. | ||
|
||
------- | ||
Backlog | ||
------- | ||
- After a few iterations, refactor to separate package. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ngr.py
will only be incubated here, it will be refactored to a standalone package after a few iterations. If you like the idea, it can be re-used on other projects, for example on crate-qa
, after it learned to invoke relevant test suites of a few other programming languages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
testing/ngr.py
moved into pueblo.ngr
already, and learned three more directory shapes to invoke: javascript
, make
, and rust
.
class ItemType(Enum):
DOTNET = "dotnet"
JAVA = "java"
JAVASCRIPT = "javascript"
MAKE = "make"
PHP = "php"
PYTHON = "python"
RUBY = "ruby"
RUST = "rust"
***** | ||
About | ||
***** | ||
|
||
Example programs demonstrating CrateDB's SQLAlchemy adapter and dialect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve documentation I
Currently, the example programs added by this patch are mostly about pandas/Dask operations through SQLAlchemy. This is a special topic, where we elaborated extensively about, at 1. Here, we surely want to link to this resource.
Footnotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The README has been improved, see its rendered version here.
aecd5a7
to
aa2551e
Compare
***** | ||
Usage | ||
***** | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improve documentation II
In this case, within the Usage section, it should clearly be focused on "library use" details, like, how to exactly use the insert_bulk
function with pandas and Dask.
from crate.client.sqlalchemy.support import insert_bulk
# pandas
df.to_sql(name=self.table_name, con=engine, if_exists="append", index=False, chunksize=bulk_size, method=insert_bulk)
# Dask
ddf.to_sql("testdrive", uri=DBURI, index=False, if_exists="replace", chunksize=10_000, parallel=True, method=insert_bulk)
What is currently within the Usage section, should go into an Examples section instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed/improved, see above. An example has been added.
# Connect to CrateDB Cloud. | ||
time python insert_pandas.py --dburi='crate://admin:<PASSWORD>@example.aks1.westeurope.azure.cratedb.net:4200?ssl=true' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit poor. Improve/highlight operations with CrateDB Cloud.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dito. Fixed.
source .venv/bin/activate | ||
pip install --upgrade --requirement requirements.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A test suite has been added. Document how it can be invoked after installing the requirements.
pytest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
|
||
logger = logging.getLogger(__name__) | ||
|
||
pkg_resources.require("sqlalchemy>=2.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also reflect this within the requirements.txt
file, and probably also within insert_dask.py
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it still needed at all then?
aa2551e
to
7b6be1b
Compare
7b6be1b
to
7355708
Compare
# Run PostgreSQL | ||
docker run --rm -it --publish=5432:5432 --env "POSTGRES_HOST_AUTH_METHOD=trust" postgres:15 postgres -c log_statement=all | ||
|
||
# Use SQLite | ||
time python insert_efficient.py sqlite multirow | ||
time python insert_efficient.py sqlite batched | ||
|
||
# Use PostgreSQL | ||
time python insert_efficient.py postgresql multirow | ||
time python insert_efficient.py postgresql batched |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this example related to other DB's? Why should I run this for other DB's?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to have an easy stage to run comparisons side-by-side. It is now integrated into the relevant example program, and I find it convenient.
When using it with higher numbers of records, you can also get a good feeling about insert performance compared with the other DBs.
|
||
To start a CrateDB instance on your machine for evaluation purposes, invoke:: | ||
|
||
docker run -it --rm --publish=4200:4200 --publish=5432:5432 crate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is publishing the PG port 5432
required for this example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Erm. Probably not. It is just my canonical command I slap everywhere, not needing to discriminate between different variants, and also advertising a bit.
|
||
@click.command() | ||
@click.option("--dburi", type=str, default="crate://localhost:4200", required=False, help="SQLAlchemy database connection URI.") | ||
@click.option("--mode", type=str, default="bulk", required=False, help="Insert mode. Choose one of basic, multi, bulk.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could use a choice option to make possible options strict.
@click.option("--mode", type=str, default="bulk", required=False, help="Insert mode. Choose one of basic, multi, bulk.") | |
@click.option("--mode", type=click.Choice(['basic', 'multi', 'bulk']), default="bulk", required=False, case_sensitive=False, help="Insert mode. Choose one of basic, multi, bulk.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you ask another expert on this?
TypeError: Parameter.__init__() got an unexpected keyword argument 'case_sensitive'
Hi there,
this patch carries over the example programs about Python/SQLAlchemy/pandas/Dask from crate/crate-python#552. Originally, the corresponding programs have been conceived as individual gists:
With kind regards,
Andreas.
/cc @marijaselakovic, @karynzv, @WalBeh, @hammerhead, @hlcianfagna, @andnig, @ckurze
Backlog
Software tests and Dependabot configuration works well for the other code example directories, and will be added here as well.Done, with a code coverage of 95%.Add fast-path INSERT methodMerged in July 2023.insert_bulk
for SQLAlchemy/pandas/Dask crate-python#553 needs to be merged and released beforehand, to make the examples actually work.=> Backlog items have been added to GH-95, in order to finally get this merged, so it can be promoted and re-used.