Bulk inserts are tricky to do efficiently #59

Diggsey · 2016-09-06T19:40:22Z

I had to implement some kind of batching myself to improve performance. This involved building up a query string over time, something like INSERT INTO table (fields...) VALUES (?, ?, ?), (?, ?, ?), (?, ?, ?), ...

The query builder is not particularly well suited to this, and building up the query via string concatenation seems relatively inefficient. Also I could only support positional parameters, since the name parsing code is not exposed (which means I couldn't use Params::make_positional).

I don't know if there's a more efficient way to do this at the protocol level?

At the very least I think one or more of the following would be useful:

Provide some guidelines in the documentation for how to do this
Expose the named parameter parsing functionality to allow building up the batching functionality oneself
Add some new functionality to assist with this usecase

The text was updated successfully, but these errors were encountered:

blackbeam · 2016-09-09T13:51:46Z

Most efficient way, i think, is using LOAD DATA INFILE if applicable.

The Second option is a large handcrafted statement or query:

For handcrafted query (:warning: will strongly not recommend) you should be aware of the max_allowed_packet value.
For a handcrafted statement, you also should be aware of maximum allowed parameters. It may look like this (in pseudo-rust and without checks)

 fn bulk_insert<F, P>(
    pool: &crate::Pool,
    table: String,
    cols: Vec<String>,
    objects: Vec<Obj>,
    fun: F,
) -> crate::Result<()>
where
    F: Fn(&Obj) -> P,
    P: Into<Params>,
{
    let mut stmt = format!("INSERT INTO {} ({}) VALUES ", table, cols.join(","));
    let row = format!(
        "({}),",
        cols.iter()
            .map(|_| "?".to_string())
            .collect::<Vec<_>>()
            .join(",")
    );
    stmt.reserve(objects.len() * (cols.len() * 2 + 2));
    for _ in 0..objects.len() {
        stmt.push_str(&row);
    }

    // remove the trailing comma
    stmt.pop();

    let mut params = Vec::new();
    for o in objects.iter() {
        let named_params: crate::Params = fun(o).into();
        let positional_params = named_params.into_positional(&cols)?;
        if let crate::Params::Positional(new_params) = positional_params {
            for param in new_params {
                params.push(param);
            }
        }
    }

    let mut conn = pool.get_conn()?;
    conn.exec_drop(stmt, params)
}
// ...
bulk_insert(pool, "table", vec!["value1", "value2"], objects, |object| {
    params! {
        "value1" => object.value1,
        "value2" => object.value2,
    }
});

Less efficient but most convenient option is to execute one statement per row. It looks like you can speed up this process by setting autocommit to 0 (:warning: run SET autocommit=0 and all of INSERT queries on specific connection by taking it from the pool), but do not forget to set autocommit back to 1. It may look like this (in pseudorust):

fn bulk_insert<F, P>(pool: my::Pool, stmt: &str, objs: Vec<Object>, fun: F) -> my::Result<()>
where F: Fn(&Object) -> P,
      P: Into<my::Params>,
{
    let mut conn = try!(pool.get_conn());
    try!(conn.query("SET autocommit=0"));
    {
        let mut stmt = try!(pool.prepare(stmt));
        for obj in objs.iter() {
            try!(stmt.execute(fun(obj)));
        }
    }
    try!(conn.query("COMMIT"));
    try!(conn.query("SET autocommit=1"));
}
// ...
bulk_insert(pool, "INSERT INTO table (value1, value2) VALUES (:value1, :value2)", objs, |obj| {
    params! {
        "value1" => obj.value1,
        "value2" => obj.value2,
    }
});

Diggsey · 2016-09-09T14:15:07Z

@blackbeam Thanks for the detailed reply:

is impossible since I'm using google's Cloud SQL database, however it did lead me to find LOAD DATA LOCAL INFILE which does work.

It looks like at the protocol level, the MySQL server receives the query with the filename, then sends back a request to the client asking for the contents of that file, which the client then sends to the server. Based on this it seems the client could actually send data from memory rather than a file, which opens up a lot of possibilities.

this is essentially what I've resorted to doing at the moment, but as you say I'm at the mercy of the max_allowed_packet and other arbitrary limits.
I'm already in a transaction so auto-commit is off anyway. The problem is the network latency to the server, because even a few ms RTT results in long delays for 10,000 rows.

Diggsey · 2016-09-09T21:38:36Z

For reference, I'm inserting ~2 million rows into one table with fairly large rows, and ~10 million rows into another table which is much more lightweight.

Currently it takes around ~2 hours to insert all the data, even using a batch size of 512. (yep, discovered that utf-8 bug ~30 mins into the import 😒 ) and despite the large quantity of data I don't think it should take that long.

0xpr03 · 2018-09-01T12:00:23Z

@Diggsey have you tried things like LOCK TABLES `mytable` WRITE;
and then done insert into.. and afterwards UNLOCK TABLES; (using the same connection).
This could also be a bottleneck and is the way mysql dump works.

Diggsey · 2018-09-01T13:07:18Z

@0xpr03 I solved this using LOAD DATA FROM LOCAL INFILE and then installing a custom handler which streamed the data directly from memory.

hpca01 · 2020-03-02T17:20:52Z

@0xpr03 I solved this using LOAD DATA FROM LOCAL INFILE and then installing a custom handler which streamed the data directly from memory.

Sorry, do you mind showing the example of your LocalInfileHandler that you implemented? I am lost as to what traits the handler should be implementing.

Diggsey · 2020-03-02T17:57:09Z

@hpca01 there's an example in the documentation: https://docs.rs/mysql/18.0.0/mysql/struct.LocalInfileHandler.html

hpca01 · 2020-03-02T18:34:11Z

@Diggsey I apologize as I am still new to the language, I got the whole thing to work by doing the following:

val.conn
	.set_local_infile_handler(Some(LocalInfileHandler::new(move |_, stream| {
		for a_val in input.as_ref() {
			write!(stream, "{}", a_val).unwrap();
		}
		Ok(())
	})));
match val
	.conn
	.query("LOAD DATA LOCAL INFILE 'file' INTO TABLE schema.price_type_pricelineitem (price_type, price_amount, per_unit, linked_ndc_id, price_internal_id)")
{
	Ok(_) => {}
	Err(err) => println!("Error {}", err),
}

I wanted to re-write it to exclude the closure and have a concrete struct that implements whatever trait is required to make it easier for the next person.

Diggsey · 2020-03-02T20:34:13Z

The trait that's required is FnMut. This is implemented automatically by closures, but implementing FnMut for your own types is still unstable.

I'm not really sure what you hope to make easier by moving it to a struct, but LocalInfineHandler is just a struct, so if you want to encapsulate creating that struct you can just make a function that returns it and hides the fact that you're supplying a closure...

hpca01 · 2020-03-02T21:24:50Z

@Diggsey thanks so much for the clear explanation 😄

I will just leave it the way it is, and write some detailed comments instead. Thanks again for your patience, and your work on the crate.

orangesoup · 2020-11-02T04:09:31Z

Hey @blackbeam! I have such use case where I need to insert (or update) thousands of rows per second, so my only real option (as far as I know) is your 2nd one, handcrafting the query with ON DUPLICATE KEY UPDATE ....

I've tried to use your pseudorust code with the latest 20.0.1 version, however, I get an error:

conn.exec_drop(stmt, params)
     ^^^^^^^^^ the trait `std::convert::From<mysql::Params>` is not implemented for `mysql::Value`

For objects I've used VecDeque<SomeStruct>, otherwise haven't really changed anything, besides using exec_drop instead of prep_exec.

Is there a way to fix my issue? Will there be a "native" option in the library to do these kind of operations easily later?

blackbeam · 2020-11-02T05:25:09Z

@orangesoup, hi. params should be a Vec of mysql::Value. I've updated the code snipped in #59 (comment)

orangesoup · 2020-11-02T13:43:19Z

Thank, will check!

midnightcodr · 2021-04-06T16:47:25Z

Thanks @blackbeam for sharing the code snippet (the second option). I found a tiny issue with the code, stmt would have a dangling comma at the end and trigger a mysql error when exec_drop() is run. To fix that, simply do a

stmt.pop();

right before line

let mut params = Vec::new();

Other than that the bulk insert code works quite nicely.

blackbeam · 2021-04-07T07:51:27Z

@midnightcodr, I've updated the snippet. Thanks.

midnightcodr · 2022-05-29T04:01:11Z

@blackbeam 2nd option's bulk_insert function signature can be improved even further by making the Obj type generic, that is

pub fn bulk_insert<F, P, T>(
    pool: &crate::Pool,
    table: String,
    cols: Vec<String>,
    objects: Vec<T>,
    fun: F,
) -> crate::Result<()>
where
    F: Fn(&T) -> P,
    P: Into<Params>,
...

By making such update, bulk_insert is now much more powerful so that I don't have to write a new bulk_insert function (under a new name of course) if I need to bulk insert for a type that's not Obj, for example. I can confirm through testing that making Object generic is working quite well.

ghost mentioned this issue Oct 26, 2016

Issue implementating pseudocode found in issue #59 #77

Closed

blackbeam mentioned this issue May 5, 2019

affected_rows return 1 even if multiple rows are INSERTed blackbeam/mysql_async#52

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk inserts are tricky to do efficiently #59

Bulk inserts are tricky to do efficiently #59

Diggsey commented Sep 6, 2016

blackbeam commented Sep 9, 2016 •

edited

Loading

Diggsey commented Sep 9, 2016 •

edited

Loading

Diggsey commented Sep 9, 2016

0xpr03 commented Sep 1, 2018 •

edited

Loading

Diggsey commented Sep 1, 2018

hpca01 commented Mar 2, 2020

Diggsey commented Mar 2, 2020

hpca01 commented Mar 2, 2020

Diggsey commented Mar 2, 2020

hpca01 commented Mar 2, 2020

orangesoup commented Nov 2, 2020

blackbeam commented Nov 2, 2020

orangesoup commented Nov 2, 2020

midnightcodr commented Apr 6, 2021 •

edited

Loading

blackbeam commented Apr 7, 2021

midnightcodr commented May 29, 2022

Bulk inserts are tricky to do efficiently #59

Bulk inserts are tricky to do efficiently #59

Comments

Diggsey commented Sep 6, 2016

blackbeam commented Sep 9, 2016 • edited Loading

Diggsey commented Sep 9, 2016 • edited Loading

Diggsey commented Sep 9, 2016

0xpr03 commented Sep 1, 2018 • edited Loading

Diggsey commented Sep 1, 2018

hpca01 commented Mar 2, 2020

Diggsey commented Mar 2, 2020

hpca01 commented Mar 2, 2020

Diggsey commented Mar 2, 2020

hpca01 commented Mar 2, 2020

orangesoup commented Nov 2, 2020

blackbeam commented Nov 2, 2020

orangesoup commented Nov 2, 2020

midnightcodr commented Apr 6, 2021 • edited Loading

blackbeam commented Apr 7, 2021

midnightcodr commented May 29, 2022

blackbeam commented Sep 9, 2016 •

edited

Loading

Diggsey commented Sep 9, 2016 •

edited

Loading

0xpr03 commented Sep 1, 2018 •

edited

Loading

midnightcodr commented Apr 6, 2021 •

edited

Loading