Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for multipart form extractors #2849

Closed
jacob-pro opened this issue Aug 15, 2022 · 43 comments · Fixed by #2883
Closed

Support for multipart form extractors #2849

jacob-pro opened this issue Aug 15, 2022 · 43 comments · Fixed by #2883
Labels
A-multipart project: actix-multipart C-feature Category: new functionality RFC / Proposal

Comments

@jacob-pro
Copy link
Contributor

jacob-pro commented Aug 15, 2022

If you have a look on crates.io there are a proliferation of crates for doing multipart form uploads with files in actix.

Including:

I think each of these have some unique improvements and features that the others do not. It seems silly to have so many unofficial crates, and is not very friendly to people who just want to quickly get started with actix.

I am wondering if we should have an effort to consolidate all of these crates into one official one in the actix project?

@asonix
Copy link
Contributor

asonix commented Aug 15, 2022

Hey there! As someone mentioned here, I have a requirement for any consolidation work on this front. I haven't looked at each of these options so I can't speak for what features they do or don't have, but I think it is important to be able to support streaming files from a multipart upload into various upstreams (files, object storage, etc). Just aggregating a file upload into a Bytes or a Vec isn't flexible enough, especially if uploaded files are potentially large

That's the primary motivation behind my library, which I use in my image host API pict-rs
(example: https://git.asonix.dog/asonix/pict-rs/src/branch/main/src/main.rs#L814-L833)

@jacob-pro
Copy link
Contributor Author

jacob-pro commented Aug 15, 2022

Yep I think all some of these support at least writing files. In my implementation stuff gets written to a NamedTempFile which you can then choose to persist.

What I might do a bit later when I have time is write up a matrix of all the different features these crates have, so we can get an idea of what an official crate would need to include.

@asonix
Copy link
Contributor

asonix commented Aug 15, 2022

I also think that maybe moving to just an Extractor would be a better idea than my current api, which is a combination Middleware + Extractor. Maybe I'll play with updating my library to work more like that

edit (more details):
Currently, I support multiple different "Form" middlewares that you can register for different routes. To make this work as a more extractor-driven API, I would need to change the Form from a middleware to extension data, then move the stream processing logic to a FromRequest that accesses that extension. This would allow adding a single form as top-level data if you only need one upload endpoint for simplicity, but different scopes would still need different forms to handle multiple endpoints (nothing a couple example programs couldn't demonstrate)

edit 2 (even more details):
An alternative to registering different forms at different scopes would be to allow tagging a form implementation with some additional data (e.g. a unit struct) and then referencing that same unit struct in the extractor. This would simplify scoping forms to routes, but would increase the complexity of the extractor type

@JSH32
Copy link

JSH32 commented Aug 15, 2022

Mine uses serde and does not support streaming. It was mostly just a proof of concept but I've been locally working on something similar to a builder but with a procmacro to auto parse into the struct which is what I think would be best.

@asonix
Copy link
Contributor

asonix commented Aug 15, 2022

@JSH32 fwiw the proc-macro "extract into this struct" method is really nice, and ideally that would be the final API. Maybe if there was an annotation you could put on a field like #[stream_with(handler)] to allow custom stream processing while still storing the result in the struct

@JSH32
Copy link

JSH32 commented Aug 15, 2022

@asonix You could allow both Vec<u8> for getting just the bytes or something which is an AsyncRead (or something like that) to put the stream into the struct.

@jacob-pro
Copy link
Contributor Author

@JSH32 yeah I was impressed with your serde compatible macro - it is a lot better than my one!

@asonix
Copy link
Contributor

asonix commented Aug 15, 2022

@JSH32 that doesn't quite work, because you need to process each field in order. If you embed a file stream directly into a struct for processing later, you don't read the Multipart stream any farther and can't "deserialize" fields that come afterwards.

@JSH32
Copy link

JSH32 commented Aug 15, 2022

@JSH32 that doesn't quite work, because you need to process each field in order. If you embed a file stream directly into a struct for processing later, you don't read the Multipart stream any farther and can't "deserialize" fields that come afterwards.

The problem is that a lot of the time the processor for the field will need to have context of the rest of the data, and reading the pure request is kind of unintuitive. Maybe save the original stream as a temporary file and provide a StreamExt for reading the bytes from temporary?

@asonix
Copy link
Contributor

asonix commented Aug 15, 2022

Saving to temporary files solves the problem for a number of use-cases, but prevents creating a pure "file proxy" implementation, which could stream files in-line to some other service. It requires either using tmpfs (which means holding files in memory) or using a filesystem, which adds requirements for deploying the service

I like the ToStream trait, but maybe it could look something like this:

trait FromStream {
    type Error: ResponseError;
    type Future: Future<Output = Result<Self, Self::Error>>;
    
    // implementations must ensure they consume the entire stream if they complete successfully,
    // otherwise the multipart stream will be stuck and no further fields can be processed (leading to a deadlock)
    fn from_stream(req: HttpRequest, stream: actix_multipart::Field /* or maybe some other opaque stream type */) -> Self::Future;
}

struct StreamMetadata {
    object_storage_id: String,
}

impl FromStream for StreamMetadata {
    type Error = actix_web::Error;
    type Future = Pin<Box<dyn Future<Output = Result<Self, Self::Error>>>>;
    
    fn from_stream(req: HttpRequest, stream: Field) -> Self::Future {
        // extract useful things from HttpRequest here
        // ...
        
        Box::pin(async move {
            // upload to minio or s3 or something here
            // ...
            
            Ok(StreamMetadata { object_storage_id: /* blah */ })
        })
    }
}

#[derive(Multipart)]
struct MyForm {
    file_field: StreamMetadata
}

@JSH32
Copy link

JSH32 commented Aug 15, 2022

That definitely works. Still kind of meh though because you cant get the full metadata of a route for conditionally uploading based on something like a user and will need to do processing of your form data outside of the route but I'm pretty sure its the only way to do it right.

@asonix
Copy link
Contributor

asonix commented Aug 15, 2022

This FromStream trait could be implemented for PathBuf (or maybe a custom TemporaryPathbuf) that would stream the files onto a disk and return their path in the deserialized struct

@jacob-pro
Copy link
Contributor Author

jacob-pro commented Aug 21, 2022

Thanks @asonix @JSH32

I've had a think about it and I think these are all the features we will need:

What we need to do with streams of bytes:

  1. Store the bytes in memory, e.g.
    pub struct BytePart {
        pub bytes: Vec<u8>,
        pub filename: Option<String>,
        pub mime: Mime,
    }
  2. Temporary files on disk (that can later be persisted)
    pub struct FilePart {
        /// The file data itself stored as a temporary file on disk.
        pub file: NamedTempFile,
        /// The size in bytes of the file.
        pub size: usize,
        /// The `filename` value in the `Content-Disposition` header.
        pub filename: Option<String>,
        /// The Content-Type specified as reported in the uploaded form.
        pub mime: mime::Mime,
    }
  3. An arbitrary user implemented future that streams the bytes elsewhere (e.g. cloud object storage)
    pub struct ObjectPart {
        /// The file has been uploaded to an external service
        pub object_id: String,
    }

As suggested by @asonix the solution could be to require each field to implement a trait, something like:

pub trait HandlePart {
   // Providing the request context allows pulling from app_data() to read config etc.
   fn handler(req: HttpRequest, stream: Field) -> Future<Output = Result<Self>>;
}

struct Form {
    bytes_in_memory: BytePart,
    file_on_disk: FilePart,
    file_in_s3: ObjectPart
}

What about simple text fields?

We could provide a handler for T: FromStr which would cover all of these?

We could provide a handler for T: Deserialize, and use it with the serde_plain crate which would convert text/plain into primitive rust types and basic enums.

Edit: We can provide another part implementation:

/// Deserializes the part in memory as plain text (using serde_plain)
#[derive(Deref)]
struct TextPart<T: DeserializeOwned>(T);

This would allow deserializing a plain text field into a simple type, e.g. strings, ints, floats, and plain enums.

It also does not restrict us to a single deserialization format, because we could introduce additional implementations such as:

/// Deserializes the part in memory as JSON
#[derive(Deref)]
struct JsonPart<T: DeserializeOwned>(T);

How should we handle Lists and Optionals?

We would need to provide our own containers that implement the handler trait. (We can't implement directly on Vec/Option
because it would conflict with the generic Deserialize implementation).

According to the spec:

To match widely deployed implementations, multiple files MUST be sent
by supplying each file in a separate part but all with the same
"name" parameter.

Using something like compile_type_eq (thanks @JSH32) the macro could detect the use of Vec and Option wrapped fields:

struct Form {
    files: Vec<S3Upload>,
    maybe: Option<String>,
}

A Vec can be used when multiple parts arrive with the same "name". An an option can be used to represent an optional part that may or may not be in the form.

Unknown and Extra fields

Serde allows using #[serde(deny_unknown_fields)] to prevent surplus fields - we should probably have something similar?

A further question is what to do if multiple fields are uploaded with the same name, but we are not expecting a Vector?

My crate has the #[from_multipart(deny_extra_parts)] which errors if:

  • A form is uploaded with an unknown field name
  • A form is uploaded with a duplicate field name (and the type is not a Vector)

But these two cases should really be separately configurable.

The duplicate field case should actually have 3 actions - keep first, keep last or error?

Renaming fields

We need to be able to support something similar to #[serde(rename = "name")] for cases where we can't use the field-name
e.g. rust keywords or different casing styles.

Data limits

We need to be able to set a global limit on the amount of data read into memory, for basic DDoS protection,
just as the official actix extractors support.
It would have to be up to the implementor of the handler to choose whether to respect / consume from the in-memory limits.

But we will also want to provide field level limits as @JSH32 has done, which will be useful for files etc.

pub struct ExampleForm {
    #[multipart(max_size = 5MB)]
    file_field: File,
}

Let me know if you can think of anything I have missed?

@jacob-pro
Copy link
Contributor Author

@robjtede Please can you advise if this sort of extractor would be accepted into acix-extras and what would be an appropriate name for it?

Or actually maybe it should go into the actix-multipart crate itself - which would clear up a lot of the naming confusion?

If so, I am happy to write a lot of the code for this - but would just like to know where to get started from an organizational standpoint? Maybe we can have a WIP/feature branch in the actix repository which we can open PRs against?

@JSH32
Copy link

JSH32 commented Aug 22, 2022

The handler will already be generated by a derive macro for introspecting field names/options so Vec/Option can just be done there and conditionally code can be generated for that.

@jacob-pro
Copy link
Contributor Author

jacob-pro commented Aug 22, 2022

@JSH32 I think you will find this is easier said than done? - I could be wrong but when I tried doing that with my macro I couldn't get it to work

The problem is how do you tell if something is a std::vec::Vec or std::option::Option - I don't think macros are smart enough to do this - they can't do import resolution. E.g. what if a user is using a type definition etc.

In my macro I came up with a hacky work around which is to define two traits with the same method name, the first trait implemented for T: FromStr and the second trait implemented for Vec/Option. And then using T::method_name() to pull the data (compiler chooses right trait automatically). But this is kind of ugly and I don't want to do it again

@junbl
Copy link
Contributor

junbl commented Aug 22, 2022

Is there a reason to use FromStr over an implementation with serde like @JSH32's? I may be misunderstanding the issue, but I'm imagining an interface like this:

/// The resulting struct from an upload.
/// 
/// All fields must either implement `Deserialize` or `HandlePart`
#[derive(Multipart)]
struct Upload {
    image: File,
    stream: Stream<Item = u8>,
    s3_upload: ImplementsS3Upload,
    id: Option<i32>,
    list: Vec<String>,
    metadata: Metadata,
}

#[derive(Deserialize)]
struct Metadata { ... }

struct ImplementsS3Upload { ... }
impl HandlePart for ImplementsS3Upload { ... }

@jacob-pro
Copy link
Contributor Author

jacob-pro commented Aug 22, 2022

@e-rhodes Yeah I did also wonder about that - it would possibly make error handling easier - but my question is which deserializer were you thinking of using?
We could use serde_plain - which works for primitive data types and simple enums?
https://github.com/mitsuhiko/serde-plain
I think this would be the only thing that make sense in the case of text/plain fields at least

@jacob-pro
Copy link
Contributor Author

@e-rhodes I guess a further benefit if we use serde Deserialize as you suggest we could match on the provided mime type of the part, and use that to deserialize appropriately?
text/plain -> serde_plain
application/json -> serde_json
etc.

@junbl
Copy link
Contributor

junbl commented Aug 22, 2022

I was mostly thinking about the additional form fields that can be posted with the files, as opposed to deserializing the content of the files. I don't have a ton of experience with the multipart spec, but it seems like other fields will just be multipart/form-data (per this answer).

For these fields, serde_plain may be sufficient---though it doesn't look like it would support deserializing multiple fields into a #[serde(flatten)]ed struct, or all the remaining keys into a HashMap, etc.

Looks like @JSH32's uses serde_json, but it supports the field[]=item syntax for arrays, which is awesome.

I also realized that my example above

    stream: Stream<Item = u8>,
    s3_upload: ImplementsS3Upload,

probably won't work because of the issue @asonix and @JSH32 alluded to above, that you can't deserialize the rest of the fields without consuming the stream first. Would it be possible to successfully deserialize successfully only if the stream is provided last in the form? Would probably be confusing behavior to work with, but a nice interface if it works.

The order is supposedly guaranteed but I'm not sure how that works with e.g. curl or other apis.

@jacob-pro
Copy link
Contributor Author

jacob-pro commented Aug 22, 2022

@e-rhodes There are a few issues here:

First is the question of what syntax do you use for a list/array of parts. According to the spec the recommended way to do it is that the uploader puts each part in with the same name. However this does vary based on implementation...

The reason I am suggesting having a separate VecPart and OptionPart type is because we should be able to upload a list of files (or an optional file) with an arbitrary handler:

#[derive(Multipart)]
struct Upload {
    files: VecPart<S3Upload>
    text: String,
}

So given the following upload:

Content-Type: multipart/form-data; boundary=BOUNDARY

--BOUNDARY
content-disposition: form-data; name="text"

Hello world
--BOUNDARY
content-disposition: form-data; name="files"

....bytes1...
--BOUNDARY
content-disposition: form-data; name="files"

....bytes2...

(Note the filename and content-type headers for each part are optional - it is up to the server to choose how to interpret them)

The multipart comes in as a stream of fields that we have to read one at a time, the order is up to uploader / doesn't have to match our struct. We would read the first part and see that its name is text we would then need to do a lookup (the macro would provide implementation of this lookup), that matches text -> a handler, in this case the one from impl<T: Deserialize> Handler for T

Once each handler is completed we will need to store its output in some sort of state, likely a Map<(String, TypeId), Box> (similar concept to actix extensions.

Once we have processed all the parts in whatever order they arrived in, saved them into a map, we can then create the output struct Upload by retrieving each of its fields out of the map.

What we can't do is this:

#[derive(Multipart)]
struct Upload {
    files: Vec<S3Upload>
    text: String,
}

First of all S3Upload won't implement Deserialize, so Vec can't implement Deserialize. The compiler (at least until trait specialisation arrives) won't allow us to implement our handler trait for both T: Deserialize and also Vec at the same time, because of the possibility that Vec itself implements Deserialize (which it does!) causing a conflict .

We could compile this:

#[derive(Multipart)]
struct Upload {
    files: OptionalPart<S3Upload>
    text: Vec<String>,
    some_other_complex_structure: HashMap<String, String>,
    plain_text: String,
}

But it would only work where all the data for those fields is uploaded within a single part string. But to do that would require us to know which deserializer to use? One possibility is the user could provide it using the content-type field

Content-Type: multipart/form-data; boundary=BOUNDARY

--BOUNDARY
content-disposition: form-data; name="text"
content-type: application/json

["string 1", "string2"]
--BOUNDARY
content-disposition: form-data; name="some_other_complex_structure"
content-type: application/json

{"key": "value"}
--BOUNDARY
content-disposition: form-data; name="plain_text"
content-type: text/plain

abcdefg

@jacob-pro
Copy link
Contributor Author

Maybe https://github.com/robjtede/actix-web-lab is the place for development?

@junbl
Copy link
Contributor

junbl commented Aug 23, 2022

what syntax do you use for a list/array of parts

Seems like we should support both the multiple files with same field name as well as multiple fields with names like field[] (ideally for both the fields and the files).

The compiler (at least until trait specialisation arrives) won't allow us to implement our handler trait for both T: Deserialize and also Vec at the same time, because of the possibility that Vec itself implements Deserialize (which it does!) causing a conflict

Makes sense now. Still, having wrappers for Vec and Option is less that ideal ux. Could we achieve this with attribute macros?

struct Upload {
    all_files: Vec<File>,
    cloud_file: S3Upload,
    #[multipart(deserialize)]
    field: Vec<String>,
}

Alternatively, maybe the library could provide a struct:

struct Multipart<F: HandlePart, T: Deserialize> {
    files: F,
    fields: T,
}
#[derive(HandlePart)]
struct Files { ... }
#[derive(Deserialize)]
struct Fields { ... }
pub async fn handler(upload: multipart::Multipart<Files, Fields>) { ... }

Doesn't seem like there's an easy way to support only having files with this approach, though.

the user could provide it using the content-type field

This seems like it would be frustrating to require, since as far as I can tell there's no way to supply content-type on an HTML form. I know you can do it with curl, but at least for my use case, I need a frontend. I think supporting deserialization of JSON passed as a form field is a low-priority use case.

For me, the most valuable use cases would just involve deserializing from flat form fields to enums (either validating a dropdown or fields that can be multiple types) or (not as useful) collecting multiple fields into a struct/map. E.g.:

Content-Type: multipart/form-data; boundary=BOUNDARY

--BOUNDARY
content-disposition: form-data; name="field1"

dog
--BOUNDARY
content-disposition: form-data; name="multi_type"

1
--BOUNDARY
content-disposition: form-data; name="other_field"

cat
--BOUNDARY
content-disposition: form-data; name="dropdown"

bird
--BOUNDARY
content-disposition: form-data; name="file"

## bytes of file ##
struct Data {
     field1: String
}
#[serde(untagged)]
enum StrOrInt {
    Str(String),
    Int(i32),
}
#[serde(rename_all = "snake_case")]
enum Dropdown {
    Bird,
    Frog,
}

#[derive(Multipart)]
struct Upload {
    #[serde(flatten)]
    data: Data,
    multi_type: StrOrInt, 
    dropdown: Dropdown,
    #[serde(flatten)]
    rest_of_the_fields: HashMap<String, Value>, // grabs other_field
    file: File,
}

@jacob-pro
Copy link
Contributor Author

jacob-pro commented Aug 23, 2022

@e-rhodes

Seems like we should support both the multiple files with same field name as well as multiple fields with names like field[]

Do you mean where multiple parts are uploaded each using the name field[] ? Would the following work?

struct Upload {
    #[multipart(rename="files[]")]
    files: VecPart<File>,
    #[multipart(rename="text[]")]
    texts: VecPart<String>,
}

In the example you've given you wouldn't be able to use those #[serde(flatten)] annotations since as mentioned before the Upload struct itself does not and cannot implement Deserialize itself.

This also doesn't work because of the limitations of serde_plain:

    #[derive(Deserialize, Debug)]
    #[serde(untagged)]
    enum StrOrInt {
        Int(i32),
        Str(String),
    }

    #[test]
    fn test_deserialize() {
        let s1: StrOrInt = serde_plain::from_str("1234").unwrap();
        // Str("1234") - doesn't match to int because plain text doesn't make a distinction unlike JSON
        println!("{:?}", s1);
    }

But the other enum would work, and this is one advantage of implementing for T: Deserialize instead of just T: FromStr:

    #[derive(Deserialize, Debug)]
    #[serde(rename_all = "snake_case")]
    enum Dropdown {
        Bird,
        Frog,
    }

    #[test]
    fn test_deserialize2() {
        let d1: Dropdown = serde_plain::from_str("frog").unwrap();
        // Dropdown::Frog
        println!("{:?}", d1);
    }

I think the fundamental issue here is that you are trying to categorise things into files and non files...

From the perspective of the server we just get a stream of parts, each one has:

  • A field name
  • An optional filename (this doesn't necessarily indicate if it is a file)
  • An optional mime type (but if it is missing, we can default to text/plain)
  • A stream of bytes

The file name isn't mandatory
for cases where the file name isn't available or is meaningless or
private; this might result, for example, when selection or drag-and-
drop is used or when the form data content is streamed directly from
a device.

What the proposal above was going to do is map a handler based purely on the field name. If the field name is associated with the S3 uploader type, the data would get uploaded to S3 regardless of whether it is a 10 byte text-plain or a 1 GiB file application/octet-stream

I guess a scenario to imagine is that you have a user uploading a 1GiB .txt file, it would have a mime type of text/plain and the filename may be empty - there is no reliable way for the server to decide if that is "file" or "non-file"...

This seems like it would be frustrating to require, since there's no way to supply content-type on an HTML form.

Yes you are right, and it goes back to my question - that for a given part how do you know what deserializer to use?

My suggestion to use the content-type would work for HTML forms, since the forms either supply a mime type of text/plain or default to text/plain and therefore their fields would get deserialized with serde_plain.

If you wanted to do JSON, we would either need an API client to tell the server via the content type that it is uploading JSON within that part, or put in some sort of annotation on the server side so we know to deserialize that part with JSON instead.

rest_of_the_fields: HashMap<String, T>, // grabs other_field

This is a feature I hadn't really thought about. It would be workable with an anotation similar to this. But it would come with the restriction that every part would have to use the same handler T, e.g. you have a to choose whether you want to attempt to read every unknown field into memory, write to disk or upload to S3, etc.

@robjtede robjtede transferred this issue from actix/actix-extras Aug 25, 2022
@robjtede
Copy link
Member

robjtede commented Aug 25, 2022

(Sorry for the hiaitus.)

I'm glad someone is giving this problem some thought; it definitely needs it.

would just like to know where to get started from an organizational standpoint

actix-multipart is pre-v1 so I don't mind iterating on the version as quickly as I would in actix-web-lab, hence the issue transfer to this repo. Also if the actix-multipart-derive crate name is needed for this effort that's fine by me.


I've been thinking for a while that might be an easier first goal is some sort of non-macro API similar to awmp or a builder API... idk... might play nicer with the flexibility we're wanting here.

@robjtede robjtede added C-feature Category: new functionality A-multipart project: actix-multipart RFC / Proposal labels Aug 25, 2022
@junbl
Copy link
Contributor

junbl commented Aug 25, 2022

I wasn't familiar with awmp, looks like it has an API similar to what I was imagining here--

struct Upload {
    all_files: Vec<File>,
    cloud_file: S3Upload,
    #[multipart(deserialize)]
    field: Vec<String>,
}

Alternatively, maybe the library could provide a struct:

struct Multipart<F: HandlePart, T: Deserialize> {
    files: F,
    fields: T,
}

Either of these would lend themselves to using an internal struct similar to awmp::Parts that could eventually be extended into a macro API for custom structs.

@robjtede when you say a builder api, are you imagining something like this?

Multipart::new()
    .file("file_field_name")
    .text("text_field_name")

Which would basically another interface to awmp::PartsConfig::with_{file|text}_fields. This could also lend itself to extending with generics like:

Multipart::new()
    .file::<S3Upload>("file_field_name") // T: Multipart or whatever we call the trait
    .text::<CustomEnum>("text_field_name") // T: Deserialize

Possible extensions to cover the use cases we've discussed:

Multipart::new()
    .files::<File>("name") // Vec<File>
    .into_struct::<FlattenedStruct>() // uses field names in struct? may be possible with serde but probably macros

The biggest issue with this approach is that it requires registering the configuration at the app level, instead of being able to specify the structure just with the struct given in the route (like other extractors e.g. web::Query<DeserializableStruct>). We'd need a way to specify multiple configurations per route, too, which from a glance I can't see how to do in awmp.

@jacob-pro

Do you mean where multiple parts are uploaded each using the name field[] ? Would the following work?

Seems like it would--I'm of the opinion that it should be supported by default instead of having to decorate each field, but that's an unnecessary extension as long as there's some way to do it.

The file name isn't mandatory

Dang lol that was the piece I was missing; I thought that's how it would distinguish them. Regardless, either of the approaches above (two different structs or specified with attribute macros) would solve this issue, since the derive macro can statically determine how to map fields to files/deserializers.

It would be workable with an annotation similar to this

I wasn't imagining supporting this for the file parts, but that's an interesting thought. If we strictly separate them, then we can piggyback off of serde(flatten) for the text fields and at any point add the feature for e.g. multipart(flatten).

@junbl
Copy link
Contributor

junbl commented Aug 25, 2022

Here's my proposal:

  1. An internal API similar to awmp.
    • This will support splitting your fields into text fields and file fields, which store a list of pairs (String, String) or (String,TempFile), respectively. (or &str or NamedTempFile or whatever else.)
      • Is this worth storing instead as a HashMap<String, Vec<String|TempFile>>? Probably less overhead to just keep as the Vec, and the less ergonomic interface isn't an issue with 2a.
    • The structure of these (where fields go) should be able to be specified per route. (This might not be feasible without 2a, or not worth reimplementing)
  2. Conversion from the internal representation to user defined structs.
    • A struct Multipart<F, T>, where T: Deserialize and F: MultipartUpload. Multipart will implement FromRequest, which will allow for specifying the structure just in the handler args, like in web::{Query, Json, Form}.
      • The internal representation Map<String,String> can be deserialized into T using e.g. serde_json.
      • Ideally, an interface like Multipart<T, ()> if you don't have any text fields.
    • Implementations of MultipartUpload for common use cases, e.g. TempFile, File, Vec<u8>.

@jacob-pro
Copy link
Contributor Author

jacob-pro commented Aug 25, 2022

@e-rhodes what you have described is very similar to what my existing crate already does.

It has an internal loader API:

let grouped: MultiMap<String, Field> = Loader::builder()
        .file_limit(500 * 1024 * 1024)
        .build()
        .load_grouped(payload)
        .await
        .unwrap();

// Where:
pub enum Field {
    File(File),
    Text(Text),
}

It determines whether a part is "text" or "file" based on this heuristic:

let item = if content_type == mime::TEXT_PLAIN && cd.get_filename().is_none() {
   // Treat as text
} else {
   // Write to tempfile
};

However like I have said many times before - this is a mistake - for multiple reasons:

  • It does not work if someone uploads a very large text file (which is a perfectly reasonable use case)
  • Error messages are confusing - what if someone uploads "text" (with the right field name), when the server is expecting "file"
  • What if your are only expecting small files and want them read directly into memory?
  • What if you want to write data somewhere other than Tempfiles?

The discussion we were having at the start of the thread, is that this isn't good enough as a general purpose solution, because we want to be able to support streaming to arbitrary backends (file, S3, memory) based on the field name.

It would thus allow us to support much broader use-cases, which your suggestion of Vec<String|TempFile> does not allow.

@junbl
Copy link
Contributor

junbl commented Aug 25, 2022

@jacob-pro

TempFile in my example was a misnomer---I meant any kind of internal representation representing the stream of bytes, that can eventually be used as any type that implements MultipartUpload. Maybe the easiest way to do this is to just make it generic at this stage, and not try to make some kind of unified stream representation.

Maybe this example will show what I mean more clearly:

#[derive(Deserialize)]
struct TextFields {
    field1: String,
}

#[derive(MultipartUpload)]
struct FileFields {
    field2: Vec<u8>, // in memory, implemented by library
    field3: S3Upload, // custom implementation of `MultipartUpload`
}

async fn handler(upload: Multipart<TextFields, FileFields>) -> impl Responder { ... }

Multipart's implementation of FromRequest decides where to put each field based on the names in the struct. We never interact with the MIME type or the filename, which, as you say, can't be relied upon. The type of the field, combined with which struct it appears in, determines how the bytes from the upload are treated.

@jacob-pro
Copy link
Contributor Author

@e-rhodes the distinction between file and non-file only makes sense in the context of the client / or a web browser.

From the server side it is completely up to us to decide what we want to do with the parts - it is irrelevant whether they came from an actual file on disk or the clients memory - the server can't reliably tell - and even if it could it wouldn't matter:

For example I would often like to submit a multipart like this:

Content-Type: multipart/form-data; boundary=BOUNDARY

--BOUNDARY
content-disposition: form-data; name="manifest"
content-type: application/json

{"title": "Photo title", "category": "holidays"}
--BOUNDARY
content-disposition: form-data; name="image"
content-type: image/jpeg

JPEG_BYTES...

Note that neither of these parts have a filename. The mime types are correct, neither of them are text/plain.

My proposal above would work with this:

#[derive(Deserialize)]
struct Manifest {
   title: String,
   category: String
}

#[derive(MultipartForm)]
struct Form {
    manifest: Manifest, // Because the mime-type is provided we know we can use serde_json instead of serde_plain
    image: S3Upload
}

Or via a builder API:

MultipartForm::new()
    .part::<S3Upload>("image") // implements MultipartHandler or whatever we call the trait
    .part::<Manifest>("manifest") // implements Deserialize

Note there is no distinction between "file" and "text"

@JSH32
Copy link

JSH32 commented Aug 25, 2022

Unrelated but I'm just going to point out since someone questioned it earlier, a macro API can have static type checking based on implementations since it would be generating a deserializer example that doesn't implement the proper things. Also, this exists if we ever need it https://docs.rs/compile_type_eq/latest/compile_type_eq/ (Although this would probably not be needed unless we go the serde route (which we will not))

@jacob-pro
Copy link
Contributor Author

thanks @JSH32 that is very cool!

@jacob-pro
Copy link
Contributor Author

jacob-pro commented Aug 26, 2022

I think I have a better option which might simplify the whole thing - lets not provide a generic implementation for deserialization - instead always require use of a wrapper type - it is slightly more verbose but would make the intent far clearer:

/// Deserializes the part in memory as plain text (using serde_plain)
struct TextPart<T: Deserialize>(T);

/// Deserializes the part in memory as JSON
struct JsonPart<T: Deserialize>(T);

/// Uploads the part to S3
struct S3Upload{ .. }

/// Writes data to a temporary file
struct FilePart{ .. }

/// Reads bytes into memory
struct BytesPart(Vec<u8>)

#[derive(MultipartForm)]
struct Form {
    integer: TextPart<i32>,
    json: JsonPart<DataStructure>,
    s3: S3Upload,
    on_disk: FilePart,
}

This would then unlock the ability to use standard library Option to indicate a part that may or may not be included in the multipart, and Vec to group parts with the same field name.

Consider this complex example:

#[derive(MultipartForm)]
struct ComplexOptionalExample {
    field1: Option<TextPart<i32>>,
    field2: TextPart<Option<i32>>,
}

field1 allows the part to be ommitted from the form, but if the part is included then the data must be an integer.

field2 requires the part to be included with the form, but allows its contents to be empty.

@junbl
Copy link
Contributor

junbl commented Aug 26, 2022

You're right--"file" is the wrong word. The thing to differentiate them is fields that are deserialized into a struct and fields that want to do something else with the stream of bytes. There's no reason to stop someone from collecting something that was uploaded as a file into a String or Vec<u8>, or saving data from an html form field as a file.

An implementation with required wrapper types does provide easier extraction for the use cases you present, but it's at the cost of usability in more common use cases.

The main difference between these two implementations is that I was suggesting putting all the TextParts in their own struct, so we can use serde directly. Whether JsonPart is supplied by the library or implemented by the user is not as important, the functionality is possible either way. I think your Option<TextPart<>> example is better expressed with a custom enum/custom deserialization if you need to distinguish between "field given as empty string" and "field doesn't appear".

Having to put TextPart/BytesPart/etc on every field and then .into_inner() or .0 everywhere it's used is a pain. For my use case at least, there are much more form fields than files. I'd prefer dropping serde entirely and just having impl FromMultipart for T: FromStr.

The biggest question to me is: how much of serde are we reimplementing to support the nondeserializable fields, vs how much is only relevant to deserializable fields? From https://serde.rs/attributes.html:

Things like rename/alias are useful for both text and streamed data. default and flatten would be fine on just the in-memory fields.

Something like deserialize_with would probably be useful too, to prevent having to make your own struct if you want to quickly implement different upload behavior.

rename_all and deny_unknown_fields are definitely useful to be applied to all fields.

It seems to me that the cost of reimplementing some of these attributes is not much more if we implement them for all fields as opposed to just those we don't deserialize, and we risk confusing behavior if e.g. #[multipart(rename)] doesn't do the same thing as #[serde(rename)]. Plus, all of these attributes are non-essential features--we just have to design it in a way that supports adding them.

Considering this, I think I'd prefer a single-struct approach with minimal wrapper types, either with attribute macros to distinguish what should be deserialized/streamed or just using FromStr.

  1. serde
#[derive(FromMultipart)]
#[multipart(deny_unknown_fields, rename_all="PascalCase")]
struct Upload {
    #[multipart] // says use our FromMultipart trait instead of Deserialize
    file: Vec<File>,
    #[serde(deserialize_with = "custom_fn")]
    field: Vec<String>,
    #[serde(rename = "Type"),
    r#type: Format,
    #[serde(flatten)]
    metadata: Metadata,
    json: JsonMetadata, // impl Deserialize for JsonMetadata to parse json from string
}

Then our derive macro basically internally constructs the struct with just the fields that can implement Deserialize and passes those attributes directly to serde, and does its own processing on the streams for the other fields. We'll get more functionality at the beginning without having to reimplement those attributes, as well as any new functionality that serde provides.

  1. FromStr
#[derive(FromMultipart)]
#[multipart(deny_unknown_fields, rename_all="PascalCase")]
struct Upload {
    file: Vec<File>,
    #[multipart(with = "custom_fn")]
    field: Vec<i32>,
    #[multipart(rename = "Type"),
    r#type: Format,
    #[multipart(flatten)]
    metadata: Metadata,
    json: JsonMetadata, // impl FromMultipart for JsonMetadata parses the contents of the stream and deserializes with serde_json
}

We provide impl<T: FromStr> FromMultipart for T.

This gets around the issues with Vec and Option, since we're providing impl<T: FromMultipart> FromMultipart for Vec<T>--no collision since Vec/Option don't implement FromStr.

What does everyone else think? @robjtede @JSH32 @asonix Preferences for any of these apis/any missing features?

@JSH32
Copy link

JSH32 commented Aug 26, 2022

I have an idea about custom deserialization, what about custom extractor-like API? Something like MultipartField which provides a stream of the field if the name matches, so for example.

pub struct TextPart<T: Deserialize>(T);

impl<T: Deserialize> Deref for TextPart<T> {
    type Target = T;

    fn deref(&self) -> &T {
        &self.0
    }
}

impl<T: Deserialize> FromField for TextPart<T> {
   // This example is `async_trait` to make it easy to understand. Could also work with `Self::Future`
   async fn from_field(req: &HttpRequest, field: &Field) -> Result<Self, io::Error> {
        // Consume the field into `TextPart<T>`
        // In the main non-extractor code we will check and make sure that the field stream is entirely exhausted to prevent issues.
   }
}

It would make it very easy to implement different ways of parsing a field. Goes pretty well with the FromRequest conventions. The other problem is how we would handle array fields, they can be out of order and each element is suffixed by a []. We can't exactly pass Vec<Field> to from_field since they can be out of order and the stream has to be consumed in order.

Maybe we could have this for single fields:

#[async_trait]
trait FromSingleField {
    async fn from_field(req: &HttpRequest, field: &Field) -> Result<Self, io::Error>;
}

And this for multiple fields since T would be what the individual fields which are part of the array are parsed into, and handled by the main deserializer:

#[async_trait]
trait FromMultipleFields<T> {
    async fn add_field(req: &HttpRequest, field: &Field) -> Result<T, io::Error>;
    async fn complete(req: &HttpRequest, fields: Vec<T>) -> Result<Self, io::Error>;
}

You could also implement something like this for simple array field types that don't need any special logic and only need collection.

impl<T: FromSingleField> FromMultipleFields<T> for Vec<T> {
    async fn add_field(req: &HttpRequest, field: &Field) -> Result<T, io::Error> {
        T::from_field(req, field)
    }

    async fn complete(_: &HttpRequest, fields: Vec<T>) -> Result<Self, io::Error>; {
        fields
    }
}

On a side note, I feel that using serde to actually parse the struct isn't such a good idea. Of course we can make an adapter that accepts all types which have Deserialize as fields but the form shouldn't have Deserialize itself. For something so limited that requires streaming of possibly gigantic files, it seems like an antipattern

@jacob-pro
Copy link
Contributor Author

jacob-pro commented Aug 26, 2022

@e-rhodes I'm starting to see what you mean - to define a MultipartForm as a tuple of two separate structs - one of those structs contains fields for which you would like to use serde and the other struct are those which you would like to use arbitrary async handlers ?

Having to put TextPart/BytesPart/etc on every field and then .into_inner() or .0 everywhere it's used is a pain. For my use case at least, there are much more form fields than files. I'd prefer dropping serde entirely and just having impl FromMultipart for T: FromStr.

Actually I think the problem is the other way around!

In my case, you can simply derefence with * to get the inner values.

#[derive(Deref)]
struct TextPart<T: DeserializeOwned>(T);

#[derive(Deref)]
struct UnifiedFormExtractor<T>(T);

struct Upload {
    int_field: TextPart<i32>,
    text_field: TextPart<String>,
}

async fn route_a(form: UnifiedFormExtractor<Upload>) -> HttpResponse {
    let int: i32 = *form.int_field;
    let string: &str = &*form.text_field;
    unimplemented!()
}

Whereas in your case:

#[derive(Deserialize)]
struct SerdeFields {
    int_field: i32,
    text_field: String
}

struct SegregatedForm<T: DeserializeOwned, E> {
    serde_fields: T,
    other_fields: E,
}

async fn route_b(form: SegregatedForm<SerdeFields, ()>) -> HttpResponse {
    let int: i32 = form.serde_fields.int_field;
    let string: &str = &form.serde_fields.text_field;
    unimplemented!()
}

I don't personally think this would be a friendly API:

  1. By having two seperate structs it kind of implies to the user that the fields are in separate namespaces when they are not... e.g. what happens if you have a field with the same name in both of the structs?
  2. It woudn't match any of the other actix web extractor APIs (JSON, Form, Path) etc. which all go into a single struct which dereferences into its inner fields:

I also think it still leaves a lot of ambiguity concerning deserialization - in the serde struct:

#[derive(Deserialize)]
struct SerdeFields {
    list: Vec<String>
}

What deserializer would you use to deserialize it with? Does this correspond to a form with multiple string parts called list or does it correspond to deserializing a single part (e.g. json or xml) called list into a list of strings ?

In comparison my suggested approach would allow you to choose your deserialization type, allowing you to build an API that accepts plain text parts (e.g. from a browser), or more complex JSON/XML parts from an API client etc.

@JSH32
Copy link

JSH32 commented Aug 26, 2022

Check out my previous comment. It might solve some stuff by not making things dependent on serde (optionally) and allowing existing types to be used.

@asonix
Copy link
Contributor

asonix commented Sep 7, 2022

I've updated my library to work as an extractor rather than a middleware: https://git.asonix.dog/asonix/actix-form-data/src/branch/v0.7.x/examples/simple.rs

@jacob-pro
Copy link
Contributor Author

jacob-pro commented Sep 18, 2022

@JSH32 I was wondering if you could help with this compile time type checking within a proc-macro?

For a given Field with a Type - I can't figure out how to match on the type to tell if it is an Option or a Vec ?

Edit: Don't worry I've figured out I don't actually need to do this

@jacob-pro
Copy link
Contributor Author

jacob-pro commented Nov 5, 2022

Update: Whilst my PR remains open, I have backported the new features to my previous library actix-easy-multipart v3.0.0 in case anyone wants to use them now

@jacob-pro
Copy link
Contributor Author

Hi @Sirneij

The compilation issue doesn't seem related to this PR

You have fore_image: actix_easy_multipart::tempfile::Tempfile which is a temporary file already written on disk

Inside the Tempfile should have field file which is a NamedTempFile.

You can save it (i.e. move from outside the temp directory) using persist

e.g.

fore_image.file.persist("media/articles/name")

@Sirneij
Copy link

Sirneij commented Jan 13, 2023

Just fixed it. Sorry.

@robjtede robjtede linked a pull request Feb 26, 2023 that will close this issue
16 tasks
@robjtede
Copy link
Member

This feature was implemented by #2883 and is now in final testing before release in the next day or two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-multipart project: actix-multipart C-feature Category: new functionality RFC / Proposal
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants