Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor/transform: introduce transform and improve error handling #35

Merged
merged 21 commits into from Feb 19, 2021
Merged

Conversation

drahnr
Copy link
Contributor

@drahnr drahnr commented Feb 18, 2021

Note that this PR still lacks test adjustments required and transform is not covered yet.

Changes:

  • improve the error type
  • add fs-err for better errors without much chore
  • introduce fn transform for transformations before artifacts hit the cache_dir (this is not quite correct now, tbd)
  • introduce a type alias type Result<T> to avoid boilerplate
  • use BufReader and BufWriter instead of intermediate Vec where possible
  • migrate a few elements to more idiomatic expressions

@bminixhofer
Copy link
Owner

Thanks for the PR!

I also think a transform_fn field on the builder is the best solution, I planned to do it the same way. I'm not sure a transform_path_fn is necessary since the transformed data might not actually exist in file form (e. g. if no cache dir is set). That said, not having a transform_path_fn would of course mean that the file extension of the cached data is slightly incorrect. I'm fine with both.

Also thanks for fixing #33 and the expects right away too! This PR is quite subsantial so feel free to add yourself to the authors of nlprule-build. I'll have a closer look once it is finished.

@drahnr
Copy link
Contributor Author

drahnr commented Feb 18, 2021

It seems crate brotli is f*cked. Neither smush nor brotli directly succeed, currently the relevant code is comments are prefixed with XXX. Filed dropbox/rust-brotli#52

@drahnr
Copy link
Contributor Author

drahnr commented Feb 18, 2021

@bminixhofer a first round of review would be appreciated 🦺 it could be simplified in a few areas, but that's smth for another day.

@bminixhofer
Copy link
Owner

a first round of review would be appreciated

Sure! But likely tomorrow 🙂

It seems crate brotli is f*cked.

IMO it's fine to just use something else in the tests here. I just tried this and have the same issue (hangs indefinitely). But I don't think it is needed for this PR. It's good to have two different compression algorithms in the tests but I see no reason to use brotli specifically.

Copy link
Owner

@bminixhofer bminixhofer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, review done 🙂 Mostly just comments and small changes. The overall approach is good!

The most significant thing I would've done differently is not using traits for the transform arguments, that adds quite a bit of boilerplate. I think we can also simplify the signature a bit.

build/src/lib.rs Outdated Show resolved Hide resolved
build/src/lib.rs Outdated
pub type Result<T> = std::result::Result<T, Error>;

/// Definition of the data transformation for the network retrieved, binencoded rules and tokenizer datasets.
pub trait TransformDataFn<I: Read>:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale for having a trait and the generic I and Cursor<&'r mut Vec<u8>> here instead of e. g. type TransformDataFn = Box<dyn Fn(Vec<u8>) -> Result<Vec<u8>>>;? That would simplify things quite a bit and Fn(Vec<u8>) -> Result<Vec<u8>> is still expressive enough. I would argue readability is more important than performance here because the closure is likely called very infrequently.

Copy link
Contributor Author

@drahnr drahnr Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what I meant in my previous comment in the PR (not in the review pane). That's complexity to be removed.

I don't think it makes sense to use type aliases. Passing the type would then require the user to Box or have the trait bounds there (and hence twice).

While in this case one could argue, that a simplified fn signature might be good, I find that a bit of an anti-pattern. impl Read and impl Write can easily be "chained"/combined without intermediate buffers.

Copy link
Contributor Author

@drahnr drahnr Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just attempted it, but having an dyn Fn(dyn Read, dyn Write) -> result::Result<(), OtherError> {} opens a can of worms, so I am not sure that's in scope of this PR.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, I agree on the impl Read / impl Write now, it is also consistent with the signature of postprocess.

Passing the type would then require the user to Box or have the trait bounds there (and hence twice).

I don't see an issue with having the bounds twice (once in the type alias, once in the transform function). Something like this would work:

pub type TransformDataFn =
    Box<dyn Fn(Cursor<Vec<u8>>, Cursor<&mut Vec<u8>>) -> result::Result<(), OtherError>>;

and

pub fn transform<F, C>(mut self, proc_fn: C, path_fn: F) -> Self
where
    C: Fn(Cursor<Vec<u8>>, Cursor<&mut Vec<u8>>) -> result::Result<(), OtherError> + 'static,
    F: TransformPathFn + 'static,
{
    self.transform_data_fn = Some(Box::new(proc_fn));
    self.transform_path_fn = Some(Box::new(path_fn));
    self
}

or am I missing something? The repetition is not perfect but changing one or the other immediately gives a compile time error so that's not an issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the last commit, one needs late bound life times and a bunch of boxes to be able to use dyn Read and dyn Write in the args which is needed for the get_build_dir case.

build/src/lib.rs Outdated Show resolved Hide resolved
build/src/lib.rs Outdated Show resolved Hide resolved
build/src/lib.rs Outdated Show resolved Hide resolved
build/src/lib.rs Outdated Show resolved Hide resolved
build/src/lib.rs Show resolved Hide resolved
build/src/lib.rs Outdated Show resolved Hide resolved
build/src/lib.rs Outdated Show resolved Hide resolved
build/src/lib.rs Show resolved Hide resolved
@drahnr
Copy link
Contributor Author

drahnr commented Feb 19, 2021

Note that this now includes a bunch of lifetime trickery, which I am happy to explain.

@drahnr
Copy link
Contributor Author

drahnr commented Feb 19, 2021

Oh, and please squash merge this 🗜️

Copy link
Owner

@bminixhofer bminixhofer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can simplify the signature of TransformDataFn to use concrete structs. That would also remove the "lifetime trickery". 🙂

Besides that only small changes. Please also run rustfmt on the files.

build/src/lib.rs Outdated Show resolved Hide resolved
build/src/lib.rs Outdated Show resolved Hide resolved
build/src/lib.rs Outdated Show resolved Hide resolved
build/src/lib.rs Outdated Show resolved Hide resolved
nlprule/src/bin/compile.rs Show resolved Hide resolved
build/src/lib.rs Show resolved Hide resolved
build/src/lib.rs Outdated Show resolved Hide resolved
build/src/lib.rs Outdated Show resolved Hide resolved
build/src/lib.rs Outdated Show resolved Hide resolved
build/src/lib.rs Outdated Show resolved Hide resolved
drahnr and others added 9 commits February 19, 2021 13:13
Co-authored-by: Benjamin Minixhofer <bminixhofer@gmail.com>
Co-authored-by: Benjamin Minixhofer <bminixhofer@gmail.com>
Co-authored-by: Benjamin Minixhofer <bminixhofer@gmail.com>
Co-authored-by: Benjamin Minixhofer <bminixhofer@gmail.com>
Co-authored-by: Benjamin Minixhofer <bminixhofer@gmail.com>
Co-authored-by: Benjamin Minixhofer <bminixhofer@gmail.com>
Co-authored-by: Benjamin Minixhofer <bminixhofer@gmail.com>
@bminixhofer
Copy link
Owner

CI failed because Backblaze bandwith got exceeded, I fixed that now. But there's a compilation error now.

@drahnr
Copy link
Contributor Author

drahnr commented Feb 19, 2021

Should be fixed now.

Sidenote: Why do I need python3.9 to link the rust crate?

@bminixhofer
Copy link
Owner

Why do I need python3.9 to link the rust crate?

I'm not sure; maybe because of the Python bindings? They're in the same workspace.

Looks good to me now! Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

postprocess has different semantic than anticipated Tighter error bounds around Error
2 participants