Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web types #21

Merged
merged 4 commits into from Jul 24, 2020
Merged

Web types #21

merged 4 commits into from Jul 24, 2020

Conversation

ririsoft
Copy link
Contributor

Add some common web types support, using the whatwg specification.

Since the algorithm is stripping white spaces I am wondering wether we should limit the buffer size to 256 or 512 bytes (like other implementations do, Go http lib among others) or live this to the responsibility of the caller. Not doing this make this crate vulnerable to DOS attacks with very long space strings. Doing this limits the possibilities for the calling application. Maybe we should leave that to the caller but add a small comment about the danger. Comments welcome.

Cheers

Copy link
Collaborator

@paolobarbolini paolobarbolini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take a more in depth look at it this evening. I'm not sure DOS is going to be a big issue, I guess if the user passes in a giant buffer then it could be!? We may want to enforce the 8192 bytes limit on Infer::get then, since there could be other matchers that are vulnerable to it?

If performance is really an issue here (I haven't benchmarked it yet), we may want to play with something like https://rust.godbolt.org to see if we can make the compiler elide bounds checks, optimize out unreachable branches or even get it to optimize the loops by making it generate SIMD instructions? Here's what the situation is like at the moment, if you feel like it's not fast enough and want to play with it https://rust.godbolt.org/z/o7rz1W

Anyway, I'll test this later, but looking at the current implementation it should be fine. We don't need to get extreme performance here, and I certainly wouldn't want to add any unsafe code...

src/matchers/text.rs Outdated Show resolved Hide resolved
src/map.rs Outdated
Comment on lines 65 to 70
(
MatcherType::APP,
"application/javascript",
"js",
matchers::no_match
),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want this. I feel like this would be something the mime_guess crate should do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am using Infer in a web context where files have both a name (with extension) and streamed content (from network or local disk). When sniffing mime the priority is given to the content, and I fallback to file extension if the match fails. For some types, such Javascript or CSS, the sniffing can only be done using the file's extension.

I was thinking that infer would address both cases. If i have to use mime_guess in addition this is ok, but it seems to be unmaintained. I have a PR to fix a critical panic opened there for a month by the way.

Also if we decided that file extension matching is out of Infer scope then we should probably remove the Infer::get_from_ext method that I introduced in #16. On the other side If we decide that file extension matching is within Infer scope then I have several enhancements planned. For instance we could handle mime type aliases as well as file extension aliases.

What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think functionality like mime_guess's could fit into the infer crate, or at least we could argue it would have a similar api. Maybe the builtin extension matchers would be in a giant match instead of an Iterator but still...

I also use both infer and mime_guess, and I could benefit by having something like custom matchers for extensions, since my infer function currently looks like something like this:

pub fn infer<P: AsRef<Path>>(data: &[u8], path: P) -> String {
    // using infer from the master branch
    if let Some(kind) = infer::get(&data) {
        // look at the comment at the bottom on why I have to use String
        return kind.mime_type().into();
    }

    let path = path.as_ref();
    // this is a bit ugly
    let ext = path
        .extension()
        .and_then(|ext| ext.to_str())
        .unwrap_or_default()
        .to_lowercase();
    match ext.as_str() {
        // some RAW Image formats
        "arw" => "image/x-sony-arw".into(),
        "cr2" => "image/x-canon-cr2".into(),
        "crw" => "image/x-canon-crw".into(),
        "dng" => "image/x-adobe-dng".into(),
        "erf" => "image/x-epson-erf".into(),
        "nef" => "image/x-nikon-nef".into(),
        "orf" => "image/x-olympus-orf".into(),
        "rw2" => "image/x-panasonic-rw2".into(),

        // unfortunately I have to return a String, even though everything else is a &'static str
        // because mime_guess returns a lifetime that doesn't last enough, even though they
        // could return a &'static str if they didn't depend on the `mime` crate
        ext => mime_guess::from_ext(ext)
            .first_or_octet_stream()
            .essence_str()
            .into(),
    }
}

I think we only need to think if this is the best solution. We already have the mime_type and the extension for formats who also have a matcher, probably the biggest issue here is the pub(crate) fn no_match(_buf: &[u8]) -> bool { false } "hack".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we only need to think if this is the best solution. We already have the mime_type and the extension for formats who also have a matcher, probably the biggest issue here is the pub(crate) fn no_match(_buf: &[u8]) -> bool { false } "hack".

I share your concern about the "hack", but it does the job without the need for further internal refactoring at this stage.
Maybe we can keep it for now until we go further into a mime_guess & infer merge ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about it. What we could do is have the matcher inside infer::Type be an Option that takes None for the extension only matchers. Then we could put those matchers in a separate file from the magic bytes matchers, so that we could code generate them like mime_guess does without having it be a huge mess.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But still I'm not sure about it, it feels like one crate is doing the work two different crates should do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am totally open to have mime_guess address extensions sniffing while infer would address content sniffing. In such case we should backout #16 which does not make much sense if we do not merge mime_guess into infer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose then that we move this discussion to a dedicated issue and that I remove the CSS and Javascript support with the no_match hack for now, so that we can merge this PR.

In the new issue we will discuss what features should Infer provide regarding file extensions and eventually decide to remove get_from_ext.

Is that ok for you ?

@ririsoft
Copy link
Contributor Author

I'll take a more in depth look at it this evening. I'm not sure DOS is going to be a big issue, I guess if the user passes in a giant buffer then it could be!? We may want to enforce the 8192 bytes limit on Infer::get then, since there could be other matchers that are vulnerable to it?

It makes sense indeed to have this addressed globally. I opened #22 to track this outside of this PR.

If performance is really an issue here (I haven't benchmarked it yet), we may want to play with something like https://rust.godbolt.org to see if we can make the compiler elide bounds checks, optimize out unreachable branches or even get it to optimize the loops by making it generate SIMD instructions? Here's what the situation is like at the moment, if you feel like it's not fast enough and want to play with it https://rust.godbolt.org/z/o7rz1W

Very interesting, thank you for sharing. I do not have performance issues related to Infer so far. I am using it in a web context where IOs are the bottleneck. I do not plan to work on this area in the short term personally. If you get into this area I will be pleased to look at your work, for the sake of learning stuffs.

@paolobarbolini
Copy link
Collaborator

I do not have performance issues related to Infer so far. I am using it in a web context where IOs are the bottleneck. I do not plan to work on this area in the short term personally. If you get into this area I will be pleased to look at your work, for the sake of learning stuffs.

Personally I run it on a background worker, so I don't mind having it be a bit slower. In some cases I myself pass the entire file to infer, since I already have it in memory, so I see how someone else could do the same. I guess I could still take a look at this in a livestream.

Copy link
Collaborator

@paolobarbolini paolobarbolini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested an optimization

Before: https://rust.godbolt.org/z/jGv46j
After: https://rust.godbolt.org/z/dfaocx

src/matchers/text.rs Outdated Show resolved Hide resolved
src/matchers/text.rs Show resolved Hide resolved
@paolobarbolini
Copy link
Collaborator

paolobarbolini commented Jul 24, 2020

I think this is ready, but as discussed I don't think infer should also have extension matching functionality.

We should be able to get the mime_guess crate fixed. The author seems to still be active on GitHub, maybe you could write him an email and see if you can get him to take a look at your PR @ririsoft ? It worked for me for a different project I contributed to.

@ririsoft
Copy link
Contributor Author

I think this is ready, but as discussed I don't think infer should also have extension matching functionality.

We should be able to get the mime_guess crate fixed. The author seems to still be active on GitHub, maybe you could write him an email and see if you can get him to take a look at your PR @ririsoft ? It worked for me for a different project I contributed to.

Regarding my comment above in the code review I understand that:

  1. I should remove CSS and Javascript matching as well as the no_match hack
  2. I should remove get_from_ext initially submitted in Get from extension #16
  3. No need to open a separate issue to discuss support of extension matching.

Is my understanding correct ?

@paolobarbolini
Copy link
Collaborator

paolobarbolini commented Jul 24, 2020

Ops sorry didn't see your second comment. Right, let's open an issue about that so we can further discuss support for extension matching. For the moment let's leave #16 as is and decide before the release what to do with it

ririsoft and others added 3 commits July 24, 2020 22:46
A 30x optimization making the compiler optimize away a panic were &buf[i..] could be an invalid index

Co-authored-by: Paolo Barbolini <paolo@paolo565.org>
Copy link
Collaborator

@paolobarbolini paolobarbolini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @ririsoft for all of the work. Merging now

@ririsoft
Copy link
Contributor Author

Here we go, hopping I did not misunderstood your points.
CSS and Javascript have been removed.
Thank you very much for your time on this!
Cheers.

@paolobarbolini paolobarbolini merged commit c43a7bd into bojand:master Jul 24, 2020
@paolobarbolini
Copy link
Collaborator

I gave another look at the mime_guess repo and it's a bit sad seeing how he's ignoring every PR and not doing anything with it. Anyway, it looks like there are already some alternatives to it https://crates.io/crates/mime-db, but very few crates are using it

@ririsoft
Copy link
Contributor Author

I gave another look at the mime_guess repo and it's a bit sad seeing how he's ignoring every PR and not doing anything with it. Anyway, it looks like there are already some alternatives to it https://crates.io/crates/mime-db, but very few crates are using it

I just opened #24 to continue the discussion on the file extension topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants