Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get rid of the unsafe block in Parser::new #6

Closed
IvanUkhov opened this issue Jun 23, 2017 · 12 comments · Fixed by #39 or #40
Closed

Get rid of the unsafe block in Parser::new #6

IvanUkhov opened this issue Jun 23, 2017 · 12 comments · Fixed by #39 or #40

Comments

@IvanUkhov
Copy link
Member

IvanUkhov commented Jun 23, 2017

It would be great to get rid of the unsafe code in Parser::new.

@P-E-Meunier
Copy link

In the latest crates.io release (0.5.12), if I keep a pointer to the text inside an Event::Text while parsing, the text inside the &str changes if I keep pulling events from the parser.

This seems linked to this issue, right?

@IvanUkhov
Copy link
Member Author

Hi, I’m not sure what you mean. Could you please provide a working example?

@d-e-s-o
Copy link

d-e-s-o commented Jul 3, 2020

Sorry to say that so bluntly, but based on my understanding (happy to be proven wrong!) what is being done in the unsafe block is scary and simply wrong. The memory of the string allocated in read_internal will get deallocated at the end of the function and all your parsing logic is operating on dead memory. That's likely what @P-E-Meunier is seeing.

I don't think that the current interface with the given lifetimes can be made to work safely. On the one hand you want to bind ownership of the string to the Parser instance and on the other you want to pass out references to the content that outlive the Parser.

I don't have a ready solution (Sorry! I am just driving by if you will), so at this point I am just raising awareness. I would not use the crate as-is if I were to rely on the Parser logic in any way.

@IvanUkhov
Copy link
Member Author

IvanUkhov commented Jul 3, 2020

Thanks, @d-e-s-o, for the feedback! Why do you say the memory will be deallocated in read_internal? The ownership of the string is moved into Parser, and the string is deallocated when the parser goes out of scope.

fn read_internal<'l, R>(mut source: R) -> io::Result<Parser<'l>>
where
    R: Read,
{
    let mut content = String::new();
    source.read_to_string(&mut content)?;
    Ok(Parser::new(content))
}

pub struct Parser<'l> {
    content: Cow<'l, str>,
    reader: Reader<'l>,
}

impl<'l> Parser<'l> {
    pub fn new<T>(content: T) -> Self
    where
        T: Into<Cow<'l, str>>,
    {
        let content = content.into();
        let reader = unsafe { ::std::mem::transmute(Reader::new(&*content)) };
        Parser { content, reader }
    }
}

@P-E-Meunier
Copy link

By the way, what I meant is really simple. If I keep a borrow a to the text inside an Event::Text, let's say let a = inner_text;, doing println!("{:?}", a); will print a value a0. If I pull another event and do println!("{:?}", a)', I get a different value a1 != a0`.

@IvanUkhov
Copy link
Member Author

@P-E-Meunier, do you have a working example of this?

@d-e-s-o
Copy link

d-e-s-o commented Jul 3, 2020

Thanks, @d-e-s-o, for the feedback! Why do you say the memory will be deallocated in read_internal? The ownership of the string is moved into Parser, and the string is deallocated when the parser goes out of scope.

Honestly, looking at it again...I made changes in an attempt to understand and fix and I believe was commenting on what I believe the code to look like before... it was late. :-| Yes, you are right about the passing of ownership.

In any case, the larger unsafety issue remains. Consider this patch:

diff --git src/lib.rs src/lib.rs
index 4bdeb8..84be52 100644
--- src/lib.rs
+++ src/lib.rs
@@ -136,6 +136,7 @@ mod tests {
     use crate::parser::{Event, Parser};

     const TEST_PATH: &'static str = "tests/fixtures/benton.svg";
+    const TEST_PATH2: &'static str = "tests/fixtures/benton2.svg";

     #[test]
     fn open() {
@@ -167,4 +168,36 @@ mod tests {

         assert!(parser.next().is_none());
     }
+
+    #[test]
+    fn read2() {
+      let t1: &str = {
+        let mut p1 = crate::read(&mut File::open(self::TEST_PATH).unwrap()).unwrap();
+        let t1 = p1.nth(9).unwrap();
+        match t1 {
+          Event::Text(text) => {
+            println!("{}", text);
+            text
+          },
+          _ => unreachable!(),
+        }
+      };
+
+      let t2: &str = {
+        let mut p2 = crate::read(&mut File::open(self::TEST_PATH2).unwrap()).unwrap();
+
+        let t2 = p2.nth(9).unwrap();
+        match t2 {
+          Event::Text(text) => {
+            println!("{}", text);
+            text
+          },
+          _ => unreachable!(),
+        }
+      };
+
+      println!("{} (should still be: 'Test1')", t1);
+      println!("{} (should still be: 'ZOMG IT CHANGED')", t2);
+    }
+
 }

This new test will do what @P-E-Meunier describes, I believe. I added a text element to the existing test file and created another file with a different text.
You can already see that t1 and t2 exist and seem to be dangling. After all, where is the memory backing them? The parser is gone, so is the string.

Here is the test data:

diff --git tests/fixtures/benton.svg tests/fixtures/benton.svg
index 003c9f..2658f9 100644
--- tests/fixtures/benton.svg
+++ tests/fixtures/benton.svg
@@ -15,4 +15,5 @@
 <path d="M544,50c-67,0-129,63-129,143c0,107,87,191,199,191c81,0,149-60,149-125c0-42-29-73-62-73c-22,0-44,17-44,40
   c0,25,19,43,41,43c11,0,26-5,31-11c16-16,26-3,27,5c1,45-40,93-123,93c-90,0-192-91-192-177c0-81,52-123,99-121c11,2,21,9,5,24
   c-7,4-12,20-12,30c0,22,20,40,45,40s43-21,43-43C621,78,588,50,544,50z"/>
+<text x="10" y="10">Test1</text>^M
 </svg>
diff --git tests/fixtures/benton2.svg tests/fixtures/benton2.svg
new file mode 100644
index 000000..95cc10
--- /dev/null
+++ tests/fixtures/benton2.svg
@@ -0,0 +1,19 @@
+<?xml version="1.0" encoding="utf-8"?>
+<!-- Generator: Adobe Illustrator 18.0.0, SVG Export Plug-In . SVG Version: 6.00 Build 0)  -->
+<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
+<svg version="1.1" id="Layer_1" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" x="0px" y="0px"
+   viewBox="0 0 800 800" enable-background="new 0 0 800 800" xml:space="preserve">
+<path d="M249,752c67,0,129-63,129-143c0-107-87-191-199-191c-81,0-149,60-149,125c0,42,29,73,62,73c22,0,44-17,44-40
+  c0-25-19-43-41-43c-11,0-26,5-31,11c-17,16-26,3-27-5c-2-45,40-93,123-93c90,0,192,91,192,177c0,81-52,123-99,121c-11-2-21-9-5>
+  c7-4,12-20,12-30c0-22-20-40-45-40s-43,21-43,43C172,724,205,752,249,752z"/>
+<path d="M544,752c44,0,77-28,77-59c0-22-18-43-43-43s-45,18-45,40c0,10,5,26,12,30c16,15,7,22-5,24c-47,2-99-40-99-121
+  c0-86,102-177,192-177c83,0,124,48,123,93c-1,8-11,21-27,5c-5-6-20-11-31-11c-22,0-41,18-41,43c0,23,22,40,44,40c33,0,62-31,62>
+  c0-65-68-125-149-125c-112,0-199,84-199,191C415,689,477,752,544,752z"/>
+<path d="M249,50c-44,0-77,28-77,59c0,22,18,43,43,43s45-18,45-40c0-10-5-26-12-30c-16-15-6-22,5-24c47-2,99,40,99,121
+  c0,86-102,177-192,177c-83,0-125-48-123-93c1-8,10-21,27-5c5,6,20,11,31,11c22,0,41-18,41-43c0-23-22-40-44-40c-33,0-62,31-62,>
+  c0,65,68,125,149,125c112,0,199-84,199-191C378,113,316,50,249,50z"/>
+<path d="M544,50c-67,0-129,63-129,143c0,107,87,191,199,191c81,0,149-60,149-125c0-42-29-73-62-73c-22,0-44,17-44,40
+  c0,25,19,43,41,43c11,0,26-5,31-11c16-16,26-3,27,5c1,45-40,93-123,93c-90,0-192-91-192-177c0-81,52-123,99-121c11,2,21,9,5,24
+  c-7,4-12,20-12,30c0,22,20,40,45,40s43-21,43-43C621,78,588,50,544,50z"/>
+<text x="10" y="10">ZOMG IT CHANGED</text>
+</svg>

Now, that still "works", in the sense that the text did not change as he described.

$ cargo test -- read2 --nocapture
   Compiling svg v0.8.0 (/home/deso/local/opt/svg)
    Finished test [unoptimized + debuginfo] target(s) in 0.89s
     Running svg-4dbe958f0f293962

running 1 test
Test1
ZOMG IT CHANGED
Test1 (should still be: 'Test1')
ZOMG IT CHANGED (should still be: 'ZOMG IT CHANGED')
test tests::read2 ... ok

However, I believe we are just lucky. The memory is not being reused.

So let's switch to jemalloc, which manages memory differently and potentially more aggressively:

diff --git Cargo.toml Cargo.toml
index 7fc3e9..3989b1 100644
--- Cargo.toml
+++ Cargo.toml
@@ -20,3 +20,6 @@ repository = "https://github.com/bodoni/svg"
 readme = "README.md"
 categories = ["multimedia::images", "parsing", "rendering::data-formats"]
 keywords = ["vector-graphics"]
+
+[dependencies]
+jemallocator = "0.3.2"
diff --git src/lib.rs src/lib.rs
index 84be52..e7b576 100644
--- src/lib.rs
+++ src/lib.rs
@@ -61,6 +61,11 @@
 //! # }
 //! ```

+use jemallocator::Jemalloc;
+
+#[global_allocator]
+static GLOBAL: Jemalloc = Jemalloc;
+
 use std::fs::File;
 use std::io::{self, Read, Write};
 use std::path::Path;
$ cargo test -- read2 --nocapture
    Finished test [unoptimized + debuginfo] target(s) in 0.01s
     Running svg-e9ef7eefbc050772

running 1 test
Test1
ZOMG IT CHANGED
ZOMG  (should still be: 'Test1')
ZOMG IT CHANGED (should still be: 'ZOMG IT CHANGED')
test tests::read2 ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 22 filtered out

And you can see that t1 got changed behind our back (at least that is the case on my system).

@IvanUkhov
Copy link
Member Author

IvanUkhov commented Jul 3, 2020

Interesting! I got the same output as you. The idea with the text node was that it would have the lifetime of the content passed into Parser:

pub struct Parser<'l> {
    content: Cow<'l, str>,
    reader: Reader<'l>,
}

pub enum Event<'l> {
    Text(&'l str),
}

It should not be legal to use t1 and t2 outside those two blocks. Since the compiler does not complain, 'l is probably inferred to be not what was desired. It is likely to be due to the moved ownership.

It works as expected when passing a reference (“content does not live long enough”).

    #[test]
    fn read2() {
      use std::io::Read;
      use crate::parser::Parser;
      let t1: &str = {
        let mut file = File::open(self::TEST_PATH).unwrap();
        let mut content = String::new();
        file.read_to_string(&mut content).unwrap();
        let mut p1 = Parser::new(&content);
        let t1 = p1.nth(9).unwrap();
        match t1 {
          Event::Text(text) => {
            println!("{}", text);
            text
          },
          _ => unreachable!(),
        }
      };

      let t2: &str = {
        let mut p2 = crate::read(&mut File::open(self::TEST_PATH2).unwrap()).unwrap();

        let t2 = p2.nth(9).unwrap();
        match t2 {
          Event::Text(text) => {
            println!("{}", text);
            text
          },
          _ => unreachable!(),
        }
      };

      println!("{} (should still be: 'Test1')", t1);
      println!("{} (should still be: 'ZOMG IT CHANGED')", t2);
    }

@d-e-s-o
Copy link

d-e-s-o commented Jul 3, 2020

It works as expected when passing a reference (“content does not live long enough”).

That makes sense to me. I believe 'l can be pretty much any lifetime that the compiler may infer from how the parser is used if you pass in an owned object (in my example, it would be pretty much the outer most scope). Because of the unsafe transmute you effectively erase the connection between content and reader, allowing for this wrong lifetime to be used to begin with (i.e., without the transmute no suitable lifetime could be inferred).
This explanation is probably not the best, but perhaps you understand what I am getting at.

Which brings me back to the initial point I made, which is that I don't believe that the current interface with the given lifetime annotations can be made work. Just from a conceptual point of view I don't see how it would be possible. But happy to be proven wrong.

Skynoodle added a commit to Skynoodle/svg that referenced this issue Sep 4, 2020
Per discussion in bodoni#6, the lifetimes and use of unsafe in
Parser's constructor may not be sound. This removes the offending unsafe
code and offers an argument for the unsoundness of the existing approach
and the necessity of leaking to soundly satisfy the current api's
lifetimes.
@Skynoodle
Copy link

Hello,

My understanding of this matches @d-e-s-o's - when providing an owned String to Parser::new, the lifetime 'l can be freely inferred to be as long as the caller desires (as it's only bounded by the Borrowed case). Since the produced Parser needn't outlive 'l, and the true lifetime of the owned string is erased by transmute, the current api seems to allow one to produce invalid references. My 'strawman solution' Skynoodle/svg@e65faa7 demonstrates safely satisfying the api's lifetimes at the deeply unfortunate cost of leaking the owned string supplied to the constructor. It also provides a best-effort argument for why leaking seems to be the only sound option (given the existing api) here that I can see.

@mlwilkerson
Copy link
Contributor

I've been looking at this too, trying understand what uses cases might motivate this API. (And trying to improve my understanding of Rust's ownership, borrowing, and lifetimes along the way. So this is more of an exploration than a conclusion.)

I wonder if it would be sufficient if Parser::new just took a &'l str. If so, then the caller of Parser::new would determine its lifetime ('l), and that &'l str could be passed down to to the Reader.

In the test cases that exercise the Parser::new, I don't see any that attempt to invoke Parser::new in any other way other than passing in a &str. They all look like this:

test!("<foo>", "foo");

which, after the macro does its thing, just becomes:

let mut parser = Parser::new("foo");

So I tried an experiment: just make that change and see what happens. Here's a commit to demonstrate that.

Now run cargo test and see the compilation failure at src/lib.rs:120:

error[E0308]: mismatched types
   --> src/lib.rs:120:20
    |
120 |     Ok(Parser::new(content))
    |                    ^^^^^^^
    |                    |
    |                    expected `&str`, found struct `std::string::String`
    |                    help: consider borrowing here: `&content`

Ok, that makes sense: read_internal would need to be updated to pass &str. But if we do that naively, the borrow checker complains:

error[E0515]: cannot return value referencing local variable `content`
   --> src/lib.rs:120:5
    |
120 |     Ok(Parser::new(&content))
    |     ^^^^^^^^^^^^^^^--------^^
    |     |              |
    |     |              `content` is borrowed here
    |     returns a value referencing data owned by the current function

Notice we're doing Parser::new(&content) in that second one, instead of Parser::new(content). So instead of passing ownership of a String to Parser::new, we're borrowing from that String as a &str, and passing that borrowed thing to Parser::new.

So the borrow checker complains because we're allocating memory in read_internal, and making content its owner. But content will go out of scope at the end of read_internal. So the borrow checker won't let us pass along a borrowed reference to memory that will go out of scope at the end of read_internal.

That's how I understand the point raised by @d-e-s-o:

The memory of the string allocated in read_internal will get deallocated at the end of the function and all your parsing logic is operating on dead memory.

My intuition seems similar to that: we can't have our cake and eat it too here. Because content will go out of scope, and the memory it owns will be deallocated at the end of read_internal, it's not safe to pass a borrow to its memory into Parser::new, and that's why the borrow checker would not let us do it if we were to make that borrow explicit, as I attempted.

So what if we instead pass ownership of that memory to Parser::new, and by returning a Parser from read_internal, we also pass ownership of the Parser, which carries with it the ownership of that content memory allocated by read_internal? That would seem to have the net effect of making the client of Parser--which is the client of the read (which invokes read_internal) and which determines the lifetime 'l in read--the owner of this memory with the appropriate lifetime.

That's how I understand @IvanUkhov's point: that the ownership of the memory for content in read_internal, being a String, is being passed to Parser::new. As a result, that memory would not be deallocated at the end of read_internal, as far as read_internal is concerned.

Back to @d-e-s-o's point, though, my intuition is that it seems like what we're fundamentally trying to achieve here is a borrow with a lifetime of 'l. It's that the client of the Parser should own some memory, that it lends out to the Parser to do its parsing, but for which it retains the right to reference directly, until that client is done with it and allows it to go out of scope.

So it seems that the key question is: is that what the unsafe transmute is accomplishing in Parser::new? Is it providing this continuity of ownership and lifetime, so that the ownership of that content memory is received from read_internal, and re-lifetimed to live as long as the lifetime 'l that the client originally specified to read?

This is how I understand the concerns about whether 'l can be inferred, such as from @d-e-s-o :

Because of the unsafe transmute you effectively erase the connection between content and reader, allowing for this wrong lifetime to be used to begin with (i.e., without the transmute no suitable lifetime could be inferred).

The fact that the lifetime couldn't be inferred in the absence of the transmute seems to be evidence that the use of transmute is not actually ensuring the continuity of the lifetime information. Instead, it's doing the opposite: "erasing the connection". Once that connection is erased by the transmute, we're in undefined territory, which is the very territory that Rust's safety guarantees are there to ensure we avoid. And as I understand it, @d-e-s-o 's demonstration with jemalloc proves this.

Reading the documentation on transmute, one of the things it's used for is:

Extending a lifetime, or shortening an invariant lifetime. This is advanced, very unsafe Rust!

Is that what this use of transmute is trying to do? Is it trying to link up the lifetime 'l that the client would have specified in the call to read with the same-named-but-different lifetime 'l in Parser::new?

If so, I think the reasoning and the demonstration above show that it's not having that intended effect. Instead, it's creating an undefined state.

So, circling back to my initial question: what use cases motivate this particular API?

If those were known, then we could consider alternative APIs that could be implemented safely.

@IvanUkhov
Copy link
Member Author

@mlwilkerson, thank you for presenting your thought process! It matches well my own understanding of the situation. Regarding the motivation behind the current API of Parser, I wanted to create a structure that would be self-sufficient and, if needed, could freely be passed around. I believed I had achieved that by using a bit of unsafe code. Before @d-e-s-o brought attention to this, I was not aware that what was actually happening was the erasal of the lifetime. If you look at it, 'l is everywhere, and it gives the impression that the original lifetime is preserved, but, as Daniel demonstrated, this is not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
5 participants