Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement LocatedSpan::get_line(). #66

Merged
merged 10 commits into from
Oct 18, 2020
Merged

Implement LocatedSpan::get_line(). #66

merged 10 commits into from
Oct 18, 2020

Conversation

kaj
Copy link
Contributor

@kaj kaj commented Oct 16, 2020

Add a function to get the full input line containing the (start point of the) LocatedSpan.

As suggested in #53.

Add a function to get the full input line containing the (start point
of the) LocatedSpan.

As suggested in fflorent#53.
Copy link
Collaborator

@progval progval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Could you add some tests for this, checking edge cases? (first/last line, beginning/end of line, etc.)

src/lib.rs Outdated
Comment on lines 305 to 315
let self_bytes = self.fragment.as_bytes();
let self_ptr = self_bytes.as_ptr();
let offset = self.get_column() - 1;
let the_line = unsafe {
assert!(
offset <= isize::max_value() as usize,
"offset is too big"
);
let line_start_ptr = self_ptr.offset(-(offset as isize));
slice::from_raw_parts(line_start_ptr, offset + self_bytes.len())
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this code could/should be shared with get_columns_and_bytes_before. Could you add a private (unsafe) function for this?

@kaj
Copy link
Contributor Author

kaj commented Oct 16, 2020

Yes, I'll write some tests and then I'll see if the unsafe code can be unified.

@progval
Copy link
Collaborator

progval commented Oct 16, 2020

Actually, I'm not sure, but isn't the_line equal to get_columns_and_bytes_before().1?

@kaj
Copy link
Contributor Author

kaj commented Oct 16, 2020

Actually, I'm not sure, but isn't the_line equal to get_columns_and_bytes_before().1?

Not quite, as the_line may continue after the thing that is get_columns_and_bytes_before().1. If you look at the (newly added) line_for_non_ascii_chars test, i think get_columns_and_bytes_before().1 would be "Förra raden var ".

Copy link
Collaborator

@progval progval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does get_line never return None?

src/tests.rs Outdated Show resolved Hide resolved
@kaj
Copy link
Contributor Author

kaj commented Oct 16, 2020

Does get_line never return None?

Actually, as I was writing tests, I found myself asking me the same question. I don't think it does, the option is accidental. I'll get rid of it.

@progval
Copy link
Collaborator

progval commented Oct 16, 2020

Not quite, as the_line may continue after the thing that is get_columns_and_bytes_before().1. If you look at the (newly added) line_for_non_ascii_chars test, i think get_columns_and_bytes_before().1 would be "Förra raden var ".

Oh yeah, indeed

@kaj
Copy link
Contributor Author

kaj commented Oct 16, 2020

It's getting late in my time zone, so I'll go to sleep now. I'll look at refactoring and any further questions later in the weekend.

@progval
Copy link
Collaborator

progval commented Oct 16, 2020

looks good, just need to deduplicate that code now

@kaj
Copy link
Contributor Author

kaj commented Oct 17, 2020

I've been afk most of the day, it's time for bed in my tz, and I'm not entirely sober, but I've attempted a refactoring of the two similar unsafe blocks to one. Can look more at it in ten hours or so ...

@kaj kaj requested a review from progval October 18, 2020 08:18
Copy link
Collaborator

@progval progval left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't worry, we're in no hurry.

I just realized that this code doesn't work if LocatedSpans are sliced with a right bound.
For example, this fails:

#[test]
fn line_of_word_in_middle() {
    let data = "One line of text\
         \nFollowed by a second\
         \nand a third\n";
    assert_eq!(
        StrSpan::new(data)
            .slice(data.find('\n').unwrap()..data.find('\n').unwrap()+5)
            .get_line(),
        "Followed by a second".as_bytes(),
    );
}

because self.get_unoffsetted_slice() only returns the bytes before the LocatedSpan and the bytes in the LocatedSpan itself, but none of the bytes after, even if they are on the same line.

Unfortunately, I don't see a way out of this without including the size of the original &[u8] in every LocatedSpan; but it would increase the size of LocatedSpan<_, ()> from 20 to 28 bytes (probably 24 to 32 including the padding).
But this seems costly just to provide a convenience function that can already be implemented by users themselves with safe code.

What do you think?

@@ -257,17 +257,24 @@ impl<T: AsBytes, X> LocatedSpan<T, X> {
&self.fragment
}

fn get_columns_and_bytes_before(&self) -> (usize, &[u8]) {
fn get_unoffsetted_slice(&self) -> &[u8] {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the name doesn't make it clear what this does, but I can't think of a better one. :/
Also some comment to explain what it does would be good

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I considered get_original_size, as that is pretty much the intent. But I didn't, since it has no way of reconstructing the original length. So unoffsetted is what it actually does, it undos the offset.

As for the larger question about a possibly missing trailer of get_line(); Yes, I was a bit worried about that myself, but decided that it's not a big problem in my main use-case of reporting parse errors. In cases I can think of that involves marking an interval (and not just a position) of a line, I think I would have two LocatedSpans to combine, where each of them would (probably) be "the rest of input from a starting point".

But I agree that this should be explained somehow, both in i a comment at get_unoffsetted_slice and in the docstring of get_line. I'll try to write something.

@kaj
Copy link
Contributor Author

kaj commented Oct 18, 2020

On the other hand, the "original" adress and length won't change while the parser creates all those subslices that are LocatedSpans. So maybe it would make sense for a LocatedSpan to be a slice and a reference to the original slice (since it already borrows the actual data from the original slice, I think lifetimes should not be a problem). The line and offset could be calculated on demand, instead of the other way around. That way, a LocatedSpan should fit in 24 bytes (plus 8*sizeof(extra)/8). Which should be eight bytes (line: u32 + alignment) less than the current size.

But that is a quite big change ...

@progval
Copy link
Collaborator

progval commented Oct 18, 2020

Well that's what I was thinking of, minus removing the line (which I don't think we should, because it makes get_line run in linear time instead)

@progval
Copy link
Collaborator

progval commented Oct 18, 2020

What about adding a function (or method of LocatedSpan) that takes the original string as argument, to returns the line?

That way we don't make the struct larger and we still provide that feature (and as a bonus, it won't rely on unsafe code)

@kaj
Copy link
Contributor Author

kaj commented Oct 18, 2020

What about adding a function (or method of LocatedSpan) that takes the original string as argument, to returns the line?

That's what I do in rsass before starting to use LocatedSpan at all (in kaj/rsass#62). For me, the whole point of using LocatedSpan is that I can get the line number, column, and the line itself from inside a parser function, where I don't have the original string.

@progval
Copy link
Collaborator

progval commented Oct 18, 2020

Hmm yeah, good point. You could put a ref to the original string in the extra, but it would bloat LocatedSpan with duplicate data in your case.

@kaj
Copy link
Contributor Author

kaj commented Oct 18, 2020

For now, I think the docstring on get_line() is enough to handle the case where the LocatedSpan ends before the line.

Replacing the self.offset with a reference to the original data may be a good idea, but I think the pros and cons of that should be considered independent of this PR.

@progval
Copy link
Collaborator

progval commented Oct 18, 2020

Yeah, you're right. Could you just rename get_line to get_line_beginning, so we can introduce get_line later without it being a breaking change?

@kaj
Copy link
Contributor Author

kaj commented Oct 18, 2020

Yeah, you're right. Could you just rename get_line to get_line_beginning, so we can introduce get_line later without it being a breaking change?

Ok, done. Should I also squash the commits on this branch to one?

This test documents how `get_line_beginning()` differs from a
hypotetical `get_line()` method.
@progval
Copy link
Collaborator

progval commented Oct 18, 2020

I can squash it on my end :)

@progval progval merged commit 76f9e90 into fflorent:master Oct 18, 2020
@kaj kaj deleted the get_line branch October 18, 2020 17:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants