New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port 'string-lessp' to Rust #217

Merged
merged 15 commits into from Jul 11, 2017

Conversation

Projects
None yet
4 participants
@DavidDeSimone
Collaborator

DavidDeSimone commented Jun 25, 2017

This is a WIP PR for porting string-lessp to Rust. As part of this work, it seemed useful to add an iterator for a LispStringRef, in order to replace the macro FETCH_STRING_CHAR_ADVANCE

Work left to do:

  • Port basic string-lessp functionality to Rust
  • Remove string-lessp implmentation in C.
  • Add code point iterator for LispStringRef
  • Add safe API for working with LispSymbol's, similar to LispStringRef/as_string/as_string_or_error
    • Fix issue where the value coming out of the LispSymbolRef's mem::transmute seem to be nonsense
  • Verify that the alignment/size/padding of Rust's Lisp_Symbol matches the alignment/size/padding of C's Lisp_Symbol
  • Basic profiling to make sure that these new concepts do not add significant cost over their C equivalents. (Based on my findings, performance for this function is equivalent to it's C counterpart)

DavidDeSimone added some commits Jun 25, 2017

Porting 'string-lessp' to Rust. Adding a Codepoint iterator for LispS…
…tringRef. This commit is a work is part of a work in progress.
Merge branch 'master' into string-lessp
Conflicts:
	rust_src/src/strings.rs
Fixing range bug in string-lessp caused by adding the byte offset, in…
…stead of incrementing the char counter by 1 per iteration.
@@ -155,6 +155,10 @@ impl LispObject {
pub fn is_symbol(self) -> bool {
self.get_type() == LispType::Lisp_Symbol
}

pub fn symbol_name(&self) -> LispObject {

This comment has been minimized.

@birkenfeld

birkenfeld Jun 25, 2017

Collaborator

This is not safe - it'll treat the LispObject as a symbol regardless of its type.

You'll have to introduce a LispSymbolRef newtype and respective as_symbol and as_symbol_or_error methods on LispObject to convert safely.

This comment has been minimized.

@DavidDeSimone

DavidDeSimone Jun 26, 2017

Collaborator

Good callout, I will do this next.

This comment has been minimized.

@birkenfeld

birkenfeld Jun 26, 2017

Collaborator

Thanks!

@@ -106,6 +106,22 @@ impl LispStringRef {
}
}

// Substitue for FETCH_STRING_CHAR_ADVANCE
impl Iterator for LispStringRef {

This comment has been minimized.

@birkenfeld

birkenfeld Jun 25, 2017

Collaborator

Great idea to use an iterator here!

i1 += 1;

if codept1 != codept2 {
if codept1 < codept2 {

This comment has been minimized.

@birkenfeld

birkenfeld Jun 25, 2017

Collaborator

This is just return LispObject::from_bool(codept1 < codept2)

}
}

if i1 < lispstr2.len_bytes() {

This comment has been minimized.

@birkenfeld

birkenfeld Jun 25, 2017

Collaborator

same here (without return)

let end = cmp::min(lispstr1.len_bytes(), lispstr2.len_bytes());
let mut i1 = 0;
while i1 < end {
// Unwraps should be fine here, due to our manual tracking of

This comment has been minimized.

@birkenfeld

birkenfeld Jun 25, 2017

Collaborator

hmm, could this be done with a zip? I don't think we need i1 afterwards since if we finish the loop it will always be = end.

This comment has been minimized.

@DavidDeSimone

DavidDeSimone Jun 26, 2017

Collaborator

Also another good call out, I've changed the implementation to use zip

DavidDeSimone added some commits Jun 26, 2017

Adding LispStringRefIterator, an iterator used for iterating over the…
… codepoints on a LispString. Declaring string_lessp in Rust as a #[lisp_fn]. Removing c definition of string-lessp.
Fixing issue where lisp strings in string-lessp were using len_bytes …
…instead of len_chars, causing a potential iteration panic. Adding iteration zip implementation for cleaner code.
codepoint = cp;
self.cur += advance;
} else {
codepoint = ref_slice[self.cur] as u32;

This comment has been minimized.

@birkenfeld

birkenfeld Jun 26, 2017

Collaborator

.. as Codepoint?

self.cur += 1;
}

Some((codepoint, self.cur))

This comment has been minimized.

@birkenfeld

birkenfeld Jun 26, 2017

Collaborator

Hm, this returns the index of the next codepoint. Is this intended?

}

impl LispStringRef {
pub fn iter(&self) -> LispStringRefIterator {

This comment has been minimized.

@birkenfeld

birkenfeld Jun 26, 2017

Collaborator

Depending on need by other APIs, I think two iterators a la chars() and char_indices() would make sense.

This comment has been minimized.

@DavidDeSimone

DavidDeSimone Jul 5, 2017

Collaborator

This sounds reasonable to me, I like the idea of a chars() and char_indicies().

}
}

LispObject::from_bool(count < lispstr2.len_chars())

This comment has been minimized.

@birkenfeld

birkenfeld Jun 26, 2017

Collaborator

I think this is just lispstr1.len_chars() < lispstr2.len_chars(), and keeping count is unnecessary.

(If lispstr2 is shorter or equally long, this will test len2 < len2 which is false.)

@@ -1,6 +1,6 @@
//! Functions operating on strings.

use std::ptr;
use std::{ptr, cmp};

This comment has been minimized.

@birkenfeld

birkenfeld Jun 26, 2017

Collaborator

unused import?

@@ -155,6 +155,10 @@ impl LispObject {
pub fn is_symbol(self) -> bool {
self.get_type() == LispType::Lisp_Symbol
}

pub fn symbol_name(&self) -> LispObject {

This comment has been minimized.

@birkenfeld

birkenfeld Jun 26, 2017

Collaborator

Thanks!

David DeSimone added some commits Jun 28, 2017

David DeSimone
Removing unused cmp import. Updating LispStringRefIterator to return …
…the current codepoint location, not the next offset location. Updating string-lessp logic to not need an explicit 'count' variable
David DeSimone
Laying the groundwork for representing a lisp symbol as a rust struct…
…. This will allow us to create a LispSymbolRef like we have a LispStringRef. This will also allow a similar API for working with symbols as we have for strings.
fn get_string_or_symbol(mut string: LispObject) -> multibyte::LispStringRef {
if string.is_symbol() {
string = string.symbol_name()
string = string.as_symbol_or_error().symbol_name()

This comment has been minimized.

@birkenfeld

birkenfeld Jun 29, 2017

Collaborator

This still duplicates the symbol check. I'd make it

match string.as_symbol() {
    Some(sym) => sym.symbol_name().as_string().expect("symbol name not a string?")
    None => string.as_string_or_error()
}

If you like, you can also make string_equal use this function, it currently uses SYMBOL_NAME directly.

Adding C representation of Lisp_Symbol. This includes making an untag…
…ged Rust union for a SymbolUnion. Updating the implementation of 'get_string_or_symbol' to avoid an additional symbol check.
#[inline]
pub fn as_symbol(&self) -> Option<LispSymbolRef> {
if self.is_symbol() {
Some(LispSymbolRef::new(unsafe { mem::transmute(self.get_untaggedptr()) }))

This comment has been minimized.

@DavidDeSimone

DavidDeSimone Jul 4, 2017

Collaborator

This looks to be incorrect. The code in lisp.h for XSYMBOL looks like

INLINE struct Lisp_Symbol *
(XSYMBOL) (Lisp_Object a)
{
#if USE_LSB_TAG
  return lisp_h_XSYMBOL (a);
#else
  eassert (SYMBOLP (a));
  intptr_t i = (intptr_t) XUNTAG (a, Lisp_Symbol);
  void *p = (char *) lispsym + i;
  return p;
#endif
}

and it looks like we will have to emulate this logic for getting the address to mem::transmute.

// @TODO check the value of name post and pre transmutation, it seems that name is surviving but
// may not be the correct value
#[repr(C)]
pub struct Lisp_Symbol {

This comment has been minimized.

@shanavas786

shanavas786 Jul 4, 2017

Collaborator

gcmarkbit, redirect ...etc are missing ?

This comment has been minimized.

@DavidDeSimone

DavidDeSimone Jul 5, 2017

Collaborator

So this is my understanding of the situation re: the mentioned variables. If anyone spots any incorrect information please do not hesitate to correct me:

Lisp_Symbol has the following definition in lisp.h (I've stripped out the comments for clarity):

struct Lisp_Symbol
{
  bool_bf gcmarkbit : 1;
  ENUM_BF (symbol_redirect) redirect : 3;
  ENUM_BF (symbol_trapped_write) trapped_write : 2;
  unsigned interned : 2;
  bool_bf declared_special : 1;
  bool_bf pinned : 1;
  Lisp_Object name;
  union {
    Lisp_Object value;
    struct Lisp_Symbol *alias;
    struct Lisp_Buffer_Local_Value *blv;
    union Lisp_Fwd *fwd;
  } val;
  Lisp_Object function;
  Lisp_Object plist;
  struct Lisp_Symbol *next;

This struct is using the C notation for bit fields. According to my research, Rust does not (rust-lang/rfcs#314, https://users.rust-lang.org/t/c-structs-with-bit-fields-and-ffi/1429) support setting up an equivalent for C bitfieldsin it's #[repr(C)] directive.

On my system (64-bit Ubuntu 16.04), no matter what the typedef of bf_bool or ENUM_BF, the bit field section of the struct takes 4 bytes. Due to alignment, the struct is padded, and offsetof(struct Lisp_Symbol, name) reports 8. My initial solution to representing this struct in Rust was to represent the bit field block as a u32, taking up the 4 bytes I mentioned earlier. I have not fully convinced myself that this is 100% the correct and portable thing to do.

Even if we can safely represent this block with a u32 on every system we support, it seems that there are 'gotcha's with accessing these bit fields via the bit wise operators (due to endianness, and compiler differences w.r.t bit field implementation.)

Overall I am not sure the best way to handle the interop for C structs that use bit fields that we need to access in Rust. It seems that one option is to simply pad the Rust struct as best we can, and if we need to access these fields, we will need to maintain C bindings that access them for us.

DavidDeSimone added some commits Jul 5, 2017

Fixing issue where 'string-lessp' would SEGFAULT when given symbol ar…
…guments. This was due to an improper call to mem::transmute in the Rust layer.

Symbols are a special case in which you cannot just call mem::transmute(self.get_untaggedptr()). Instead, one must offset the pointer value based on the memory address of an emacs global 'lispsym'.
Adding "chars()", and "char_indicies()" implementation for LispString…
…Ref. This will allow a user to loop over the codepoints of a LispStringRef, or the indicies of the codepoints of a LispStringRef.
LispStringRefCharIterator(self.iter())
}

pub fn char_indices(&self) -> LispStringRefIndexIterator {

This comment has been minimized.

@birkenfeld

birkenfeld Jul 6, 2017

Collaborator

Sorry to nitpick again, but actually char_indices returns the (index, char) tuple - you just need to rename iter.

This comment has been minimized.

@DavidDeSimone

DavidDeSimone Jul 6, 2017

Collaborator

No need to apologize, this should be fixed now.

@DavidDeSimone DavidDeSimone changed the title from [WIP] Port 'string-lessp' to Rust to Port 'string-lessp' to Rust Jul 6, 2017

@DavidDeSimone

This comment has been minimized.

Collaborator

DavidDeSimone commented Jul 6, 2017

Overall, I feel pretty confident with the PR now. If any maintainers have additional feedback, I will be happy to address their concerns, otherwise I think this is ready to merge.

@Wilfred

This looks great to me: code looks clean :). Other than one missing docstring, I think it's good to merge.


pub struct LispStringRefCharIterator<'a>(LispStringRefIterator<'a>);

// Substitue for FETCH_STRING_CHAR_ADVANCE

This comment has been minimized.

@Wilfred

Wilfred Jul 9, 2017

Owner

*Substitute

@@ -202,6 +202,21 @@ fn string_to_unibyte(string: LispObject) -> LispObject {
}
}

#[lisp_fn]

This comment has been minimized.

@Wilfred

Wilfred Jul 9, 2017

Owner

Please put the docstring here.

Adding docstring to Rust impl for string-lessp. Fixing spelling error…
… in comment about LispStringRefIterator

@Wilfred Wilfred merged commit 008064a into Wilfred:master Jul 11, 2017

1 check failed

continuous-integration/travis-ci/pr The Travis CI build failed
Details
@Wilfred

This comment has been minimized.

Owner

Wilfred commented Jul 11, 2017

Marvellous :)

@DavidDeSimone DavidDeSimone deleted the DavidDeSimone:string-lessp branch Jul 11, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment