-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug in len()
implementation / Implementing the length()
primitive
#105
Comments
Note: adding the length field to Rep is not necessary and should just be calculated on the fly. |
Vectors in R today store a bit of metadata as well. I don't remember exactly what they track, but a number of algorithms could be made faster if we always know whether a vector has missingness, is sorted, has a known length - and I'm sure many more. Although I think it will happen eventually, I've been trying to avoid storing redundant metadata for now. Ithink we have to be really thoughtful about where it goes - whether it is specific to an alt-rep, should be universal to all vectors, or all objects. So far I've just opted for the minimum data and haven't worried too much about the performance costs of re-calculating things.
I'd recommend trying to store minimal state in the object. If you can get the behavior you're looking for without more data in the struct, then that simplicity is preferred for now. If it's unavoidable, then it's a pretty minimal change, so go for it.
This will be inevitable. My thinking for now is that the laziness is just hidden in the background. As soon as something needs a materialized vector it materializes. In this case, if an iterator doesn't have a conclusive size hint, then the vector is materialized and the length is returned. |
I am stumbling upon another problem here: Therefore, either all those methods have to be rewritten, and trait implementations such as of Display discarded, or the in-place materialization of vector representations uses unsafe rust. Maybe there are also other options which I am not aware of. My gut-feeling tells me that this might be a situation for unsafe rust (because lazy representations and materialized representations should really behave identically when accessed through their public API), but as I have very little experience in the language I am not sure... |
Yeah - I imagine things might need to change. This is good, though. This is exactly how the rest of the code has developed so far. We've identified a place where the current model of the problem doesn't work, and we've found a good, relatively minimal interface to test a redesign. In this case, the fundamental behavior is "how do we materialize a vector?". This effectively means instantiating a new I'm not proficient enough to know exactly what the solution will look like at the start, so I'll just speculate and maybe some ideas spur some new directions:
Yeah... that's certainly possible! I wouldn't see it so much as a failure of the interface, just that it was originally written without the full feature set in mind. As our understanding of the problem grows, some implementations will have to adapt as well. In most cases when I initially feel overwhelmed with a rewrite, I end up pleasantly surprised with how little has to change. Most likely the complexity will move into
I don't think this will require unsafe. It might require that we re-model a lot of the language, but fundamentally what we're trying to do is not so esoteric or magical. |
Thanks for your input! Maybe one solution would be to introduce a pub struct VarRep<T>(RefCell<Rep<T>>); This type can then be used as the data in the #[derive(Debug, Clone, PartialEq)]
pub enum Vector {
Double(VarRep<Double>),
Integer(VarRep<Integer>),
Logical(VarRep<Logical>),
Character(VarRep<Character>),
} For example, the To achieve this, we need to implement all the methods that are implemented for The most important method below is the impl <T> VarRep<T>
// the underlying Rep<T> should not be exposed through the public API
fn borrow(&self) -> Ref<Rep<T>> {
self.0.borrow()
}
pub fn new() -> Self {
Rep::new().into()
}
fn materialize_inplace(&self) -> &Self {
self.0.replace(RefCell::new(self.borrow().materialize()))
&self
}
pub fn len(&self) -> usize {
if self.borrow().len_requires_materialization() {
self.materialize_inplace()
}
self.borrow().len()
}
pub fn get(&self, index: usize) -> Option<Self> {
let x = self.borrow().get(index);
match x {
Some(x) => Some(x.into()),
None => None,
}
}
// other methods
}
impl<T> From<Rep<T>> for VarRep<T>
where
T: AtomicMode + Clone + Default,
{
fn from(rep: Rep<T>) -> Self {
VarRep(RefCell::new(rep))
}
} |
Testing this, I think just operating on lists is left before this is in really good shape. |
The
len()
implementation of theRep<T>
enum is not correct in all cases I believe:Mask
subset, the length of the resulting vector is not the length oflast
subset but the sum of the recycled logical vectorIndices
subset, taking the minimum over the "actual vector" and the elements in the last subset does not work. Even if we assume that all indices are within bounds, it still fails in cases such as(1:2)[c(1, 1, 1)]
.Context: I am just trying to see whether I can implement the
length()
primitive.My current idea would be to add a
Option<usize>
field to theSubset
variant ofRep<T>
here.Then, whenever a new subset is added to
Subsets
by callingpush()
, we check whether the length of the new vector is known. If it is, we set the field toOption::Some(length)
, otherwise toNone
.The
len()
method of theRep<T>
then simply reads the field.If we define
len()
like this forRep<T>
, we can then definelen()
onVector
(R/src/object/vector/core.rs
Line 199 in 2ef9780
len()
on theRep<T>
. IfSome(usize)
is returned, we know the length. Otherwise - ifNone
is returned - we could modify the vector in-place, i.e. call materialize on the subset and replace the existing subset-representation of the vector with the materialized one (so we avoid materializing the vector more than once).Alternatively we can also generalize this to use a size hint for lower and upper bounds as you suggested in this issue: #98
What I don't like about this solution is that it is very tailored to the length. However, if the lazy vector representation is a core feature of the language, I expect that there will be more cases like this. With "like this" I mean that functions can in some cases be calculated on the lazy representation, whereas they otherwise might need to materialize the vector.
The text was updated successfully, but these errors were encountered: