Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can the small factors levels limits be increased from <128? #4

Open
xiaodaigh opened this issue Jan 28, 2018 · 1 comment
Open

Can the small factors levels limits be increased from <128? #4

xiaodaigh opened this issue Jan 28, 2018 · 1 comment
Assignees
Labels

Comments

@xiaodaigh
Copy link

I was looking into lib/factor/factor_v7.cpp and see code like if (*nrOfLevels < 128). In the comment it says

// use 1 byte per int (Na encoding takes 1 bit)

which seems to be "wasting" the other 7 bits once that one bit is used, technically can support up to 256 distinct values (including NA and NaN).

Without much background, I assume it's to do with how R encodes the values, so it's always stored as int instead of unsigned int. I know if it's too expensive in terms of performance to relax this to 256 by converting to unsigned int. I know Julia supports unsigned with its UInt8 type.

@MarcusKlik
Copy link
Collaborator

MarcusKlik commented Jan 30, 2018

Hi @xiaodaigh, thanks for your question. In an R factor, the value NA is encoded as a NA value in the value vector:

# some factor
x <- factor(sample(LETTERS, 10), levels = LETTERS)

# set factor value to NA
x[5] <- NA

# underlying value is set to NA
as.integer(x)
#>  [1] 23 22 16 18 NA  5  8  4  2 26

This could have been done better in R I guess, for example by coding the value 0 as the NA, but with the current implementation that leads to an error:

# create factor manually
y <- c(1L, 2L, 3L, 0L, 4L, 5L)
attr(y, "levels") <- LETTERS
attr(y, "class") <- "factor"

print(y)
#> Error in as.character.factor(x): malformed factor

For performance reasons, fst takes bit 32 from the factor values and adds bit 0-7 to that to get a single byte. So these 7 bits can only be used for < 128 levels. I could also re-code value 0 as an NA, but that would require more processing and would reduce the speed of the filter...

Hope that answers your question!

@MarcusKlik MarcusKlik self-assigned this Jan 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants